April 2025 - Page 6 of 8

A Step-by-Step Coding Guide to Building a Gemini-Powered AI Startup Pi …

Posted on April 8, 2025 by i-genie

In this tutorial, we built a powerful and interactive AI application that generates startup pitch ideas using Google’s Gemini Pro model through the versatile LiteLLM framework. LiteLLM is the backbone of this implementation, providing a unified interface to interact with over 100 LLM providers using OpenAI-compatible APIs, eliminating the complexity of dealing with individual SDKs. By leveraging LiteLLM, we seamlessly connected to Gemini’s capabilities for creative ideation and wrapped the outputs into a user-friendly Gradio interface. Also, we used FPDF to generate polished, Unicode-compatible PDFs containing the full startup pitch deck. This tutorial demonstrates how modern AI tooling, including LiteLLM, Gradio, Google Generative AI, and FPDF, can build an end-to-end solution for entrepreneurs, innovators, and developers.

Copy CodeCopiedUse a different Browser!pip install litellm gradio fpdf –quiet

!pip install litellm gradio fpdf –quiet installs the core libraries needed for this project. It brings in LiteLLM for interacting with Gemini via a unified API, Gradio for creating a simple web interface, and FPDF for exporting the AI-generated pitch into a well-formatted PDF file—all while suppressing verbose installation logs with –quiet.

Copy CodeCopiedUse a different Browserimport os
import gradio as gr
import uuid
import urllib.request
from fpdf import FPDF
from litellm import completion

api_key = “Your API Key”

We import all the essential Python libraries used in the project, including os for file operations, uuid for generating unique filenames, and urllib for downloading fonts. We also initialize Gradio for the UI, FPDF for PDF creation, and LiteLLM’s completion function to interface with Gemini. The api_key variable stores the user’s Gemini API key, which is required to authenticate requests.

Copy CodeCopiedUse a different Browserimport urllib.request
import zipfile
import os
import shutil

if not os.path.exists(“DejaVuSans.ttf”):
print(” Downloading DejaVuSans.ttf…”)
font_zip_url = “https://downloads.sourceforge.net/project/dejavu/dejavu/2.37/dejavu-fonts-ttf-2.37.zip”
font_zip_path = “dejavu-fonts.zip”

urllib.request.urlretrieve(font_zip_url, font_zip_path)

with zipfile.ZipFile(font_zip_path, ‘r’) as zip_ref:
zip_ref.extractall(“dejavu-extracted”)

for root, dirs, files in os.walk(“dejavu-extracted”):
for file in files:
if file == “DejaVuSans.ttf”:
ttf_path = os.path.join(root, file)
shutil.copy(ttf_path, “DejaVuSans.ttf”)
print(” Font extracted and ready.”)
break

Here, we ensure that the DejaVuSans.ttf font is available to create Unicode-compatible PDFs. It downloads the font zip file from SourceForge, extracts its contents, and copies the .ttf file to the working directory. This step is crucial for handling special characters from Gemini’s output when generating the final pitch PDF using FPDF.

Copy CodeCopiedUse a different Browserdef call_gemini(system_prompt, user_prompt):
messages = [
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: user_prompt}
]
response = completion(
model=”gemini/gemini-2.0-flash-lite”,
messages=messages,
api_key=api_key
)
return response[“choices”][0][“message”][“content”]

This function, call_gemini, is a wrapper that uses LiteLLM’s completion API to interact with the Gemini 2.0 Flash Lite model. It accepts a system prompt and a user prompt, structures them in OpenAI-compatible format, sends the request using the provided API key, and returns the generated response—making it easy to reuse across various application parts.

Copy CodeCopiedUse a different Browserdef generate_startup_pitch(theme):
try:
idea_prompt = f”Generate an innovative startup idea in the field of {theme}. Focus on solving real problems using modern technology.”
tagline_prompt = “Based on the idea you just gave, generate a short, catchy tagline for the startup.”
pitch_prompt = “””
Based on the previous startup idea, write a concise pitch deck covering:
1. Problem
2. Solution
3. Market Opportunity
4. Team Description
5. Business Model
6. Traction or Future Plan
Format it in a way that looks like slide notes for a VC pitch.
“””

idea = call_gemini(“You are an innovation strategist.”, idea_prompt)
tagline = call_gemini(“You are a branding expert.”, tagline_prompt)
pitch = call_gemini(“You are a startup mentor writing a pitch deck.”, pitch_prompt)

filename = f”startup_pitch_{uuid.uuid4().hex[:8]}.pdf”
pdf = FPDF()
pdf.add_page()
pdf.add_font(“DejaVu”, “”, font_path, uni=True)
pdf.set_font(“DejaVu”, size=12)

full_text = f”Startup Idea:n{idea}nnTagline:n{tagline}nnPitch Deck:n{pitch}”
pdf.multi_cell(0, 10, full_text)
pdf.output(filename)

return idea, tagline, pitch, filename
except Exception as e:
return f” Error: {e}”, “”, “”, None

The generate_startup_pitch function orchestrates the entire startup generation process. It sends tailored prompts to Gemini via LiteLLM to produce a startup idea, a catchy tagline, and a structured pitch deck. The responses are then combined into a formatted PDF using FPDF, with proper Unicode support via the DejaVu font. The PDF is saved with a unique filename, enabling users to download their personalized pitch. Error handling ensures smooth execution and user feedback in case of failures.

Copy CodeCopiedUse a different Browserwith gr.Blocks() as demo:
gr.Markdown(“# AI Startup Pitch Generator (with PDF Export)”)
theme_input = gr.Textbox(label=”Enter a theme or industry”, placeholder=”e.g., mental health, fintech, climate tech”)

generate_button = gr.Button(“Generate Pitch”)

idea_output = gr.Textbox(label=”Startup Idea”)
tagline_output = gr.Textbox(label=”Tagline”)
pitch_output = gr.Textbox(label=”Pitch Deck Summary”, lines=10)
pdf_output = gr.File(label=”Download Pitch as PDF”)

def wrapper(theme):
idea, tagline, pitch, pdf_path = generate_startup_pitch(theme)
return idea, tagline, pitch, pdf_path

generate_button.click(fn=wrapper, inputs=theme_input, outputs=[idea_output, tagline_output, pitch_output, pdf_output])

demo.launch(share=True)

We defined the Gradio user interface for the AI Startup Pitch Generator. Using gr.Blocks() creates a clean layout with an input box for the user to enter a startup theme or industry and a button to trigger the pitch generation. Once clicked, the wrapper function calls generate_startup_pitch, returning a startup idea, tagline, pitch summary, and a downloadable PDF. The share=True flag enables public access to the app, making it easy to demo or share the tool with others via a unique URL.

App Interface to Generate Ideas

Download the PDF Report

In conclusion, by combining the abstraction power of LiteLLM with the creative intelligence of Google’s Gemini Pro, this tutorial highlights how developers can rapidly prototype intelligent, production-ready applications. LiteLLM drastically simplifies working with diverse LLM APIs by maintaining a consistent OpenAI-style calling interface across providers like Gemini, Claude, OpenAI, and more. Through Gradio, we added an intuitive front end to accept user input and display results, while FPDF allowed us to convert AI-generated content into shareable, well-formatted PDF documents. This tutorial showcases how to build a multi-component AI app in a Colab-friendly environment and underlines LiteLLM’s role as a pivotal gateway to the expanding ecosystem of language models. Whether you’re building MVPs or production tools, LiteLLM offers the flexibility and scalability to keep your LLM workflow fast and future-proof.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post A Step-by-Step Coding Guide to Building a Gemini-Powered AI Startup Pitch Generator Using LiteLLM Framework, Gradio, and FPDF in Google Colab with PDF Export Support appeared first on MarkTechPost.

Llama 4 family of models from Meta are now available in SageMaker Jump …

Posted on April 8, 2025 by i-genie

Today, we’re excited to announce the availability of Llama 4 Scout and Maverick models in Amazon SageMaker JumpStart and coming soon in Amazon Bedrock. Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. You can now deploy the Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8 models using SageMaker JumpStart in the US East (N. Virginia) AWS Region.
In this blog post, we walk you through how to deploy and prompt a Llama-4-Scout-17B-16E-Instruct model using SageMaker JumpStart.
Llama 4 overview
Meta announced Llama 4 today, introducing three distinct model variants: Scout, which offers advanced multimodal capabilities and a 10M token context window; Maverick, a cost-effective solution with a 128K context window; and Behemoth, in preview. These models are optimized for multimodal reasoning, multilingual tasks, coding, tool-calling, and powering agentic systems.
Llama 4 Maverick is a powerful general-purpose model with 17 billion active parameters, 128 experts, and 400 billion total parameters, and optimized for high-quality general assistant and chat use cases. Additionally, Llama 4 Maverick is available with base and instruct models in both a quantized version (FP8) for efficient deployment on the Instruct model and a non-quantized (BF16) version for maximum accuracy.
Llama 4 Scout, the more compact and smaller model, has 17 billion active parameters, 16 experts, and 109 billion total parameters, and features an industry-leading 10M token context window. These models are designed for industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of AI applications that bridge language barriers.
See Meta’s community license agreement for usage terms and more details.
SageMaker JumpStart overview
SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.
With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances can be isolated within your virtual private cloud (VPC). After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.
Prerequisites
To try the Llama 4 models in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
Access to Amazon SageMaker Studio and a SageMaker AI notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Discover Llama 4 models in SageMaker JumpStart
SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the Amazon SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.
SageMaker Studio is a comprehensive integrated development environment (IDE) that offers a unified, web-based interface for performing all aspects of the AI development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process.
In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference. You can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page in SageMaker Studio, as shown in the following figure.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI and machine learning (AI/ML) workflows and pipelines.
By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.
Deploy Llama 4 models for inference through the SageMaker JumpStart UI
On the SageMaker JumpStart landing page, you can find all the public pre-trained models offered by SageMaker AI. You can then choose the Meta model provider tab to discover all the available Meta models.
If you’re using SageMaker Classic Studio and don’t see the Llama 4 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, see Shut down and Update Studio Classic Apps.

Search for Meta to view the Meta model card. Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)

Select the model card to view the model details page.

The model details page includes the following information:

The model name and provider information
Deploy button to deploy the model
About and Notebooks tabs with detailed information

The About tab includes important details, such as:

Model description
License information
Technical specifications
Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

Choose Deploy to proceed with deployment.

For Endpoint name, use the automatically generated name or enter a custom one.
For Instance type, use the default: p5.48xlarge.
For Initial instance count, enter the number of instances (default: 1). Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed.
Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy. The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy Llama 4 models for inference using the SageMaker Python SDK
When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.
To deploy using a notebook, start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker AI.
You can deploy the Llama 4 Scout model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = “meta-vlm-llama-4-scout-17b-16e-instruct”)

predictor = model.deploy(accept_eula=False)

This deploys the model on SageMaker AI with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant”},
{“role”: “user”, “content”: “How are you doing today”},
{“role”: “assistant”, “content”: “Good, what can i help you with today?”},
{“role”: “user”, “content”: “Give me 5 steps to become better at tennis?”}
],
“temperature”: 0.6,
“top_p”: 0.9,
“max_tokens”: 512,
“logprobs”: False
}
response = predictor.predict(payload)
response_message = response[‘choices’][0][‘message’][‘content’]

Recommended instances and benchmark
The following table lists all the Llama 4 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model name
Model ID
Default instance type
Supported instance types

Llama-4-Scout-17B-16E-Instruct
meta-vlm-llama-4-scout-17b-16e-instruct
ml.p5.48xlarge
ml.g6e.48xlarge, ml.p5.48xlarge, ml.p5en.48xlarge

Llama-4-Maverick-17B-128E-Instruct
meta-vlm-llama-4-maverick-17b-128e-instruct
ml.p5.48xlarge
ml.p5.48xlarge, ml.p5en.48xlarge

Llama 4-Maverick-17B-128E-Instruct-FP8
meta-vlm-llama-4-maverick-17b-128-instruct-fp8
ml.p5.48xlarge
ml.p5.48xlarge, ml.p5en.48xlarge

Inference and example prompts for Llama 4 Scout 17B 16 Experts model
You can use the Llama 4 Scout model for text and image or vision reasoning use cases. With that model, you can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more.
In the following sections we show example payloads, invocations, and responses for Llama 4 Scout that you can use against your Llama 4 model deployments using Sagemaker JumpStart.
Text-only input
Input:

payload = {
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful AI assistant.”
},
{
“role”: “user”,
“content”: “What are three key benefits of large language models for businesses?”
}
],
“max_tokens”: 2048,
“temperature”: 0.7,
“top_p”: 0.9,
“stream”: False
}

Response:

Large language models (LLMs) offer numerous benefits to businesses, transforming how they operate,
interact with customers, and generate insights. Here are three key benefits:

1. **Enhanced Customer Experience through Automation and Personalization**:
– **Automated Customer Support**: LLMs can power chatbots and virtual assistants
that provide 24/7 customer support. These models can understand and respond to a wide range of customer queries, help with troubleshooting, and even process transactions, significantly reducing the workload on human customer support agents.
– **Personalization**: By analyzing customer data, LLMs can help businesses personalize their
communications and offerings. For instance, they can generate personalized emails, product recommendations, and content, enhancing customer engagement and satisfaction.

2. **Efficiency and Productivity Gains**:
– **Content Generation**: LLMs can automate the creation of various types of content, such as
blog posts, reports, product descriptions, and social media updates. This not only speeds up content production but also allows human writers to focus on more creative and strategic tasks.
– **Data Analysis and Summarization**: These models can quickly analyze large volumes of data, extract relevant information, and summarize findings in a readable format. This capability can significantly reduce the time and effort required for market research, competitive analysis, and internal reporting.

3. **Improved Decision Making with Data-Driven Insights**:
– **Market and Trend Analysis**: LLMs can process and analyze vast amounts of data from various sources, including news articles, social media, and market reports. This helps businesses stay informed about market trends, consumer sentiment, and competitor activity, enabling more informed strategic decisions.
– **Risk Management and Compliance**: By analyzing regulatory documents and monitoring communications, LLMs can help businesses identify and mitigate compliance risks. They can also assist in the creation of compliance reports and documentation, reducing the risk of human error.

In summary, large language models offer businesses the potential to automate and enhance customer interactions,
improve operational efficiency, and gain deeper insights from their data. These capabilities can lead to cost savings,
increased revenue, and a competitive edge in the market.

Single-image input
In this section, let’s test Llama 4’s multimodal capabilities. By merging text and vision tokens into a unified processing backbone, Llama 4 can seamlessly understand and respond to queries about an image. The following is an example of how you can prompt Llama 4 to answer questions about an image such as the one in the example:
Image:

Input:

import requests
import base64
import boto3
import json

s3 = boto3.client(“s3″)

s3_bucket = f”jumpstart-cache-prod-{region}”
key_prefix = “inference-notebook-assets”

def download_from_s3(key_filenames):
for key_filename in key_filenames:
s3.download_file(s3_bucket, f”{key_prefix}/{key_filename}”, key_filename)

# Define image names
heat_map = “heatmap_semantic_similarity_search.png”

# Download and display the heatmap image
download_from_s3(key_filenames=[heat_map])

def img_to_base64(image_path):
with open(image_path, “rb”) as f:
img = f.read()
enc_img = base64.b64encode(img).decode(‘utf-8’)
return enc_img

b64_img = img_to_base64(heat_map)

data = {
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant”,
},
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “What is in this image?”,
},
{
“type”: “image_url”,
“image_url”: {
“url”: b64_img
}
},
],
},
],
“temperature”: 0.6,
“top_p”: 0.9,
“max_tokens”: 512
}

payload = json.dumps(data)

Response:

The image presents a heatmap illustrating the semantic textual similarity between
sentences. The heatmap is divided into two axes, with the x-axis listing six sentences
and the y-axis listing the same six sentences in a different order.nn**Sentences:**nn*
Your dog is so cute.n* How cute your dog is!n* You have such a cute dog!n* New York City
is the place where I work.n* I work in New York City.n* What color do you like the most?n*
What is your favourite color?nn**Heatmap:**nnThe heatmap displays the semantic textual
similarity between each pair of sentences, with darker shades of red indicating higher similarity
and lighter shades of orange indicating lower similarity. The color bar on the right side of the
heatmap provides a scale for interpreting the similarity values, ranging from -0.4 to 1.0.nn
**Observations:**nn* The three sentences related to dogs (“Your dog is so cute.”, “How cute your dog is!”,
and “You have such a cute dog!”) show high similarity with each other, as indicated by the dark red squares
in the corresponding section of the heatmap.n* The two sentences related to New York City (“New York City is
the place where I work.” and “I work in New York City.”) also show high similarity with each other.n* The two
sentences related to favorite colors (“What color do you like the most?” and “What is your favorite color?”)
show high similarity with each other.n* The sentences from different categories (dogs, New York City, and
favorite colors) show lower similarity with each other, as indicated by the lighter shades of orange in the heatmap.nn
Overall, the heatmap effectively visualizes the semantic textual similarity between the given sentences, allowing
for easy identification of similar and dissimilar sentence pairs.

The Llama 4 model on JumpStart can take in the image provided via a URL, underlining its powerful potential for real-time multimodal applications.
Multi-image input
Building on its advanced multimodal functionality, Llama 4 can effortlessly process multiple images at the same time. In this demonstration, the model is prompted with two image URLs and tasked with describing each image and explaining their relationship, showcasing its capacity to synthesize information across several visual inputs. Let’s test this below by passing in the URLs of the following images in the payload.
Image 1:

Image 2:

Input:

def url_to_base64(image_url):
# Download the image
response = requests.get(image_url)
if response.status_code != 200:
return None

# Encode the image content to base64
image_base64 = base64.b64encode(response.content).decode(‘utf-8’)
return image_base64

image_url_2 = “https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg”
b64_img_url_2 = url_to_base64(image_url_2)

data = {
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant”,
},
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “There are two images. Describe each one and why they are related”,
},
{
“type”: “image_url”,
“image_url”: {
“url”: b64_img,
}
},
{
“type”: “image_url”,
“image_url”: {
“url”: b64_img_url_2,
}
},
],
},
],
“temperature”: 0.6,
“top_p”: 0.9,
“max_tokens”: 512
}

Response:

The first image is a heatmap illustrating the semantic textual similarity between
sentences. The heatmap is titled “Semantic Textual Similarity Between Sentences”
and features a color scale ranging from -0.4 to 1.0, with darker shades indicating
higher similarity. The sentences being compared are:nn* Your dog is so cute.n*
How cute your dog is!n* You have such a cute dog!n* New York City is the place
where I work.n* I work in New York City.n* What color do you like the most?n*
What is your favourite color?nnThe heatmap reveals that the sentences about dogs
have high similarity scores, with values above 0.9, indicating that they convey
similar meanings. Similarly, the sentences about work locations and favorite colors
also show high similarity scores. However, there is a low similarity score between
the sentences about dogs and those about work locations or favorite colors.nn
The second image presents five stick figures labeled A, B, C, D, and E, each representing
a person. The figures are drawn using simple lines and geometric shapes, such as circles
for heads, squares for bodies, and triangles for skirts. The main difference between the
figures lies in their arm and leg positions:nn* Figure A has its arms at its sides and
straight legs.n* Figure B has its arms at its sides and straight legs, but with a slightly
different posture.n* Figure C has its arms at its sides and straight legs, similar to Figure A.n*
Figure D has one arm raised and straight legs.n* Figure E has two arms raised and bent legs.nn
Overall, the two images are distinct in terms of their content and purpose. The heatmap provides a
visual representation of semantic similarities between sentences, while the stick figures appear to
be simple illustrations of people with varying poses.

As you can see, Llama 4 excels in handling multiple images simultaneously, providing detailed and contextually relevant insights that emphasize its robust multimodal processing abilities.
Codebase analysis with Llama 4
Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:
Input:

payload = {
“messages”: [
{
“role”: “system”,
“content”: “””You are a helpful AI assistant that summarizes codebases to me to help me understand
how to analyze code by synthesizing through the entire codebase before responding.

Be thorough in your search as the file may be nested within a markdown code block or within a directory listing.
“””
},
{
“role”: “user”,
“content”: f”Can you explain to me the buildspec-1-10-2.yml file and how it relates to the rest of the huggingface directory? Here is the AWS Deep Learning Container Repository for reference:nn{text}”
}
],
“max_tokens”: 2048,
“temperature”: 0.7,
“top_p”: 0.9,
“stream”: False
}

Output:

## Overview of the buildspec-1-10-2.yml File

The provided information appears to be a part of a large, merged representation of
the Hugging Face repository, generated by Repomix. The `buildspec-1-10-2.yml` file
is likely a build specification file used in a Continuous Integration/Continuous
Deployment (CI/CD) pipeline.

## Purpose and Format

The file contains a packed representation of the entire repository’s contents, designed
to be easily consumable by AI systems for analysis, code review, or other automated processes.
The content is organized into sections:

1. **Summary**: Provides an overview of the file’s purpose, format, and usage guidelines.
2. **Repository Information**: Not shown in the provided excerpt, but likely contains metadata about the repository.
3. **Directory Structure**: Lists the directory structure of the repository, including file paths.
4. **File Entries**: Each file entry consists of a header with the file path and the full contents
of the file in a code block.

## Relationship to the Hugging Face Directory

The `buildspec-1-10-2.yml` file is part of the Hugging Face repository, which appears to be a
large project with multiple subdirectories and files. The directory structure section lists various
subdirectories, including:

* `.github/`: Contains GitHub-related configuration files, such as issue templates, workflows,
and CODEOWNERS.
* `.release_templates/`: Includes release image templates for various frameworks like MXNet, PyTorch,
and TensorFlow.
* `autogluon/`: A subdirectory with build artifacts, inference configurations, and other related files.

## buildspec-1-10-2.yml File Contents

Unfortunately, the provided excerpt does not contain the actual contents of the `buildspec-1-10-2.yml` file.
However, based on the file name and its presence in a CI/CD context, it is likely that this file specifies
the build process, dependencies, and environment configurations for the Hugging Face project.

## Usage Guidelines

When working with this file, keep in mind:

* Treat this file as read-only and make changes to the original repository files instead.
* Use the file path to distinguish between different files in the repository.
* Handle this file with the same level of security as the original repository, as it may
contain sensitive information.

## Additional Notes

* Some files might be excluded based on `.gitignore` rules and Repomix’s configuration.
* Binary files are not included in this packed representation.
* Files are sorted by Git change count, with files having more changes appearing at the bottom.

## Example Use Case

To analyze the `buildspec-1-10-2.yml` file, you would:

1. Locate the file in the repository directory structure.
2. Review the file contents to understand the build process, dependencies, and environment configurations.
3. Use this information to optimize the build process, troubleshoot issues, or make changes to the repository.

Keep in mind that the actual contents of the `buildspec-1-10-2.yml` file are not provided, so this
is a general explanation based on the file name and context.

To get the actual content you would look for

## File: buildspec-1-10-2.yml

block and read its content

Let me know if I can assist with anything else.

Multi-document processing
Harnessing the same extensive token context window, Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.
Input:

payload = {
“messages”: [
{
“role”: “system”,
“content”: “You are a financial analyst. Extract key financial data from this 10-K report.”
},
{
“role”: “user”,
“content”: f”””Extract the following information from this Amazon 10-K report:
1. What year is this report for?
2. Total revenue for the year
3. Net income
4. Key business segments and their performance
5. Important trends mentioned

Here’s the 10-K text:
{pdf_text[:25000]}

Format your response as a JSON object with these fields: year, revenue, net_income, segments, trends.
“””
}
],
“max_tokens”: 2048,
“temperature”: 0.7,
“top_p”: 0.9,
“stream”: False
}

Output:

To analyze Amazon’s growth from 2017 to 2024, we’ll examine the revenue and net income
data available from the 10-K reports.

### Revenue Growth

| Year | Revenue (in billion USD) |
| — | — |
| 2014 | 89 |
| 2015 | 107 |
| 2016 | 135.99 |
| 2017 | 177.87 |
| 2018 | 232.88 |
| 2019 | 280.52 |
| 2020 | 386.06 |
| 2021 | 478 |
| 2022 | 513 |
| 2023 | 513.98 |
| 2024 | 574.78 (projected) |

From 2017 to 2024, Amazon’s revenue has grown from $177.87 billion to a projected $574.78 billion,
representing a significant increase of approximately 223% over the 7-year period.

### Net Income Growth

| Year | Net Income (in billion USD) |
| — | — |
| 2014 | 0.236 |
| 2015 | 0.596 |
| 2016 | 2.37 |
| 2017 | 3.03 |
| 2018 | 10.07 |
| 2019 | 11.59 |
| 2020 | 18.68 |
| 2021 | 33.4 |
| 2022 | 18.7 (lower due to economic conditions) |
| 2023 | 33.36 |
| 2024 | 37.68 (projected) |

Amazon’s net income has grown from $3.03 billion in 2017 to a projected $37.68 billion in 2024,
representing a substantial increase of approximately 1143% over the 7-year period.

### Segment and Trend Analysis

Throughout the years, Amazon has maintained its three main business segments: North America,
International, and Amazon Web Services (AWS). The company has continued to expand into new products,
services, technologies, and geographic regions, driving growth and increasing competition.

The trends mentioned across the 10-K reports highlight the company’s:

1. **Seasonality**: Higher sales volume during the fourth quarter.
2. **Intense competition**: Across geographies and industries, including e-commerce, cloud computing,
and digital content.
3. **Expansion**: Into new products, services, technologies, and geographic regions, which subjects
the company to additional business, legal, financial, and competitive risks.
4. **International operations**: Subject to risks such as local economic and political conditions,
government regulation, and restrictions on sales or distribution.

### Conclusion

Amazon has demonstrated remarkable growth from 2017 to 2024, with revenue increasing by
approximately 223% and net income growing by approximately 1143%. The company’s continued
expansion into new areas, its strong presence in cloud computing through AWS, and its ability
to adapt to changing market conditions have contributed to its success. However, the company
also faces intense competition, seasonality, and risks associated with international operations.

————————————————–
Ask a question about the Amazon 10-K reports across years.

Clean up
To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, using the SageMaker console, complete the following steps:

On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
Search for the embedding and text generation endpoints.
On the endpoint details page, choose Delete.
Choose Delete again to confirm.

Conclusion
In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 4 models today.
For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.

About the authors
Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. As a member of the Third-party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a global lead for the Meta–AWS Partnership and technical strategy. Based in Seattle, Washington, Marco enjoys writing, reading, exercising, and building applications in his free time.
Chakravarthy Nagarajan is a Principal Solutions Architect specializing in machine learning, big data, and high performance computing. In his current role, he helps customers solve real-world, complex business problems using machine learning and generative AI solutions.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.
Malav Shastri is a Software Development Engineer at AWS, where he works on the Amazon SageMaker JumpStart and Amazon Bedrock teams. His role focuses on enabling customers to take advantage of state-of-the-art open source and proprietary foundation models and traditional machine learning algorithms. Malav holds a Master’s degree in Computer Science.
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon Sagemaker and Amazon EC2. Based in San Francisco, Baladithya enjoys tinkering, developing applications, and his home lab in his free time.
John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 and Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.

Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge …

Posted on April 8, 2025 by i-genie

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and AWS. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end Retrieval Augmented Generation (RAG) workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources.
Organizations need to control access to their data across different business units, including companies, departments, or even individuals, while maintaining scalability. When organizations try to separate data sources manually, they often create unnecessary complexity and hit service limitations. This post demonstrates how Amazon Bedrock Knowledge Bases can help you scale your data management effectively while maintaining proper access controls on different management levels.
One of these strategies is using Amazon Simple Storage Service (Amazon S3) folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segmentation within a single knowledge base. Additionally, we dive into integrating common vector database solutions available for Amazon Bedrock Knowledge Bases and how these integrations enable advanced metadata filtering and querying capabilities.
Organizing S3 folder structures for scalable knowledge bases
Organizations working with multiple customers need a secure and scalable way to keep each customer’s data separate while maintaining efficient access controls. Without proper data segregation, companies risk exposing sensitive information between customers or creating complex, hard-to-maintain systems. For this post, we focus on maintaining access controls across multiple business units within the same management level.
A key strategy involves using S3 folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segregation within a single knowledge base. Instead of creating separate knowledge bases for each customer, you can use a consolidated knowledge base with a well-structured S3 folder hierarchy. For example, imagine a consulting firm that manages documentation for multiple healthcare providers—each customer’s sensitive patient records and operational documents must remain strictly separated. The Amazon S3 structure might look as follows:
s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/
    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/
        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/policies/
        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/procedures/
    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/
        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/policies/
        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/procedures/
This structure makes sure that Customer A’s healthcare documentation remains completely separate from Customer B’s data. When combined with Amazon Bedrock Knowledge Bases metadata filtering, you can verify that users associated with Customer A can only access their organization’s documents, and Customer B’s users can only see their own data—maintaining strict data boundaries while using a single, efficient knowledge base infrastructure.
The Amazon Bedrock Knowledge Bases metadata filtering capability enhances this segregation by allowing you to tag documents with customer-specific identifiers and other relevant attributes. These metadata filters provide an additional layer of security and organization, making sure that queries only return results from the appropriate customer’s dataset.
Solution overview
The following diagram provides a high-level overview of AWS services and features through a sample use case. Although the example uses Customer A and Customer B for illustration, these can represent distinct business units (such as departments, companies, or teams) with different compliance requirements, rather than only individual customers.

The workflow consists of the following steps:

Customer data is uploaded along with metadata indicating data ownership and other properties to specific folders in an S3 bucket.
The S3 bucket, containing customer data and metadata, is configured as a knowledge base data source. Amazon Bedrock Knowledge Bases ingests the data, along with the metadata, from the source repository and a knowledge base sync is performed.
A customer initiates a query using a frontend application with metadata filters against the Amazon Bedrock knowledge base. An access control metadata filter must be in place to make sure that the customer only accesses data they own; the customer can apply additional filters to further refine query results. This combined query and filter is passed to the RetrieveAndGenerate API.
The RetrieveAndGenerate API handles the core RAG workflow. It consists of several sub-steps:

The user query is converted into a vector representation (embedding).
Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
The original query is augmented with the retrieved documents, providing context for the large language model (LLM).
The LLM generates a response based on the augmented query and retrieved context.

Finally, the generated response is sent back to the user.

When implementing Amazon Bedrock Knowledge Bases in scenarios involving sensitive information or requiring access controls, developers must implement proper metadata filtering in their application code. Failure to enforce appropriate metadata-based filtering could result in unauthorized access to sensitive documents within the knowledge base. Metadata filtering serves as a critical security boundary and should be consistently applied across all queries. For comprehensive guidance on implementing secure metadata filtering practices, refer to the Amazon Bedrock Knowledge Base Security documentation.
Implement metadata filtering
For this use case, two specific example customers, Customer A and Customer B, are aligned to different proprietary compliance documents. The number of customers and folders can scale to N depending on the size of the customer base. We will use the following public documents, which will reside in the respective customer’s S3 folder. Customer A requires the Architecting for HIPAA Security and Compliance on AWS document. Customer B requires access to the Using AWS in the Context of NHS Cloud Security Guidance document.

Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data:
{ “metadataAttributes”: { “customer”: “CustomerA”, “documentType”: “HIPAA Compliance Guide”, “focus”: “HIPAA Compliance”, “publicationYear”: 2022, “region”: “North America” }}
The following is the JSON metadata for Customer B’s data:
{ “metadataAttributes”: { “customer”: “CustomerB”, “documentType”: “NHS Compliance Guidance”, “focus”: “UK Healthcare Compliance”, “publicationYear”: 2023, “region”: “Europe” }}

Save these files separately with the naming convention <filename>.pdf.metadata.JSON and store them in the same S3 folder or prefix that stores the source document. For Customer A, name the metadata file architecting-hipaa-compliance-on-aws.pdf.metadata.json and upload it to the folder corresponding to Customer A’s documents. Repeat these steps for Customer B.
Create an Amazon Bedrock knowledge base. For instructions, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases
After you create your knowledge base, you can sync the data source. For more details, see Sync your data with your Amazon Bedrock knowledge base.

Test metadata filtering
After you sync the data source, you can test the metadata filtering.
The following is an example for setting the customer = CustomerA metadata filter to show Customer A only has access to the HIPAA compliance document and not the NHS Compliance Guidance that relates to Customer B.
To use the metadata filtering options on the Amazon Bedrock console, complete the following steps:

On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
Choose the knowledge base you created.
Choose Test knowledge base.
Choose the Configurations icon, then expand Filters.
Enter a condition using the format: key = value (for this example, customer = CustomerA) and press Enter.
When finished, enter your query in the message box, then choose Run.

We enter two queries, “summarize NHS Compliance Guidance” and “summarize HIPAA Compliance Guide.” The following figure shows the two queries: one attempting to query data related to NHS compliance guidance, which fails because it is outside of the Customer A segment, and another successfully querying data on HIPAA compliance, which has been tagged for Customer A.

Implement field-specific chunking
Amazon Bedrock Knowledge Bases supports several document types for Amazon S3 metadata filtering. The supported file formats include:

Plain text (.txt)
Markdown (.md)
HTML (.html)
Microsoft Word documents (.doc and.docx)
CSV files (.csv)
Microsoft Excel spreadsheets (.xls and .xlsx)

When working with CSV data, customers often want to chunk on a specific field in their CSV documents to gain granular control over data retrieval and enhance the efficiency and accuracy of queries. By creating logical divisions based on fields, users can quickly access relevant subsets of data without needing to process the entire dataset.
Additionally, field-specific chunking aids in organizing and maintaining large datasets, facilitating updating or modifying specific portions without affecting the whole. This granularity supports better version control and data lineage tracking, which are crucial for data integrity and compliance. Focusing on relevant chunks can improve the performance of LLMs, ultimately leading to more accurate insights and better decision-making processes within organizations. For more information, see Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.
To demonstrate field-specific chunking, we use two sample datasets with the following schemas:

Schema 1 – Customer A uses the following synthetic dataset for recording medical case reports (case_reports.csv)

CaseID
DoctorID
PatientID
Diagnosis
TreatmentPlan
Content

C001
D001
P001
Hypertension
Lifestyle changes, Medication (Lisinopril)
“Patient diagnosed with hypertension, advised lifestyle changes, and started on Lisinopril.”

C002
D002
P002
Diabetes Type 2
Medication (Metformin), Diet adjustment
“Diabetes Type 2 confirmed, prescribed Metformin, and discussed a low-carb diet plan.”

C003
D003
P003
Asthma
Inhaler (Albuterol)
“Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management.”

C004
D004
P004
Coronary Artery Disease
Medication (Atorvastatin), Surgery Consultation
“Coronary artery disease diagnosed, started on Atorvastatin, surgery consultation recommended.”

…
…
…
…
…
…

Schema 2 – Customer B uses the following dataset for recording genetic testing results (genetic_testings.csv)

SampleID
PatientID
TestType
Result

S001
P001
Genome Sequencing
Positive

S002
P002
Exome Sequencing
Negative

S003
P003
Targeted Gene Panel
Positive

S004
P004
Whole Genome Sequencing
Negative

…
…
…
…

Complete the following steps:

Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data (note that recordBasedStructureMetadata supports exactly one content field):

{
“metadataAttributes”: {
“customer”: “CustomerA”
},
“documentStructureConfiguration”: {
“type”: “RECORD_BASED_STRUCTURE_METADATA”,
“recordBasedStructureMetadata”: {
“contentFields”: [
{
“fieldName”: “Content”
}
],
“metadataFieldsSpecification”: {
“fieldsToInclude”: [
{
“fieldName”: “CaseID”
},
{
“fieldName”: “DoctorID”
},
{
“fieldName”: “PatientID”
},
{
“fieldName”: “Diagnosis”
},
{
“fieldName”: “TreatmentPlan”
}
]
}
}
}
}

The following is the JSON metadata for Customer B’s data:

{
“metadataAttributes”: {
“customer”: “CustomerB”
},
“documentStructureConfiguration”: {
“type”: “RECORD_BASED_STRUCTURE_METADATA”,
“recordBasedStructureMetadata”: {
“contentFields”: [
{
“fieldName”: “TestType”
}
],
“metadataFieldsSpecification”: {
“fieldsToInclude”: [
{
“fieldName”: “SampleID”
},
{
“fieldName”: “PatientID”
},
{
“fieldName”: “Result”
}
]
}
}
}
}

Save your files with the naming convention <filename>.csv.metadata.json and store the new JSON file in the same S3 prefix of the bucket where you stored the dataset. For Customer A, name the metadata file case_reports.csv.metadata.JSON and upload the file to the same folder corresponding to Customer A’s datasets.

Repeat the process for Customer B. You have now created metadata from the source CSV itself, as well as an additional metadata field customer that doesn’t exist in the original dataset. The following image highlights the metadata.

Create an Amazon Bedrock knowledge base.
Sync your data with your Amazon Bedrock knowledge base.

Test field-specific chunking
The following is an example of setting the customer = CustomerA metadata filter demonstrating that Customer A only has access to the medical case reports dataset and not the genetic testing dataset that relates to Customer B. We enter a query requesting information about a patient with PatientID as P003.
To test, complete the following steps:

The knowledge base returns, “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management,” which is the Result column entry from Customer A’s medical case reports dataset for that PatientID. Although there is a record with the same PatientID in Customer B’s genetic testing dataset, Customer A has access only to the medical case reports data due to the metadata filtering.
Apply metadata filtering for the Amazon Bedrock API
You can call the Amazon Bedrock API RetrieveAndGenerate to query a knowledge base and generate responses based on the retrieved results using the specified FM or inference profile. The response only cites sources that are relevant to the query.
The following Python Boto3 example API call applies the metadata filtering for retrieving Customer B data and generates responses based on the retrieved results using the specified FM (Anthropic’s Claude 3 Sonnet) in RetrieveAndGenerate:

response = bedrock_client.retrieve_and_generate(
input={
“text”: “Summarize NHS compliance guidance.”
},
retrieveAndGenerateConfiguration={
“type”: “KNOWLEDGE_BASE”,
“knowledgeBaseConfiguration”: {
‘knowledgeBaseId’: ‘example_knowledge_base_id’,
“modelArn”: “arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0”.format(region),
“retrievalConfiguration”: {
“vectorSearchConfiguration”: {
“numberOfResults”: 5,
“filter”: {
“equals”: {
“key”: “customer”,
“value”: ‘CustomerB’
}
}
}
}
}
})

The following GitHub repository provides a notebook that you can follow to deploy an Amazon Bedrock knowledge base with access control implemented using metadata filtering in your own AWS account.
Integrate existing vector databases with Amazon Bedrock Knowledge Bases and validate metadata
There are multiple ways to create vector databases from AWS services and partner offerings to build scalable solutions. If a vector database doesn’t exist, you can use Amazon Bedrock Knowledge Bases to create one using Amazon OpenSearch Serverless Service, Amazon Aurora PostgreSQL Serverless, or Amazon Neptune Analytics to store embeddings, or you can specify an existing vector database supported by Redis Enterprise Cloud, Amazon Aurora PostgreSQL with the pgvector extension, MongoDB Atlas, or Pinecone. After you create your knowledge base and either ingest or sync your data, the metadata attached to the data will be ingested and automatically populated to the vector database.
In this section, we review how to incorporate and validate metadata filtering with existing vector databases using OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone. To learn how to set up each individual vector databases, follow the instructions in Prerequisites for your own vector store for a knowledge base.
OpenSearch Serverless as a knowledge base vector store
With OpenSearch Serverless vector database capabilities, you can implement semantic search, RAG with LLMs, and recommendation engines. To address data segregation between business segments within each Amazon Bedrock knowledge base with an OpenSearch Serverless vector database, use metadata filtering. Metadata filtering allows you to segment data inside of an OpenSearch Serverless vector database. This can be useful when you want to add descriptive data to your documents for more control and granularity in searches.
Each OpenSearch Serverless dashboard has a URL that can be used to add documents and query your database; the structure of the URL is domain-endpoint/_dashboard.
After creating a vector database index, you can use metadata filtering to selectively retrieve items by using JSON query options in the request body. For example, to return records owned by Customer A, you can use the following request:

GET <index_name>/_search
{
“query”: {
“match”: {
“customer”: “CustomerA”
}
}
}

This query will return a JSON response containing the document index with the document labeled as belonging to Customer A.
Aurora PostgreSQL with the pgvector extension as a knowledge base vector store
Pgvector is an extension of PostgreSQL that allows you to extend your relational database into a high-dimensional vector database. It stores each document’s vector in a separate row of a database table. For details on creating an Aurora PostgreSQL table to be used as the vector store for a knowledge base, see Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock.
When storing a vector index for your knowledge base in an Aurora database cluster, make sure that the table for your index contains a column for each metadata property in your metadata files before starting data ingestion.
Continuing with the Customer A example, the customer requires the Architecting for HIPAA Security and Compliance on AWS document.
The following is the JSON metadata for Customer A’s data:
{ “metadataAttributes”: { “customer”: “CustomerA”, “documentType”: “HIPAA Compliance Guide”, “focus”: “HIPAA Compliance”, “publicationYear”: 2022, “region”: “North America” }}
The schema of the PostgreSQL table you create must contain four essential columns for ID, text content, vector values, and service managed metadata; it must also include additional metadata columns (customer, documentType, focus, publicationYear, region) for each metadata property in the corresponding metadata file. This allows pgvector to perform efficient vector searches and similarity comparisons by running queries directly on the database table. The following table summarizes the columns.

Column Name
Data Type
Description

id
UUID primary key
Contains unique identifiers for each record

chunks
Text
Contains the chunks of raw text from your data sources

embedding
Vector
Contains the vector embeddings of the data sources

metadata
JSON
Contains Amazon Bedrock managed metadata required to carry out source attribution and to enable data ingestion and querying.

customer
Text
Contains the customer ID

documentType
Text
Contains the type of document

focus
Text
Contains the document focus

publicationYear
Int
Contains the year document was published

region
Text
Contains the document’s related AWS Region

During Amazon Bedrock knowledge base data ingestion, these columns will be populated with the corresponding attribute values. Chunking can break down a single document into multiple separate records (each associated with a different ID).
This PostgreSQL table structure allows for efficient storage and retrieval of document vectors, using PostgreSQL’s robustness and pgvector’s specialized vector handling capabilities for applications like recommendation systems, search engines, or other systems requiring similarity searches in high-dimensional space.
Using this approach, you can implement access control at the table level by creating database tables for each segment. Additional metadata columns can also be included in the table for properties such as the specific document owner (user_id), tags, and so on to further enable and enforce fine-grained (row-level) access control and result filtering if you restrict each user to only query the rows that contain their user ID (document owner).
After creating a vector database table, you can use metadata filtering to selectively retrieve items by using a PostgreSQL query. For example, to return table records owned by Customer A, you can use the following query:

SELECT *
FROM bedrock_integration.bedrock_kb
WHERE customer = ‘CustomerA’;

This query will return a response containing the database records with the document labeled as belonging to Customer A.
Pinecone as a knowledge base vector store
Pinecone, a fully managed vector database, enables semantic search, high-performance search, and similarity matching. Pinecone databases can be integrated into your AWS environment in the form of Amazon Bedrock knowledge bases, but are first created through the Pinecone console. For detailed documentation about setting up a vector store in Pinecone, see Pinecone as a Knowledge Base for Amazon Bedrock. Then, you can integrate the databases using the Amazon Bedrock console. For more information about Pinecone integration with Amazon Bedrock, see Bring reliable GenAI applications to market with Amazon Bedrock and Pinecone.
You can segment a Pinecone database by adding descriptive metadata to each index and using that metadata to inform query results. Pinecone supports strings and lists of strings to filter vector searches on customer names, customer industry, and so on. Pinecone also supports numbers and booleans.
Use metadata query language to filter output ($eq, $ne, $in, $nin, $and, and $or). The following example shows a snippet of metadata and queries that will return that index. The example queries in Python demonstrate how you can retrieve a list of records associated with Customer A from the Pinecone database.

pc = Pinecone(api_key=”xxxxxxxxxxx”)

index = pc.Index(<index_name>)

index.query(
namespace=””,
vector=[0.17,0.96, …, 0.44],
filter={
“customer”: {“$eq”: “CustomerA”}
},
top_k=10,
include_metadata=True # Include metadata in the response.
)

This query will return a response containing the database records labeled as belonging to Customer A.
Enhanced scaling with multiple data sources
Amazon Bedrock Knowledge Bases now supports multiple data sources across AWS accounts. Amazon Bedrock Knowledge Bases can ingest data from up to five data sources, enhancing the comprehensiveness and relevancy of a knowledge base. This feature allows customers with complex IT systems to incorporate data into generative AI applications without restructuring or migrating data sources. It also provides flexibility for you to scale your Amazon Bedrock knowledge bases when data resides in different AWS accounts.
The features includes cross-account data access, enabling the configuration of S3 buckets as data sources across different accounts and efficient data management options for retaining or deleting data when a source is removed. These enhancements alleviate the need for creating multiple knowledge bases or redundant data copies.
Clean up
After completing the steps in this blog post, make sure to clean up your resources to avoid incurring unnecessary charges. Delete the Amazon Bedrock Knowledge Base by navigating to the Amazon Bedrock console, selecting your knowledge base, and choosing “Delete” from the “Actions” dropdown menu. If you created vector databases for testing, remember to delete OpenSearch Serverless collections, stop or delete Aurora PostgreSQL instances, and remove Pinecone index created. Additionally, consider deleting test documents uploaded to S3 buckets specifically for this blog example to avoid storage charges. Review and clean up any IAM roles or policies created for this demonstration if they’re no longer needed.
While Amazon Bedrock Knowledge Bases include charges for data indexing and queries, the underlying storage in S3 and vector databases will continue to incur charges until those resources are removed. For specific pricing details, refer to the Amazon Bedrock pricing page.
Conclusion
In this post, we covered several key strategies for building scalable, secure, and segmented Amazon Bedrock knowledge bases. These include using S3 folder structure, metadata to organize data sources, and data segmentation within a single knowledge base. Using metadata filtering to create custom queries that target specific data segments helps provide retrieval accuracy and maintain data privacy. We also explored integrating and validating metadata for vector databases including OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone.
By consolidating multiple business segments or customer data within a single Amazon Bedrock knowledge base, organizations can achieve cost optimization compared to creating and managing them separately. The improved data segmentation and access control measures help make sure each team or customer can only access the information relevant to their domain. The enhanced scalability helps meet the diverse needs of organizations, while maintaining the necessary data segregation and access control.
Try out metadata filtering with Amazon Bedrock Knowledge Bases, and share your thoughts and questions with the authors or in the comments.

About the Authors
Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing 1P and 3P model adoption. Breanne is also on the Women at Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.
Justin Lin is a Small & Medium Business Solutions Architect at Amazon Web Services. He studied computer science at UW Seattle. Dedicated to designing and developing innovative solutions that empower customers, Justin has been dedicating his time to experimenting with applications in generative AI, natural language processing, and forecasting.
Chloe Gorgen is an Enterprise Solutions Architect at Amazon Web Services, advising AWS customers in various topics including security, analytics, data management, and automation. Chloe is passionate about youth engagement in technology, and supports several AWS initiatives to foster youth interest in cloud-based technology. Chloe holds a Bachelor of Science in Statistics and Analytics from the University of North Carolina at Chapel Hill.

Effectively use prompt caching on Amazon Bedrock

Posted on April 8, 2025 by i-genie

Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models, lowers response latency by up to 85% and reduces costs up to 90% by caching frequently used prompts across multiple API calls.
With prompt caching, you can mark the specific contiguous portions of your prompts to be cached (known as a prompt prefix). When a request is made with the specified prompt prefix, the model processes the input and caches the internal state associated with the prefix. On subsequent requests with a matching prompt prefix, the model reads from the cache and skips the computation steps required to process the input tokens. This reduces the time to first token (TTFT) and makes more efficient use of hardware such that we can share the cost savings with you.
This post provides a detailed overview of the prompt caching feature on Amazon Bedrock and offers guidance on how to effectively use this feature to achieve improved latency and cost savings.
How prompt caching works
Large language model (LLM) processing is made up of two primary stages: input token processing and output token generation. The prompt caching feature on Amazon Bedrock optimizes the input token processing stage.
You can begin by marking the relevant portions of your prompt with cache checkpoints. The entire section of the prompt preceding the checkpoint then becomes the cached prompt prefix. As you send more requests with the same prompt prefix, marked by the cache checkpoint, the LLM will check if the prompt prefix is already stored in the cache. If a matching prefix is found, the LLM can read from the cache, allowing the input processing to resume from the last cached prefix. This saves the time and cost that would otherwise be spent recomputing the prompt prefix.
Be advised that the prompt caching feature is model-specific. You should review the supported models and details on the minimum number of tokens per cache checkpoint and maximum number of cache checkpoints per request.

Cache hits only occur when the exact prefix matches. To fully realize the benefits of prompt caching, it’s recommended to position static content such as instructions and examples at the beginning of the prompt. Dynamic content, including user-specific information, should be placed at the end of the prompt. This principle also extends to images and tools, which must remain identical across requests in order to enable caching.
The following diagram illustrates how cache hits work. A, B, C, D represent distinct portions of the prompt. A, B and C are marked as the prompt prefix. Cache hits occur when subsequent requests contain the same A, B, C prompt prefix.

When to use prompt caching
Prompt caching on Amazon Bedrock is recommended for workloads that involve long context prompts that are frequently reused across multiple API calls. This capability can significantly improve response latency by up to 85% and reduce inference costs by up to 90%, making it well-suited for applications that use repetitive, long input context. To determine if prompt caching is beneficial for your use case, you will need to estimate the number of tokens you plan to cache, the frequency of reuse, and the time between requests.
The following use cases are well-suited for prompt caching:

Chat with document – By caching the document as input context on the first request, each user query becomes more efficient, enabling simpler architectures that avoid heavier solutions like vector databases.
Coding assistants – Reusing long code files in prompts enables near real-time inline suggestions, eliminating much of the time spent reprocessing code files.
Agentic workflows – Longer system prompts can be used to refine agent behavior without degrading the end-user experience. By caching the system prompts and complex tool definitions, the time to process each step in the agentic flow can be reduced.
Few-shot learning – Including numerous high-quality examples and complex instructions, such as for customer service or technical troubleshooting, can benefit from prompt caching.

How to use prompt caching
When evaluating a use case to use prompt caching, it’s crucial to categorize the components of a given prompt into two distinct groups: the static and repetitive portion, and the dynamic portion. The prompt template should adhere to the structure illustrated in the following figure.

You can create multiple cache checkpoints within a request, subject to model-specific limitations. It should follow the same static portion, cache checkpoint, dynamic portion structure, as illustrated in the following figure.

Use case example
The “chat with document” use case, where the document is included in the prompt, is well-suited for prompt caching. In this example, the static portion of the prompt would comprise instructions on response formatting and the body of the document. The dynamic portion would be the user’s query, which changes with each request.
In this scenario, the static portions of the prompt should be marked as the prompt prefixes to enable prompt caching. The following code snippet demonstrates how to implement this approach using the Invoke Model API. Here we create two cache checkpoints in the request, one for the instructions and one for the document content, as illustrated in the following figure.

We use the following prompt:

def chat_with_document(document, user_query):
instructions = (
“I will provide you with a document, followed by a question about its content. ”
“Your task is to analyze the document, extract relevant information, and provide ”
“a comprehensive answer to the question. Please follow these detailed instructions:”

“nn1. Identifying Relevant Quotes:”
“n – Carefully read through the entire document.”
“n – Identify sections of the text that are directly relevant to answering the question.”
“n – Select quotes that provide key information, context, or support for the answer.”
“n – Quotes should be concise and to the point, typically no more than 2-3 sentences each.”
“n – Choose a diverse range of quotes if multiple aspects of the question need to be addressed.”
“n – Aim to select between 2 to 5 quotes, depending on the complexity of the question.”

“nn2. Presenting the Quotes:”
“n – List the selected quotes under the heading ‘Relevant quotes:'”
“n – Number each quote sequentially, starting from [1].”
“n – Present each quote exactly as it appears in the original text, enclosed in quotation marks.”
“n – If no relevant quotes can be found, write ‘No relevant quotes’ instead.”
“n – Example format:”
“n Relevant quotes:”
“n [1] “This is the first relevant quote from the document.””
“n [2] “This is the second relevant quote from the document.””

“nn3. Formulating the Answer:”
“n – Begin your answer with the heading ‘Answer:’ on a new line after the quotes.”
“n – Provide a clear, concise, and accurate answer to the question based on the information in the document.”
“n – Ensure your answer is comprehensive and addresses all aspects of the question.”
“n – Use information from the quotes to support your answer, but do not repeat them verbatim.”
“n – Maintain a logical flow and structure in your response.”
“n – Use clear and simple language, avoiding jargon unless it’s necessary and explained.”

“nn4. Referencing Quotes in the Answer:”
“n – Do not explicitly mention or introduce quotes in your answer (e.g., avoid phrases like ‘According to quote [1]’).”
“n – Instead, add the bracketed number of the relevant quote at the end of each sentence or point that uses information from that quote.”
“n – If a sentence or point is supported by multiple quotes, include all relevant quote numbers.”
“n – Example: ‘The company’s revenue grew by 15% last year. [1] This growth was primarily driven by increased sales in the Asian market. [2][3]'”

“nn5. Handling Uncertainty or Lack of Information:”
“n – If the document does not contain enough information to fully answer the question, clearly state this in your answer.”
“n – Provide any partial information that is available, and explain what additional information would be needed to give a complete answer.”
“n – If there are multiple possible interpretations of the question or the document’s content, explain this and provide answers for each interpretation if possible.”

“nn6. Maintaining Objectivity:”
“n – Stick to the facts presented in the document. Do not include personal opinions or external information not found in the text.”
“n – If the document presents biased or controversial information, note this objectively in your answer without endorsing or refuting the claims.”

“nn7. Formatting and Style:”
“n – Use clear paragraph breaks to separate different points or aspects of your answer.”
“n – Employ bullet points or numbered lists if it helps to organize information more clearly.”
“n – Ensure proper grammar, punctuation, and spelling throughout your response.”
“n – Maintain a professional and neutral tone throughout your answer.”

“nn8. Length and Depth:”
“n – Provide an answer that is sufficiently detailed to address the question comprehensively.”
“n – However, avoid unnecessary verbosity. Aim for clarity and conciseness.”
“n – The length of your answer should be proportional to the complexity of the question and the amount of relevant information in the document.”

“nn9. Dealing with Complex or Multi-part Questions:”
“n – For questions with multiple parts, address each part separately and clearly.”
“n – Use subheadings or numbered points to break down your answer if necessary.”
“n – Ensure that you’ve addressed all aspects of the question in your response.”

“nn10. Concluding the Answer:”
“n – If appropriate, provide a brief conclusion that summarizes the key points of your answer.”
“n – If the question asks for recommendations or future implications, include these based strictly on the information provided in the document.”

“nnRemember, your goal is to provide a clear, accurate, and well-supported answer based solely on the content of the given document. ”
“Adhere to these instructions carefully to ensure a high-quality response that effectively addresses the user’s query.”
)

document_content = f”Here is the document: <document> {document} </document>”

messages_API_body = {
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 4096,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: instructions,
“cache_control”: {
“type”: “ephemeral”
}
},
{
“type”: “text”,
“text”: document_content,
“cache_control”: {
“type”: “ephemeral”
}
},
{
“type”: “text”,
“text”: user_query
},
]
}
]
}

response = bedrock_runtime.invoke_model(
body=json.dumps(messages_API_body),
modelId=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″,
accept=”application/json”,
contentType=”application/json”
)
response_body = json.loads(response.get(“body”).read())
print(json.dumps(response_body, indent=2))

response = requests.get(“https://aws.amazon.com/blogs/aws/reduce-costs-and-latency-with-amazon-bedrock-intelligent-prompt-routing-and-prompt-caching-preview/”)
blog = response.text
chat_with_document(blog, “What is the blog writing about?”)

In the response to the preceding code snippet, there is a usage section that provides metrics on the cache reads and writes. The following is the example response from the first model invocation:

{
“id”: “msg_bdrk_01BwzJX6DBVVjUDeRqo3Z6GL”,
“type”: “message”,
“role”: “assistant”,
“model”: “claude-3-7-sonnet-20250219”,
“content”: [
{
“type”: “text”,
“text”: “Relevant quotes:n[1] “Today, Amazon Bedrock has introduced in preview two capabilities that help reduce costs and latency for generative AI applications”nn[2] “Amazon Bedrock Intelligent Prompt Routing u2013 When invoking a model, you can now use a combination of foundation models (FMs) from the same model family to help optimize for quality and cost… Intelligent Prompt Routing can reduce costs by up to 30 percent without compromising on accuracy.”nn[3] “Amazon Bedrock now supports prompt caching u2013 You can now cache frequently used context in prompts across multiple model invocations… Prompt caching in Amazon Bedrock can reduce costs by up to 90% and latency by up to 85% for supported models.”nnAnswer:nThe article announces two new preview features for Amazon Bedrock that aim to improve cost efficiency and reduce latency in generative AI applications [1]:nn1. Intelligent Prompt Routing: This feature automatically routes requests between different models within the same model family based on the complexity of the prompt, choosing more cost-effective models for simpler queries while maintaining quality. This can reduce costs by up to 30% [2].nn2. Prompt Caching: This capability allows frequent reuse of cached context across multiple model invocations, which is particularly useful for applications that repeatedly use the same context (like document Q&A systems). This feature can reduce costs by up to 90% and improve latency by up to 85% [3].nnThese features are designed to help developers build more efficient and cost-effective generative AI applications while maintaining performance and quality standards.”
}
],
“stop_reason”: “end_turn”,
“stop_sequence”: null,
“usage”: {
“input_tokens”: 9,
“cache_creation_input_tokens”: 37209,
“cache_read_input_tokens”: 0,
“output_tokens”: 357
}
}

The cache checkpoint has been successfully created with 37,209 tokens cached, as indicated by the cache_creation_input_tokens value, as illustrated in the following figure.

For the subsequent request, we can ask a different question:

chat_with_document(blog, “what are the use cases?”)

The dynamic portion of the prompt has been changed, but the static portion and prompt prefixes remain the same. We can expect cache hits from the subsequent invocations. See the following code:

{
“id”: “msg_bdrk_01HKoDMs4Bmm9mhzCdKoQ8bQ”,
“type”: “message”,
“role”: “assistant”,
“model”: “claude-3-7-sonnet-20250219”,
“content”: [
{
“type”: “text”,
“text”: “Relevant quotes:n[1] “This is particularly useful for applications such as customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models.”nn[2] “This is especially valuable for applications that repeatedly use the same context, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that need to maintain context about code files.”nn[3] “During the preview, you can use the default prompt routers for Anthropic’s Claude and Meta Llama model families.”nnAnswer:nThe document describes two main features with different use cases:nn1. Intelligent Prompt Routing:n- Customer service applications where query complexity variesn- Applications needing to balance between cost and performancen- Systems that can benefit from using different models from the same family (Claude or Llama) based on query complexity [1][3]nn2. Prompt Caching:n- Document Q&A systems where users ask multiple questions about the same documentn- Coding assistants that need to maintain context about code filesn- Applications that frequently reuse the same context in prompts [2]nnBoth features are designed to optimize costs and reduce latency while maintaining response quality. Prompt routing can reduce costs by up to 30% without compromising accuracy, while prompt caching can reduce costs by up to 90% and latency by up to 85% for supported models.”
}
],
“stop_reason”: “end_turn”,
“stop_sequence”: null,
“usage”: {
“input_tokens”: 10,
“cache_creation_input_tokens”: 0,
“cache_read_input_tokens”: 37209,
“output_tokens”: 324
}
}

37,209 tokens are for the document and instructions read from the cache, and 10 input tokens are for the user query, as illustrated in the following figure.

Let’s change the document to a different blog post, but our instructions remain the same. We can expect cache hits for the instructions prompt prefix because it was positioned before the document body in our requests. See the following code:

response = requests.get(https://aws.amazon.com/blogs/machine-learning/enhance-conversational-ai-with-advanced-routing-techniques-with-amazon-bedrock/)
blog = response.text
chat_with_document(blog, “What is the blog writing about?”)
{
“id”: “msg_bdrk_011S8zqMXzoGHABHnXX9qSjq”,
“type”: “message”,
“role”: “assistant”,
“model”: “claude-3-7-sonnet-20250219”,
“content”: [
{
“type”: “text”,
“text”: “Let me analyze this document and provide a comprehensive answer about its main topic and purpose.nnRelevant quotes:n[1] “When you’re designing a security strategy for your organization, firewalls provide the first line of defense against threats. Amazon Web Services (AWS) offers AWS Network Firewall, a stateful, managed network firewall that includes intrusion detection and prevention (IDP) for your Amazon Virtual Private Cloud (VPC).”nn[2] “This blog post walks you through logging configuration best practices, discusses three common architectural patterns for Network Firewall logging, and provides guidelines for optimizing the cost of your logging solution.”nn[3] “Determining the optimal logging approach for your organization should be approached on a case-by-case basis. It involves striking a balance between your security and compliance requirements and the costs associated with implementing solutions to meet those requirements.”nnAnswer:nThis document is a technical blog post that focuses on cost considerations and logging options for AWS Network Firewall. The article aims to help organizations make informed decisions about implementing and managing their firewall logging solutions on AWS. Specifically, it:nn1. Explains different logging configuration practices for AWS Network Firewall [1]n2. Discusses three main architectural patterns for handling firewall logs:n – Amazon S3-based solutionn – Amazon CloudWatch-based solutionn – Amazon Kinesis Data Firehose with OpenSearch solutionn3. Provides detailed cost analysis and comparisons of different logging approaches [3]n4. Offers guidance on balancing security requirements with cost considerationsnnThe primary purpose is to help AWS users understand and optimize their firewall logging strategies while managing associated costs effectively. The article serves as a practical guide for organizations looking to implement or improve their network security logging while maintaining cost efficiency [2].”
}
],
“stop_reason”: “end_turn”,
“stop_sequence”: null,
“usage”: {
“input_tokens”: 9,
“cache_creation_input_tokens”: 37888,
“cache_read_input_tokens”: 1038,
“output_tokens”: 385
}
}

In the response, we can see 1,038 cache read tokens for the instructions and 37,888 cache write tokens for the new document content, as illustrated in the following figure.

Cost savings
When a cache hit happens, Amazon Bedrock passes along the compute savings to customers by giving a per-token discount on cached context. To calculate the potential cost savings, you should first understand your prompt caching usage pattern with cache write/read metrics in the Amazon Bedrock response. Then you can calculate your potential cost savings with price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read). For more price details, see Amazon Bedrock pricing.
Latency benchmark
Prompt caching is optimized to improve the TTFT performance on repetitive prompts. Prompt caching is well-suited for conversational applications that involve multi-turn interactions, similar to chat playground experiences. It can also benefit use cases that require repeatedly referencing a large document.
However, prompt caching might be less effective for workloads that involve a lengthy 2,000-token system prompt with a long set of dynamically changing text afterwards. In such cases, the benefits of prompt caching might be limited.
We have published a notebook on how to use prompt caching and how to benchmark it in our GitHub repo . The benchmark results depend on the use case: input token count, cached token count, or output token count.
Amazon Bedrock cross-Region inference
Prompt caching can be used in conjunction with cross-region inference (CRIS). Cross-region inference automatically selects the optimal AWS Region within your geography to serve your inference request, thereby maximizing available resources and model availability. At times of high demand, these optimizations may lead to increased cache writes.
Metrics and observability
Prompt caching observability is essential for optimizing cost savings and improving latency in applications using Amazon Bedrock. By monitoring key performance metrics, developers can achieve significant efficiency improvements—such as reducing TTFT by up to 85% and cutting costs by up to 90% for lengthy prompts. These metrics are pivotal because they enable developers to assess cache performance accurately and make strategic decisions regarding cache management.
Monitoring with Amazon Bedrock
Amazon Bedrock exposes cache performance data through the API response’s usage section, allowing developers to track essential metrics such as cache hit rates, token consumption (both read and write), and latency improvements. By using these insights, teams can effectively manage caching strategies to enhance application responsiveness and reduce operational costs.
Monitoring with Amazon CloudWatch
Amazon CloudWatch provides a robust platform for monitoring the health and performance of AWS services, including new automatic dashboards tailored specifically for Amazon Bedrock models. These dashboards offer quick access to key metrics and facilitate deeper insights into model performance.
To create custom observability dashboards, complete the following steps:

On the CloudWatch console, create a new dashboard. For a full example, see Improve visibility into Amazon Bedrock usage and performance with Amazon CloudWatch.
Choose CloudWatch as your data source and select Pie for the initial widget type (this can be adjusted later).
Update the time range for metrics (such as 1 hour, 3 hours, or 1 day) to suit your monitoring needs.
Select Bedrock under AWS namespaces.
Enter “cache” in the search box to filter cache-related metrics.
For the model, locate anthropic.claude-3-7-sonnet-20250219-v1:0, and select both CacheWriteInputTokenCount and CacheReadInputTokenCount.

Choose Create widget and then Save to save your dashboard.

The following is a sample JSON configuration for creating this widget:

{
“view”: “pie”,
“metrics”: [
[ “AWS/Bedrock”, “CacheReadInputTokenCount” ],
[ “.”, “CacheWriteInputTokenCount” ]
],
“region”: “us-west-2”,
“setPeriodToTimeRange”: true
}

Understanding cache hit rates
Analyzing cache hit rates involves observing both CacheReadInputTokens and CacheWriteInputTokens. By summing these metrics over a defined period, developers can gain insights into the efficiency of the caching strategies. With the published pricing for the model-specific price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read) on the Amazon Bedrock pricing page, you can estimate the potential cost savings for your specific use case.

Conclusion
This post explored the prompt caching feature in Amazon Bedrock, demonstrating how it works, when to use it, and how to use it effectively. It’s important to carefully evaluate whether your use case will benefit from this feature. It depends on thoughtful prompt structuring, understanding the distinction between static and dynamic content, and selecting appropriate caching strategies for your specific needs. By using CloudWatch metrics to monitor cache performance and following the implementation patterns outlined in this post, you can build more efficient and cost-effective AI applications while maintaining high performance.
For more information about working with prompt caching on Amazon Bedrock, see Prompt caching for faster model inference.

About the authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Kosta Belz is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where he helps customers design and build generative AI solutions to solve key business problems.
Sean Eichenberger is a Sr Product Manager at AWS.

Transformer Meets Diffusion: How the Transfusion Architecture Empowers …

Posted on April 7, 2025 by i-genie

OpenAI’s GPT-4o represents a new milestone in multimodal AI: a single model capable of generating fluent text and high-quality images in the same output sequence. Unlike previous systems (e.g., ChatGPT) that had to invoke an external image generator like DALL-E, GPT-4o produces images natively as part of its response. This advance is powered by a novel Transfusion architecture described in 2024 by researchers at Meta AI, Waymo, and USC. Transfusion marries the Transformer models used in language generation with the Diffusion models used in image synthesis, allowing one large model to handle text and images seamlessly. In GPT-4o, the language model can decide on the fly to generate an image, insert it into the output, and then continue generating text in one coherent sequence.

Let’s look into a detailed, technical exploration of GPT-4o’s image generation capabilities through the lens of the Transfusion architecture. First, we review how Transfusion works: a single Transformer-based model can output discrete text tokens and continuous image content by incorporating diffusion generation internally. We then contrast this with prior approaches, specifically, the tool-based method where a language model calls an external image API and the discrete token method exemplified by Meta’s earlier Chameleon (CM3Leon) model. We dissect the Transfusion design: special Begin-of-Image (BOI) and End-of-Image (EOI) tokens that bracket image content, the generation of image patches which are later refined in diffusion style, and the conversion of these patches into a final image via learned decoding layers (linear projections, U-Net upsamplers, and a variational autoencoder). We also compare empirical performance: Transfusion-based models (like GPT-4o) significantly outperform discretization-based models (Chameleon) in image quality and efficiency and match state-of-the-art diffusion models on image benchmarks. Finally, we situate this work in the context of 2023–2025 research on unified multimodal generation, highlighting how Transfusion and similar efforts unify language and image generation in a single forward pass or shared tokenization framework.

From Tools to Native Multimodal Generation

Prior Tool-Based Approach: Before architectures like GPT-4o, if one wanted a conversational agent to produce images, a common approach was a pipeline or tool-invocation strategy. For example, ChatGPT could be augmented with a prompt to call an image generator (such as DALL·E 3) when the user requests an image. In this two-model setup, the language model itself does not truly generate the image; it merely produces a textual description or API call, which an external diffusion model renders into an image. While effective, this approach has clear limitations: the image generation is not tightly integrated with the language model’s knowledge and context.

Discrete Token Early-Fusion: An alternative line of research made image generation endogenously part of the sequence modeling by treating images as sequences of discrete tokens. Pioneered by models like DALL·E (2021), which used a VQ-VAE to encode images into codebook indices, this approach allows a single transformer to generate text and image tokens from one vocabulary. For instance, Parti (Google, 2022) and Meta’s Chameleon (2024) extend language modeling to image synthesis by quantizing images into tokens and training the model to predict those tokens like words. The key idea of Chameleon was the “early fusion” of modalities: images and text are converted into a common token space from the start.

Image Source

However, this discretization approach introduces an information bottleneck. Converting an image into a sequence of discrete tokens necessarily throws away some detail. The VQ-VAE codebook has a fixed size, so it may not capture subtle color gradients or fine textures present in the original image. Moreover, to retain as much fidelity as possible, the image must be broken into many tokens, often hundreds or more for a single image. This makes generation slow and training costly. Despite these efforts, there is an inherent trade-off: using a larger codebook or more tokens improves image quality but increases sequence length and computation, whereas using a smaller codebook speeds up generation but loses detail. Empirically, models like Chameleon, while innovative, lag behind dedicated diffusion models in image fidelity.

The Transfusion Architecture: Merging Transformers with Diffusion

Transfusion takes a hybrid approach, directly integrating a continuous diffusion-based image generator into the transformer’s sequence modeling framework. The core of Transfusion is a single transformer model (decoder-only) trained on a mix of text and images but with different objectives for each. Text tokens use the standard next-token prediction loss. Image tokens, continuous embeddings of image patches, use a diffusion loss, the same kind of denoising objective used to train models like Stable Diffusion, except it is implemented within the transformer.

Image Source

Unified Sequence with BOI/EOI Markers: In Transfusion (and GPT-4o), text and image data are concatenated into one sequence during training. Special tokens mark the boundaries between modalities. A Begin-of-Image (BOI) token indicates that subsequent elements in the sequence are image content, and an End-of-Image (EOI) token signals that the image content has ended. Everything outside of BOI…EOI is treated as normal text; everything inside is treated as a continuous image representation. The same transformer processes all sequences. Within an image’s BOI–EOI block, the attention is bidirectional among image patch elements. This means the transformer can treat an image as a two-dimensional entity while treating the image as a whole as one step in an autoregressive sequence.

Image Patches as Continuous Tokens: Transfusion represents an image as a small set of continuous vectors called latent patches rather than discrete codebook tokens. The image is first encoded by a variational autoencoder (VAE) into a lower-dimensional latent space. The latent image is then divided into a grid of patches, & each patch is flattened into a vector. These patch vectors are what the transformer sees and predicts for image regions. Since they are continuous-valued, the model cannot use a softmax over a fixed vocabulary to generate an image patch. Instead, image generation is learned via diffusion: The model is trained to output denoised patches from noised patches.

Lightweight modality-specific layers project these patch vectors into the transformer’s input space. Two design options were explored: a simple linear layer or a small U-Net style encoder that further downsamples local patch content. The U-Net downsampler can capture more complex spatial structures from a larger patch. In practice, Transfusion found that using U-Net up/down blocks allowed them to compress an entire image into as few as 16 latent patches with minimal performance loss. Fewer patches mean shorter sequences and faster generation. In the best configuration, a Transfusion model at 7B scale represented an image with 22 latent patch vectors on average.

Denoising Diffusion Integration: Training the model on images uses a diffusion objective embedded in the sequence. For each image, the latent patches are noised with a random noise level, as in a standard diffusion model. These noisy patches are given to the transformer (preceded by BOI). The transformer must predict the denoised version. The loss on image tokens is the usual diffusion loss (L2 error), while the loss on text tokens is cross-entropy. The two losses are simply added for joint training. Thus, depending on its current processing, the model learns to continue text or refine an image.

At inference time, the generation procedure mirrors training. GPT-4o generates tokens autoregressively. If it generates a normal text token, it continues as usual. But if it generates the special BOI token, it transitions to image generation. Upon producing BOI, the model appends a block of latent image tokens initialized with pure random noise to the sequence. These serve as placeholders for the image. The model then enters diffusion decoding, repeatedly passing the sequence through the transformer to progressively denoise the image. Text tokens in the context act as conditioning. Once the image patches are fully generated, the model emits an EOI token to mark the end of the image block.

Decoding Patches into an Image: The final latent patch vectors are converted into an actual image. This is done by inverting the earlier encoding: first, the patch vectors are mapped back to latent image tiles using either a linear projection or U-Net up blocks. After this, the VAE decoder decodes the latent image into the final RGB pixel image. The result is typically high quality and coherent because the image was generated through a diffusion process in latent space.

Transfusion vs. Prior Methods: Key Differences and Advantages

Native Integration vs. External Calls: The most immediate advantage of Transfusion is that image generation is native to the model’s forward pass, not a separate tool. This means the model can fluidly blend text and imagery. Moreover, the language model’s knowledge and reasoning abilities directly inform the image creation. GPT-4o excels at rendering text in images and handling multiple objects, likely due to this tighter integration.

Continuous Diffusion vs. Discrete Tokens: Transfusion’s continuous patch diffusion approach retains much more information and yields higher-fidelity outputs. The transformer cannot choose from a limited palette by eliminating the quantization bottleneck. Instead, it predicts continuous values, allowing subtle variations. In benchmarks, a 7.3B-parameter Transfusion model achieved an FID of 6.78 on MS-COCO, compared to an FID of 26.7 for a similarly sized Chameleon model. Transfusion also had a higher CLIP score (0.63 vs 0.39), indicating better image-text alignment.

Efficiency and Scaling: Transfusion can compress an image into as few as 16–20 latent patches. Chameleon might require hundreds of tokens. This means that the transfusion transformer takes fewer steps per image. Transfusion matched Chameleon’s performance using only ~22% of the compute. The model reached the same language perplexity using roughly half the compute as Chameleon.

Image Source

Image Generation Quality: Transfusion generates photorealistic images comparable to state-of-the-art diffusion models. On the GenEval benchmark for text-to-image generation, a 7B Transfusion model outperformed DALL-E 2 and even SDXL 1.0. GPT-4o renders legible text in images and handles many distinct objects in a scene.

Flexibility and Multi-turn Multimodality: GPT-4o can handle bimodal interactions, not just text-to-image but image-to-text and mixed tasks. For example, it can show an image and then continue generating text about it or edit it with further instructions. Transfusion enables these capabilities naturally within the same architecture.

Limitations: While Transfusion outperforms discrete approaches, it still inherits some limitations from diffusion models. Image output is slower due to multiple iterative steps. The transformer must perform double duty, increasing training complexity. However, careful masking and normalization enable training to billions of parameters without collapse.

Related Work and Multimodal Generative Models (2023–2025)

Before Transfusion, most efforts fell into tool-augmented models and token-fusion models. HuggingGPT and Visual ChatGPT allowed an LLM to call various APIs for tasks like image generation. Token-fusion approaches include DALL·E, CogView, and Parti, which treat images as sequences of tokens. Chameleon trained on interleaved image-text sequences. Kosmos-1 and Kosmos-2 were multimodal transformers aimed at understanding rather than generation.

Transfusion bridges the gap by keeping the single-model elegance of token fusion but using continuous latent and iterative refinement like diffusion. Google’s Muse and DeepFloyd IF introduced variations but used multiple stages or frozen language encoders. Transfusion integrates all capabilities into one transformer. Other examples include Meta’s Make-A-Scene and Paint-by-Example, Stability AI’s DeepFloyd IF, and HuggingFace’s IDEFICS.

In conclusion, the Transfusion architecture demonstrates that unifying text and image generation in one transformer is possible. GPT-4o with Transfusion generates images natively, guided by context and knowledge, and produces high-quality visuals interleaved with text. Compared to prior models like Chameleon, it offers better image quality, more efficient training, and deeper integration.

Sources

https://openai.com/index/introducing-4o-image-generation/

https://arxiv.org/pdf/2102.12092

https://ar5iv.labs.arxiv.org/html/2405.09818

https://arxiv.org/pdf/2408.11039v1

https://arxiv.org/pdf/2206.10789

https://ai.meta.com/blog/generative-ai-text-images-cm3leon/

https://github.com/deep-floyd/IF

Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Transformer Meets Diffusion: How the Transfusion Architecture Empowers GPT-4o’s Creativity appeared first on MarkTechPost.

This AI Paper from Anthropic Introduces Attribution Graphs: A New Inte …

Posted on April 7, 2025 by i-genie

While the outputs of large language models (LLMs) appear coherent and useful, the underlying mechanisms guiding these behaviors remain largely unknown. As these models are increasingly deployed in sensitive and high-stakes environments, it has become crucial to understand what they do and how they do it.

The main challenge lies in uncovering the internal steps that lead a model to a specific response. The computations happen across hundreds of layers and billions of parameters, making it difficult to isolate the processes involved. Without a clear understanding of these steps, trusting or debugging their behavior becomes harder, especially in tasks requiring reasoning, planning, or factual reliability. Researchers are thus focused on reverse-engineering these models to identify how information flows and decisions are made internally.

Existing interpretability methods like attention maps and feature attribution offer partial views into model behavior. While these tools help highlight which input tokens contribute to outputs, they often fail to trace the full chain of reasoning or identify intermediate steps. Moreover, these tools usually focus on surface-level behaviors and do not provide consistent insight into deeper computational structures. This has created the need for more structured, fine-grained methods to trace logic through internal representations over multiple steps.

To address this, researchers from Anthropic introduced a new technique called attribution graphs. These graphs allow researchers to trace the internal flow of information between features within a model during a single forward pass. By doing so, they attempt to identify intermediate concepts or reasoning steps that are not visible from the model’s outputs alone. The attribution graphs generate hypotheses about the computational pathways a model follows, which are then tested using perturbation experiments. This approach marks a significant step toward revealing the “wiring diagram” of large models, much like how neuroscientists map brain activity.

The researchers applied attribution graphs to Claude 3.5 Haiku, a lightweight language model released by Anthropic in October 2024. The method begins by identifying interpretable features activated by a specific input. These features are then traced to determine their influence on the final output. For example, when prompted with a riddle or poem, the model selects a set of rhyming words before writing lines, a form of planning. In another example, the model identifies “Texas” as an intermediate step to answer the question, “What’s the capital of the state containing Dallas?” which it correctly resolves as “Austin.” The graphs reveal the model outputs and how it internally represents and transitions between ideas.

The performance results from attribution graphs uncovered several advanced behaviors within Claude 3.5 Haiku. In poetry tasks, the model pre-plans rhyming words before composing each line, showing anticipatory reasoning. In multi-hop questions, the model forms internal intermediate representations, such as associating Dallas with Texas before determining Austin as the answer. It leverages both language-specific and abstract circuits for multilingual inputs, with the latter becoming more prominent in Claude 3.5 Haiku than in earlier models. Further, the model generates diagnoses internally in medical reasoning tasks and uses them to inform follow-up questions. These findings suggest that the model can abstract planning, internal goal-setting, and stepwise logical deductions without explicit instruction.

This research presents attribution graphs as a valuable interpretability tool that reveals the hidden layers of reasoning in language models. By applying this method, the team from Anthropic has shown that models like Claude 3.5 Haiku don’t merely mimic human responses—they compute through layered, structured steps. This opens the door to deeper audits of model behavior, allowing more transparent and responsible deployment of advanced AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku appeared first on MarkTechPost.

Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigati …

Posted on April 7, 2025 by i-genie

A key advancement in AI capabilities is the development and use of chain-of-thought (CoT) reasoning, where models explain their steps before reaching an answer. This structured intermediate reasoning is not just a performance tool; it’s also expected to enhance interpretability. If models explain their reasoning in natural language, developers can trace the logic and detect faulty assumptions or unintended behaviors. While the transparency potential of CoT reasoning has been well-recognized, the actual faithfulness of these explanations to the model’s internal logic remains underexplored. As reasoning models become more influential in decision-making processes, it becomes critical to ensure the coherence between what a model thinks and what it says.

The challenge lies in determining whether these chain-of-thought explanations genuinely reflect how the model arrived at its answer or if they are plausible post-hoc justifications. If a model internally processes one line of reasoning but writes down another, then even the most detailed CoT output becomes misleading. This discrepancy raises serious concerns, especially in contexts where developers rely on these CoTs to detect harmful or unethical behavior patterns during training. In some cases, models might execute behaviors like reward hacking or misalignment without verbalizing the true rationale, thereby escaping detection. This gap between behavior and verbalized reasoning can undermine safety mechanisms designed to prevent catastrophic outcomes in scenarios involving high-stakes decisions.

To evaluate this problem, the researchers from Alignment Science Team, Anthropic designed a set of experiments that tested four language models—two reasoning models (Claude 3.7 Sonnet and DeepSeek R1) and two non-reasoning models (Claude 3.5 Sonnet (New) and DeepSeek V3). They used a controlled prompt-pairing method where aa version with a subtle embedded hint followed a base question. If a model’s answer changed in the presence of the hint, researchers checked whether the CoT explicitly referenced the hint. Six categories of hints were used: sycophancy, consistency, visual pattern recognition, metadata cues, grader hacking, and unethical information use. Importantly, the last two are considered “misaligned” hints that can simulate unintended model behaviors like exploiting incorrect patterns or using unauthorized knowledge.

The research introduced a benchmark to quantify the faithfulness of CoT outputs. The models were assigned multiple-choice questions from MMLU (easier) and GPQA (more difficult) datasets. CoT faithfulness was defined as the rate at which a model, after changing its answer due to a hint, explicitly stated in the CoT that it relied on the hint. For instance, Claude 3.7 Sonnet acknowledged using hints in 25% of relevant cases overall, while DeepSeek R1 showed 39% faithfulness. More concerning were the faithfulness scores for misaligned hints: 20% for Claude 3.7 and 29% for DeepSeek R1. These numbers indicate that even when models use problematic cues, they often fail to verbalize them, making them harder to detect.

The study also discovered that CoTs of reasoning models, while more frequent and elaborate than non-reasoning models, are not necessarily more accurate. For example, Claude 3.7 Sonnet had longer average CoTs when they were unfaithful, 2064 tokens compared to 1439 tokens for faithful ones. DeepSeek R1 showed a similar pattern with 6003 tokens for unfaithful versus 4737 for faithful. These verbose yet misleading outputs suggest that longer explanations do not equate to transparency. Also, the researchers observed that models often construct flawed justifications when given misleading hints, even when they previously answered the same question correctly without the hint. This indicates a troubling inclination to conform to the new cue rather than assert previously correct logic.

A notable observation was the decline in CoT faithfulness on more difficult datasets. The GPQA dataset led to a 44% relative drop in faithfulness for Claude 3.7 Sonnet and a 32% drop for DeepSeek R1 compared to MMLU. This suggests that as tasks grow in complexity, CoTs become less reliable as windows into the model’s thought process. Even when models derive an answer through reasoning, they do not reliably disclose their use of cues, especially when navigating harder problems. This diminishes confidence in using CoT outputs for AI safety on advanced tasks.

Outcome-based reinforcement learning (RL) was evaluated as a possible solution. Researchers hypothesized that training models solely based on task success, without process supervision, would naturally encourage reliance on CoTs. Initially, CoT faithfulness increased by 63% on MMLU and 41% on GPQA but plateaued quickly. Ultimately, RL only achieved 28% faithfulness on MMLU and 20% on GPQA, underscoring its limitations in addressing the faithfulness problem at scale.

In the most critical test, models were trained in synthetic environments designed to simulate reward hacking. They consistently exploited reward hacks in over 99% of examples but verbalized them in fewer than 2% of CoTs across five out of six environments. Instead of admitting to using incorrect hint-based shortcuts, models would either avoid mention altogether or craft elaborate, inaccurate justifications. This stark mismatch between behavior and explanation indicates that CoT monitoring is ineffective for reliably flagging harmful patterns during RL training, even when the CoT is not optimized against a monitor.

Some Key Takeaways from the Research include:

Reasoning models verbalize used hints in as few as 1–20% of applicable cases, depending on the hint type and model.

Claude 3.7 Sonnet and DeepSeek R1 showed overall CoT faithfulness scores of 25% and 39%, respectively.

For misaligned hints (e.g., grader hacking), faithfulness dropped to 20% (Claude) and 29% (DeepSeek).

Faithfulness declines with harder datasets: Claude 3.7 experienced a 44% drop, and DeepSeek R1 on GPQA versus MMLU experienced a 32% drop.

Outcome-based RL training initially boosts faithfulness (up to 63% improvement) but plateaus at low overall scores (28% MMLU, 20% GPQA).

In reward hack environments, models exploited hacks >99% of the time but verbalized them in <2% of cases across five out of six settings.

Longer CoTs do not imply greater faithfulness; unfaithful CoTs were significantly longer on average.

CoT monitoring cannot yet be trusted to detect undesired or unsafe model behaviors consistently.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models appeared first on MarkTechPost.

Meta AI Just Released Llama 4 Scout and Llama 4 Maverick: The First Se …

Posted on April 6, 2025 by i-genie

Today, Meta AI announced the release of its latest generation multimodal models, Llama 4, featuring two variants: Llama 4 Scout and Llama 4 Maverick. These models represent significant technical advancements in multimodal AI, offering improved capabilities for both text and image understanding.

Llama 4 Scout is a 17-billion-active-parameter model structured with 16 expert modules. It introduces an extensive context window capable of accommodating up to 10 million tokens. This substantial context capacity enables the model to manage and interpret extensive textual content effectively, beneficial for long-form document processing, complex codebases, and detailed dialogue tasks. In comparative evaluations, Llama 4 Scout has demonstrated superior performance relative to contemporary models such as Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across recognized benchmark datasets.

Parallel to Scout, Llama 4 Maverick, also built upon a 17-billion-active-parameter architecture, incorporates 128 expert modules explicitly designed to enhance visual grounding. This design facilitates precise alignment between textual prompts and associated visual elements, enabling targeted responses grounded accurately to specific image regions. Maverick exhibits robust performance in comparative assessments, surpassing GPT-4o and Gemini 2.0 Flash, particularly in multimodal reasoning tasks. Additionally, Maverick has achieved comparable outcomes to DeepSeek v3 on reasoning and coding benchmarks while employing approximately half the active parameters.

A key feature of Maverick is its noteworthy performance-to-cost efficiency. Benchmarking efforts, specifically on the LMArena platform, have recorded an Elo rating of 1417 for Maverick’s chat-optimized version, indicating its computational efficiency and practical applicability in conversational and multimodal contexts.

The development of Scout and Maverick draws heavily from distillation techniques derived from the ongoing training of Meta’s more powerful model, Llama 4 Behemoth. Behemoth, which remains under active training, has preliminarily shown significant advantages over established models such as GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro, particularly within STEM-focused benchmarks. The insights and advanced methodologies from Behemoth have been instrumental in refining Scout and Maverick’s technical capabilities.

With the introduction of Llama 4, Meta AI advances multimodal artificial intelligence through highly refined and technically sophisticated models capable of deep semantic understanding and precise multimodal alignment. This release further exemplifies Meta AI’s ongoing commitment to fostering innovation and maintaining open accessibility for researchers, developers, and enterprise applications.

Future progress in multimodal AI is anticipated with the finalization and public release of Llama 4 Behemoth. Initial results indicate Behemoth’s potential to set new standards within multimodal performance, particularly in STEM applications and computational reasoning tasks. Meta AI plans to disclose detailed technical specifications and performance metrics upon completion of the Behemoth model.

The announcement underscores Meta AI’s dedication to pushing the technical limits of multimodal modeling, supporting the evolution of practical and research-oriented AI applications across diverse sectors including scientific research, education, and complex conversational systems. As Meta AI continues this trajectory, the technological advancements embodied in Llama 4 Scout, Maverick, and eventually Behemoth are expected to facilitate substantial progress in the computational and practical capabilities of multimodal AI.

Check out the Benchmarks and Download Llama 4. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Meta AI Just Released Llama 4 Scout and Llama 4 Maverick: The First Set of Llama 4 Models appeared first on MarkTechPost.

Scalable Reinforcement Learning with Verifiable Rewards: Generative Re …

Posted on April 6, 2025 by i-genie

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs’ reasoning and coding abilities, particularly in domains where structured reference answers allow clear-cut verification. This approach relies on reference-based signals to determine if a model’s response aligns with a known correct answer, typically through binary correctness labels or graded scores. RLVR has mainly been applied to areas like math and coding, where rule-based or tool-assisted verification is straightforward. However, expanding RLVR to more complex and less structured tasks has been difficult due to challenges in verifying open-ended or ambiguous reference responses. Although generative models and closed-source LLMs like GPT-4o have been explored as verifiers, these solutions often remain domain-specific and require extensive annotated datasets for training.

Recent developments aim to broaden RLVR applications by introducing generative reward modeling, where LLMs use their generative abilities to produce judgments and justifications. These models can be trained without detailed rationales, instead relying on the confidence of the verifier’s outputs to generate stable reward signals. This technique supports reinforcement learning in tasks with noisy or ambiguous labels. Furthermore, researchers are exploring RLVR in a wider variety of domains using more free-form reference answers—sourced from expert annotations and pretraining data or generated by LLMs—moving beyond narrowly defined tasks like math and logic puzzles. These efforts mark a significant step toward scalable and domain-general RLVR training.

Tencent AI Lab and Soochow University researchers are exploring extending RLVR to complex, unstructured domains like medicine, chemistry, and education. They show that binary correctness judgments remain consistent across LLMs when expert-written references are available. To address the limitations of binary rewards in free-form tasks, they introduce soft, generative model-based reward signals. Using compact 7B models, they train cross-domain reward verifiers without requiring extensive domain-specific annotation. Their RLVR framework significantly outperforms top open-source models in reasoning tasks and scales effectively. They also release a 570k-example dataset to support further research in multi-domain RLVR.

The method uses expert-written reference answers to guide reward estimation for reinforcement learning. Responses are evaluated using a generative LLM verifier, which outputs binary (0/1) or soft rewards based on the likelihood of correctness. Rewards are normalized using z-score normalization for stable training and better learning dynamics. The authors train a compact (7B) generative reward model using judgments collected during RL exploration to avoid relying solely on large models. These binary labels are obtained from a larger LLM and used to fine-tune the smaller verifier. This approach balances performance and efficiency while increasing robustness to noise and formatting variations.

The study uses two large-scale Chinese QA datasets—one with 773k free-form math questions across school levels and another with 638k multi-subject college-level questions from ExamQA. These datasets feature complex, unstructured answers that challenge rule-based reward methods. The researchers trained a 7B reward model (RM-7B) using 160k distilled samples and tested various RL approaches. Results show that RL with model-based rewards outperforms rule-based methods and supervised fine-tuning (SFT), especially in reasoning tasks. Notably, RM-7B achieves performance close to the larger 72B model, highlighting its efficiency. Binary rewards outperform soft rewards in rule-based settings due to semantic mismatch issues.

In conclusion, the study simplifies reward modeling by training a generative model to output binary scores (1 or 0) without relying on chain-of-thought reasoning. While CoT aids in reasoning, its necessity for verifying semantic similarity remains unclear. Unlike past work that relied on format-based scoring, this approach avoids strict answer formatting, reducing manual effort. The research extends RLVR beyond structured domains to areas like medicine and economics, where reference answers are less defined. Using a 7B model, it shows that soft, model-based rewards enhance performance in free-form tasks, outperforming larger models and improving RLVR’s adaptability and scalability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks appeared first on MarkTechPost.

NVIDIA AI Released AgentIQ: An Open-Source Library for Efficiently Con …

Posted on April 6, 2025 by i-genie

Enterprises increasingly adopt agentic frameworks to build intelligent systems capable of performing complex tasks by chaining tools, models, and memory components. However, as organizations build these systems across multiple frameworks, challenges arise regarding interoperability, observability, performance profiling, and workflow evaluation. Teams are often locked into particular frameworks, making it hard to scale or reuse agents and tools across different contexts. Also, debugging agentic workflows or identifying inefficiencies becomes arduous without unified profiling and evaluation tools. The lack of a standardized way to build and monitor these systems creates a significant bottleneck in agile AI development and deployment.

NVIDIA has introduced AgentIQ, a lightweight and flexible Python library designed to unify agentic workflows across frameworks, memory systems, and data sources. Instead of replacing existing tools, AgentIQ enhances them, bringing composability, observability, and reusability to the forefront of AI system design. With AgentIQ, every agent, tool, and workflow is treated as a function call, allowing developers to mix and match components from different frameworks with minimal overhead. The release aims to streamline development, enabling detailed profiling and end-to-end evaluation across agentic systems.

AgentIQ is packed with features that make it a compelling solution for developers and enterprises building complex agentic systems:

Framework Agnostic Design: AgentIQ integrates seamlessly with any agentic framework, such as LangChain, Llama Index, Crew.ai, Microsoft Semantic Kernel, and custom Python agents. This allows teams to continue using their current tools without replatforming.

Reusability and Composability: Every component, whether an agent, a tool, or a workflow, is treated like a function call that can be reused, repurposed, and combined in different configurations.

Rapid Development: Developers can start with prebuilt components and customize workflows quickly, saving time in system design and experimentation.

Profiling and Bottleneck Detection: The built-in profiler allows detailed tracking of token usage, response timings, and hidden latencies at a granular level, helping teams optimize system performance.

Observability Integration: AgentIQ works with any OpenTelemetry-compatible observability platform, allowing deep insights into how each part of the workflow functions.

Evaluation System: A consistent and robust evaluation mechanism helps teams validate and maintain the accuracy of both Retrieval-Augmented Generation (RAG) and end-to-end (E2E) workflows.

User Interface: AgentIQ includes a chat-based UI for real-time agent interaction, output visualization, and workflow debugging.

MCP Compatibility: AgentIQ supports the Model Context Protocol (MCP), making incorporating tools hosted on MCP servers into function calls easier.

AgentIQ is best described as a complement to existing frameworks rather than a competitor. It does not aim to be another agentic framework, nor does it try to solve agent-to-agent communication; this remains the domain of protocols like HTTP and gRPC. AgentIQ also refrains from replacing observability platforms; instead, it provides the hooks and telemetry data that can be routed into whichever monitoring system the team prefers. It uniquely connects and profiles multi-agent workflows, even when deeply nested, using a function-call-based architecture. It combines agents and tools developed in different ecosystems and enables robust evaluation and monitoring from a centralized perspective. AgentIQ is also fully opt-in; users can integrate it at any level, whether at the tool, agent, or entire workflow level, depending on their needs.

Image Source

AgentIQ’s design opens the door to multiple enterprise use cases. For example, a customer support system built using LangChain and custom Python agents can now integrate seamlessly with analytics tools running in Llama Index or Semantic Kernel. Developers can run profiling to identify which agent or tool in the workflow is causing a bottleneck or using too many tokens and evaluate the system’s response consistency and relevance over time. Installing AgentIQ is straightforward. It supports Ubuntu and other Linux-based distributions, including WSL, and uses modern Python environment management tools. After cloning the GitHub repository, users initialize submodules, install Git LFS for dataset handling, and create a virtual environment with `uv`. Developers can then install the full AgentIQ library and plugins using `uv sync –all-groups –all-extras` or opt for core installation with `uv sync`. Plugins like `langchain` or `profiling` can be installed as needed. The installation is verified using the `aiq –help` and `aiq –version` commands.

In conclusion, AgentIQ represents a significant step toward modular, interoperable, and observable agentic systems. Functioning as a unifying layer across frameworks and data sources empowers development teams to build sophisticated AI applications without worrying about compatibility, performance bottlenecks, or evaluation inconsistencies. Its profiling capabilities, evaluation system, and support for popular frameworks make it a critical tool in the AI developer’s arsenal. Also, AgentIQ’s opt-in approach ensures teams can start small, perhaps profiling just one tool or agent, and scale up as they see value. With future updates on the roadmap, including NeMo Guardrails integration, agentic accelerations in partnership with Dynamo, and a data feedback loop, AgentIQ is poised to become a foundational layer in enterprise agent development. For any team aiming to build, monitor, and optimize AI-driven workflows at scale, AgentIQ is the bridge that connects ideas to efficient execution.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post NVIDIA AI Released AgentIQ: An Open-Source Library for Efficiently Connecting and Optimizing Teams of AI Agents appeared first on MarkTechPost.

Building Your AI Q&A Bot for Webpages Using Open Source AI Models

Posted on April 5, 2025 by i-genie

In today’s information-rich digital landscape, navigating extensive web content can be overwhelming. Whether you’re researching for a project, studying complex material, or trying to extract specific information from lengthy articles, the process can be time-consuming and inefficient. This is where an AI-powered Question-Answering (Q&A) bot becomes invaluable.

This tutorial will guide you through building a practical AI Q&A system that can analyze webpage content and answer specific questions. Instead of relying on expensive API services, we’ll utilize open-source models from Hugging Face to create a solution that’s:

Completely free to use

Runs in Google Colab (no local setup required)

Customizable to your specific needs

Built on cutting-edge NLP technology

By the end of this tutorial, you’ll have a functional web Q&A system that can help you extract insights from online content more efficiently.

What We’ll Build

We’ll create a system that:

Takes a URL as input

Extracts and processes the webpage content

Accepts natural language questions about the content

Provides accurate, contextual answers based on the webpage

Prerequisites

A Google account to access Google Colab

Basic understanding of Python

No prior machine learning knowledge required

Step 1: Setting Up the Environment

First, let’s create a new Google Colab notebook. Go to Google Colab and create a new notebook.

Let’s start by installing the necessary libraries:

# Install required packages

Copy CodeCopiedUse a different Browser!pip install transformers torch beautifulsoup4 requests

This installs:

transformers: Hugging Face’s library for state-of-the-art NLP models

torch: PyTorch deep learning framework

beautifulsoup4: For parsing HTML and extracting web content

requests: For making HTTP requests to webpages

Step 2: Import Libraries and Set Up Basic Functions

Now let’s import all the necessary libraries and define some helper functions:

Copy CodeCopiedUse a different Browserimport torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import requests
from bs4 import BeautifulSoup
import re
import textwrap

# Check if GPU is available

Copy CodeCopiedUse a different Browserdevice = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
print(f”Using device: {device}”)

# Function to extract text from a webpage

Copy CodeCopiedUse a different Browserdef extract_text_from_url(url):
try:
headers = {
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36’
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, ‘html.parser’)

for script_or_style in soup([‘script’, ‘style’, ‘header’, ‘footer’, ‘nav’]):
script_or_style.decompose()

text = soup.get_text()

lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(” “))
text = ‘n’.join(chunk for chunk in chunks if chunk)

text = re.sub(r’s+’, ‘ ‘, text).strip()

return text

except Exception as e:
print(f”Error extracting text from URL: {e}”)
return None

This code:

Imports all necessary libraries

Sets up our device (GPU if available, otherwise CPU)

Creates a function to extract readable text content from a webpage URL

Step 3: Load the Question-Answering Model

Now let’s load a pre-trained question-answering model from Hugging Face:

# Load pre-trained model and tokenizer

Copy CodeCopiedUse a different Browsermodel_name = “deepset/roberta-base-squad2″

print(f”Loading model: {model_name}”)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)
print(“Model loaded successfully!”)

We’re using deepset/roberta-base-squad2, which is:

Based on RoBERTa architecture (a robustly optimized BERT approach)

Fine-tuned on SQuAD 2.0 (Stanford Question Answering Dataset)

A good balance between accuracy and speed for our task

Step 4: Implement the Question-Answering Function

Now, let’s implement the core functionality – the ability to answer questions based on the extracted webpage content:

Copy CodeCopiedUse a different Browserdef answer_question(question, context, max_length=512):
max_chunk_size = max_length – len(tokenizer.encode(question)) – 5
all_answers = []

for i in range(0, len(context), max_chunk_size):
chunk = context[i:i + max_chunk_size]

inputs = tokenizer(
question,
chunk,
add_special_tokens=True,
return_tensors=”pt”,
max_length=max_length,
truncation=True
).to(device)

with torch.no_grad():
outputs = model(**inputs)

answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits)

start_score = outputs.start_logits[0][answer_start].item()
end_score = outputs.end_logits[0][answer_end].item()
score = start_score + end_score

input_ids = inputs.input_ids.tolist()[0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)

answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])

answer = answer.replace(“[CLS]”, “”).replace(“[SEP]”, “”).strip()

if answer and len(answer) > 2:
all_answers.append((answer, score))

if all_answers:
all_answers.sort(key=lambda x: x[1], reverse=True)
return all_answers[0][0]
else:
return “I couldn’t find an answer in the provided content.”

This function:

Takes a question and the webpage content as input

Handles long content by processing it in chunks

Uses the model to predict the answer span (start and end positions)

Processes multiple chunks and returns the answer with the highest confidence score

Step 5: Testing and Examples

Let’s test our system with some examples. Here’s the complete code:

Copy CodeCopiedUse a different Browserurl = “https://en.wikipedia.org/wiki/Artificial_intelligence”
webpage_text = extract_text_from_url(url)

print(“Sample of extracted text:”)
print(webpage_text[:500] + “…”)

questions = [
“When was the term artificial intelligence first used?”,
“What are the main goals of AI research?”,
“What ethical concerns are associated with AI?”
]

for question in questions:
print(f”nQuestion: {question}”)
answer = answer_question(question, webpage_text)
print(f”Answer: {answer}”)

This will demonstrate how the system works with real examples.

Output of the above code

Limitations and Future Improvements

Our current implementation has some limitations:

It can struggle with very long webpages due to context length limitations

The model may not understand complex or ambiguous questions

It works best with factual content rather than opinions or subjective material

Future improvements could include:

Implementing semantic search to better handle long documents

Adding document summarization capabilities

Supporting multiple languages

Implementing memory of previous questions and answers

Fine-tuning the model on specific domains (e.g., medical, legal, technical)

Conclusion

Now you’ve successfully built your AI-powered Q&A system for webpages using open-source models. This tool can help you:

Extract specific information from lengthy articles

Research more efficiently

Get quick answers from complex documents

By utilizing Hugging Face’s powerful models and the flexibility of Google Colab, you’ve created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and extend this project to meet your specific needs.

Useful Resources

Hugging Face Transformers Documentation

Augment Code Released Augment SWE-bench Verified Agent: An Open-Source …

Posted on April 5, 2025 by i-genie

AI agents are increasingly vital in helping engineers efficiently handle complex coding tasks. However, one significant challenge has been accurately assessing and ensuring these agents can handle real-world coding scenarios beyond simplified benchmark tests.

Augment Code has announced the launch of their Augment SWE-bench Verified Agent, a development in agentic AI tailored specifically for software engineering. This release places them at the top of open-source agent performance on the SWE-bench leaderboard. By combining the strengths of Anthropic’s Claude Sonnet 3.7 and OpenAI’s O1 model, Augment Code’s approach has delivered impressive results, showcasing a compelling blend of innovation and pragmatic system architecture.

The SWE-bench benchmark is a rigorous test that measures an AI agent’s effectiveness in handling practical software engineering tasks drawn directly from GitHub issues in prominent open-source repositories. Unlike traditional coding benchmarks, which generally focus on isolated, algorithmic-style problems, SWE-bench offers a more realistic testbed that requires agents to navigate existing codebases, identify relevant tests autonomously, create scripts, and iterate against comprehensive regression test suites.

Augment Code’s initial submission has achieved a 65.4% success rate, a notable achievement in this demanding environment. The company focused its first effort on leveraging existing state-of-the-art models, specifically Anthropic’s Claude Sonnet 3.7 as the primary driver for task execution and OpenAI’s O1 model for ensembling. This approach strategically bypassed training proprietary models at this initial phase, establishing a robust baseline.

Image Source

One interesting aspect of Augment’s methodology was their exploration into different agent behaviors and strategies. For example, they found that certain expected beneficial techniques like Claude Sonnet’s ‘thinking mode’ and separate regression-fixing agents did not yield meaningful performance improvements. This highlights the nuanced and sometimes counterintuitive dynamics in agent performance optimization. Also, basic ensembling techniques such as majority voting were explored but ultimately abandoned due to cost and efficiency considerations. However, simple ensembling with OpenAI’s O1 did provide incremental improvements in accuracy, underscoring the value of ensembling even in constrained scenarios.

While Augment Code’s initial SWE-bench submission’s success is commendable, the company is transparent about the benchmark’s limitations. Notably, SWE-bench problems are heavily skewed toward bug fixing rather than feature creation, the provided descriptions are more structured and LLM-friendly compared to typical real-world developer prompts, and the benchmark solely utilizes Python. Real-world complexities, such as navigating massive production codebases and dealing with less descriptive programming languages, pose challenges that SWE-bench does not capture.

Augment Code has openly acknowledged these limitations, emphasizing its continued commitment to optimizing agent performance beyond benchmark metrics. They stress that while improvements to prompts and ensembling can boost quantitative results, qualitative customer feedback and real-world usability remain its priorities. The ultimate goal for Augment Code is developing cost-effective, fast agents capable of providing unparalleled coding assistance in practical professional environments.

As part of its future roadmap, Augment is actively exploring the fine-tuning of proprietary models using RL techniques and proprietary data. Such advancements promise to enhance model accuracy and significantly reduce latency and operational costs, facilitating more accessible and scalable AI-driven coding assistance.

Some of the key takeaways from the Augment SWE-bench Verified Agent include:

Augment Code released Augment SWE-bench Verified Agent, achieving the top spot among open-source agents.

The agent combines Anthropic’s Claude Sonnet 3.7 as its core driver and OpenAI’s O1 model for ensembling.

Achieved a 65.4% success rate on SWE-bench, highlighting robust baseline capabilities.

Found counterintuitive results, where anticipated beneficial features like ‘thinking mode’ and separate regression-fixing agents offered no substantial performance gains.

Identified cost-effectiveness as a critical barrier to implementing extensive ensembling in real-world scenarios.

Acknowledged benchmark limitations, including its bias towards Python and smaller-scale bug-fixing tasks.

Future improvements will focus on cost reduction, lower latency, and improved usability through reinforcement learning and fine-tuning proprietary models.

Highlighted the importance of balancing benchmark-driven improvements with qualitative user-centric enhancements.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks appeared first on MarkTechPost.

NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Con …

Posted on April 5, 2025 by i-genie

The future of robotics has advanced significantly. For many years, there have been expectations of human-like robots that can navigate our environments, perform complex tasks, and work alongside humans. Examples include robots conducting precise surgical procedures, building intricate structures, assisting in disaster response, and cooperating efficiently with humans in various settings such as factories, offices, and homes. However, actual progress has historically been limited.

Researchers from NVIDIA, Carnegie Mellon University, UC Berkeley, UT Austin, and UC San Diego introduced HOVER, a unified neural controller aimed at enhancing humanoid robot capabilities. This research proposes a multi-mode policy distillation framework, integrating different control strategies into one cohesive policy, thereby making a notable advancement in humanoid robotics.

The Achilles Heel of Humanoid Robotics: The Control Conundrum

Imagine a robot that can execute a perfect backflip but then struggles to grasp a doorknob.

The problem? Specialization.

Humanoid robots are incredibly versatile platforms, capable of supporting a wide range of tasks, including bimanual manipulation, bipedal locomotion, and complex whole-body control. However, despite impressive advances in these areas, researchers have typically employed different control formulations designed for specific scenarios.

Some controllers excel at locomotion, using “root velocity tracking” to guide movement. This approach focuses on controlling the robot’s overall movement through space.

Others prioritize manipulation, relying on “joint angle tracking” for precise movements. This approach allows for fine-grained control of the robot’s limbs.

Still others use “kinematic tracking” of key points for teleoperation. This method enables a human operator to control the robot by tracking their own movements.

Each speaks a different control language, creating a fragmented landscape where robots are masters of one task and inept at others. Switching between tasks has been clunky, inefficient, and often impossible. This specialization creates practical limitations. For example, a robot designed for bipedal locomotion on uneven terrain using root velocity tracking would struggle to transition smoothly to precise bimanual manipulation tasks that require joint angle or end-effector tracking.

In addition to that, many pre-trained manipulation policies operate across different configuration spaces, such as joint angles and end-effector positions. These constraints highlight the need for a unified low-level humanoid controller capable of adapting to diverse control modes.

HOVER: The Unified Field Theory of Robotic Control

HOVER is a paradigm shift. It’s a “generalist policy”—a single neural network that harmonizes diverse control modes, enabling seamless transitions and unprecedented versatility. HOVER supports diverse control modes, including over 15 useful configurations for real-world applications on a 19-DOF humanoid robot. This versatile command space encompasses most of the modes used in previous research.

Learning from the Masters: Human Motion ImitationHOVER‘s brilliance lies in its foundation: learning from human movement itself. By training an “oracle motion imitator” on a massive dataset of human motion capture data (MoCap), HOVER absorbs the fundamental principles of balance, coordination, and efficient movement. This approach utilizes human movements’ natural adaptability and efficiency, providing the policy with rich motor priors that can be reused across multiple control modes.The researchers ground the training process in human-like motion, allowing the policy to develop a deeper understanding of balance, coordination, and motion control, crucial elements for effective whole-body humanoid behavior.

From Oracle to Prodigy: Policy Distillation

The magic truly happens through “policy distillation.” The oracle policy, the master imitator, teaches a “student policy” (HOVER) its skills. Through a process involving command masking and a DAgger framework, HOVER learns to master diverse control modes, from kinematic position tracking to joint angle control and root tracking. This creates a “generalist” capable of handling any control scenario.Through policy distillation, these motor skills are transferred from the oracle policy into a single “generalist policy” capable of handling multiple control modes. The resulting multi-mode policy supports diverse control inputs and outperforms policies trained individually for each mode. The researchers hypothesize this superior performance stems from the policy using shared physical knowledge across modes, such as maintaining balance, human-like motion, and precise limb control. These shared skills enhance generalization, leading to better performance across all modes, while single-mode policies often overfit specific reward structures and training environments.HOVER‘s implementation involves training an Oracle policy followed by knowledge distillation to create a versatile controller. The oracle policy processes proprioceptive information, including position, orientation, velocities, and previous actions alongside reference poses, to generate optimal movements. The oracle achieves robust motion imitation using a carefully designed reward system with penalty, regularization, and task components. The student policy then learns from this oracle through a DAgger framework, incorporating model-based and sparsity-based masking techniques that allow selective tracking of different body parts. This distillation process minimizes the action difference between teacher and student, creating a unified controller capable of handling diverse control scenarios.The researchers formulate humanoid control as a goal-conditioned reinforcement learning task where the policy is trained to track real-time human motion. The state includes the robot’s proprioception and a unified target goal state. Using these inputs, they define a reward function for policy optimization. The actions represent target joint positions that are fed into a PD controller. The system employs Proximal Policy Optimization (PPO) to maximize cumulative discounted rewards, essentially training the humanoid to follow target commands at each timestep.

The research methodology utilizes motion retargeting techniques to create feasible humanoid movements from human motion datasets. This three-step process begins with computing keypoint positions through forward kinematics, fitting the SMPL model to align with these key points, and retargeting the AMASS dataset by matching corresponding points between models using gradient descent. The “sim-to-data” procedure converts the large-scale human motion dataset into feasible humanoid motions, establishing a strong foundation for training the controller.The research team designed a comprehensive command space for humanoid control that overcomes the limitations of previous approaches. Their unified framework accommodates multiple control modes simultaneously, including kinematic position tracking, joint angle tracking, and root tracking. This design satisfies key criteria of generality (supporting various input devices) and atomicity (enabling arbitrary combinations of control options).

HOVER Unleashed: Performance That Redefines Robotics

HOVER‘s capabilities are proven by rigorous testing:

Dominating the Specialists:HOVER outperforms specialized controllers across the board. The research team evaluated HOVER against specialist policies and alternative multi-mode training approaches through comprehensive tests in both IsaacGym simulation and real-world implementations using the Unitree H1 robot.To address whether HOVER could outperform specialized policies, they compared it against various specialists, including ExBody, HumanPlus, H2O, and OmniH2O – each designed for different tracking objectives such as joint angles, root velocity, or specific key points.

In evaluations using the retargeted AMASS dataset, HOVER consistently demonstrated superior generalization, outperforming specialists in at least 7 out of 12 metrics in every command mode. HOVER performed better than specialists trained for specific useful control modes like left-hand, right-hand, two-hand, and head tracking.

Multi-Mode Mastery: A Clean SweepWhen compared to other multi-mode training methods, they implemented a baseline that used the same masking process but trained from scratch with reinforcement learning. Radar charts visualizing tracking errors across eight distinct control modes showed HOVER consistently achieving lower errors across all 32 metrics and modes. HOVER achieved consistently lower tracking errors across all 32 metrics and 8 distinct control modes. This decisive victory underscores the power of HOVER’s distillation approach. This comprehensive performance advantage underscores the effectiveness of distilling knowledge from an oracle policy that tracks full-body kinematics rather than training with reinforcement learning from scratch.

From Simulation to Reality: Real-World ValidationHOVER‘s prowess is not confined to the digital world. The experimental setup included motion tracking evaluations using the retargeted AMASS dataset in simulation and 20 standing motion sequences for the real-world tests on the 19-DOF Unitree H1 platform, weighing 51.5kg and standing 1.8m tall. The experiments were structured to answer three key questions about HOVER’s generalizability, comparative performance, and real-world transferability.

On the Unitree H1 robot, a 19-DOF humanoid weighing 51.5kg and standing 1.8m tall, HOVER flawlessly tracked complex standing motions, dynamic running movements, and smoothly transitioned between control modes during locomotion and teleoperation. Experiments conducted in both simulation and on a physical humanoid robot show that HOVER achieves seamless transitions between control modes and delivers superior multi-mode control compared to baseline approaches.

HOVER: The Future of Humanoid Potential

HOVERunlocks the vast potential of humanoid robots. The multi-mode generalist policy also enables seamless transitions between modes, making it robust and versatile.

Imagine a future where humanoids:

Perform intricate surgery with unparalleled precision.

Construct complex structures with human-like dexterity.

Respond to disasters with agility and resilience.

Collaborate seamlessly with humans in factories, offices, and homes.

The age of truly versatile, capable, and intelligent humanoids is on the horizon, and HOVER is leading the way. Their evaluations collectively illustrate HOVER‘s ability to handle diverse real-world control modes, offering superior performance compared to specialist policies.

Sources:

https://arxiv.org/pdf/2410.21229

https://github.com/NVlabs/HOVER/tree/main

https://github.com/NVlabs/HOVER/tree/main?tab=readme-ov-file

https://arxiv.org/abs/2410.21229

Thanks to the NVIDIA team for the thought leadership/ Resources for this article. NVIDIA team has supported and sponsored this content/article.
The post NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics appeared first on MarkTechPost.

Prompting for the best price-performance

Posted on April 5, 2025 by i-genie

In the drive to remain competitive, businesses today are turning to AI to help them minimize cost and maximize efficiency. It’s incumbent on them to find the most suitable AI model—the one that will help them achieve more while spending less. For many businesses, the migration from OpenAI’s model family to Amazon Nova represents not only a shift in model but a strategic move toward scalability, efficiency, and broader multimodal capabilities.
In this blog, we discuss how to optimize prompting in Amazon Nova for the best price-performance.
Why migrate from OpenAI to Amazon Nova?
OpenAI’s models remain powerful, but their operational costs can be prohibitive when scaled. Consider these figures from Artificial Analysis:

Model
Input Token Cost (per Million Tokens)
Output Token Cost (per Million Tokens)
Context Window
Output Speed (Tokens per Second)
Latency (Seconds per first token)

GPT-4o
~$2.50
~$10.00
Up to 128K tokens
~63
~0.49

GPT-4o Mini
~$0.15
~$0.60
Up to 128K tokens
~90
~0.43

Nova Micro
~$0.035
~$0.14
Up to 128K tokens
~195
~0.29

Nova Lite
~$0.06
~$0.24
Up to 300K tokens
~146
~0.29

Nova Pro
~$0.80
~$3.20
Up to 300K tokens
~90
~0.34

For high-volume applications—like global customer support or large-scale document analysis—these cost differences are disruptive. Not only does Amazon Nova Pro offer over three times the cost-efficiency, its longer context window also enables it to handle more extensive and complex inputs.
Breaking down the Amazon Nova suite
Amazon Nova isn’t a single model—it’s a suite designed for various needs:

Amazon Nova Pro – A robust multimodal model that can process text, images, and video. It excels at tasks such as document analysis and deep data visualization. Benchmark comparisons show Amazon Nova Pro matching or even surpassing GPT-4o on complex reasoning tasks, according to section 2.1.1 of the Nova technical report and model card.
Amazon Nova Lite – Offers a balanced mix of multimodal processing and speed. Amazon Nova Lite is ideal for applications such as document summarization, translation, and even basic visual search, delivering quality outputs at lower latency and cost compared to GPT-4o Mini. You can find these benchmark results in section 2.1.2 of the Nova Technical Report and Model Card.
Amazon Nova Micro – A text-only model engineered for ultra-low latency. With output speed of up to 195 tokens per second, Amazon Nova Micro is perfect for real-time applications such as chat-based assistants and automated FAQs. Its token costs are dramatically lower than those of GPT-4o Mini—roughly 4.3 times cheaper on a per-token basis.

The lower per-token costs and higher output per second of Amazon Nova give you the flexibility to simplify prompts for real-time applications so you can balance quality, speed, and cost for your use case.
Understanding the foundations
To make the best decision about which model family fits your needs, it’s important to understand the differences in prompt engineering best practices in both OpenAI and Amazon Nova. Each model family has its own set of strengths, but there are some things that apply to both families. Across both model families, quality accuracy is achieved through clarity of instructions, structured prompts, and iterative refinement. Whether you’re using strong output directives or clearly defined use cases, the goal is to reduce ambiguity and improve response quality.
The OpenAI approach
OpenAI uses a layered messaging system for prompt engineering, where system, developer, and user prompts work in harmony to control tone, safety, and output format. Their approach emphasizes:

Hierarchical message roles – Setting the model’s role and behavior using system messages makes sure that the overarching safety and style guidelines (set in system prompts) are preserved
Instruction placement and delimiters – Directives are placed at the beginning, with clear separation between context, examples, and queries
Selective chain-of-thought – Detailed, step-by-step reasoning is used when it benefits complex tasks
Formatting and structure – Using strong directives such as DO, MUST, and DO NOT to provide consistent outputs (for example, in JSON)

The Amazon Nova approach

Define the prompt use case

Task – What exactly the model should do
Role – Which role the model should assume
Response style – The structure or tone of the output
Instructions – Guidelines the model must follow

Chain-of-thought

Explicit state response – Provide clear and strong instructions to limit the model’s response
Structural thinking – Thinking step-by-step encourages structural thinking

Formatting and structure

Use delimiters to sectioned your prompts, for example, ##Task##, ##Context##, or ##Example##
Specify the output format, for example, JSON, YAML, or Markdown
Use strong instructions and caps, such as DO, DO NOT, or MUST
Prefill the responses to guide the model, for example, start with “{“ or “json…”

Evolving your prompt strategy: Migrating from OpenAI to Amazon Nova
Transitioning to Amazon Nova isn’t merely a change in API endpoints—it requires retooling your prompt engineering to align with the strengths of Amazon Nova. You need to reframe your use case definition. Begin by breaking down your current GPT-4o or GPT-4o Mini prompt into its core elements of task, role, response style, and instructions and success criteria. Make sure to structure these elements clearly to provide a blueprint for the model.
To understand how to migrate an existing OpenAI prompt to work optimally for Amazon Nova Pro, consider the following example using the meeting notes summarizer. Here is the GPT-4o system prompt:

You will be provided with meeting notes, and your task is to summarize the meeting
as follows:
-Overall summary of discussion
-Action items (what needs to be done and who is doing it)
-If applicable, a list of topics that need to be discussed more fully in the next meeting.

The user prompt is the meeting notes that need to be summarized:

Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters…

GPT produces this helpful response:

Overall Summary of Discussion: The team discussed the recent mission to Planet Zog,
the space pirate issue in Sector 7, the annual Intergalactic Bake-Off, planning for
the upcoming charity fundraiser, and a proposed team-building retreat. The mission to
Planet Zog was successful, but communication with the Zogians was challenging. The team
also discussed the need for a better strategy to deal with space pirates. The team placed
second in the Intergalactic Bake-Off and is planning for the upcoming charity fundraiser.
A team-building retreat to the Moon Resort and Spa was proposed.

Action Items:
1. Dr. Quasar to start working on a Zogian-English dictionary.
2. Captain Stardust to speak with Admiral Starbeam about increasing patrols in Sector 7.
3. Dr. Quasar to finalize the prototype of the new cloaking technology.
4. All team members to start gathering supplies and preparing games for the Intergalactic
Charity Bazaar.
5. Captain Stardust to check the budget for the proposed team-building retreat to the Moon
Resort and Spa.

Topics for Next Meeting:
1. Update on the Zogian-English dictionary.
2. Progress report on the cloaking technology.
3. Results of increased patrols in Sector 7.
4. Final preparations for the Intergalactic Charity Bazaar.

To meet or exceed the quality of the response from GPT-4o, here is what an Amazon Nova Pro prompt might look like. The prompt uses the same best practices discussed in this post, starting with the system prompt. We used a temperature of .2 and a topP of .9 here:

You are an experienced executive assistant skilled in meeting note analysis and
summarization. Your primary responsibilities include distilling complex discussions
into clear, actionable summaries.
Follow these instructions:

##INSTRUCTIONS##
1. Read understand the meeting notes found in ##NOTES##
2. Put all of your outputs in a section called ##OUTPUTS## in markdown formatting
3. Summarize the meeting notes in 5 sentences or less. Put this in a section called
“Overall Summary”.
4. Numerically list any action items for specific people and what needs to be completed.
Put this list in a section called “Action Items”.
5. If applicable, list the topics that need to be discussed more fully in the next meeting.
Put this in a section called “Topics for Next Meeting”.

Here’s the user prompt, using prefilled responses:

##NOTES##
Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters
Attendees:
– Captain Stardust
– Dr. Quasar
– Lady Nebula
– Sir Supernova
– Ms. Comet
Meeting called to order by Captain Stardust at 2:05 PM
1. Introductions and welcome to our newest team member, Ms. Comet
2. Discussion of our recent mission to Planet Zog
– Captain Stardust: “Overall, a success, but communication with the Zogians was difficult.
We need to improve our language skills.”
– Dr. Quasar: “Agreed. I’ll start working on a Zogian-English dictionary right away.”
– Lady Nebula: “The Zogian food was out of this world, literally! We should consider having
a Zogian food night on the ship.”
3. Addressing the space pirate issue in Sector 7
– Sir Supernova: “We need a better strategy for dealing with these pirates. They’ve already
plundered three cargo ships this month.”
– Captain Stardust: “I’ll speak with Admiral Starbeam about increasing patrols in that area.
– Dr. Quasar: “I’ve been working on a new cloaking technology that could help our ships avoid
detection by the pirates. I’ll need a few more weeks to finalize the prototype.”
4. Review of the annual Intergalactic Bake-Off
– Lady Nebula: “I’m happy to report that our team placed second in the competition! Our Martian Mud
Pie was a big hit!”
– Ms. Comet: “Let’s aim for first place next year. I have a secret recipe for Jupiter Jello that I
think could be a winner.”
5. Planning for the upcoming charity fundraiser
– Captain Stardust: “We need some creative ideas for our booth at the Intergalactic Charity Bazaar.”
– Sir Supernova: “How about a ‘Dunk the Alien’ game? We can have people throw water balloons at a
volunteer dressed as an alien.”
– Dr. Quasar: “I can set up a ‘Name That Star’ trivia game with prizes for the winners.”
– Lady Nebula: “Great ideas, everyone. Let’s start gathering the supplies and preparing the games.”
6. Upcoming team-building retreat
– Ms. Comet: “I would like to propose a team-building retreat to the Moon Resort and Spa. It’s a great
opportunity to bond and relax after our recent missions.”
– Captain Stardust: “Sounds like a fantastic idea. I’ll check the budget and see if we can make it happen.”
7. Next meeting agenda items
– Update on the Zogian-English dictionary (Dr. Quasar)
– Progress report on the cloaking technology (Dr. Quasar)
– Results of increased patrols in Sector 7 (Captain Stardust)
– Final preparations for the Intergalactic Charity Bazaar (All)
Meeting adjourned at 3:15 PM. Next meeting scheduled for March 19th, 2050 at 2:00 PM in Conference Room 3B,
Intergalactic Headquarters.

##OUTPUTS##

The following example shows that the Amazon Nova response meets and exceeds the accuracy of the OpenAI example, formats the output in Markdown, and has found clear owners for each action item:

### Overall Summary
The meeting covered the successful mission to Planet Zog, the ongoing issue with
space pirates in Sector 7, the results of the annual Intergalactic Bake-Off, plans
for the upcoming charity fundraiser, and a proposed team-building retreat. Action items
were assigned, and the next meeting’s agenda was set.

### Action Items
1. **Dr. Quasar**: Begin working on a Zogian-English dictionary.
2. **Captain Stardust**: Speak with Admiral Starbeam about increasing patrols in Sector 7.
3. **Dr. Quasar**: Finalize the prototype for the new cloaking technology within a few weeks.
4. **Team**: Start gathering supplies and preparing games for the Intergalactic Charity Bazaar.
5. **Captain Stardust**: Check the budget for the proposed team-building retreat to the Moon Resort and Spa.

### Topics for Next Meeting
1. Update on the Zogian-English dictionary (Dr. Quasar)
2. Progress report on the cloaking technology (Dr. Quasar)
3. Results of increased patrols in Sector 7 (Captain Stardust)
4. Final preparations for the Intergalactic Charity Bazaar (All)

A few updates to the prompt can achieve comparable or better results from Amazon Nova Pro while enjoying a much less expensive cost of inference.
Employ Amazon Nova extended context
Amazon Nova Lite and Amazon Nova Pro can support up to 300,000 input tokens, which means that you can include more context in your prompt if needed. Expand your background data and detailed instructions accordingly—if your original OpenAI prompt was optimized for 128,000 tokens, adjust it to use the Amazon Nova extended window.
Tailor output constraints
If your GPT prompt required strict formatting (for example, “Respond in JSON only”), make sure that your Amazon Nova prompt includes these directives. Additionally, if your task involves multimodal inputs, specify when to include images or video references.
Function calling
The rise of generative AI agents has made function calling, or tool calling, one of the most important abilities of a given large language model (LLM). A model’s ability to correctly pick the right tool for the job, in a low-latency manner, is often the difference between success and failure of an agentic system.
Both OpenAI and Amazon Nova models share similarities in function calling, in particular their support for structured API calls. Both model families support tool selection through defined tool schemas, which we discuss later in this post. They also both provide a mechanism to decide when to invoke these tools or not.
OpenAI’s function calling uses flexible JSON schemas to define and structure API interactions. The models support a wide range of schema configurations, which give developers the ability to quickly implement external function calls through straightforward JSON definitions tied to their API endpoints.
Here is an example of a function:
tools = [{
   “type”: “function”,
   “function”: {
   “name”: “get_weather”,
   “description”: “Get current temperature for a given location.”,
   “parameters”: {
   “type”: “object”,
   “properties”: {
   “location”: {
   “type”: “string”,
   “description”: “City and country e.g. Montevideo, Uruguay”
   }
   },
   “required”: [
   “location”
   ],
   “additionalProperties”: False
   },
   “strict”: True
   }
}]

completion = client.chat.completions.create(
   model=”gpt-4o”,
   messages=[{“role”: “user”, “content”: “What is the weather like in Punta del Este today?”}],
   tools=tools
Similar to OpenAI’s approach, Amazon Nova can call tools when passed a configuration schema as shown in the following code example. Amazon Nova has made heavy use of Greedy Decoding when calling tools, and it’s advised to set temperature, topP, and topK to 1. This makes sure that the model has the highest accuracy in tool selection. These Greedy Decoding parameters and other great examples of tool use are covered in great detail in Tool use (function calling) with Amazon Nova.
The following is an example of function calling without using additionalModelRequestFields:
tool_config = {
“tools”: [{
“toolSpec”: {
“name”: “get_recipe”,
“description”: “Structured recipe generation system”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“recipe”: {
“type”: “object”,
“properties”: {
“name”: {“type”: “string”},
“ingredients”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“item”: {“type”: “string”},
“amount”: {“type”: “number”},
“unit”: {“type”: “string”}
}
}
},
“instructions”: {
“type”: “array”,
“items”: {“type”: “string”}
}
},
“required”: [“name”, “ingredients”, “instructions”]
}
}
}
}
}
}]
}

# Base configuration without topK=1
input_text = “I need a recipe for chocolate lava cake”
messages = [{
“role”: “user”,
“content”: [{“text”: input_text}]
}]

# Inference parameters
inf_params = {“topP”: 1, “temperature”: 1}

response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig=inf_params
)
# Typically produces less structured or incomplete output

The following example shows how function calling accuracy can be improved by using
additionalModelRequestFields:
# Enhanced configuration with topK=1
response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig=inf_params,
additionalModelRequestFields={“inferenceConfig”: {“topK”: 1}}
)
# Produces more accurate and structured function call

To maximize Amazon Nova function calling potential and improve accuracy, always use additionalModelRequestFields with topk=1. This forces the model to select the single most probable token and prevents random token selection. This increases deterministic output generation and improves function call precision by about 30–40%.
The following code examples further explain how to conduct tool calling successfully. The first scenario shows recipe generation without an explicit tool. The example doesn’t use topK, which typically results in responses that are less structured:
input_text = “””
I’m looking for a decadent chocolate dessert that’s quick to prepare.
Something that looks fancy but isn’t complicated to make.
“””

messages = [{
“role”: “user”,
“content”: [{“text”: input_text}]
}]

response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
inferenceConfig={“topP”: 1, “temperature”: 1}
)
# Generates a conversational recipe description
# Less structured, more narrative-driven response

In this example, the scenario shows recipe generation with a structured tool. We add topK set to 1, which produces a more structured output:
response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig={“topP”: 1, “temperature”: 1},
additionalModelRequestFields={“inferenceConfig”: {“topK”: 1}}
)
# Generates a highly structured, JSON-compliant recipe
# Includes precise ingredient measurements
# Provides step-by-step instructions
Overall, OpenAI offers more flexible, broader schema support. Amazon Nova provides more precise, controlled output generation and is the best choice when working with high-stakes, structured data scenarios, as demonstrated in Amazon Nova’s performance on the IFEval benchmark discussed in section 2.1.1 of the technical report and model card. We recommend using Amazon Nova for applications requiring predictable, structured responses because its function calling methodology provides superior control and accuracy.
Conclusion
The evolution from OpenAI’s models to Amazon Nova represents a significant shift in using AI. It shows a transition toward models that deliver similar or superior performance at a fraction of the cost, with expanded capabilities in multimodal processing and extended context handling.
Whether you’re using the robust, enterprise-ready Amazon Nova Pro, the agile and economical Amazon Nova Lite, or the versatile Amazon Nova Micro, the benefits are clear:

Cost savings – With token costs up to four times lower, businesses can scale applications more economically
Enhanced response performance – Faster response times (up to 190 tokens per second) make real-time applications more viable
Expanded capabilities – A larger context window and multimodal support unlock new applications, from detailed document analysis to integrated visual content

By evolving your prompt strategy—redefining use cases, exploiting the extended context, and iteratively refining instructions—you can smoothly migrate your existing workflows from OpenAI’s o4 and o4-mini models to the innovative world of Amazon Nova.

About the Authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Evaluate models or RAG systems using Amazon Bedrock Evaluations – No …

Posted on April 5, 2025 by i-genie

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.
The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.
In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.
What’s new: RAG evaluation
In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.
Bring Your Own Inference responses
The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.
When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.
In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
“conversationTurns”: [
{
“prompt”: {
“content”: [
{
“text”: string
}
]
},
“referenceResponses”?: [
{
“content”: [
{
“text”: string
}
]
}
],
“output”?: {
“knowledgeBaseIdentifier”: string
“retrievedResults”: {
“retrievalResults”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
}
}
}
]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
“conversationTurns”: [
{
“prompt”: {
“content”: [
{
“text”: string
}
]
},
“referenceResponses”?: [
{
“content”: [
{
“text”: string
}
]
}
],
“output”?: {
“text”: string
“modelIdentifier”?: string
“knowledgeBaseIdentifier”: string
“retrievedPassages”: {
“retrievalResults”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
},
“citations”: [
{
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: number
“end”: number

},
“text”: string
}
},
“retrievedReferences”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
}
]
}
}
]
}

Citation metrics
Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.
Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.
What’s new: Model evaluation
In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.
Bring Your Own Inference responses
The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.
You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.
When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
“prompt”: string
“referenceResponse”?: string
“category”?: string
“modelResponses”?: [
{
“response”: string
“modelIdentifier”: string
}
]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
“prompt”: string
“referenceResponse”?: string
“category”?: string
“modelResponses”?: [
{
“response”: string
“modelIdentifier”: string
},
{
“response”: string
“modelIdentifier”: string
}
]
}

Feature overview
The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites
To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

AWS account and model access:

An active AWS account
Selected evaluator models are enabled in Amazon Bedrock (verify on the Model access page of the Amazon Bedrock console)
Confirmed AWS Regions where the models are available and their quotas

AWS Identity and Access Management (IAM) and Amazon Simple Storage Service (Amazon S3) configuration:

Completed IAM setup and permissions for both model and RAG evaluation
Configured S3 bucket with appropriate permissions for accessing and writing output data
Enabled CORS on your S3 bucket

Dataset description and preparation
For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the “third-party-model” identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.
The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using “third-party-RAG” as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.
Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console
In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
On the Evaluations page, choose the Models
In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
Optionally, choose Tags, and create your own tags.
Under Inference source, choose Bring your own inference responses as the inference source.
For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

{
“prompt”: “If books cost \$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?”,
“referenceResponse”: “The total price will be \$122.52. Original price: \$153.15 minus 20% discount (\$30.630000000000003)”,
“category”: “Shopping Math”,
“modelResponses”: [{
“response”: “To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:nn1. Calculate the total cost of the books before the discount.n2. Determine the discount amount.n3. Subtract the discount amount from the total cost before the discount.nn**Step 1: Calculate the total cost of the books before the discount.**nnThe cost of one book is \$10.21. Therefore, the cost of 15 books is:n\[ 15 \times 10.21 = 153.15 \]nn**Step 2: Determine the discount amount.**nnThe discount is 20%, so we need to find 20% of \$153.15. To do this, we multiply \$153.15 by 0.20:n\[ 153.15 \times 0.20 = 30.63 \]nn**Step 3: Subtract the discount amount from the total cost before the discount.**nnThe total cost after the discount is:n\[ 153.15 – 30.63 = 122.52 \]nnTherefore, the total amount you will pay is:n\[ \boxed{122.52} \]”,
“modelIdentifier”: “third-party-model”
}]
}

Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

When the job is complete, choose the job to see its details and metrics summary.
To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs
To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

Set up the required configurations:

import boto3
import json
import random
from datetime import datetime
import botocore

# AWS Configuration
REGION = “us-east-1”
ROLE_ARN = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”
BUCKET_NAME = “<YOUR_S3_BUCKET_NAME>”
PREFIX = “<YOUR_BUCKET_PREFIX>”
dataset_custom_name = “<YOUR_BYOI_DATASET_NAME>” # without the “.jsonl extension

# Initialize AWS clients
bedrock_client = boto3.client(‘bedrock’, region_name=REGION)
s3_client = boto3.client(‘s3’, region_name=REGION)

Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.

def create_llm_judge_evaluation(
client,
job_name: str,
role_arn: str,
input_s3_uri: str,
output_s3_uri: str,
evaluator_model_id: str,
dataset_name: str = None,
task_type: str = “General” # must be General for LLMaaJ
):
# All available LLM-as-judge metrics
llm_judge_metrics = [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Faithfulness”,
“Builtin.Helpfulness”,
“Builtin.Coherence”,
“Builtin.Relevance”,
“Builtin.FollowingInstructions”,
“Builtin.ProfessionalStyleAndTone”,
“Builtin.Harmfulness”,
“Builtin.Stereotyping”,
“Builtin.Refusal”
]

# Configure dataset
dataset_config = {
“name”: dataset_name or “CustomDataset”,
“datasetLocation”: {
“s3Uri”: input_s3_uri
}
}

try:
response = client.create_evaluation_job(
jobName=job_name,
roleArn=role_arn,
applicationType=”ModelEvaluation”,
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [
{
“taskType”: task_type,
“dataset”: dataset_config,
“metricNames”: llm_judge_metrics
}
],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [
{
“modelIdentifier”: evaluator_model_id
}
]
}
}
},
inferenceConfig={
“models”: [
{
“precomputedInferenceSource”: {
“inferenceSourceIdentifier”: “third-party-model”
}
}
]
},
outputDataConfig={
“s3Uri”: output_s3_uri
}
)
return response

except Exception as e:
print(f”Error creating evaluation job: {str(e)}”)
raise

Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.

# Job Configuration
evaluator_model = “anthropic.claude-3-haiku-20240307-v1:0″
job_name = f”llmaaj-third-party-model-{evaluator_model.split(‘.’)[0]}-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

# S3 Paths
input_data = f”s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl”
output_path = f”s3://{BUCKET_NAME}/{PREFIX}”

# Create evaluation job
try:
llm_as_judge_response = create_llm_judge_evaluation(
client=bedrock_client,
job_name=job_name,
role_arn=ROLE_ARN,
input_s3_uri=input_data,
output_s3_uri=output_path,
evaluator_model_id=evaluator_model,
task_type=”General”
)
print(f”✓ Created evaluation job: {llm_as_judge_response[‘jobArn’]}”)
except Exception as e:
print(f”✗ Failed to create evaluation job: {str(e)}”)
raise

Monitor the progress of your evaluation job:

# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response[‘jobArn’]

# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn)
print(f”Job Status: {check_status[‘status’]}”)

Start a RAG evaluation job with BYOI using the Amazon Bedrock console
When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:
Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
Choose the RAG.
Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
Under Inference source, choose Bring your own inference responses as the inference source.
For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:

{
“conversationTurns”: [{
“prompt”: {
“content”: [{
“text”: “What is Amazon’s SEC file number?”
}]
},
“referenceResponses”: [{
“content”: [{
“text”: “Amazon’s SEC file number is 000-22513.”
}]
}],
“output”: {
“text”: “Amazon’s SEC file number is 000-22513.”,
“modelIdentifier”: “third-party-model”,
“knowledgeBaseIdentifier”: “third-party-RAG”,
“retrievedPassages”: {
“retrievalResults”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}, {
“content”: {
“text”: “AMAZON.COM, INC. (Exact name of registrant as specified in its charter)”
}
}]
},
“citations”: [{
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 0,
“end”: 11
},
“text”: “Amazon’s SEC”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “UNITED STATESSECURITIES AND EXCHANGE COMMISSION”
}
}]
}, {
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 12,
“end”: 22
},
“text”: “file number”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}]
}, {
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 23,
“end”: 33
},
“text”: “is 000-22513”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}]
}]
}
}]
}

Therefore, we use the source name third-party-RAG.

Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs
To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:

import boto3
from datetime import datetime

# Configure knowledge base and model settings
evaluator_model = “<YOUR_EVALUATOR_MODEL>”
role_arn = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”
BUCKET_NAME = “<YOUR_S3_BUCKET_NAME>”
PREFIX = “<YOUR_BUCKET_PREFIX>”
RAG_dataset_custom_name = “<YOUR_RAG_BYOI_DATASET_NAME>” # without the “.jsonl extension

# Specify S3 locations
input_data = f”s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl”
output_path = f”s3://{BUCKET_NAME}/{PREFIX}/”

# Create Bedrock client
bedrock_client = boto3.client(‘bedrock’)

Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.

retrieve_generate_job_name = f”rag-evaluation-generate-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

retrieve_generate_job = bedrock_client.create_evaluation_job(
jobName=retrieve_generate_job_name,
jobDescription=”Evaluate retrieval and generation”,
roleArn=role_arn,
applicationType=”RagEvaluation”,
inferenceConfig={
“ragConfigs”: [
{
“precomputedRagSourceConfig”: {
“retrieveAndGenerateSourceConfig”: {
“ragSourceIdentifier”: “third-party-RAG” # Replace with your identifier
}
}
}
]
},
outputDataConfig={
“s3Uri”: output_path
},
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [{
“taskType”: “QuestionAndAnswer”,
“dataset”: {
“name”: “RagDataset”,
“datasetLocation”: {
“s3Uri”: input_data
}
},
“metricNames”: [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Helpfulness”,
“Builtin.LogicalCoherence”,
“Builtin.Faithfulness”,
“Builtin.CitationPrecision”,
“Builtin.CitationCoverage”
]
}],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [{
“modelIdentifier”: evaluator_model
}]
}
}
}
)

After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up
To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.
Conclusion
The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.
As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.
We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.

About the authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.