A Comprehensive Tutorial on the Five Levels of Agentic AI Architecture …

In this tutorial, we explore five levels of Agentic Architectures, from the simplest language model calls to a fully autonomous code-generating system. This tutorial is designed to run seamlessly on Google Colab. Starting with a basic “simple processor” that simply echoes the model’s output, you will progressively build routing logic, integrate external tools, orchestrate multi-step workflows, and ultimately empower the model to plan, validate, refine, and execute its own Python code. Throughout each section, you’ll find detailed explanations, self-contained demo functions, and clear prompts that illustrate how to balance human control and machine autonomy in real-world AI applications.

Copy CodeCopiedUse a different Browserimport os
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import re
import json
import time
import random
from IPython.display import clear_output

We import core Python and third-party libraries, including os and time for environment and execution control, torch, along with Hugging Face’s transformers (pipeline, AutoTokenizer, AutoModelForCausalLM) for model loading and inference. Also, we utilize re and json for parsing LLM outputs, random seeds, and mock data, while clear_output maintains a tidy Colab interface.

Copy CodeCopiedUse a different BrowserMODEL_NAME = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”
def get_model_and_tokenizer():
if not hasattr(get_model_and_tokenizer, “model”):
print(f”Loading model {MODEL_NAME}…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
device_map=”auto”,
low_cpu_mem_usage=True
)
get_model_and_tokenizer.model = model
get_model_and_tokenizer.tokenizer = tokenizer
print(“Model loaded successfully!”)

return get_model_and_tokenizer.model, get_model_and_tokenizer.tokenizer

Here, we define MODEL_NAME to point at the TinyLlama 1.1B chat model and implement a lazy‐loading helper get_model_and_tokenizer() that downloads and initializes the tokenizer and model only once, caching them on first call to minimize overhead, and then returns the cached instances for all subsequent inference calls.

Copy CodeCopiedUse a different Browserdef get_model_and_tokenizer():
if not hasattr(get_model_and_tokenizer, “model”):
print(f”Loading model {MODEL_NAME}…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
device_map=”auto”,
low_cpu_mem_usage=True
)
get_model_and_tokenizer.model = model
get_model_and_tokenizer.tokenizer = tokenizer
print(“Model loaded successfully!”)

return get_model_and_tokenizer.model, get_model_and_tokenizer.tokenizer

This helper function implements a lazy-loading pattern for the TinyLlama model and its tokenizer. On the first call, it downloads and initializes both with half-precision and automatic device placement, caches them as attributes on the function object, and on subsequent calls, simply returns the already-loaded instances to avoid redundant overhead.

Copy CodeCopiedUse a different Browserdef generate_text(prompt, max_length=512):
model, tokenizer = get_model_and_tokenizer()

messages = [{“role”: “user”, “content”: prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

inputs = tokenizer(formatted_prompt, return_tensors=”pt”).to(model.device)

with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=0.7,
top_p=0.9,
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

response = generated_text.split(“ASSISTANT: “)[-1].strip()
return response

The generate_text function wraps the TinyLlama inference workflow: it retrieves the cached model and tokenizer, formats the user prompt into the chat template, tokenizes and moves inputs to the model’s device, then samples a response with temperature and top-p settings. After generation, it decodes the output and extracts just the assistant’s reply by splitting on the “ASSISTANT: ” marker.

Level 1: Simple Processor

At the simplest level, the code defines a straightforward text‐generation pipeline that treats the model purely as a language processor. When the user provides a prompt, the `simple_processor` function invokes the `generate_text` helper, which is built on the TinyLlama 1.1B chat model, to produce a free-form response. It then displays that response directly. Under the hood, `generate_text` ensures the model and tokenizer are loaded just once by caching them inside the `get_model_and_tokenizer` function, formats the prompt for the chat model, runs generation with sampling parameters for diversity, and extracts the assistant’s reply by splitting on the “ASSISTANT:” marker. This level demonstrates the most basic interaction pattern: input is received, output is generated, and program flow remains entirely under human control.

Copy CodeCopiedUse a different Browserdef simple_processor(prompt):
“””Level 1: Simple Processor – Model has no impact on program flow”””
response = generate_text(prompt)
return response

def demo_level1():
print(“n” + “=”*50)
print(“LEVEL 1: SIMPLE PROCESSOR DEMO”)
print(“=”*50)
print(“At this level, the AI has no control over program flow.”)
print(“It simply takes input and produces output.n”)

user_input = input(“Enter your question or prompt: “) or “Write a short poem about artificial intelligence.”
print(“nProcessing your request…n”)

output = simple_processor(user_input)
print(“OUTPUT:”)
print(“-“*50)
print(output)
print(“-“*50)

The simple_processor function embodies the Simple Processor of our agent hierarchy by treating the model purely as a text generator; it accepts a user-provided prompt and delegates to generate_text. It returns whatever the model produces without any branching or decision logic. The accompanying demo_level1 routine provides a minimal interactive loop, printing a clear header, soliciting user input (with a sensible default), invoking simple_processor, and then displaying the raw output, showcasing the most basic prompt-to-response workflow in which the AI exerts no influence over the program’s flow.

Level 2: Router

The second level introduces conditional routing based on the model’s classification of the user’s query. The `router_agent` function first asks the model to classify a query into “technical,” “creative,” or “factual,” then normalizes the model’s response into one of those categories. Depending on which category is detected, the query is dispatched to a specialized handler, either `handle_technical_query`, `handle_creative_query`, or `handle_factual_query`, each of which wraps the user’s query in a system-style prompt tailored to the chosen tone and purpose. This routing mechanism provides the model with partial control over program flow, enabling it to guide the subsequent interaction path while still relying on human-defined handlers to generate the final output.

Copy CodeCopiedUse a different Browserdef router_agent(user_query):
“””Level 2: Router – Model determines basic program flow”””

category_prompt = f”””Classify the following query into one of these categories:
‘technical’, ‘creative’, or ‘factual’.

Query: {user_query}

Return ONLY the category name and nothing else.”””

category_response = generate_text(category_prompt)

category = category_response.lower()
if “technical” in category:
category = “technical”
elif “creative” in category:
category = “creative”
else:
category = “factual”

print(f”Query classified as: {category}”)

if category == “technical”:
return handle_technical_query(user_query)
elif category == “creative”:
return handle_creative_query(user_query)
else:
return handle_factual_query(user_query)

def handle_technical_query(query):
system_prompt = f”””You are a technical assistant. Provide detailed technical explanations.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Technical Response]n{response}”

def handle_creative_query(query):
system_prompt = f”””You are a creative assistant. Be imaginative and inspiring.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Creative Response]n{response}”

def handle_factual_query(query):
system_prompt = f”””You are a factual assistant. Provide accurate information concisely.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Factual Response]n{response}”

def demo_level2():
print(“n” + “=”*50)
print(“LEVEL 2: ROUTER DEMO”)
print(“=”*50)
print(“At this level, the AI determines basic program flow.”)
print(“It decides which processing path to take.n”)

user_query = input(“Enter your question or prompt: “) or “How do neural networks work?”
print(“nProcessing your request…n”)

result = router_agent(user_query)
print(“OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

The router_agent function implements Router behavior by first asking the model to classify the user’s query as “technical,” “creative,” or “factual,” then normalizing that classification and dispatching the query to the corresponding handler (handle_technical_query, handle_creative_query, or handle_factual_query), each of which wraps the original query in an appropriate system‐style prompt before calling generate_text. The demo_level2 routine provides a clear CLI-style interface, printing headers, accepting input (with a default), invoking router_agent, and displaying the categorized response, showcasing how the model can take basic control over program flow by choosing which processing path to follow.

Level 3: Tool Calling

At the third level, the code empowers the model to decide which of several external tools to invoke by embedding a JSON-based function selection protocol into the prompt. The `tool_calling_agent` presents the user’s question alongside a menu of potential tools, including weather lookup, web search simulation, current date and time retrieval, or direct response, and instructs the model to respond with a valid JSON message specifying the chosen tool and its parameters. A regex then extracts the first JSON object from the model’s output, and the code safely falls back to a direct response if parsing fails. Once the tool and arguments are identified, the corresponding Python function is executed, its result is captured, and a final model call integrates that result into a coherent answer. This pattern bridges LLM reasoning with concrete code execution by letting the model orchestrate which APIs or utilities to call.

Copy CodeCopiedUse a different Browserdef tool_calling_agent(user_query):
“””Level 3: Tool Calling – Model determines how functions are executed”””

tool_selection_prompt = f”””Based on the user query, select the most appropriate tool from the following list:
1. get_weather: Get the current weather for a location
2. search_information: Search for specific information on a topic
3. get_date_time: Get current date and time
4. direct_response: Provide a direct response without using tools

USER QUERY: {user_query}

INSTRUCTIONS:
– Return your response in valid JSON format
– Include the tool name and any required parameters
– For get_weather, include location parameter
– For search_information, include query and depth parameter (basic or detailed)
– For get_date_time, include timezone parameter (optional)
– For direct_response, no parameters needed

Example output format: {{“tool”: “get_weather”, “parameters”: {{“location”: “New York”}}}}”””

tool_selection_response = generate_text(tool_selection_prompt)

try:
json_match = re.search(r'({.*})’, tool_selection_response, re.DOTALL)
if json_match:
tool_selection = json.loads(json_match.group(1))
else:
print(“Could not parse tool selection. Defaulting to direct response.”)
tool_selection = {“tool”: “direct_response”, “parameters”: {}}
except json.JSONDecodeError:
print(“Invalid JSON in tool selection. Defaulting to direct response.”)
tool_selection = {“tool”: “direct_response”, “parameters”: {}}

tool_name = tool_selection.get(“tool”, “direct_response”)
parameters = tool_selection.get(“parameters”, {})

print(f”Selected tool: {tool_name}”)

if tool_name == “get_weather”:
location = parameters.get(“location”, “Unknown”)
tool_result = get_weather(location)
elif tool_name == “search_information”:
query = parameters.get(“query”, user_query)
depth = parameters.get(“depth”, “basic”)
tool_result = search_information(query, depth)
elif tool_name == “get_date_time”:
timezone = parameters.get(“timezone”, “UTC”)
tool_result = get_date_time(timezone)
else:
return generate_text(f”Please provide a helpful response to: {user_query}”)

final_prompt = f”””User Query: {user_query}
Tool Used: {tool_name}
Tool Result: {json.dumps(tool_result)}

Based on the user’s query and the tool result above, provide a helpful response.”””

final_response = generate_text(final_prompt)
return final_response

def get_weather(location):
weather_conditions = [“Sunny”, “Partly cloudy”, “Overcast”, “Light rain”, “Heavy rain”, “Thunderstorms”, “Snowy”, “Foggy”]
temperatures = {
“cold”: list(range(-10, 10)),
“mild”: list(range(10, 25)),
“hot”: list(range(25, 40))
}

location_hash = sum(ord(c) for c in location)
condition_index = location_hash % len(weather_conditions)
season = [“winter”, “spring”, “summer”, “fall”][location_hash % 4]

temp_range = temperatures[“cold”] if season in [“winter”, “fall”] else temperatures[“hot”] if season == “summer” else temperatures[“mild”]
temperature = random.choice(temp_range)

return {
“location”: location,
“temperature”: f”{temperature}°C”,
“conditions”: weather_conditions[condition_index],
“humidity”: f”{random.randint(30, 90)}%”
}

def search_information(query, depth=”basic”):
mock_results = [
f”First result about {query}”,
f”Second result discussing {query}”,
f”Third result analyzing {query}”
]

if depth == “detailed”:
mock_results.extend([
f”Fourth detailed analysis of {query}”,
f”Fifth comprehensive overview of {query}”,
f”Sixth academic paper on {query}”
])

return {
“query”: query,
“results”: mock_results,
“depth”: depth,
“sources”: [f”source{i}.com” for i in range(1, len(mock_results) + 1)]
}

def get_date_time(timezone=”UTC”):
current_time = time.strftime(“%Y-%m-%d %H:%M:%S”, time.gmtime())
return {
“current_datetime”: current_time,
“timezone”: timezone
}

def demo_level3():
print(“n” + “=”*50)
print(“LEVEL 3: TOOL CALLING DEMO”)
print(“=”*50)
print(“At this level, the AI selects which tools to use and with what parameters.”)
print(“It can process the results from tools to create a final response.n”)

user_query = input(“Enter your question or prompt: “) or “What’s the weather like in San Francisco?”
print(“nProcessing your request…n”)

result = tool_calling_agent(user_query)
print(“OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

In the Level 3 implementation, the tool_calling_agent function prompts the model to choose among a predefined set of utilities, such as weather lookup, mock web search, or date/time retrieval, by returning a JSON object with the selected tool name and its parameters. It then safely parses that JSON, invokes the corresponding Python function to obtain structured data, and makes a follow-up model call to integrate the tool’s output into a coherent, user-facing response.

Level 4: Multi-Step Agent

The fourth level extends the tool-calling pattern into a full multi-step agent that manages its workflow and state. The `MultiStepAgent` class maintains an internal memory of user inputs, tool outputs, and agent actions. Each iteration generates a planning prompt that summarizes the entire memory, asking the model to choose one of several tools, such as web search simulation, information extraction, text summarization, or report creation, or to conclude the task with a final output. After executing the selected tool and appending its results back to memory, the process repeats until either the model issues a “complete” action or the maximum number of steps is reached. Finally, the agent collates the memory into a cohesive final response. This structure shows how an LLM can orchestrate complex, multi-stage processes while consulting external functions and refining its plan based on previous results.

Copy CodeCopiedUse a different Browserclass MultiStepAgent:
“””Level 4: Multi-Step Agent – Model controls iteration and program continuation”””

def __init__(self):
self.tools = {
“search_web”: self.search_web,
“extract_info”: self.extract_info,
“summarize_text”: self.summarize_text,
“create_report”: self.create_report
}
self.memory = []
self.max_steps = 5

def run(self, user_task):
self.memory.append({“role”: “user”, “content”: user_task})

steps_taken = 0
while steps_taken < self.max_steps:
next_action = self.determine_next_action()

if next_action[“action”] == “complete”:
return next_action[“output”]

tool_name = next_action[“tool”]
tool_args = next_action[“args”]

print(f”n Step {steps_taken + 1}: Using tool ‘{tool_name}’ with arguments: {tool_args}”)

tool_result = self.tools[tool_name](**tool_args)

self.memory.append({
“role”: “tool”,
“content”: json.dumps(tool_result)
})

steps_taken += 1

return self.generate_final_response(“Maximum steps reached. Here’s what I’ve found so far.”)

def determine_next_action(self):
context = “Current memory state:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER INPUT: {item[‘content’]}nn”
elif item[“role”] == “tool”:
context += f”TOOL RESULT: {item[‘content’]}nn”

prompt = f”””{context}

Based on the above information, determine the next action to take.
Choose one of the following options:
1. search_web: Search for information (args: query)
2. extract_info: Extract specific information from a text (args: text, target_info)
3. summarize_text: Create a summary of text (args: text)
4. create_report: Create a structured report (args: title, content)
5. complete: Task is complete (include final output)

Respond with a JSON object with the following structure:
For tools: {{“action”: “tool”, “tool”: “tool_name”, “args”: {{tool-specific arguments}}}}
For completion: {{“action”: “complete”, “output”: “final output text”}}

Only return the JSON object and nothing else.”””

next_action_response = generate_text(prompt)

try:
json_match = re.search(r'({.*})’, next_action_response, re.DOTALL)
if json_match:
next_action = json.loads(json_match.group(1))
else:
return {“action”: “complete”, “output”: “I encountered an error in planning. Here’s what I know so far: ” + self.generate_final_response(“Error in planning”)}
except json.JSONDecodeError:
return {“action”: “complete”, “output”: “I encountered an error in planning. Here’s what I know so far: ” + self.generate_final_response(“Error in planning”)}

self.memory.append({“role”: “assistant”, “content”: next_action_response})
return next_action

def generate_final_response(self, prefix=””):
context = “Task history:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER INPUT: {item[‘content’]}nn”
elif item[“role”] == “tool”:
context += f”TOOL RESULT: {item[‘content’]}nn”
elif item[“role”] == “assistant”:
context += f”AGENT ACTION: {item[‘content’]}nn”

prompt = f”””{context}

{prefix} Generate a comprehensive final response that addresses the original user task.”””

final_response = generate_text(prompt)
return final_response

def search_web(self, query):
time.sleep(1)

query_hash = sum(ord(c) for c in query)
num_results = (query_hash % 3) + 2

results = []
for i in range(num_results):
results.append(f”Result {i+1}: Information about ‘{query}’ related to aspect {chr(97 + i)}.”)

return {
“query”: query,
“results”: results
}

def extract_info(self, text, target_info):
time.sleep(0.5)

return {
“extracted_info”: f”Extracted information about ‘{target_info}’ from the text: The text indicates that {target_info} is related to several key aspects mentioned in the content.”,
“confidence”: round(random.uniform(0.7, 0.95), 2)
}

def summarize_text(self, text):
time.sleep(0.5)

word_count = len(text.split())

return {
“summary”: f”Summary of the provided text ({word_count} words): The text discusses key points related to the subject matter, highlighting important aspects and providing context.”,
“original_length”: word_count,
“summary_length”: round(word_count * 0.3)
}

def create_report(self, title, content):
time.sleep(0.7)

report_sections = [
“## Introduction”,
f”This report provides an overview of {title}.”,
“”,
“## Key Findings”,
content,
“”,
“## Conclusion”,
f”This analysis of {title} highlights several important aspects that warrant consideration.”
]

return {
“report”: “n”.join(report_sections),
“word_count”: len(content.split()),
“section_count”: 3
}

def demo_level4():
print(“n” + “=”*50)
print(“LEVEL 4: MULTI-STEP AGENT DEMO”)
print(“=”*50)
print(“At this level, the AI manages the entire workflow, deciding which tools”)
print(“to use, when to use them, and determining when the task is complete.n”)

user_task = input(“Enter a research or analysis task: “) or “Research quantum computing recent developments and create a brief report”
print(“nProcessing your request… (this may take a minute)n”)

agent = MultiStepAgent()
result = agent.run(user_task)
print(“nFINAL OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

The MultiStepAgent class maintains an evolving memory of user inputs and tool outputs, then repeatedly prompts the LLM to decide its next action, whether to search the web, extract information, summarize text, create a report, or finish, executing the chosen tool and appending the result until the task is complete or a step limit is reached. In doing so, it showcases a Level 4 agent that orchestrates multi-step workflows by letting the model control iteration and program continuation.

Level 5: Fully Autonomous Agent

At the most advanced level, the `AutonomousAgent` class demonstrates a closed-loop system in which the model not only plans and executes but also generates, validates, refines, and runs new Python code. After the user task is recorded, the agent asks the model to produce a detailed plan, then prompts it to generate self-contained solution code, which is automatically cleaned of markdown formatting. A subsequent validation step queries the model for any syntax or logic issues; if issues are found, the agent asks the model to refine the code. The validated code is then wrapped with sandboxing utilities, such as safe printing, captured output buffers, and result-capture logic, and executed in a restricted local environment. Finally, the agent synthesizes a professional report explaining what was done, how it was accomplished, and the final results. This level exemplifies a truly autonomous AI system that can extend its capabilities through dynamic code creation and execution.

Copy CodeCopiedUse a different Browserclass AutonomousAgent:
“””Level 5: Fully Autonomous Agent – Model creates & executes new code”””

def __init__(self):
self.memory = []

def run(self, user_task):
self.memory.append({“role”: “user”, “content”: user_task})

print(” Planning solution approach…”)
planning_message = self.plan_solution(user_task)
self.memory.append({“role”: “assistant”, “content”: planning_message})

print(” Generating solution code…”)
generated_code = self.generate_solution_code()
self.memory.append({“role”: “assistant”, “content”: f”Generated code: “`pythonn{generated_code}n“`”})

print(” Validating code…”)
validation_result = self.validate_code(generated_code)
if not validation_result[“valid”]:
print(” Code validation found issues – refining…”)
refined_code = self.refine_code(generated_code, validation_result[“issues”])
self.memory.append({“role”: “assistant”, “content”: f”Refined code: “`pythonn{refined_code}n“`”})
generated_code = refined_code
else:
print(” Code validation passed”)

try:
print(” Executing solution…”)
execution_result = self.safe_execute_code(generated_code, user_task)
self.memory.append({“role”: “system”, “content”: f”Execution result: {execution_result}”})

# Generate a final report
print(” Creating final report…”)
final_report = self.create_final_report(execution_result)
return final_report

except Exception as e:
return f”Error executing the solution: {str(e)}nnGenerated code was:n“`pythonn{generated_code}n“`”

def plan_solution(self, task):
prompt = f”””Task: {task}

You are an autonomous problem-solving agent. Create a detailed plan to solve this task.
Include:
1. Breaking down the task into subtasks
2. What algorithms or approaches you’ll use
3. What data structures are needed
4. Any external resources or libraries required
5. Expected challenges and how to address them

Provide a step-by-step plan.
“””

return generate_text(prompt)

def generate_solution_code(self):
context = “Task and planning information:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER TASK: {item[‘content’]}nn”
elif item[“role”] == “assistant”:
context += f”PLANNING: {item[‘content’]}nn”

prompt = f”””{context}

Generate clean, efficient Python code that solves this task. Include comments to explain the code.
The code should be self-contained and able to run inside a Python script or notebook.
Only include the Python code itself without any markdown formatting.
“””

code = generate_text(prompt)

code = re.sub(r’^“`pythonn|“`$’, ”, code, flags=re.MULTILINE)

return code

def validate_code(self, code):
prompt = f”””Code to validate:
“`python
{code}
“`

Examine the code for the following issues:
1. Syntax errors
2. Logic errors
3. Inefficient implementations
4. Security concerns
5. Missing error handling
6. Import statements for unavailable libraries

If the code has any issues, describe them in detail. If the code looks good, state “No issues found.”
“””

validation_response = generate_text(prompt)

if “no issues” in validation_response.lower() or “code looks good” in validation_response.lower():
return {“valid”: True, “issues”: None}
else:
return {“valid”: False, “issues”: validation_response}

def refine_code(self, original_code, issues):
prompt = f”””Original code:
“`python
{original_code}
“`

Issues identified:
{issues}

Please provide a corrected version of the code that addresses these issues.
Only include the Python code itself without any markdown formatting.
“””

refined_code = generate_text(prompt)

refined_code = re.sub(r’^“`pythonn|“`$’, ”, refined_code, flags=re.MULTILINE)

return refined_code

def safe_execute_code(self, code, user_task):

safe_imports = “””
# Standard library imports
import math
import random
import re
import time
import json
from datetime import datetime

# Define a function to capture printed output
captured_output = []
original_print = print

def safe_print(*args, **kwargs):
output = ” “.join(str(arg) for arg in args)
captured_output.append(output)
original_print(output)

print = safe_print

# Define a result variable to store the final output
result = None

# Function to store the final result
def store_result(value):
global result
result = value
return value
“””

result_capture = “””
# Store the final result if not already done
if ‘result’ not in locals() or result is None:
try:
# Look for variables that might contain the final result
potential_results = [var for var in locals() if not var.startswith(‘_’) and var not in
[‘math’, ‘random’, ‘re’, ‘time’, ‘json’, ‘datetime’,
‘captured_output’, ‘original_print’, ‘safe_print’,
‘result’, ‘store_result’]]
if potential_results:
# Use the last defined variable as the result
store_result(locals()[potential_results[-1]])
except:
pass
“””

full_code = safe_imports + “n# User code starts heren” + code + “nn” + result_capture

code_lines = code.split(‘n’)
first_lines = code_lines[:3]
print(f”nExecuting (first 3 lines):n{first_lines}”)

local_env = {}

try:
exec(full_code, {}, local_env)

return {
“output”: local_env.get(‘captured_output’, []),
“result”: local_env.get(‘result’, “No explicit result returned”)
}
except Exception as e:
return {“error”: str(e)}

def create_final_report(self, execution_result):
if isinstance(execution_result.get(‘output’), list):
output_text = “n”.join(execution_result.get(‘output’, []))
else:
output_text = str(execution_result.get(‘output’, ”))

result_text = str(execution_result.get(‘result’, ”))
error_text = execution_result.get(‘error’, ”)

context = “Task history:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER TASK: {item[‘content’]}nn”

prompt = f”””{context}

EXECUTION OUTPUT:
{output_text}

EXECUTION RESULT:
{result_text}

{f”ERROR: {error_text}” if error_text else “”}

Create a final report that explains the solution to the original task. Include:
1. What was done
2. How it was accomplished
3. The final results
4. Any insights or conclusions drawn from the analysis

Format the report in a professional, easy to read manner.
“””

return generate_text(prompt)

def demo_level5():
print(“n” + “=”*50)
print(“LEVEL 5: FULLY AUTONOMOUS AGENT DEMO”)
print(“=”*50)
print(“At this level, the AI generates and executes code to solve complex problems.”)
print(“It can create, validate, refine, and run custom code solutions.n”)

user_task = input(“Enter a data analysis or computational task: “) or “Analyze a dataset of numbers [10, 45, 65, 23, 76, 12, 89, 32, 50] and create visualizations of the distribution”
print(“nProcessing your request… (this may take a minute or two)n”)

agent = AutonomousAgent()
result = agent.run(user_task)
print(“nFINAL REPORT:”)
print(“-“*50)
print(result)
print(“-“*50)

The AutonomousAgent class embodies the autonomy of a Fully Autonomous Agent by maintaining a running memory of the user’s task and systematically orchestrating five core phases: planning, code generation, validation, safe execution, and reporting. When the run is initiated, the agent prompts the model to generate a detailed plan for solving the task and stores this plan in memory. Next, it asks the model to create self-contained Python code based on that plan, strips away any markdown formatting, and then validates the code by querying the model for syntax, logic, performance, and security issues. If validation uncovers problems, the agent instructs the model to refine the code until it passes inspection. The finalized code is then wrapped in a sandboxed execution harness, complete with captured output buffers and automatic result extraction, and executed in an isolated local environment. Finally, the agent synthesizes a polished, professional report by feeding the execution results back into the model, producing a narrative that explains what was done, how it was accomplished, and what insights were gained. The accompanying demo_level5 function provides a straightforward, interactive loop that accepts a user task, runs the agent, and presents a comprehensive final report.

Main Function: All Above Steps

Copy CodeCopiedUse a different Browserdef main():
while True:
clear_output(wait=True)
print(“n” + “=”*50)
print(“AI AGENT LEVELS DEMO”)
print(“=”*50)
print(“nThis notebook demonstrates the 5 levels of AI agents:”)
print(“1. Simple Processor – Model has no impact on program flow”)
print(“2. Router – Model determines basic program flow”)
print(“3. Tool Calling – Model determines how functions are executed”)
print(“4. Multi-Step Agent – Model controls iteration and program continuation”)
print(“5. Fully Autonomous Agent – Model creates & executes new code”)
print(“6. Quit”)

choice = input(“nSelect a level to demo (1-6): “)

if choice == “1”:
demo_level1()
elif choice == “2”:
demo_level2()
elif choice == “3”:
demo_level3()
elif choice == “4”:
demo_level4()
elif choice == “5”:
demo_level5()
elif choice == “6”:
print(“nThank you for exploring the AI Agent levels!”)
break
else:
print(“nInvalid choice. Please select 1-6.”)

input(“nPress Enter to return to the main menu…”)

if __name__ == “__main__”:
main()

Finally, the main function presents a simple, interactive menu loop that clears the Colab output for readability, displays all five agent levels alongside a quit option, and then dispatches the user’s choice to the corresponding demo function before waiting for input to return to the menu. This structure provides a cohesive, CLI-style interface enabling you to explore each agent level in sequence without manual cell execution.

In conclusion, by working through these five levels, we have gained practical insight into the principles of agentic AI and the trade-offs between control, flexibility, and autonomy. We have seen how a system can evolve from straightforward prompt-response behavior to complex decision-making pipelines and even self-modifying code execution. Whether you aim to prototype intelligent assistants, build data pipelines, or experiment with emerging AI capabilities, this progression framework provides a roadmap for designing robust and scalable agents.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post A Comprehensive Tutorial on the Five Levels of Agentic AI Architectures: From Basic Prompt Responses to Fully Autonomous Code Generation and Execution appeared first on MarkTechPost.

Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Vis …

In recent years, contrastive language-image models such as CLIP have established themselves as a default choice for learning vision representations, particularly in multimodal applications like Visual Question Answering (VQA) and document understanding. These models leverage large-scale image-text pairs to incorporate semantic grounding via language supervision. However, this reliance on text introduces both conceptual and practical challenges: the assumption that language is essential for multimodal performance, the complexity of acquiring aligned datasets, and the scalability limits imposed by data availability. In contrast, visual self-supervised learning (SSL)—which operates without language—has historically demonstrated competitive results on classification and segmentation tasks, yet has been underutilized for multimodal reasoning due to performance gaps, especially in OCR and chart-based tasks.

Meta Releases WebSSL Models on Hugging Face (300M–7B Parameters)

To explore the capabilities of language-free visual learning at scale, Meta has released the Web-SSL family of DINO and Vision Transformer (ViT) models, ranging from 300 million to 7 billion parameters, now publicly available via Hugging Face. These models are trained exclusively on the image subset of the MetaCLIP dataset (MC-2B)—a web-scale dataset comprising two billion images. This controlled setup enables a direct comparison between WebSSL and CLIP, both trained on identical data, isolating the effect of language supervision.

The objective is not to replace CLIP, but to rigorously evaluate how far pure visual self-supervision can go when model and data scale are no longer limiting factors. This release represents a significant step toward understanding whether language supervision is necessary—or merely beneficial—for training high-capacity vision encoders.

Technical Architecture and Training Methodology

WebSSL encompasses two visual SSL paradigms: joint-embedding learning (via DINOv2) and masked modeling (via MAE). Each model follows a standardized training protocol using 224×224 resolution images and maintains a frozen vision encoder during downstream evaluation to ensure that observed differences are attributable solely to pretraining.

Models are trained across five capacity tiers (ViT-1B to ViT-7B), using only unlabeled image data from MC-2B. Evaluation is conducted using Cambrian-1, a comprehensive 16-task VQA benchmark suite encompassing general vision understanding, knowledge-based reasoning, OCR, and chart-based interpretation.

In addition, the models are natively supported in Hugging Face’s transformers library, providing accessible checkpoints and seamless integration into research workflows.

Performance Insights and Scaling Behavior

Experimental results reveal several key findings:

Scaling Model Size: WebSSL models demonstrate near log-linear improvements in VQA performance with increasing parameter count. In contrast, CLIP’s performance plateaus beyond 3B parameters. WebSSL maintains competitive results across all VQA categories and shows pronounced gains in Vision-Centric and OCR & Chart tasks at larger scales.

Data Composition Matters: By filtering the training data to include only 1.3% of text-rich images, WebSSL outperforms CLIP on OCR & Chart tasks—achieving up to +13.6% gains in OCRBench and ChartQA. This suggests that the presence of visual text alone, not language labels, significantly enhances task-specific performance.

High-Resolution Training: WebSSL models fine-tuned at 518px resolution further close the performance gap with high-resolution models like SigLIP, particularly for document-heavy tasks.

LLM Alignment: Without any language supervision, WebSSL shows improved alignment with pretrained language models (e.g., LLaMA-3) as model size and training exposure increase. This emergent behavior implies that larger vision models implicitly learn features that correlate well with textual semantics.

Importantly, WebSSL maintains strong performance on traditional benchmarks (ImageNet-1k classification, ADE20K segmentation, NYUv2 depth estimation), and often outperforms MetaCLIP and even DINOv2 under equivalent settings.

Concluding Observations

Meta’s Web-SSL study provides strong evidence that visual self-supervised learning, when scaled appropriately, is a viable alternative to language-supervised pretraining. These findings challenge the prevailing assumption that language supervision is essential for multimodal understanding. Instead, they highlight the importance of dataset composition, model scale, and careful evaluation across diverse benchmarks.

The release of models ranging from 300M to 7B parameters enables broader research and downstream experimentation without the constraints of paired data or proprietary pipelines. As open-source foundations for future multimodal systems, WebSSL models represent a meaningful advancement in scalable, language-free vision learning.

Check out the Models on Hugging Face, GitHub Page and Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning appeared first on MarkTechPost.

Meet Rowboat: An Open-Source IDE for Building Complex Multi-Agent Syst …

As multi-agent systems gain traction in real-world applications—from customer support automation to AI-native infrastructure—the need for a streamlined development interface has never been greater. Meet Rowboat, an open-source IDE designed to accelerate the construction, debugging, and deployment of multi-agent AI workflows. It’s powered by OpenAI Agents SDK, connects MCP servers, and can integrate into your apps using HTTP or the SDK. Backed by Y Combinator and tightly integrated with OpenAI’s Agents SDK, Rowboat offers a unique combination of visual development, tool modularity, and real-time testing—making it a compelling platform for engineering agentic AI systems at scale.

Rethinking Multi-Agent Development

Developing multi-agent systems typically requires orchestrating interactions between multiple specialized agents, each responsible for a distinct task or capability. This often involves stitching together prompts, toolchains, and APIs—an effort that is not only tedious but error-prone. Rowboat abstracts away much of this complexity by introducing a visual, AI-assisted development environment that allows teams to define agent behavior using natural language, integrate modular toolsets, and evaluate systems through interactive testing.

The IDE is built with developers and applied AI teams in mind, especially those working on domain-specific use cases in customer experience (CX), enterprise automation, and backend infrastructure.

Key Features and Architecture

1. Copilot: Natural Language-Based Agent Design

At the heart of Rowboat lies its AI-powered Copilot—a system that transforms natural language specifications into runnable multi-agent workflows. For example, users can describe, “Build an assistant for a telecom company to handle data plan upgrades and billing inquiries,” and the Copilot scaffolds the entire system accordingly. This dramatically reduces the ramp-up time for teams new to multi-agent architectures.

2. Tool Integration via MCP Compatibility

Rowboat supports Modular Command Protocol (MCP) servers, enabling seamless tool injection into agents. Developers can import tools defined in an external MCP server, assign them to individual agents within Rowboat, and trigger tool invocations through agent reasoning steps. This modular design ensures clear separation of responsibilities, enabling scalable and maintainable agent workflows.

3. Interactive Testing in the Playground

The built-in Playground offers a live testing environment where users can interact with their agents, observe system behavior, and debug tool calls. It supports step-by-step inspection of conversation history, function execution, and context propagation—critical capabilities when validating agent coordination or investigating unexpected behaviors.

4. Flexible Deployment via HTTP API and Python SDK

Rowboat isn’t just a visual IDE—it ships with an HTTP API and a Python SDK, giving teams the flexibility to embed Rowboat agents into broader infrastructure. Whether you’re running agents in a cloud-native microservice or embedding them in internal developer tools, the SDK provides both stateless and session-aware configurations.

Practical Use Cases

Rowboat is well-suited for teams building production-grade assistant systems. Some real-world applications include:

Financial Services: Automate credit card support, loan updates, and payment reminders using a team of domain-specific agents.

Insurance: Assist users with claims processing, policy inquiries, and premium calculations.

Travel & Hospitality: Handle flight updates, hotel bookings, itinerary changes, and multilingual support.

Telecom: Support billing resolution, plan changes, SIM management, and device troubleshooting.

These scenarios benefit from decomposing tasks into specialized agents with focused tool access—exactly the design pattern that Rowboat enables.

Conclusion

Rowboat fills an important gap in the AI development ecosystem: a purpose-built environment for prototyping and managing multi-agent systems. Its intuitive design, natural language integration, and modular architecture make it more than just an IDE—it’s a full development suite for agentic systems. Whether you’re building a customer service assistant, a backend orchestration tool, or a custom LLM agent pipeline, Rowboat provides the foundation.

Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Meet Rowboat: An Open-Source IDE for Building Complex Multi-Agent Systems appeared first on MarkTechPost.

OpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generatio …

OpenAI has officially announced the release of its image generation API, powered by the gpt-image-1 model. This launch brings the multimodal capabilities of ChatGPT into the hands of developers, enabling programmatic access to image generation—an essential step for building intelligent design tools, creative applications, and multimodal agent systems.

The new API supports high-quality image synthesis from natural language prompts, marking a significant integration point for generative AI workflows in production environments. Available starting today, developers can now directly interact with the same image generation model that powers ChatGPT’s image creation capabilities.

Expanding the Capabilities of ChatGPT to Developers

The gpt-image-1 model is now available through the OpenAI platform, allowing developers to generate photorealistic, artistic, or highly stylized images using plain text. This follows a phased rollout of image generation features in the ChatGPT product interface and marks a critical transition toward API-first deployment.

The image generation endpoint supports parameters such as:

Prompt: Natural language description of the desired image.

Size: Standard resolution settings (e.g., 1024×1024).

n: Number of images to generate per prompt.

Response format: Choose between base64-encoded images or URLs.

Style: Optionally specify image aesthetics (e.g., “vivid” or “natural”).

The API follows a synchronous usage model, which means developers receive the generated image(s) in the same response—ideal for real-time interfaces like chatbots or design platforms.

Technical Overview of the API and gpt-image-1 Model

OpenAI has not yet released full architectural details about gpt-image-1, but based on public documentation, the model supports robust prompt adherence, detailed composition, and stylistic coherence across diverse image types. While it is distinct from DALL·E 3 in naming, the image quality and alignment suggest continuity in OpenAI’s image generation research lineage.

The API is designed to be stateless and easy to integrate:

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
import base64
client = OpenAI()

prompt = “””
A children’s book drawing of a veterinarian using a stethoscope to
listen to the heartbeat of a baby otter.
“””

result = client.images.generate(
model=”gpt-image-1″,
prompt=prompt
)

image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)

# Save the image to a file
with open(“otter.png”, “wb”) as f:
f.write(image_bytes)

Unlocking Developer Use Cases

By making this API available, OpenAI positions gpt-image-1 as a fundamental building block for multimodal AI development. Some key applications include:

Generative Design Tools: Seamlessly integrate prompt-based image creation into design software for artists, marketers, and product teams.

AI Assistants and Agents: Extend LLMs with visual generation capabilities to support richer user interaction and content composition.

Prototyping for Games and XR: Rapidly generate environments, textures, or concept art for iterative development pipelines.

Educational Visualizations: Generate scientific diagrams, historical reconstructions, or data illustrations on demand.

With image generation now programmable, these use cases can be scaled, personalized, and embedded directly into user-facing platforms.

Content Moderation and Responsible Use

Safety remains a core consideration. OpenAI has implemented content filtering layers and safety classifiers around the gpt-image-1 model to mitigate risks of generating harmful, misleading, or policy-violating images. The model is subject to the same usage policies as OpenAI’s text-based models, with automated moderation for prompts and generated content.

Developers are encouraged to follow best practices for end-user input validation and maintain transparency in applications that include generative visual content.

Conclusion

The release of gpt-image-1 to the API marks a pivotal step in making generative vision models accessible, controllable, and production-ready. It’s not just a model—it’s an interface to imagination, grounded in structured, repeatable, and scalable computation.

For developers building the next generation of creative software, autonomous agents, or visual storytelling tools, gpt-image-1 offers a robust foundation to bring language and imagery together in code.

Check out the Technical Details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post OpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generation to Developers appeared first on MarkTechPost.

Enterprise-grade natural language to SQL generation using LLMs: Balanc …

This blog post is co-written with Renuka Kumar and Thomas Matthew from Cisco.
Enterprise data by its very nature spans diverse data domains, such as security, finance, product, and HR. Data across these domains is often maintained across disparate data environments (such as Amazon Aurora, Oracle, and Teradata), with each managing hundreds or perhaps thousands of tables to represent and persist business data. These tables house complex domain-specific schemas, with instances of nested tables and multi-dimensional data that require complex database queries and domain-specific knowledge for data retrieval.
Recent advances in generative AI have led to the rapid evolution of natural language to SQL (NL2SQL) technology, which uses pre-trained large language models (LLMs) and natural language to generate database queries in the moment. Although this technology promises simplicity and ease of use for data access, converting natural language queries to complex database queries with accuracy and at enterprise scale has remained a significant challenge. For enterprise data, a major difficulty stems from the common case of database tables having embedded structures that require specific knowledge or highly nuanced processing (for example, an embedded XML formatted string). As a result, NL2SQL solutions for enterprise data are often incomplete or inaccurate.
This post describes a pattern that AWS and Cisco teams have developed and deployed that is viable at scale and addresses a broad set of challenging enterprise use cases. The methodology allows for the use of simpler, and therefore more cost-effective and lower latency, generative models by reducing the processing required for SQL generation.
Specific challenges for enterprise-scale NL2SQL
Generative accuracy is paramount for NL2SQL use cases; inaccurate SQL queries might result in a sensitive enterprise data leak, or lead to inaccurate results impacting critical business decisions. Enterprise-scale data presents specific challenges for NL2SQL, including the following:

Complex schemas optimized for storage (and not retrieval) – Enterprise databases are often distributed in nature and optimized for storage and not for retrieval. As a result, the table schemas are complex, involving nested tables and multi-dimensional data structures (for example, a cell containing an array of data). As a further result, creating queries for retrieval from these data stores requires specific expertise and involves complex filtering and joins.
Diverse and complex natural language queries – The user’s natural language input might also be complex because they might refer to a list of entities of interest or date ranges. Converting the logical meaning of these user queries into a database query can lead to overly long and complex SQL queries due to the original design of the data schema.
LLM knowledge gap – NL2SQL language models are typically trained on data schemas that are publicly available for education purposes and might not have the necessary knowledge complexity required of large, distributed databases in production environments. Consequently, when faced with complex enterprise table schemas or complex user queries, LLMs have difficulty generating correct query statements because they have difficulty understanding interrelationships between the values and entities of the schema.
LLM attention burden and latency – Queries containing multi-dimensional data often involve multi-level filtering over each cell of the data. To generate queries for cases such as these, the generative model requires more attention to support attending to the increase in relevant tables, columns, and values; analyzing the patterns; and generating more tokens. This increases the LLM’s query generation latency, and the likelihood of query generation errors, because of the LLM misunderstanding data relationships and generating incorrect filter statements.
Fine-tuning challenge – One common approach to achieve higher accuracy with query generation is to fine-tune the model with more SQL query samples. However, it is non-trivial to craft training data for generating SQL for embedded structures within columns (for example, JSON, or XML), to handle sets of identifiers, and so on, to get baseline performance (which is the problem we are trying to solve in the first place). This also introduces a slowdown in the development cycle.

Solution design and methodology
The solution described in this post provides a set of optimizations that solve the aforementioned challenges while reducing the amount of work that has to be performed by an LLM for generating accurate output. This work extends upon the post Generating value from enterprise data: Best practices for Text2SQL and generative AI. That post has many useful recommendations for generating high-quality SQL, and the guidelines outlined might be sufficient for your needs, depending on the inherent complexity of the database schemas.
To achieve generative accuracy for complex scenarios, the solution breaks down NL2SQL generation into a sequence of focused steps and sub-problems, narrowing the generative focus to the appropriate data domain. Using data abstractions for complex joins and data structure, this approach enables the use of smaller and more affordable LLMs for the task. This approach results in reduced prompt size and complexity for inference, reduced response latency, and improved accuracy, while enabling the use of off-the-shelf pre-trained models.
Narrowing scope to specific data domains
The solution workflow narrows down the overall schema space into the data domain targeted by the user’s query. Each data domain corresponds to the set of database data structures (tables, views, and so on) that are commonly used together to answer a set of related user queries, for an application or business domain. The solution uses the data domain to construct prompt inputs for the generative LLM.
This pattern consists of the following elements:

Mapping input queries to domains – This involves mapping each user query to the data domain that is appropriate for generating the response for NL2SQL at runtime. This mapping is similar in nature to intent classification, and enables the construction of an LLM prompt that is scoped for each input query (described next).
Scoping data domain for focused prompt construction – This is a divide-and-conquer pattern. By focusing on the data domain of the input query, redundant information, such as schemas for other data domains in the enterprise data store, can be excluded. This might be considered as a form of prompt pruning; however, it offers more than prompt reduction alone. Reducing the prompt context to the in-focus data domain enables greater scope for few-shot learning examples, declaration of specific business rules, and more.
Augmenting SQL DDL definitions with metadata to enhance LLM inference – This involves enhancing the LLM prompt context by augmenting the SQL DDL for the data domain with descriptions of tables, columns, and rules to be used by the LLM as guidance on its generation. This is described in more detail later in this post.
Determine query dialect and connection information – For each data domain, the database server metadata (such as the SQL dialect and connection URI) is captured during use case onboarding and made available at runtime to be automatically included in the prompt for SQL generation and subsequent query execution. This enables scalability through decoupling the natural language query from the specific queried data source. Together, the SQL dialect and connectivity abstractions allow for the solution to be data source agnostic; data sources might be distributed within or across different clouds, or provided by different vendors. This modularity enables scalable addition of new data sources and data domains, because each is independent.

Managing identifiers for SQL generation (resource IDs)
Resolving identifiers involves extracting the named resources, as named entities, from the user’s query and mapping the values to unique IDs appropriate for the target data source prior to NL2SQL generation. This can be implemented using natural language processing (NLP) or LLMs to apply named entity recognition (NER) capabilities to drive the resolution process. This optional step has the most value when there are many named resources and the lookup process is complex. For instance, in a user query such as “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” there are named resources: ‘allyson felix’, ‘isabelle werth’, and ‘nedo nadi’. This step allows for rapid and precise feedback to the user when a resource can’t be resolved to an identifier (for example, due to ambiguity).
This optional process of handling many or paired identifiers is included to offload the burden on LLMs for user queries with challenging sets of identifiers to be incorporated, such as those that might come in pairs (such as ID-type, ID-value), or where there are many identifiers. Rather than having the generative LLM insert each unique ID into the SQL directly, the identifiers are made available by defining a temporary data structure (such as a temporary table) and a set of corresponding insert statements. The LLM is prompted with few-shot learning examples to generate SQL for the user query by joining with the temporary data structure, rather than attempt identity injection. This results in a simpler and more consistent query pattern for cases when there are one, many, or pairs of identifiers.
Handling complex data structures: Abstracting domain data structures
This step is aimed at simplifying complex data structures into a form that can be understood by the language model without having to decipher complex inter-data relationships. Complex data structures might appear as nested tables or lists within a table column, for instance.
We can define temporary data structures (such as views and tables) that abstract complex multi-table joins, nested structures, and more. These higher-level abstractions provide simplified data structures for query generation and execution. The top-level definitions of these abstractions are included as part of the prompt context for query generation, and the full definitions are provided to the SQL execution engine, along with the generated query. The resulting queries from this process can use simple set operations (such as IN, as opposed to complex joins) that LLMs are well trained on, thereby alleviating the need for nested joins and filters over complex data structures.
Augmenting data with data definitions for prompt construction
Several of the optimizations noted earlier require making some of the specifics of the data domain explicit. Fortunately, this only has to be done when schemas and use cases are onboarded or updated. The benefit is higher generative accuracy, reduced generative latency and cost, and the ability to support arbitrarily complex query requirements.
To capture the semantics of a data domain, the following elements are defined:

The standard tables and views in data schema, along with comments to describe the tables and columns.
Join hints for the tables and views, such as when to use outer joins.
Data domain-specific rules, such as which columns might not appear in a final select statement.
The set of few-shot examples of user queries and corresponding SQL statements. A good set of examples would include a wide variety of user queries for that domain.
Definitions of the data schemas for any temporary tables and views used in the solution.
A domain-specific system prompt that specifies the role and expertise that the LLM has, the SQL dialect, and the scope of its operation.
A domain-specific user prompt.
Additionally, if temporary tables or views are used for the data domain, a SQL script is required that, when executed, creates the desired temporary data structures needs to be defined. Depending on the use case, this can be a static or dynamically generated script.

Accordingly, the prompt for generating the SQL is dynamic and constructed based on the data domain of the input question, with a set of specific definitions of data structure and rules appropriate for the input query. We refer to this set of elements as the data domain context. The purpose of the data domain context is to provide the necessary prompt metadata for the generative LLM. Examples of this, and the methods described in the previous sections, are included in the GitHub repository. There is one context for each data domain, as illustrated in the following figure.

Bringing it all together: The execution flow
This section describes the execution flow of the solution. An example implementation of this pattern is available in the GitHub repository. Access the repository to follow along with the code.
To illustrate the execution flow, we use an example database with data about Olympics statistics and another with the company’s employee vacation schedule. We follow the execution flow for the domain regarding Olympics statistics using the user query “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” to show the inputs and outputs of the steps in the execution flow, as illustrated in the following figure.

Preprocess the request
The first step of the NL2SQL flow is to preprocess the request. The main objective of this step is to classify the user query into a domain. As explained earlier, this narrows down the scope of the problem to the appropriate data domain for SQL generation. Additionally, this step identifies and extracts the referenced named resources in the user query. These are then used to call the identity service in the next step to get the database identifiers for these named resources.
Using the earlier mentioned example, the inputs and outputs of this step are as follows:

user_query = “In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?”
pre_processed_request = request_pre_processor.run(user_query)
domain = pre_processed_request[app_consts.DOMAIN]

# Output pre_processed_request:
  {‘user_query’: ‘In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?’,
   ‘domain’: ‘olympics’,
   ‘named_resources’: {‘allyson felix’, ‘isabelle werth’, ‘nedo nadi’} }

Resolve identifiers (to database IDs)
This step processes the named resources’ strings extracted in the previous step and resolves them to be identifiers that can be used in database queries. As mentioned earlier, the named resources (for example, “group22”, “user123”, and “I”) are looked up using solution-specific means, such through database lookups or an ID service.
The following code shows the execution of this step in our running example:

named_resources = pre_processed_request[app_consts.NAMED_RESOURCES]
if len(named_resources) > 0:
  identifiers = id_service_facade.resolve(named_resources)
  # add identifiers to the pre_processed_request object
  pre_processed_request[app_consts.IDENTIFIERS] = identifiers
else:
  pre_processed_request[app_consts.IDENTIFIERS] = []

# Output pre_processed_request:
  {‘user_query’: ‘In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?’,
   ‘domain’: ‘olympics’,
   ‘named_resources’: {‘allyson felix’, ‘isabelle werth’, ‘nedo nadi’},
   ‘identifiers’: [ {‘id’: 34551, ‘role’: 32, ‘name’: ‘allyson felix’},
   {‘id’: 129726, ‘role’: 32, ‘name’: ‘isabelle werth’},
   {‘id’: 84026, ‘role’: 32, ‘name’: ‘nedo nadi’} ] }

Prepare the request
This step is pivotal in this pattern. Having obtained the domain and the named resources along with their looked-up IDs, we use the corresponding context for that domain to generate the following:

A prompt for the LLM to generate a SQL query corresponding to the user query
A SQL script to create the domain-specific schema

To create the prompt for the LLM, this step assembles the system prompt, the user prompt, and the received user query from the input, along with the domain-specific schema definition, including new temporary tables created as well as any join hints, and finally the few-shot examples for the domain. Other than the user query that is received as in input, other components are based on the values provided in the context for that domain.
A SQL script for creating required domain-specific temporary structures (such as views and tables) is constructed from the information in the context. The domain-specific schema in the LLM prompt, join hints, and the few-shot examples are aligned with the schema that gets generated by running this script. In our example, this step is shown in the following code. The output is a dictionary with two keys, llm_prompt and sql_preamble. The value strings for these have been clipped here; the full output can be seen in the Jupyter notebook.

prepared_request = request_preparer.run(pre_processed_request)

# Output prepared_request:
{‘llm_prompt’: ‘You are a SQL expert. Given the following SQL tables definitions, …
CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER, …);

<example>
question: How many gold medals has Yukio Endo won? answer: “`{“sql”:
“SELECT a.id, count(m.medal_name) as “count”
FROM athletes_in_focus a INNER JOIN games_competitor gc …
WHERE m.medal_name = ‘Gold’ GROUP BY a.id;” }“`
</example>

‘sql_preamble’: [ ‘CREATE temp TABLE athletes_in_focus (row_id INTEGER
PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);’,
‘INSERT INTO athletes_in_focus VALUES
(1,84026,’nedo nadi’), (2,34551,’allyson felix’), (3,129726,’isabelle werth’);”]}

Generate SQL
Now that the prompt has been prepared along with any information necessary to provide the proper context to the LLM, we provide that information to the SQL-generating LLM in this step. The goal is to have the LLM output SQL with the correct join structure, filters, and columns. See the following code:

llm_response = llm_service_facade.invoke(prepared_request[ ‘llm_prompt’ ])
generated_sql = llm_response[ ‘llm_output’ ]

# Output generated_sql:
{‘sql’: ‘SELECT g.games_name, g.games_year FROM athletes_in_focus a
JOIN games_competitor gc ON gc.person_id = a.id
JOIN games g ON gc.games_id = g.id;’}

Execute the SQL
After the SQL query is generated by the LLM, we can send it off to the next step. At this step, the SQL preamble and the generated SQL are merged to create a complete SQL script for execution. The complete SQL script is then executed against the data store, a response is fetched, and then the response is passed back to the client or end-user. See the following code:

sql_script = prepared_request[ ‘sql_preamble’ ] + [ generated_sql[ ‘sql’ ] ]
database = app_consts.get_database_for_domain(domain)
results = rdbms_service_facade.execute_sql(database, sql_script)

# Output results:
{‘rdbms_output’: [
(‘games_name’, ‘games_year’),
(‘2004 Summer’, 2004),

(‘2016 Summer’, 2016)],
‘processing_status’: ‘success’}

Solution benefits
Overall, our tests have shown several benefits, such as:

High accuracy – This is measured by a string matching of the generated query with the target SQL query for each test case. In our tests, we observed over 95% accuracy for 100 queries, spanning three data domains.
High consistency – This is measured in terms of the same SQL generated being generated across multiple runs. We observed over 95% consistency for 100 queries, spanning three data domains. With the test configuration, the queries were accurate most of the time; a small number occasionally produced inconsistent results.
Low cost and latency – The approach supports the use of small, low-cost, low-latency LLMs. We observed SQL generation in the 1–3 second range using models Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3.
Scalability – The methods that we employed in terms of data abstractions facilitate scaling independent of the number of entities or identifiers in the data for a given use case. For instance, in our tests consisting of a list of 200 different named resources per row of a table, and over 10,000 such rows, we measured a latency range of 2–5 seconds for SQL generation and 3.5–4.0 seconds for SQL execution.
Solving complexity – Using the data abstractions for simplifying complexity enabled the accurate generation of arbitrarily complex enterprise queries, which almost certainly would not be possible otherwise.

We attribute the success of the solution with these excellent but lightweight models (compared to a Meta Llama 70B variant or Anthropic’s Claude Sonnet) to the points noted earlier, with the reduced LLM task complexity being the driving force. The implementation code demonstrates how this is achieved. Overall, by using the optimizations outlined in this post, natural language SQL generation for enterprise data is much more feasible than would be otherwise.
AWS solution architecture
In this section, we illustrate how you might implement the architecture on AWS. The end-user sends their natural language queries to the NL2SQL solution using a REST API. Amazon API Gateway is used to provision the REST API, which can be secured by Amazon Cognito. The API is linked to an AWS Lambda function, which implements and orchestrates the processing steps described earlier using a programming language of the user’s choice (such as Python) in a serverless manner. In this example implementation, where Amazon Bedrock is noted, the solution uses Anthropic’s Claude Haiku 3.
Briefly, the processing steps are as follows:

Determine the domain by invoking an LLM on Amazon Bedrock for classification.
Invoke Amazon Bedrock to extract relevant named resources from the request.
After the named resources are determined, this step calls a service (the Identity Service) that returns identifier specifics relevant to the named resources for the task at hand. The Identity Service is logically a key/value lookup service, which might support for multiple domains.
This step runs on Lambda to create the LLM prompt to generate the SQL, and to define temporary SQL structures that will be executed by the SQL engine along with the SQL generated by the LLM (in the next step).
Given the prepared prompt, this step invokes an LLM running on Amazon Bedrock to generate the SQL statements that correspond to the input natural language query.
This step executes the generated SQL query against the target database. In our example implementation, we used an SQLite database for illustration purposes, but you could use another database server.

The final result is obtained by running the preceding pipeline on Lambda. When the workflow is complete, the result is provided as a response to the REST API request.
The following diagram illustrates the solution architecture.

Conclusion
In this post, the AWS and Cisco teams unveiled a new methodical approach that addresses the challenges of enterprise-grade SQL generation. The teams were able to reduce the complexity of the NL2SQL process while delivering higher accuracy and better overall performance.
Though we’ve walked you through an example use case focused on answering questions about Olympic athletes, this versatile pattern can be seamlessly adapted to a wide range of business applications and use cases. The demo code is available in the GitHub repository. We invite you to leave any questions and feedback in the comments.

About the authors

Renuka Kumar is a Senior Engineering Technical Lead at Cisco, where she has architected and led the development of Cisco’s Cloud Security BU’s AI/ML capabilities in the last 2 years, including launching first-to-market innovations in this space. She has over 20 years of experience in several cutting-edge domains, with over a decade in security and privacy. She holds a PhD from the University of Michigan in Computer Science and Engineering.

Toby Fotherby is a Senior AI and ML Specialist Solutions Architect at AWS, helping customers use the latest advances in AI/ML and generative AI to scale their innovations. He has over a decade of cross-industry expertise leading strategic initiatives and master’s degrees in AI and Data Science. Toby also leads a program training the next generation of AI Solutions Architects.

Shweta Keshavanarayana is a Senior Customer Solutions Manager at AWS. She works with AWS Strategic Customers and helps them in their cloud migration and modernization journey. Shweta is passionate about solving complex customer challenges using creative solutions. She holds an undergraduate degree in Computer Science & Engineering. Beyond her professional life, she volunteers as a team manager for her sons’ U9 cricket team, while also mentoring women in tech and serving the local community.
Thomas Matthew is an AL/ML Engineer at Cisco. Over the past decade, he has worked on applying methods from graph theory and time series analysis to solve detection and exfiltration problems found in Network security. He has presented his research and work at Blackhat and DevCon. Currently, he helps integrate generative AI technology into Cisco’s Cloud Security product offerings.
Daniel Vaquero is a Senior AI/ML Specialist Solutions Architect at AWS. He helps customers solve business challenges using artificial intelligence and machine learning, creating solutions ranging from traditional ML approaches to generative AI. Daniel has more than 12 years of industry experience working on computer vision, computational photography, machine learning, and data science, and he holds a PhD in Computer Science from UCSB.
Atul Varshneya is a former Principal AI/ML Specialist Solutions Architect with AWS. He currently focuses on developing solutions in the areas of AI/ML, particularly in generative AI. In his career of 4 decades, Atul has worked as the technology R&D leader in multiple large companies and startups.
Jessica Wu is an Associate Solutions Architect at AWS. She helps customers build highly performant, resilient, fault-tolerant, cost-optimized, and sustainable architectures.

AWS Field Experience reduced cost and delivered low latency and high p …

AWS Field Experience (AFX) empowers Amazon Web Services (AWS) sales teams with generative AI solutions built on Amazon Bedrock, improving how AWS sellers and customers interact. The AFX team uses AI to automate tasks and provide intelligent insights and recommendations, streamlining workflows for both customer-facing roles and internal support functions. Their approach emphasizes operational efficiency and practical enhancements to daily processes.
Last year, AFX introduced Account Summaries as the first in a forthcoming lineup of tools designed to support and streamline sales workflows. By integrating structured and unstructured data—from sales collateral and customer engagements to external insights and machine learning (ML) outputs—the tool delivers summarized insights that offer a comprehensive view of customer accounts. These summaries provide concise overviews and timely updates, enabling teams to make informed decisions during customer interactions.
The following screenshot shows an example of Account Summary for a customer account, including an executive summary, company overview, and recent account changes.

Migration to the Amazon Nova Light foundation model
Initially, AFX selected a range of models available on Amazon Bedrock, each chosen for its specific capabilities tailored to the diverse requirements of various summary sections. This was done to optimize accuracy, response time, and cost efficiency. However, following the introduction of state-of-the-art Amazon Nova foundation models in December 2024, the AFX team consolidated all its generative AI workload onto the Nova Lite model to capitalize on its industry-leading price performance and optimized latency.
Since moving to the Nova Lite model, the AFX team has achieved a remarkable 90% reduction in inference costs. This has empowered them to scale operations and deliver greater business value that directly supports their mission of creating efficient, high-performing sales processes.
Because Account Summaries are often used by sellers during on-the-go customer engagements, response speed is critical for maintaining seller efficiency. The Nova Lite model’s ultra-low latency helps ensure that sellers receive fast, reliable responses, without compromising on the quality of the insights.
The AFX team also highlighted the seamless migration experience, noting that their existing prompting, reasoning, and evaluation criteria transferred smoothly for the Amazon Nova Lite model without requiring significant modifications. The combination of tailored prompt controls and authorized reference content creates a bounded response framework, minimizing hallucinations elements and inaccuracies.
Overall impact
Since using the Nova Lite model, over 15,600 summaries have been generated by 3,600 sellers—with 1,500 of those sellers producing more than four summaries each. Impressively, the generative AI Account Summaries have achieved a 72% favorability rate, underscoring strong seller confidence and widespread approval.
AWS sellers report saving an average of 35 minutes per summary, a benefit that significantly boosts productivity and allocates more time for customer engagements. Additionally, about one-third of surveyed sellers noted that the summaries positively influenced their customer interactions, and those using generative AI Account Summaries experienced a 4.9% increase in the value of opportunities created.
A member of the AFX team explained, “The Amazon Nova Lite model has significantly reduced our costs without compromising performance. It allowed us to get fast, reliable account summaries, making customer interaction more productive and impactful.”
Conclusion
The AFX team’s product migration to the Nova Lite model has delivered tangible enterprise value by enhancing sales workflows. By migrating to the Amazon Nova Lite model, the team has not only achieved significant cost savings and reduced latency, but has also empowered sellers with a leading intelligent and reliable solution. This process has translated into real-world benefits—saving time, simplifying research, and bolstering customer engagement—laying a solid foundation for ongoing business goals and sustained success.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About the Authors
Anuj Jauhari is a Senior Product Marketing Manager at Amazon Web Services, where he helps customers realize value from innovations in generative AI.
Ashwin Nadagoudar is a Software Development Manager at Amazon Web Services, leading go-to-market (GTM) strategies and user journey initiatives with generative AI.
Sonciary Perez is a Principal Product Manager at Amazon Web Services, supporting the transformation of AWS Sales through AI-powered solutions that drive seller productivity and accelerate revenue growth.

Combine keyword and semantic search for text and images using Amazon B …

Customers today expect to find products quickly and efficiently through intuitive search functionality. A seamless search journey not only enhances the overall user experience, but also directly impacts key business metrics such as conversion rates, average order value, and customer loyalty. According to a McKinsey study, 78% of consumers are more likely to make repeat purchases from companies that provide personalized experiences. As a result, delivering exceptional search functionality has become a strategic differentiator for modern ecommerce services. With ever expanding product catalogs and increasing diversity of brands, harnessing advanced search technologies is essential for success.
Semantic search enables digital commerce providers to deliver more relevant search results by going beyond keyword matching. It uses an embeddings model to create vector embeddings that capture the meaning of the input query. This helps the search be more resilient to phrasing variations and to accept multimodal inputs such as text, image, audio, and video. For example, a user inputs a query containing text and an image of a product they like, and the search engine translates both into vector embeddings using a multimodal embeddings model and retrieves related items from the catalog using embeddings similarities. To learn more about semantic search and how Amazon Prime Video uses it to help customers find their favorite content, see Amazon Prime Video advances search for sports using Amazon OpenSearch Service.
While semantic search provides contextual understanding and flexibility, keyword search remains a crucial component for a comprehensive ecommerce search solution. At its core, keyword search provides the essential baseline functionality of accurately matching user queries to product data and metadata, making sure explicit product names, brands, or attributes can be reliably retrieved. This matching capability is vital, because users often have specific items in mind when initiating a search, and meeting these explicit needs with precision is important to deliver a satisfactory experience.
Hybrid search combines the strengths of keyword search and semantic search, enabling retailers to deliver more accurate and relevant results to their customers. Based on OpenSearch blog post, hybrid search improves result quality by 8–12% compared to keyword search and by 15% compared to natural language search. However, combining keyword search and semantic search presents significant complexity because different query types provide scores on different scales. Using Amazon OpenSearch Service hybrid search, customers can seamlessly integrate these approaches by combining relevance scores from multiple search types into one unified score.
OpenSearch Service is the AWS recommended vector database for Amazon Bedrock. It’s a fully managed service that you can use to deploy, operate, and scale OpenSearch on AWS. OpenSearch is a distributed open-source search and analytics engine composed of a search engine and vector database. OpenSearch Service can help you deploy and operate your search infrastructure with native vector database capabilities delivering as low as single-digit millisecond latencies for searches across billions of vectors, making it ideal for real-time AI applications. To learn more, see Improve search results for AI using Amazon OpenSearch Service as a vector database with Amazon Bedrock.
Multimodal embedding models like Amazon Titan Multimodal Embeddings G1, available through Amazon Bedrock, play a critical role in enabling hybrid search functionality. These models generate embeddings for both text and images by representing them in a shared semantic space. This allows systems to retrieve relevant results across modalities such as finding images using text queries or combining text with image inputs.
In this post, we walk you through how to build a hybrid search solution using OpenSearch Service powered by multimodal embeddings from the Amazon Titan Multimodal Embeddings G1 model through Amazon Bedrock. This solution demonstrates how you can enable users to submit both text and images as queries to retrieve relevant results from a sample retail image dataset.
Overview of solution
In this post, you will build a solution that you can use to search through a sample image dataset in the retail space, using a multimodal hybrid search system powered by OpenSearch Service. This solution has two key workflows: a data ingestion workflow and a query workflow.
Data ingestion workflow
The data ingestion workflow generates vector embeddings for text, images, and metadata using Amazon Bedrock and the Amazon Titan Multimodal Embeddings G1 model. Then, it stores the vector embeddings, text, and metadata in an OpenSearch Service domain.
In this workflow, shown in the following figure, we use a SageMaker JupyterLab notebook to perform the following actions:

Read text, images, and metadata from an Amazon Simple Storage Service (Amazon S3) bucket, and encode images in Base64 format.
Send the text, images, and metadata to Amazon Bedrock using its API to generate embeddings using the Amazon Titan Multimodal Embeddings G1 model.
The Amazon Bedrock API replies with embeddings to the Jupyter notebook.
Store both the embeddings and metadata in an OpenSearch Service domain.

Query workflow
In the query workflow, an OpenSearch search pipeline is used to convert the query input to embeddings using the embeddings model registered with OpenSearch. Then, within the OpenSearch search pipeline results processor, results of semantic search and keyword search are combined using the normalization processor to provide relevant search results to users. Search pipelines take away the heavy lifting of building score results normalization and combination outside your OpenSearch Service domain.
The workflow consists of the following steps shown in the following figure:

The client submits a query input containing text, a Base64 encoded image, or both to OpenSearch Service. Text submitted is used for both semantic and keyword search, and the image is used for semantic search.
The OpenSearch search pipeline performs the keyword search using textual inputs and a neural search using vector embeddings generated by Amazon Bedrock using Titan Multimodal Embeddings G1 model.
The normalization processor within the pipeline scales search results using techniques like min_max and combines keyword and semantic scores using arithmetic_mean.
Ranked search results are returned to the client.

Walkthrough overview
To deploy the solution, complete the following high-level steps:

Create a connector for Amazon Bedrock in OpenSearch Service.
Create an OpenSearch search pipeline and enable hybrid search.
Create an OpenSearch Service index for storing the multimodal embeddings and metadata.
Ingest sample data to the OpenSearch Service index.
Create OpenSearch Service query functions to test search functionality.

Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account.
Amazon Bedrock with Amazon Titan Multimodal Embeddings G1 enabled. For more information, see Access Amazon Bedrock foundation models.
An OpenSearch Service domain. For instructions, see Getting started with Amazon OpenSearch Service.
An Amazon SageMaker notebook. For instructions, see Quick setup for Amazon SageMaker.
Familiarity with AWS Identity and Access Management (IAM), Amazon Elastic Compute Cloud (Amazon EC2), OpenSearch Service, and SageMaker.
Familiarity with Python programming language.

The code is open source and hosted on GitHub.
Create a connector for Amazon Bedrock in OpenSearch Service
To use OpenSearch Service machine learning (ML) connectors with other AWS services, you need to set up an IAM role allowing access to that service. In this section, we demonstrate the steps to create an IAM role and then create the connector.
Create an IAM role
Complete the following steps to set up an IAM role to delegate Amazon Bedrock permissions to OpenSearch Service:

Add the following policy to the new role to allow OpenSearch Service to invoke the Amazon Titan Multimodal Embeddings G1 model: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “bedrock:InvokeModel”,
“Resource”: “arn:aws:bedrock:region:account-id:foundation-model/amazon.titan-embed-image-v1”
}
]
}

Modify the role trust policy as follows. You can follow the instructions in IAM role management to edit the trust relationship of the role. {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “opensearchservice.amazonaws.com”
},
“Action”: “sts:AssumeRole”
}
]
}

Connect an Amazon Bedrock model to OpenSearch
After you create the role, you can use the Amazon Resource Name (ARN) of the role to define the constant in the SageMaker notebook along with the OpenSearch domain endpoint. Complete the following steps:

Register a model group. Note the model group ID returned in the response to register a model in a later step.
Create a connector, which facilitates registering and deploying external models in OpenSearch. The response will contain the connector ID.
Register the external model to the model group and deploy the model. In this step, you register and deploy the model at the same time—by setting up deploy=true, the registered model is deployed as well.

Create an OpenSearch search pipeline and enable hybrid search
A search pipeline runs inside the OpenSearch Service domain and can have three types of processors: search request processor, search response processor, and search phase result processor. For our search pipeline, we use the search phase result processor, which runs between the search phases at the coordinating node level. The processor uses the normalization processor and normalizes the score from keyword and semantic search. For hybrid search, min-max normalization and arithmetic_mean combination techniques are preferred, but you can also try L2 normalization and geometric_mean or harmonic_mean combination techniques depending on your data and use case.
payload={
“phase_results_processors”: [
{
“normalization-processor”: {
“normalization”: {
“technique”: “min_max”
},
“combination”: {
“technique”: “arithmetic_mean”,
“parameters”: {
“weights”: [
OPENSEARCH_KEYWORD_WEIGHT,
1 – OPENSEARCH_KEYWORD_WEIGHT
]
}
}
}
}
]
}
response = requests.put(
url=f”{OPENSEARCH_ENDPOINT}/_search/pipeline/”+OPENSEARCH_SEARCH_PIPELINE_NAME,
json=payload,
headers={“Content-Type”: “application/json”},
auth=open_search_auth
)
Create an OpenSearch Service index for storing the multimodal embeddings and metadata
For this post, we use the Amazon Berkley Objects Dataset, which is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. In this example, we only use Shoes and listings that are in en_US as shown in section Prepare listings dataset for Amazon OpenSearch ingestion of the notebook.
Use the following code to create an OpenSearch index to ingest the sample data:
response = opensearch_client.indices.create(
index=OPENSEARCH_INDEX_NAME,
body={
“settings”: {
“index.knn”: True,
“number_of_shards”: 2
},
“mappings”: {
“properties”: {
“amazon_titan_multimodal_embeddings”: {
“type”: “knn_vector”,
“dimension”: 1024,
“method”: {
“name”: “hnsw”,
“engine”: “lucene”,
“parameters”: {}
}
}
}
}
}
)
Ingest sample data to the OpenSearch Service index
In this step, you select the relevant features used for generating embeddings. The images are converted to Base64. The combination of a selected feature and a Base64 image is used to generate multimodal embeddings, which are stored in the OpenSearch Service index along with the metadata using a OpenSearch bulk operation, and ingest listings in batches.
Create OpenSearch Service query functions to test search functionality
With the sample data ingested, you can run queries against this data to test the hybrid search functionality. To facilitate this process, we created helper functions to perform the queries in the query workflow section of the notebook. In this section, you explore specific parts of the functions that differentiate the search methods.
Keyword search
For keyword search, send the following payload to the OpenSearch domain search endpoint:
payload = {
“query”: {
“multi_match”: {
“query”: query_text,
}
},
}
Semantic search
For semantic search, you can send the text and image as part of the payload. Model_id in the request is the external embeddings model that you connected earlier. OpenSearch will invoke the model and convert text and image to embeddings.
payload = {
“query”: {
“neural”: {
“vector_embedding”: {
“query_text”: query_text,
“query_image”: query_jpg_image,
“model_id”: model_id,
“k”: 5
}
}
}
}
Hybrid search
This method uses the OpenSearch pipeline you created. The payload has both the semantic and neural search.
payload = {
“query”: {
“hybrid”: {
“queries”: [
{
“multi_match”: {
“query”: query_text,
}
},
{
“neural”: {
“vector_embedding”: {
“query_text”: query_text,
“query_image”: query_jpg_image,
“model_id”: model_id,
“k”: 5
}
}
}
]
}
}
}
Test search methods
To compare the multiple search methods, you can query the index using query_text which provides specific information about the desired output, and query_jpg_image which provides the overall abstraction of the desired style of the output.
query_text = “leather sandals in Petal Blush”
search_image_path = ’16/16e48774.jpg’

Keyword search
The following output lists the top three keyword search results. The keyword search successfully located leather sandals in the color Petal Blush, but it didn’t take the desired style into consideration.

——————————————————————————————————————————–
Score: 8.4351 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–
Score: 8.4351 Item ID: B06XH8M37Q
Item Name: Amazon Brand – The Fix Women’s Farah Single Buckle Platform Dress Sandal, Petal Blush, 6.5 B US
Fabric Type: 100% Leather Material: None Color: Petal Blush Style: Farah Single Buckle Platform Sandal
——————————————————————————————————————————–
Score: 8.4351 Item ID: B01MSCV2YB
Item Name: Amazon Brand – The Fix Women’s Conley Lucite Heel Dress Sandal,Petal Blush,7.5 B US
Fabric Type: Leather Material: Suede Color: Petal Blush Style: Conley Lucite Heel Sandal
——————————————————————————————————————————–

 
Semantic search
Semantic search successfully located leather sandal and considered the desired style. However, the similarity to the provided images took priority over the specific color provided in query_text.

——————————————————————————————————————————–
Score: 0.7072 Item ID: B01MZF96N7
Item Name: Amazon Brand – The Fix Women’s Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather Material: Suede Color: Havana Tan Style: Bonilla Block Heel Cutout Tribal Sandal
——————————————————————————————————————————–
Score: 0.7018 Item ID: B01MUG3C0Q
Item Name: Amazon Brand – The Fix Women’s Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic Material: Leather Color: Light Rose/Gold Style: Farrell Cutout Tribal Square Toe Flat Sandal
——————————————————————————————————————————–
Score: 0.6858 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–

 
Hybrid search
Hybrid search returned similar results to the semantic search because they use the same embeddings model. However, by combining the output of keyword and semantic searches, the ranking of the Petal Blush sandal that most closely matches query_jpg_image increases, moving it the top of the results list.

——————————————————————————————————————————–
Score: 0.6838 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–
Score: 0.6 Item ID: B01MZF96N7
Item Name: Amazon Brand – The Fix Women’s Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather Material: Suede Color: Havana Tan Style: Bonilla Block Heel Cutout Tribal Sandal
——————————————————————————————————————————–
Score: 0.5198 Item ID: B01MUG3C0Q
Item Name: Amazon Brand – The Fix Women’s Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic Material: Leather Color: Light Rose/Gold Style: Farrell Cutout Tribal Square Toe Flat Sandal
——————————————————————————————————————————–

 
Clean up
After you complete this walkthrough, clean up all the resources you created as part of this post. This is an important step to make sure you don’t incur any unexpected charges. If you used an existing OpenSearch Service domain, in the Cleanup section of the notebook, we provide suggested cleanup actions, including delete the index, un-deploy the model, delete the model, delete the model group, and delete the Amazon Bedrock connector. If you created an OpenSearch Service domain exclusively for this exercise, you can bypass these actions and delete the domain.
Conclusion
In this post, we explained how to implement multimodal hybrid search by combining keyword and semantic search capabilities using Amazon Bedrock and Amazon OpenSearch Service. We showcased a solution that uses Amazon Titan Multimodal Embeddings G1 to generate embeddings for text and images, enabling users to search using both modalities. The hybrid approach combines the strengths of keyword search and semantic search, delivering accurate and relevant results to customers.
We encourage you to test the notebook in your own account and get firsthand experience with hybrid search variations. In addition to the outputs shown in this post, we provide a few variations in the notebook. If you’re interested in using custom embeddings models in Amazon SageMaker AI instead, see Hybrid Search with Amazon OpenSearch Service. If you want a solution that offers semantic search only, see Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless and Build multimodal search with Amazon OpenSearch Service.

About the Authors
Renan Bertolazzi is an Enterprise Solutions Architect helping customers realize the potential of cloud computing on AWS. In this role, Renan is a technical leader advising executives and engineers on cloud solutions and strategies designed to innovate, simplify, and deliver results.
Birender Pal is a Senior Solutions Architect at AWS, where he works with strategic enterprise customers to design scalable, secure and resilient cloud architectures. He supports digital transformation initiatives with a focus on cloud-native modernization, machine learning, and Generative AI. Outside of work, Birender enjoys experimenting with recipes from around the world.
Sarath Krishnan is a Senior Solutions Architect with Amazon Web Services. He is passionate about enabling enterprise customers on their digital transformation journey. Sarath has extensive experience in architecting highly available, scalable, cost-effective, and resilient applications on the cloud. His area of focus includes DevOps, machine learning, MLOps, and generative AI.

Protect sensitive data in RAG applications with Amazon Bedrock

Retrieval Augmented Generation (RAG) applications have become increasingly popular due to their ability to enhance generative AI tasks with contextually relevant information. Implementing RAG-based applications requires careful attention to security, particularly when handling sensitive data. The protection of personally identifiable information (PII), protected health information (PHI), and confidential business data is crucial because this information flows through RAG systems. Failing to address these security considerations can lead to significant risks and potential data breaches. For healthcare organizations, financial institutions, and enterprises handling confidential information, these risks can result in regulatory compliance violations and breach of customer trust. See the OWASP Top 10 for Large Language Model Applications to learn more about the unique security risks associated with generative AI applications.
Developing a comprehensive threat model for your generative AI applications can help you identify potential vulnerabilities related to sensitive data leakage, prompt injections, unauthorized data access, and more. To assist in this effort, AWS provides a range of generative AI security strategies that you can use to create appropriate threat models.
Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give foundation models (FMs) and agents contextual information from your private data sources to deliver more relevant and accurate responses tailored to your specific needs. Additionally, with Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can redact sensitive information such as PII to protect privacy using Amazon Bedrock Guardrails.
RAG workflow: Converting data to actionable knowledge
RAG consists of two major steps:

Ingestion – Preprocessing unstructured data, which includes converting the data into text documents and splitting the documents into chunks. Document chunks are then encoded with an embedding model to convert them to document embeddings. These encoded document embeddings along with the original document chunks in the text are then stored to a vector store, such as Amazon OpenSearch Service.
Augmented retrieval – At query time, the user’s query is first encoded with the same embedding model to convert the query into a query embedding. The generated query embedding is then used to perform a similarity search on the stored document embeddings to find and retrieve semantically similar document chunks to the query. After the document chunks are retrieved, the user prompt is augmented by passing the retrieved chunks as additional context, so that the text generation model can answer the user query using the retrieved context. If sensitive data isn’t sanitized before ingestion, this might lead to retrieving sensitive data from the vector store and inadvertently leak the sensitive data to unauthorized users as part of the model response.

The following diagram shows the architectural workflow of a RAG system, illustrating how a user’s query is processed through multiple stages to generate an informed response

Solution overview
In this post we present two architecture patterns: data redaction at storage level and role-based access, for protecting sensitive data when building RAG-based applications using Amazon Bedrock Knowledge Bases.
Data redaction at storage level – Identifying and redacting (or masking) sensitive data before storing them to the vector store (ingestion) using Amazon Bedrock Knowledge Bases. This zero-trust approach to data sensitivity reduces the risk of sensitive information being inadvertently disclosed to unauthorized users.
Role-based access to sensitive data – Controlling selective access to sensitive information based on user roles and permissions during retrieval. This approach is best in situations where sensitive data needs to be stored in the vector store, such as in healthcare settings with distinct user roles like administrators (doctors) and non-administrators (nurses or support personnel).
For all data stored in Amazon Bedrock, the AWS shared responsibility model applies.
Let’s dive in to understand how to implement the data redaction at storage level and role-based access architecture patterns effectively.
Scenario 1: Identify and redact sensitive data before ingesting into the vector store
The ingestion flow implements a four-step process to help protect sensitive data when building RAG applications with Amazon Bedrock:

Source document processing – An AWS Lambda function monitors the incoming text documents landing to a source Amazon Simple Storage Service (Amazon S3) bucket and triggers an Amazon Comprehend PII redaction job to identify and redact (or mask) sensitive data in the documents. An Amazon EventBridge rule triggers the Lambda function every 5 minutes. The document processing pipeline described here only processes text documents. To handle documents containing embedded images, you should implement additional preprocessing steps to extract and analyze images separately before ingestion.
PII identification and redaction – The Amazon Comprehend PII redaction job analyzes the text content to identify and redact PII entities. For example, the job identifies and redacts sensitive data entities like name, email, address, and other financial PII entities.
Deep security scanning – After redaction, documents move to another folder where Amazon Macie verifies redaction effectiveness and identifies any remaining sensitive data objects. Documents flagged by Macie go to a quarantine bucket for manual review, while cleared documents move to a redacted bucket ready for ingestion. For more details on data ingestion, see Sync your data with your Amazon Bedrock knowledge base.
Secure knowledge base integration – Redacted documents are ingested into the knowledge base through a data ingestion job. In case of multi-modal content, for enhanced security, consider implementing:

A dedicated image extraction and processing pipeline.
Image analysis to detect and redact sensitive visual information.
Amazon Bedrock Guardrails to filter inappropriate image content during retrieval.

This multi-layered approach focuses on securing text content while highlighting the importance of implementing additional safeguards for image processing. Organizations should evaluate their multi-modal document requirements and extend the security framework accordingly.
Ingestion flow
The following illustration demonstrates a secure document processing pipeline for handling sensitive data before ingestion into Amazon Bedrock Knowledge Bases.

The high-level steps are as follows:

The document ingestion flow begins when documents containing sensitive data are uploaded to a monitored inputs folder in the source bucket. An EventBridge rule triggers a Lambda function (ComprehendLambda).
The ComprehendLambda function monitors for new files in the inputs folder of the source bucket and moves landed files to a processing folder. It then launches an asynchronous Amazon Comprehend PII redaction analysis job and records the job ID and status in an Amazon DynamoDB JobTracking table for monitoring job completion. The Amazon Comprehend PII redaction job automatically redacts and masks sensitive elements such as names, addresses, phone numbers, Social Security numbers, driver’s license IDs, and banking information with the entity type. The job replaces these identified PII entities with placeholder tokens, such as [NAME], [SSN] etc. The entities to mask can be configured using RedactionConfig. For more information, see Redacting PII entities with asynchronous jobs (API). The MaskMode in RedactionConfig is set to REPLACE_WITH_PII_ENTITY_TYPE instead of MASK; redacting with a MaskCharacter would affect the quality of retrieved documents because many documents could contain the same MaskCharacter, thereby affecting the retrieval quality. After completion, the redacted files move to the for_macie_scan folder for secondary scanning.
The secondary verification phase employs Macie for additional sensitive data detection on the redacted files. Another Lambda function (MacieLambda) monitors the completion of the Amazon Comprehend PII redaction job. When the job is complete, the function triggers a Macie one-time sensitive data detection job with files in the for_macie_scan folder.
The final stage integrates with the Amazon Bedrock knowledge base. The findings from Macie determine the next steps: files with high severity ratings (3 or higher) are moved to a quarantine folder for human review by authorized personnel with appropriate permissions and access controls, whereas files with low severity ratings are moved to a designated redacted bucket, which then triggers a data ingestion job to the Amazon Bedrock knowledge base.

This process helps prevent sensitive details from being exposed when the model generates responses based on retrieved data.
Augmented retrieval flow
The augmented retrieval flow diagram shows how user queries are processed securely. It illustrates the complete workflow from user authentication through Amazon Cognito to response generation with Amazon Bedrock, including guardrail interventions that help prevent policy violations in both inputs and outputs.

The high-level steps are as follows:

For our demo, we use a web application UI built using Streamlit. The web application launches with a login form with user name and password fields.
The user enters the credentials and logs in. User credentials are authenticated using Amazon Cognito user pools. Amazon Cognito acts as our OpenID connect (OIDC) identity provider (IdP) to provide authentication and authorization services for this application. After authentication, Amazon Cognito generates and returns identity, access and refresh tokens in JSON web token (JWT) format back to the web application. Refer to Understanding user pool JSON web tokens (JWTs) for more information.
After the user is authenticated, they are logged in to the web application, where an AI assistant UI is presented to the user. The user enters their query (prompt) in the assistant’s text box. The query is then forwarded using a REST API call to an Amazon API Gateway endpoint along with the access tokens in the header.
API Gateway forwards the payload along with the claims included in the header to a conversation orchestrator Lambda function.
The conversation orchestrator Lambda function processes the user prompt and model parameters received from the UI and calls the RetrieveAndGenerate API to the Amazon Bedrock knowledge base. Input guardrails are first applied to this request to perform input validation on the user query.

The guardrail evaluates and applies predefined responsible AI policies using content filters, denied topic filters and word filters on user input. For more information on creating guardrail filters, see Create a guardrail.
If the predefined input guardrail policies are triggered on the user input, the guardrails intervene and return a preconfigured message like, “Sorry, your query violates our usage policy.”
Requests that don’t trigger a guardrail policy will retrieve the documents from the knowledge base and generate a response using the RetrieveAndGenerate. Optionally, if users choose to run Retrieve separately, guardrails can also be applied at this stage. Guardrails during document retrieval can help block sensitive data returned from the vector store.

During retrieval, Amazon Bedrock Knowledge Bases encodes the user query using the Amazon Titan Text v2 embeddings model to generate a query embedding.
Amazon Bedrock Knowledge Bases performs a similarity search with the query embedding against the document embeddings in the OpenSearch Service vector store and retrieves top-k chunks. Optionally, post-retrieval, you can incorporate a reranking model to improve the retrieved results quality from the OpenSearch vector store. Refer to Improve the relevance of query responses with a reranker model in Amazon Bedrock for more details.
Finally, the user prompt is augmented with the retrieved document chunks from the vector store as context and the final prompt is sent to an Amazon Bedrock foundation model (FM) for inference. Output guardrail policies are again applied post-response generation. If the predefined output guardrail policies are triggered, the model generates a predefined response like “Sorry, your query violates our usage policy.” If no policies are triggered, then the large language model (LLM) generated response is sent to the user.

To deploy Scenario 1, find the instructions here on Github
Scenario 2: Implement role-based access to PII data during retrieval
In this scenario, we demonstrate a comprehensive security approach that combines role-based access control (RBAC) with intelligent PII guardrails for RAG applications. It integrates Amazon Bedrock with AWS identity services to automatically enforce security through different guardrail configurations for admin and non-admin users.
The solution uses the metadata filtering capabilities of Amazon Bedrock Knowledge Bases to dynamically filter documents during similarity searches using metadata attributes assigned before ingestion. For example, admin and non-admin metadata attributes are created and attached to relevant documents before the ingestion process. During retrieval, the system returns only the documents with metadata matching the user’s security role and permissions and applies the relevant guardrail policies to either mask or block sensitive data detected on the LLM output.
This metadata-driven approach, combined with features like custom guardrails, real-time PII detection, masking, and comprehensive access logging creates a robust framework that maintains the security and utility of the RAG application while enforcing RBAC.
The following diagram illustrates how RBAC works with metadata filtering in the vector database.

For a detailed understanding of how metadata filtering works, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.
Augmented retrieval flow
The augmented retrieval flow diagram shows how user queries are processed securely based on role-based access.

The workflow consists of the following steps:

The user is authenticated using an Amazon Cognito user pool. It generates a validation token after successful authentication.
The user query is sent using an API call along with the authentication token through Amazon API Gateway.
Amazon API Gateway forwards the payload and claims to an integration Lambda function.
The Lambda function extracts the claims from the header and checks for user role and determines whether to use an admin guardrail or a non-admin guardrail based on the access level.
Next, the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is invoked along with the guardrail applied on the user input.
Amazon Bedrock Knowledge Bases embeds the query using the Amazon Titan Text v2 embeddings model.
Amazon Bedrock Knowledge Bases performs similarity searches on the OpenSearch Service vector database and retrieves relevant chunks (optionally, you can improve the relevance of query responses using a reranker model in the knowledge base).
The user prompt is augmented with the retrieved context from the previous step and sent to the Amazon Bedrock FM for inference.
Based on the user role, the LLM output is evaluated against defined Responsible AI policies using either admin or non-admin guardrails.
Based on guardrail evaluation, the system either returns a “Sorry! Cannot Respond” message if the guardrail intervenes, or delivers an appropriate response with no masking on the output for admin users or sensitive data masked for non-admin users.

To deploy Scenario 2, find the instructions here on Github
This security architecture combines Amazon Bedrock guardrails with granular access controls to automatically manage sensitive information exposure based on user permissions. The multi-layered approach makes sure organizations maintain security compliance while fully utilizing their knowledge base, proving security and functionality can coexist.
Customizing the solution
The solution offers several customization points to enhance its flexibility and adaptability:

Integration with external APIs – You can integrate existing PII detection and redaction solutions with this system. The Lambda function can be modified to use custom APIs for PHI or PII handling before calling the Amazon Bedrock Knowledge Bases API.
Multi-modal processing – Although the current solution focuses on text, it can be extended to handle images containing PII by incorporating image-to-text conversion and caption generation. For more information about using Amazon Bedrock for processing multi-modal content during ingestion, see Parsing options for your data source.
Custom guardrails – Organizations can implement additional specialized security measures tailored to their specific use cases.
Structured data handling – For queries involving structured data, the solution can be customized to include Amazon Redshift as a structured data store as opposed to OpenSearch Service. Data masking and redaction on Amazon Redshift can be achieved by applying dynamic data masking (DDM) policies, including fine-grained DDM policies like role-based access control and column-level policies using conditional dynamic data masking.
Agentic workflow integration – When incorporating an Amazon Bedrock knowledge base with an agentic workflow, additional safeguards can be implemented to protect sensitive data from external sources, such as API calls, tool use, agent action groups, session state, and long-term agentic memory.
Response streaming support – The current solution uses a REST API Gateway endpoint that doesn’t support streaming. For streaming capabilities, consider WebSocket APIs in API Gateway, Application Load Balancer (ALB), or custom solutions with chunked responses using client-side reassembly or long-polling techniques.

With these customization options, you can tailor the solution to your specific needs, providing a robust and flexible security framework for your RAG applications. This approach not only protects sensitive data but also maintains the utility and efficiency of the knowledge base, allowing users to interact with the system while automatically enforcing role-appropriate information access and PII handling.
Shared security responsibility: The customer’s role
At AWS, security is our top priority and security in the cloud is a shared responsibility between AWS and our customers. With AWS, you control your data by using AWS services and tools to determine where your data is stored, how it is secured, and who has access to it. Services such as AWS Identity and Access Management (IAM) provide robust mechanisms for securely controlling access to AWS services and resources.
To enhance your security posture further, services like AWS CloudTrail and Amazon Macie offer advanced compliance, detection, and auditing capabilities. When it comes to encryption, AWS CloudHSM and AWS Key Management Service (KMS) enable you to generate and manage encryption keys with confidence.
For organizations seeking to establish governance and maintain data residency controls, AWS Control Tower offers a comprehensive solution. For more information on Data protection and Privacy, refer to Data Protection and Privacy at AWS.
While our solution demonstrates the use of PII detection and redaction techniques, it does not provide an exhaustive list of all PII types or detection methods. As a customer, you bear the responsibility for implementing the appropriate PII detection types and redaction methods using AWS services, including Amazon Bedrock Guardrails and other open-source libraries. The regular expressions configured in Bedrock Guardrails within this solution serve as a reference example only and do not cover all possible variations for detecting PII types. For instance, date of birth (DOB) formats can vary widely. Therefore, it falls on you to configure Bedrock Guardrails and policies to accurately detect the PII types relevant to your use case. Amazon Bedrock maintains strict data privacy standards. The service does not store or log your prompts and completions, nor does it use them to train AWS models or share them with third parties. We implement this through our Model Deployment Account architecture – each AWS Region where Amazon Bedrock is available has a dedicated deployment account per model provider, managed exclusively by the Amazon Bedrock service team. Model providers have no access to these accounts. When a model is delivered to AWS, Amazon Bedrock performs a deep copy of the provider’s inference and training software into these controlled accounts for deployment, making sure that model providers cannot access Amazon Bedrock logs or customer prompts and completions.
Ultimately, while we provide the tools and infrastructure, the responsibility for securing your data using AWS services rests with you, the customer. This shared responsibility model makes sure that you have the flexibility and control to implement security measures that align with your unique requirements and compliance needs, while we maintain the security of the underlying cloud infrastructure. For comprehensive information about Amazon Bedrock security, please refer to the Amazon Bedrock Security documentation.
Conclusion
In this post, we explored two approaches for securing sensitive data in RAG applications using Amazon Bedrock. The first approach focused on identifying and redacting sensitive data before ingestion into an Amazon Bedrock knowledge base, and the second demonstrated a fine-grained RBAC pattern for managing access to sensitive information during retrieval. These solutions represent just two possible approaches among many for securing sensitive data in generative AI applications.
Security is a multi-layered concern that requires careful consideration across all aspects of your application architecture. Looking ahead, we plan to dive deeper into RBAC for sensitive data within structured data stores when used with Amazon Bedrock Knowledge Bases. This can provide additional granularity and control over data access patterns while maintaining security and compliance requirements. Securing sensitive data in RAG applications requires ongoing attention to evolving security best practices, regular auditing of access patterns, and continuous refinement of your security controls as your applications and requirements grow.
To enhance your understanding of Amazon Bedrock security implementation, explore these additional resources:

Implementing least privilege access for Amazon Bedrock
Safeguard your generative AI workloads from prompt injections

The complete source code and deployment instructions for these solutions are available in our GitHub repository.
We encourage you to explore the repository for detailed implementation guidance and customize the solutions based on your specific requirements using the customization points discussed earlier.

About the authors
Praveen Chamarthi brings exceptional expertise to his role as a Senior AI/ML Specialist at Amazon Web Services, with over two decades in the industry. His passion for Machine Learning and Generative AI, coupled with his specialization in ML inference on Amazon SageMaker and Amazon Bedrock, enables him to empower organizations across the Americas to scale and optimize their ML operations. When he’s not advancing ML workloads, Praveen can be found immersed in books or enjoying science fiction films. Connect with him on LinkedIn to follow his insights.
Srikanth Reddy is a Senior AI/ML Specialist with Amazon Web Services. He is responsible for providing deep, domain-specific expertise to enterprise customers, helping them use AWS AI and ML capabilities to their fullest potential. You can find him on LinkedIn.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.
Brandon Rooks Sr. is a Cloud Security Professional with 20+ years of experience in the IT and Cybersecurity field. Brandon joined AWS in 2019, where he dedicates himself to helping customers proactively enhance the security of their cloud applications and workloads. Brandon is a lifelong learner, and holds the CISSP, AWS Security Specialty, and AWS Solutions Architect Professional certifications. Outside of work, he cherishes moments with his family, engaging in various activities such as sports, gaming, music, volunteering, and traveling.
Vikash Garg is a Principal Engineer at Amazon Bedrock with almost 4 years of experience in building AI/ML services. He has a decade of experience in building large-scale systems. He now focuses on building the generative AI service AWS Bedrock Guardrails. In his free time, he enjoys hiking and traveling.

Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrati …

VoltAgent is an open-source TypeScript framework designed to streamline the creation of AI‑driven applications by offering modular building blocks and abstractions for autonomous agents. It addresses the complexity of directly working with large language models (LLMs), tool integrations, and state management by providing a core engine that handles these concerns out-of-the-box. Developers can define agents with specific roles, equip them with memory, and tie them to external tools without having to reinvent foundational code for each new project.

Unlike DIY solutions that require extensive boilerplate and custom infrastructure, or no-code platforms that often impose vendor lock-in and limited extensibility, VoltAgent strikes a middle ground by giving developers full control over provider choice, prompt design, and workflow orchestration. It integrates seamlessly into existing Node.js environments, enabling teams to start small, build single assistants, and scale up to complex multi‑agent systems coordinated by supervisor agents.

The Challenge of Building AI Agents

Creating intelligent assistants typically involves three major pain points:  

Model Interaction Complexity: Managing calls to LLM APIs, handling retries, latency, and error states.  

Stateful Conversations: Persisting user context across sessions to achieve natural, coherent dialogues.  

External System Integration: Connecting to databases, APIs, and third‑party services to perform real‑world tasks.

Traditional approaches either require you to write custom code for each of these layers, resulting in fragmented and hard-to-maintain repositories, or lock you into proprietary platforms that sacrifice flexibility. VoltAgent abstracts these layers into reusable packages, so developers can focus on crafting agent logic rather than plumbing.

Core Architecture and Modular Packages

At its core, VoltAgent consists of a Core Engine package (‘@voltagent/core’) responsible for agent lifecycle, message routing, and tool invocation. Around this core, a suite of extensible packages provides specialized features:

Multi‑Agent Systems: Supervisor agents coordinate sub‑agents, delegating tasks based on custom logic and maintaining shared memory channels.  

Tooling & Integrations: ‘createTool’ utilities and type-safe tool definitions (via Zod schemas) enable agents to invoke HTTP APIs, database queries, or local scripts as if they were native LLM functions.  

Voice Interaction: The ‘@voltagent/voice’ package provides speech-to-text and text-to-speech support, enabling agents to speak and listen in real-time.  

Model Control Protocol (MCP): Standardized protocol support for inter‑process or HTTP‑based tool servers, facilitating vendor‑agnostic tool orchestration.  

Retrieval‑Augmented Generation (RAG): Integrate vector stores and retriever agents to fetch relevant context before generating responses.  

Memory Management: Pluggable memory providers (in-memory, LibSQL/Turso, Supabase) enable agents to retain past interactions, ensuring continuity of context.  

Observability & Debugging: A separate VoltAgent Console provides a visual interface for inspecting agent states, logs, and conversation flows in real-time.

Getting Started: Automatic Setup

VoltAgent includes a CLI tool, ‘create-voltagent-app’, to scaffold a fully configured project in seconds. This automatic setup prompts for your project name and preferred package manager, installs dependencies, and generates starter code, including a simple agent definition so that you can run your first AI assistant with a single command.

Copy CodeCopiedUse a different Browser# Using npm
npm create voltagent-app@latest my-voltagent-app

# Or with pnpm
pnpm create voltagent-app my-voltagent-app

cd my-voltagent-app
npm run dev

Code Source

At this point, you can open the VoltAgent Console in your browser, locate your new agent, and start chatting directly in the built‑in UI. The CLI’s built‑in ‘tsx watch’ support means any code changes in ‘src/’ automatically restart the server.

Manual Setup and Configuration

For teams that prefer fine‑grained control over their project configuration, VoltAgent provides a manual setup path. After creating a new npm project and adding TypeScript support, developers install the core framework and any desired packages:

Copy CodeCopiedUse a different Browser// tsconfig.json
{
“compilerOptions”: {
“target”: “ES2020”,
“module”: “NodeNext”,
“outDir”: “dist”,
“strict”: true,
“esModuleInterop”: true
},
“include”: [“src”]
}

Code Source

Copy CodeCopiedUse a different Browser# Development deps
npm install –save-dev typescript tsx @types/node @voltagent/cli

# Framework deps
npm install @voltagent/core @voltagent/vercel-ai @ai-sdk/openai zod

Code Source

A minimal ‘src/index.ts’ might look like this:

Copy CodeCopiedUse a different Browserimport { VoltAgent, Agent } from “@voltagent/core”;
import { VercelAIProvider } from “@voltagent/vercel-ai”;
import { openai } from “@ai-sdk/openai”;

// Define a simple agent
const agent = new Agent({
name: “my-agent”,
description: “A helpful assistant that answers questions without using tools”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
});

// Initialize VoltAgent
new VoltAgent({
agents: { agent },
});

Code Source

Adding an ‘.env’ file with your ‘OPENAI_API_KEY’ and updating ‘package.json’ scripts to include ‘”dev”: “tsx watch –env-file=.env ./src”‘ completes the local development setup. Running ‘npm run dev’ launches the server and automatically connects to the developer console.

Building Multi‑Agent Workflows

Beyond single agents, VoltAgent truly shines when orchestrating complex workflows via Supervisor Agents. In this paradigm, specialized sub‑agents handle discrete tasks, such as fetching GitHub stars or contributors, while a supervisor orchestrates the sequence and aggregates results:

Copy CodeCopiedUse a different Browserimport { Agent, VoltAgent } from “@voltagent/core”;
import { VercelAIProvider } from “@voltagent/vercel-ai”;
import { openai } from “@ai-sdk/openai”;

const starsFetcher = new Agent({
name: “Stars Fetcher”,
description: “Fetches star count for a GitHub repo”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
tools: [fetchRepoStarsTool],
});

const contributorsFetcher = new Agent({
name: “Contributors Fetcher”,
description: “Fetches contributors for a GitHub repo”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
tools: [fetchRepoContributorsTool],
});

const supervisor = new Agent({
name: “Supervisor”,
description: “Coordinates data gathering and analysis”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
subAgents: [starsFetcher, contributorsFetcher],
});

new VoltAgent({ agents: { supervisor } });

Code Source

In this setup, when a user inputs a repository URL, the supervisor routes the request to each sub-agent in turn, gathers their outputs, and synthesizes a final report, demonstrating VoltAgent’s ability to structure multi-step AI pipelines with minimal boilerplate.

Observability and Telemetry Integration

Production‑grade AI systems require more than code; they demand visibility into runtime behavior, performance metrics, and error conditions. VoltAgent’s observability suite includes integrations with popular platforms like Langfuse, enabling automated export of telemetry data:

Copy CodeCopiedUse a different Browserimport { VoltAgent } from “@voltagent/core”;
import { LangfuseExporter } from “langfuse-vercel”;

export const volt = new VoltAgent({
telemetry: {
serviceName: “ai”,
enabled: true,
export: {
type: “custom”,
exporter: new LangfuseExporter({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: process.env.LANGFUSE_BASEURL,
}),
},
},
});

Code Source

This configuration wraps all agent interactions with metrics and traces, which are sent to Langfuse for real-time dashboards, alerting, and historical analysis, equipping teams to maintain service-level agreements (SLAs) and quickly diagnose issues in AI-driven workflows.

VoltAgent’s versatility empowers a broad spectrum of applications:

Customer Support Automation: Agents that retrieve order status, process returns, and escalate complex issues to human reps, all while maintaining conversational context.  

Intelligent Data Pipelines: Agents orchestrate data extraction from APIs, transform records, and push results to business intelligence dashboards, fully automated and monitored.  

DevOps Assistants: Agents that analyze CI/CD logs, suggest optimizations, and even trigger remediation scripts via secure tool calls.  

Voice‑Enabled Interfaces: Deploy agents in kiosks or mobile apps that listen to user queries and respond with synthesized speech, enhanced by memory for personalized experiences.  

RAG Systems: Agents that first retrieve domain‑specific documents (e.g., legal contracts, technical manuals) and then generate precise answers, blending vector search with LLM generation.  

Enterprise Integration: Workflow agents that coordinate across Slack, Salesforce, and internal databases, automating cross‑departmental processes with full audit trails.

By abstracting common patterns, tool invocation, memory, multi‑agent coordination, and observability, VoltAgent reduces integration time from weeks to days, making it a powerful choice for teams seeking to infuse AI across products and services.

In conclusion, VoltAgent reimagines AI agent development by offering a structured yet flexible framework that scales from single-agent prototypes to enterprise-level multi-agent systems. Its modular architecture, with a robust core, rich ecosystem packages, and observability tooling, allows developers to focus on domain logic rather than plumbing. Whether you’re building a chat assistant, automating complex workflows, or integrating AI into existing applications, VoltAgent provides the speed, maintainability, and control you need to bring sophisticated AI solutions to production quickly. By combining easy onboarding via ‘create-voltagent-app’, manual configuration options for power users, and deep extensibility through tools and memory providers, VoltAgent positions itself as the definitive TypeScript framework for AI agent orchestration, helping teams deliver intelligent applications with confidence and speed.

Sources

https://voltagent.dev/docs/ 

https://github.com/VoltAgent/voltagent?tab=readme-ov-file

The post Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents appeared first on MarkTechPost.

Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Gen …

Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks.

To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation. 

Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance.

The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process.The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.

In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing appeared first on MarkTechPost.

A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing …

In this tutorial, we’ll build an end‑to‑end ticketing assistant powered by Agentic AI using the PydanticAI library. We’ll define our data rules with Pydantic v2 models, store tickets in an in‑memory SQLite database, and generate unique identifiers with Python’s uuid module. Behind the scenes, two agents, one for creating tickets and one for checking status, leverage Google Gemini (via PydanticAI’s google-gla provider) to interpret your natural‑language prompts and call our custom database functions. The result is a clean, type‑safe workflow you can run immediately in Colab.

Copy CodeCopiedUse a different Browser!pip install –upgrade pip
!pip install pydantic-ai

First, these two commands update your pip installer to the latest version, bringing in new features and security patches, and then install PydanticAI. This library enables the definition of type-safe AI agents and the integration of Pydantic models with LLMs.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass

if “GEMINI_API_KEY” not in os.environ:
os.environ[“GEMINI_API_KEY”] = getpass(“Enter your Google Gemini API key: “)

We check whether the GEMINI_API_KEY environment variable is already set. If not, we securely prompt you (without echoing) to enter your Google Gemini API key at runtime, then store it in os.environ so that your Agentic AI calls can authenticate automatically.

Copy CodeCopiedUse a different Browser!pip install nest_asyncio

We install the nest_asyncio package, which lets you patch the existing asyncio event loop so that you can call async functions (or use .run_sync()) inside environments like Colab without running into “event loop already running” errors.

Copy CodeCopiedUse a different Browserimport sqlite3
import uuid
from dataclasses import dataclass
from typing import Literal

from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext

We bring in Python’s sqlite3 for our in‑memory database and uuid to generate unique ticket IDs, use dataclass and Literal for clear dependency and type definitions, and load Pydantic’s BaseModel/​Field for enforcing data schemas alongside Agent and RunContext from PydanticAI to wire up and run our conversational agents.

Copy CodeCopiedUse a different Browserconn = sqlite3.connect(“:memory:”)
conn.execute(“””
CREATE TABLE tickets (
ticket_id TEXT PRIMARY KEY,
summary TEXT NOT NULL,
severity TEXT NOT NULL,
department TEXT NOT NULL,
status TEXT NOT NULL
)
“””)
conn.commit()

We set up an in‑memory SQLite database and define a tickets table with columns for ticket_id, summary, severity, department, and status, then commit the schema so you have a lightweight, transient store for managing your ticket records.

Copy CodeCopiedUse a different Browser@dataclass
class TicketingDependencies:
“””Carries our DB connection into system prompts and tools.”””
db: sqlite3.Connection

class CreateTicketOutput(BaseModel):
ticket_id: str = Field(…, description=”Unique ticket identifier”)
summary: str = Field(…, description=”Text summary of the issue”)
severity: Literal[“low”,”medium”,”high”] = Field(…, description=”Urgency level”)
department: str = Field(…, description=”Responsible department”)
status: Literal[“open”] = Field(“open”, description=”Initial ticket status”)

class TicketStatusOutput(BaseModel):
ticket_id: str = Field(…, description=”Unique ticket identifier”)
status: Literal[“open”,”in_progress”,”resolved”] = Field(…, description=”Current ticket status”)

Here, we define a simple TicketingDependencies dataclass to pass our SQLite connection into each agent call, and then declare two Pydantic models: CreateTicketOutput (with fields for ticket ID, summary, severity, department, and default status “open”) and TicketStatusOutput (with ticket ID and its current status). These models enforce a clear, validated structure on everything our agents return, ensuring you always receive well-formed data.

Copy CodeCopiedUse a different Browsercreate_agent = Agent(
“google-gla:gemini-2.0-flash”,
deps_type=TicketingDependencies,
output_type=CreateTicketOutput,
system_prompt=”You are a ticketing assistant. Use the `create_ticket` tool to log new issues.”
)

@create_agent.tool
async def create_ticket(
ctx: RunContext[TicketingDependencies],
summary: str,
severity: Literal[“low”,”medium”,”high”],
department: str
) -> CreateTicketOutput:
“””
Logs a new ticket in the database.
“””
tid = str(uuid.uuid4())
ctx.deps.db.execute(
“INSERT INTO tickets VALUES (?,?,?,?,?)”,
(tid, summary, severity, department, “open”)
)
ctx.deps.db.commit()
return CreateTicketOutput(
ticket_id=tid,
summary=summary,
severity=severity,
department=department,
status=”open”
)

We create a PydanticAI Agent named’ create_agent’ that’s wired to Google Gemini and is aware of our SQLite connection (deps_type=TicketingDependencies) and output schema (CreateTicketOutput). The @create_agent.tool decorator then registers an async create_ticket function, which generates a UUID, inserts a new row into the tickets table, and returns a validated CreateTicketOutput object.

Copy CodeCopiedUse a different Browserstatus_agent = Agent(
“google-gla:gemini-2.0-flash”,
deps_type=TicketingDependencies,
output_type=TicketStatusOutput,
system_prompt=”You are a ticketing assistant. Use the `get_ticket_status` tool to retrieve current status.”
)

@status_agent.tool
async def get_ticket_status(
ctx: RunContext[TicketingDependencies],
ticket_id: str
) -> TicketStatusOutput:
“””
Fetches the ticket status from the database.
“””
cur = ctx.deps.db.execute(
“SELECT status FROM tickets WHERE ticket_id = ?”, (ticket_id,)
)
row = cur.fetchone()
if not row:
raise ValueError(f”No ticket found for ID {ticket_id!r}”)
return TicketStatusOutput(ticket_id=ticket_id, status=row[0])

We set up a second PydanticAI Agent, status_agent, also using the Google Gemini provider and our shared TicketingDependencies. It registers an async get_ticket_status tool that looks up a given ticket_id in the SQLite database and returns a validated TicketStatusOutput, or raises an error if the ticket isn’t found.

Copy CodeCopiedUse a different Browserdeps = TicketingDependencies(db=conn)

create_result = await create_agent.run(
“My printer on 3rd floor shows a paper jam error.”, deps=deps
)

print(“Created Ticket →”)
print(create_result.output.model_dump_json(indent=2))

tid = create_result.output.ticket_id
status_result = await status_agent.run(
f”What’s the status of ticket {tid}?”, deps=deps
)

print(“Ticket Status →”)
print(status_result.output.model_dump_json(indent=2))

Finally, we first package your SQLite connection into deps, then ask the create_agent to log a new ticket via a natural‑language prompt, printing the validated ticket data as JSON. It then takes the returned ticket_id, queries the status_agent for that ticket’s current state, and prints the status in JSON form.

In conclusion, you have seen how Agentic AI and PydanticAI work together to automate a complete service process, from logging a new issue to retrieving its live status, all managed through conversational prompts. Our use of Pydantic v2 ensures every ticket matches the schema you define, while SQLite provides a lightweight backend that’s easy to replace with any database. With these tools in place, you can expand the assistant, adding new agent functions, integrating other AI models like openai:gpt-4o, or connecting real‑world APIs, confident that your data remains structured and reliable throughout.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database appeared first on MarkTechPost.

Supercharge your LLM performance with Amazon SageMaker Large Model Inf …

Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).
This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.
What’s new?
LMI v15 brings several enhancements that improve throughput, latency, and usability:

An async mode that directly integrates with vLLM’s AsyncLLMEngine for improved request handling. This mode creates a more efficient background loop that continuously processes incoming requests, enabling it to handle multiple concurrent requests and stream outputs with higher throughput than the previous Rolling-Batch implementation in v14.
Support for the vLLM V1 engine, which delivers up to 111% higher throughput compared to the previous V0 engine for smaller models at high concurrency. This performance improvement comes from reduced CPU overhead, optimized execution paths, and more efficient resource utilization in the V1 architecture. LMI v15 supports both V1 and V0 engines, with V1 being the default. If you have a need to use V0, you can use the V0 engine by specifying VLLM_USE_V1=0. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.
Expanded API schema support with three flexible options to allow seamless integration with applications built on popular API patterns:

Message format compatible with the OpenAI Chat Completions API.
OpenAI Completions format.
Text Generation Inference (TGI) schema to support backward compatibility with older models.

Multimodal support, with enhanced capabilities for vision-language models including optimizations such as multimodal prefix caching
Built-in support for function calling and tool calling, enabling sophisticated agent-based workflows.

Enhanced model support
LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:

Llama 4 – Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E-Instruct
Gemma 3 – Google’s lightweight and efficient models, known for their strong performance despite smaller size
Qwen 2.5 – Alibaba’s advanced models including QwQ 2.5 and Qwen2-VL with multimodal capabilities
Mistral AI models – High-performance models from Mistral AI that offer efficient scaling and specialized capabilities
DeepSeek-R1/V3 – State of the art reasoning models

Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.
Benchmarks
Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:

Model
Batch size
Instance type
LMI v14 throughput [tokens/s] (V0 engine)
LMI v15 throughput [tokens/s] (V1 engine)
Improvement

1
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
128
p4d.24xlarge
1768
2198
24%

2
meta-llama/Llama-3.1-8B-Instruct
64
ml.g6e.2xlarge
1548
2128
37%

3
mistralai/Mistral-7B-Instruct-v0.3
64
ml.g6e.2xlarge
942
1988
111%

DeepSeek-R1 Llama 70B for various levels of concurrency

Llama 3.1 8B Instruct for various level of concurrency

Mistral 7B for various levels of concurrency

The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:

Higher batch sizes increase concurrency but come with a natural tradeoff in terms of latency
Batch sizes of 4 and 8 provide the best latency for most use cases
Batch sizes up to 64 and 128 achieve maximum throughput with acceptable latency trade-offs

API formats
LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.

Chat Completions – Message format is compatible with OpenAI Chat Completions API. Use this schema for tool calling, reasoning, and multimodal use cases. Here is a sample of the invocation with the Messages API:

body = {
“messages”: [
{“role”: “user”, “content”: “Name popular places to visit in London?”}
],
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,
}

OpenAI Completions format – The Completions API endpoint is no longer receiving updates:

body = {
“prompt”: “Name popular places to visit in London?”,
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,

TGI – Supports backward compatibility with older models:

body = {
“inputs”: “Name popular places to visit in London?”,
“parameters”: {
“max_new_tokens”: 256,
“temperature”: 0.9,
},
“stream”: True,
}

Getting started with LMI v15
Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.
For optimal performance, we recommend the following instances:

Llama 4 Scout: ml.p5.48xlarge
DeepSeek R1/V3: ml.p5e.48xlarge
Qwen 2.5 VL-32B: ml.g5.12xlarge
Qwen QwQ 32B: ml.g5.12xlarge
Mistral Large: ml.g6e.48xlarge
Gemma3-27B: ml.g5.12xlarge
Llama 3.3-70B: ml.p4d.24xlarge

To deploy with LMI v15, follow these steps:

Clone the notebook to your Amazon SageMaker Studio notebook or to Visual Studio Code (VS Code). You can then run the notebook to do the initial setup and deploy the model from the Hugging Face repository to the SageMaker AI endpoint. We walk through the key blocks here.
LMI v15 maintains the same configuration pattern as previous versions, using environment variables in the form OPTION_<CONFIG_NAME>. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15.

vllm_config = {
“HF_MODEL_ID”: “meta-llama/Llama-4-Scout-17B-16E”,
“HF_TOKEN”: “entertoken”,
“OPTION_MAX_MODEL_LEN”: “250000”,
“OPTION_MAX_ROLLING_BATCH_SIZE”: “8”,
“OPTION_MODEL_LOADING_TIMEOUT”: “1500”,
“SERVING_FAIL_FAST”: “true”,
“OPTION_ROLLING_BATCH”: “disable”,
“OPTION_ASYNC_MODE”: “true”,
“OPTION_ENTRYPOINT”: “djl_python.lmi_vllm.vllm_async_service”
}

HF_MODEL_ID sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).
HF_TOKEN sets the token to download the model. This is required for gated models like Llama-4
OPTION_MAX_MODEL_LEN. This is the max model context length.
OPTION_MAX_ROLLING_BATCH_SIZE sets the batch size for the model.
OPTION_MODEL_LOADING_TIMEOUT sets the timeout value for SageMaker to load the model and run health checks.
SERVING_FAIL_FAST=true. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.
OPTION_ROLLING_BATCH= disable disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performance
OPTION_ASYNC_MODE=true enables async mode.
OPTION_ENTRYPOINT provides the entrypoint for vLLM’s async integrations

Set the latest container (in this example we used 0.33.0-lmi15.0.0-cu128), AWS Region (us-east-1), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.
Deploy the model to the endpoint using model.deploy().

CONTAINER_VERSION = ‘0.33.0-lmi15.0.0-cu128’
REGION = ‘us-east-1′
# Construct container URI
container_uri = f’763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}’

# Select instance type
instance_type = “ml.p5.48xlarge”

model = Model(image_uri=container_uri,
role=role,
env=vllm_config)
endpoint_name = sagemaker.utils.name_from_base(“Llama-4″)

print(endpoint_name)
model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout = 1800
)

Invoke the model, SageMaker inference provides two APIs to invoke the model- InvokeEndpoint and InvokeEndpointWithResponseStream. You can choose either option based on your needs.

# Create SageMaker Runtime client
smr_client = boto3.client(‘sagemaker-runtime’)
##Add your endpoint here
endpoint_name = ”

# Invoke with messages format
body = {
“messages”: [
{“role”: “user”, “content”: “Name popular places to visit in London?”}
],
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,
}

# Invoke with endpoint streaming
resp = smr_client.invoke_endpoint_with_response_stream(
EndpointName=endpoint_name,
Body=json.dumps(body),
ContentType=”application/json”,
)

To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.
Conclusion
Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.
We encourage you to explore this release for deploying your generative AI models.
Check out the provided example notebooks to start deploying models with LMI v15.

About the authors
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Accuracy evaluation framework for Amazon Q Business – Part 2

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.
In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

Comprehensive evaluation workflow – This ready-to-deploy solution uses AWS CloudFormation stacks to set up an Amazon Q Business application, complete with user access, a custom UI for review and evaluation, and the supporting evaluation infrastructure
Lightweight AWS Lambda based evaluation – Designed for users with an existing Amazon Q Business application, this streamlined solution employs an AWS Lambda function to efficiently assess the application’s accuracy

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.
Challenges in evaluating Amazon Q Business
Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.
Context recall
Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.
For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

Expected: The United States is the third-largest country in the world by land area, covering approximately 9.8 million square kilometers. It has a diverse range of geographical features.
High context recall: The United States spans approximately 9.8 million square kilometers, making it the third-largest nation globally by land area. country’s geography is incredibly diverse, featuring the Rocky Mountains stretching from New Mexico to Alaska, the Appalachian Mountains along the eastern states, the expansive Great Plains in the central region, arid deserts like the Mojave in the southwest.
Low context recall: The United States features significant geographical landmarks. Additionally, the country is home to unique ecosystems like the Everglades in Florida, a vast network of wetlands.

The following diagram illustrates the context recall workflow.

Context precision
Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.
For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

Ground truth answer: Silicon Valley is famous for fostering innovation and entrepreneurship in the technology sector.
High precision context: Many groundbreaking startups originate from Silicon Valley, benefiting from a culture that encourages innovation, risk-taking
Low precision context: Silicon Valley experiences a Mediterranean climate, with mild, wet, winters and warm, dry summers, contributing to its appeal as a place to live and works

The following diagram illustrates the context precision workflow.

Answer relevancy
Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.
For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

High relevance answer: Amazon Q Business Service is a RAG Generative AI solution designed for enterprise use. Key features include a fully managed Generative AI solutions, integration with enterprise data sources, robust security protocols, and customizable virtual assistants. It benefits enterprise customers by enabling efficient information retrieval, automating customer support tasks, enhancing employee productivity through quick access to data, and providing insights through analytics on user interactions.
Low relevance answer: Amazon Q Business Service is part of Amazon’s suite of cloud services. Amazon also offers online shopping and streaming services.

The following diagram illustrates the answer relevancy workflow.

Truthfulness
Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.
For example, a user might ask “What is the capital of Canada?” They could get the following responses:

Context: Canada’s capital city is Ottawa, located in the province of Ontario. Ottawa is known for its historic Parliament Hill, the center of government, and the scenic Rideau Canal, a UNESCO World Heritage site
High truthfulness answer: The capital of Canada is Ottawa
Low truthfulness answer: The capital of Canada is Toronto

The following diagram illustrates the truthfulness workflow.

Evaluation methods
Deciding on who should conduct the evaluation can significantly impact results. Options include:

Human-in-the-Loop (HITL) – Human evaluators manually assess the accuracy and relevance of responses, offering nuanced insights that automated systems might miss. However, it is a slow process and difficult to scale.
LLM-aided evaluation – Automated methods, such as the Ragas framework, use language models to streamline the evaluation process. However, these might not fully capture the complexities of domain-specific knowledge.

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.
Solution overview
In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.
Solution 1: End-to-end evaluation solution
For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

User access and UI – Authenticated users interact with a frontend UI to upload datasets, review RAGAS output, and provide human feedback
Evaluation solution infrastructure – Core components include:

Amazon DynamoDB to store data
Amazon Simple Queue Service (Amazon SQS) and Lambda to manage processing and scoring with Ragas

Ragas scoring – Automated metrics provide an initial layer of evaluation
HITL review – Human evaluators refine Ragas scores through the UI, providing nuanced accuracy and reliability

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.
A quick video demo of this solution is shown below:

Solution architecture
The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

User access and UI – Users authenticate through Amazon Cognito, and upon successful login, interact with a Streamlit-based custom UI. This frontend allows users to upload CSV datasets to Amazon Simple Storage Service (Amazon S3), review Ragas evaluation outputs, and provide human feedback for refinement. The application exchanges the Amazon Cognito token for an AWS IAM Identity Center token, granting scoped access to Amazon Q Business.UI
infrastructure – The UI is hosted behind an Application Load Balancer, supported by Amazon Elastic Compute Cloud (Amazon EC2) instances running in an Auto Scaling group for high availability and scalability.
Upload dataset and trigger evaluation – Users upload a CSV file containing queries and ground truth answers to Amazon S3, which triggers an evaluation process. A Lambda function reads the CSV, stores its content in a DynamoDB table, and initiates further processing through a DynamoDB stream.
Consuming DynamoDB stream – A separate Lambda function processes new entries from the DynamoDB stream, and publishes messages to an SQS queue, which serves as a trigger for the evaluation Lambda function.
Ragas scoring – The evaluation Lambda function consumes SQS messages, sending queries (prompts) to Amazon Q Business for generating answers. It then evaluates the prompt, ground truth, and generated answer using the Ragas evaluation framework. Ragas computes automated evaluation metrics such as context recall, context precision, answer relevancy, and truthfulness. The results are stored in DynamoDB and visualized in the UI.

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.
This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites
For this walkthrough, you should have the following prerequisites:

A Linux/Mac computer. If you are using a Windows computer, make sure you can run a Linux command line.
An AWS account. If you don’t have an AWS account, follow the instructions to create one, unless you have been provided event engine details.
A user role with administrator access (service access associated with this role can be constrained further when the workflow goes to production).
The AWS Command Line Interface (AWS CLI) installed and configured with your user credentials.
Model access enabled in Amazon Bedrock for Anthropic’s Claude 3 Sonnet model and the Amazon Titan Embedding G1 – Text model. The Ragas scoring needs access to these two LLMs hosted on Amazon Bedrock.

Additionally, make sure that all the resources you deploy are in the same AWS Region.
Deploy the CloudFormation stack
Complete the following steps to deploy the CloudFormation stack:

Clone the repository or download the files to your local computer.
Unzip the downloaded file (if you used this option).
Using your local computer command line, use the ‘cd’ command and change directory into ./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution
Make sure the ./deploy.sh script can run by executing the command chmod 755 ./deploy.sh.
Execute the CloudFormation deployment script provided as follows:

./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business
You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.
Upload the evaluation dataset through the UI
In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.
This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

What are the index types of Amazon Q Business and the features of each?
I want to use Q Apps, which subscription tier is required to use Q Apps?
What is the file size limit for Amazon Q Business via file upload?
What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the evals stack that you already launched.
On the Outputs tab, take note of the user name and password to log in to the UI application, and choose the UI URL.

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

After you log in to the custom UI used for Amazon Q evaluation, choose Upload Dataset, then upload the dataset CSV file.

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation
After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.
Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

Answer relevancy – Measures relevancy of answers. Higher scores indicate better alignment with the user input, and lower scores are given if the response is incomplete or includes redundant information.
Truthfulness – Verifies factual accuracy by comparing responses to verified sources. Higher scores indicate a better consistency with verified sources.
Context precision – Assesses the relevance and conciseness of retrieved information. Higher scores indicate that the retrieved information closely matches the query intent, reducing irrelevant data.
Context recall – Measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation
If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data
Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch
Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.
Using evaluation results to improve Amazon Q Business application accuracy
This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.
Context recall
Let’s examine the following problems and troubleshooting tips:

Insufficient documents retrieved – Retrieved passages only partially cover the relevant topics, omitting essential information. This could result from document parsing errors or a ranking algorithm with a limited top-K selection. To address this, review the source attributes and context provided by Amazon Q Business answers to identify any gaps.

Aggressive query filtering – Overly strict search filters or metadata constraints might exclude relevant records. You should review the metadata filters or boosting settings applied in Amazon Q Business to make sure they don’t unnecessarily restrict results.
Data source ingestion errors – Documents from certain data sources aren’t successfully ingested into Amazon Q Business. To address this, check the document sync history report in Amazon Q Business to confirm successful ingestion and resolve ingestion errors.

Context precision
Consider the following potential issues:

Over-retrieval of documents – Large top-K values might retrieve semi-related or off-topic passages, which the LLM might incorporate unnecessarily. To address this, refine metadata filters or apply boosting to improve passage relevance and reduce noise in the retrieved context.

Poor query specificity – Broad or poorly formed user queries can yield loosely related results. You should make sure user queries are clear and specific. Train users or implement query refinement mechanisms to optimize query quality.

Answer relevance
Consider the following troubleshooting methods:

Partial coverage – Retrieved context addresses parts of the question but fails to cover all aspects, especially in multi-part queries. To address this, decompose complex queries into sub-questions. Instruct the LLM or a dedicated module to retrieve and answer each sub-question before composing the final response. For example:

Break down the query into sub-questions.
Retrieve relevant passages for each sub-question.
Compose a final answer addressing each part.

Context/answer mismatch – The LLM might misinterpret retrieved passages, omit relevant information, or merge content incorrectly due to hallucination. You can use prompt engineering to guide the LLM more effectively. For example, for the original query “What are the top 3 reasons for X?” you can use the rewritten prompt “List the top 3 reasons for X clearly labeled as #1, #2, and #3, based strictly on the retrieved context.”

Truthfulness
Consider the following:

Stale or inaccurate data sources – Outdated or conflicting information in the knowledge corpus might lead to incorrect answers. To address this, compare the retrieved context with verified sources to provide accuracy. Collaborate with SMEs to validate the data.
LLM hallucination – The model might fabricate or embellish details, even with accurate retrieved context. Although Amazon Q Business is a RAG generative AI solution, and should significantly reduce the hallucination, it’s not possible to eliminate hallucination totally. You can measure the frequency of low context precision answers to identify patterns and quantify the impact of hallucinations to gain an aggregated view with the evaluation solution.

By systematically examining and addressing the root causes of low evaluation metrics, you can optimize your Amazon Q Business application. From document retrieval and ranking to prompt engineering and validation, these strategies will help enhance the effectiveness of your RAG solution.
Clean up
Don’t forget to go back to the CloudFormation console and delete the CloudFormation stack to delete the underlying infrastructure that you set up, to avoid additional costs on your AWS account.
Conclusion
In this post, we outlined two evaluation solutions for Amazon Q Business: a comprehensive evaluation workflow and a lightweight Lambda based evaluation. These approaches combine automated evaluation approaches such as Ragas with human-in-the-loop validation, providing reliable and accurate assessments.
By using our guidance on how to improve evaluation metrics, you can continuously optimize your Amazon Q Business application to meet enterprise needs with Amazon Q Business. Whether you’re using the end-to-end solution or the lightweight approach, these frameworks provide a scalable and efficient path to improve accuracy and relevance.
To learn more about Amazon Q Business and how to evaluate Amazon Q Business results, explore these hands-on workshops:

Evaluating Amazon Q Business applications to maximize business impact
Innovate on enterprise data with generative AI & Amazon Q Business application

About the authors
Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.
Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Bedrock team, and a Gold member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.
Amit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.
Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.
Ricardo Aldao is a Senior Partner Solutions Architect at AWS. He is a passionate AI/ML enthusiast who focuses on supporting partners in building generative AI solutions on AWS.

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency ben …

In December, we announced the preview availability for Amazon Bedrock Intelligent Prompt Routing, which provides a single serverless endpoint to efficiently route requests between different foundation models within the same model family. To do this, Amazon Bedrock Intelligent Prompt Routing dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality, as shown in the following figure.

Today, we’re happy to announce the general availability of Amazon Bedrock Intelligent Prompt Routing. Over the past several months, we drove several improvements in intelligent prompt routing based on customer feedback and extensive internal testing. Our goal is to enable you to set up automated, optimal routing between large language models (LLMs) through Amazon Bedrock Intelligent Prompt Routing and its deep understanding of model behaviors within each model family, which incorporates state-of-the-art methods for training routers for different sets of models, tasks and prompts.
In this blog post, we detail various highlights from our internal testing, how you can get started, and point out some caveats and best practices. We encourage you to incorporate Amazon Bedrock Intelligent Prompt Routing into your new and existing generative AI applications. Let’s dive in!
Highlights and improvements
Today, you can either use Amazon Bedrock Intelligent Prompt Routing with the default prompt routers provided by Amazon Bedrock or configure your own prompt routers to adjust for performance linearly between the performance of the two candidate LLMs. Default prompt routers—pre-configured routing systems to map performance to the more performant of the two models while lowering costs by sending easier prompts to the cheaper model—are provided by Amazon Bedrock for each model family. These routers come with predefined settings and are designed to work out-of-the-box with specific foundation models. They provide a straightforward, ready-to-use solution without needing to configure any routing settings. Customers who tested Amazon Bedrock Intelligent Prompt Routing in preview (thank you!), you could choose models in the Anthropic and Meta families. Today, you can choose more models from within the Amazon Nova, Anthropic, and Meta families, including:

Anthropic’s Claude family: Haiku, Sonnet3.5 v1, Haiku 3.5, Sonnet 3.5 v2
Llama family: Llama 3.1 8b, 70b, 3.2 11B, 90B and 3.3 70B
Nova family: Nova Pro and Nova lite

You can also configure your own prompt routers to define your own routing configurations tailored to specific needs and preferences. These are more suitable when you require more control over how to route your requests and which models to use. In GA, you can configure your own router by selecting any two models from the same model family and then configuring the response quality difference of your router.
Adding components before invoking the selected LLM with the original prompt can add overhead. We reduced overhead of added components by over 20% to approximately 85 ms (P90). Because the router preferentially invokes the less expensive model while maintaining the same baseline accuracy in the task, you can expect to get an overall latency and cost benefit compared to always hitting the larger/ more expensive model, despite the additional overhead. This is discussed further in the following benchmark results section.
We conducted several internal tests with proprietary and public data to evaluate Amazon Bedrock Intelligent Prompt Routing metrics. First, we used average response quality gain under cost constraints (ARQGC), a normalized (0–1) performance metric for measuring routing system quality for various cost constraints, referenced against a reward model, where 0.5 represents random routing and 1 represents optimal oracle routing performance. We also captured the cost savings with intelligent prompt routing relative to using the largest model in the family, and estimated latency benefit based on average recorded time to first token (TTFT) to showcase the advantages and report them in the following table.

Model family
Router overall performance
Performance when configuring the router to match performance of the strong model

Average ARQGC
Cost savings (%)
Latency benefit (%)

Nova
0.75
35%
9.98%

Anthropic
0.86
56%
6.15%

Meta
0.78
16%
9.38%

How to read this table?
It’s important to pause and understand these metrics. First, results shown in the preceding table are only meant for comparing against random routing within the family (that is, improvement in ARQGC over 0.5) and not across families. Second, the results are relevant only within the family of models and are different than other model benchmarks that you might be familiar with that are used to compare models. Third, because the real cost and price change frequently and are dependent on the input and output token counts, it’s challenging to compare the real cost. To solve this problem, we define the cost savings metric as the maximum cost saved compared to the strongest LLM cost for a router to achieve a certain level of response quality. Specifically, in the example shown in the table, there’s an average 35% cost savings using the Nova family router compared to using Nova Pro for all prompts without the router.
You can expect to see varying levels of benefit based on your use case. For example, in an internal test with hundreds of prompts, we achieve 60% cost savings using Amazon Bedrock Intelligent Prompt Routing with the Anthropic family, with the response quality matching that of Claude Sonnet3.5 V2.
What is response quality difference?
The response quality difference measures the disparity between the responses of the fallback model and the other models. A smaller value indicates that the responses are similar. A higher value indicates a significant difference in the responses between the fallback model and the other models. The choice of what you use as a fallback model is important. When configuring a response quality difference of 10% with Anthropic’s Claude 3 Sonnet as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a 10% drop in the response quality from Claude 3 Sonnet. Conversely, if you use a less expensive model such as Claude 3 Haiku as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a more than 10% increase from Claude 3 Haiku.
In the following figure, you can see that the response quality difference is set at 10% with Haiku as the fallback model. If customers want to explore optimal configurations beyond the default settings described previously, they can experiment with different response quality difference thresholds, analyze the router’s response quality, cost, and latency on their development dataset, and select the configuration that best fits their application’s requirements.
When configuring your own prompt router, you can set the threshold for response quality difference as shown in the following image of the Configure prompt router page, under Response quality difference (%) in the Amazon Bedrock console. To do this by using APIs, see How to use intelligent prompt routing.

Benchmark results
When using different model pairings, the ability of the smaller model to service a larger number of input prompts will have significant latency and cost benefits, depending on the model choice and the use case. For example, when comparing between usage of Claude 3 Haiku and Claude 3.5 Haiku along with Claude 3.5 Sonnet, we observe the following with one of our internal datasets:
Case 1: Routing between Claude 3 Haiku and Claude 3.5 Sonnet V2: Cost savings of 48% while maintaining the same response quality as Claude 3.5 Sonnet v2

Case 2: Routing between Claude 3.5 Haiku and Claude 3.5 Sonnet V2: Cost savings of 56% while maintaining the same response quality as Claude 3.5 Sonnet v2

As you can see in case 1 and case 2, as model capabilities for less expensive models improve with respect to more expensive models in the same family (for example Claude 3 Haiku to 3.5 Haiku), you can expect more complex tasks to be reliably solved by them, therefore causing a higher percentage of routing to the less expensive model while still maintaining the same overall accuracy in the task.
We encourage you to test the effectiveness of Amazon Bedrock Intelligent Prompt Routing on your specialized task and domain because results can vary. For example, when we tested Amazon Bedrock Intelligent Prompt Routing with open source and internal Retrieval Augmented Generation (RAG) datasets, we saw an average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger/ more expensive model (Sonnet 3.5 v2 in the following figure) alone, averaged across RAG datasets.

Getting started
You can get started using the AWS Management Console for Amazon Bedrock. As mentioned earlier, you can create your own router or use a default router:
Use the console to configure a router:

In the Amazon Bedrock console, choose Prompt Routers in the navigation pane, and then choose Configure prompt router.
You can then use a previously configured router or a default router in the console-based playground. For example, in the following figure, we attached a 10K document from Amazon.com and asked a specific question about the cost of sales.
Choose the router metrics icon (next to the refresh icon) to see which model the request was routed to. Because this is a nuanced question, Amazon Bedrock Intelligent Prompt Routing correctly routes to Claude 3.5 Sonnet V2 in this case, as shown in the following figure.

You can also use AWS Command Line Interface (AWS CLI) or API, to configure and use a prompt router.
To use the AWS CLI or API to configure a router:
AWS CLI:

aws bedrock create-prompt-router
    –prompt-router-name my-prompt-router
    –models ‘[{“modelArn”: “arn:aws:bedrock:<region>::foundation-model/<modelA>”}]’
    –fallback-model ‘[{“modelArn”: “arn:aws:bedrock:<region>::foundation-model/<modelB>”}]’
    –routing-criteria ‘{“responseQualityDifference”: 0.5}’

Boto3 SDK:

response = client.create_prompt_router(
    promptRouterName=’my-prompt-router’,
    models=[
        {
            ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelA>’
        },
        {
            ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelB>’
        },
    ],
    description=’string’,
    routingCriteria={
        ‘responseQualityDifference’:0.5
    },
    fallbackModel={
        ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelA>’
    },
    tags=[
        {
            ‘key’: ‘string’,
            ‘value’: ‘string’
        },
    ]
)

Caveats and best practices
When using intelligent prompt routing in Amazon Bedrock, note that:

Amazon Bedrock Intelligent Prompt Routing is optimized for English prompts for typical chat assistant use cases. For use with other languages or customized use cases, conduct your own tests before implementing prompt routing in production applications or reach out to your AWS account team for help designing and conducting these tests.
You can select only two models to be part of the router (pairwise routing), with one of these two models being the fallback model. These two models have to be in the same AWS Region.
When starting with Amazon Bedrock Intelligent Prompt Routing, we recommend that you experiment using the default routers provided by Amazon Bedrock before trying to configure custom routers. After you’ve experimented with default routers, you can configure your own routers as needed for your use cases, evaluate the response quality in the playground, and use them for production application if they meet your requirements.
Amazon Bedrock Intelligent Prompt Routing can’t adjust routing decisions or responses based on application-specific performance data currently and might not always provide the most optimal routing for unique or specialized, domain-specific use cases. Contact your AWS account team for customization help on specific use cases.

Conclusion
In this post, we explored Amazon Bedrock Intelligent Prompt Routing, highlighting its ability to help optimize both response quality and cost by dynamically routing requests between different foundation models. Benchmark results demonstrate significant cost savings while maintaining high-quality responses and reduced latency benefits across model families. Whether you implement the pre-configured default routers or create custom configurations, Amazon Bedrock Intelligent Prompt Routing offers a powerful way to balance performance and efficiency in generative AI applications. As you implement this feature in your workflows, testing its effectiveness for specific use cases is recommended to take full advantage of the flexibility it provides. To get started, see Understanding intelligent prompt routing in Amazon Bedrock

About the authors
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Balasubramaniam Srinivasan is a Senior Applied Scientist at Amazon AWS, working on post training methods for generative AI models. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and football (soccer).
Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.
Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Build an automated generative AI solution evaluation pipeline with Ama …

Large language models (LLMs) have become integral to numerous applications across industries, ranging from enhanced customer interactions to automated business processes. Deploying these models in real-world scenarios presents significant challenges, particularly in ensuring accuracy, fairness, relevance, and mitigating hallucinations. Thorough evaluation of the performance and outputs of these models is therefore critical to maintaining trust and safety.
Evaluation plays a central role in the generative AI application lifecycle, much like in traditional machine learning. Robust evaluation methodologies enable informed decision-making regarding the choice of models and prompts. However, evaluating LLMs is a complex and resource-intensive process given the free-form text output of LLMs. Methods such as human evaluation provide valuable insights but are costly and difficult to scale. Consequently, there is a demand for automated evaluation frameworks that are highly scalable and can be integrated into application development, much like unit and integration tests in software development.
In this post, to address the aforementioned challenges, we introduce an automated evaluation framework that is deployable on AWS. The solution can integrate multiple LLMs, use customized evaluation metrics, and enable businesses to continuously monitor model performance. We also provide LLM-as-a-judge evaluation metrics using the newly released Amazon Nova models. These models enable scalable evaluations due to their advanced capabilities and low latency. Additionally, we provide a user-friendly interface to enhance ease of use.
In the following sections, we discuss various approaches to evaluate LLMs. We then present a typical evaluation workflow, followed by our AWS-based solution that facilitates this process.
Evaluation methods
Prior to implementing evaluation processes for generative AI solutions, it’s crucial to establish clear metrics and criteria for assessment and gather an evaluation dataset.
The evaluation dataset should be representative of the actual real-world use case. It should consist of diverse samples and ideally contain ground truth values generated by experts. The size of the dataset will depend on the exact application and the cost of acquiring data; however, a dataset that spans relevant and diverse use cases should be a minimum. Developing an evaluation dataset can itself be an iterative task that is progressively enhanced by adding new samples and enriching the dataset with samples where the model performance is lacking. After the evaluation dataset is acquired, evaluation criteria can then be defined.
The evaluation criteria can be broadly divided into three main areas:

Latency-based metrics – These include measurements such as response generation time or time to first token. The importance of each metric might vary depending on the specific application.
Cost – This refers to the expense associated with response generation.
Performance – Performance-based metrics are highly case-dependent. They might include measurements of accuracy, factual consistency of responses, or the ability to generate structured responses.

Generally, there is an inverse relationship between latency, cost, and performance. Depending on the use case, one factor might be more critical than the others. Having metrics for these categories across different models can help you make data-driven decisions to determine the optimum choice for your specific use case.
Although measuring latency and cost can be relatively straightforward, assessing performance requires a deep understanding of the use case and knowing what is crucial for success. Depending on the application, you might be interested in evaluating the factual accuracy of the model’s output (particularly if the output is based on specific facts or reference documents), or you might want to assess whether the model’s responses are consistently polite and helpful, or both.
To support these diverse scenarios, we have incorporated several evaluation metrics in our solution:

FMEval – Foundation Model Evaluation (FMEval) library provided by AWS offers purpose-built evaluation models to provide metrics like toxicity in LLM output, accuracy, and semantic similarity between generated and reference text. This library can be used to evaluate LLMs across several tasks such as open-ended generation, text summarization, question answering, and classification.
Ragas – Ragas is an open source framework that provides metrics for evaluation of Retrieval Augmented Generation (RAG) systems (systems that generate answers based on a provided context). Ragas can be used to evaluate the performance of an information retriever (the component that retrieves relevant information from a database) using metrics like context precision and recall. Ragas also provides metrics to evaluate the LLM generation from the provided context using metrics like answer faithfulness to the provided context and answer relevance to the original question.
LLMeter – LLMeter is a simple solution for latency and throughput testing of LLMs, such as LLMs provided through Amazon Bedrock and OpenAI. This can be helpful in comparing models on metrics for latency-critical workloads.
LLM-as-a-judge metrics – Several challenges arise in defining performance metrics for free form text generated by LLMs – for example, the same information might be expressed in a different way. It’s also difficult to clearly define metrics for measuring characteristics like politeness. To tackle such evaluations, LLM-as-a-judge metrics have become popular. LLM-as-a-judge evaluations use a judge LLM to score the output of an LLM based on certain predefined criteria. We use the Amazon Nova model as the judge due to its advanced accuracy and performance.

Evaluation workflow
Now that we know what metrics we care about, how do we go about evaluating our solution? A typical generative AI application development (proof of concept) process can be abstracted as follows:

Builders use a few test examples and try out different prompts to see the performance and get a rough idea of the prompt template and model they want to start with (online evaluation).
Builders test the first prompt template version with a selected LLM against a test dataset with ground truth for a list of evaluation metrics to check the performance (offline evaluation). Based on the evaluation results, they might need to modify the prompt template, fine-tune the model, or implement RAG to add additional context to improve performance.
Builders implement the change and evaluate the updated solution against the dataset to validate improvements on the solution. Then they repeat the previous steps until the performance of the developed solution meets the business requirements.

The two key stages in the evaluation process are:

Online evaluation – This involves manually evaluating prompts based on a few examples for qualitative checks
Offline evaluation – This involves automated quantitative evaluation on an evaluation dataset

This process can add significant operational complications and effort from the builder team and operations team. To achieve this workflow, you need the following:

A side-by-side comparison tool for various LLMs
A prompt management service that can be used to save and version control prompts
A batch inference service that can invoke your selected LLM on a large number of examples
A batch evaluation service that can be used to evaluate the LLM response generated in the previous step

In the next section, we describe how we can create this workflow on AWS.
Solution overview
In this section, we present an automated generative AI evaluation solution that can be used to simplify the evaluation process. The architecture diagram of the solution is shown in the following figure.

This solution provides both online (real-time comparison) and offline (batch evaluation) evaluation options that fulfill different needs during the generative AI solution development lifecycle. Each component in this evaluation infrastructure can be developed using existing open source tools or AWS native services.
The architecture of the automated LLM evaluation pipeline focuses on modularity, flexibility, and scalability. The design philosophy makes sure that different components can be reused or adapted for other generative AI projects. The following is an overview of each component and its role in the solution:

UI – The UI provides a straightforward way to interact with the evaluation framework. Users can compare different LLMs with a side-by-side comparison. The UI provides latency, model outputs, and cost for each input query (online evaluation). The UI also helps you store and manage your different prompt templates backed by the Amazon Bedrock prompt management feature. These prompts can be referenced later for batch generation or production use. You can also launch batch generation and evaluation jobs through the UI. The UI service can be run locally in a Docker container or deployed to AWS Fargate.
Prompt management – The evaluation solution includes a key component for prompt management. Backed by Amazon Bedrock prompt management, you can save and retrieve your prompts using the UI.
LLM invocation pipeline – Using AWS Step Functions, this workflow automates the process of generating outputs from the LLM for a test dataset. It retrieves inputs from Amazon Simple Storage Service (Amazon S3), processes them, and stores the responses back to Amazon S3. This workflow supports batch processing, making it suitable for large-scale evaluations.
LLM evaluation pipeline – This workflow, also managed by Step Functions, evaluates the outputs generated by the LLM. At the time of writing, the solution supports metrics provided by the FMEval library, Ragas library, and custom LLM-as-a-judge metrics. It handles various evaluation methods, including direct metrics computation and LLM-guided evaluation. The results are stored in Amazon S3, ready for analysis.
Eval factory – A core service for conducting evaluations, the eval factory supports multiple evaluation techniques, including those that use other LLMs for reference-free scoring. It provides consistency in evaluation results by standardizing outputs into a single metric per evaluation. It can be difficult to find a one-size-fits-all solution when it comes to evaluation, so we provide you the flexibility to use your own script for evaluation. We also provide pre-built scripts and pipelines for some common tasks including classification, summarization, translation, and RAG. Especially for RAG, we have integrated popular open source libraries like Ragas.
Postprocessing and results store – After the pipeline results are generated, postprocessing can concatenate the results and potentially display the results in a results store that can provide a graphical view of the results. This part also handles updates to the prompt management system because each prompt template and LLM combination will have recorded evaluation results to help you select the right model and prompt template for the use case. Visualization of the results can be done on the UI or even with an Amazon Athena table if the prompt management system uses Amazon S3 as the data storage. This part can be done by using an AWS Lambda function, which can be triggered by an event sent after the new data has been saved to the Amazon S3 location for the prompt management system.

The evaluation solution can significantly enhance team productivity throughout the development lifecycle by reducing manual intervention and increasing automated processes. As new LLMs emerge, builders can compare the current production LLM with new models to determine if upgrading would improve the system’s performance. This ongoing evaluation process makes sure that the generative AI solution remains optimal and up-to-date.
Prerequisites
For scripts to set up the solution, refer to the GitHub repository. After the backend and the frontend are up and running, you can start the evaluation process.
To start, open the UI in your browser. The UI provides the ability to do both online and offline evaluations.
Online evaluation
To iteratively refine prompts, you can follow these steps:

Choose the options menu (three lines) on the top left side of the page to set the AWS Region.
After you choose the Region, the model lists will be prefilled with the available Amazon Bedrock models in that Region.
You can choose two models for side-by-side comparison.
You can select a prompt already stored in Amazon Bedrock prompt management from the dropdown menu. If selected, this will automatically fill the prompts.
You can also create a new prompt by entering the prompt in the text box. You can select generation configurations (temperature, top P, and so on) on the Generation Configuration The prompt template can also use dynamic variables by entering variables in {{}} (for example, for additional context, add a variable like {{context}}). Then define the value of these variables on the Context tab.
Choose Enter to start generation.
This will invoke the two models and present the output in the text boxes below each model. Additionally, you will also be provided with the latency and cost for each model.
To save the prompt to Amazon Bedrock, choose Save.

Offline generation and evaluation
After you have made the model and prompt choice, you can run batch generation and evaluation over a larger dataset.

To run batch generation, choose the model from the dropdown list.
You can provide an Amazon Bedrock knowledge base ID if additional context is required for generation.
You can also provide a prompt template ID. This prompt will be used for generation.
Upload a dataset file. This file will be uploaded to the S3 bucket set in the sidebar. This file should be a pipe (|) separated CSV file. For more details on expected data file format, see the project’s GitHub README file.
Choose Start Generation to start the job. This will trigger a Step Functions workflow that you can track by choosing the link in the pop-up.

Invoking batch generation triggers a Step Functions workflow, which is shown in the following figure. The logic follows these steps:

GetPrompts – This step retrieves a CSV file containing prompts from an S3 bucket. The contents of this file become the Step Functions workflow’s payload.
convert_to_json – This step parses the CSV output and converts it into a JSON format. This transformation enables the step function to use the Map state to process the invoke_llm flow concurrently.
Map step – This is an iterative step that processes the JSON payload by invoking the invoke_llm Lambda function concurrently for each item in the payload. A concurrency limit is set, with a default value of 3. You can adjust this limit based on the capacity of your backend LLM service. Within each Map iteration, the invoke_llm Lambda function calls the backend LLM service to generate a response for a single question and its associated context.
InvokeSummary – This step combines the output from each iteration of the Map step. It generates a JSON Lines result file containing the outputs, which is then stored in an S3 bucket for evaluation purposes.

When the batch generation is complete, you can trigger a batch evaluation pipeline with the selected metrics from the predefined metric list. You can also specify the location of an S3 file that contains already generated LLM outputs to perform batch evaluation.

Invoking batch evaluation triggers an Evaluate-LLM Step Functions workflow, which is shown in the following figure. The Evaluate-LLM Step Functions workflow is designed to comprehensively assess LLM performance using multiple evaluation frameworks:

LLMeter evaluation – Uses the AWS Labs LLMeter framework and focuses on endpoint performance metrics and benchmarking.
Ragas framework evaluation – Uses Ragas framework evaluation to measure four critical quality metrics:

Context precision – A metric that evaluates whether the ground truth relevant items present in the contexts (retrieved chunks from vector database) are ranked higher or not. Its value ranges between 0–1, with higher values indicating better performance. The RAG system usually retrieves more than 1 chunks for a given query, and the chunks are ranked in order. A lower score is assigned when the high-ranked chunks contain more irrelevant information, which indicate bad information retrieval capability.
Context recall – A metric that measures the extent to which the retrieved context aligns with the ground truth. Its value ranges between 0–1, with higher values indicating better performance. The ground truth can contain several short and definitive claims. For example, the ground truth “Canberra is the capital city of Australia, and the city is located at the northern end of the Australian Capital Territory” has two claims: “Canberra is the capital city of Australia” and “Canberra city is located at the northern end of the Australian Capital Territory.” Each claim in the ground truth is analyzed to determine whether it can be attributed to the retrieved context or not. A higher value is assigned when more claims in the ground truth are attributable to the retrieved context.
Faithfulness – A metric that measures the factual consistency of the generated answer against the given context. Its value ranges between 0–1, with higher values indicating better performance. The answer can also contain several claims. A lower score is assigned to answers that contain a smaller number of claims that can be inferred from the given context.
Answer relevancy – A metric that focuses on assessing how pertinent the generated answer is to the given prompt. It is scaled to (0, 1) range, and the higher the better. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.

LLM-as-a-judge evaluation – Uses LLM capabilities to compare and score outputs against expected answers, which provides qualitative assessment of response accuracy. The prompts used for the LLM-as-a-judge are for demonstration purposes; to serve your specific use case, provide your own evaluation prompts to make sure the LLM-as-a-judge meets the correct evaluation requirements.
FM evaluation: Uses the AWS open source FMEval library and analyzes key metrics, including toxicity measurement.

The architecture implements these evaluations as nested Step Functions workflows that execute concurrently, enabling efficient and comprehensive model assessment. This design also makes it straightforward to add new frameworks to the evaluation workflow.

Clean up
To delete local deployment for the frontend, run run.sh delete_local. If you need to delete the cloud deployment, run run.sh delete_cloud. For the backend, you can delete the AWS CloudFormation stack, llm-evaluation-stack. For resources that you can’t delete automatically, manually delete them on the AWS Management Console.
Conclusion
In this post, we explored the importance of evaluating LLMs in the context of generative AI applications, highlighting the challenges posed by issues like hallucinations and biases. We introduced a comprehensive solution using AWS services to automate the evaluation process, allowing for continuous monitoring and assessment of LLM performance. By using tools like the FMeval Library, Ragas, LLMeter, and Step Functions, the solution provides flexibility and scalability, meeting the evolving needs of LLM consumers.
With this solution, businesses can confidently deploy LLMs, knowing they adhere to the necessary standards for accuracy, fairness, and relevance. We encourage you to explore the GitHub repository and start building your own automated LLM evaluation pipeline on AWS today. This setup can not only streamline your AI workflows but also make sure your models deliver the highest-quality outputs for your specific applications.

About the Authors
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in artificial intelligence, he partners with clients to accelerate their GenAI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences, currently focusing on strength training.
Rafa XU, is a passionate Amazon Web Services (AWS) senior cloud architect focused on helping Public Sector customers design, build, and run infrastructure application and services on AWS. With more than 10 years of experience working across multiple information technology disciplines, Rafa has spent the last five years focused on AWS Cloud infrastructure, serverless applications, and automation. More recently, Rafa has expanded his skillset to include Generative AI, Machine Learning, Big data and Internet of Things (IoT).
Dr. Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Sam Edwards, is a Solutions Architect at AWS based in Sydney and focused on Media & Entertainment. He is a Subject Matter Expert for Amazon Bedrock and Amazon SageMaker AI services. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. In his spare time, he likes traveling and enjoying time with Family.
Dr. Kai Zhu, currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker and Bedrock Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.