i-genie, Author at i-genie.co.uk

How to Use the SHAP-IQ Package to Uncover and Visualize Feature Intera …

Posted on August 3, 2025 by i-genie

In this tutorial, we explore how to use the SHAP-IQ package to uncover and visualize feature interactions in machine learning models using Shapley Interaction Indices (SII), building on the foundation of traditional Shapley values.

Shapley values are great for explaining individual feature contributions in AI models but fail to capture feature interactions. Shapley interactions go a step further by separating individual effects from interactions, offering deeper insights—like how longitude and latitude together influence house prices. In this tutorial, we’ll get started with the shapiq package to compute and explore these Shapley interactions for any model. Check out the Full Codes here

Installing the dependencies

Copy CodeCopiedUse a different Browser!pip install shapiq overrides scikit-learn pandas numpy

Data Loading and Pre-processing

In this tutorial, we’ll use the Bike Sharing dataset from OpenML. After loading the data, we’ll split it into training and testing sets to prepare it for model training and evaluation. Check out the Full Codes here

Copy CodeCopiedUse a different Browserimport shapiq
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np

# Load data
X, y = shapiq.load_bike_sharing(to_numpy=True)

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training and Performance Evaluation

Copy CodeCopiedUse a different Browser# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f”R² Score: {r2:.4f}”)
print(f”Mean Absolute Error: {mae:.4f}”)
print(f”Root Mean Squared Error: {rmse:.4f}”)

Setting up an Explainer

We set up a TabularExplainer using the shapiq package to compute Shapley interaction values based on the k-SII (k-order Shapley Interaction Index) method. By specifying max_order=4, we allow the explainer to consider interactions of up to 4 features simultaneously, enabling deeper insights into how groups of features collectively impact model predictions. Check out the Full Codes here

Copy CodeCopiedUse a different Browser# set up an explainer with k-SII interaction values up to order 4
explainer = shapiq.TabularExplainer(
model=model,
data=X,
index=”k-SII”,
max_order=4
)

Explaining a Local Instance

We select a specific test instance (index 100) to generate local explanations. The code prints the true and predicted values for this instance, followed by a breakdown of its feature values. This helps us understand the exact inputs passed to the model and sets the context for interpreting the Shapley interaction explanations that follow. Check out the Full Codes here

Copy CodeCopiedUse a different Browserfrom tqdm.asyncio import tqdm
# create explanations for different orders
feature_names = list(df[0].columns) # get the feature names
n_features = len(feature_names)

# select a local instance to be explained
instance_id = 100
x_explain = X_test[instance_id]
y_true = y_test[instance_id]
y_pred = model.predict(x_explain.reshape(1, -1))[0]
print(f”Instance {instance_id}, True Value: {y_true}, Predicted Value: {y_pred}”)
for i, feature in enumerate(feature_names):
print(f”{feature}: {x_explain[i]}”)

Analyzing Interaction Values

We use the explainer.explain() method to compute Shapley interaction values for a specific data instance (X[100]) with a budget of 256 model evaluations. This returns an InteractionValues object, which captures how individual features and their combinations influence the model’s output. The max_order=4 means we consider interactions involving up to 4 features. Check out the Full Codes here

Copy CodeCopiedUse a different Browserinteraction_values = explainer.explain(X[100], budget=256)
# analyse interaction values
print(interaction_values)

First-Order Interaction Values

To keep things simple, we compute first-order interaction values—i.e., standard Shapley values that capture only individual feature contributions (no interactions).

By setting max_order=1 in the TreeExplainer, we’re saying:

“Tell me how much each feature individually contributes to the prediction, without considering any interaction effects.”

These values are known as standard Shapley values. For each feature, it estimates the average marginal contribution to the prediction across all possible permutations of feature inclusion. Check out the Full Codes here

Copy CodeCopiedUse a different Browserfeature_names = list(df[0].columns)
explainer = shapiq.TreeExplainer(model=model, max_order=1, index=”SV”)
si_order = explainer.explain(x=x_explain)
si_order

Plotting a Waterfall chart

A Waterfall chart visually breaks down a model’s prediction into individual feature contributions. It starts from the baseline prediction and adds/subtracts each feature’s Shapley value to reach the final predicted output.

In our case, we’ll use the output of TreeExplainer with max_order=1 (i.e., individual contributions only) to visualize the contribution of each feature. Check out the Full Codes here

Copy CodeCopiedUse a different Browsersi_order.plot_waterfall(feature_names=feature_names, show=True)

In our case, the baseline value (i.e., the model’s expected output without any feature information) is 190.717.

As we add the contributions from individual features (order-1 Shapley values), we can observe how each one pushes the prediction up or pulls it down:

Features like Weather and Humidity have a positive contribution, increasing the prediction above the baseline.

Features like Temperature and Year have a strong negative impact, pulling the prediction down by −35.4 and −45, respectively.

Overall, the Waterfall chart helps us understand which features are driving the prediction, and in which direction—providing valuable insight into the model’s decision-making.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Use the SHAP-IQ Package to Uncover and Visualize Feature Interactions in Machine Learning Models Using Shapley Interaction Indices (SII) appeared first on MarkTechPost.

A Coding Guide to Build Intelligent Multi-Agent Systems with the PEER …

Posted on August 3, 2025 by i-genie

In this tutorial, we explore a powerful multi-agent system built around the PEER pattern: Plan, Execute, Express, and Review. We run the entire workflow in Google Colab/Notebook, integrating agents with specialized roles and leveraging Google’s Gemini 1.5 Flash model via a free API key. As we walk through the system, we observe how each agent collaborates to tackle complex tasks across different domains such as finance, technology, and creative strategy. This hands-on tutorial allows us to understand the architecture, workflow, and iterative refinement that underpin high-quality AI outputs.

Copy CodeCopiedUse a different Browser!pip install agentUniverse google-generativeai python-dotenv pydantic

import os
import asyncio
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from enum import Enum
import json
import time
import google.generativeai as genai

GEMINI_API_KEY = ‘Use Your API Key Here’
genai.configure(api_key=GEMINI_API_KEY)

We begin by installing the required libraries, including agentUniverse and google-generativeai, to set up our multi-agent system. After importing the necessary modules, we configure the Gemini API using our free API key to enable AI-powered content generation. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass AgentRole(Enum):
PLANNER = “planner”
EXECUTOR = “executor”
EXPRESSER = “expresser”
REVIEWER = “reviewer”

@dataclass
class Task:
id: str
description: str
context: Dict[str, Any]
status: str = “pending”
result: Optional[str] = None
feedback: Optional[str] = None

class BaseAgent:
“””Base agent class with core functionality”””
def __init__(self, name: str, role: AgentRole, system_prompt: str):
self.name = name
self.role = role
self.system_prompt = system_prompt
self.memory: List[Dict] = []

async def process(self, task: Task) -> str:
prompt = f”{self.system_prompt}nnTask: {task.description}nContext: {json.dumps(task.context)}”

result = await self._simulate_llm_call(prompt, task)

self.memory.append({
“task_id”: task.id,
“input”: task.description,
“output”: result,
“timestamp”: time.time()
})

return result

async def _simulate_llm_call(self, prompt: str, task: Task) -> str:
“””Call Google Gemini API for real LLM processing”””
try:
model = genai.GenerativeModel(‘gemini-1.5-flash’)

enhanced_prompt = self._create_role_prompt(prompt, task)

response = await asyncio.to_thread(
lambda: model.generate_content(enhanced_prompt)
)

return response.text.strip()

except Exception as e:
print(f” Gemini API error for {self.role.value}: {str(e)}”)
return self._get_fallback_response(task)

def _create_role_prompt(self, base_prompt: str, task: Task) -> str:
“””Create enhanced role-specific prompts for Gemini”””
role_instructions = {
AgentRole.PLANNER: “You are a strategic planning expert. Create detailed, actionable plans. Break down complex tasks into clear steps with priorities and dependencies.”,
AgentRole.EXECUTOR: “You are a skilled executor. Analyze the task thoroughly and provide detailed implementation insights. Focus on practical solutions and potential challenges.”,
AgentRole.EXPRESSER: “You are a professional communicator. Present information clearly, professionally, and engagingly. Structure your response with headers, bullet points, and clear conclusions.”,
AgentRole.REVIEWER: “You are a quality assurance expert. Evaluate completeness, accuracy, and clarity. Provide specific, actionable improvement suggestions.”
}

context_info = f”Previous context: {json.dumps(task.context, indent=2)}” if task.context else “No previous context”

return f”””
{role_instructions[self.role]}

{base_prompt}

{context_info}

Task to process: {task.description}

Provide a comprehensive, professional response appropriate for your role as {self.role.value}.
“””

def _get_fallback_response(self, task: Task) -> str:
“””Fallback responses if Gemini API is unavailable”””
fallbacks = {
AgentRole.PLANNER: f”STRATEGIC PLAN for ‘{task.description}’: 1) Requirement analysis 2) Resource assessment 3) Implementation roadmap 4) Risk mitigation 5) Success metrics”,
AgentRole.EXECUTOR: f”EXECUTION ANALYSIS for ‘{task.description}’: Comprehensive analysis completed. Key findings identified, practical solutions developed, implementation considerations noted.”,
AgentRole.EXPRESSER: f”PROFESSIONAL SUMMARY for ‘{task.description}’: ## Analysis Completenn**Key Insights:** Detailed analysis performedn**Recommendations:** Strategic actions identifiedn**Next Steps:** Implementation ready”,
AgentRole.REVIEWER: f”QUALITY REVIEW for ‘{task.description}’: **Assessment:** High quality output achieved. **Strengths:** Comprehensive analysis, clear structure. **Suggestions:** Consider additional quantitative metrics.”
}
return fallbacks[self.role]

We define four distinct agent roles, Planner, Executor, Expresser, and Reviewer, using an Enum to represent their specialized functions. Then, we create a Task dataclass to manage task metadata, including status, result, and feedback. The BaseAgent class serves as the core blueprint for all agents, enabling them to process tasks, call the Gemini API with role-specific prompts, store results in memory, and gracefully fall back to predefined responses if the API fails. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass PEERAgent:
“””PEER Pattern Implementation – Plan, Execute, Express, Review”””
def __init__(self):
self.planner = BaseAgent(“Strategic Planner”, AgentRole.PLANNER,
“You are a strategic planning agent. Break down complex tasks into actionable steps.”)

self.executor = BaseAgent(“Task Executor”, AgentRole.EXECUTOR,
“You are an execution agent. Complete tasks efficiently using available tools and knowledge.”)

self.expresser = BaseAgent(“Result Expresser”, AgentRole.EXPRESSER,
“You are a communication agent. Present results clearly and professionally.”)

self.reviewer = BaseAgent(“Quality Reviewer”, AgentRole.REVIEWER,
“You are a quality assurance agent. Review outputs and provide improvement feedback.”)

self.iteration_count = 0
self.max_iterations = 3

async def collaborate(self, task: Task) -> Dict[str, Any]:
“””Execute PEER collaboration pattern”””
results = {“iterations”: [], “final_result”: None}

while self.iteration_count < self.max_iterations:
iteration_result = {}

print(f” Planning Phase (Iteration {self.iteration_count + 1})”)
plan = await self.planner.process(task)
iteration_result[“plan”] = plan
task.context[“current_plan”] = plan

print(f” Execution Phase”)
execution = await self.executor.process(task)
iteration_result[“execution”] = execution
task.context[“execution_result”] = execution

print(f” Expression Phase”)
expression = await self.expresser.process(task)
iteration_result[“expression”] = expression
task.result = expression

print(f” Review Phase”)
review = await self.reviewer.process(task)
iteration_result[“review”] = review
task.feedback = review

results[“iterations”].append(iteration_result)

if “high” in review.lower() and self.iteration_count >= 1:
results[“final_result”] = expression
break

self.iteration_count += 1
task.context[“previous_feedback”] = review

return results

We implement the PEER pattern, Plan, Execute, Express, Review, through the PEERAgent class, which coordinates four specialized agents for collaborative task-solving. Each iteration runs through all four phases, refining the task output based on structured planning, execution, professional expression, and quality review. We allow up to three iterations, concluding early if the review indicates high-quality completion, making the workflow both adaptive and efficient. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass MultiAgentOrchestrator:
“””Orchestrates multiple specialized agents”””
def __init__(self):
self.agents = {}
self.peer_system = PEERAgent()
self.task_queue = []

def register_agent(self, agent: BaseAgent):
“””Register a specialized agent”””
self.agents[agent.name] = agent

async def process_complex_task(self, description: str, domain: str = “general”) -> Dict[str, Any]:
“””Process complex task using PEER pattern and domain agents”””
task = Task(
id=f”task_{int(time.time())}”,
description=description,
context={“domain”: domain, “complexity”: “high”}
)

print(f” Starting Complex Task Processing: {description}”)
print(“=” * 60)

peer_results = await self.peer_system.collaborate(task)

if domain in [“financial”, “technical”, “creative”]:
domain_agent = self._get_domain_agent(domain)
if domain_agent:
print(f” Domain-Specific Processing ({domain})”)
domain_result = await domain_agent.process(task)
peer_results[“domain_enhancement”] = domain_result

return {
“task_id”: task.id,
“original_request”: description,
“peer_results”: peer_results,
“status”: “completed”,
“processing_time”: f”{len(peer_results[‘iterations’])} iterations”
}

def _get_domain_agent(self, domain: str) -> Optional[BaseAgent]:
“””Get domain-specific agent with enhanced Gemini prompts”””
domain_agents = {
“financial”: BaseAgent(“Financial Analyst”, AgentRole.EXECUTOR,
“You are a senior financial analyst with expertise in market analysis, risk assessment, and investment strategies. Provide detailed financial insights with quantitative analysis.”),
“technical”: BaseAgent(“Technical Expert”, AgentRole.EXECUTOR,
“You are a lead software architect with expertise in system design, scalability, and best practices. Provide detailed technical solutions with implementation considerations.”),
“creative”: BaseAgent(“Creative Director”, AgentRole.EXPRESSER,
“You are an award-winning creative director with expertise in brand strategy, content creation, and innovative campaigns. Generate compelling and strategic creative solutions.”)
}
return domain_agents.get(domain)

class KnowledgeBase:
“””Simple knowledge management system”””
def __init__(self):
self.knowledge = {
“financial_analysis”: [“Risk assessment”, “Portfolio optimization”, “Market analysis”],
“technical_development”: [“System architecture”, “Code optimization”, “Security protocols”],
“creative_content”: [“Brand storytelling”, “Visual design”, “Content strategy”]
}

def get_domain_knowledge(self, domain: str) -> List[str]:
return self.knowledge.get(domain, [“General knowledge”])

async def run_advanced_demo():

orchestrator = MultiAgentOrchestrator()
knowledge_base = KnowledgeBase()

print(“n DEMO 1: Financial Analysis with PEER Pattern”)
print(“-” * 40)

financial_task = “Analyze the potential impact of rising interest rates on tech stocks portfolio”
result1 = await orchestrator.process_complex_task(financial_task, “financial”)

print(f”n Task Completed: {result1[‘processing_time’]}”)
print(f”Final Result: {result1[‘peer_results’][‘final_result’]}”)

print(“n DEMO 2: Technical Problem Solving”)
print(“-” * 40)

technical_task = “Design a scalable microservices architecture for a high-traffic e-commerce platform”
result2 = await orchestrator.process_complex_task(technical_task, “technical”)

print(f”n Task Completed: {result2[‘processing_time’]}”)
print(f”Final Result: {result2[‘peer_results’][‘final_result’]}”)

print(“n DEMO 3: Creative Content with Multi-Agent Collaboration”)
print(“-” * 40)

creative_task = “Create a comprehensive brand strategy for a sustainable fashion startup”
result3 = await orchestrator.process_complex_task(creative_task, “creative”)

print(f”n Task Completed: {result3[‘processing_time’]}”)
print(f”Final Result: {result3[‘peer_results’][‘final_result’]}”)

print(“n AGENT MEMORY & LEARNING”)
print(“-” * 40)
print(f”Planner processed {len(orchestrator.peer_system.planner.memory)} tasks”)
print(f”Executor processed {len(orchestrator.peer_system.executor.memory)} tasks”)
print(f”Expresser processed {len(orchestrator.peer_system.expresser.memory)} tasks”)
print(f”Reviewer processed {len(orchestrator.peer_system.reviewer.memory)} tasks”)

return {
“demo_results”: [result1, result2, result3],
“agent_stats”: {
“total_tasks”: 3,
“success_rate”: “100%”,
“avg_iterations”: sum(len(r[‘peer_results’][‘iterations’]) for r in [result1, result2, result3]) / 3
}
}

def explain_peer_pattern():
“””Explain the PEER pattern in detail”””
explanation = “””
PEER Pattern Explained:

P – PLAN: Strategic decomposition of complex tasks
E – EXECUTE: Systematic implementation using tools and knowledge
E – EXPRESS: Clear, structured communication of results
R – REVIEW: Quality assurance and iterative improvement

This pattern enables:
Better task decomposition
Systematic execution
Professional output formatting
Continuous quality improvement
“””
print(explanation)

def show_architecture():
“””Display the multi-agent architecture”””
architecture = “””
agentUniverse Architecture:

Task Input
↓
PEER System
├── Planner Agent
├── Executor Agent
├── Expresser Agent
└── Reviewer Agent
↓
Domain Specialists
├── Financial Analyst
├── Technical Expert
└── Creative Director
↓
Knowledge Base
↓
Results & Analytics
“””
print(architecture)

We bring everything together through the MultiAgentOrchestrator, which coordinates the PEER system and, when needed, invokes domain-specific agents like the Financial Analyst or Technical Expert. This orchestrator handles each complex task by first leveraging the PEER pattern and then enhancing results with specialized knowledge. We also define a simple KnowledgeBase to support domain-aware reasoning. In the run_advanced_demo() function, we test the full pipeline with three tasks, financial, technical, and creative, while capturing agent performance and iteration metrics to showcase the power and versatility of our multi-agent setup. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(” Get your FREE API key at: https://makersuite.google.com/app/apikey”)
print(” Make sure to replace ‘your-gemini-api-key-here’ with your actual key!”)

if GEMINI_API_KEY == ‘your-gemini-api-key-here’:
print(” WARNING: Please set your Gemini API key first!”)
print(” 1. Go to https://makersuite.google.com/app/apikey”)
print(” 2. Create a free API key”)
print(” 3. Replace ‘your-gemini-api-key-here’ with your key”)
print(” 4. Re-run the tutorial”)
else:
print(” API key configured! Starting tutorial…”)

explain_peer_pattern()
show_architecture()

print(“n Running Advanced Demo with Gemini AI (This may take a moment)…”)

try:
import nest_asyncio
nest_asyncio.apply()

demo_results = asyncio.run(run_advanced_demo())

print(“n TUTORIAL COMPLETED SUCCESSFULLY!”)
print(“=” * 50)
print(f” Performance Summary:”)
print(f” • Tasks Processed: {demo_results[‘agent_stats’][‘total_tasks’]}”)
print(f” • Success Rate: {demo_results[‘agent_stats’][‘success_rate’]}”)
print(f” • Avg Iterations: {demo_results[‘agent_stats’][‘avg_iterations’]:.1f}”)
print(f” • Powered by: Google Gemini (FREE)”)

print(“n Key Takeaways:”)
print(” • PEER pattern enables systematic problem-solving”)
print(” • Multi-agent collaboration improves output quality”)
print(” • Domain expertise integration enhances specialization”)
print(” • Iterative refinement ensures high-quality results”)
print(” • Gemini provides powerful, free AI capabilities”)

except ImportError:
print(” Note: Install nest_asyncio for full async support in Colab”)
print(“Run: !pip install nest_asyncio”)
except Exception as e:
print(f” Error running demo: {str(e)}”)
print(“This might be due to API key configuration or network issues.”)

print(“n Next Steps:”)
print(” • Customize agents for your specific domain”)
print(” • Experiment with different Gemini models (gemini-pro, gemini-1.5-flash)”)
print(” • Build production-ready multi-agent applications”)

We conclude the tutorial by initializing the system, verifying the Gemini API key, and executing the full PEER-based multi-agent workflow. We explain the architecture and pattern before running the demo, and upon successful completion, we display a performance summary and key takeaways.

In conclusion, we successfully demonstrate how a multi-agent system can systematically solve complex problems with the help of domain-specific reasoning, structured communication, and iterative quality checks. We gain insights into the collaborative power of the PEER framework and witness how Gemini enhances each agent’s output. Through this experience, we realize the potential of modular AI systems in creating scalable, reliable, and intelligent applications ready for real-world deployment.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Build Intelligent Multi-Agent Systems with the PEER Pattern appeared first on MarkTechPost.

Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attentio …

Posted on August 2, 2025 by i-genie

Introduction

The Falcon-H1 series, developed by the Technology Innovation Institute (TII), marks a significant advancement in the evolution of large language models (LLMs). By integrating Transformer-based attention with Mamba-based State Space Models (SSMs) in a hybrid parallel configuration, Falcon-H1 achieves exceptional performance, memory efficiency, and scalability. Released in multiple sizes (0.5B to 34B parameters) and versions (base, instruct-tuned, and quantized), Falcon-H1 models redefine the trade-off between compute budget and output quality, offering parameter efficiency superior to many contemporary models such as Qwen2.5-72B and LLaMA3.3-70B.

Key Architectural Innovations

The technical report explains how Falcon-H1 adopts a novel parallel hybrid architecture where both attention and SSM modules operate concurrently, and their outputs are concatenated before the projection. This design deviates from traditional sequential integration and provides the flexibility to tune the number of attention and SSM channels independently. The default configuration uses a 2:1:5 ratio for SSM, attention, and MLP channels respectively, optimizing both efficiency and learning dynamics.

To further refine the model, Falcon-H1 explores:

Channel allocation: Ablations show that increasing attention channels deteriorates performance, whereas balancing SSM and MLP yields robust gains.

Block configuration: The SA_M configuration (semi-parallel with attention and SSM run together, followed by MLP) performs best in training loss and computational efficiency.

RoPE base frequency: An unusually high base frequency of 10^11 in Rotary Positional Embeddings (RoPE) proved optimal, improving generalization during long-context training.

Width-depth trade-off: Experiments show that deeper models outperform wider ones under fixed parameter budgets. Falcon-H1-1.5B-Deep (66 layers) outperforms many 3B and 7B models.

Tokenizer Strategy

Falcon-H1 uses a customized Byte Pair Encoding (BPE) tokenizer suite with vocabulary sizes ranging from 32K to 261K. Key design choices include:

Digit and punctuation splitting: Empirically improves performance in code and multilingual settings.

LATEX token injection: Enhances model accuracy on math benchmarks.

Multilingual support: Covers 18 languages and scales to 100+, using optimized fertility and bytes/token metrics.

Pretraining Corpus and Data Strategy

Falcon-H1 models are trained on up to 18T tokens from a carefully curated 20T token corpus, comprising:

High-quality web data (filtered FineWeb)

Multilingual datasets: Common Crawl, Wikipedia, arXiv, OpenSubtitles, and curated resources for 17 languages

Code corpus: 67 languages, processed via MinHash deduplication, CodeBERT quality filters, and PII scrubbing

Math datasets: MATH, GSM8K, and in-house LaTeX-enhanced crawls

Synthetic data: Rewritten from raw corpora using diverse LLMs, plus textbook-style QA from 30K Wikipedia-based topics

Long-context sequences: Enhanced via Fill-in-the-Middle, reordering, and synthetic reasoning tasks up to 256K tokens

Training Infrastructure and Methodology

Training utilized customized Maximal Update Parametrization (µP), supporting smooth scaling across model sizes. The models employ advanced parallelism strategies:

Mixer Parallelism (MP) and Context Parallelism (CP): Enhance throughput for long-context processing

Quantization: Released in bfloat16 and 4-bit variants to facilitate edge deployments

Evaluation and Performance

Falcon-H1 achieves unprecedented performance per parameter:

Falcon-H1-34B-Instruct surpasses or matches 70B-scale models like Qwen2.5-72B and LLaMA3.3-70B across reasoning, math, instruction-following, and multilingual tasks

Falcon-H1-1.5B-Deep rivals 7B–10B models

Falcon-H1-0.5B delivers 2024-era 7B performance

Benchmarks span MMLU, GSM8K, HumanEval, and long-context tasks. The models demonstrate strong alignment via SFT and Direct Preference Optimization (DPO).

Conclusion

Falcon-H1 sets a new standard for open-weight LLMs by integrating parallel hybrid architectures, flexible tokenization, efficient training dynamics, and robust multilingual capability. Its strategic combination of SSM and attention allows for unmatched performance within practical compute and memory budgets, making it ideal for both research and deployment across diverse environments.

Check out the Paper and Models on Hugging Face. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention–SSM Model That Rivals 70B LLMs appeared first on MarkTechPost.

Meet SmallThinker: A Family of Efficient Large Language Models LLMs Na …

Posted on August 2, 2025 by i-genie

The generative AI landscape is dominated by massive language models, often designed for the vast capacities of cloud data centers. These models, while powerful, make it difficult or impossible for everyday users to deploy advanced AI privately and efficiently on local devices like laptops, smartphones, or embedded systems. Instead of compressing cloud-scale models for the edge—often resulting in substantial performance compromises—the team behind SmallThinker asked a more fundamental question: What if a language model were architected from the start for local constraints?

This was the genesis for SmallThinker, a family of Mixture-of-Experts (MoE) models developed by Researchers at Shanghai Jiao Tong University and Zenergize AI, that targets at high-performance, memory-limited, and compute-constrained on-device inference. With two main variants—SmallThinker-4B-A0.6B and SmallThinker-21B-A3B—they set a new benchmark for efficient, accessible AI.

Local Constraints Become Design Principles

Architectural Innovations

Fine-Grained Mixture-of-Experts (MoE):Unlike typical monolithic LLMs, SmallThinker’s backbone features a fine-grained MoE design. Multiple specialized expert networks are trained, but only a small subset is activated for each input token:

SmallThinker-4B-A0.6B: 4 billion parameters in total, with just 600 million in play per token.

SmallThinker-21B-A3B: 21 billion parameters, of which only 3 billion are active at once.

This enables high capacity without the memory and computation penalties of dense models.

ReGLU-Based Feed-Forward Sparsity:Activation sparsity is further enforced using ReGLU. Even within activated experts, over 60% of neurons are idle per inference step, realizing massive compute and memory savings.

NoPE-RoPE Hybrid Attention:For efficient context handling, SmallThinker employs a novel attention pattern: alternating between global NoPositionalEmbedding (NoPE) layers and local RoPE sliding-window layers. This approach supports large context lengths (up to 32K tokens for 4B and 16K for 21B) but trims the Key/Value cache size compared to traditional all-global attention.

Pre-Attention Router and Intelligent Offloading:Critical to on-device use is the decoupling of inference speed from slow storage. SmallThinker’s “pre-attention router” predicts which experts will be needed before each attention step, so their parameters are prefetched from SSD/flash in parallel with computation. The system relies on caching “hot” experts in RAM (using an LRU policy), while less-used specialists remain on fast storage. This design essentially hides I/O lag and maximizes throughput even with minimal system memory.

Training Regime and Data Procedures

SmallThinker models were trained afresh, not as distillations, on a curriculum that progresses from general knowledge to highly specialized STEM, mathematical, and coding data:

The 4B variant processed 2.5 trillion tokens; the 21B model saw 7.2 trillion.

Data comes from a blend of curated open-source collections, augmented synthetic math and code datasets, and supervised instruction-following corpora.

Methodologies included quality-filtering, MGA-style data synthesis, and persona-driven prompt strategies—particularly to raise performance in formal and reasoning-heavy domains.

Benchmark Results

On Academic Tasks:SmallThinker-21B-A3B, despite activating far fewer parameters than equivalent rivals, stands shoulder to shoulder with or beats them in fields ranging from mathematics (MATH-500, GPQA-Diamond) to code generation (HumanEval) and broad knowledge assessments (MMLU):

ModelMMLUGPQAMath-500IFEvalLiveBenchHumanEvalAverageSmallThinker-21B-A3B84.455.182.485.860.389.676.3Qwen3-30B-A3B85.144.484.484.358.890.274.5Phi-4-14B84.655.580.263.242.487.268.8Gemma3-12B-it78.534.982.474.744.582.966.3

The 4B-A0.6B model also outperforms or matches other models with similar activated parameter counts, particularly excelling in reasoning and code.

On Real Hardware:Where SmallThinker truly shines is on memory-starved devices:

The 4B model works comfortably with as little as 1 GiB RAM, and the 21B model with just 8 GiB, without catastrophic speed drops.

Prefetching and caching mean that even under these limits, inference remains vastly faster and smoother than baseline models simply swapped to disk.

For example, the 21B-A3B variant maintains over 20 tokens/sec on a standard CPU, while Qwen3-30B-A3B nearly crashes under similar memory constraints.

Impact of Sparsity and Specialization

Expert Specialization:Activation logs reveal that 70–80% of experts are sparsely used, while a core few “hotspot” experts light up for specific domains or languages—a property which enables highly predictable and efficient caching.

Neuron-Level Sparsity:Even within active experts, median neuron inactivity rates exceed 60%. Early layers are almost entirely sparse, while deeper layers retain this efficiency, illustrating why SmallThinker manages to do so much with so little compute.

System Limitations and Future Work

While the achievements are substantial, SmallThinker isn’t without caveats:

Training Set Size: Its pretraining corpus, though massive, is still smaller than those behind some frontier cloud models—potentially limiting generalization in rare or obscure domains.

Model Alignment: Only supervised fine-tuning is applied; unlike leading cloud LLMs, no reinforcement learning from human feedback is used, possibly leaving some safety and helpfulness gaps.

Language Coverage: English and Chinese, with STEM, dominate training—other languages may see reduced quality.

The authors anticipate expanding the datasets and introducing RLHF pipelines in future versions.

Conclusion

SmallThinker represents a radical departure from the “shrink cloud models for edge” tradition. By starting from local-first constraints, it delivers high capability, high speed, and low memory use through architectural and systems innovation. This opens the door for private, responsive, and capable AI on nearly any device—democratizing advanced language technology for a much broader swath of users and use cases.

The models—SmallThinker-4B-A0.6B-Instruct and SmallThinker-21B-A3B-Instruct—are freely available for researchers and developers, and stand as compelling proof of what’s possible when model design is driven by deployment realities, not just data-center ambition.

Check out the Paper, SmallThinker-4B-A0.6B-Instruct and SmallThinker-21B-A3B-Instruct here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meet SmallThinker: A Family of Efficient Large Language Models LLMs Natively Trained for Local Deployment appeared first on MarkTechPost.

Google AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): …

Posted on August 2, 2025 by i-genie

Deep Research (DR) agents have rapidly gained popularity in both research and industry, thanks to recent progress in LLMs. However, most popular public DR agents are not designed with human thinking and writing processes in mind. They often lack structured steps that support human researchers, such as drafting, searching, and using feedback. Current DR agents compile test-time algorithms and various tools without cohesive frameworks, highlighting the critical need for purpose-built frameworks that can match or excel human research capabilities. The absence of human-inspired cognitive processes in current methods creates a gap between how humans do research and how AI agents handle complex research tasks.

Existing works, such as test-time scaling, utilize iterative refinement algorithms, debate mechanisms, tournaments for hypothesis ranking, and self-critique systems to generate research proposals. Multi-agent systems utilize planners, coordinators, researchers, and reporters to produce detailed responses, while some frameworks enable human co-pilot modes for feedback integration. Agent tuning approaches focus on training through multitask learning objectives, component-wise supervised fine-tuning, and reinforcement learning to improve search and browsing capabilities. LLM diffusion models attempt to break autoregressive sampling assumptions by generating complete noisy drafts and iteratively denoising tokens for high-quality outputs.

Researchers at Google introduced Test-Time Diffusion Deep Researcher (TTD-DR), inspired by the iterative nature of human research through repeated cycles of searching, thinking, and refining. It conceptualizes research report generation as a diffusion process, starting with a draft that serves as an updated outline and evolving foundation to guide research direction. The draft undergoes iterative refinement through a “denoising” process, dynamically informed by a retrieval mechanism that incorporates external information at each step. This draft-centric design makes report writing more timely and coherent while reducing information loss during iterative search processes. TTD-DR achieves state-of-the-art results on benchmarks that require intensive search and multi-hop reasoning.

The TTD-DR framework addresses limitations of existing DR agents that employ linear or parallelized processes. The proposed backbone DR agent contains three major stages: Research Plan Generation, Iterative Search and Synthesis, and Final Report Generation, each containing unit LLM agents, workflows, and agent states. The agent utilizes self-evolving algorithms to enhance the performance of each stage, helping it to find and preserve high-quality context. The proposed algorithm, inspired by recent self-evolution work, is implemented in a parallel workflow along with sequential and loop workflows. This algorithm can be applied to all three stages of agents to improve overall output quality.

In side-by-side comparisons with OpenAI Deep Research, TTD-DR achieves 69.1% and 74.5% win rates for long-form research report generation tasks, while outperforming by 4.8%, 7.7%, and 1.7% on three research datasets with short-form ground-truth answers. It shows strong performance in Helpfulness and Comprehensiveness auto-rater scores, especially on LongForm Research datasets. Moreover, the self-evolution algorithm achieves 60.9% and 59.8% win rates against OpenAI Deep Research on LongForm Research and DeepConsult. The correctness score shows an enhancement of 1.5% and 2.8% on HLE datasets, though the performance on GAIA remains 4.4% below OpenAI DR. The incorporation of Diffusion with Retrieval leads to substantial gains over OpenAI Deep Research across all benchmarks.

In conclusion, Google presents TTD-DR, a method that addresses fundamental limitations through human-inspired cognitive design. The framework’s approach conceptualizes research report generation as a diffusion process, utilizing an updatable draft skeleton that guides research direction. TTD-DR, enhanced by self-evolutionary algorithms applied to each workflow component, ensures high-quality context generation throughout the research process. Moreover, evaluations demonstrate that TTD-DR’s state-of-the-art performance across various benchmarks that require intensive search and multi-hop reasoning, with superior results in both comprehensive long-form research reports and concise multi-hop reasoning tasks.

Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Introduces the Test-Time Diffusion Deep Researcher (TTD-DR): A Human-Inspired Diffusion Framework for Advanced Deep Research Agents appeared first on MarkTechPost.

Introducing Amazon Bedrock AgentCore Browser Tool

Posted on August 2, 2025 by i-genie

At AWS Summit New York City 2025, Amazon Web Services (AWS) announced the preview of Amazon Bedrock AgentCore browser tool, a fully managed, pre-built cloud-based browser. This tool enables generative AI agents to interact seamlessly with websites. It addresses two fundamental limitations: first, foundation models (FMs) are trained on large but static datasets and need dynamic access to current information when API access isn’t readily available; second, organizations face significant challenges when attempting to scale web automation with AI for enterprise use cases.
The development of agentic AI systems is moving toward applications that can execute complex, multistep tasks. For these agents to be effective, they require access to dynamic, real-time data, particularly from websites and web applications that don’t offer APIs or where API integration would be complex. Moreover, as businesses seek to deploy AI-powered automation across their operations, they need solutions that can reliably scale without the operational overhead of managing browser farms or solving complex concurrency issues. The AgentCore Browser Tool provides these capabilities, allowing agents to perform tasks such as automating research, streamlining operations, and interacting with web-based applications—all with the scalability, reliability, and security of the AWS Cloud infrastructure. By providing a fully managed cloud-based browser, AWS addresses the critical need for enterprises to deploy AI automation at scale across thousands of concurrent sessions, supporting use cases from customer service automation to large-scale data collection and analysis, without the traditional complexity and resource constraints of self-managed browser automation frameworks.
In this post, we introduce the newly announced Amazon Bedrock AgentCore Browser Tool. We explore why organizations need cloud-based browser automation and the limitations it addresses for FMs that require real-time data access. We talk about key use cases and the core capabilities of the AgentCore Browser Tool. We walk through how to get started with the tool.
Why do you need the cloud-based AgentCore Browser Tool?
Traditional browser automation approaches have typically required significant infrastructure management, security considerations, and development expertise. The introduction of a fully managed, cloud-based browser automation solution addresses several critical needs, including simplified infrastructure management, enterprise-grade security, global availability and scaling, and cost optimization. Organizations no longer need to provision, maintain, and scale browser instances to support their automation needs. AWS now handles the complex infrastructure requirements, so developers can focus on building intelligent agent capabilities rather than managing browser farms. Cloud-based browser automation provides isolated execution environments with AWS security controls, reducing the risk of data exfiltration or unauthorized access that might occur in less controlled environments. With a cloud-based browser, you can instantaneously deploy browser instances across the global infrastructure of AWS so that browser automation can scale. By offering browser automation as a managed service, organizations can use a consumption-based pricing model instead of maintaining always-on infrastructure, which can substantially reduce costs for intermittent workloads.
Use cases for cloud-based browser automation
Handling repetitive web tasks: With the introduction of Amazon Bedrock AgentCore Browser Tool, organizations can now implement sophisticated browser automation at scale. Cloud-based browser automation excels at minimizing manual execution of repetitive tasks across web interfaces. AI agents can populate complex web forms across multiple systems, validate entries, and maintain compliance with business rules. Agents can navigate to internal dashboards, extract critical metrics, and compile reports without human intervention. For organizations managing large user-generated content domains, agents can assist human moderators by prescreening content across multiple web interfaces.
AI powered research and intelligence gathering: With cloud-based browser automation, AI agents become powerful research assistants. They automatically track related websites for pricing changes, new product launches, or content updates with regular monitoring. You can use AI agents to gather and analyze consumer sentiment across various web forums, review sites, and social domains to inform product development. With the AgentCore Browser Tool, you can create automated systems that regularly scan trusted information sources to keep internal knowledge bases current.
Complex workflow automation across systems: Many organizations operate across numerous web applications that lack integrated workflows. Use the AgentCore Browser Tool to automate customer setup across multiple software-as-a-service (SaaS) systems when APIs are unavailable. This helps maintain consistency and reduces error rates. You can monitor supplier portals, inventory systems, and logistics services to maintain visibility across complex supply chains. By automating account creation and permission settings across numerous internal web applications, employee onboarding becomes streamlined.
Testing and quality assurance: Cloud-based browser automation enables robust testing at scale. You can use AgentCore Browser Tool to validate user experiences and functionality across different scenarios, devices, and browsers in parallel. Deploy agents to continuously interact with critical business applications and set up alerts to your teams about performance issues before customers encounter them. With AgentCore Browser Tool, you can regularly test web applications for accessibility compliance, security vulnerabilities, or regulatory requirements.
Legacy system integration: Many organizations maintain legacy systems that lack modern APIs. Enable modern AI capabilities to interact with legacy web applications that would be costly to replace or modernize. Apply intelligent automation to systems that were never designed for programmatic access. As a result, you can extract valuable organizational data trapped in older web applications through regular, automated harvesting.
Core capabilities
The Amazon Bedrock AgentCore Browser Tool empowers AI agents to interact with web content the same way humans do, through a fully managed remote browser infrastructure that minimizes traditional complexity while delivering enterprise-grade security and scalability.
Web interaction capabilities

Complete navigation control across websites and multipage workflows
Interaction with JavaScript-heavy applications and dynamic content
Form manipulation, including text fields, dropdown menus, and file uploads
Humanlike interaction patterns such as scrolling, hovering, and clicking

Serverless browser infrastructure

Zero-management browser fleet with automatic patching
Seamless scaling from single session to thousands based on demand
Global deployment options with usage-based pricing
Optimized performance without infrastructure overhead

Visual understanding

Full-page screenshots enabling AI comprehension of layout and content
Visual element identification by appearance and position
Content extraction from graphical elements
Resolution and device emulation capabilities

Human-in-the-loop integration

Real-time interactive viewing and control for human operators
Session recording for review, training, and compliance

Enterprise-grade security

Complete session isolation for each browser instance
AWS Identity and Access Management (IAM) controls for access management
Ephemeral browser sessions that reset after each use

Complex web application support

Full compatibility with modern JavaScript frameworks
Authentication handling and session persistence
Processing of asynchronous content and real-time updates
Intelligent interaction with complex UI patterns

Audit and compliance

Detailed interaction logging and session recording
Integration with AWS CloudTrail for comprehensive tracking

Observability

Performance metrics on latency and resource usage
Integration with Amazon CloudWatch for unified monitoring
Session record and replay for observability

This comprehensive set of capabilities bridges the fundamental gap between AI agents and the human web, enabling organizations to build intelligent agents that can understand and interact with content designed for humans rather than being limited to API-based integrations.
How an AI agent can use AgentCore Browser Tool
Amazon Bedrock AgentCore Browser runs in a secure, isolated containerized environment within AgentCore, insulating web activity from your local system. You can interact with the AgentCore Browser Tool using browser actuation libraries, such as Playwright, or use AI agentic frameworks specialized for browser automation, such as Amazon Nova Act and Browser Use. You can also integrate browser automation as a tool in a multi-agentic workflow.
Amazon Nova Act or Browser Use works with the AgentCore Browser Tool to take natural language instructions from the user and convert them to actuations on the browser by following this workflow:

The user sends a query such as “search for shoes on Amazon”
An agentic framework such as Amazon Nova Act or Browser Use passes the query to the large language model (LLM)
The LLM reasons and generates instructions in a structured output format (for example, JSON encoded)
The agentic framework maps these instructions into browser actuation commands (such as Playwright, Puppeteer, or Selenium)
The browser actuation commands are executed on the AgentCore Browser over a secure WebSocket connection
The response from the browser and a screenshot are sent to the agent to reason further

This process repeats until the original task is complete. The flow is illustrated in the following diagram.

Get started
The Amazon Bedrock AgentCore Browser Tool is available for use today. For a collection of open source examples, visit the amazon-bedrock-agentcore-samples repository on GitHub.
Prerequisites
To use the Amazon Bedrock AgentCore Brower Tool, you need to complete the following prerequisites:

Python 3.10+
Verify your IAM user or role has the permissions to use AgentCore Browser:

git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git
pip install bedrock-agentcore

For browser visualization on your local machine, you need the BrowserViewerServer component in the repository you cloned at: 01-tutorials/05-AgentCore-tools/02-Agent-Core-browser-tool/interactive_tools
You can also visualize the browser live on the Amazon Bedrock AgentCore console at https://us-east-1.console.aws.amazon.com/bedrock-agentcore/builtInTools
The following Python code demonstrates how to use the AgentCore Browser Tool directly with the Playwright library and the Amazon Bedrock AgentCore SDK. This example initiates a secure browser session, connects to it, and automates a straightforward workflow in which it navigates to https://www.amazon.com and searches for a product.

To get started with playwright:

cd 01-tutorials/05-AgentCore-tools/02-Agent-Core-browser-tool

Install dependencies:

pip install playwright

Author your playwright-based script:

from playwright.sync_api import sync_playwright, Playwright, BrowserType
from bedrock_agentcore.tools.browser_client import browser_session
from browser_viewer import BrowserViewerServer
import time
from rich.console import Console
console = Console()
def run(playwright: Playwright):
# Create the browser session and keep it alive
with browser_session(‘us-west-2’) as client:
ws_url, headers = client.generate_ws_headers()
# Start viewer server
viewer = BrowserViewerServer(client, port=8005)
viewer_url = viewer.start(open_browser=True)
# Connect using headers
chromium: BrowserType = playwright.chromium
browser = chromium.connect_over_cdp(
ws_url,
headers=headers
)
context = browser.contexts[0]
page = context.pages[0]
try:
page.goto(“https://amazon.com/”)
console.print(page.title())
# Keep running
while True:
time.sleep(120)
except KeyboardInterrupt:
console.print(“nn[yellow]Shutting down…[/yellow]”)
if ‘client’ in locals():
client.stop()
console.print(“✅ Browser session terminated”)
except Exception as e:
console.print(f”n[red]Error: {e}[/red]”)
import traceback
traceback.print_exc()
with sync_playwright() as playwright:
run(playwright)

Alternatively, you can build a browser agent using Amazon Nova Act to automate web interactions:

pip install nova-act

Author your Nova Act based script:

import time
from bedrock_agentcore.tools.browser_client import browser_session
from nova_act import NovaAct
from rich.console import Console
from browser_viewer import BrowserViewerServer

NOVA_ACT_API_KEY = “YOUR_NOVA_ACT_API_KEY”
console = Console()

def main():
try:
# Step 1: Create browser session
with browser_session(‘us-west-2’) as client:
print(“r ✅ Browser ready! “)
ws_url, headers = client.generate_ws_headers()

# Step 2: Start viewer server
console.print(“n[cyan]Step 3: Starting viewer server…[/cyan]”)
viewer = BrowserViewerServer(client, port=8005)
viewer_url = viewer.start(open_browser=True)

# Step 3: Use Nova Act to interact with the browser with NovaAct
with NovaAct(
cdp_endpoint_url=ws_url,
cdp_headers=headers,
preview={“playwright_actuation”: True},
nova_act_api_key=NOVA_ACT_API_KEY,
starting_page=”https://www.amazon.com”,
) as nova_act:
result = nova_act.act(“Search for coffee maker and get the details of the lowest priced one on the first page”)
console.print(f”n[bold green]Nova Act Result:[/bold green] {result}”)

# Keep running
while True:
time.sleep(1)

except KeyboardInterrupt:
console.print(“nn[yellow]Shutting down…[/yellow]”)
if ‘client’ in locals():
client.stop()
print(“✅ Browser session terminated”)
except Exception as e:
print(f”n[red]Error: {e}[/red]”)
import traceback
traceback.print_exc()

if __name__ == “__main__”:
main()

Alternatively, you can run the tutorial notebooks in the Amazon Bedrock AgentCore GitHub repository.
Pricing and availability
Amazon Bedrock AgentCore offers flexible, consumption-based pricing with no upfront commitments or minimum fees. AgentCore Browser can be used independently of the other services. You can try AgentCore services at no additional charge until September 16, 2025. After this date, AgentCore Browser Tool will be charged based on consumption. Billing is calculated per second, using the highest watermark of CPU and memory usage for that second, with a 1-second minimum. 128 MB minimum memory billing applies. Network data transfer through customer elastic network interfaces is billed at standard Amazon Elastic Compute Cloud (Amazon EC2) rates
For more information about pricing, visit Amazon Bedrock AgentCore (Preview) Pricing.
Conclusion
Amazon Bedrock AgentCore Browser Tool marks a transformative advancement in AI-powered web automation, offering organizations a fully managed, cloud-based browser solution. AgentCore Browser Tool addresses critical limitations faced by generative AI systems requiring real-time data access, enabling AI agents to interact naturally with websites through capabilities such as complete navigation control, visual understanding, and seamless integration with frameworks such as Playwright and Amazon Nova Act. By using this tool, businesses can now implement sophisticated automation at scale across various use cases—from streamlining repetitive web tasks and conducting AI-enhanced research to automating complex workflows and integrating with legacy systems—all while benefiting from the reliable cloud infrastructure of AWS that adapts to organizational needs without the operational overhead of managing browser farms.
Resources
To learn more and start building, visit the following resources:

Amazon Bedrock AgentCore Developer Guide
Amazon Bedrock AgentCore console

About the authors
Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.
Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build and deploy, scalable Agentic AI solutions. Prior to joining AWS, Rahul spent more than decade in technical consulting, engineering, and architecture, helping companies build digital products, powered by data and machine learning. In his free time, Rahul enjoys exploring cuisines, traveling, reading books(biographies and humor) and binging on investigative documentaries, in no particular order.
Kishor Aher is a Principal Product Manager at AWS, leading the Agentic AI team responsible for developing first-party tools such as Browser Tool, and Code Interpreter. As a founding member of Amazon Bedrock, he spearheaded the vision and successful launch of the service, driving key features including Converse API, Managed Model Customization, and Model Evaluation capabilities. Kishor regularly shares his expertise through speaking engagements at AWS events, including re:Invent and AWS Summits. Outside of work, he pursues his passion for aviation as a general aviation pilot and enjoys playing volleyball.

Introducing the Amazon Bedrock AgentCore Code Interpreter

Posted on August 2, 2025 by i-genie

AI agents have reached a critical inflection point where their ability to generate sophisticated code exceeds the capacity to execute it safely in production environments. Organizations deploying agentic AI face a fundamental dilemma: although large language models (LLMs) can produce complex code scripts, mathematical analyses, and data visualizations, executing this AI-generated code introduces significant security vulnerabilities and operational complexity.
In this post, we introduce the Amazon Bedrock AgentCore Code Interpreter, a fully managed service that enables AI agents to securely execute code in isolated sandbox environments. We discuss how the AgentCore Code Interpreter helps solve challenges around security, scalability, and infrastructure management when deploying AI agents that need computational capabilities. We walk through the service’s key features, demonstrate how it works with practical examples, and show you how to get started with building your own agents using popular frameworks like Strands, LangChain, and LangGraph.
Security and scalability challenges with AI-generated code
Consider an example where an AI agent needs perform analysis on multi-year sales projections data for a product, to understand anomalies, trends, and seasonality. The analysis should be grounded in logic, repeatable, handle data securely, and scalable over large data and multiple iterations, if needed. Although LLMs excel at understanding and explaining concepts, they lack the ability to directly manipulate data or perform consistent mathematical operations at scale. LLMs alone are often inadequate for complex data analysis tasks like these, due to their inherent limitations in processing large datasets, performing precise calculations, and generating visualizations. This is where code interpretation and execution tools become essential, providing the capability to execute precise calculations, handle large datasets efficiently, and create reproducible analyses through programming languages and specialized libraries. Furthermore, implementing code interpretation capabilities comes with significant considerations. Organizations must maintain secure sandbox environments to help prevent malicious code execution, manage resource allocation, and maintain data privacy. The infrastructure requires regular updates, robust monitoring, and careful scaling strategies to handle increasing demand.
Traditional approaches to code execution in AI systems suffer from several limitations:

Security vulnerabilities – Executing untrusted AI-generated code in production environments exposes organizations to code injection threats, unauthorized system access, and potential data breaches. Without proper sandboxing, malicious or poorly constructed code can compromise entire infrastructure stacks.
Infrastructure overhead – Building secure execution environments requires extensive DevOps expertise, including container orchestration, network isolation, resource monitoring, and security hardening. Many organizations lack the specialized knowledge to implement these systems correctly.
Scalability bottlenecks – Traditional code execution environments struggle with the dynamic, unpredictable workloads generated by AI agents. Peak demand can overwhelm static infrastructure, and idle periods waste computational resources.
Integration complexity – Connecting secure code execution capabilities with existing AI frameworks often requires custom development, creating maintenance overhead and limiting adoption across development teams.
Compliance challenges – Enterprise environments demand comprehensive audit trails, access controls, and compliance certifications that are difficult to implement and maintain in custom solutions.

These barriers have prevented organizations from fully using the computational capabilities of AI agents, limiting their applications to simple, deterministic tasks rather than the complex, code-dependent workflows that could maximize business value.
Introducing the Amazon Bedrock AgentCore Code Interpreter
With the AgentCore Core Interpreter, AI agents can write and execute code securely in sandbox environments, enhancing their accuracy and expanding their ability to solve complex end-to-end tasks. This purpose-built service minimizes the security, scalability, and integration challenges that have hindered AI agent deployment by providing a fully managed, enterprise-grade code execution system specifically designed for agentic AI workloads. The AgentCore Code Interpreter is designed and built from the ground up for AI-generated code, with built-in safeguards, dynamic resource allocation, and seamless integration with popular AI frameworks. It offers advanced configuration support and seamless integration with popular frameworks, so developers can build powerful agents for complex workflows and data analysis while meeting enterprise security requirements.
Transforming AI agent capabilities
The AgentCore Code Interpreter powers advanced use cases by addressing several critical enterprise requirements:

Enhanced security posture – Configurable network access options range from fully isolated environments, which provide enhanced security by helping prevent AI-generated code from accessing external systems, to controlled network connectivity that provides flexibility for specific development needs and use cases.
Zero infrastructure management – The fully managed service minimizes the need for specialized DevOps resources, reducing time-to-market from months to days while maintaining enterprise-grade reliability and security.
Dynamic scalability – Automatic resource allocation handles varying AI agent workloads without manual intervention, providing low-latency session start-up times during peak demand while optimizing costs during idle periods.
Framework agnostic integration – It integrates with Amazon Bedrock AgentCore Runtime, with native support for popular AI frameworks including Strands, LangChain, LangGraph, and CrewAI, so teams can use existing investments while maintaining development velocity.
Enterprise compliance – Built-in access controls and comprehensive audit trails facilitate regulatory compliance without additional development overhead.

Purpose-built for AI agent code execution
The AgentCore Code Interpreter represents a shift in how AI agents interact with computational resources. This operation processes the agent generated code, runs it in a secure environment, and returns the execution results, including output, errors, and generated visualizations. The service operates as a secure, isolated execution environment where AI agents can run code (Python, JavaScript, and TypeScript), perform complex data analysis, generate visualizations, and execute mathematical computations without compromising system security. Each execution occurs within a dedicated sandbox environment that provides complete isolation from other workloads and the broader AWS infrastructure. What distinguishes the AgentCore Code Interpreter from traditional execution environments is its optimization for AI-generated workloads. The service handles the unpredictable nature of AI-generated code through intelligent resource management, automatic error handling, and built-in security safeguards specifically designed for untrusted code execution.
Key features and capabilities of AgentCore Code Interpreter include:

Secure sandbox architecture:

Low-latency session start-up time and compute-based session isolation facilitating complete workload separation
Configurable network access policies supporting both isolated sandbox and controlled public network modes
Implements resource constraints by setting maximum limits on memory and CPU usage per session, helping to prevent excessive consumption (see AgentCore Code Interpreter Service Quotas)

Advanced session management:

Persistent session state allowing multi-step code execution workflows
Session-based file storage for complex data processing pipelines
Automatic session and resource cleanup
Support for long-running computational tasks with configurable timeouts

Comprehensive Python runtime environment:

Pre-installed data science libraries, including pandas, numpy, matplotlib, scikit-learn, and scipy
Support for popular visualization libraries, including seaborn and bokeh
Mathematical computing capabilities with sympy and statsmodels
Custom package installation within sandbox boundaries for specialized requirements

File operations and data management:

Upload data files, process them with code, and retrieve the results
Secure file transfer mechanisms with automatic encryption
Support for upload and download of files directly within the sandbox from Amazon Simple Storage Service (Amazon S3)
Support for multiple file formats, including CSV, JSON, Excel, and images
Temporary storage with automatic cleanup for enhanced security
Support for running AWS Command Line Interface (AWS CLI) commands directly within the sandbox, using the Amazon Bedrock AgentCore SDK and API

Enterprise integration features:

AWS Identity and Access Management (IAM) based access controls with fine-grained permission management
AWS CloudTrail integration providing audit trails for compliance

How the AgentCore Code Interpreter works
To understand the functionality of the AgentCore Code Interpreter, let’s examine the orchestrated flow of a typical data analysis request from an AI agent, as illustrated in the following diagram.

The workflow consists of the following key components:

Deployment and invocation – An agent is built and deployed (for instance, on the AgentCore Runtime) using a framework like Strands, LangChain, LangGraph, or CrewAI. When a user sends a prompt (for example, “Analyze this sales data and show me the trend by salesregion”), the AgentCore Runtime initiates a secure, isolated session.
Reasoning and tool selection – The agent’s underlying LLM analyzes the prompt and determines that it needs to perform a computation. It then selects the AgentCore Code Interpreter as the appropriate tool.
Secure code execution – The agent generates a code snippet, for instance using the pandas library, to read a data file and matplotlib to create a plot. This code is passed to the AgentCore Code Interpreter, which executes it within its dedicated, sandboxed session. The agent can read from and write files to the session-specific file system.
Observation and iteration – The AgentCore Code Interpreter returns the result of the execution—such as a calculated value, a dataset, an image file of a graph, or an error message—to the agent. This feedback loop allows the agent to engage in iterative problem-solving by debugging its own code and refining its approach.
Context and memory – The agent maintains context for subsequent turns in the conversation, during the duration of the session. Alternatively, the entire interaction can be persisted in Amazon Bedrock AgentCore Memory for long-term storage and retrieval.
Monitoring and observability – Throughout this process, a detailed trace of the agent’s execution, providing visibility into agent behavior, performance metrics, and logs, is available for debugging and auditing purposes.

Practical real-world applications and use cases
The AgentCore Code Interpreter can be applied to real-world business problems that are difficult to solve with LLMs alone.
Use case 1: Automated financial analysis
An agent can be tasked with performing on-demand analysis of financial data. For this example, a user provides a CSV file of billing data within the following prompt and asks for analysis and visualization: “Using the billing data provided below, create a bar graph that shows the total spend by product category… After generating the graph, provide a brief interpretation of the results…”The agent takes the following actions:

The agent receives the prompt and the data file containing the raw data.
It invokes the AgentCore Code Interpreter, generating Python code with the pandas library to parse the data into a DataFrame. The agent then generates another code block to group the data by category and sum the costs, and asks the AgentCore Code Interpreter to execute it.
The agent uses matplotlib to generate a bar chart and the AgentCore Code Interpreter saves it as an image file.
The agent returns both a textual summary of the findings and the generated PNG image of the graph.

Use case 2: Interactive data science assistant
The AgentCore Code Interpreter’s stateful session supports a conversational and iterative workflow for data analysis. For this example, a data scientist uses an agent for exploratory data analysis. The workflow is as follows:

The user provides a prompt: “Load dataset.csv and provide descriptive statistics.”
The agent generates and executes pandas.read_csv(‘dataset.csv’) followed by .describe()and returns the statistics table.
The user prompts, “Plot a scatter plot of column A versus column B.”
The agent, using the dataset already loaded in its session, generates code with matplotlib.pyplot.scatter() and returns the plot.
The user prompts, “Run a simple linear regression and provide the R^2 value.”
The agent generates code using the scikit-learn library to fit a model and calculate the R^2 metric.

This demonstrates iterative code execution capabilities, which allow agents to work through complex data science problems in a turn-by-turn manner with the user.
Solution overview
To get started with the AgentCore Code Interpreter, clone the GitHub repo:

git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git

In the following sections, we show how to create a question answering agent that validates answers through code and reasoning. We build it using the Strands SDK, but you can use a framework of your choice.
Prerequisites
Make sure you have the following prerequisites:

An AWS account with AgentCore Code Interpreter access
The necessary IAM permissions to create and manage AgentCore Code Interpreter resources and invoke models on Amazon Bedrock
The required Python packages installed (including boto3, bedrock-agentcore, and strands)
Access to Anthropic’s Claude 4 Sonnet model in the us-west-2 AWS Region (Anthropic’s Claude 4 is the default model for Strands SDK, but you can override and use your preferred model as described in the Strands SDK documentation)

Configure your IAM role
Your IAM role should have appropriate permissions to use the AgentCore Code Interpreter:

{
“Version”: “2012-10-17”,
“Statement”: [
   {
   “Effect”: “Allow”,
   “Action”: [
   “bedrock-agentcore:CreateCodeInterpreter”,
   “bedrock-agentcore:StartCodeInterpreterSession”,
   “bedrock-agentcore:InvokeCodeInterpreter”,
   “bedrock-agentcore:StopCodeInterpreterSession”,
   “bedrock-agentcore:DeleteCodeInterpreter”,
   “bedrock-agentcore:ListCodeInterpreters”,
   “bedrock-agentcore:GetCodeInterpreter”
   ],
   “Resource”: “*”
   },
   {
   “Effect”: “Allow”,
   “Action”: [
   “logs:CreateLogGroup”,
   “logs:CreateLogStream”,
   “logs:PutLogEvents”
   ],
   “Resource”: “arn:aws:logs:*:*:log-group:/aws/bedrock-agentcore/code-interpreter*”
   }
]
}

Set up and configure the AgentCore Code Interpreter
Complete the following setup and configuration steps:

Install the bedrock-agentcore Python SDK:

pip install bedrock-agentcore

Import the AgentCore Code Interpreter and other libraries:

from bedrock_agentcore.tools.code_interpreter_client import code_session
from strands import Agent, tool
import json

Define the system prompt:

SYSTEM_PROMPT “””You are a helpful AI assistant that validates all answers through code execution.

TOOL AVAILABLE:
– execute_python: Run Python code and see output

Define the code execution tool for the agent. Within the tool definition, we use the invoke method to execute the Python code generated by the LLM-powered agent. It automatically starts a serverless AgentCore Code Interpreter session if one doesn’t exist.

@tool
def execute_python(code: str, description: str = “”) -> str:
   “””Execute Python code in the sandbox.”””

   if description:
   code = f”# {description}n{code}”

   print(f”n Generated Code: {code}”)

   for event in response[“stream”]:
   return json.dumps(event[“result”])

Configure the agent:

agent Agent(
tools[execute_python],
system_promptSYSTEM_PROMPT,
callback_handler
)

Invoke the agent
Test the AgentCore Code Interpreter powered agent with a simple prompt:

query  “Tell me the largest random prime number between 1 and 100, which is less than 84 and more that 9”
try:
   response_text = “”
   async for event in agent.stream_async(query):
   if “data” in event:
   chunk = event[“data”]
   response_text += chunk
   print(chunk, end=””)
except Exception as e:
   print(f”Error occurred: {str(e)}”)

We get the following result:

I’ll find the largest random prime number between 1 and 100 that is less than 84 and more than 9. To do this, I’ll write code to:

1. Generate all prime numbers in the specified range
2. Filter to keep only those > 9 and < 84
3. Find the largest one

Let me implement this:
Generated Code: import random

def is_prime(n):
   “””Check if a number is prime”””
   if n <= 1:
   return False
   if n <= 3:
   return True
   if n % 2 == 0 or n % 3 == 0:
   return False
   i = 5
   while i * i <= n:
   if n % i == 0 or n % (i + 2) == 0:
   return False
   i += 6
   return True

# Find all primes in the range
primes_in_range = [n for n in range(10, 84) if is_prime(n)]

print(“All prime numbers between 10 and 83:”)
print(primes_in_range)

# Get the largest prime in the range
largest_prime = max(primes_in_range)
print(f”nThe largest prime number between 10 and 83 is: {largest_prime}”)

# For verification, let’s check that it’s actually prime
print(f”Verification – is {largest_prime} prime? {is_prime(largest_prime)}”)
Based on the code execution, I can tell you that the largest prime number between 1 and 100, which is less than 84 and more than 9, is **83**.

I verified this by:
1. Writing a function to check if a number is prime
2. Generating all prime numbers in the range 10-83
3. Finding the maximum value in that list

The complete list of primes in your specified range is: 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, and 83.

Since 83 is the largest among these primes, it is the answer to your question.

Pricing and availability
Amazon Bedrock AgentCore is available in multiple Regions and uses a consumption-based pricing model with no upfront commitments or minimum fees. Billing for the AgentCore Code Interpreter is calculated per second and is based on the highest watermark of CPU and memory resources consumed during that second, with a 1-second minimum charge.
Conclusion
The AgentCore Code Interpreter transforms the landscape of AI agent development by solving the critical challenge of secure, scalable code execution in production environments. This purpose-built service minimizes the complex infrastructure requirements, security vulnerabilities, and operational overhead that have historically prevented organizations from deploying sophisticated AI agents capable of complex computational tasks. The service’s architecture—featuring isolated sandbox environments, enterprise-grade security controls, and seamless framework integration—helps development teams focus on agent logic and business value rather than infrastructure complexity.
To learn more, refer to the following resources:

Introducing Amazon Bedrock AgentCore: Securely deploy and operate AI agents at any scale (preview)
Execute code and analyze data using Amazon Bedrock AgentCore Code Interpreter
Code Interpreter API Reference Examples
Amazon Bedrock AgentCore Code Interpreter GitHub repo

Try it out today or reach out to your AWS account team for a demo!

Observing and evaluating AI agentic workflows with Strands Agents SDK …

Posted on August 2, 2025 by i-genie

This post is co-written with Rich Young from Arize AI.
Agentic AI applications built on agentic workflows differ from traditional workloads in one important way: they’re nondeterministic. That is, they can produce different results with the same input. This is because the large language models (LLMs) they’re based on use probabilities when generating each token. This inherent unpredictability can lead AI application designers to ask questions related to the correction plan of action, the optimal path of an agent and the correct set of tools with the right parameters. Organizations that want to deploy such agentic workloads need an observability system that can make sure that they’re producing results that are correct and can be trusted.
In this post, we present how the Arize AX service can trace and evaluate AI agent tasks initiated through Strands Agents, helping validate the correctness and trustworthiness of agentic workflows.
Challenges with generative AI applications
The path from a promising AI demo to a reliable production system is fraught with challenges that many organizations underestimate. Based on industry research and real-world deployments, teams face several critical hurdles:

Unpredictable behavior at scale – Agents that perform well in testing might fail with unexpected inputs in production, such as new language variations or domain-specific jargon that cause irrelevant or misunderstood responses.
Hidden failure modes – Agents can produce plausible but wrong outputs or skip steps unnoticed, such as miscalculating financial metrics in a way that seems correct but misleads decision-making.
Nondeterministic paths – Agents might choose inefficient or incorrect decision paths, such as taking 10 steps to route a query that should take only 5, leading to poor user experiences.
Tool integration complexity – Agents can break when calling APIs incorrectly, for example, passing the wrong order ID format so that a refund silently fails despite a successful inventory update.
Cost and performance variability – Loops or verbose outputs can cause runaway token costs and latency spikes, such as an agent making more than 20 LLM calls and delaying a response from 3 to 45 seconds.

These challenges mean that traditional testing and monitoring approaches are insufficient for AI systems. Success requires a more thoughtful approach that incorporates a more comprehensive strategy.
Arize AX delivers a comprehensive observability, evaluation, and experimentation framework
Arize AX is the enterprise-grade AI engineering service that helps teams monitor, evaluate, and debug AI applications from development to production lifecycle. Incorporating Arize’s Phoenix foundation, AX adds enterprise essentials such as the “Alyx” AI assistant, online evaluations, automatic prompt optimization, role-based access control (RBAC), and enterprise scale and support. AX offers a comprehensive solution to organizations that caters to both technical and nontechnical personas so they can manage and improve AI agents from development through production at scale. Arize AX capabilities include:

Tracing – Full visibility into LLM operations using OpenTelemetry to capture model calls, retrieval steps, and metadata such as tokens and latency for detailed analysis.
Evaluation – Automated quality monitoring with LLM-as-a-judge evaluations on production samples, supporting custom evaluators and clear success metrics.
Datasets – Maintain versioned, representative datasets for edge cases, regression tests, and A/B testing, refreshed with real production examples.
Experiments – Run controlled tests to measure the impact of changes to prompts or models, validating improvements with statistical rigor.
Playground – Interactive environment to replay traces, test prompt variations, and compare model responses for effective debugging and optimization.
Prompt management – Version, test, and deploy prompts like code, with performance tracking and gradual rollouts to catch regressions early.
Monitoring and alerting – Real-time dashboards and alerts for latency, errors, token usage, and drift, with escalation for critical issues.
Agent visualization – Analyze and optimize agent decision paths to reduce loops and inefficiencies, refining planning strategies.

These components form a comprehensive observability strategy that treats LLM applications as mission-critical production systems requiring continuous monitoring, evaluation, and improvement.
Arize AX and Strands Agents: A powerful combination
Strands Agents is an open source SDK, a powerful low-code framework for building and running AI agents with minimal overhead. Designed to simplify the development of sophisticated agent workflows, Strands unifies prompts, tools, LLM interactions, and integration protocols into a single streamlined experience. It supports both Amazon Bedrock hosted and external models, with built-in capabilities for Retrieval Augmented Generation (RAG), Model Context Protocol (MCP), and Agent2Agent (A2A) communication. In this section, we walk through building an agent with Strands Agent SDK, instrumenting it with Arize AX for trace-based evaluation, and optimizing its behavior.
The following workflow shows how a Strands agent handles a user task end-to-end—invoking tools, retrieving context, and generating a response—while sending traces to Arize AX for evaluation and optimization.

The solution follows these high-level steps:

Install and configure the dependencies
Instrument the agent for observability
Build the agent with Strands SDK
Test the agent and generate traces
Analyze traces in Arize AI
Evaluate the agent’s behavior
Optimize the agent
Continually monitor the agent

Prerequisites
You’ll need:

An AWS account with access to Amazon Bedrock
An Arize account with your Space ID and API Key (sign up at no additional cost at arize.com).

Install dependencies:pip install strands opentelemetry-sdk arize-otel
Solution walkthrough: Using Arize AX with Strands Agents
The integration between Strands Agent SDK and Arize AI’s observability system provides deep, structured visibility into the behavior and decisions of AI agents. This setup enables end-to-end tracing of agent workflows—from user input through planning, tool invocation, and final output.
Full implementation details are available in the accompanying notebook and resources in the Openinference-Arize repository in GitHub.
Install and configure the dependencies
To install and configure the dependencies, use the following code:

from opentelemetry import trace
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from strands_to_openinference_mapping import StrandsToOpenInferenceProcessor
from arize.otel import register
import grpc

Instrument the agent for observability
To instrument the agent for observability, use the following code.

The StrandsToOpenInferenceProcessor converts native spans to OpenInference format.
trace_attributes add session and user context for richer trace filtering.

Use Arize’s OpenTelemetry integration to enable tracing:

register(
    space_id=”your-arize-space-id”,
    api_key=”your-arize-api-key”,
    project_name=”strands-project”,
    processor=StrandsToOpenInferenceProcessor()
)
agent = Agent(
    model=model,
    system_prompt=system_prompt,
    tools=[
        retrieve, current_time, get_booking_details,
        create_booking, delete_booking
    ],
    trace_attributes={
        “session.id”: “abc-1234”,
        “user.id”: “user-email@example.com”,
        “arize.tags”: [
            “Agent-SDK”,
            “Arize-Project”,
            “OpenInference-Integration”
        ]
    }
)

Build the agent with Strands SDK
Create the Restaurant Assistant agent using Strands. This agent will help customers with restaurant information and reservations using several tools:

retrieve – Searches the knowledge base for restaurant information
current_time – Gets the current time for reservation scheduling
create_booking – Creates a new restaurant reservation
get_booking_details – Retrieves details of an existing reservation
delete_booking – Cancels an existing reservation

The agent uses Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock for natural language understanding and generation. Import the required tools and define the agent:

import get_booking_details, delete_booking, create_booking
from strands_tools import retrieve, current_time
from strands import Agent, tool
from strands.models.bedrock import BedrockModel
import boto3
system_prompt = “””You are “Restaurant Helper”, a restaurant assistant helping customers reserving tables in different restaurants. You can talk about the menus, create new bookings, get the details of an existing booking or delete an existing reservation. You reply always politely and mention your name in the reply (Restaurant Helper)………..”””
model = BedrockModel(
    model_id=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″,
)
kb_name = ‘restaurant-assistant’
smm_client = boto3.client(‘ssm’)
kb_id = smm_client.get_parameter(
    Name=f'{kb_name}-kb-id’,
    WithDecryption=False
)
os.environ[“KNOWLEDGE_BASE_ID”] = kb_id[“Parameter”][“Value”]
agent = Agent(
    model=model,
    system_prompt=system_prompt,
    tools=[
        retrieve, current_time, get_booking_details,
        create_booking, delete_booking
    ],
    trace_attributes={
        “session.id”: “abc-1234”,
        “user.id”: “user-email-example@domain.com”,
        “arize.tags”: [
            “Agent-SDK”,
            “Arize-Project”,
            “OpenInference-Integration”,
        ]
    }
)

Test the agent and generate traces
Test the agent with a couple of queries to generate traces for Arize. Each interaction will create spans in OpenTelemetry that will be processed by the custom processor and sent to Arize AI.The first test case is a restaurant information query. Ask about restaurants in San Francisco. This will trigger the knowledge base retrieval tool:

# Test with a question about restaurants
results = agent(“Hi, where can I eat in New York?”)
print(results)

The second test case is for a restaurant reservation. Test the booking functionality by making a reservation. This will trigger the create_booking tool:

# Test with a reservation request
results = agent(“Make a reservation for tonight at Rice & Spice. At 8pm, for 2 people in the name of Anna”)
print(results)

Analyze traces in Arize AI
After running the agent, you can view and analyze the traces in the Arize AI dashboard, shown in the following screenshot. Trace-level visualization shows the representation of the trace to confirm the path that the agent took during execution. In the Arize dashboard, you can review the traces generated by the agent. By selecting the strands-project you defined in the notebook, you can view your traces on the LLM Tracing tab. Arize provides powerful filtering capabilities to help you focus on specific traces. You can filter by OTel attributes and metadata, for example, to analyze performance across different models.

You can also use Alyx AI assistant, to analyze your agent’s behavior through natural language queries and uncover insights. In the example below, we use Alyx to reason about why a tool was invoked incorrectly by the agent in one of the traces, helping us identify the root cause of the misstep

Choosing a specific trace gives detailed information about the agent’s runtime performance and decision-making process, as shown in the following screenshot.

The graph view, shown in the following screenshot, shows the hierarchical structure of your agent’s execution and users can inspect specific execution paths to understand how the agent made decisions by selecting the graph.

You can also view session-level insights on the Sessions tab next to LLM Tracing. By tagging spans with session.id and user.id, you can group related interactions, identify where conversations break down, track user frustration, and evaluate multiturn performance across sessions.
Evaluate the agent’s behavior
Arize’s system traces the agent’s decision-making process, capturing details such as routing decisions, tool calls and parameters. You can evaluate performance by analyzing these traces to verify that the agent selects optimal paths and provides accurate responses. For example, if the agent misinterprets a customer’s request and chooses the wrong tool or uses incorrect parameters, Arize evaluators will identify when these failures occur.Arize has pre-built evaluation templates for every step of your Agent process:

Agent Tool Calling

Agent Tool Selection
Agent Parameter Extraction

Agent Path Convergence
Agent Planning
Agent Reflection

Create a new task under Evals and Tasks and choose LLM as a judge task type. You can use a pre-built prompt template (tool calling is used in the example shown in the following screenshot) or you can ask Alyx AI assistant to build one for you. Evals will now automatically run on your traces as they flow into Arize. This uses AI to automatically label your data and identify failures at scale without human intervention.

Now every time the agent is invoked, trace data is collected in Arize and the tool calling evaluation automatically runs and labels the data with a correct or incorrect label along with an explanation by the LLM-as-a-judge for its labeling decision. Here is an example of an evaluation label and explanation.

Optimize the agent
The LLM-as-a-judge evaluations automatically identify and label failure cases where the agent didn’t call the right tool. In the below screenshot these failure cases are automatically captured and added to a regression dataset, which will drive agent improvement workflows. This production data can now fuel development cycles for improving the agent.

Now, you can connect directly with Arize’s prompt playground, an integrated development environment (IDE) where you can experiment with various prompt changes and model choices, compare side-by-side results and test across the regression dataset from the previous step. When you have an optimal prompt and model combination, you can save this version to the prompt hub for future version tracking and retrieval, as shown in the following screenshot.

Experiments from the prompt testing are automatically saved, with online evaluations run and results saved for immediate analysis and comparison to facilitate data-driven decisions on what enhancements to deploy. Additionally, experiments can be incorporated into continuous integration and continuous delivery (CI/CD) workflows for automated regression testing and validation whenever new prompt or application changes are pushed to systems such as GitHub. The screenshot below shows hallucination metrics for prompt experiments.

Continually monitor the agent
To maintain reliability and performance in production, it’s essential to continually monitor your AI agents. Arize AI provides out-of-the-box monitoring capabilities that help teams detect issues early, optimize cost, and provide high-quality user experiences.Setting up monitors in Arize AI offers:

Early issue detection – Identify problems before they impact users
Performance tracking – Monitor trends and maintain consistent agent behavior
Cost management – Track token usage to avoid unnecessary expenses
Quality assurance – Validate your agent is delivering accurate, helpful responses

You can access and configure monitors on the Monitors tab in your Arize project. For details, refer to the Arize documentation on monitoring.
When monitoring your Strands Agent in production, pay close attention to these key metrics:

Latency – Time taken for the agent to respond to user inputs
Token usage – Number of tokens consumed, which directly impacts cost
Error rate – Frequency of failed responses or tool invocations
Tool usage – Effectiveness and frequency of tool calls
User satisfaction signals – Proxy metrics such as tool call correctness, conversation length, or resolution rates

By continually monitoring these metrics, teams can proactively improve agent performance, catch regressions early, and make sure the system scales reliably in real-world use. In Arize, you can create custom metrics directly from OTel trace attributes or metadata, and even from evaluation labels and metrics, such as the tool calling correctness evaluation you created previously. The screenshot below visualizes the tool call correctness ratio across agent traces, helping identify patterns in correct versus incorrect tool usage

The screenshot below illustrate how Arize provides customizable dashboards that enable deep observability into LLM agent performance, showcasing a custom monitoring dashboard tracking core metrics such as latency, token usage, and the percentage of correct tool calls.

The screenshot below demonstrates prebuilt templates designed to accelerate setup and offer immediate visibility into key agent behaviors.

Clean up
When you’re done experimenting, you can clean up the AWS resources created by this notebook by running the cleanup script: !sh cleanup.sh.
Conclusion
The key lesson is clear: observability, automatic evaluations, experimentation and feedback loops, and proactive alerting aren’t optional for production AI—they’re the difference between innovation and liability. Organizations that invest in proper AI operations infrastructure can harness the transformative power of AI agents while avoiding the pitfalls that have plagued early adopters. The combination of Amazon Strands Agents and Arize AI provides a comprehensive solution that addresses these challenges:

Strands Agents offers a model-driven approach for building and running AI agents
Arize AI adds the critical observability layer with tracing, evaluation, and monitoring capabilities

The partnership between AWS and Arize AI offers a powerful solution for building and deploying generative AI agents. The fully managed framework of Strands Agents simplifies agent development, and Arize’s observability tools provide critical insights into agent performance. By addressing challenges such as nondeterminism, verifying correctness, and enabling continual monitoring, this integration benefits organizations in that they can create reliable and effective AI applications. As businesses increasingly adopt agentic workflows, the combination of Amazon Bedrock and Arize AI sets a new standard for trustworthy AI deployment.
Get started
Now that you’ve learned how to integrate Strands Agents with the Arize Observability Service, you can start exploring different types of agents using the example provided in this sample. As a next step, try expanding this integration to include automated evaluations using Arize’s evaluation framework to score agent performance and decision quality.
Ready to build better agents? Get started with an account at arize.com for no additional cost and begin transforming your AI agents from unpredictable experiments into reliable, production-ready solutions. The tools and knowledge are here; the only question is: what will you build?
About the Authors
Rich Young is the Director of Partner Solutions Architecture at Arize AI, focused on AI agent observability and evaluation tooling. Prior to joining Arize, Rich led technical pre-sales at WhyLabs AI. In his pre-AI life, Rich held leadership and IC roles at enterprise technology companies such as Splunk and Akamai.
Karan Singh is a Agentic AI leader at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise agentic AI challenges. Karan holds a BS in Electrical Engineering from Manipal University, a MS in Electrical Engineering from Northwestern University, and an MBA from the Haas School of Business at University of California, Berkeley.
Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in mechanical engineering from Princeton University.
Venu Kanamatareddy is an AI/ML Solutions Architect at AWS, supporting AI-driven startups in building and scaling innovative solutions. He provides strategic and technical guidance across the AI lifecycle from model development to MLOps and generative AI. With experience across startups and large enterprises, he brings deep expertise in cloud architecture and AI solutions. Venu holds a degree in computer science and a master’s in artificial intelligence from Liverpool John Moores University.

A Coding Guide to Build an Intelligent Conversational AI Agent with Ag …

Posted on August 1, 2025 by i-genie

In this tutorial, we delve into building an advanced AI agent with agent memory using Cognee and Hugging Face models, utilizing entirely free, open-source tools that work seamlessly in Google Colab and other notebook. We configure Cognee for memory storage and retrieval, integrate a lightweight conversational model for generating responses, and bring it all together into an intelligent agent that learns, reasons, and interacts naturally. Whether it’s processing documents across domains or engaging in dialogue with contextual understanding, we walk through each step to create a capable agent without relying on paid APIs. Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications.

Copy CodeCopiedUse a different Browser!pip install cognee transformers torch sentence-transformers accelerate

import asyncio
import os
import json
from typing import List, Dict, Any
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

import cognee

We begin by installing all the essential libraries, including Cognee, Transformers, Torch, and Sentence-Transformers, to power our AI agent. We then import the required modules to handle tokenization, model loading, asynchronous tasks, and memory integration. This setup ensures we have everything ready to build, train, and interact with our intelligent agent. Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications.

Copy CodeCopiedUse a different Browserasync def setup_cognee():
“””Setup Cognee with proper configuration”””
try:
await cognee.config.set(“EMBEDDING_MODEL”, “sentence-transformers/all-MiniLM-L6-v2”)
await cognee.config.set(“EMBEDDING_PROVIDER”, “sentence_transformers”)
print(” Cognee configured successfully”)
return True
except Exception as e:
print(f” Cognee config error: {e}”)
try:
os.environ[“EMBEDDING_MODEL”] = “sentence-transformers/all-MiniLM-L6-v2”
os.environ[“EMBEDDING_PROVIDER”] = “sentence_transformers”
print(” Cognee configured via environment”)
return True
except Exception as e2:
print(f” Alternative config failed: {e2}”)
return False

We set up Cognee by configuring the embedding model and provider to use all-MiniLM-L6-v2, a lightweight and efficient sentence-transformer. If the primary method fails, we fall back to manually setting environment variables, ensuring Cognee is always ready to process and store embeddings. Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications.

Copy CodeCopiedUse a different Browserclass HuggingFaceLLM:
def __init__(self, model_name=”microsoft/DialoGPT-medium”):
print(f” Loading Hugging Face model: {model_name}”)
self.device = “cuda” if torch.cuda.is_available() else “cpu”
print(f” Using device: {self.device}”)

if “DialoGPT” in model_name:
self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side=’left’)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
else:
self.generator = pipeline(
“text-generation”,
model=”distilgpt2″,
device=0 if self.device == “cuda” else -1,
max_length=150,
do_sample=True,
temperature=0.7
)
self.tokenizer = None
self.model = None

print(” Model loaded successfully!”)

def generate_response(self, prompt: str, max_length: int = 100) -> str:
try:
if self.model is not None:
inputs = self.tokenizer.encode(prompt + self.tokenizer.eos_token, return_tensors=’pt’)

with torch.no_grad():
outputs = self.model.generate(
inputs,
max_length=inputs.shape[1] + max_length,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)

response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
response = response[len(prompt):].strip()
return response if response else “I understand.”

else:
result = self.generator(prompt, max_length=max_length, truncation=True)
return result[0][‘generated_text’][len(prompt):].strip()

except Exception as e:
print(f” Generation error: {e}”)
return “I’m processing that information.”

hf_llm = None

We define the HuggingFaceLLM class to handle text generation using lightweight Hugging Face models, such as DialoGPT or DistilGPT2. We detect whether a GPU is available and load the appropriate tokenizer and model accordingly. This setup enables our agent to generate intelligent and context-aware responses to user queries. Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications.

Copy CodeCopiedUse a different Browserclass AdvancedAIAgent:
“””
Advanced AI Agent with persistent memory, learning capabilities,
and multi-domain knowledge processing using Cognee
“””

def __init__(self, agent_name: str = “CogneeAgent”):
self.name = agent_name
self.memory_initialized = False
self.knowledge_domains = []
self.conversation_history = []
self.manual_memory = []

async def initialize_memory(self):
“””Initialize the agent’s memory system and HF model”””
global hf_llm
if hf_llm is None:
hf_llm = HuggingFaceLLM(“microsoft/DialoGPT-medium”)

setup_success = await setup_cognee()

try:
await cognee.prune()
print(f” {self.name} memory system initialized”)
self.memory_initialized = True
except Exception as e:
print(f” Memory initialization warning: {e}”)
self.memory_initialized = True

async def learn_from_text(self, text: str, domain: str = “general”):
“””Add knowledge to the agent’s memory with domain tagging”””
if not self.memory_initialized:
await self.initialize_memory()

enhanced_text = f”[DOMAIN: {domain}] [TIMESTAMP: {datetime.now().isoformat()}]n{text}”

try:
await cognee.add(enhanced_text)
await cognee.cognify()
if domain not in self.knowledge_domains:
self.knowledge_domains.append(domain)
print(f” Learned new knowledge in domain: {domain}”)
return True
except Exception as e:
print(f” Learning error: {e}”)
try:
await cognee.add(text)
await cognee.cognify()
if domain not in self.knowledge_domains:
self.knowledge_domains.append(domain)
print(f” Learned (simplified): {domain}”)
return True
except Exception as e2:
print(f” Simplified learning failed: {e2}”)
if not hasattr(self, ‘manual_memory’):
self.manual_memory = []
self.manual_memory.append({“text”: text, “domain”: domain})
if domain not in self.knowledge_domains:
self.knowledge_domains.append(domain)
print(f” Stored in manual memory: {domain}”)
return True

async def learn_from_documents(self, documents: List[Dict[str, str]]):
“””Batch learning from multiple documents”””
print(f” Processing {len(documents)} documents…”)

for i, doc in enumerate(documents):
text = doc.get(“content”, “”)
domain = doc.get(“domain”, “general”)
title = doc.get(“title”, f”Document_{i+1}”)

enhanced_content = f”Title: {title}n{text}”
await self.learn_from_text(enhanced_content, domain)

if i % 3 == 0:
print(f” Processed {i+1}/{len(documents)} documents”)

async def query_knowledge(self, question: str, domain_filter: str = None) -> List[str]:
“””Query the agent’s knowledge base with optional domain filtering”””
try:
if domain_filter:
enhanced_query = f”[DOMAIN: {domain_filter}] {question}”
else:
enhanced_query = question

search_results = await cognee.search(“SIMILARITY”, enhanced_query)

results = []
for result in search_results:
if hasattr(result, ‘text’):
results.append(result.text)
elif hasattr(result, ‘content’):
results.append(result.content)
elif hasattr(result, ‘value’):
results.append(str(result.value))
elif isinstance(result, dict):
content = result.get(‘text’) or result.get(‘content’) or result.get(‘data’) or result.get(‘value’)
if content:
results.append(str(content))
else:
results.append(str(result))
elif isinstance(result, str):
results.append(result)
else:
result_str = str(result)
if len(result_str) > 10:
results.append(result_str)

if not results and hasattr(self, ‘manual_memory’):
for item in self.manual_memory:
if domain_filter and item[‘domain’] != domain_filter:
continue
if any(word.lower() in item[‘text’].lower() for word in question.split()):
results.append(item[‘text’])

return results[:5]

except Exception as e:
print(f” Search error: {e}”)
results = []
if hasattr(self, ‘manual_memory’):
for item in self.manual_memory:
if domain_filter and item[‘domain’] != domain_filter:
continue
if any(word.lower() in item[‘text’].lower() for word in question.split()):
results.append(item[‘text’])
return results[:5]

async def reasoning_chain(self, question: str) -> Dict[str, Any]:
“””Advanced reasoning using retrieved knowledge”””
print(f” Processing question: {question}”)

relevant_info = await self.query_knowledge(question)

analysis = {
“question”: question,
“relevant_knowledge”: relevant_info,
“domains_searched”: self.knowledge_domains,
“confidence”: min(len(relevant_info) / 3.0, 1.0),
“timestamp”: datetime.now().isoformat()
}

if relevant_info and len(relevant_info) > 0:
reasoning = self._synthesize_answer(question, relevant_info)
analysis[“reasoning”] = reasoning
analysis[“answer”] = self._extract_key_points(relevant_info)
else:
analysis[“reasoning”] = “No relevant knowledge found in memory”
analysis[“answer”] = “I don’t have information about this topic in my current knowledge base.”

return analysis

def _synthesize_answer(self, question: str, knowledge_pieces: List[str]) -> str:
“””AI-powered answer synthesis using Hugging Face model”””
global hf_llm

if not knowledge_pieces:
return “No relevant information found in my knowledge base.”

context = ” “.join(knowledge_pieces[:2])
context = context[:300]

prompt = f”Based on this information: {context}nnQuestion: {question}nAnswer:”

try:
if hf_llm:
synthesized = hf_llm.generate_response(prompt, max_length=80)
return synthesized if synthesized else f”Based on my knowledge: {context[:100]}…”
else:
return f”From my analysis: {context[:150]}…”
except Exception as e:
print(f” Synthesis error: {e}”)
return f”Based on my knowledge: {context[:100]}…”

def _extract_key_points(self, knowledge_pieces: List[str]) -> List[str]:
“””Extract key points from retrieved knowledge”””
key_points = []
for piece in knowledge_pieces:
clean_piece = piece.replace(“[DOMAIN:”, “”).replace(“[TIMESTAMP:”, “”)
sentences = clean_piece.split(‘.’)
if len(sentences) > 0 and len(sentences[0].strip()) > 10:
key_points.append(sentences[0].strip() + “.”)

return key_points[:3]

async def conversational_agent(self, user_input: str) -> str:
“””Main conversational interface with HF model integration”””
global hf_llm
self.conversation_history.append({“role”: “user”, “content”: user_input})

if any(word in user_input.lower() for word in [“learn”, “remember”, “add”, “teach”]):
content_to_learn = user_input.replace(“learn this:”, “”).replace(“remember:”, “”).strip()
await self.learn_from_text(content_to_learn, “conversation”)
response = “I’ve stored that information in my memory! What else would you like to teach me?”

elif user_input.lower().startswith((“what”, “how”, “why”, “when”, “where”, “who”, “tell me”)):
analysis = await self.reasoning_chain(user_input)

if analysis[“relevant_knowledge”] and hf_llm:
context = ” “.join(analysis[“relevant_knowledge”][:2])[:200]
prompt = f”Question: {user_input}nKnowledge: {context}nFriendly response:”
ai_response = hf_llm.generate_response(prompt, max_length=60)
response = ai_response if ai_response else “Here’s what I found in my knowledge base.”
else:
response = “I don’t have specific information about that topic in my current knowledge base.”

else:
relevant_context = await self.query_knowledge(user_input)

if hf_llm:
context_info = “”
if relevant_context:
context_info = f” I know that: {relevant_context[0][:100]}…”

conversation_prompt = f”User says: {user_input}{context_info}nI respond:”
response = hf_llm.generate_response(conversation_prompt, max_length=50)

if not response or len(response.strip()) < 3:
response = “That’s interesting! I’m learning from our conversation.”
else:
response = “I’m listening and learning from our conversation.”

self.conversation_history.append({“role”: “assistant”, “content”: response})
return response

We now define the core of our system, the AdvancedAIAgent class, which brings together Cognee’s memory, domain-aware learning, knowledge retrieval, and Hugging Face-powered reasoning. We empower our agent to learn from both text and documents, retrieve contextually relevant knowledge, and respond to queries with synthesized, intelligent answers. Whether it’s remembering facts, answering questions, or engaging in conversation, this agent adapts, remembers, and responds with human-like fluency. Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications.

Copy CodeCopiedUse a different Browserasync def main():
print(” Advanced AI Agent with Cognee Tutorial”)
print(“=” * 50)

agent = AdvancedAIAgent(“TutorialAgent”)
await agent.initialize_memory()

print(“n DEMO 1: Multi-domain Learning”)
sample_documents = [
{
“title”: “Python Basics”,
“content”: “Python is a high-level programming language known for its simplicity and readability.”,
“domain”: “programming”
},
{
“title”: “Climate Science”,
“content”: “Climate change”,
“domain”: “science”
},
{
“title”: “AI Ethics”,
“content”: “AI ethics involves ensuring artificial intelligence systems are developed and deployed responsibly, considering fairness, transparency, accountability, and potential societal impacts.”,
“domain”: “technology”
},
{
“title”: “Sustainable Energy”,
“content”: “Renewable energy sources are crucial for reducing carbon emissions”,
“domain”: “environment”
}
]

await agent.learn_from_documents(sample_documents)

print(“n DEMO 2: Knowledge Retrieval & Reasoning”)
test_questions = [
“What do you know about Python programming?”,
“How does climate change relate to energy?”,
“What are the ethical considerations in AI?”
]

for question in test_questions:
print(f”n Question: {question}”)
analysis = await agent.reasoning_chain(question)
print(f” Answer: {analysis.get(‘answer’, ‘No answer generated’)}”)
print(f” Confidence: {analysis.get(‘confidence’, 0):.2f}”)

print(“n DEMO 3: Conversational Agent”)
conversation_inputs = [
“Learn this: Machine learning is a subset of AI”,
“What is machine learning?”,
“How does it relate to Python?”,
“Remember that neural networks are inspired by biological neurons”
]

for user_input in conversation_inputs:
print(f”n User: {user_input}”)
response = await agent.conversational_agent(user_input)
print(f” Agent: {response}”)

print(f”n DEMO 4: Agent Knowledge Summary”)
print(f”Knowledge domains: {agent.knowledge_domains}”)
print(f”Conversation history: {len(agent.conversation_history)} exchanges”)

print(f”n Domain-specific search:”)
programming_results = await agent.query_knowledge(“programming concepts”, “programming”)
print(f”Programming knowledge: {len(programming_results)} results found”)

if __name__ == “__main__”:
print(“Starting Advanced AI Agent Tutorial with Hugging Face Models…”)
print(” Using free models from Hugging Face Hub”)
print(” GPU acceleration available!” if torch.cuda.is_available() else ” Running on CPU”)

try:
await main()
except RuntimeError:
import nest_asyncio
nest_asyncio.apply()
asyncio.run(main())

print(“n Tutorial completed! You’ve learned:”)
print(“• How to set up Cognee with Hugging Face models”)
print(“• AI-powered response generation”)
print(“• Multi-domain knowledge management”)
print(“• Advanced reasoning and retrieval”)
print(“• Conversational agent with memory”)
print(“• Free GPU-accelerated inference”)

We conclude the tutorial by running a comprehensive demonstration of our AI agent in action. We first teach it from multi-domain documents, then test its ability to retrieve knowledge and reason intelligently. Next, we engage it in a natural conversation, watching it learn and recall information taught by users. Finally, we view a summary of its memory, showcasing how it organizes and filters knowledge by domain, all with real-time inference using free Hugging Face models.

In conclusion, we’ve built a fully functional AI agent that can learn from structured data, recall and reason with stored knowledge, and converse intelligently using Hugging Face models. We configure Cognee for persistent memory, demonstrate domain-specific queries, and even simulate real conversations with the agent.

Check out the Full Codes here. Feel free to check other AI Agent and Agentic AI Codes and Tutorial for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A Coding Guide to Build an Intelligent Conversational AI Agent with Agent Memory Using Cognee and Free Hugging Face Models appeared first on MarkTechPost.

AgentSociety: An Open Source AI Framework for Simulating Large-Scale S …

Posted on August 1, 2025 by i-genie

AgentSociety is a cutting-edge, open-source framework designed to simulate large populations of agents, each powered by Large Language Models (LLMs), to realistically model the complex interactions found in human societies. Leveraging powerful distributed processing technologies—especially Ray—this project achieves simulations involving tens of thousands of simultaneously active agents, each embedded in detailed, realistic environments that capture social, economic, and mobility behaviors.

Key Capabilities

Massive Scale and Fast Performance

Supports Large Populations: The framework demonstrated simulations with up to 30,000 agents, outperforming wall-clock time—that is, running the virtual society faster than real time1.

Parallelization with Ray: AgentSociety uses the Ray framework to manage large-scale parallel execution of agents, critical for handling massive and non-deterministic interactions.

Efficient Resource Usage: By grouping agents and sharing network clients within groups, the framework greatly reduces memory and connection overhead, overcoming the port and memory bottlenecks common in scaling distributed simulations.

Realistic Societal Environments

AgentSociety differentiates itself by integrating highly realistic feedback and constraints, enabling agents to behave in a way that mirrors real societal systems.

Urban Space: Incorporates real-world map data (e.g., from OpenStreetMap), road networks, points of interest, and models of mobility (walking, driving, public transport) updated every simulated second1.

Social Space: Agents form evolving social networks, engaging in both online and offline social interactions. Messaging (including content moderation and user blocking) is modeled to simulate social media and real-world communication patterns.

Economic Space: Implements employment, consumption, banking, government (taxes), and macroeconomic reporting—all driven by agent decisions. Agents must balance income and spending, simulating realistic economic behavior.

Architecture & Technology

Parallelized Interaction Engine

Group-Based Distributed Execution: Agents are partitioned into groups managed by Ray “actors,” optimizing resource use while maintaining high parallelism, with asynchronous network requests utilizing connection reuse.

High-Performance Messaging: Utilizing Redis’s Pub/Sub capabilities, agents efficiently communicate, supporting agent-agent and user-agent (external program) interactions.

Time Alignment Mechanism: The framework synchronizes agent and environment progression, ensuring consistent and reproducible simulations despite variable processing times from LLM API calls.

Comprehensive Utilities: Simulation logging (via PostgreSQL and local file storage), metric recording (mlflow), and a GUI for experiment creation/management and results visualization.

Quantitative Results

Scalability and Speed

Faster than Real-Time: On a deployment with 24 NVIDIA A800 GPUs, simulations of 30,000 agents achieved faster-than-wall-clock operation (e.g., an iteration round for all agents executed faster than the equivalent real-world elapsed time).

Linear Scaling: Performance scales linearly with computing resources; increasing LLM-serving GPUs enables higher simulation throughput, up to the service limits of the language model backend.

Example Metrics: In the largest experiment (30,000 agents, 8 groups), an average agent round completed in 252 seconds, staying under real-time and with 100% LLM call success rate. Environment simulation and message passing times remain far below LLM inference time, affirming the system’s computational efficiency.

Impact of Realistic Environments

Authenticity of Agent Behaviors: Incorporating realistic environment simulators significantly improved the authenticity and human-likeness of agent behaviors compared to both pure LLM-prompt-based “text simulators” and various generative trajectory baselines.

Empirical Benchmarks: On measures such as radius of gyration, daily visited locations, and behavioral intention distributions, LLM agents with environment support dramatically outperformed both prompt-only and classical model baselines, matching closely to real-world data.

Use Cases and Applications

The open design and configurable environments make AgentSociety a powerful tool for:

Social Science Research: Studying societal patterns, emergent phenomena, mobility, and information spread.

Urban Planning and Policy Analysis: Evaluating interventions in simulated environments before real-world deployment.

Management Science: Modeling organizational dynamics, workforce changes, and economic behaviors.

Conclusion

AgentSociety stands out as the first open source framework to efficiently and realistically simulate societal interactions at unprecedented scale. Its integration of LLM-powered agents with parallelized, data-driven environments positions it as a critical tool for both computational research and practical decision support in understanding complex societal dynamics.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post AgentSociety: An Open Source AI Framework for Simulating Large-Scale Societal Interactions with LLM Agents appeared first on MarkTechPost.

The Ultimate 2025 Guide to Coding LLM Benchmarks and Performance Metri …

Posted on August 1, 2025 by i-genie

Large language models (LLMs) specialized for coding are now integral to software development, driving productivity through code generation, bug fixing, documentation, and refactoring. The fierce competition among commercial and open-source models has led to rapid advancement as well as a proliferation of benchmarks designed to objectively measure coding performance and developer utility. Here’s a detailed, data-driven look at the benchmarks, metrics, and top players as of mid-2025.

Core Benchmarks for Coding LLMs

The industry uses a combination of public academic datasets, live leaderboards, and real-world workflow simulations to evaluate the best LLMs for code:

HumanEval: Measures the ability to produce correct Python functions from natural language descriptions by running code against predefined tests. Pass@1 scores (percentage of problems solved correctly on the first attempt) are the key metric. Top models now exceed 90% Pass@1.

MBPP (Mostly Basic Python Problems): Evaluates competency on basic programming conversions, entry-level tasks, and Python fundamentals.

SWE-Bench: Targets real-world software engineering challenges sourced from GitHub, evaluating not only code generation but issue resolution and practical workflow fit. Performance is offered as a percentage of issues correctly resolved (e.g., Gemini 2.5 Pro: 63.8% on SWE-Bench Verified).

LiveCodeBench: A dynamic and contamination-resistant benchmark incorporating code writing, repair, execution, and prediction of test outputs. Reflects LLM reliability and robustness in multi-step coding tasks.

BigCodeBench and CodeXGLUE: Diverse task suites measuring automation, code search, completion, summarization, and translation abilities.

Spider 2.0: Focused on complex SQL query generation and reasoning, important for evaluating database-related proficiency1.

Several leaderboards—such as Vellum AI, ApX ML, PromptLayer, and Chatbot Arena—also aggregate scores, including human preference rankings for subjective performance.

Key Performance Metrics

The following metrics are widely used to rate and compare coding LLMs:

Function-Level Accuracy (Pass@1, Pass@k): How often the initial (or k-th) response compiles and passes all tests, indicating baseline code correctness.

Real-World Task Resolution Rate: Measured as percent of closed issues on platforms like SWE-Bench, reflecting ability to tackle genuine developer problems.

Context Window Size: The volume of code a model can consider at once, ranging from 100,000 to over 1,000,000 tokens for latest releases—crucial for navigating large codebases.

Latency & Throughput: Time to first token (responsiveness) and tokens per second (generation speed) impact developer workflow integration.

Cost: Per-token pricing, subscription fees, or self-hosting overhead are vital for production adoption.

Reliability & Hallucination Rate: Frequency of factually incorrect or semantically flawed code outputs, monitored with specialized hallucination tests and human evaluation rounds.

Human Preference/Elo Rating: Collected via crowd-sourced or expert developer rankings on head-to-head code generation outcomes.

Top Coding LLMs—May–July 2025

Here’s how the prominent models compare on the latest benchmarks and features:

ModelNotable Scores & FeaturesTypical Use StrengthsOpenAI o3, o4-mini83–88% HumanEval, 88–92% AIME, 83% reasoning (GPQA), 128–200K contextBalanced accuracy, strong STEM, general useGemini 2.5 Pro99% HumanEval, 63.8% SWE-Bench, 70.4% LiveCodeBench, 1M contextFull-stack, reasoning, SQL, large-scale projAnthropic Claude 3.7≈86% HumanEval, top real-world scores, 200K contextReasoning, debugging, factualityDeepSeek R1/V3Comparable coding/logic scores to commercial, 128K+ context, open-sourceReasoning, self-hostingMeta Llama 4 series≈62% HumanEval (Maverick), up to 10M context (Scout), open-sourceCustomization, large codebasesGrok 3/484–87% reasoning benchmarksMath, logic, visual programmingAlibaba Qwen 2.5High Python, good long context handling, instruction-tunedMultilingual, data pipeline automation

Real-World Scenario Evaluation

Best practices now include direct testing on major workflow patterns:

IDE Plugins & Copilot Integration: Ability to use within VS Code, JetBrains, or GitHub Copilot workflows.

Simulated Developer Scenarios: E.g., implementing algorithms, securing web APIs, or optimizing database queries.

Qualitative User Feedback: Human developer ratings continue to guide API and tooling decisions, supplementing quantitative metrics.

Emerging Trends & Limitations

Data Contamination: Static benchmarks are increasingly susceptible to overlap with training data; new, dynamic code competitions or curated benchmarks like LiveCodeBench help provide uncontaminated measurements.

Agentic & Multimodal Coding: Models like Gemini 2.5 Pro and Grok 4 are adding hands-on environment usage (e.g., running shell commands, file navigation) and visual code understanding (e.g., code diagrams).

Open-Source Innovations: DeepSeek and Llama 4 demonstrate open models are viable for advanced DevOps and large enterprise workflows, plus better privacy/customization.

Developer Preference: Human preference rankings (e.g., Elo scores from Chatbot Arena) are increasingly influential for adoption and model selection, alongside empirical benchmarks.

In Summary:

Top coding LLM benchmarks of 2025 balance static function-level tests (HumanEval, MBPP), practical engineering simulations (SWE-Bench, LiveCodeBench), and live user ratings. Metrics such as Pass@1, context size, SWE-Bench success rates, latency, and developer preference collectively define the leaders. Current standouts include OpenAI’s o-series, Google’s Gemini 2.5 Pro, Anthropic’s Claude 3.7, DeepSeek R1/V3, and Meta’s latest Llama 4 models, with both closed and open-source contenders delivering excellent real-world results.
The post The Ultimate 2025 Guide to Coding LLM Benchmarks and Performance Metrics appeared first on MarkTechPost.

Introducing AWS Batch Support for Amazon SageMaker Training jobs

Posted on August 1, 2025 by i-genie

Picture this: your machine learning (ML) team has a promising model to train and experiments to run for their generative AI project, but they’re waiting for GPU availability. The ML scientists spend time monitoring instance availability, coordinating with teammates over shared resources, and managing infrastructure allocation. Simultaneously, your infrastructure administrators spend significant time trying to maximize utilization and minimize idle instances that lead to cost-inefficiency.
This isn’t a unique story. We heard from customers that instead of managing their own infrastructure and job ordering, they wanted a way to queue, submit, and retry training jobs while using Amazon SageMaker AI to perform model training.
AWS Batch now seamlessly integrates with Amazon SageMaker Training jobs. This integration delivers intelligent job scheduling and automated resource management while preserving the fully managed SageMaker experience your teams are familiar with. ML scientists can now focus more on model development and less on infrastructure coordination. At the same time, your organization can optimize the usage of costly accelerated instances, increasing productivity and decreasing costs. The following example comes from Toyota Research Institute (TRI):

“With multiple variants of Large Behavior Models (LBMs) to train, we needed a sophisticated job scheduling system. AWS Batch’s priority queuing, combined with SageMaker AI Training Jobs, allowed our researchers to dynamically adjust their training pipelines—enabling them to prioritize critical model runs, balance demand across multiple teams, and efficiently utilize reserved capacity. The result was ideal for TRI: we maintained flexibility and speed while being responsible stewards of our resources.” –Peter Richmond, Director of Information Engineering

In this post, we discuss the benefits of managing and prioritizing ML training jobs to use hardware efficiently for your business. We also walk you through how to get started using this new capability and share suggested best practices, including the use of SageMaker training plans.
Solution overview
AWS Batch is a fully managed service for developers and researchers to efficiently run batch computing workloads at different scales without the overhead of managing underlying infrastructure. AWS Batch dynamically provisions the optimal quantity and type of compute resources based on the volume and specific requirements of submitted batch jobs. The service automatically handles the heavy lifting of capacity planning, job scheduling, and resource allocation, so you can focus on your application logic rather than managing underlying infrastructure.
When you submit a job, AWS Batch evaluates the job’s resource requirements, queues it appropriately, and launches the necessary compute instances to run the job, scaling up during peak demand and scaling down to zero when no jobs are running. Beyond basic orchestration, AWS Batch includes intelligent features like automatic retry mechanisms that restart failed jobs based on configurable retry strategies, and fair share scheduling to manage equitable resource distribution among different users or projects by preventing a single entity from monopolizing compute resources. This can be especially useful if your organization has production workloads that should be prioritized. AWS Batch has been used by many customers with submit-now, run-later semantics for scheduling jobs and achieving high utilization of compute resources on Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and now SageMaker Training jobs.
AWS Batch for SageMaker Training jobs consists of the following key components that work together to deliver seamless batch processing:

Training jobs serve as blueprints that specify how jobs should run, including Docker container images, instance types, AWS Identity and Access Management (IAM) roles, and environment variables
Job queues act as holding areas where jobs wait to be executed, with configurable priority levels that determine execution order
Service environments define the underlying infrastructure maximum capacity

With these foundations, AWS Batch can retry for transient failures and provide comprehensive queue visualization, addressing critical pain points that have been challenging to address with ML workflows. The integration provides automatic retry for transient failures, bulk job submission, enabling scientists to focus on model improvements instead of infrastructure management.
To use an AWS Batch queue for SageMaker Training jobs, you must have a service environment and a job queue. The service environment represents the Amazon SageMaker AI capacity limits available to schedule, expressed through maximum number of instances. The job queue is the scheduler interface researchers interact with to submit jobs and interrogate job status. You can use the AWS Batch console, or AWS Command Line Interface (AWS CLI) to create these resources. In this example, we create a First-In-First-Out (FIFO) job queue and a service environment pool with a limit of five ml.g5.xlarge instances using the AWS Batch console. The following diagram illustrates the solution architecture.

Prerequisites
Before you deploy this solution, you must have an AWS account with permissions to create and manage AWS Batch resources. For this example, you can use these Sample IAM Permissions along with your SageMaker AI execution role.
Create a service environment
Complete the following steps to create the service environment you will associate with the training job queue:

On the AWS Batch console, choose Environments in the navigation pane.
Choose Create environment, then choose Service environment.

Provide a name for your service environment (for this post, we name it ml-g5-xl-se).
Specify the maximum number of compute instances that will be available to this environment for model training (for this post, we set it to 5). You can update the value for your capacity limit later as needed.
Optionally, specify tags for your service environment.
Create your service environment.

Create a job queue
Complete the following steps to create your job queue:

On the AWS Batch console, choose Job queues in the navigation pane.
Choose Create job queue.
For Orchestration type, select SageMaker Training.

Provide a name for your job queue (for this post, we name it my-sm-training-fifo-jq).
For Connected service environment, choose the service environment you created.
Leave the remaining settings as default and choose Create job queue.

You can explore fair-share queues by reading more about the scheduling policy parameter. Additionally, you can use job state limits to configure your job queue to take automatic action to unblock itself in the event that a user submitted jobs that are misconfigured or remain capacity constrained beyond a configurable period of time. These are workload-specific parameters that you can tune to help optimize your throughput and resource utilization.
Submit SageMaker Training jobs to AWS Batch from the SageMaker Python SDK
The newly added aws_batch module within the SageMaker Python SDK allows you to programmatically create and submit SageMaker Training jobs to an AWS Batch queue using Python. This includes helper classes to submit both Estimators and ModelTrainers. You can see an example of this in action by reviewing the sample Jupyter notebooks. The following code snippets summarize the key pieces.
Complete the basic setup steps to install a compatible version of the SageMaker Python SDK:

!pip install sagemaker

To use the job queue you configured earlier, you can refer to it by name. The Python SDK has built-in support for the integration within the TrainingQueue class:

from sagemaker.aws_batch.training_queue import TrainingQueue

JOB_QUEUE_NAME = ‘my-sm-training-fifo-jq’
training_queue = TrainingQueue(JOB_QUEUE_NAME)

For this example, we focus on the simplest job that you can run, either a class that inherits from EstimatorBase or ModelTrainer, a hello world job. You can use a ModelTrainer or Estimator, such as PyTorch, instead of the placeholder:

from sagemaker.session import Session
from sagemaker import image_uris
session = Session()

image_uri = image_uris.retrieve(
   framework=”pytorch”,
   region=session.boto_session.region_name,
   version=”2.5″,
   instance_type=INSTANCE_TYPE,
   image_scope=”training”
)
from sagemaker.estimator import Estimator

EXECUTION_ROLE = get_execution_role()
INSTANCE_TYPE = ‘ml.g5.xlarge’
TRAINING_JOB_NAME = ‘hello-world-simple-job’

estimator = Estimator(
   image_uri=image_uri,
   role=EXECUTION_ROLE,
   instance_count=1,
   instance_type=INSTANCE_TYPE,
   volume_size=1,
   base_job_name=TRAINING_JOB_NAME,
   container_entry_point=[‘echo’, ‘Hello’, ‘World’],
   max_run=300,
)

training_queued_job = training_queue.submit(training_job=estimator, inputs=None)

Submitting an estimator job is as straightforward as creating the estimator and then calling queue.submit. This particular estimator doesn’t require any data, but in general, data should be provided by specifying inputs. Alternatively, you can queue a ModelTrainer using AWS Batch by calling queue.submit, shown in the following code:

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode

source_code = SourceCode(command=”echo ‘Hello World'”)

model_trainer = ModelTrainer(
   training_image=image_uri,
   source_code=source_code,
   base_job_name=TRAINING_JOB_NAME,
   compute={“instance_type”: INSTANCE_TYPE, “instance_count”: 1},
   stopping_condition={“max_runtime_in_seconds”: 300}
)

training_queued_job = training_queue.submit(training_job=model_trainer, inputs=None)

Monitor job status
In this section, we demonstrate two methods to monitor the job status.
Display the status of jobs using the Python SDK
The TrainingQueue can list jobs by status, and each job can be described individually for more details:

submitted_jobs = training_queue.list_jobs(status=”SUBMITTED”)
pending_jobs = training_queue.list_jobs(status=”PENDING”)
runnable_jobs = training_queue.list_jobs(status=”RUNNABLE”)
scheduled_jobs = training_queue.list_jobs(status=”SCHEDULED”)
starting_jobs = training_queue.list_jobs(status=”STARTING”)
running_jobs = training_queue.list_jobs(status=”RUNNING”)
completed_jobs = training_queue.list_jobs(status=”SUCCEEDED”)
failed_jobs = training_queue.list_jobs(status=”FAILED”)

all_jobs = submitted_jobs + pending_jobs + runnable_jobs + scheduled_jobs + starting_jobs + running_jobs + completed_jobs + failed_jobs

for job in all_jobs:
job_status = job.describe().get(“status”, “”)
print(f”Job : {job.job_name} is {job_status}”)

After a TrainingQueuedJob has reached the STARTING status, the logs can be printed from the underlying SageMaker AI training job:

import time

while True:
job_status = training_queued_job.describe().get(“status”, “”)

if job_status in {“STARTING”, “RUNNING”, “SUCCEEDED”, “FAILED”}:
break

print(f”Job : {training_queued_job.job_name} is {job_status}”)
time.sleep(5)

training_queued_job.get_estimator().logs()

Display the status of jobs on the AWS Batch console
The AWS Batch console also provides a convenient way to view the status of running and queued jobs. To get started, navigate to the overview dashboard, as shown in the following screenshot.

From there, you can choose on the number underneath the AWS Batch job state you’re interested in to see the jobs in your queue that are in the given state.

Choosing an individual job in the queue will bring you to the job details page.

You can also switch to the SageMaker Training job console for a given job by choosing the View in SageMaker link on the AWS Batch job details page. You will be redirected to the corresponding job details page on the SageMaker Training console.

Whether you use the AWS Batch console or a programmatic approach to inspecting the jobs in your queue, it is generally useful to know how AWS Batch job states map to SageMaker Training job states. To learn how that mapping is defined, refer to the Batch service job status overview page found within the Batch user guide.
Best practices
We recommend creating dedicated service environments for each job queue in a 1:1 ratio. FIFO queues deliver basic fire-and-forget semantics, whereas fair share scheduling queues provide more sophisticated scheduling, balancing utilization within a share identifier, share weights, and job priority. If you don’t need multiple shares but want to assign a priority on job submission, we recommend creating a fair share scheduling queue and using a single share within it for all submissions.
This integration works seamlessly with SageMaker Flexible Training Plans (FTP); simply set the TrainingPlanArn as part of the CreateTrainingJob JSON request, which is passed to AWS Batch. If the goal is for a single job queue to keep that FTP fully utilized, setting capacityLimits on the service environment to match the capacity allocated to the flexible training plan will allow the queue to maintain high utilization of all the capacity.
If the same FTP needs to be shared among many teams, each with a firm sub-allocation of capacity (for example, dividing a 20-instance FTP into 5 instances for a research team and 15 instances for a team serving production workloads), then we recommend creating two job queues and two service environments. The first job queue, research_queue, would be connected to the research_environment service environment with a capacityLimit set to 5 instances. The second job queue, production_queue, would be connected to a production_environment service environment with a capacity limit of 15. Both research and production team members would submit their requests using the same FTP.
Alternatively, if a strict partition isn’t necessary, both teams can share a single fair share scheduling job queue with separate share identifiers, which allows the queue to better utilize available capacity.
We recommend not using the SageMaker warm pool feature, because this can cause capacity to be idle.
Conclusion
In this post, we covered the new capability to use AWS Batch with SageMaker Training jobs and how to get started setting up your queues and submitting your jobs. This can help your organization schedule and prioritize jobs, freeing up time for your infrastructure admins and ML scientists. By implementing this functionality, your teams can focus on their workloads and not waste time managing and coordinating infrastructure. This capability is especially powerful using SageMaker training plans so that your organization can reserve capacity in the quantity you need, during the time you need it. By using AWS Batch with SageMaker AI, you can fully utilize the training plan for the most efficiency. We encourage you to try out this new capability so it can make a meaningful impact in your operations!

About the Authors
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
David Lindskog is a Senior Software Engineer at AWS Batch. David has worked across a broad spectrum of projects at Amazon, and specializes in designing and implementing complex, scalable distributed systems and APIs that solve challenging technical problems.
Mike Moore is a Software Development Manager at AWS Batch. He works in high performance computing, with a focus on the application of simulation to the analysis and design of spacecraft and robotic systems. Prior to joining AWS, Mike worked with NASA to build spacecraft simulators to certify SpaceX Dragon and CST-100’s ascent abort systems for crew flight readiness. He lives in Seattle with his wife and daughter, where they enjoy hiking, biking, and sailing.
Mike Garrison is a Global Solutions Architect based in Ypsilanti, Michigan. Utilizing his twenty years of experience, he helps accelerate tech transformation of automotive companies. In his free time, he enjoys playing video games and travel.
Michelle Goodstein is a Principal Engineer on AWS Batch. She focuses on scheduling improvements for AI/ML to drive utilization, efficiency, and cost optimization, as well as improved observability into job execution lifecycle and efficiency. She enjoys building innovative solutions to distributed systems problems spanning data, compute, and AI/ML.
Michael Oguike is a Product Manager for Amazon SageMaker AI. He is passionate about using technology and AI to solve real-world problems. At AWS, he helps customers across industries build, train, and deploy AI/ML models at scale. Outside of work, Michael enjoys exploring behavioral science and psychology through books and podcasts.
Angel Pizarro is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.
Tom Burggraf is the Head of Product for AWS Batch, where he champions innovative features that help research platform builders achieve unprecedented scale and operational efficiency. He specializes in identifying novel ways to evolve AWS Batch capabilities, particularly in democratizing high-performance computing for complex scientific and analytical workloads. Prior to AWS, he was a product leader in FinTech and served as a consultant for product organizations across multiple industries, bringing a wealth of cross-industry expertise to cloud computing challenges.

Structured outputs with Amazon Nova: A guide for builders

Posted on August 1, 2025 by i-genie

Developers building AI applications face a common challenge: converting unstructured data into structured formats. Structured output is critical for machine-to-machine communication use cases, because this enables downstream use cases to more effectively consume and process the generated outputs. Whether it’s extracting information from documents, creating assistants that fetch data from APIs, or developing agents that take actions, these tasks require foundation models to generate outputs in specific structured formats.
We launched constrained decoding to provide reliability when using tools for structured outputs. Now, tools can be used with Amazon Nova foundation models (FMs) to extract data based on complex schemas, reducing tool use errors by over 95%.
In this post, we explore how you can use Amazon Nova FMs for structured output use cases.
Techniques for implementing structured outputs
When addressing the requirements for structured outputs use cases, there are two common approaches for implementation. You can modify the system prompt or take advantage of tool calling. For example, in a customer support use case, you might want the model to output a JSON with its response to the user and the current sentiment. So, the system prompt would be modified to include the expected structure:

Make sure your final response is valid JSON that follows the below response schema:

##Response schema
“`json
{
“response”: “the response to the customer”,
“sentiment”: “the current customer sentiment”
}“`

The other option is to provide a tool configuration. Tool calling is the act of providing an API, code function, or schema (or structure) required by your end application to the model through the request schema with the Converse API. This is most used when building agentic applications but is also frequently used in structured output use cases because of the ability to define a set schema that the model should adhere to.

tool_config = {
“tools”: [
{
“toolSpec”: {
“name”: “respondToUser”,
“description”: “the formatted response to the customer”,
“inputSchema”: {
“type”: “object”,
“properties”: {
“response”: {
“description”: “the response to the customer”,
“type”: “string”
},
“sentiment”: {
“description”: “the current customer sentiment”,
“type”: “string”
}
},
“required”: [
“response”,
“sentiment”
]
}
}
}
]
}

Both approaches can be effective prompting techniques to influence the model output. However, the output is still non-deterministic and there is room for failure. In our work with customers to implement use cases such as agentic workflows and applications and structured extraction, we’ve observed that the accuracy of the model tends to decrease as the schema becomes more complex.
Structured output with Amazon Nova models
Based on these learnings, we have implemented constrained decoding in our system to help ensure high model reliability in the output generated and to allow the model to handle complex schemas with ease. Constrained decoding relies on a grammar to constrain the possible tokens a model can output at each step. This is differentiated from the prompting techniques historically used, because this changes the actual tokens a model can choose from when generating an output. For example, when closing a JSON object, the model would be constrained to just a } token to select. Constrained decoding is used every time a tool configuration is passed. Because tool use provides us a specific schema already, we can use that to generate a grammar dynamically, based on the schema desired by the developer. Constrained decoding prevents the model from generating invalid keys and enforces correct data types based on the defined schema.
Schema definition process
A key step in using structured outputs with Amazon Nova is to create a tool configuration. The tool configuration provides a standard interface to define the expected output schema. While the primary intent of a tool configuration is to provide external functionality to the model, this JSON interface is used in structured output use cases as well. This can be illustrated using a use case that extracts recipes from online content. To start the integration, we create a tool configuration representing the specific fields we want extracted from the invoices. When creating a tool configuration, it is important to be clear and concise because the property names and descriptions are what inform the model how the fields should be populated.

tool_config = {
“tools”: [
{
“toolSpec”: {
“name”: “extract_recipe”,
“description”: “Extract recipe for cooking instructions”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“recipe”: {
“type”: “object”,
“properties”: {
“name”: {
“type”: “string”,
“description”: “Name of the recipe”
},
“description”: {
“type”: “string”,
“description”: “Brief description of the dish”
},
“prep_time”: {
“type”: “integer”,
“description”: “Preparation time in minutes”
},
“cook_time”: {
“type”: “integer”,
“description”: “Cooking time in minutes”
},
“servings”: {
“type”: “integer”,
“description”: “Number of servings”
},
“difficulty”: {
“type”: “string”,
“enum”: [
“easy”,
“medium”,
“hard”
],
“description”: “Difficulty level of the recipe”
},
“ingredients”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“name”: {
“type”: “string”,
“description”: “Name of ingredient”
},
“amount”: {
“type”: “number”,
“description”: “Quantity of ingredient”
},
“unit”: {
“type”: “string”,
“description”: “Unit of measurement”
}
},
“required”: [
“name”,
“amount”,
“unit”
]
}
},
“instructions”: {
“type”: “array”,
“items”: {
“type”: “string”,
“description”: “Step-by-step cooking instructions”
}
},
“tags”: {
“type”: “array”,
“items”: {
“type”: “string”,
“description”: “Categories or labels for the recipe”
}
}
},
“required”: [
]
}
},
“required”: [
]
}
}
}
}
]
}

After the tool configuration has been created, we can pass it through the Converse API along with the recipe, which will be contained in the user prompt. A system prompt is historically required for structured output use cases to guide the model in how to output the content, in this case we can use it to pass details about the system role and persona.

import boto3

model_response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
system=[{“text”: “You are an expert recipe extractor that compiles recipe details from blog posts”}],
messages=[{“role”: “user”, “content”: content}],
inferenceConfig={“temperature”: 0},
toolConfig=tool_config
)

By using the native tool use support with constrained decoding, we get a parsed tool call that will follow the correct syntax and expected schema as set in the tool configuration.

{
“toolUse”: {
“toolUseId”: “tooluse_HDCl-Y8gRa6yWTU-eE97xg”,
“name”: “extract_recipe”,
“input”: {
“recipe”: {
“name”: “Piacenza Tortelli”,
“description”: “Piacenza tortelli, also known as ‘tortelli with the tail’ due to their elongated shape, are a delicious fresh pasta, easy to make at home!”,
“prep_time”: 60,
“cook_time”: 10,
“servings”: 4,
“difficulty”: “hard”,
“ingredients”: [
{
“name”: “Type 00 flour”,
“amount”: 2.3,
“unit”: “cups”
},
{
“name”: “Eggs”,
“amount”: 3,
“unit”: “”
},
{
“name”: “Fine salt”,
“amount”: 1,
“unit”: “pinch”
},
{
“name”: “Spinach”,
“amount”: 13.3,
“unit”: “cups”
},
{
“name”: “Cow’s milk ricotta cheese”,
“amount”: 1.3,
“unit”: “cups”
},
{
“name”: “Parmigiano Reggiano PDO cheese”,
“amount”: 4.2,
“unit”: “oz”
},
{
“name”: “Fine salt”,
“amount”: 1,
“unit”: “to taste”
},
{
“name”: “Nutmeg”,
“amount”: 1,
“unit”: “to taste”
},
{
“name”: “Butter”,
“amount”: 80,
“unit”: “g”
},
{
“name”: “Sage”,
“amount”: 2,
“unit”: “sprigs”
}
],
“instructions”: [
“Arrange the flour in a mound and pour the eggs into the center 1; add a pinch of salt and start working with a fork 2, then knead by hand 3.”,
“You should obtain a smooth dough 4; wrap it in plastic wrap and let it rest for half an hour in a cool place.”,
“Meanwhile, prepare the filling starting with the spinach: immerse them in boiling salted water 5 and blanch them for a few minutes until wilted 6.”,
“Drain the spinach and transfer them to cold water 7, preferably with ice. Then squeeze them very well 8 and chop them finely with a knife 9.”,
“Place the chopped spinach in a bowl, add the ricotta 10, salt, pepper, and nutmeg 11. Also add the grated Parmigiano Reggiano DOP 12.”,
“Mix well until you get a homogeneous consistency 13.”,
“At this point, take the dough that has now rested 14, take a portion of it keeping the rest covered. Lightly flatten the dough with a rolling pin 15.”,
“Roll it out with a pasta machine 16; as you reduce the thickness, fold the dough over itself 17 and roll it out again 18.”,
“You should get a very thin rectangle, about 0.04-0.08 inches thick 19. Cut 2 strips of dough by dividing the rectangle in half lengthwise 20, then cut out diamonds of 4 inches 21.”,
“Fill the diamonds with the spinach filling 22 and close them. To do this, bring one of the two longer points inward 23, then fold the two side points towards the center 24.”,
“Now close the tortello by pinching the dough in the center and moving gradually towards the outside 25. The movement is similar to the closure of culurgiones. Continue in this way until the dough and filling are finished 26; you will get about 40-45 pieces.”,
“Place a pot full of salted water on the stove. Meanwhile, in a pan, pour the butter and sage 27. Turn on the heat and let it flavor.”,
“Then cook the tortelli for 5-6 minutes 28, then drain them and toss them in the butter and sage sauce 29.”,
“Plate and serve the Piacenza tortelli with plenty of grated Parmigiano Reggiano DOP 30!”
],
“tags”: [
“vegetarian”,
“Italian”
]
}
}
}
}

Now, with constrained decoding, we can use a smaller model such as Amazon Nova Lite to output a large and complex JSON schema to use in our application. For image-based use cases with complex schemas, we recommend that you use Nova Pro or Nova Premier for the best performance.
Conclusion
By using structured output with Amazon Nova through tool calling, you can take advantage of the key benefits of constrained decoding and build a reliable system. We encourage you to try this out in your applications today. Learn more at the Amazon Nova User Guide. Get started building your AI applications with Amazon Nova in the Amazon Bedrock console.

About the authors
Jean Farmer is a Generative AI Solutions Architect on the Amazon Artificial General Intelligence (AGI) team, specializing in agentic applications. Based in Seattle, Washington, she works at the intersection of autonomous AI systems and practical business solutions, helping to shape the future of AGI at Amazon.
Mukund Birje is a Sr. Product Marketing Manager on the AIML team at AWS. In his current role he’s focused on driving adoption of Amazon Nova Foundation Models. He has over 10 years of experience in marketing and branding across a variety of industries. Outside of work you can find him hiking, reading, and trying out new restaurants. You can connect with him on LinkedIn.

AI agents unifying structured and unstructured data: Transforming supp …

Posted on August 1, 2025 by i-genie

As organizations seek to derive greater value from their AWS Support data, operational teams are looking for ways to transform raw support cases and health events into actionable insights. While traditional analytics tools can provide basic reporting capabilities, teams need more sophisticated solutions that can understand and process natural language queries about their operational data. Retrieval-Augmented Generation (RAG) architecture forms the foundation for optimizing large language model outputs by referencing authoritative knowledge bases outside of their training data before generating responses. This architecture uses the power of semantic search and information retrieval capabilities to enhance accuracy.
In our previous blog post, Derive meaningful and actionable operational insights from AWS Using Amazon Q Business, we introduced a RAG-based solution using Amazon Q Business. However, while this approach excels at semantic search, it can face challenges with precise numerical analysis and aggregations. In this post, we address these limitations by showing how you can enhance Amazon Q with custom plugins to improve support analytics and incident response. Our solution combines the strengths of RAG with structured data querying, enabling more accurate answers to analytical questions about your AWS Support data. Detailed deployment instructions for the solution in this post are available in our AWS Samples GitHub repository.
Understanding RAG’s design for analytics
RAG architecture excels at finding relevant information through context-based matching, enabling natural language interactions with external data. However, the fundamental architecture of vector similarity search, which prioritizes semantic relevance over analytical aggregation capabilities, can introduce significant inaccuracies. Let’s examine this limitation with a simple query:

Give me total count of support cases in Feb 2025

The following is a RAG-only response indicating that the system attempts to piece together information from various documents, reporting 190 cases through potentially hallucinated aggregations.

Based on the provided JSON files, there were 190 AWS support cases documented in
February 2025, calculated from the following case counts: 17 cases from Feb 1-3,
12 cases from Feb 4-5, 21 cases from Feb 7-8, 21 cases from Feb 8-10, 16 cases from
Feb 10-11, 10 cases from Feb 11-12, 23 cases from Feb 19-20, 22 cases from Feb 20-21,
27 cases from Feb 23-24, and 21 cases from Feb 25-26.

However, when enhanced with structured data querying capabilities, the same query yields 958 support cases, the accurate count verified against the actual dataset:

Based on the query results, there were a total of 958 support cases in February 2025.

This limitation becomes even more apparent with moderately complex queries that require categorical analysis:

Give me OpenSearch Support cases, distribution based on severity, opened in Jan 2025?

The following is a RAG-only response showing limited or inaccurate categorical breakdown:

Based on the support cases data from January 2025, here is the severity distribution
for OpenSearch cases:

Urgent: One case was opened regarding cluster health degradation and production system
downtime

Normal: Two cases were identified:
1. A case about OpenSearch RI Exchange
2. A case concerning blue/green deployment upgrade from version 1.3 to 2.17

The cases demonstrate varying levels of urgency, with one critical production issue
and two normal severity cases requiring standard response times.

With structured data querying, we get precise severity distributions:

Based on the query results for OpenSearch Support cases opened in January 2025,
here’s the distribution by severity:
1. Normal severity: 4 cases
2. Critical severity: 4 cases
3. High severity: 3 cases
4. Low severity: 2 cases
5. Urgent severity: 1 case

In total, there were 14 OpenSearch Support cases opened in January 2025.
The distribution shows that Normal and Critical severity cases were the most common,
followed by High severity cases. There were fever Low and Urgent severity cases
during this period.

The gap widens further when dealing with multi-dimensional analysis. Consider the following multi-dimensional query expecting detailed analysis with account-service combinations and temporal patterns.
Query:

Find accounts with frequent high-priority support cases since January 2024,
excluding limit increase requests. For each account and AWS service combination,
show the total number of cases and how many different days had issues.
Only include results where there were more than 3 cases, and sort results by the
highest number of cases first.

Response:

Based on the analysis of high-priority support cases since January 2024, excluding
limit increase requests, here are the key findings:

1. The account with the most frequent high-priority cases is 111111111111, with
60 cases for the Connect service.
2. The top 5 account and service combinations with the highest number of cases are:
* Account 111111111111 (Connect): 60 cases
* Account 222222222222 (RDS Oracle): 52 cases
* Account 333333333333 (OpenSearch Service): 38 cases
* Account 444444444444 (Aurora PostgreSQL): 25 cases
* Account 555555555555 (EMR): 23 cases

The services that appear most frequently in the top results are:
* Aurora PostgreSQL
* OpenSearch Service
* RDS Oracle
* EMR
* DynamoDB

Interestingly, the number of total cases matches the number of unique days with
cases for each account-service combination. This suggests that these accounts
are experiencing issues on different days rather than multiple issues on the
same day.

There are 89 account-service combinations that have more than 3 high-priority
cases since January 2024.

Some accounts appear multiple times with different services, indicating they
may be experiencing issues across various AWS offerings. For example:
* Account 222222222222 appears with RDS Oracle (52 cases), DynamoDB (18 cases),
and Aurora PostgreSQL (6 cases)
* Account 111111111111 appears with Connect (60 cases) and other services

This information can be useful for identifying accounts that may need additional
support or proactive engagement to address recurring issues. Would you like to
focus on a specific aspect of this data or explore further details about
particular accounts or services?

The following is a RAG-only response showing inability to process the same complex query:

Based on the provided data source, I cannot provide a complete analysis of
high-priory support cases and their frequency patterns. While there are support
cases mentioned in the data from January 2024, the data does not contain
sufficient information about case priorities or enough cases per account
to identity patterns meeting the threshold of more than 3 cases.
Additionally, while some technical and customer service cases are shown,
the data is fragmented across different dates without providing a comprehensive
view of case volumes or frequencies per account and service.

These examples demonstrate how RAG alone struggles with analytical queries of increasing complexity, while our plugin-enhanced solution delivers precise, structured analysis at every level. When working with scenarios requiring precise numerical analysis, we can enhance RAG’s capabilities through structured approaches:

Aggregation and pattern analysis: When user prompts include aggregation queries (such as counts, totals, or distributions), they require exact numerical computation through structured querying to provide precise results. Vector similarity search alone cannot guarantee accurate numerical aggregations, making structured metadata querying essential for these analytical use cases.
Context and correlation analysis: External unstructured data requires thoughtful data engineering to extract and maintain structured metadata (such as creation dates, categories, severity levels, and service types). While RAG excels at finding semantically similar content, having well-defined metadata enables precise filtering and querying capabilities. For example, when analyzing system performance issues, structured metadata about incident timing, affected services, and their dependencies enables comprehensive impact analysis through exact querying rather than relying solely on semantic matching.

Enhancing Q Support-Insights with agentic AI
Building on the Q Support-Insights (QSI) solution introduced in Derive meaningful and actionable operational insights from AWS Using Amazon Q Business, we’ll demonstrate how to enhance analytical capabilities through agentic AI by creating custom plugins. This enhancement preserves QSI’s base implementation while adding precise analytical processing through structured metadata querying.
QSI overview
The Amazon Q Support Insights (QSI) solution consists of two main components:

Data collection Pipeline

Support Collector module using AWS Lambda functions
The Support data consists of AWS Support cases, Health events, and Trusted Advisor checks
Amazon EventBridge for automated data collection. The data pipeline enables two synchronization mechanisms:

Real-time case updates: Processes AWS Support cases through event-based triggers (CreateCase, AddCommunicationToCase, ResolveCase, ReopenCase).
Historical and daily sync: Performs initial historical data sync and refreshes AWS Trusted Advisor data daily.

Stores data in JSON format in centralized Amazon Simple Storage Service (Amazon S3) bucket
Supports multi-account data aggregation through AWS Organizations

Amazon Q Business application environment

Amazon Q Business application deployment
Amazon S3 connector for data source integration
Web experience configuration for user interaction
Authentication through AWS IAM Identity Center

Enabling query aggregation with custom plugins for Amazon Q Business
Custom plugins extend Amazon Q Business to combine semantic search with precise analytics capabilities. The following implementation details outline how we’ve augmented the base QSI solution:

Augments QSI’s natural language processing with structured query capabilities
Converts analytical requests into precise Amazon Athena SQL using an Amazon Bedrock large language model (LLM)
Executes queries against structured metadata tables
Provides exact numerical results alongside semantic search responses

Metadata processing
To enable precise querying capabilities, we’ve implemented a robust metadata processing framework that structures and indexes support data:

Using QSI’s existing data collection in Amazon S3
Extracts structured metadata from support cases into two queryable tables
Maintains critical fields like dates, categories, severity levels
Enables precise filtering and aggregation capabilities

Cross-source analysis
The solution enables comprehensive analysis across multiple data sources through intelligent correlation capabilities:

Correlates support cases with health events
Enables comprehensive operational assessment
Supports pattern detection across multiple data sources

Architecture overview
The following illustration shows how the enhanced Amazon Q Business solution integrates custom plugins with the existing QSI architecture. The diagram demonstrates the flow from user query through analytical processing to response generation, highlighting how structured data querying complements the RAG-based semantic search capabilities.

When a user submits an analytical query like Show me OpenSearch case distribution based on severity, the following process occurs:

The custom plugin detects the analytical intent and triggers a Lambda function
The Lambda function uses an Amazon Bedrock LLM to transform the natural language query into precise Athena SQL
The query executes against structured metadata in Athena
Results are synthesized back into natural language and presented alongside traditional RAG responses

This approach enables precise analytics while maintaining the conversational interface users expect.
Implementation example
Here’s a sample interaction demonstrating the enhanced capabilities. Let’s start with a basic query to understand the overall OpenSearch case volume, using structured data querying:
Query:

Give me OpenSearch cases in Jan 2025

Response:

Based on the query results for OpenSearch cases in January 2025, here’s a summary:
1. Total cases: 15
2. Severity breakdown:
Critical: 4 cases
High: 3 cases
Urgent: 1 case
Normal: 3 cases
Low: 2 cases
3. Status: All cases are resolved
4. Most common issues:
Cluster health problems (yellow status)
Blue/Green deployment issues
Performance concerns (high CPU utilization)
5. Services involved:
OpenSearch Service Managed Cluster: 13 cases
OpenSearch Service Ingestion: 2 cases

The cases span various topics including cluster health, deployments, performance,
and configuration issues. Is there a specific aspect of these cases you’d like
more information on?

Now, using RAG capabilities to analyze unstructured case content for architectural insights:
Query:

Out of these, what are the OpenSearch issues that are related to architecture
deficiencies?

Response:

Based on the available data, several architectural issues have been identified in OpenSearch deployments:

* The shard count per node exceeding recommended limits is a significant
architectural concern, with validation errors occurring when scaling up
node counts. The recommended limit is 1,000 shards per node and exceeding
this requires careful architectural planning.

* Master node scaling is another architectural challenge, where insufficient
master node capacity can limit cluster expansion. This requires proper
configuration of master node instance types according to the number of nodes
and shards in the domain.

* Resource utilization issues, particularly around JVM Memory Pressure and
CPU utilization, indicate potential architectural shortcomings.
The system architecture should be designed to maintain JVM Memory Pressure
below 80% and CPU Utilization below 85% for optimal performance.

* The system’s handling of blue/green deployments shows architectural limitations,
as these operations cause performance impacts and increased latency during data
migration to new nodes.

Continuing with RAG-based analysis of case details to evaluate implementation patterns:
Query:

Given the high shard count, JVM, or other data can we make inferences about whether
they are using this correctly?

Response:

Based on the data discussed, there are strong indicators that the OpenSearch system
is not being used optimally:

JVM Memory Management Issues:
1. Memory pressure consistently exceeded recommended thresholds (above 78% and
eventually over 90%)
2. The 64GB initial heap size configuration proved inadequate for the workload
3. Circuit breaking exceptions and cluster blocks occurred due to memory issues

Shard Management Problems:
1. The system appears to have exceeded the recommended limit of 1,000 shards
per node
2. Too many small shards created unnecessary system strain
Large shards complicated failure recovery processes

Resource Utilization:
1. Failed to maintain recommended thresholds:
2. JVM Memory Pressure should be below 80%
3. CPU Utilization should be below 85%
4. Continuous memory pressure issues

Failed snapshots
These indicators strongly suggest improper sizing and resource allocation for
the workload, pointing to incorrect implementation or configuration of the
OpenSearch system. Better capacity planning and adherence to best practices
would be necessary for optimal performance.

Finally, here’s another example of structured data analysis with a complex query demonstrating multi-dimensional aggregation capabilities:
Query:

For high-severity support cases since January 1, 2024, show the number of
total cases and after-hours cases for each account and service combination.
Define after-hours as before 1 PM UTC, after 1 AM UTC, or times falling within
the weekend. Only include results with more than 3 after-hours cases. Sort by
the highest number of after-hours cases. Exclude limit increase requests.

Response:

Based on the analysis of high-severity support cases since January 1, 2024,
here’s a summary of the results, focusing on accounts and services with more
than 3 after-hours cases:

1. The account with the highest number of after-hours cases (48) is 111111111111,
for the Connect service. They have a total of 66 high-severity cases.
2. The second highest is account 333333333333 with 40 after-hours cases for the
OpenSearch Service Managed Cluster, out of 49 total cases.
3. Account 222222222222 has 38 after-hours cases for RDS Oracle, out of 64
total cases.
4. EMR service for account 555555555555 has 19 after-hours cases out of 23
total cases.
5. Bedrock service for account 666666666666 has 17 after-hours cases out of
24 total cases.

The results are sorted by the highest number of after-hours cases, and limit
increase requests have been excluded as requested. After-hours cases are
defined as those created before 1 PM UTC, after 1 AM UTC, or times falling
within the weekend.

Would you like more details on a specific account or service, or do you need
a further analysis of this data?

The response combines precise numerical analysis with semantic understanding, providing actionable insights for operational reviews.
Benefits and impact
This plugin architecture delivers several key improvements:

Precise analytics: Exact counts and distributions replace approximate semantic matching
Contextual analysis: Maintains analytical context across conversation threads
Architectural understanding: Better correlation of related issues through structured analysis

Deploy the Amazon Q Business application
The following is a simplified deployment process. For detailed instructions, see the Amazon Q Business application creation module.
Prerequisites

AWS CloudShell is recommended since, it comes pre-installed with the required libraries and tools. Alternatively, you can use a local machine with the AWS Command Line Interface (AWS CLI) installed and configured with valid credentials.
Two S3 buckets:

Support data bucket for storing AWS Support case data
Resource bucket as temporary storage for Lambda resources for deployment (can be deleted after deployment)

IAM Identity Center instance configured
The solution needs AWS Support data collected using the Support Data Pipeline. You can deploy now and add data later, but functionality depends on data availability in your S3 bucket.
Access to Anthropic’s Cloud 3-5 Sonnet through Amazon Bedrock. See Add or remove access to Amazon Bedrock foundation models
The default database should exist in Athena. If not, you can create one using Athena Query Editor to create the database.

Deployment steps
You can us the following script to deploy the Q solution. No manual steps are needed—the script handles stack creation and configuration automatically.

# Clone the repository
git clone https://github.com/aws-samples/support-insights-with-amazon-q.git
cd q_application
chmod +x deploy_q_stacks.sh
./deploy_q_stacks.sh

Clean up
To remove the resources, delete the S3 buckets and CloudFormation stacks. Delete the CloudFormation stacks in the following order:

case-metadata-stack
amazon-q-stack
custom-plugin-stack

Note that this won’t delete the existing S3 buckets, you must manually delete the S3 buckets.
Conclusion
By combining RAG’s semantic understanding with precise analytical capabilities through plugins, we’ve transformed Amazon Q Business into a powerful operational analytics platform. In the examples in this post, you can see how organizations can use this enhancement to derive more accurate and actionable insights from their AWS Support data, supporting better operational decision-making and proactive issue resolution. While demonstrated through support data analytics for operational improvements, these patterns apply across domains that combine structured and unstructured data sources.
Learn more

Explore the Amazon Q documentation to understand more about building custom plugins
Check out these related resources:

Getting Started with Amazon Q Business
Plugins for Amazon Q Business
Amazon Q Business FAQs
About the AWS Support API

For questions and feedback, visit the AWS re:Post or contact AWS Support.

About the authors
Chitresh Saxena is a Sr. AI/ML specialist TAM specializing in generative AI solutions and dedicated to helping customers successfully adopt AI/ML on AWS. He excels at understanding customer needs and provides technical guidance to build, launch, and scale AI solutions that solve complex business problems.
Kevin Morgan is a Sr. Enterprise Support Manager at AWS who helps customers accelerate their cloud adoption journey through hands-on leadership and technical guidance. As a member of the NextGen Developer Experience TFC, he specializes in Builder Experience, CloudOps and DevOps. Outside of work, Kevin enjoys being a Game Master for D&D and is a retro computing enthusiast.

Next-Gen Privacy: How AI Is Transforming Secure Browsing and VPN Techn …

Posted on July 31, 2025 by i-genie

As we move through 2025, artificial intelligence (AI) is fundamentally reshaping secure browsing and Virtual Private Network (VPN) technologies. The explosion of sophisticated cyber threats, sharpened by the capabilities of AI and quantum computing, is forcing rapid innovation in privacy protection, user trust, and online security infrastructure.

The Data Privacy Wakeup Call

AI-Related Privacy Breaches: According to Stanford’s 2025 AI Index Report, AI incidents increased by 56.4% in just one year, with 233 major cases documented in 2024—including data breaches, algorithmic failures, and misuse of personal data.

Consumer Trust: 70% of global consumers have little to no trust in companies to use AI responsibly. 57% view the use of AI in data collection as a major threat to privacy, and 81% expect their information will be used in ways they won’t approve of as AI adoption grows.

Corporate Realities: 40% of organizations experienced an AI-related privacy breach, yet fewer than two-thirds are actively implementing safeguards. In practice, only 37% of small enterprises have any plans to use AI for privacy—highlighting resource and governance barriers.

VPN Usage and Privacy Surge

Explosive Growth: In 2025, the global VPN market is projected to hit $77 billion, up from $44.6 billion a year ago, with more than 1.9 billion regular users worldwide—representing a 20% year-over-year increase and over one-third of all internet users.

Regional Differences: North America leads with 30% market growth, Asia-Pacific is expanding at a 16% annual pace, and VPN usage has become routine in places like Singapore (19% penetration).

Mobile Dominance: 69% of VPN usage now happens on mobile devices; desktop/laptop daily use is much lower.

Use Cases: While 37% use VPNs to avoid tracking, one in four still want access to region-locked streaming content—underscoring privacy and entertainment as dual drivers.

Shift in US: Paradoxically, American VPN usage fell from 46% in 2024 to 32% in 2025, reflecting confusion over privacy, shifting workplace mandates, and trust in current VPN solutions.

AI: The Dual-Edged Sword in Secure Browsing

How AI Defends (and Attacks):

Real-Time Threat Recognition: AI enables VPNs to instantly detect anomalous traffic, filter zero-day threats, and halt phishing or malware before users are harmed.

Automated, Predictive Security: Machine learning models now block suspicious IPs, re-route data, and tighten user authentication automatically, keeping pace with rapidly evolving threats.

Countering AI-Driven Crime: Attackers are using generative AI and agent “swarms” to launch convincing deepfakes, automate malware, and operate cybercrime-as-a-service—lifting breakout speeds to under an hour for some attacks.

AI-Enhanced VPN Features:

Smart Server Selection & Optimization: AI analyzes live network conditions to pick the fastest, least-congested servers, improving speed for streaming, gaming, or remote work.

Adaptive Encryption: Dynamic selection or modification of encryption regimes based on threat levels and data type—soon including seamless integration of quantum-resistant protocols.

Personalized Privacy: AI customizes user privacy settings, recommends more secure servers, and proactively flags applications or sites trying to harvest sensitive data.

Quantum-Resistant and Decentralized VPNs: Tomorrow’s Core

Quantum Encryption Becomes Reality

Industry Rollout: By 2025, leading VPN companies like NordVPN aim to integrate quantum-resistant (post-quantum cryptography, PQC) encryption across all platforms, using protocols like ML-KEM/Kyber in hybrid modes for minimal performance loss.

Early Adoption: Early implementation of PQC-VPNs helps organizations future-proof data security and meet compliance challenges in the post-quantum era. The “harvest now, decrypt later” risk is a major driver for rapid adoption.

Competitive Advantage: Firms that adopt PQC early gain critical protection and an edge in customer trust.

Decentralized VPNs (dVPNs) and Blockchain

Decentralization Surge: By 2030, about 15% of VPN users are expected to migrate to dVPNs, which use peer-to-peer networks to eliminate central points of failure and resist mass surveillance.

Blockchain Benefits: Blockchain-based VPNs provide transparent, verifiable privacy assurances. Users can independently audit no-log policies and provider practices in real time, removing the need for blind trust.

Market Examples: Platforms like Mysterium Network (20,000+ nodes across 135+ countries) and Orchid Protocol (multi-hop, crypto-powered routing) are driving innovation and adoption, though network variability and higher costs remain challenges.

Regulatory and Ethical Frontlines

Legal Pressure: Increasingly complex AI and privacy legislation is rolling out globally, with more enforcement and stricter penalties for breaches and non-compliance anticipated through 2025 and beyond.

Corporate Ethics Gap: 91% of companies say they need to do more to reassure customers about their data practices—highlighting a growing disconnect between policy and public trust.

Conclusion: AI Is the New Backbone of Privacy—But Requires Vigilance

The fusion of AI and VPN technologies is both urgent and promising: organizations and individuals must adapt to survive against AI-powered threats.

Expect quantum-ready encryption, decentralized structures, and adaptive, AI-powered privacy controls to become standard within the decade.

The organizations who move from theoretical risk management to active, transparent, and user-centric privacy innovation will lead the next era of digital trust and security.

Key Stats Table

MetricValue/InsightAI privacy breaches (2024)233 incidents, up 56.4% YoYGlobal VPN users (2025)1.9 billion+ (20% YoY growth)Market size (2025→2026)$44.6B → $77B Consumer trust in AI companies70% have little/no trustQuantum-resistant VPN adoptionMajor rollout by 2025Decentralized VPN adoption (2030)15% of VPN users

Organizations and consumers who embrace next-gen AI-driven privacy tools—and demand transparent, quantum-ready, decentralized protection—will shape a safer, more secure online future.

Sources:

https://www.kiteworks.com/cybersecurity-risk-management/ai-data-privacy-risks-stanford-index-report-2025/

https://secureframe.com/blog/data-privacy-statistics

Data privacy

VPN Usage Explodes: Must-Know VPN Statistics for 2025

VPN Usage Statistics and Trends for 2025–2026: What the Data Reveals

https://www.linkedin.com/pulse/vpn-services-market-2025-new-data-insights-research-2032-pwx8c

2025 VPN Trends, Statistics, and Consumer Opinions

https://www.mckinsey.com/about-us/new-at-mckinsey-blog/ai-is-the-greatest-threat-and-defense-in-cybersecurity-today

https://www.rapid7.com/blog/post/emerging-trends-in-ai-related-cyberthreats-in-2025-impacts-on-organizational-cybersecurity/

https://www.darktrace.com/blog/ai-and-cybersecurity-predictions-for-2025

https://circleid.com/posts/nordvpn-introduces-quantum-resilient-encryption

https://www.fortinet.com/resources/cyberglossary/quantum-safe-encryption

https://quantumxc.com/blog/the-quantum-revolution-in-2025-and-beyond/

https://www.futuremarketinsights.com/blogs/vpn-industry

https://axis-intelligence.com/decentralized-vpn-explain-guide-2025/

https://www.jacksonlewis.com/insights/year-ahead-2025-tech-talk-ai-regulations-data-privacy

https://termly.io/resources/articles/ai-statistics/

https://cloudsecurityalliance.org/blog/2025/04/22/ai-and-privacy-2024-to-2025-embracing-the-future-of-global-legal-developments

https://newsroom.cisco.com/c/r/newsroom/en/us/a/y2025/m04/cisco-2025-data-privacy-benchmark-study-privacy-landscape-grows-increasingly-complex-in-the-age-of-ai.html

https://hai.stanford.edu/ai-index/2025-ai-index-report

The post Next-Gen Privacy: How AI Is Transforming Secure Browsing and VPN Technologies (2025 Data-Driven Deep Dive) appeared first on MarkTechPost.