Chunking vs. Tokenization: Key Differences in AI Text Processing

Table of contentsIntroductionWhat is Tokenization?What is Chunking?The Key Differences That MatterWhy This Matters for Real ApplicationsWhere You’ll Use Each ApproachCurrent Best Practices (What Actually Works)Summary

Introduction

When you’re working with AI and natural language processing, you’ll quickly encounter two fundamental concepts that often get confused: tokenization and chunking. While both involve breaking down text into smaller pieces, they serve completely different purposes and work at different scales. If you’re building AI applications, understanding these differences isn’t just academic—it’s crucial for creating systems that actually work well.

Think of it this way: if you’re making a sandwich, tokenization is like cutting your ingredients into bite-sized pieces, while chunking is like organizing those pieces into logical groups that make sense to eat together. Both are necessary, but they solve different problems.

Source: marktechpost.com

What is Tokenization?

Tokenization is the process of breaking text into the smallest meaningful units that AI models can understand. These units, called tokens, are the basic building blocks that language models work with. You can think of tokens as the “words” in an AI’s vocabulary, though they’re often smaller than actual words.

There are several ways to create tokens:

Word-level tokenization splits text at spaces and punctuation. It’s straightforward but creates problems with rare words that the model has never seen before.

Subword tokenization is more sophisticated and widely used today. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller chunks based on how frequently character combinations appear in training data. This approach handles new or rare words much better.

Character-level tokenization treats each letter as a token. It’s simple but creates very long sequences that are harder for models to process efficiently.

Here’s a practical example:

Original text: “AI models process text efficiently.”

Word tokens: [“AI”, “models”, “process”, “text”, “efficiently”]

Subword tokens: [“AI”, “model”, “s”, “process”, “text”, “efficient”, “ly”]

Notice how subword tokenization splits “models” into “model” and “s” because this pattern appears frequently in training data. This helps the model understand related words like “modeling” or “modeled” even if it hasn’t seen them before.

What is Chunking?

Chunking takes a completely different approach. Instead of breaking text into tiny pieces, it groups text into larger, coherent segments that preserve meaning and context. When you’re building applications like chatbots or search systems, you need these larger chunks to maintain the flow of ideas.

Think about reading a research paper. You wouldn’t want each sentence scattered randomly—you’d want related sentences grouped together so the ideas make sense. That’s exactly what chunking does for AI systems.

Here’s how it works in practice:

Original text: “AI models process text efficiently. They rely on tokens to capture meaning and context. Chunking allows better retrieval.”

Chunk 1: “AI models process text efficiently.”

Chunk 2: “They rely on tokens to capture meaning and context.”

Chunk 3: “Chunking allows better retrieval.”

Modern chunking strategies have become quite sophisticated:

Fixed-length chunking creates chunks of a specific size (like 500 words or 1000 characters). It’s predictable but sometimes breaks up related ideas awkwardly.

Semantic chunking is smarter—it looks for natural breakpoints where topics change, using AI to understand when ideas shift from one concept to another.

Recursive chunking works hierarchically, first trying to split at paragraph breaks, then sentences, then smaller units if needed.

Sliding window chunking creates overlapping chunks to ensure important context isn’t lost at boundaries.

The Key Differences That Matter

Understanding when to use each approach makes all the difference in your AI applications:

What You’re DoingTokenizationChunkingSizeTiny pieces (words, parts of words)Bigger pieces (sentences, paragraphs)GoalMake text digestible for AI modelsKeep meaning intact for humans and AIWhen You Use ItTraining models, processing inputSearch systems, question answeringWhat You Optimize ForProcessing speed, vocabulary sizeContext preservation, retrieval accuracy

Why This Matters for Real Applications

For AI Model Performance

When you’re working with language models, tokenization directly affects how much you pay and how fast your system runs. Models like GPT-4 charge by the token, so efficient tokenization saves money. Current models have different limits:

GPT-4: Around 128,000 tokens

Claude 3.5: Up to 200,000 tokens

Gemini 2.0 Pro: Up to 2 million tokens

Recent research shows that larger models actually work better with bigger vocabularies. For example, while LLaMA-2 70B uses about 32,000 different tokens, it would probably perform better with around 216,000. This matters because the right vocabulary size affects both performance and efficiency.

For Search and Question-Answering Systems

Chunking strategy can make or break your RAG (Retrieval-Augmented Generation) system. If your chunks are too small, you lose context. Too big, and you overwhelm the model with irrelevant information. Get it right, and your system provides accurate, helpful answers. Get it wrong, and you get hallucinations and poor results.

Companies building enterprise AI systems have found that smart chunking strategies significantly reduce those frustrating cases where AI makes up facts or gives nonsensical answers.

Where You’ll Use Each Approach

Tokenization is Essential For:

Training new models – You can’t train a language model without first tokenizing your training data. The tokenization strategy affects everything about how well the model learns.

Fine-tuning existing models – When you adapt a pre-trained model for your specific domain (like medical or legal text), you need to carefully consider whether the existing tokenization works for your specialized vocabulary.

Cross-language applications – Subword tokenization is particularly helpful when working with languages that have complex word structures or when building multilingual systems.

Chunking is Critical For:

Building company knowledge bases – When you want employees to ask questions and get accurate answers from your internal documents, proper chunking ensures the AI retrieves relevant, complete information.

Document analysis at scale – Whether you’re processing legal contracts, research papers, or customer feedback, chunking helps maintain document structure and meaning.

Search systems – Modern search goes beyond keyword matching. Semantic chunking helps systems understand what users really want and retrieve the most relevant information.

Current Best Practices (What Actually Works)

After watching many real-world implementations, here’s what tends to work:

For Chunking:

Start with 512-1024 token chunks for most applications

Add 10-20% overlap between chunks to preserve context

Use semantic boundaries when possible (end of sentences, paragraphs)

Test with your actual use cases and adjust based on results

Monitor for hallucinations and tweak your approach accordingly

For Tokenization:

Use established methods (BPE, WordPiece, SentencePiece) rather than building your own

Consider your domain—medical or legal text might need specialized approaches

Monitor out-of-vocabulary rates in production

Balance between compression (fewer tokens) and meaning preservation

Summary

Tokenization and chunking aren’t competing techniques—they’re complementary tools that solve different problems. Tokenization makes text digestible for AI models, while chunking preserves meaning for practical applications.

As AI systems become more sophisticated, both techniques continue evolving. Context windows are getting larger, vocabularies are becoming more efficient, and chunking strategies are getting smarter about preserving semantic meaning.

The key is understanding what you’re trying to accomplish. Building a chatbot? Focus on chunking strategies that preserve conversational context. Training a model? Optimize your tokenization for efficiency and coverage. Building an enterprise search system? You’ll need both—smart tokenization for efficiency and intelligent chunking for accuracy.
The post Chunking vs. Tokenization: Key Differences in AI Text Processing appeared first on MarkTechPost.

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI …

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES.

Copy CodeCopiedUse a different Browser!pip -q install -U transformers accelerate bitsandbytes rich

import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

MODEL_NAME = “Qwen/Qwen2.5-1.5B-Instruct”
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32

We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab.

Copy CodeCopiedUse a different Browsertok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map=”auto”,
torch_dtype=DTYPE,
load_in_4bit=True
)
gen = pipeline(
“text-generation”,
model=model,
tokenizer=tok,
return_full_text=False
)

We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES.

Copy CodeCopiedUse a different Browserdef chat(prompt: str, system: str = “”, max_new_tokens: int = 512, temperature: float = 0.3) -> str:
msgs = []
if system:
msgs.append({“role”:”system”,”content”:system})
msgs.append({“role”:”user”,”content”:prompt})
inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
return out[0][“generated_text”].strip()

def extract_json(txt: str) -> Dict[str, Any]:
m = re.search(r”{[sS]*}$”, txt.strip())
if not m:
m = re.search(r”{[sS]*?}”, txt)
try:
return json.loads(m.group(0)) if m else {}
except Exception:
# fallback: strip code fences
s = re.sub(r”^“`.*?n|n“`$”, “”, txt, flags=re.S)
try:
return json.loads(s)
except Exception:
return {}

We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES.

Copy CodeCopiedUse a different Browserdef extract_code(txt: str) -> str:
m = re.search(r”“`(?:python)?s*([sS]*?)“`”, txt, flags=re.I)
return (m.group(1) if m else txt).strip()

def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
import io, contextlib
g = {“__name__”: “__main__”}; l = {}
if env: g.update(env)
buf = io.StringIO()
try:
with contextlib.redirect_stdout(buf):
exec(code, g, l)
out = l.get(“RESULT”, g.get(“RESULT”))
return {“ok”: True, “result”: out, “stdout”: buf.getvalue()}
except Exception as e:
return {“ok”: False, “error”: str(e), “trace”: traceback.format_exc(), “stdout”: buf.getvalue()}

PLANNER_SYS = “””You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {“subgoals”:[…], “final_format”:”<one-line answer format>”}.”””

SOLVER_SYS = “””You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
– Compute deterministically.
– Set a variable RESULT to the answer.
– Keep code short; stdlib only.
Return only a Python code block.”””

CRITIC_SYS = “””You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {“action”:”submit”|”revise”,”critique”:”…”, “fix_hint”:”<if revise>”}.”””

SYNTH_SYS = “””You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly.”””

We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES.

Copy CodeCopiedUse a different Browserdef plan(task: str) -> Dict[str, Any]:
p = f”TASK:n{task}nReturn JSON only.”
return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))

def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
prompt = f”SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only.”
code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
res = run_python(code, env=context)
return {“subgoal”: subgoal, “code”: code, “run”: res}

def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
pl = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”), “ok”: L[“run”][“ok”]} for L in logs]
out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(pl, ensure_ascii=False, indent=2)+”nReturn JSON only.”,
CRITIC_SYS, temperature=0.1, max_new_tokens=250)
return extract_json(out)

def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
sys = “Refine subgoals minimally to fix issues. Return same JSON schema as planner.”
out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(logs, ensure_ascii=False)+”nReturn JSON only.”,
sys, temperature=0.2, max_new_tokens=250)
j = extract_json(out)
return j if j.get(“subgoals”) else {}

def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
packed = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”)} for L in logs]
return chat(“TASK:n”+task+”nLOGS:n”+json.dumps(packed, ensure_ascii=False)+
f”nfinal_format: {final_format}nOnly the final answer.”,
SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()

def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
ctx = dict(context or {})
trace, plan_json = [], plan(task)
for round_id in range(1, budget+1):
logs = [solve_subgoal(sg, ctx) for sg in plan_json.get(“subgoals”, [])]
for L in logs:
ctx_key = f”g{len(trace)}_{abs(hash(L[‘subgoal’]))%9999}”
ctx[ctx_key] = L[“run”].get(“result”)
verdict = critic(task, logs)
trace.append({“round”: round_id, “plan”: plan_json, “logs”: logs, “verdict”: verdict})
if verdict.get(“action”) == “submit”: break
plan_json = refine(task, logs) or plan_json
final = synthesize(task, trace[-1][“logs”], plan_json.get(“final_format”, “Answer: <value>”))
return {“final”: final, “trace”: trace}

We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES.

Copy CodeCopiedUse a different BrowserARC_TASK = textwrap.dedent(“””
Infer the transformation rule from train examples and apply to test.
Return exactly: “Answer: <grid>”, where <grid> is a Python list of lists of ints.
“””).strip()
ARC_DATA = {
“train”: [
{“inp”: [[0,0],[1,0]], “out”: [[1,1],[0,1]]},
{“inp”: [[0,1],[0,0]], “out”: [[1,0],[1,1]]}
],
“test”: [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={“TRAIN”: ARC_DATA[“train”], “TEST”: ARC_DATA[“test”]}, budget=2)
rprint(“n[bold]Demo 1 — ARC-like Toy[/bold]”)
rprint(res1[“final”])

WM_TASK = “A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: ‘Answer: <liters>’.”
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint(“n[bold]Demo 2 — Word Math[/bold]”)
rprint(res2[“final”])

rprint(“n[dim]Rounds executed (Demo 1):[/dim]”, len(res1[“trace”]))

We run two demos to validate the agent: an ARC-style task where we infer a transformation from train pairs and apply it to a test grid, and a word-math problem that checks numeric reasoning. We call hrm_agent with each task, print the final answers, and also display the number of reasoning rounds the ARC run takes.

In conclusion, we recognize that what we have built is more than a simple demonstration; it is a window into how hierarchical reasoning can make smaller models punch above their weight. By layering planning, solving, and critiquing, we empower a free Hugging Face model to perform tasks with surprising robustness. We leave with a deeper appreciation of how brain-inspired structures, when paired with practical, open-source tools, enable us to explore reasoning benchmarks and experiment creatively without incurring high costs. This hands-on journey shows us that advanced cognitive-like workflows are accessible to anyone willing to tinker, iterate, and learn.

Check out the Paper and FULL CODES. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models appeared first on MarkTechPost.

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Train …

Table of contentsThe Problem with “Thinking Longer”The Agentic ApproachInfrastructure Challenges and SolutionsGRPO-RoC: Learning from High-Quality ExamplesTraining Strategy: From Simple to ComplexBreakthrough ResultsUnderstanding the MechanismsSummary

The Problem with “Thinking Longer”

Large language models have made impressive strides in mathematical reasoning by extending their Chain-of-Thought (CoT) processes—essentially “thinking longer” through more detailed reasoning steps. However, this approach has fundamental limitations. When models encounter subtle errors in their reasoning chains, they often compound these mistakes rather than detecting and correcting them. Internal self-reflection frequently fails, especially when the initial reasoning approach is fundamentally flawed.

Microsoft’s new research report introduces rStar2-Agent, that takes a different approach: instead of just thinking longer, it teaches models to think smarter by actively using coding tools to verify, explore, and refine their reasoning process.

https://arxiv.org/abs/2508.20722

The Agentic Approach

rStar2-Agent represents a shift toward agentic reinforcement learning, where a 14B parameter model interacts with a Python execution environment throughout its reasoning process. Rather than relying solely on internal reflection, the model can write code, execute it, analyze the results, and adjust its approach based on concrete feedback.

This creates a dynamic problem-solving process. When the model encounters a complex mathematical problem, it might generate initial reasoning, write Python code to test hypotheses, analyze execution results, and iterate toward a solution. The approach mirrors how human mathematicians often work—using computational tools to verify intuitions and explore different solution paths.

Infrastructure Challenges and Solutions

Scaling agentic RL presents significant technical hurdles. During training, a single batch can generate tens of thousands of concurrent code execution requests, creating bottlenecks that can stall GPU utilization. The researchers addressed this with two key infrastructure innovations.

First, they built a distributed code execution service capable of handling 45,000 concurrent tool calls with sub-second latency. The system isolates code execution from the main training process while maintaining high throughput through careful load balancing across CPU workers.

Second, they developed a dynamic rollout scheduler that allocates computational work based on real-time GPU cache availability rather than static assignment. This prevents GPU idle time caused by uneven workload distribution—a common problem when some reasoning traces require significantly more computation than others.

These infrastructure improvements enabled the entire training process to complete in just one week using 64 AMD MI300X GPUs, demonstrating that frontier-level reasoning capabilities don’t require massive computational resources when efficiently orchestrated.

GRPO-RoC: Learning from High-Quality Examples

The core algorithmic innovation is Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC). Traditional reinforcement learning in this context faces a quality problem: models receive positive rewards for correct final answers even when their reasoning process includes multiple code errors or inefficient tool usage.

GRPO-RoC addresses this by implementing an asymmetric sampling strategy. During training, the algorithm:

Oversamples initial rollouts to create a larger pool of reasoning traces

Preserves diversity in failed attempts to maintain learning from various error modes

Filters positive examples to emphasize traces with minimal tool errors and cleaner formatting

This approach ensures the model learns from high-quality successful reasoning while still exposure to diverse failure patterns. The result is more efficient tool usage and shorter, more focused reasoning traces.

https://arxiv.org/abs/2508.20722

Training Strategy: From Simple to Complex

The training process unfolds in three carefully designed stages, starting with non-reasoning supervised fine-tuning that focuses purely on instruction following and tool formatting—deliberately avoiding complex reasoning examples that might create early biases.

Stage 1 constrains responses to 8,000 tokens, forcing the model to develop concise reasoning strategies. Despite this limitation, performance jumps dramatically—from near-zero to over 70% on challenging benchmarks.

Stage 2 extends the token limit to 12,000, allowing for more complex reasoning while maintaining the efficiency gains from the first stage.

Stage 3 shifts focus to the most difficult problems by filtering out those the model has already mastered, ensuring continued learning from challenging cases.

This progression from concise to extended reasoning, combined with increasing problem difficulty, maximizes learning efficiency while minimizing computational overhead.

Breakthrough Results

The results are striking. rStar2-Agent-14B achieves 80.6% accuracy on AIME24 and 69.8% on AIME25, surpassing much larger models including the 671B parameter DeepSeek-R1. Perhaps more importantly, it accomplishes this with significantly shorter reasoning traces—averaging around 10,000 tokens compared to over 17,000 for comparable models.

The efficiency gains extend beyond mathematics. Despite training exclusively on math problems, the model demonstrates strong transfer learning, outperforming specialized models on scientific reasoning benchmarks and maintaining competitive performance on general alignment tasks.

https://arxiv.org/abs/2508.20722

Understanding the Mechanisms

Analysis of the trained model reveals fascinating behavioral patterns. High-entropy tokens in reasoning traces fall into two categories: traditional “forking tokens” that trigger self-reflection and exploration, and a new category of “reflection tokens” that emerge specifically in response to tool feedback.

These reflection tokens represent a form of environment-driven reasoning where the model carefully analyzes code execution results, diagnoses errors, and adjusts its approach accordingly. This creates more sophisticated problem-solving behavior than pure CoT reasoning can achieve.

Summary

rStar2-Agent demonstrates that moderate-sized models can achieve frontier-level reasoning through sophisticated training rather than brute-force scaling. The approach suggests a more sustainable path toward advanced AI capabilities—one that emphasizes efficiency, tool integration, and smart training strategies over raw computational power.

The success of this agentic approach also points toward future AI systems that can seamlessly integrate multiple tools and environments, moving beyond static text generation toward dynamic, interactive problem-solving capabilities.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance appeared first on MarkTechPost.

Top 20 Voice AI Blogs and News Websites 2025: The Ultimate Resource Gu …

Voice AI technology has experienced unprecedented growth in 2025, with revolutionary breakthroughs in real-time conversational AI, emotional intelligence, and voice synthesis. As enterprises increasingly adopt voice agents and consumers embrace next-generation AI assistants, staying informed about the latest developments has become crucial for professionals across industries. The global Voice AI market has reached $5.4 billion in 2024, reflecting a remarkable 25% increase from the previous year, with voice AI solutions attracting $2.1 billion in equity funding.

Top 20 Voice AI Blogs and Websites

1. OpenAI Blog – Voice AI Research & Development

OpenAI leads the voice AI revolution with groundbreaking models like GPT-4o Realtime API and advanced text-to-speech systems. Their blog provides insider insights into cutting-edge research, model releases, and real-world applications. OpenAI’s recent announcement of gpt-realtime and Realtime API updates for production voice agents represents a major breakthrough in conversational AI.

Key Focus Areas:

Real-time speech-to-speech models

Voice synthesis and emotional expression

Safety and responsible AI deployment

Developer tools and APIs

2. MarkTechPost – Voice AI News & Analysis

MarkTechPost has established itself as the go-to source for comprehensive AI news coverage, with exceptional depth in voice AI reporting. Their expert analysis of emerging technologies and market trends makes complex developments accessible to both technical and business audiences. Their recent coverage of Microsoft’s MAI-Voice-1 launch and comprehensive analysis of the voice AI landscape demonstrates their commitment to timely, authoritative reporting.

Key Focus Areas:

Voice AI market analysis and trends

Technical breakthroughs in speech synthesis

Enterprise voice agent implementations

Industry funding and acquisitions

3. Google AI Blog – Multimodal & Speech Research

Google’s research team consistently pushes the boundaries of conversational AI, with innovations like real-time voice agent architecture and advanced speech recognition systems. Their recent work on building real-time voice agents with Gemini demonstrates practical applications of their research.

Key Contributions:

Multimodal AI integration

Real-time voice agent architecture

Speech understanding and generation

Privacy-preserving voice technologies

4. Microsoft Azure AI Blog – Enterprise Voice Solutions

Microsoft’s Azure AI Speech services power millions of enterprise applications. Their blog provides practical insights into implementing voice AI at scale, including personal voice creation, enterprise speech-to-text solutions, and multilingual voice support.autogpt+3

Focus Areas:

Personal voice creation and customization

Enterprise speech-to-text solutions

Multilingual voice support

Azure cognitive services integration

5. ElevenLabs Blog – Voice Synthesis Innovation

ElevenLabs has revolutionized voice cloning and synthesis, setting new standards for natural-sounding AI voices. The company secured $180 million in Series C funding in January 2025, reaching a valuation of $3.3 billion, demonstrating strong investor confidence in their technology.

Specializations:

Voice cloning technology

Multilingual speech synthesis

Creative applications in media

API development for voice integration

6. Deepgram Blog – Speech Recognition Excellence

Deepgram’s State of Voice AI 2025 report provides authoritative market analysis, identifying 2025 as “the year of human-like voice AI agents”. Their technical content explores the latest in speech recognition and real-time transcription.

Key Insights:

Voice AI market trends and predictions

Technical deep-dives into speech recognition

Developer tutorials and best practices

Industry adoption case studies

7. Anthropic Research – Conversational AI Ethics & Voice Mode

Anthropic’s work on Claude focuses on safe, beneficial AI development with emphasis on alignment and responsible deployment. In May 2025, Anthropic launched voice mode for Claude, powered by Claude Sonnet 4, enabling complete spoken conversations with five distinct voice options.

Focus Areas:

AI safety in conversational systems

Ethical voice AI development

Human-AI interaction research

Voice mode implementation using ElevenLabs technology

8. Stanford HAI Blog – Academic Voice AI Research

Stanford’s Human-Centered AI Institute produces cutting-edge research on voice interaction and turn-taking in conversations. Their recent work on teaching voice assistants when to speak represents breakthrough research in conversational AI, moving beyond simple silence detection to analyze voice intonation patterns.

Research Highlights:

Conversational AI turn-taking and interruption handling

World Wide Voice Web (WWvW) development

Silent speech recognition advances

Open-source virtual assistant development

9. Hume AI Blog – Emotionally Intelligent Voice

Hume AI specializes in emotionally intelligent voice interactions, combining speech technology with empathic understanding. Their Empathic Voice Interface (EVI 3) represents a breakthrough in conversational AI, capable of understanding and responding with natural, emotionally intelligent voice interactions.

Innovations:

Emotional intelligence in voice AI

Empathic voice interfaces

Voice control and customization

Human wellbeing optimization through AI

10. MIT Technology Review – Voice AI Analysis

MIT Technology Review provides in-depth analysis of voice AI trends, societal implications, and breakthrough research with rigorous journalistic standards. Their coverage includes voice AI diversity initiatives, synthetic voice technology implications, and ethical considerations in voice technology deployment.

Coverage Areas:

Voice AI diversity and inclusion

Audio deepfake detection and prevention

Industry analysis and market trends

Ethical considerations in voice tech

11. Resemble AI Blog – Voice Cloning & Security

Resemble AI leads in voice cloning technology while addressing security concerns like deepfake detection. They specialize in advanced voice cloning techniques, enterprise voice solutions, and voice security authentication.

Expertise:

Advanced voice cloning techniques

Deepfake detection and prevention

Enterprise voice solutions

Voice security and authentication

12. TechCrunch – Voice AI Industry News

TechCrunch provides comprehensive coverage of voice AI startups, funding rounds, and industry developments. They extensively covered Anthropic’s voice mode launch and provide regular updates on industry partnerships and product launches.

Coverage Focus:

Startup funding and acquisitions

Industry partnerships and deals

Product launches and demos

Market analysis and predictions

13. VentureBeat AI – Voice Technology Trends

VentureBeat offers detailed coverage of voice AI business applications and enterprise adoption trends. They specialize in enterprise AI adoption analysis, voice technology market research, and developer tools coverage.

Specializations:

Enterprise AI adoption

Voice technology market analysis

Product reviews and comparisons

Developer tools and platforms

14. Towards Data Science – Technical Voice AI Content

This Medium publication features hands-on tutorials, technical deep-dives, and practical implementations of voice AI technologies. Content includes privacy-preserving voice AI implementations, voice assistant tuning, and AI-powered language learning applications.

Content Types:

Technical tutorials and guides

Voice AI implementation case studies

Python and machine learning applications

Data science approaches to speech

15. Amazon Alexa Blog – Voice Assistant Innovation

Amazon’s Alexa team shares insights into voice assistant development and smart home integration. However, the 2025 Alexa+ launch has faced significant challenges including reliability issues, missing features, and smart home compatibility problems.

Current Status:

Voice assistant development insights

Smart home integration challenges

Alexa+ beta testing with mixed results

Over one million users now have access to Alexa+, but with notable limitations

16. Speechify Blog – Accessibility & Voice Tech

Speechify focuses on accessibility applications of voice technology and text-to-speech innovations. They specialize in accessibility through voice technology, learning tools, and voice AI applications for diverse needs.

Specializations:

Accessibility through voice technology

Text-to-speech applications

Learning and productivity tools

Voice AI for diverse user needs

17. Murf AI Blog – Voice Generation Applications

Murf AI provides practical insights into voice generation for content creation, marketing, and business applications. Their coverage includes voice generation for content creators, marketing applications, and business use cases.

Coverage:

Voice generation for content creators

Marketing applications of voice AI

Business use cases and ROI analysis

Voice customization techniques

18. Wondercraft AI Blog – Audio Content Creation

Wondercraft focuses on AI-powered audio content creation, offering insights into podcast generation and creative voice applications. Their innovations include AI podcast generation, creative audio applications, and voice design customization.

Innovations:

AI podcast generation

Creative audio applications

Voice design and customization

Audio content automation

19. Play.ht Blog – Voice Synthesis & Applications

Play.ht covers the full spectrum of voice AI applications, from technical implementation to creative use cases. They provide comprehensive coverage of voice synthesis technology, multilingual voice support, and API integration guides.

Content Focus:

Voice synthesis technology

Multilingual voice support

Podcast and content creation

API integration guides

20. Picovoice Blog – Edge Voice AI

Picovoice specializes in on-device voice AI, providing insights into privacy-preserving voice technologies and edge computing applications. Their expertise includes on-device voice processing, privacy-preserving voice AI, and wake word detection.

Expertise:

On-device voice processing

Privacy-preserving voice AI

Edge computing for voice applications

Wake word detection and processing

Conclusion

The voice AI landscape in 2025 is characterized by rapid innovation and significant market growth, but also implementation challenges as companies rush to market with products that may not be fully ready. From OpenAI’s groundbreaking real-time APIs to the emergence of emotionally intelligent voice agents, staying informed through these authoritative sources is essential for anyone working in or interested in voice AI technology.

These 20 blogs and websites represent some of the the best resources for understanding both the technical innovations and market dynamics shaping the future of voice AI. Whether you’re a developer building voice applications, a business leader evaluating voice AI solutions, or a researcher pushing the boundaries of conversational AI, these resources will keep you at the forefront of this transformative technology – while also providing realistic perspectives on current limitations and challenges in the field.
The post Top 20 Voice AI Blogs and News Websites 2025: The Ultimate Resource Guide appeared first on MarkTechPost.

Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House M …

Microsoft AI lab officially launched MAI-Voice-1 and MAI-1-preview, marking a new phase for the company’s artificial intelligence research and development efforts. The announcement explains how Microsoft AI Lab is getting involved in AI research without any third party involvement. MAI-Voice-1 and MAI-1-preview models supports distinct but complementary roles in speech synthesis and general-purpose language understanding.

MAI-Voice-1: Technical Details and Capabilities

MAI-Voice-1 is a speech generation model that produces audio with high fidelity. It generates one minute of natural-sounding audio in under one second using a single GPU, supporting applications such as interactive assistants and podcast narration with low latency and hardware needs. Try out here

The model uses a transformer-based architecture trained on a diverse multilingual speech dataset. It handles single-speaker and multi-speaker scenarios, providing expressive and context-appropriate voice outputs.

MAI-Voice-1 is integrated into Microsoft products like Copilot Daily for voice updates and news summaries. It is available for testing in Copilot Labs, where users can create audio stories or guided narratives from text prompts.

Technically, the model focuses on quality, versatility, and speed. Its single-GPU operation differs from systems requiring multiple GPUs, enabling integration in consumer devices and cloud applications beyond research settings

MAI-1-Preview: Foundation Model Architecture and Performance

MAI-1-preview is Microsoft’s first end-to-end, in-house foundation language model. Unlike previous models that Microsoft integrated or licensed from outside, MAI-1-preview was trained entirely on Microsoft’s own infrastructure, using a mixture-of-experts architecture and approximately 15,000 NVIDIA H100 GPUs.

Microsoft AI team have made the MAI-1-preview on the LMArena platform, placing it next to several other models. MAI-1-preview is optimized for instruction-following and everyday conversational tasks, making it suitable for consumer-focused applications rather than enterprise or highly specialized use cases. Microsoft has begun rolling out access to the model for select text-based scenarios within Copilot, with a gradual expansion planned as feedback is collected and the system is refined.

Model Development and Training Infrastructure

The development of MAI-Voice-1 and MAI-1-preview was supported by Microsoft’s next-generation GB200 GPU cluster, a custom-built infrastructure specifically optimized for training large generative models. In addition to hardware, Microsoft has invested heavily in talent, assembling a team with deep expertise in generative AI, speech synthesis, and large-scale systems engineering. The company’s approach to model development emphasizes a balance between fundamental research and practical deployment, aiming to create systems that are not just theoretically impressive but also reliable and useful in everyday scenarios.

Applications

MAI-Voice-1 can be used for real-time voice assistance, audio content creation in media and education, or accessibility features. Its ability to simulate multiple speakers supports use in interactive scenarios such as storytelling, language learning, or simulated conversations. The model’s efficiency also allows for deployment on consumer hardware.

MAI-1-preview is focused on general language understanding and generation, assisting with tasks like drafting emails, answering questions, summarizing text, or helping with understanding and assisting school tasks in a conversational format.

Conclusion

Microsoft’s release of MAI-Voice-1 and MAI-1-preview shows the company can now develop core generative AI models internally, backed by substantial investment in training infrastructure and technical talent. Both models are intended for practical, real-world use and are being refined with user feedback. This development adds to the diversity of model architectures and training methods in the field, with a focus on systems that are efficient, reliable, and suitable for integration into everyday applications. Microsoft’s approach—using large-scale resources, gradual deployment, and direct engagement with users—offers one example of how organizations can progress AI capabilities while emphasizing practical, incremental improvement.

Check out the Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI appeared first on MarkTechPost.

Building and Optimizing Intelligent Machine Learning Pipelines with TP …

We begin this tutorial to demonstrate how to harness TPOT to automate and optimize machine learning pipelines practically. By working directly in Google Colab, we ensure the setup is lightweight, reproducible, and accessible. We walk through loading data, defining a custom scorer, tailoring the search space with advanced models like XGBoost, and setting up a cross-validation strategy. As we proceed, we explore how evolutionary algorithms in TPOT search for high-performing pipelines, providing us transparency through Pareto fronts and checkpoints. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install tpot==0.12.2 xgboost==2.0.3 scikit-learn==1.4.2 graphviz==0.20.3

import os, json, math, time, random, numpy as np, pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, f1_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from tpot import TPOTClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

SEED = 7
random.seed(SEED); np.random.seed(SEED); os.environ[“PYTHONHASHSEED”]=str(SEED)

We begin by installing the libraries and importing all the essential modules that support data handling, model building, and pipeline optimization. We set a fixed random seed to ensure our results remain reproducible every time we run the notebook. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserX, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

scaler = StandardScaler().fit(X_tr)
X_tr_s, X_te_s = scaler.transform(X_tr), scaler.transform(X_te)

def f1_cost_sensitive(y_true, y_pred):
return f1_score(y_true, y_pred, average=’binary’, pos_label=1)
cost_f1 = make_scorer(f1_cost_sensitive, greater_is_better=True)

Here, we load the breast cancer dataset and split it into training and testing sets while preserving class balance. We standardize the features for stability and then define a custom F1-based scorer, allowing us to evaluate pipelines with a focus on effectively capturing positive cases. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertpot_config = {
‘sklearn.linear_model.LogisticRegression’: {
‘C’: [0.01, 0.1, 1.0, 10.0],
‘penalty’: [‘l2’], ‘solver’: [‘lbfgs’], ‘max_iter’: [200]
},
‘sklearn.naive_bayes.GaussianNB’: {},
‘sklearn.tree.DecisionTreeClassifier’: {
‘criterion’: [‘gini’,’entropy’], ‘max_depth’: [3,5,8,None],
‘min_samples_split’:[2,5,10], ‘min_samples_leaf’:[1,2,4]
},
‘sklearn.ensemble.RandomForestClassifier’: {
‘n_estimators’:[100,300], ‘criterion’:[‘gini’,’entropy’],
‘max_depth’:[None,8], ‘min_samples_split’:[2,5], ‘min_samples_leaf’:[1,2]
},
‘sklearn.ensemble.ExtraTreesClassifier’: {
‘n_estimators’:[200], ‘criterion’:[‘gini’,’entropy’],
‘max_depth’:[None,8], ‘min_samples_split’:[2,5], ‘min_samples_leaf’:[1,2]
},
‘sklearn.ensemble.GradientBoostingClassifier’: {
‘n_estimators’:[100,200], ‘learning_rate’:[0.03,0.1],
‘max_depth’:[2,3], ‘subsample’:[0.8,1.0]
},
‘xgboost.XGBClassifier’: {
‘n_estimators’:[200,400], ‘max_depth’:[3,5], ‘learning_rate’:[0.05,0.1],
‘subsample’:[0.8,1.0], ‘colsample_bytree’:[0.8,1.0],
‘reg_lambda’:[1.0,2.0], ‘min_child_weight’:[1,3],
‘n_jobs’:[0], ‘tree_method’:[‘hist’], ‘eval_metric’:[‘logloss’],
‘gamma’:[0,1]
}
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

We define a custom TPOT configuration that combines linear models, tree-based learners, ensembles, and XGBoost, utilizing carefully chosen hyperparameters. We also established a stratified 5-fold cross-validation strategy, ensuring that every candidate pipeline is tested fairly across balanced splits of the dataset. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsert0 = time.time()
tpot = TPOTClassifier(
generations=5,
population_size=40,
offspring_size=40,
scoring=cost_f1,
cv=cv,
subsample=0.8,
n_jobs=-1,
config_dict=tpot_config,
verbosity=2,
random_state=SEED,
max_time_mins=10,
early_stop=3,
periodic_checkpoint_folder=”tpot_ckpt”,
warm_start=False
)
tpot.fit(X_tr_s, y_tr)
print(f”n First search took {time.time()-t0:.1f}s”)

def pareto_table(tpot_obj, k=5):
rows=[]
for ind, meta in tpot_obj.pareto_front_fitted_pipelines_.items():
rows.append({
“pipeline”: ind, “cv_score”: meta[‘internal_cv_score’],
“size”: len(str(meta[‘pipeline’])),
})
df = pd.DataFrame(rows).sort_values(“cv_score”, ascending=False).head(k)
return df.reset_index(drop=True)

pareto_df = pareto_table(tpot, k=5)
print(“nTop Pareto pipelines (cv):n”, pareto_df)

def eval_pipeline(pipeline, X_te, y_te, name):
y_hat = pipeline.predict(X_te)
f1 = f1_score(y_te, y_hat)
print(f”n[{name}] F1(test) = {f1:.4f}”)
print(classification_report(y_te, y_hat, digits=3))

print(“nEvaluating top pipelines on test:”)
for i, (ind, meta) in enumerate(sorted(
tpot.pareto_front_fitted_pipelines_.items(),
key=lambda kv: kv[1][‘internal_cv_score’], reverse=True)[:3], 1):
eval_pipeline(meta[‘pipeline’], X_te_s, y_te, name=f”Pareto#{i}”)

We launch an evolutionary search with TPOT, cap the runtime for practicality, and checkpoint progress, allowing us to reproducibly hunt for strong pipelines. We then inspect the Pareto front to identify the top trade-offs, convert it into a compact table, and select leaders based on the cross-validation score. Finally, we evaluate the best candidates on the held-out test set to confirm real-world performance with F1 and a full classification report. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n Warm-start for extra refinement…”)
t1 = time.time()
tpot2 = TPOTClassifier(
generations=3, population_size=40, offspring_size=40,
scoring=cost_f1, cv=cv, subsample=0.8, n_jobs=-1,
config_dict=tpot_config, verbosity=2, random_state=SEED,
warm_start=True, periodic_checkpoint_folder=”tpot_ckpt”
)
try:
tpot2._population = tpot._population
tpot2._pareto_front = tpot._pareto_front
except Exception:
pass
tpot2.fit(X_tr_s, y_tr)
print(f” Warm-start extra search took {time.time()-t1:.1f}s”)

best_model = tpot2.fitted_pipeline_ if hasattr(tpot2, “fitted_pipeline_”) else tpot.fitted_pipeline_
eval_pipeline(best_model, X_te_s, y_te, name=”BestAfterWarmStart”)

export_path = “tpot_best_pipeline.py”
(tpot2 if hasattr(tpot2, “fitted_pipeline_”) else tpot).export(export_path)
print(f”n Exported best pipeline to: {export_path}”)

from importlib import util as _util
spec = _util.spec_from_file_location(“tpot_best”, export_path)
tbest = _util.module_from_spec(spec); spec.loader.exec_module(tbest)
reloaded_clf = tbest.exported_pipeline_
pipe = Pipeline([(“scaler”, scaler), (“model”, reloaded_clf)])
pipe.fit(X_tr, y_tr)
eval_pipeline(pipe, X_te, y_te, name=”ReloadedExportedPipeline”)

report = {
“dataset”: “sklearn breast_cancer”,
“train_size”: int(X_tr.shape[0]), “test_size”: int(X_te.shape[0]),
“cv”: “StratifiedKFold(5)”,
“scorer”: “custom F1 (binary)”,
“search”: {“gen_1”: 5, “gen_2_warm”: 3, “pop”: 40, “subsample”: 0.8},
“exported_pipeline_first_120_chars”: str(reloaded_clf)[:120]+”…”,
}
print(“n Model Card:n”, json.dumps(report, indent=2))

We continue the search with a warm start, reusing the learned warm start to refine candidates and select the best performer on our test set. We export the winning pipeline, reload it alongside our scaler to mimic deployment, and verify its results. Finally, we generate a compact model card to document the dataset, search settings, and the summary of the exported pipeline for reproducibility.

In conclusion, we see how TPOT allows us to move beyond trial-and-error model selection and instead rely on automated, reproducible, and explainable optimization. We export the best pipeline, validate it on unseen data, and even reload it for deployment-style use, confirming that the workflow is not just experimental but production-ready. By combining reproducibility, flexibility, and interpretability, we end with a robust framework that we can confidently apply to more complex datasets and real-world problems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building and Optimizing Intelligent Machine Learning Pipelines with TPOT for Complete Automation and Performance Enhancement appeared first on MarkTechPost.

Detect Amazon Bedrock misconfigurations with Datadog Cloud Security

This post was co-written with Nick Frichette and Vijay George from Datadog. 
As organizations increasingly adopt Amazon Bedrock for generative AI applications, protecting against misconfigurations that could lead to data leaks or unauthorized model access becomes critical. The AWS Generative AI Adoption Index, which surveyed 3,739 senior IT decision-makers across nine countries, revealed that 45% of organizations selected generative AI tools as their top budget priority in 2025. As more AWS and Datadog customers accelerate their adoption of AI, building AI security into existing processes will become essential, especially as more stringent regulations emerge. But looking at AI risks in a silo isn’t enough; AI risks must be contextualized alongside other risks such as identity exposures and misconfigurations. The combination of Amazon Bedrock and Datadog’s comprehensive security monitoring helps organizations innovate faster while maintaining robust security controls.
Amazon Bedrock delivers enterprise-grade security by incorporating built-in protections across data privacy, access controls, network security, compliance, and responsible AI safeguards. Customer data is encrypted both in transit using TLS 1.2 or above and at rest with AWS Key Management Service (AWS KMS), and organizations have full control over encryption keys. Data privacy is central: your input, prompts, and outputs are not shared with model providers nor used to train or improve foundation models (FMs). Fine-tuning and customizations occur on private copies of models, providing data confidentiality. Access is tightly governed through AWS Identity and Access Management (IAM) and resource-based policies, supporting granular authorization for users and roles. Amazon Bedrock integrates with AWS PrivateLink and supports virtual private cloud (VPC) endpoints for private, internal communication, so traffic doesn’t leave the Amazon network. The service complies with key industry standards such as ISO, SOC, CSA STAR, HIPAA eligibility, GDPR, and FedRAMP High, making it suitable for regulated industries. Additionally, Amazon Bedrock includes configurable guardrails to filter sensitive or harmful content and promote responsible AI use. Security is structured under the AWS Shared Responsibility Model, where AWS manages infrastructure security and customers are responsible for secure configurations and access controls within their Amazon Bedrock environment.
Building on these robust AWS security features, Datadog and AWS have partnered to provide a holistic view of AI infrastructure risks, vulnerabilities, sensitive data exposure, and other misconfigurations. Datadog Cloud Security employs both agentless and agent-based scanning to help organizations identify, prioritize, and remediate risks across cloud resources. This integration helps AWS users prioritize risks based on business criticality, with security findings enriched by observability data, thereby enhancing their overall security posture in AI implementations.
We’re excited to announce new security capabilities in Datadog Cloud Security that can help you detect and remediate Amazon Bedrock misconfigurations before they become security incidents. This integration helps organizations embed robust security controls and secure their use of the powerful capabilities of Amazon Bedrock by offering three critical advantages: holistic AI security by integrating AI security into your broader cloud security strategy, real-time risk detection through identifying potential AI-related security issues as they emerge, and simplified compliance to help meet evolving AI regulations with pre-built detections.
AWS and Datadog: Empowering customers to adopt AI securely
The partnership between AWS and Datadog is focused on helping customers operate their cloud infrastructure securely and efficiently. As organizations rapidly adopt AI technologies, extending this partnership to include Amazon Bedrock is a natural evolution. Amazon Bedrock is a fully managed service that makes high-performing FMs from leading AI companies and Amazon available through a unified API, making it an ideal starting point for Datadog’s AI security capabilities.
The decision to prioritize Amazon Bedrock integration is driven by several factors, including strong customer demand, comprehensive security needs, and the existing integration foundation. With over 900 integrations and a partner-built Marketplace, Datadog’s long-standing partnership with AWS and deep integration capabilities have helped Datadog quickly develop comprehensive security monitoring for Amazon Bedrock while using their existing cloud security expertise.
Throughout Q4 2024, Datadog Security Research observed increasing threat actor interest in cloud AI environments, making this integration particularly timely. By combining the powerful AI capabilities of AWS with Datadog’s security expertise, you can safely accelerate your AI adoption while maintaining robust security controls.
How Datadog Cloud Security helps secure Amazon Bedrock resources
After adding the AWS integration to your Datadog account and enabling Datadog Cloud Security, Datadog Cloud Security continuously monitors your AWS environment, identifying misconfigurations, identity risks, vulnerabilities, and compliance violations. These detections use the Datadog Severity Scoring system to prioritize them based on infrastructure context. The scoring considers a variety of variables, including if the resource is in production, is publicly accessible, or has access to sensitive data. This multi-layer analysis can help you reduce noise and focus your attention to the most critical misconfigurations by considering runtime behavior.
Partnering with AWS, Datadog is excited to offer detections for Datadog Cloud Security customers, such as:

Amazon Bedrock custom models should not output model data to publicly accessible S3 buckets
Amazon Bedrock custom models should not train from publicly writable S3 buckets
Amazon Bedrock guardrails should have a prompt attack filter enabled and block prompt attacks at high sensitivity
Amazon Bedrock agent guardrails should have the sensitive information filter enabled and block highly sensitive PII entities

Detect AI misconfigurations with Datadog Cloud Security
To understand how these detections can help secure your Amazon Bedrock infrastructure, let’s look at a specific use case, in which Amazon Bedrock custom models should not train from publicly writable Amazon Simple Storage Service (Amazon S3) buckets.
With Amazon Bedrock, you can customize AI models by fine-tuning on domain specific data. To do this, that data is stored in an S3 bucket. Threat actors are constantly evaluating the configuration of S3 buckets, looking for the potential to access sensitive data or even the ability to write to S3 buckets.
If a threat actor finds an S3 bucket that was misconfigured to permit public write access, and that same bucket contained data that was used to train an AI model, a bad actor could poison that dataset and introduce malicious behavior or output to the model. This is known as a data poisoning attack.
Normally, detecting these types of misconfigurations requires multiple steps: one to identify the S3 bucket misconfigured with write access, and one to identify that the bucket is being used by Amazon Bedrock. With Datadog Cloud Security, this detection is one of hundreds that are activated out of the box.
In the Datadog Cloud Security system, you can view this issue alongside surrounding infrastructure using Cloud Map. This provides live diagrams of your cloud architecture, as shown in the following screenshot. AI risks are then contextualized alongside sensitive data exposure, identity risks, vulnerabilities, and other misconfigurations to give you a 360-view of risks.

For example, you might see that your application is using Anthropic’s Claude 3.7 on Amazon Bedrock and accessing training or prompt data stored in an S3 bucket that also has public write access. This could inadvertently impact model integrity by introducing unapproved data to the large language model (LLM), so you will want to update this configuration. Though basic, the first step for most security initiatives is identifying the issue. With agentless scanning, Datadog scans your AWS environment at intervals between 15 minutes and 2 hours, so users can identify misconfigurations as they are introduced to their environment. The next step is to remediate this risk. Datadog Cloud Security offers automatically generated remediation guidance, specifically for each risk (see the following screenshot). You will get a step-by-step explanation of how to fix each finding. In this situation, we can remediate this issue by modifying the S3 bucket’s policy, helping prevent public write access. You can do this directly in AWS, create a JIRA ticket, or use the built-in workflow automation tools. From here, you can apply remediation steps directly within Datadog and confirm that the misconfiguration has been resolved.

Resolving this issue will positively impact your compliance posture, as illustrated by the posture score in Datadog Cloud Security, helping teams meet internal benchmarks and regulatory standards. Teams can also create custom frameworks or iterate on existing ones for tailored compliance controls.

As generative AI is embraced across industries, the regulatory environment will evolve. Datadog will continue partnering with AWS to expand their detection library and support secure AI adoption and compliance.
How Datadog Cloud Security detects misconfigurations in your cloud environment
You can deploy Datadog Cloud Security either with the Datadog agent, agentlessly, or both to maximize security coverage in your cloud environment. Datadog customers can start monitoring their AWS accounts for misconfigurations by first adding the AWS integration to Datadog. This enables Datadog to crawl cloud resources in customer AWS accounts.
As the Datadog system finds resources, it runs through a catalog of hundreds of out-of-the-box detection rules against these resources, looking for misconfigurations and threat paths that adversaries can exploit.
Secure your AI infrastructure with Datadog
Misconfigurations in AI systems can be risky, but with the right tools, you can have the visibility and context needed to manage them. With Datadog Cloud Security, teams gain visibility into these risks, detect threats early, and remediate issues with confidence. In addition, Datadog has also released numerous agentic AI security features, designed to help teams gain visibility into the health and security of critical AI workload, which includes new announcements made to Datadog’s LLM observability features.
Lastly, Datadog announced Bits AI Security Analyst alongside other Bits AI agents at DASH. Included as part of Cloud SIEM, Bits is an agentic AI security analyst that automates triage for AWS CloudTrail signals. Bits investigates each alert like a seasoned analyst: pulling in relevant context from across your Datadog environment, annotating key findings, and offering a clear recommendation on whether the signal is likely benign or malicious. By accelerating triage and surfacing real threats faster, Bits helps reduce mean time to remediation (MTTR) and frees analysts to focus on important threat hunting and response initiatives. This helps across different threats, including AI-related threats.
To learn more about how Datadog helps secure your AI infrastructure, see Monitor Amazon Bedrock with Datadog or check out our security documentation. If you’re not already using Datadog, you can get started with Datadog Cloud Security with a 14-day free trial.

About the Authors
Nina Chen is a Customer Solutions Manager at AWS specializing in leading software companies to use the power of the AWS Cloud to accelerate their product innovation and growth. With over 4 years of experience working in the strategic independent software vendor (ISV) vertical, Nina enjoys guiding ISV partners through their cloud transformation journeys, helping them optimize their cloud infrastructure, driving product innovation, and delivering exceptional customer experiences.
Sujatha Kuppuraju is a Principal Solutions Architect at AWS, specializing in cloud and generative AI security. She collaborates with software companies’ leadership teams to architect secure, scalable solutions on AWS and guide strategic product development. Using her expertise in cloud architecture and emerging technologies, Sujatha helps organizations optimize offerings, maintain robust security, and bring innovative products to market in an evolving tech landscape.
Nick Frichette is a Staff Security Researcher for Cloud Security Research at Datadog.
Vijay George is a Product Manager for AI Security at Datadog.

Set up custom domain names for Amazon Bedrock AgentCore Runtime agents

When deploying AI agents to Amazon Bedrock AgentCore Runtime (currently in preview), customers often want to use custom domain names to create a professional and seamless experience.
By default, AgentCore Runtime agents use endpoints like https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{EncodedAgentARN}/invocations.
In this post, we discuss how to transform these endpoints into user-friendly custom domains (like https://agent.yourcompany.com) using Amazon CloudFront as a reverse proxy. The solution combines CloudFront, Amazon Route 53, and AWS Certificate Manager (ACM) to create a secure, scalable custom domain setup that works seamlessly with your existing agents.
Benefits of Amazon Bedrock AgentCore Runtime
If you’re building AI agents, you have probably wrestled with hosting challenges: managing infrastructure, handling authentication, scaling, and maintaining security. Amazon Bedrock AgentCore Runtime helps address these problems.
Amazon Bedrock AgentCore Runtime is framework agnostic; you can use it with LangGraph, CrewAI, Strands Agents, or custom agents you have built from scratch. It supports extended execution times up to 8 hours, perfect for complex reasoning tasks that traditional serverless functions can’t handle. Each user session runs in its own isolated microVM, providing security that’s crucial for enterprise applications.
The consumption-based pricing model means you only pay for what you use, not what you provision. And unlike other hosting solutions, Amazon Bedrock AgentCore Runtime includes built-in authentication and specialized observability for AI agents out of the box.
Benefits of custom domains
When using Amazon Bedrock AgentCore Runtime with Open Authorization (OAuth) authentication, your applications make direct HTTPS requests to the service endpoint. Although this works, custom domains offer several benefits:

Custom branding – Client-side applications (web browsers, mobile apps) display your branded domain instead of AWS infrastructure details in network requests
Better developer experience – Development teams can use memorable, branded endpoints instead of copying and pasting long AWS endpoints across code bases and configurations
Simplified maintenance – Custom domains make it straightforward to manage endpoints when deploying multiple agents or updating configurations across environments

Solution overview
In this solution, we use CloudFront as a reverse proxy to transform requests from your custom domain into Amazon Bedrock AgentCore Runtime API calls. Instead of using the default endpoint, your applications can make requests to a user-friendly URL like https://agent.yourcompany.com/.
The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

A client application authenticates with Amazon Cognito and receives a bearer token.
The client makes an HTTPS request to your custom domain.
Route 53 resolves the DNS request to CloudFront.
CloudFront forwards the authenticated request to the Amazon Bedrock Runtime agent.
The agent processes the request and returns the response through the same path.

You can use the same CloudFront distribution to serve both your frontend application and backend agent endpoints, avoiding cross-origin resource sharing (CORS) issues because everything originates from the same domain.
Prerequisites
To follow this walkthrough, you must have the following in place:

An AWS account with appropriate permissions
The AWS Cloud Development Kit (AWS CDK) version 2.x or later
An AWS Identity and Access Management (IAM) execution role with appropriate permissions for Amazon Bedrock AgentCore Runtime

Although Amazon Bedrock AgentCore Runtime can be in other supported AWS Regions, CloudFront requires SSL certificates to be in the us-east-1 Region.
You can choose from the following domain options:

Use an existing domain – Add a subdomain like agent.yourcompany.com
Register a new domain – Use Route 53 to register a domain if you don’t have one
Use the default URL from CloudFront – No domain registration or configuration required

Choose the third option if you want to test the solution quickly before setting up a custom domain.
Create an agent with inbound authentication
If you already have an agent deployed with OAuth authentication, you can skip to the next section to set up the custom domain. Otherwise, follow these steps to create a new agent using Amazon Cognito as your OAuth provider:

Create a new directory for your agent with the following structure:

your_project_directory/
├── agent_example.py # Your main agent code
├── requirements.txt # Dependencies for your agent
└── __init__.py # Makes the directory a Python package

Create the main agent code in agent_example.py:

# agent_example.py
from strands import Agent
from bedrock_agentcore.runtime import BedrockAgentCoreApp

agent = Agent()
app = BedrockAgentCoreApp()
@app.entrypoint
def invoke(payload):
“””Process user input and return a response”””
user_message = payload.get(“prompt”, “Hello”)
response = agent(user_message)
return str(response) # response should be json serializable
if __name__ == “__main__”:
app.run()

Add dependencies to requirements.txt:

# requirements.txt
strands-agents
bedrock-agentcore

Run the following commands to create an Amazon Cognito user pool and test user:

# Create User Pool and capture Pool ID
export POOL_ID=$(aws cognito-idp create-user-pool
–pool-name “MyUserPool”
–policies ‘{“PasswordPolicy”:{“MinimumLength”:8}}’
–region us-east-1 | jq -r ‘.UserPool.Id’)

# Create App Client and capture Client ID
export CLIENT_ID=$(aws cognito-idp create-user-pool-client
–user-pool-id $POOL_ID
–client-name “MyClient”
–no-generate-secret
–explicit-auth-flows “ALLOW_USER_PASSWORD_AUTH” “ALLOW_REFRESH_TOKEN_AUTH”
–region us-east-1 | jq -r ‘.UserPoolClient.ClientId’)

# Create and configure a test user
aws cognito-idp admin-create-user
–user-pool-id $POOL_ID
–username “testuser”
–temporary-password “Temp1234”
–region us-east-1
–message-action SUPPRESS

aws cognito-idp admin-set-user-password
–user-pool-id $POOL_ID
–username “testuser”
–password “MyPassword123”
–region us-east-1
–permanent

echo “Pool ID: $POOL_ID”
echo “Discovery URL: https://cognito-idp.us-east-1.amazonaws.com/$POOL_ID/.well-known/openid-configuration”
echo “Client ID: $CLIENT_ID”

Deploy the agent using the Amazon Bedrock AgentCore command line interface (CLI) provided by the starter toolkit:

pip install bedrock-agentcore-starter-toolkit #install the starter toolkit

agentcore configure –entrypoint agent_example.py
–name my_agent
–execution-role your-execution-role-arn
–requirements-file requirements.txt
–authorizer-config “{“customJWTAuthorizer”:{“discoveryUrl”:”https://cognito-idp.us-east-1.amazonaws.com/$POOL_ID/.well-known/openid-configuration”,”allowedClients”:[“$CLIENT_ID”]}}”

agentcore launch

Make note of your agent runtime Amazon Resource Name (ARN) after deployment. You will need this for the custom domain configuration.
For additional examples and details, see Authenticate and authorize with Inbound Auth and Outbound Auth.
Set up the custom domain solution
Now let’s implement the custom domain solution using the AWS CDK. This section shows you how to create the CloudFront distribution that proxies your custom domain requests to Amazon Bedrock AgentCore Runtime endpoints.

Create a new directory and initialize an AWS CDK project:

mkdir agentcore-custom-domain
cd agentcore-custom-domain
cdk init app –language python
source .venv/bin/activate
pip install aws-cdk-lib constructs

Encode the agent ARN and prepare the CloudFront origin configuration:

# agentcore_custom_domain_stack.py
import urllib.parse

agent_runtime_arn = “arn:aws:bedrock-agentcore:us-east-1:accountId:runtime/my_agent-xbcDkz4FR9″
encoded_arn = urllib.parse.quote(agent_runtime_arn, safe=”) # URL-encode the ARN
region = agent_runtime_arn.split(‘:’)[3] # Extract region from ARN

If your frontend application runs on a different domain than your agent endpoint, you must configure CORS headers. This is common if your frontend is hosted on a different domain (for example, https://app.yourcompany.com calling https://agent.yourcompany.com), or if you’re developing locally (for example, http://localhost:3000 calling your production agent endpoint).

To handle CORS requirements, create a CloudFront response headers policy:

# agentcore_custom_domain_stack.py
from aws_cdk.aws_cloudfront import ResponseHeadersPolicy, ResponseHeadersCorsBehavior

# Create CORS response headers policy
cors_policy = ResponseHeadersPolicy(self, ‘CorsPolicy’,
cors_behavior=ResponseHeadersCorsBehavior(
access_control_allow_origins=[‘*’], # Or specify your frontend domains
access_control_allow_headers=[
‘Authorization’,
‘Content-Type’,
‘X-Amzn-*’,
‘X-Requested-With’
],
access_control_allow_methods=[‘GET’, ‘POST’, ‘OPTIONS’],
access_control_allow_credentials=False,
access_control_expose_headers=[‘*’],
origin_override=True # Overrides CORS headers from origin
)
)

Create a CloudFront distribution to act as a reverse proxy for your agent endpoints:

# agentcore_custom_domain_stack.py
from aws_cdk.aws_cloudfront import (
Distribution, BehaviorOptions, CachePolicy,
AllowedMethods, ViewerProtocolPolicy,
OriginProtocolPolicy, OriginRequestPolicy
)
from aws_cdk.aws_cloudfront_origins import HttpOrigin

bedrock_agentcore_hostname = f”bedrock-agentcore.{region}.amazonaws.com”
origin_path = f”/runtimes/{encoded_arn}/invocations”

distribution = Distribution(self, ‘Distribution’,
default_behavior=BehaviorOptions(
origin=HttpOrigin(
bedrock_agentcore_hostname,
origin_path=origin_path,
protocol_policy=OriginProtocolPolicy.HTTPS_ONLY,
read_timeout=Duration.seconds(120) # Optional: for responses >30s, adjust as needed
),
viewer_protocol_policy=ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
cache_policy=CachePolicy.CACHING_DISABLED, # Critical for dynamic APIs
allowed_methods=AllowedMethods.ALLOW_ALL,
response_headers_policy=cors_policy, # Add CORS policy if created
origin_request_policy=OriginRequestPolicy.ALL_VIEWER, # Forward headers for MCP
),
# Add domain configuration if using custom domains
domain_names=[domain_name] if domain_name else None,
certificate=certificate if domain_name else None,
)

Set cache_policy=CachePolicy.CACHING_DISABLED to make sure your agent responses remain dynamic and aren’t cached by CloudFront.

If you’re using a custom domain, add an SSL certificate and DNS configuration to your stack:

# agentcore_custom_domain_stack.py
from aws_cdk.aws_certificatemanager import Certificate, CertificateValidation
from aws_cdk.aws_route53 import HostedZone, ARecord, RecordTarget
from aws_cdk.aws_route53_targets import CloudFrontTarget

# For existing domains
hosted_zone = HostedZone.from_lookup(self, ‘HostedZone’,
domain_name=’yourcompany.com’
)
# SSL certificate with automatic DNS validation
certificate = Certificate(self, ‘Certificate’,
domain_name=’my-agent.yourcompany.com’,
validation=CertificateValidation.from_dns(hosted_zone),
)
# DNS record pointing to CloudFront
ARecord(self, ‘AliasRecord’,
zone=hosted_zone,
record_name=’my-agent.yourcompany.com’,
target=RecordTarget.from_alias(CloudFrontTarget(distribution)),
)

The following code is the complete AWS CDK stack that combines all the components:

# agentcore_custom_domain_stack.py
import urllib.parse
from aws_cdk import Stack, CfnOutput, Duration
from aws_cdk.aws_cloudfront import (
Distribution, BehaviorOptions,
CachePolicy, AllowedMethods,
ViewerProtocolPolicy, OriginProtocolPolicy,
ResponseHeadersPolicy, ResponseHeadersCorsBehavior,
OriginRequestPolicy
)
from aws_cdk.aws_cloudfront_origins import HttpOrigin
from aws_cdk.aws_certificatemanager import Certificate, CertificateValidation
from aws_cdk.aws_route53 import HostedZone, ARecord, RecordTarget
from aws_cdk.aws_route53_targets import CloudFrontTarget
from constructs import Construct

class AgentcoreCustomDomainStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)

# Configuration – Update these for your setup
agent_runtime_arn = “arn:aws:bedrock-agentcore:us-east-1:accountId:runtime/my_agent-xbcDkz4FR9”
region = agent_runtime_arn.split(‘:’)[3] # Extract region from ARN
domain_name = “agent.yourcompany.com” # Using your hosted zone
hosted_zone_id = “Z1234567890ABC” # Your hosted zone ID
enable_cors = True # Set to False if serving frontend and backend from same domain

# Encode the agent ARN for the origin path
encoded_arn = urllib.parse.quote(agent_runtime_arn, safe=”)
bedrock_agentcore_hostname = f”bedrock-agentcore.{region}.amazonaws.com”
origin_path = f”/runtimes/{encoded_arn}/invocations”

# Create CORS response headers policy if needed
cors_policy = None
if enable_cors:
cors_policy = ResponseHeadersPolicy(self, ‘CorsPolicy’,
cors_behavior=ResponseHeadersCorsBehavior(
access_control_allow_origins=[‘*’], # Or specify your frontend domains
access_control_allow_headers=[
‘Authorization’,
‘Content-Type’,
‘X-Amzn-*’,
‘X-Requested-With’
],
access_control_allow_methods=[‘GET’, ‘POST’, ‘OPTIONS’],
access_control_expose_headers=[‘*’],
access_control_allow_credentials=False,
origin_override=True # Overrides CORS headers from origin
)
)

# Base distribution configuration
distribution_props = {
“default_behavior”: BehaviorOptions(
origin=HttpOrigin(
bedrock_agentcore_hostname,
origin_path=origin_path, # Direct path to agent endpoint
protocol_policy=OriginProtocolPolicy.HTTPS_ONLY,
read_timeout=Duration.seconds(120) # Optional: for responses >30s, adjust as needed
),
viewer_protocol_policy=ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
cache_policy=CachePolicy.CACHING_DISABLED,
allowed_methods=AllowedMethods.ALLOW_ALL,
response_headers_policy=cors_policy, # Add CORS policy if enabled
origin_request_policy=OriginRequestPolicy.ALL_VIEWER, # Forward headers for MCP
)
}

# Optional: Add custom domain
if domain_name:
# Use from_hosted_zone_attributes for specific zone
hosted_zone = HostedZone.from_hosted_zone_attributes(self, ‘HostedZone’,
zone_name=’yourcompany.com’, # Your root domain
hosted_zone_id=hosted_zone_id
)

certificate = Certificate(self, ‘Certificate’,
domain_name=domain_name,
validation=CertificateValidation.from_dns(
hosted_zone),
)

# Add custom domain to distribution
distribution_props[“domain_names”] = [domain_name]
distribution_props[“certificate”] = certificate

distribution = Distribution(self, ‘Distribution’, **distribution_props)

# Create DNS record if using custom domain
if domain_name:
ARecord(self, ‘AliasRecord’,
zone=hosted_zone,
record_name=domain_name,
target=RecordTarget.from_alias(
CloudFrontTarget(distribution)),
)

# Outputs
if domain_name:
domain_url = f”https://{domain_name}/”
CfnOutput(self, “AgentEndpoint”,
value=domain_url,
description=”Your custom domain endpoint”
)

CfnOutput(self, “CloudFrontDistribution”,
value=f”https://{distribution.distribution_domain_name}/”,
description=”CloudFront default domain (works without custom domain)”
)

Configure the AWS CDK app entry point:

# app.py
#!/usr/bin/env python3
import aws_cdk as cdk
from agentcore_custom_domain.agentcore_custom_domain_stack import AgentCoreCustomDomainStack

app = cdk.App()
AgentcoreCustomDomainStack(app, “AgentCoreCustomDomainStack”,
# CloudFront requires certificates in us-east-1
env=cdk.Environment(region=’us-east-1′),
)
app.synth()

Deploy your custom domain
Now you can deploy the solution and verify it works with both custom and default domains. Complete the following steps:

Update the following values in agentcore_custom_domain_stack.py:

Your Amazon Bedrock AgentCore Runtime ARN
Your domain name (if using a custom domain)
Your hosted zone ID (if using a custom domain)

Deploy using the AWS CDK:

cdk deploy

Test your endpoint
After you deploy the custom domain, you can test your endpoints using either the custom domain or the CloudFront default domain.First, get a JWT token from Amazon Cognito:

export TOKEN=$(aws cognito-idp initiate-auth
–client-id “your-client-id”
–auth-flow USER_PASSWORD_AUTH
–auth-parameters USERNAME=’testuser’,PASSWORD=’MyPassword123′
–region us-east-1 | jq -r ‘.AuthenticationResult.AccessToken’)

Use the following code to test with your custom domain:

curl -X POST “https://my-agent.yourcompany.com/”
-H “Authorization: Bearer $TOKEN”
-H “Content-Type: application/json”
-H “X-Amzn-Bedrock-AgentCore-Runtime-Session-Id: session-12345678901234567890123456789012345”
-d ‘{“prompt”: “Hello, how can you help me today?”}’

Alternatively, use the following code to test with the CloudFront default domain:

curl -X POST “https://d1234567890123.cloudfront.net/”
-H “Authorization: Bearer $TOKEN”
-H “Content-Type: application/json”
-H “X-Amzn-Bedrock-AgentCore-Runtime-Session-Id: session-12345678901234567890123456789012345”
-d ‘{“prompt”: “Hello, how can you help me today?”}’

If everything works correctly, you will receive a response from your agent through either endpoint. You’ve successfully created a custom domain for your Amazon Bedrock AgentCore Runtime agent!

Considerations
As you implement this solution in production, the following are some important considerations:

Cost implications – CloudFront adds costs for data transfer and requests. Review Amazon CloudFront pricing to understand the impact for your usage patterns.
Security enhancements – Consider implementing the following security measures:

AWS WAF rules to help protect against common web exploits.
Rate limiting to help prevent abuse.
Geo-restrictions if your agent should only be accessible from specific Regions.

Monitoring – Enable CloudFront access logs and set up Amazon CloudWatch alarms to monitor error rates, latency, and request volume.

Clean up
To avoid ongoing costs, delete the resources when you no longer need them:

cdk destroy

You might need to manually delete the Route 53 hosted zones and ACM certificates from their respective service consoles.
Conclusion
In this post, we showed you how to create custom domain names for your Amazon Bedrock AgentCore Runtime agent endpoints using CloudFront as a reverse proxy. This solution provides several key benefits: simplified integration for development teams, custom domains that align with your organization, cleaner infrastructure abstraction, and straightforward maintenance when endpoints need updates. By using CloudFront as a reverse proxy, you can also serve both your frontend application and backend agent endpoints from the same domain, avoiding common CORS challenges.
We encourage you to explore this solution further by adapting it to your specific needs. You might want to enhance it with additional security features, set up monitoring, or integrate it with your existing infrastructure.
To learn more about building and deploying AI agents, see the Amazon Bedrock AgentCore Developer Guide. For advanced configurations and best practices with CloudFront, refer to the Amazon CloudFront documentation. You can find detailed information about SSL certificates in the AWS Certificate Manager documentation, and domain management in the Amazon Route 53 documentation.
Amazon Bedrock AgentCore is currently in preview and subject to change. Standard AWS pricing applies to additional services used, such as CloudFront, Route 53, and Certificate Manager.

About the authors
Rahmat Fedayizada is a Senior Solutions Architect with the AWS Energy and Utilities team. He works with energy companies to design and implement scalable, secure, and highly available architectures. Rahmat is passionate about translating complex technical requirements into practical solutions that drive business value.
Paras Bhuva is a Senior Manager of Solutions Architecture at AWS, where he leads a team of solution architects helping energy customers innovate and accelerate their transformation. Having started as a Solution Architect in 2012, Paras is passionate about architecting scalable solutions and building organizations focused on application modernization and AI initiatives.

Introducing auto scaling on Amazon SageMaker HyperPod

Today, we’re excited to announce that Amazon SageMaker HyperPod now supports managed node automatic scaling with Karpenter, so you can efficiently scale your SageMaker HyperPod clusters to meet your inference and training demands. Real-time inference workloads require automatic scaling to address unpredictable traffic patterns and maintain service level agreements (SLAs). As demand spikes, organizations must rapidly adapt their GPU compute without compromising response times or cost-efficiency. Unlike self-managed Karpenter deployments, this service-managed solution alleviates the operational overhead of installing, configuring, and maintaining Karpenter controllers, while providing tighter integration with the resilience capabilities of SageMaker HyperPod. This managed approach supports scale to zero, reducing the need for dedicated compute resources to run the Karpenter controller itself, improving cost-efficiency.
SageMaker HyperPod offers a resilient, high-performance infrastructure, observability, and tooling optimized for large-scale model training and deployment. Companies like Perplexity, HippocraticAI, H.AI, and Articul8 are already using SageMaker HyperPod for training and deploying models. As more customers transition from training foundation models (FMs) to running inference at scale, they require the ability to automatically scale their GPU nodes to handle real production traffic by scaling up during high demand and scaling down during periods of lower utilization. This capability necessitates a powerful cluster auto scaler. Karpenter, an open source Kubernetes node lifecycle manager created by AWS, is a popular choice among Kubernetes users for cluster auto scaling due to its powerful capabilities that optimize scaling times and reduce costs.
This launch provides a managed Karpenter-based solution for automatic scaling that is installed and maintained by SageMaker HyperPod, removing the undifferentiated heavy lifting of setup and management from customers. The feature is available for SageMaker HyperPod EKS clusters, and you can enable auto scaling to transform your SageMaker HyperPod cluster from static capacity to a dynamic, cost-optimized infrastructure that scales with demand. This combines Karpenter’s proven node lifecycle management with the purpose-built and resilient infrastructure of SageMaker HyperPod, designed for large-scale machine learning (ML) workloads. In this post, we dive into the benefits of Karpenter, and provide details on enabling and configuring Karpenter in your SageMaker HyperPod EKS clusters.
New features and benefits
Karpenter-based auto scaling in your SageMaker HyperPod clusters provides the following capabilities:

Service managed lifecycle – SageMaker HyperPod handles Karpenter installation, updates, and maintenance, alleviating operational overhead
Just-in-time provisioning – Karpenter observes your pending pods and provisions the required compute for your workloads from an on-demand pool
Scale to zero – You can scale down to zero nodes without maintaining dedicated controller infrastructure
Workload-aware node selection – Karpenter chooses optimal instance types based on pod requirements, Availability Zones, and pricing to minimize costs
Automatic node consolidation – Karpenter regularly evaluates clusters for optimization opportunities, shifting workloads to avoid underutilized nodes
Integrated resilience – Karpenter uses the built-in fault tolerance and node recovery mechanisms of SageMaker HyperPod

These capabilities are built on top of recently launched continuous provisioning capabilities, which enables SageMaker HyperPod to automatically provision remaining capacity in the background while workloads start immediately on available instances. When node provisioning encounters failures due to capacity constraints or other issues, SageMaker HyperPod automatically retries in the background until clusters reach their desired scale, so your auto scaling operations remain resilient and non-blocking.
Solution overview
The following diagram illustrates the solution architecture.

Karpenter works as a controller in the cluster and operates in the following steps:

Watching – Karpenter watches for un-schedulable pods in the cluster through the Kubernetes API server. These could be pods that go into pending state when deployed or automatically scaled to increase the replica count.
Evaluating – When Karpenter finds such pods, it computes the shape and size of a NodeClaim to fit the set of pods requirements (GPU, CPU, memory) and topology constraints, and checks if it can pair them with an existing NodePool. For each NodePool, it queries the SageMaker HyperPod APIs to get the instance types supported by the NodePool. It uses the information about instance type metadata (hardware requirements, zone, capacity type) to find a matching NodePool.
Provisioning – If Karpenter finds a matching NodePool, it creates a NodeClaim and tries to provision a new instance to be used as the new node. Karpenter internally uses the sagemaker:UpdateCluster API to increase the capacity of the selected instance group.
Disrupting – Karpenter periodically checks if a new node is needed or not. If it’s not needed, Karpenter deletes it, which internally translates to a delete node request to the SageMaker HyperPod cluster.

Prerequisites
Verify you have the required quotas for the instances you will create in the SageMaker HyperPod cluster. To review your quotas, on the Service Quotas console, choose AWS services in the navigation pane, then choose SageMaker. For example, the following screenshot shows the available quota for g5.12xlarge instances (three).

To update the cluster, you must first create AWS Identity and Access Management (IAM) permissions for Karpenter. For instructions, see Create an IAM role for HyperPod autoscaling with Karpenter.
Create and configure a SageMaker HyperPod cluster
To begin, launch and configure your SageMaker HyperPod EKS cluster and verify that continuous provisioning mode is enabled on cluster creation. Complete the following steps:

On the SageMaker AI console, choose HyperPod clusters in the navigation pane.
Choose Create HyperPod cluster and Orchestrated on Amazon EKS.
For Setup options, select Custom setup.
For Name, enter a name.
For Instance recovery, select Automatic.
For Instance provisioning mode, select Use continuous provisioning.
Choose Submit.

This setup creates the necessary configuration such as virtual private cloud (VPC), subnets, security groups, and EKS cluster, and installs operators in the cluster. You can also provide existing resources such as an EKS cluster if you want to use an existing cluster instead of creating a new one. This setup will take around 20 minutes.
Verify that each InstanceGroup is limited to one zone by opting for the OverrideVpcConfig and selecting only one subnet per each InstanceGroup.

After you create the cluster, you must update it to enable Karpenter. You can do this using Boto3 or the AWS Command Line Interface (AWS CLI) using the UpdateCluster API command (after configuring the AWS CLI to connect to your AWS account).
The following code uses Python Boto3:

import boto3
client = boto3.client(‘sagemaker’)
response = client.update_cluster(
ClusterName=<Your_Cluster_Name>,
AutoScaling = { “Mode”: “Enable”, “AutoScalerType”: “Karpenter” },
ClusterRole = <Cluster_Role_ARN>,
)

The following code uses the AWS CLI:

aws sagemaker update-cluster
    –cluster-name <clustername> 
    –auto-scaling ‘{ “Mode”: “Enable”, “AutoScalerType”: “Karpenter” }`
    –cluster-role <clusterrole>

After you run this command and update the cluster, you can verify that Karpenter has been enabled by running the DescribeCluster API.
The following code uses Python:

import boto3
client = boto3.client(‘sagemaker’)
print(sagemaker_client.describe_cluster(ClusterName=<Your_Cluster_Name>).get(“AutoScaling”))

The following code uses the AWS CLI:

aws sagemaker describe-cluster –cluster-name <clustername> –query AutoScaling

The following code shows our output:

{‘Mode’: ‘Enable’,
 ‘AutoScalerType’: ‘Karpenter’,
 ‘Status’: ‘Enabled’}

Now you have a working cluster. The next step is to set up some custom resources in your cluster for Karpenter.
Create HyperpodNodeClass
HyperpodNodeClass is a custom resource that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter’s auto scaling decisions. To use HyperpodNodeClass, simply specify the names of the InstanceGroups of your SageMaker HyperPod cluster that you want to use as the source for the AWS compute resources to use to scale up your pods in your NodePools.
The HyperpodNodeClass name that you use here is carried over to the NodePool in the next section where you reference it. This tells the NodePool which HyperpodNodeClass to draw resources from. To create a HyperpodNodeClass, complete the following steps:

Create a YAML file (for example, nodeclass.yaml) similar to the following code. Add InstanceGroup names that you used at the time of the SageMaker HyperPod cluster creation. You can also add new instance groups to an existing SageMaker HyperPod EKS cluster.
Reference the HyperPodNodeClass name in your NodePool configuration.

The following is a sample HyperpodNodeClass that uses ml.g6.xlarge and ml.g6.4xlarge instance types:

apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  name: multiazg6
spec:
  instanceGroups:
    # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
    # before this step can be completed.
    # MaxItems: 10
    – auto-g6-az1
    – auto-g6-4xaz2

Apply the configuration to your EKS cluster using kubectl:

kubectl apply -f nodeclass.yaml

Monitor the HyperpodNodeClass status to verify the Ready condition in status is set to True to ensure it was successfully created:

kubectl get hyperpodnodeclass multiazc5 -oyaml

The SageMaker HyperPod cluster must have AutoScaling enabled and the AutoScaling status must change to InService before the HyperpodNodeClass can be applied.
For more information and key considerations, see Autoscaling on SageMaker HyperPod EKS.
Create NodePool
The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool can be set to perform various actions, such as:

Define labels and taints to limit the pods that can run on nodes Karpenter creates
Limit node creation to certain zones, instance types, and computer architectures, and so on

For more information about NodePool, refer to NodePools. SageMaker HyperPod managed Karpenter supports a limited set of well-known Kubernetes and Karpenter requirements, which we explain in this post.
To create a NodePool, complete the following steps:

Create a YAML file named nodepool.yaml with your desired NodePool configuration.

The following code is a sample configuration to create a sample NodePool. We specify the NodePool to include our ml.g6.xlarge SageMaker instance type, and we additionally specify it for one zone. Refer to NodePools for more customizations.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
 name: gpunodepool
spec:
 template:
   spec:
     nodeClassRef:
      group: karpenter.sagemaker.amazonaws.com
      kind: HyperpodNodeClass
      name: multiazg6
     expireAfter: Never
     requirements:
        – key: node.kubernetes.io/instance-type
          operator: Exists
        – key: “node.kubernetes.io/instance-type”
          operator: In
          values: [“ml.g6.xlarge”]
        – key: “topology.kubernetes.io/zone”
          operator: In
          values: [“us-west-2a”]

Apply the NodePool to your cluster:

kubectl apply -f nodepool.yaml

Monitor the NodePool status to ensure the Ready condition in the status is set to True:

kubectl get nodepool gpunodepool -oyaml

This example shows how a NodePool can be used to specify the hardware (instance type) and placement (Availability Zone) for pods.
Launch a simple workload
The following workload runs a Kubernetes deployment where the pods in deployment are requesting for 1 CPU and 256 MB memory per replica, per pod. The pods have not been spun up yet.

kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml

When we apply this, we can see a deployment and a single node launch in our cluster, as shown in the following screenshot.

To scale this component, use the following command:

kubectl scale deployment inflate –replicas 10

Within a few minutes, we can see Karpenter add the requested nodes to the cluster.

Implement advanced auto scaling for inference with KEDA and Karpenter
To implement an end-to-end auto scaling solution on SageMaker HyperPod, you can set up Kubernetes Event-driven Autoscaling (KEDA) along with Karpenter. KEDA enables pod-level auto scaling based on a wide range of metrics, including Amazon CloudWatch metrics, Amazon Simple Queue Service (Amazon SQS) queue lengths, Prometheus queries, and resource utilization patterns. By configuring Keda ScaledObject resources to target your model deployments, KEDA can dynamically adjust the number of inference pods based on real-time demand signals.
When integrating KEDA and Karpenter, this combination creates a powerful two-tier auto scaling architecture. As KEDA scales your pods up or down based on workload metrics, Karpenter automatically provisions or deletes nodes in response to changing resource requirements. This integration delivers optimal performance while controlling costs by making sure your cluster has precisely the right amount of compute resources available at all times. For effective implementation, consider the following key factors:

Set appropriate buffer thresholds in KEDA to accommodate Karpenter’s node provisioning time
Configure cooldown periods carefully to prevent scaling oscillations
Define clear resource requests and limits to help Karpenter make optimal node selections
Create specialized NodePools tailored to specific workload characteristics

The following is a sample spec of a KEDA ScaledObject file that scales the number of pods based on CloudWatch metrics of Application Load Balancer (ALB) request count:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nd-deepseek-llm-scaler
  namespace: default
spec:
  scaleTargetRef:
    name: nd-deepseek-llm-r1-distill-qwen-1-5b
    apiVersion: apps/v1
    kind: Deployment
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 30     # seconds between checks
  cooldownPeriod: 300     # seconds before scaling down
  triggers:
    – type: aws-cloudwatch
      metadata:
        namespace: AWS/ApplicationELB        # or your metric namespace
        metricName: RequestCount              # or your metric name
        dimensionName: LoadBalancer           # or your dimension key
        dimensionValue: app/k8s-default-albnddee-cc02b67f20/0991dc457b6e8447
        statistic: Sum
        threshold: “3”                        # change to your desired threshold
        minMetricValue: “0”                   # optional floor
        region: us-east-2                     # your AWS region
        identityOwner: operator               # use the IRSA SA bound to keda-operator

Clean up
To clean up your resources to avoid incurring more charges, delete your SageMaker HyperPod cluster.
Conclusion
With the launch of Karpenter node auto scaling on SageMaker HyperPod, ML workloads can automatically adapt to changing workload requirements, optimize resource utilization, and help control costs by scaling precisely when needed. You can also integrate it with event-driven pod auto scalers such as KEDA to scale based on custom metrics.
To experience these benefits for your ML workloads, enable Karpenter in your SageMaker HyperPod clusters. For detailed implementation guidance and best practices, refer to Autoscaling on SageMaker HyperPod EKS.

About the authors
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Adam Stanley is a Solution Architect for Software, Internet and Model Provider customers at Amazon Web Services (AWS). He supports customers adopting all AWS services, but focuses primarily on Machine Learning training and inference infrastructure. Prior to AWS, Adam went to the University of New South Wales and graduated with degrees in Mathematics and Accounting. You can connect with him on LinkedIn.
Kunal Jha is a Principal Product Manager at AWS, where he focuses on building Amazon SageMaker HyperPod to enable scalable distributed training and fine-tuning of foundation models. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can connect with him on LinkedIn.
Ty Bergstrom is a Software Engineer at Amazon Web Services. He works on the HyperPod Clusters platform for Amazon SageMaker.

Grounding Medical AI in Expert‑Labeled Data: A Case Study on PadChes …

Table of contentsA Multimodal Radiology BreakthroughThe Challenge: Moving Beyond Image ClassificationHuman‑in‑the‑Loop at Clinical ScaleThe Dataset: PadChest‑GROutcomes and ImplicationsBroader Reflections: Why Data Matters in Medical AICase Study in Context: Centaur.ai’s Broader VisionConclusion

A Multimodal Radiology Breakthrough

Introduction

Recent advances in medical AI have underscored that breakthroughs hinge not solely on model sophistication, but fundamentally on the quality and richness of the underlying data. This case study spotlights a pioneering collaboration among Centaur.ai, Microsoft Research, and the University of Alicante, culminating in PadChest‑GR—the first multimodal, bilingual, sentence‑level dataset for grounded radiology reporting. By aligning structured clinical text with annotated chest‑X‑ray imagery, PadChest‑GR empowers models to justify each diagnostic claim with a visually interpretable reference—an innovation that marks a critical leap in AI transparency and trustworthiness.

The Challenge: Moving Beyond Image Classification

Historically, medical imaging datasets have supported only image‑level classification. For example, an X‑ray might be labeled as “showing cardiomegaly” or “no abnormalities detected.” While functional, such classifications fall short on explanation and reliability. AI models trained in this manner are prone to hallucinations—generating unsupported findings or failing to localize pathology accurately  .

Enter grounded radiology reporting. This approach demands a richer, dual‑dimensional annotation:

Spatial grounding: Findings are localized with bounding boxes on the image.

Linguistic grounding: Each textual description is tied to a specific region, rather than generic classification.

Contextual clarity: Each report entry is deeply contextualized both linguistically and spatially, greatly reducing ambiguity and raising interpretability.

This paradigm shift requires a fundamentally different kind of dataset—one that embraces complexity, precision, and linguistic nuance.

Human‑in‑the‑Loop at Clinical Scale

Creating PadChest‑GR required uncompromising annotation quality. Centaur.ai’s HIPAA‑compliant labeling platform enabled trained radiologists at the University of Alicante to:

Draw bounding boxes around visible pathologies in thousands of chest X‑rays.

Link each region to specific sentence‑level findings, in both Spanish and English.

Conduct rigorous, consensus‑driven quality control, including adjudication of edge cases and alignment across languages.

Centaur.ai’s platform is purpose‑built for medical‑grade annotation workflows. Its standout features include:

Multiple annotator consensus & disagreement resolution

Performance‑weighted labeling (where expert annotations are weighted based on historical agreement)

Support for DICOM formats and other complex medical imaging types

Multimodal workflows that handle images, text, and clinical metadata

Full audit trails, version control, and live quality monitoring—for traceable, trustworthy labels  .

These capabilities allowed the research team to focus on challenging medical nuances without sacrificing annotation speed or integrity.

The Dataset: PadChest‑GR

PadChest‑GR builds on the original PadChest dataset by adding these robust dimensions of spatial grounding and bilingual, sentence‑level text alignment  .

Key Features:

Multimodal: Integrates image data (chest X‑rays) with textual observations, precisely aligned.

Bilingual: Captures annotations in both Spanish and English, broadening utility and inclusivity.

Sentence‑level granularity: Each finding is connected to a specific sentence, not just a general label.

Visual explainability: The model can point to exactly where a diagnosis is made, fostering transparency.

By combining these attributes, PadChest‑GR stands as a landmark dataset—reshaping what radiology‑trained AI models can achieve.

Outcomes and Implications

Enhanced Interpretability & Reliability

Grounded annotation enables models to point to the exact region prompting a finding, marvelously improving transparency. Clinicians can see both the claim and its spatial basis—boosting trust.

Reduction of AI Hallucinations

By tying linguistic claims to visual evidence, PadChest‑GR greatly diminishes the risk of fabricated or speculative model outputs.

Bilingual Utility

Multilingual annotations extend the dataset’s applicability across Spanish‑speaking populations, enhancing accessibility and global research potential.

Scalable, High‑Quality Annotation

Combining expert radiologists, stringent consensus, and a secure platform allowed the team to generate complex multimodal annotations at scale, with uncompromised quality.

Broader Reflections: Why Data Matters in Medical AI

This case study is a powerful testament to a broader truth: the future of AI depends on better data, not just better models  . Especially in healthcare, where stakes are high and trust is essential, AI’s value is tightly bound to the fidelity of its foundation.

The success of PadChest‑GR hinges on the synergy of:

Domain experts (radiologists) who bring nuanced judgment.

Advanced annotation infrastructure (Centaur.ai‘s platform) enabling traceable, consensus-driven workflows.

Collaborative partnerships (involving Microsoft Research and the University of Alicante), ensuring scientific, linguistic, and technical rigor.

Case Study in Context: Centaur.ai’s Broader Vision

While this study centers on radiology, it exemplifies Centaur.ai‘s wider mission: to scale expert‑level annotation for medical AI across modalities.

Through their DiagnosUs app, Centaur Labs (the same organization) has built a gamified annotation platform, harnessing collective intelligence and performance‑weighted scoring to label medical data at scale, with speed and accuracy  .

Their platform is HIPAA‑ and SOC 2‑compliant, supporting annotators across image, text, audio, and video data—and serving clients such as Mayo Clinic spin‑outs, pharmaceutical firms, and AI developers  .

Innovations like performance‑weighted labeling help ensure that only high‑performing experts influence the final annotations—raising quality and reliability  .

PadChest‑GR sits squarely within this ecosystem—leveraging Centaur.ai’s sophisticated tools and rigorous workflows to deliver a groundbreaking radiology dataset.

Conclusion

The PadChest‑GR case study exemplifies how expert‑grounded, multimodal annotation can fundamentally transform medical AI—enabling transparent, reliable, and linguistically rich diagnostic modeling.

By harnessing domain expertise, multilingual alignment, and spatial grounding, Centaur.ai, Microsoft Research, and the University of Alicante have set a new benchmark for what medical image datasets can—and should—be. Their achievement underscores the vital truth that the promise of AI in healthcare is only as strong as the data it’s trained on.

This case stands as a compelling model for future medical AI collaborations—highlighting the path forward to trustworthy, interpretable, and scalable AI in the clinic.  For more information, visit Centaur.ai.

Thanks to the Centaur.ai team for the thought leadership/ Resources for this article. Centaur.ai team has supported and sponsored this content/article.
The post Grounding Medical AI in Expert‑Labeled Data: A Case Study on PadChest-GR- the First Multimodal, Bilingual, Sentence‑Level Dataset for Radiology Reporting appeared first on MarkTechPost.

How to Build a Multi-Round Deep Research Agent with Gemini, DuckDuckGo …

We begin this tutorial by designing a modular deep research system that runs directly on Google Colab. We configure Gemini as the core reasoning engine, integrate DuckDuckGo’s Instant Answer API for lightweight web search, and orchestrate multi-round querying with deduplication and delay handling. We emphasize efficiency by limiting API calls, parsing concise snippets, and using structured prompts to extract key points, themes, and insights. Every component, from source collection to JSON-based analysis, allows us to experiment quickly and adapt the workflow for deeper or broader research queries. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os
import json
import time
import requests
from typing import List, Dict, Any
from dataclasses import dataclass
import google.generativeai as genai
from urllib.parse import quote_plus
import re

We start by importing essential Python libraries that handle system operations, JSON processing, web requests, and data structures. We also incorporate Google’s Generative AI SDK and utilities, such as URL encoding, to ensure our research system operates smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class ResearchConfig:
gemini_api_key: str
max_sources: int = 10
max_content_length: int = 5000
search_delay: float = 1.0

class DeepResearchSystem:
def __init__(self, config: ResearchConfig):
self.config = config
genai.configure(api_key=config.gemini_api_key)
self.model = genai.GenerativeModel(‘gemini-1.5-flash’)

def search_web(self, query: str, num_results: int = 5) -> List[Dict[str, str]]:
“””Search web using DuckDuckGo Instant Answer API”””
try:
encoded_query = quote_plus(query)
url = f”https://api.duckduckgo.com/?q={encoded_query}&format=json&no_redirect=1″

response = requests.get(url, timeout=10)
data = response.json()

results = []

if ‘RelatedTopics’ in data:
for topic in data[‘RelatedTopics’][:num_results]:
if isinstance(topic, dict) and ‘Text’ in topic:
results.append({
‘title’: topic.get(‘Text’, ”)[:100] + ‘…’,
‘url’: topic.get(‘FirstURL’, ”),
‘snippet’: topic.get(‘Text’, ”)
})

if not results:
results = [{
‘title’: f”Research on: {query}”,
‘url’: f”https://search.example.com/q={encoded_query}”,
‘snippet’: f”General information and research about {query}”
}]

return results

except Exception as e:
print(f”Search error: {e}”)
return [{‘title’: f”Research: {query}”, ‘url’: ”, ‘snippet’: f”Topic: {query}”}]

def extract_key_points(self, content: str) -> List[str]:
“””Extract key points using Gemini”””
prompt = f”””
Extract 5-7 key points from this content. Be concise and factual:

{content[:2000]}

Return as numbered list:
“””

try:
response = self.model.generate_content(prompt)
return [line.strip() for line in response.text.split(‘n’) if line.strip()]
except:
return [“Key information extracted from source”]

def analyze_sources(self, sources: List[Dict[str, str]], query: str) -> Dict[str, Any]:
“””Analyze sources for relevance and extract insights”””
analysis = {
‘total_sources’: len(sources),
‘key_themes’: [],
‘insights’: [],
‘confidence_score’: 0.7
}

all_content = ” “.join([s.get(‘snippet’, ”) for s in sources])

if len(all_content) > 100:
prompt = f”””
Analyze this research content for the query: “{query}”

Content: {all_content[:1500]}

Provide:
1. 3-4 key themes (one line each)
2. 3-4 main insights (one line each)
3. Overall confidence (0.1-1.0)

Format as JSON with keys: themes, insights, confidence
“””

try:
response = self.model.generate_content(prompt)
text = response.text
if ‘themes’ in text.lower():
analysis[‘key_themes’] = [“Theme extracted from analysis”]
analysis[‘insights’] = [“Insight derived from sources”]
except:
pass

return analysis

def generate_comprehensive_report(self, query: str, sources: List[Dict[str, str]],
analysis: Dict[str, Any]) -> str:
“””Generate final research report”””

sources_text = “n”.join([f”- {s[‘title’]}: {s[‘snippet’][:200]}”
for s in sources[:5]])

prompt = f”””
Create a comprehensive research report on: “{query}”

Based on these sources:
{sources_text}

Analysis summary:
– Total sources: {analysis[‘total_sources’]}
– Confidence: {analysis[‘confidence_score’]}

Structure the report with:
1. Executive Summary (2-3 sentences)
2. Key Findings (3-5 bullet points)
3. Detailed Analysis (2-3 paragraphs)
4. Conclusions & Implications (1-2 paragraphs)
5. Research Limitations

Be factual, well-structured, and insightful.
“””

try:
response = self.model.generate_content(prompt)
return response.text
except Exception as e:
return f”””
# Research Report: {query}

## Executive Summary
Research conducted on “{query}” using {analysis[‘total_sources’]} sources.

## Key Findings
– Multiple perspectives analyzed
– Comprehensive information gathered
– Research completed successfully

## Analysis
The research process involved systematic collection and analysis of information related to {query}. Various sources were consulted to provide a balanced perspective.

## Conclusions
The research provides a foundation for understanding {query} based on available information.

## Research Limitations
Limited by API constraints and source availability.
“””

def conduct_research(self, query: str, depth: str = “standard”) -> Dict[str, Any]:
“””Main research orchestration method”””
print(f” Starting research on: {query}”)

search_rounds = {“basic”: 1, “standard”: 2, “deep”: 3}.get(depth, 2)
sources_per_round = {“basic”: 3, “standard”: 5, “deep”: 7}.get(depth, 5)

all_sources = []

search_queries = [query]

if depth in [“standard”, “deep”]:
try:
related_prompt = f”Generate 2 related search queries for: {query}. One line each.”
response = self.model.generate_content(related_prompt)
additional_queries = [q.strip() for q in response.text.split(‘n’) if q.strip()][:2]
search_queries.extend(additional_queries)
except:
pass

for i, search_query in enumerate(search_queries[:search_rounds]):
print(f” Search round {i+1}: {search_query}”)
sources = self.search_web(search_query, sources_per_round)
all_sources.extend(sources)
time.sleep(self.config.search_delay)

unique_sources = []
seen_urls = set()
for source in all_sources:
if source[‘url’] not in seen_urls:
unique_sources.append(source)
seen_urls.add(source[‘url’])

print(f” Analyzing {len(unique_sources)} unique sources…”)

analysis = self.analyze_sources(unique_sources[:self.config.max_sources], query)

print(” Generating comprehensive report…”)

report = self.generate_comprehensive_report(query, unique_sources, analysis)

return {
‘query’: query,
‘sources_found’: len(unique_sources),
‘analysis’: analysis,
‘report’: report,
‘sources’: unique_sources[:10]
}

We define a ResearchConfig dataclass to manage parameters like API keys, source limits, and delays, and then build a DeepResearchSystem class that integrates Gemini with DuckDuckGo search. We implement methods for web search, key point extraction, source analysis, and report generation, allowing us to orchestrate multi-round research and produce structured insights in a streamlined workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_research_system(api_key: str) -> DeepResearchSystem:
“””Quick setup for Google Colab”””
config = ResearchConfig(
gemini_api_key=api_key,
max_sources=15,
max_content_length=6000,
search_delay=0.5
)
return DeepResearchSystem(config)

We create a setup_research_system function that simplifies initialization in Google Colab by wrapping our configuration in ResearchConfig and returning a ready-to-use DeepResearchSystem instance with custom limits and delays. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
API_KEY = “Use Your Own API Key Here”

researcher = setup_research_system(API_KEY)

query = “Deep Research Agent Architecture”
results = researcher.conduct_research(query, depth=”standard”)

print(“=”*50)
print(“RESEARCH RESULTS”)
print(“=”*50)
print(f”Query: {results[‘query’]}”)
print(f”Sources found: {results[‘sources_found’]}”)
print(f”Confidence: {results[‘analysis’][‘confidence_score’]}”)
print(“n” + “=”*50)
print(“COMPREHENSIVE REPORT”)
print(“=”*50)
print(results[‘report’])

print(“n” + “=”*50)
print(“SOURCES CONSULTED”)
print(“=”*50)
for i, source in enumerate(results[‘sources’][:5], 1):
print(f”{i}. {source[‘title’]}”)
print(f” URL: {source[‘url’]}”)
print(f” Preview: {source[‘snippet’][:150]}…”)
print()

We add a main execution block where we initialize the research system with our API key, run a query on “Deep Research Agent Architecture,” and then display structured outputs. We print research results, a comprehensive report generated by Gemini, and a list of consulted sources with titles, URLs, and previews.

In conclusion, we see how the entire pipeline consistently transforms unstructured snippets into a structured, well-organized report. We successfully combine search, language modeling, and analysis layers to simulate a complete research workflow within Colab. By using Gemini for extraction, synthesis, and reporting, and DuckDuckGo for free search access, we create a reusable foundation for more advanced agentic research systems. This notebook provides a practical, technically detailed template that we can now expand with additional models, custom ranking, or domain-specific integrations, while still retaining a compact, end-to-end architecture.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Multi-Round Deep Research Agent with Gemini, DuckDuckGo API, and Automated Reporting? appeared first on MarkTechPost.

Australia’s Large Language Model Landscape: Technical Assessment

Key Points

No flagship, globally competitive, locally developed LLM (such as GPT-4, Claude 3.5, LLaMA 3.1) has yet emerged from Australia. Australian research and commerce currently rely primarily on international LLMs, which are frequently used but have measurable limitations on Australian English and cultural context.

Kangaroo LLM is the only major open-source, locally developed LLM project. Backed by a consortium of Katonic AI, RackCorp, NEXTDC, Hitachi Vantara, and Hewlett Packard Enterprise, it aims to build a model specifically for Australian English, but remains in early data collection and governance phases, with no published model weights, benchmarks, or production deployment as of August 2025.l

International models (Claude 3.5 Sonnet, GPT-4, LLaMA 2) are widely accessible in Australia and used in research, government, and industry. Their deployment in Australian contexts is often subject to data sovereignty, privacy law, and model fine-tuning challenges.

Australian academic research makes important contributions to LLM evaluation, fairness, and domain adaptation—not foundational architecture. Work at UNSW, Macquarie, and the University of Adelaide focuses on bias detection, medical and legal applications, and fine-tuning of pre-trained models, not on building new, large-scale LLMs from scratch.

Government and industry investment in AI is growing, but AI sovereignty remains aspirational. There is active policy development, increased venture capital, and strategic university-industry partnerships, but no national computational infrastructure or commercial ecosystem for training large, general-purpose LLMs at scale.

Local Model Development: Kangaroo LLM

Kangaroo LLM is Australia’s flagship effort to build a sovereign, open-source large language model tailored to Australian English and culture. The project is managed by a nonprofit consortium and aims to create a model that understands Australian humor, slang, and legal/ethical norms. However, as of August 2025, Kangaroo LLM is not yet a fully trained, benchmarked, or publicly available model. Its current status is best described as follows:

Partners: Katonic AI (lead), RackCorp, NEXTDC, Hitachi Vantara, Hewlett Packard Enterprise.

Mission: To create an open-source LLM trained on Australian web content, with data sovereignty and local cultural alignment as primary goals.

Progress: The project has identified 4.2 million Australian websites for potential data collection, with an initial focus on 754,000 sites. Crawling was delayed in late 2024 due to legal and privacy concerns, and no public dataset or model has been released.

Technical Approach: The “Kangaroo Bot” crawler respects robots.txt and allows opt-out for websites. Data is processed into the “VegeMighty Dataset” and refined through a “Great Barrier Reef Pipeline” for LLM training. The model’s architecture, size, and training methodology remain undisclosed.

Governance: Operates as a nonprofit with volunteer labor (about 100 volunteers, 10+ full-time equivalent). Funding is sought from corporate clients and possible government grants, but no major public or private investment has been announced.

Timeline: Originally slated for an October 2024 launch, but as of August 2025, the project is still in the data collection and legal compliance phase, with no confirmed release date for a trained model.

Significance: Kangaroo LLM is a symbolic and practical step toward AI sovereignty, but it does not yet represent a technical alternative to global LLMs. Success will depend on sustained funding, technical execution, and adoption by Australian developers and enterprises.

International Model Deployment

Claude 3.5 Sonnet (Anthropic), GPT-4 (OpenAI), and LLaMA 2 (Meta) are all available and actively used in Australian research and industry. Their adoption is driven by their superior capabilities, ease of access via cloud providers (AWS, Azure, Google Cloud), and integration into enterprise workflows.

Claude 3.5 Sonnet has been available in AWS’s Sydney region since February 2025, enabling Australian organizations to use a state-of-the-art LLM with data residency compliance. This model is used in applications ranging from customer service to scientific research.

GPT-4 and LLaMA 2 are widely used in Australian universities, startups, and corporations for prototyping, content generation, and task automation. Their use is often accompanied by fine-tuning on local datasets to improve relevance and accuracy.

University of Sydney Case Study: A team used Claude to analyze whale acoustic data, achieving 89.4% accuracy in detecting minke whales—a significant improvement over traditional methods (76.5%). This project demonstrates how global LLMs can be adapted for local scientific needs, but also highlights Australia’s reliance on external model providers.

Research Contributions

Australia’s academic institutions are active in LLM research, but their focus is on evaluation, fairness, domain adaptation, and application—not on building new, large-scale foundational models.

UNSW’s BESSTIE Benchmark: A systematic evaluation framework for sentiment and sarcasm in Australian, British, and Indian English. It reveals that global LLMs consistently underperform on Australian English, especially for sarcasm detection (F-score 0.59 on Reddit, compared to 0.81 for sentiment). This work is critical for understanding the limitations of current models in local contexts.

Macquarie University’s Biomedical LLMs: Researchers have fine-tuned BERT variants (BioBERT, ALBERT) for medical question answering, achieving top scores in international competitions. This demonstrates Australia’s strength in adapting existing models to specialized domains, but not in developing new architectures.

CSIRO Data61: Publishes influential research on agent-based systems using LLMs, privacy-preserving AI, and model risk management. Their work is practical and policy-focused, not focused on foundational model development.

University of Adelaide and CommBank Partnership: The CommBank Centre for Foundational AI, established in late 2024, aims to advance machine learning for financial services, including fraud detection and personalized banking. This is a significant industry investment, but again, the focus is on application and fine-tuning, not on building a new, large-scale LLM.

Policy, Investment, and Ecosystem

Government Policy:The Australian government has developed a risk-based AI policy framework, with mandatory transparency, testing, and accountability for high-risk applications. Privacy law reforms in 2024 introduced new requirements for AI transparency, affecting how models are selected and deployed.

Investment:Venture capital in Australian AI startups reached AUD$1.3 billion in 2024, with AI accounting for nearly 30% of all venture deals in early 2025. However, most of this investment is in application-layer companies, not in foundational model development.

Industry Adoption:A 2024 survey found that 71% of Australian university staff use generative AI tools, primarily ChatGPT and Claude. Enterprise adoption is growing, but often limited by data sovereignty requirements, privacy compliance, and the lack of locally tailored models.

Computational Infrastructure:Australia does not have large-scale, sovereign computational infrastructure for LLM training. Most large-scale model training and inference rely on international cloud providers, though AWS’s Sydney region now supports Claude 3.5 Sonnet at scale.

Summary

Australia’s LLM landscape is defined by strong application-driven research, growing enterprise adoption, and active policy development, but no sovereign, large-scale foundational model. Kangaroo LLM is one of the few significant local effort, but it remains in early stages and faces major technical and resourcing hurdles.

In summary, Australia is a sophisticated user and adapter of LLMs, but not yet a builder of them. The most important elements are clear: Kangaroo LLM is a meaningful step, but not yet a solution; global models dominate but have local limitations; and Australian research and policy are world-class in evaluation and application, not in foundational innovation.

Sources:

https://kangaroollm.com.au

https://au.linkedin.com/company/kangaroo-llm

https://ia.acs.org.au/article/2024/australian-ai-project-calls-for-more-unpaid-volunteers.html

https://www.aboutamazon.com.au/news/aws/the-upgraded-claude-3-5-sonnet-anthropics-most-intelligent-ai-model-to-date-now-available-in-aws-sydney-region

https://www.anthropic.com/customers/university-of-sydney

https://www.sydney.edu.au/news-opinion/news/2025/06/05/whale-song-crunched-for-minke-conservation1.html

https://aclanthology.org/2025.findings-acl.441.pdf

https://aclanthology.org/2023.bsnlp-1.7.pdf

https://ceur-ws.org/Vol-2936/paper-20.pdf

https://www.adelaide.edu.au/aiml/news/list/2024/09/20/aiml-and-commbank-join-forces-to-establish-new-centre-for-foundational

https://www.commbank.com.au/articles/newsroom/2025/03/anthropic.html

https://www.csiro.au/en/research/technology-space/ai/nsf-ai-research

https://investmentcouncil.com.au/Common/Uploaded%20files/Smart%20Suite/Smart%20Library/ccd30dea-7be2-45e6-974c-bf98336c879f/2025%20Australian%20Private%20Capital%20Yearbook.pdf

The State of AI Venture Capital in 2025: AI Boom Slows with Fewer Startups But Bigger Bets

https://www.cutthrough.com/insights/cut-through-quarterly-1q-2025

https://ia.acs.org.au/article/2025/what-deepseeks-rise-could-mean-for-australian-ai.html

https://www.digital.gov.au/policy/ai/policy

Privacy and AI Regulations | 2024 review & 2025 outlook

https://kangaroollm.com.au

https://kangaroollm.com.au/kangaroo-bot/

https://au.linkedin.com/company/kangaroo-llm

https://ia.acs.org.au/article/2024/australian-ai-project-calls-for-more-unpaid-volunteers.html

https://ia.acs.org.au/article/2025/what-deepseeks-rise-could-mean-for-australian-ai.html

https://www.linkedin.com/posts/kangaroo-llm_australiandomains-digitalpresence-techtrends-activity-7241954718191652865-W1Fw

https://www.myaustraliannews.com.au/news-releases.html?id=1012302&headline=kangaroo-llm-launches-massive-web-crawl-to-build-australias-first-open-source-ai-model&pageNumber=7

https://www.forbes.com/councils/forbestechcouncil/2024/10/31/why-its-high-time-for-australia-to-build-its-own-large-language-model/

Home

The post Australia’s Large Language Model Landscape: Technical Assessment appeared first on MarkTechPost.

Meet Boti: The AI assistant transforming how the citizens of Buenos Ai …

This post is co-written with Julieta Rappan, Macarena Blasi, and María Candela Blanco from the Government of the City of Buenos Aires.
The Government of the City of Buenos Aires continuously works to improve citizen services. In February 2019, it introduced an AI assistant named Boti available through WhatsApp, the most widely used messaging service in Argentina. With Boti, citizens can conveniently and quickly access a wide variety of information about the city, such as renewing a driver’s license, accessing healthcare services, and learning about cultural events. This AI assistant has become a preferred communication channel and facilitates more than 3 million conversations each month.
As Boti grows in popularity, the Government of the City of Buenos Aires seeks to provide new conversational experiences that harness the latest developments in generative AI. One challenge that citizens often face is navigating the city’s complex bureaucratic landscape. The City Government’s website includes over 1,300 government procedures, each of which has its own logic, nuances, and exceptions. The City Government recognized that Boti could improve access to this information by directly answering citizens’ questions and connecting them to the right procedure.
To pilot this new solution, the Government of the City of Buenos Aires partnered with the AWS Generative AI Innovation Center (GenAIIC). The teams worked together to develop an agentic AI assistant using LangGraph and Amazon Bedrock. The solution includes two main components: an input guardrail system and a government procedures agent. The input guardrail uses a custom LLM classifier to analyze incoming user queries, determining whether to approve or block requests based on their content. Approved requests are handled by the government procedures agent, which retrieves relevant procedural information and generates responses. Since most user queries focus on a single procedure, we developed a novel reasoning retrieval system to improve retrieval accuracy. This system initially retrieves comparative summaries that disambiguate similar procedures and then applies a large language model (LLM) to select the most relevant results. The agent uses this information to craft responses in Boti’s characteristic style, delivering short, helpful, and expressive messages in Argentina’s Rioplatense Spanish dialect. We focused on distinctive linguistic features of this dialect including the voseo (using “vos” instead of “tú”) and periphrastic future (using “ir a” before verbs).
In this post, we dive into the implementation of the agentic AI system. We begin with an overview of the solution, explaining its design and main features. Then, we discuss the guardrail and agent subcomponents and assess their performance. Our evaluation shows that the guardrails effectively block harmful content, including offensive language, harmful opinions, prompt injection attempts, and unethical behaviors. The agent achieves up to 98.9% top-1 retrieval accuracy using the reasoning retriever, which marks a 12.5–17.5% improvement over standard retrieval-augmented generation (RAG) methods. Subject matter experts found that Boti’s responses were 98% accurate in voseo usage and 92% accurate in periphrastic future usage. The promising results of this solution establish a new era of citizen-government interaction.
Solution overview
The Government of the City of Buenos Aires and the GenAIIC built an agentic AI assistant using Amazon Bedrock and LangGraph that includes an input guardrail system to enable safe interactions and a government procedures agent to respond to user questions. The workflow is shown in the following diagram.

The process begins when a user submits a question. In parallel, the question is passed to the input guardrail system and government procedures agent. The input guardrail system determines whether the question contains harmful content. If triggered, it stops graph execution and redirects the user to ask questions about government procedures. Otherwise, the agent continues to formulate its response. The agent either calls a retrieval tool, which allows it to obtain relevant context and metadata from government procedures stored in Amazon Bedrock Knowledge Bases, or responds to the user. Both the input guardrail and government procedures agent use the Amazon Bedrock Converse API for LLM inference. This API provides access to a wide selection of LLMs, helping us optimize performance and latency across different subtasks.
Input guardrail system
Input guardrails help prevent the LLM system from processing harmful content. Although Amazon Bedrock Guardrails offers one implementation approach with filters for specific words, content, or sensitive information, we developed a custom solution. This provided us greater flexibility to optimize performance for Rioplatense Spanish and monitor specific types of content. The following diagram illustrates our approach, in which an LLM classifier assigns a primary category (“approved” or “blocked”) as well as a more detailed subcategory.

Approved queries are within the scope of the government procedures agent. They consist of on-topic requests, which focus on government procedures, and off-topic requests, which are low-risk conversation questions that the agent responds to directly. Blocked queries contain high-risk content that Boti should avoid, including offensive language, harmful opinions, prompt injection attacks, or unethical behaviors.
We evaluated the input guardrail system on a dataset consisting of both normal and harmful user queries. The system successfully blocked 100% of harmful queries, while occasionally flagging normal queries as harmful. This performance balance makes sure that Boti can provide helpful information while maintaining safe and appropriate interactions for users.
Agent system
The government procedures agent is responsible for answering user questions. It determines when to retrieve relevant procedural information using its retrieval tool and generates responses in Boti’s characteristic style. In the following sections, we examine both processes.
Reasoning retriever
The agent can use a retrieval tool to provide accurate and up-to-date information about government procedures. Retrieval tools typically employ a RAG framework to perform semantic similarity searches between user queries and a knowledge base containing document chunks stored as embeddings, and then provide the most relevant samples as context to the LLM. Government procedures, however, present challenges to this standard approach. Related procedures, such as renewing and reprinting drivers’ licenses, can be difficult to disambiguate. Additionally, each user question typically requires information from one specific procedure. The mixture of chunks returned from standard RAG approaches increases the likelihood of generating incorrect responses.
To better disambiguate government procedures, the Buenos Aires and GenAIIC teams developed a reasoning retrieval method that uses comparative summaries and LLM selection. An overview of this approach is shown in the following diagram.

A necessary preprocessing step before retrieval is the creation of a government procedures knowledge base. To capture both the key information contained in procedures and how they related to each other, we created comparative summaries. Each summary contains basic information, such as the procedure’s purpose, intended audience, and content, such as costs, steps, and requirements. We clustered the base summaries into small groups, with an average cluster size of 5, and used an LLM to generate descriptions about what made each procedure different from its neighbors. We appended the distinguishing descriptions to the base information to create the final summary. We note that this approach shares similarities to Anthropic’s Contextual Retrieval, which prepends explanatory context to document chunk.
With the knowledge base in place, we are able to retrieve relevant government procedures based on the user query. The reasoning retriever completes three steps:

Retrieve M Summaries: We retrieve between 1 and M comparative summaries using semantic search.
Optional Reasoning: In some cases, the initial retrieval surfaces similar procedures. To make sure that the most relevant procedures are returned to the agent, we apply an optional LLM reasoning step. The condition for this step occurs when the ratio of the first and second retrieval scores falls below a threshold value. An LLM follows a chain-of-thought (CoT) process in which it compares the user query to the retrieved summaries. It discards irrelevant procedures and reorders the remaining ones based on relevance. If the user query is specific enough, this process typically returns one result. By applying this reasoning step selectively, we minimize latency and token usage while maintaining high retrieval accuracy.
Retrieve N Full-Text Procedures: After the most relevant procedures are identified, we fetch their complete documents and metadata from an Amazon DynamoDB table. The metadata contains information like the source URL and the sentiment of the procedure. The agent typically receives between 1 and N results, where N ≤ M.

The agent receives the retrieved full text procedures in its context window. It follows its own CoT process to determine the relevant content and URL source attributions when generating its answer.
We evaluated our reasoning retriever against standard RAG techniques using a synthetic dataset of 1,908 questions derived from known source procedures. The performance was measured by determining whether the correct procedure appeared in the top-k retrieved results for each question. The following plot compares the top-k retrieval accuracy for each approach across different models, arranged in order of ascending performance from left to right. The metrics are proportionally weighted based on each procedure’s webpage visit frequency, making sure that our evaluation reflects real-world usage patterns.

The first three approaches represent standard vector-based retrieval methods. The first method, Section Titan, involved chunking procedures by document sections, targeting approximately 250 words per chunk, and then embedding the chunks using Amazon Titan Text Embeddings v2. The second method, Summaries Titan, consisted of embedding the procedure summaries using the same embedding model. By embedding summaries rather than document text, the retrieval accuracy improved by 7.8–15.8%. The third method, Summaries Cohere, involved embedding procedure summaries using Cohere Multilingual v3 on Amazon Bedrock. The Cohere Multilingual embedding model provided a noticeable improvement in retrieval accuracy compared to the Amazon Titan embedding models, with all top-k values above 90%.
The next three approaches use the reasoning retriever. We embedded the procedure summaries using the Cohere Multilingual model, retrieved 10 summaries during the initial retrieval step, and optionally applied the LLM-based reasoning step using either Anthropic’s Haiku 3, Claude 3 Sonnet, or Claude 3.5 Sonnet on Amazon Bedrock. All three reasoning retrievers consistently outperform standard RAG techniques, achieving 12.5–17.5% higher top-k accuracies. Anthropic’s Claude 3.5 Sonnet delivered the highest performance with 98.9% top-1 accuracy. These results demonstrate how combining embedding-based retrieval with LLM-powered reasoning can improve RAG performance.
Answer generation
After collecting the necessary information, the agent responds using Boti’s distinctive communication style: concise, helpful messages in Rioplatense Spanish. We maintained this voice through prompt engineering that specified the following:

Personality – Convey a warm and friendly tone, providing quick solutions to everyday problems
Response length – Limit responses to a few sentences
Structure – Organize content using lists and highlights key information using bold text
Expression – Use emojis to mark important requirements and add visual cues
Dialect – Incorporate Rioplatense linguistic features, including voseo, periphrastic future, and regional vocabulary (for example, “acordate,” “entrar,” “acá,” and “allá”).

Government procedures often address sensitive topics, like accidents, health, or security. To facilitate appropriate responses, we incorporated sentiment analysis into our knowledge base as metadata. This allows our system to route to different prompt templates. Sensitive topics are directed to prompts with reduced emoji usage and more empathetic language, whereas neutral topics receive standard templates.
The following figure shows a sample response to a question about borrowing library books. It has been translated to English for convenience.

To validate our prompt engineering approach, subject matter experts at the Government of the City of Buenos Aires reviewed a sample of Boti’s responses. Their analysis confirmed high fidelity to Rioplatense Spanish, with 98% accuracy in voseo usage and 92% in periphrastic future usage.
Conclusion
This post described the agentic AI assistant built by the Government of the City of Buenos Aires and the GenAIIC to respond to citizens’ questions about government procedures. The solution consists of two primary components: an input guardrail system that helps prevent the system from responding to harmful user queries and a government procedures agent that retrieves relevant information and generates responses. The input guardrails effectively block harmful content, including queries with offensive language, harmful opinions, prompt injection, and unethical behaviors. The government procedures agent employs a novel reasoning retrieval method that disambiguates similar government procedures, achieving up to 98.9% top-1 retrieval accuracy and a 12.5–17.5% improvement over standard RAG methods. Through prompt engineering, responses are delivered in Rioplatense Spanish using Boti’s voice. Subject matter experts rated Boti’s linguistic performance highly, with 98% accuracy in voseo usage and 92% in periphrastic future usage.
As generative AI advances, we expect to continuously improve our solution. The expanding catalog of LLMs available in Amazon Bedrock makes it possible to experiment with newer, more powerful models. This includes models that process text, as explored in the solution in this post, as well as models that process speech, allowing for direct speech-to-speech interactions. We might also explore the fine-tuning capabilities of Amazon Bedrock to customize models so that they better capture the linguistic features of Rioplatense Spanish. Beyond model improvements, we can iterate on our agent framework. The agent’s tool set can be expanded to support other tasks associated with government procedures like account creation, form completion, and appointment scheduling. As the City Government develops new experiences for citizens, we can consider implementing multi-agent frameworks in which specialist agents, like the government procedures agent, handle specific tasks.
To learn more about Boti and AWS’s generative AI capabilities, check out the following resources:

Boti: The City Chatbot
Government of the City of Buenos Aires: Procedures
Amazon Bedrock
Amazon Bedrock Knowledge Bases

About the authors
Julieta Rappan is Director of the Digital Channels Department of the Buenos Aires City Government, where she coordinates the landscape of digital and conversational interfaces. She has extensive experience in the comprehensive management of strategic and technological projects, as well as in leading high-performance teams focused on the development of digital products and services. Her leadership drives the implementation of technological solutions with a focus on scalability, coherence, public value, and innovation—where generative technologies are beginning to play a central role.
Macarena Blasi is Chief of Staff at the Digital Channels Department of the Buenos Aires City Government, working across the city’s main digital services, including Boti—the WhatsApp-based virtual assistant—and the official Buenos Aires website. She began her journey working in conversational experience design, later serving as product owner and Operations Manager and then as Head of Experience and Content, leading multidisciplinary teams focused on improving the quality, accessibility, and usability of public digital services. Her work is driven by a commitment to building clear, inclusive, and human-centered experiences in the public sector.
María Candela Blanco is Operations Manager for Quality Assurance, Usability, and Continuous Improvement at the Buenos Aires Government, where she leads the content, research, and conversational strategy across the city’s main digital channels, including the Boti AI assistant and the official Buenos Aires website. Outside of tech, Candela studies literature at UNSAM and is deeply passionate about language, storytelling, and the ways they shape our interactions with technology.
Leandro Micchele is a Software Developer focused on applying AI to real-world use cases, with expertise in AI assistants, voice, and vision solutions. He serves as the technical lead and consultant for the Boti AI assistant at the Buenos Aires Government and works as a Software Developer at Telecom Argentina. Beyond tech, his discipline extends to martial arts: he has over 20 years of experience and currently teaches Aikido.
Hugo Albuquerque is a Deep Learning Architect at the AWS Generative AI Innovation Center. Before joining AWS, Hugo had extensive experience working as a data scientist in the media and entertainment and marketing sectors. In his free time, he enjoys learning other languages like German and practicing social dancing, such as Brazilian Zouk.
Enrique Balp is a Senior Data Scientist at the AWS Generative AI Innovation Center working on cutting-edge AI solutions. With a background in the physics of complex systems focused on neuroscience, he has applied data science and machine learning across healthcare, energy, and finance for over a decade. He enjoys hikes in nature, meditation retreats, and deep friendships.
Diego Galaviz is a Deep Learning Architect at the AWS Generative AI Innovation Center. Before joining AWS, he had over 8 years of expertise as a data scientist across diverse sectors, including financial services, energy, big tech, and cybersecurity. He holds a master’s degree in artificial intelligence, which complements his practical industry experience.
Laura Kulowski is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where she works with customers to build generative AI solutions. Before joining Amazon, Laura completed her PhD at Harvard’s Department of Earth and Planetary Sciences and investigated Jupiter’s deep zonal flows and magnetic field using Juno data.
Rafael Fernandes is the LATAM leader of the AWS Generative AI Innovation Center, whose mission is to accelerate the development and implementation of generative AI in the region. Before joining Amazon, Rafael was a co-founder in the financial services industry space and a data science leader with over 12 years of experience in Europe and LATAM.

Empowering air quality research with secure, ML-driven predictive anal …

Air pollution remains one of Africa’s most pressing environmental health crises, causing widespread illness across the continent. Organizations like sensors.AFRICA have deployed hundreds of air quality sensors to address this challenge, but face a critical data problem: significant gaps in PM2.5 (particulate matter with diameter less than or equal to 2.5 micrometers) measurement records because of power instability and connectivity issues in high-risk regions where physical maintenance is limited. Missing data in PM2.5 datasets reduces statistical power and introduces bias into parameter estimates, leading to unreliable trend detection and flawed conclusions about air quality patterns. These data gaps ultimately compromise evidence-based decision-making for pollution control strategies, health impact assessments, and regulatory compliance.
In this post, we demonstrate the time-series forecasting capability of Amazon SageMaker Canvas, a low-code no-code (LCNC) machine learning (ML) platform to predict PM2.5 from incomplete datasets. PM2.5 exposure contributes to millions of premature deaths globally through cardiovascular disease, respiratory illness, and systemic health effects, making accurate air quality forecasting a critical public health tool. A key advantage of the forecasting capability of SageMaker Canvas is its robust handling of incomplete data. Traditional air quality monitoring systems often require complete datasets to function properly, meaning they can’t be relied on when sensors malfunction or require maintenance. In contrast, SageMaker Canvas can generate reliable predictions even when faced with gaps in sensor data. This resilience enables continuous operation of air quality monitoring networks despite inevitable sensor failures or maintenance periods, eliminating costly downtime and data gaps. Environmental agencies and public health officials benefit from uninterrupted access to critical air quality information, enabling timely pollution alerts and more comprehensive long-term analysis of air quality trends. By maintaining operational continuity even with imperfect data inputs, SageMaker Canvas significantly enhances the reliability and practical utility of environmental monitoring systems.
In this post, we provide a data imputation solution using Amazon SageMaker AI, AWS Lambda, and AWS Step Functions. This solution is designed for environmental analysts, public health officials, and business intelligence professionals who need reliable PM2.5 data for trend analysis, reporting, and decision-making. We sourced our sample training dataset from openAFRICA. Our solution predicts PM2.5 values using time-series forecasting. The sample training dataset contained over 15 million records from March 2022 to Oct 2022 in various parts of Kenya and Nigeria—data coming from 23 sensor devices from 15 unique locations. The sample code and workflows can be adapted to create prediction models for your PM2.5 datasets. See our solution’s README for detailed instructions.
Solution overview
The solution consists of two main ML components: a training workflow and an inference workflow. These workflows are built using the following services:

SageMaker Canvas is used to prepare data and train the prediction model through its no-code interface
Batch Transform for inference with Amazon SageMaker AI is used for inference, processing the dataset in bulk to generate predictions
Step Functions orchestrates the inferencing process by coordinating the workflow between data retrieval, batch transforming, and database updates, managing workflow state transitions, and making sure that data flows properly through each processing stage
Lambda functions perform critical operations at each workflow step: retrieving sensor data from the database in required format, transforming data for model input, sending batches to SageMaker for inferencing, and updating the database with prediction results after processing is complete

At a high level, the solution works by taking a set of PM2.5 data with gaps and predicts the missing values within the range of plus or minus 4.875 micrograms per cubic meter of the actual PM2.5 concentration. It does this by first training a model on the data using inputs for the specific schema and a historical set of values from the user to guide the training process, which is completed with SageMaker Canvas. After the model is trained on a representative dataset and schema, SageMaker Canvas exports the model for use with batch processing. The Step Functions orchestration calls a Lambda function every 24 hours that takes a dataset of new sensor data that has gaps and initiates a SageMaker batch transform job to predict the missing values. The batch transform job processes the entire dataset at once, and the Lambda function then updates the existing dataset with the results. The new completed dataset with predicted values can now be distributed to public health decision-makers who need complete datasets to effectively analyze the patterns of PM2.5 data.
We dive into each of these steps in later sections of this post.
Solution walkthrough
The following diagram shows our solution architecture:

Let’s explore the architecture step by step:

To systematically collect, identify, and fill PM2.5 data gaps caused by sensor limitations and connectivity issues, Amazon EventBridge Scheduler invokes a Step Functions state machine every 24 hours. Step Functions orchestrates the calling of various Lambda functions to perform different steps without handling the complexities of error handling, retries, and state management, providing a serverless workflow that seamlessly coordinates the PM2.5 data imputation process.
The State Machine invokes a Lambda function in your Amazon Virtual Private Cloud (Amazon VPC) that retrieves records containing missing air quality values from the user’s air quality database on Amazon Aurora PostgreSQL-Compatible Edition and stores the records in a CSV file in an Amazon Simple Storage Service (Amazon S3) bucket.
The State Machine then runs a Lambda function that retrieves the records from Amazon S3 and initiates the SageMaker batch transform job in your VPC using your SageMaker model created from your SageMaker Canvas predictive model trained on historical PM2.5 data.
To streamline the batch transform workflow, this solution uses an event-driven approach with EventBridge and Step Functions. EventBridge captures completion events from SageMaker batch transform jobs, while the task token functionality of Step Functions enables extended waiting periods beyond the time limits of Lambda. After processing completes, SageMaker writes the prediction results directly to an S3 bucket.
The final step in the state machine retrieves the predicted values from the S3 bucket and then updates the database in Aurora PostgreSQL-Compatible with the values including a predicted label set to true.

Prerequisites
To implement the PM2.5 data imputation solution, you must have the following:

An AWS account with AWS Identity and Access Management (IAM) permissions sufficient to deploy the solution and interact with the database.
The following AWS services:

Amazon SageMaker AI
AWS Lambda
AWS Step Functions
Amazon S3
Aurora PostgreSQL-Compatible
Amazon CloudWatch
AWS CloudFormation
Amazon Virtual Private Cloud (VPC)
Amazon EventBridge
IAM for authentication to Aurora PostgreSQL-Compatible
AWS Systems Manager Parameter Store

A local desktop set up with AWS Command Line Interface (AWS CLI) version 2, Python 3.10, AWS Cloud Development Kit (AWS CDK) v2.x, and Git version 2.x.
The AWS CLI set up with the necessary credentials in the desired AWS Region.
Historical air quality sensor data. Note that our solution requires a fixed schema described in the GitHub repo’s README.

Deploy the solution
You will run the following steps to complete the deployment:

Prepare your environment by building Python modules locally for Lambda layers, deploying infrastructure using the AWS CDK, and initializing your Aurora PostgreSQL database with sensor data.
Perform steps in the Build your air quality prediction model section to configure a SageMaker Canvas application, followed by training and registering your model in Amazon SageMaker Model Registry.
Create SageMaker model using your registered SageMaker Canvas model by updating infrastructure using the AWS CDK.
Manage future configuration changes using the AWS CDK.

Step 1: Deploy AWS infrastructure and upload air quality sensor data
Complete the following steps to deploy the PM2.5 data imputation solution AWS Infrastructure and upload air quality sensor data to Amazon Aurora RDS:

Clone the repository to your local desktop environment using the following command:

git clone git@github.com:aws-samples/sample-empowering-air-quality-research-secure-machine-learning-predictive-analytics.git

Change to the project directory:

cd <BASE_PROJECT_FOLDER>

Follow the deployment steps in the README file up to Model Setup for Batch Transform Inference.

Step 2: Build your air quality prediction model
After you create the SageMaker AI domain and the SageMaker AI user profile as part of the CDK deployment steps, follow these steps to build your air quality prediction model
Configure your SageMaker Canvas application

On the AWS Management Console, go to the SageMaker AI console and select the domain and the user profile that was created under Admin, Configurations, and Domains.
Choose the App Configurations tab, scroll down to the Canvas section, and select Edit.
In Canvas storage configuration, select Encryption and select the dropdown for aws/s3.
In the ML Ops Configuration, turn on the option to Enable Model Registry registration permissions for this user profile.

Optionally, in the Local file upload configuration section in your domain’s Canvas App Configuration, you can turn on Enable local file upload.

Choose Submit to save your configuration choices.
In your Amazon SageMaker AI home page, go to the Applications and IDEs section and select Canvas.
Select the SageMaker AI user profile that was created for you by the CDK deployment and choose Open Canvas.
In a new tab, SageMaker Canvas will start creating your application. This takes a few minutes.

Create and register your prediction model
In this phase, you develop a prediction model using your historical air quality sensor data.

The preceding architecture diagram illustrates the end-to-end process for training the SageMaker Canvas prediction model, registering that model and creating a SageMaker model for running inference on newly found PM2.5 data gaps. The training process starts by extracting air quality sensor dataset from the database. The dataset is imported into SageMaker Canvas for predictive analysis. This training dataset is transformed and prepared through data wrangling steps implemented by SageMaker Canvas for building and training ML models.
Prepare data
Our solution supports a SageMaker Canvas model trained for a single-target variable prediction based on historical data and performs corresponding data imputation for PM2.5 data gaps. To train your model for predictive analysis, follow the comprehensive End to End Machine Learning workflow in the AWS Canvas Immersion Day workshop, adapting each step to prepare your air quality sensor dataset. Begin with the standard workflow until you reach the data preparation section. Here, you can make several customizations:

Filter dataset for single-target value prediction: Your air quality dataset might contain multiple sensor parameters. For single-target value prediction using this solution, filter the dataset to include only PM2.5 measurements.
Clean sensor data: Remove records containing sensor fault values. For example, we filtered out values that equal 65535, because 65535 is a common error code for malfunctioning sensors. Adjust this filtering based on the specific error codes your air quality monitoring equipment produces.

The following image shows our data wrangling Data Flow implemented using above guidance:
Data Wrangler > Data Flow

Review generated insights and remove irrelevant data: Review the SageMaker Canvas generated insights and analyses. Evaluate them based on time-series forecasting and geospatial temporal data for air quality patterns and relationships between other columns of impact. See chosen columns of impact in GitHub for guidance. Analyze your dataset to identify rows and columns that impact the prediction and remove data that can reduce prediction accuracy.

The following image shows our data wrangling Analyses obtained with implementing the above guidance:
Data Wrangler > Analyses

Training your prediction model
After completing your data preparation, proceed to the Train the Model section of the workshop and continue with these specifications:

Select problem type: Select Predictive Analysis as your ML approach. Because our dataset is tabular and contains a timestamp, a target column that has values we’re using to forecast future values, and a device ID column, SageMaker Canvas will choose time series forecasting.
Define target column: Set Value as your target column for predicting PM2.5 values.
Build configuration: Use the Standard Build option for model training because it generally has a higher accuracy. See What happens when you build a model in How custom models work for more information.

By following these steps, you can create a model optimized for PM2.5 dataset predictive analysis, capable of generating valuable insights. Note that SageMaker Canvas supports retraining the ML model for updated PM2.5 datasets.
Evaluate the model
After training your model, proceed to Evaluate the model and review column impact, root mean square error (RMSE) score and other advanced metrics to understand your model’s performance for generating predictions for PM2.5.
The following image shows our model evaluation statistics achieved.

Add the model to the registry
Once you are satisfied with your model performance, follow the steps in Register a model version to the SageMaker AI model registry. Make sure to change the approval status to Approved before continuing to run this solution. At the time of this post’s publication, the approval must be updated in Amazon SageMaker Studio.
Log out of SageMaker Canvas
After completing your work in SageMaker Canvas, you can log out or configure your application to automatically terminate the workspace instance. A workspace instance is dedicated for your use every time you launch a Canvas application, and you are billed for as long as the instance runs. Logging out or terminating the workspace instance stops the workspace instance billing. For more information, see billing and cost in SageMaker Canvas.
Step 3: Create a SageMaker model using your registered SageMaker Canvas model
In the previous steps, you created a SageMaker domain and user profile through CDK deployment (Step 1) and successfully registered your model (Step 2). Now, it’s time to create the SageMaker model in your VPC using the SageMaker Canvas model you registered. Follow Model Setup for Batch Inference and Re-Deploy with Updated Configuration sections in the code README for creating SageMaker model.
Step 4: Manage future configuration changes
The same deployment pattern applies to any future configuration modifications you might require, including:

Batch transform instance type optimizations
Transform job scheduling changes

Update the relevant parameters in your configuration and run cdk deploy to propagate these changes throughout your solution architecture.
For a comprehensive list of configurable parameters and their default values, see the configuration file in the repository.
Execute cdk deploy again to update your infrastructure stack with the your model ID for batch transform operations, replacing the placeholder value initially deployed. This infrastructure-as-code approach helps ensure consistent, version-controlled updates to your data imputation workflow.
Security best practices
Security and compliance is a shared responsibility between AWS and the customer, as outlined in the Shared Responsibility Model. We encourage you to review this model for a comprehensive understanding of the respective responsibilities.
In this solution, we enhanced security by implementing encryption at rest for Amazon S3, Aurora PostgreSQL-Compatible database, and the SageMaker Canvas application. We also enabled encryption in transit by requiring SSL/TLS for all connections from the Lambda functions. We implemented secure database access by providing temporary dynamic credentials through IAM authentication for Amazon RDS, eliminating the need for static passwords. Each Lambda function operates with least privilege access, receiving only the minimal permissions required for its specific function. Finally, we deployed the Lambda functions, Aurora PostgreSQL-Compatible instance, and SageMaker Batch Transform jobs in private subnets of the VPC that do not traverse the public internet. This private network architecture is enabled through VPC endpoints for Amazon S3, SageMaker AI, and AWS Secrets Manager.
Results
As shown in the following image, our model, developed using SageMaker Canvas, predicts PM2.5 values with an R-squared of 0.921. Because ML models for PM2.5 prediction frequently achieve R-squared values between 0.80 and 0.98 (see this example from ScienceDirect), our solution is within the range of higher-performing PM2.5 prediction models available today. SageMaker Canvas delivers this performance through its no-code experience, automatically handling model training and optimization without requiring ML expertise from users.

Clean up
Complete the following steps to clean up your resources:

SageMaker Canvas application cleanup:

On the go to the SageMaker AI console and select the domain that was created under Admin Configurations, and Domains.
Select the user created under User Profiles for that domain.
On the User Details page, navigate to Spaces and Apps, and choose Delete to manually delete your SageMaker AI canvas application and clean up resources.

SageMaker Domain EFS storage cleanup:

Open Amazon EFS and in File systems, delete filesystem tagged as ManagedByAmazonSageMakerResource.
Open VPC and under Security, navigate to Security groups.
On Security groups, select security-group-for-inbound-nfs-<your-sagemaker-domain-id> and delete all Inbound rules associated with that group.
On Security groups, select security-group-for-outbound-nfs-<your-sagemaker-domain-id> and delete all associated Outbound rules.
Finally, delete both the security groups: security-group-for-inbound-nfs-<your-sagemaker-domain-id> and security-group-for-outbound-nfs-<your-sagemaker-domain-id>.

Use the AWS CDK to clean up the remaining AWS resources:

After the preceding steps are complete, return to your local desktop environment where the GitHub repo was cloned, and change to the project’s infra directory: cd <BASE_PROJECT_FOLDER>/infra
Destroy the resources created with AWS CloudFormation using the AWS CDK: cdk destroy
Monitor the AWS CDK process deleting resources created by the solution. If there are any errors, troubleshoot using the CloudFormation console and then retry deletion.

Conclusion
The development of accurate PM2.5 prediction models has traditionally required extensive technical expertise, presenting significant challenges for public health researchers studying air pollution’s impact on disease outcomes. From data preprocessing and feature engineering to model selection and hyperparameter tuning, these technical requirements diverted substantial time and effort away from researchers’ core work of analyzing health outcomes and developing evidence-based interventions.SageMaker Canvas transforms this landscape by dramatically reducing the effort required to develop high-performing PM2.5 prediction models. Public health researchers can now generate accurate predictions without mastering complex ML algorithms, iterate quickly through an intuitive interface, and validate models across regions without manual hyperparameter tuning. With this shift to streamlined, accessible prediction capabilities, researchers can dedicate more time to interpreting results, understanding air pollution’s impact on community health, and developing protective interventions for vulnerable populations. The result is more efficient research that responds quickly to emerging air quality challenges and informs timely public health decisions. We invite you to implement this solution for your air quality research or ML-based predictive analytics projects. Our comprehensive deployment steps and customization guidance will help you launch quickly and efficiently. As we continue enhancing this solution, your feedback is invaluable for improving its capabilities and maximizing its impact.

About the authors
Nehal Sangoi is a Senior Technical Account Manager at Amazon Web Services. She provides strategic technical guidance to help independent software vendors plan and build solutions using AWS best practices. Connect with Nehal on LinkedIn.
Ben Peterson is a Senior Technical Account Manager with AWS. He is passionate about enhancing the developer experience and driving customer success. In his role, he provides strategic guidance on using the comprehensive AWS suite of services to modernize legacy systems, optimize performance, and unlock new capabilities. Connect with Ben on LinkedIn.
Shashank Shrivastava is a Senior Delivery Consultant and Serverless TFC member at AWS. He is passionate about helping customers and developers build modern applications on serverless architecture. As a pragmatic developer and blogger, he promotes community-driven learning and sharing of technology. His interests are software architecture, developer tools, GenAI, and serverless computing. Connect with Shashank on LinkedIn.
Akshay Singhal is a Senior Technical Account Manager at Amazon Web Services supporting Enterprise Support customers focusing on the Security ISV segment. He provides technical guidance for customers to implement AWS solutions, with expertise spanning serverless architectures and cost optimization. Outside of work, Akshay enjoys traveling, Formula 1, making short movies, and exploring new cuisines. Connect with Akshay on LinkedIn.

How Amazon Finance built an AI assistant using Amazon Bedrock and Amaz …

Finance analysts across Amazon Finance face mounting complexity in financial planning and analysis processes. When working with vast datasets spanning multiple systems, data lakes, and business units, analysts encounter several critical challenges. First, they spend significant time manually browsing data catalogs and reconciling data from disparate sources, leaving less time for valuable analysis and insight generation. Second, historical data and previous business decisions often reside in various documents and legacy systems, making it difficult to use past learnings during planning cycles. Third, as business contexts rapidly evolve, analysts need quick access to relevant metrics, planning assumptions, and financial insights to support data-driven decision-making.
Traditional tools and processes fall short in addressing these challenges. Keyword-based searches often miss contextual relationships in financial data, and rigid query structures limit analysts’ ability to explore data dynamically. Furthermore, the lack of institutional knowledge preservation means valuable insights and decision rationales often remain siloed or get lost over time, leading to redundant analysis and inconsistent planning assumptions across teams. These challenges significantly impact financial planning efficiency, decision-making agility, and the overall quality of business insights. Analysts needed a more intuitive way to access, understand, and use their organization’s collective financial knowledge and data assets.
The Amazon Finance technical team develops and manages comprehensive technology solutions that power financial decision-making and operational efficiency while standardizing across Amazon’s global operations. In this post, we explain how the team conceptualized and implemented a solution to these business challenges by harnessing the power of generative AI using Amazon Bedrock and intelligent search with Amazon Kendra.
Solution overview
To address these business challenges, Amazon Finance developed an AI-powered assistant solution that uses generative AI and enterprise search capabilities. This solution helps analysts interact with financial data sources and documentation through natural language queries, minimizing the need for complex manual searches across multiple systems. The assistant accesses a comprehensive knowledge base of financial documents, historical data, and business context, providing relevant and accurate responses while maintaining enterprise security standards. This approach not only streamlines data discovery but also preserves institutional knowledge and enables more consistent decision-making across the organization.
The AI assistant’s methodology consists of two key solution components: intelligent retrieval and augmented generation. The retrieval system uses vector stores, which are specialized databases that efficiently store and search high-dimensional representations of text meanings. Unlike traditional databases that rely on keyword matching, vector stores enable semantic search by converting user queries into vector representations and finding similar vectors in the database. Building on this retrieval foundation, the system employs augmented generation to create accurate and contextual responses. This approach enhances traditional language models by incorporating external knowledge sources during response generation, significantly reducing hallucinations and improving factual accuracy. The process follows three steps: retrieving relevant information from knowledge sources using semantic search, conditioning the language model with this context, and generating refined responses that incorporate the retrieved information. By combining these technologies, the assistant delivers responses that are both contextually appropriate and grounded in verified organizational knowledge, making it particularly effective for knowledge-intensive applications like financial operations and planning.
We implemented this Retrieval Augmented Generation (RAG) system through a combination of large language models (LLMs) on Amazon Bedrock and intelligent search using Amazon Kendra.
In the following sections, we discuss the key architectural components that we used in the solution and describe how the overall solution works.
Amazon Bedrock
We chose Anthropic’s Claude 3 Sonnet, a powerful language model, for its exceptional language generation capabilities and ability to understand and reason complex topics. By integrating Anthropic’s Claude into the RAG module through Amazon Bedrock, the AI assistant can generate contextual and informative responses that seamlessly combine the retrieved knowledge from the vector store with the model’s natural language processing and generation abilities, resulting in a more human-like and engaging conversational experience.
Amazon Kendra (Enterprise Edition Index)
Amazon Kendra offers powerful natural language processing for AI assistant applications. It excels at understanding user questions and finding relevant answers through semantic search. The service works smoothly with generative AI models, particularly in RAG solutions. The enterprise security features in Amazon Kendra support data protection and compliance. Its ability to understand user intent and connect directly with Amazon Bedrock makes it ideal for business assistants. This helps create meaningful conversations using business documents and data catalogs.
We chose Amazon Kendra Enterprise Edition Index over Amazon OpenSearch Service, primarily due to its sophisticated built-in capabilities and reduced need for manual configuration. Whereas OpenSearch Service requires extensive customization and technical expertise, Amazon Kendra provides out-of-the-box natural language understanding, automatic document processing for over 40 file formats, pre-built enterprise connectors, and intelligent query handling including synonym recognition and refinement suggestions. The service combines keyword, semantic, and vector search approaches automatically, whereas OpenSearch Service requires manual implementation of these features. These features of Amazon Kendra were suitable for our finance domain use case, where accuracy is imperative for usability.
We also chose Amazon Kendra Enterprise Edition Index over Amazon Q Business for information retrieval, because it stands out as a more robust and flexible solution. Although both tools aim to streamline access to company information, Amazon Kendra offers superior retrieval accuracy and greater control over search parameters. With Amazon Kendra, you can fine-tune relevance tuning, customize document attributes, and implement custom synonyms to enhance search precision. This level of customization helped us tailor the search experience to our specific needs in the Amazon Finance domain and monitor the search results prior to the augmented generation step within user conversations.
Streamlit
We selected Streamlit, a Python-based framework for creating interactive web applications, for building the AI assistant’s UI due to its rapid development capabilities, seamless integration with Python and the assistant’s backend components, interactive and responsive UI components, potential for data visualization, and straightforward deployment options. With the Streamlit UI, the assistant provides a user-friendly and engaging interface that facilitates natural language interactions while allowing for efficient iteration and deployment of the application.
Prompt template
Prompt templates allow for formatting user queries, integrating retrieved knowledge, and providing instructions or constraints for response generation, which are essential for generating contextual and informative responses that combine the language generation abilities of Anthropic’s Claude with the relevant knowledge retrieved from the search powered by Amazon Kendra. The following is an example prompt:

“””
H: Use the following pieces of context to provide a concise answer to the question at the end. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
<context>
{context}
</context>

Question: {question}

A:
“””

Solution architecture
The following solution architecture diagram depicts how the key architectural components work with each other to power the solution.

The workflow consists of the following steps:

The user asks the question in a chat box after authentication.
The Streamlit application sends the query to an Amazon Kendra retriever for relevant document retrieval.
Amazon Kendra sends the relevant paragraph and document references to the RAG solution.
The RAG solution uses Anthropic’s Claude in Amazon Bedrock along with the prompt template and relevant paragraph as context.
The LLM response is sent back to the Streamlit UI.
The response is shown to the user along with the feedback feature and session history.
The user feedback on responses is stored separately in Amazon Simple Storage Service (Amazon S3)
Amazon Kendra indexes relevant documents stored in S3 buckets for document search and retrieval.

Frontend architecture
We designed the following frontend architecture to allow for rapid modifications and deployment, keeping in mind the scalability and security of the solution.

This workflow consists of the following steps:

The user navigates to the application URL in their browser.
Amazon Route 53 resolves their request to the Amazon CloudFront distribution, which then selects the server closest to the user (to minimize latency).
CloudFront runs an AWS Lambda function that makes sure the user has been authenticated. If not, the user is redirected to sign in. After they successfully sign in, they are redirected back to the application website. The flow repeats, and CloudFront triggers the Lambda function again. This time, the user is now able to access the website.
Now authenticated, CloudFront returns the assets of the web application.
AWS Fargate makes it possible to run containers without having to manage the underlying Amazon Elastic Compute Cloud (Amazon EC2) instances. This allows running containers as a true serverless service. Amazon Elastic Container Service (Amazon ECS) is configured with automatic scaling (target tracking automatic scaling, which scales based on the Application Load Balancer (ALB) requests per target).

Evaluation of the solution’s performance
We implemented a comprehensive evaluation framework to rigorously assess the AI assistant’s performance and make sure it meets the high standards required for financial applications. Our framework was designed to capture both quantitative metrics for measurable performance and qualitative indicators for user experience and response quality. During our benchmarking tests with analysts, we found that this solution dramatically reduced search time by 30% because analysts can now perform natural language search, and it improved the accuracy of search results by 80%.
Quantitative assessment
We focused primarily on precision and recall testing, creating a diverse test set of over 50 business queries that represented typical use cases our analysts encounter. Using human-labeled answers as our ground truth, we evaluated the system’s performance across two main categories: data discovery and knowledge search. In data discovery scenarios, where the system helps analysts locate specific data sources and metrics, we achieved an initial precision rate of 65% and a recall rate of 60% without performing metadata enrichment on the data sources. Although these rates might appear moderate, they represent a significant improvement over the previous manual search process, which had an estimated success rate of only 35% and often required multiple iterations across different systems. The primary reasons for the current rates of the new system were attributed to the lack of rich metadata about data sources and was a good indicator for teams to facilitate better metadata collection of data assets, which is currently underway.
The knowledge search capability demonstrated initial rates of 83% precision and 74% recall without performing metadata enrichment on data sources. This marked a substantial improvement over traditional keyword-based search methods, which typically achieved only 45–50% precision in our internal testing. This improvement is particularly meaningful because it translates to analysts finding the right information in their first search attempt roughly 8 out of 10 times, compared to the previous average of 3–4 attempts needed to locate the same information.
Qualitative metrics
The qualitative evaluation centered around the concept of faithfulness—a critical metric for financial applications where accuracy and reliability are paramount. We employed an innovative LLM-as-a-judge methodology to evaluate how well the AI assistant’s responses aligned with source documentation and avoided hallucinations or unsupported assertions. The results showed a marked difference between use cases: data discovery achieved a faithfulness score of 70%, and business knowledge search demonstrated an impressive 88% faithfulness. These scores significantly outperform our previous documentation search system, which had no built-in verification mechanism and often led to analysts working with outdated or incorrect information.
Most importantly, the new system reduced the average time to find relevant information from 45–60 minutes to just 5–10 minutes—an 85% improvement in efficiency. User satisfaction surveys indicate that 92% of analysts prefer the new system over traditional search methods, citing improved accuracy and time savings as key benefits.
These evaluation results have not only validated our approach but also highlighted specific areas for future enhancement. We continue to refine our evaluation framework as the system evolves, making sure it maintains high standards of accuracy and reliability while meeting the dynamic needs of our financial analysts. The evaluation framework was instrumental in building confidence within our business user community, providing transparent metrics that demonstrate the system’s capability to handle complex financial queries while maintaining the accuracy standards essential for financial operations.
Use cases
Our solution transforms how finance users interact with complex financial and operational data through natural language queries. In this section, we discuss some key examples demonstrating how the system simplifies data discovery.
Seamless data discovery
The solution enables users to find data sources through natural language queries rather than requiring technical knowledge of database structures. It uses a sophisticated combination of vector stores and enterprise search capabilities to match user questions with relevant data sources, though careful attention must be paid to context management and preventing over-reliance on previous interactions. Prior to the AI assistant solution, finance analysts needed deep technical knowledge to navigate complex database structures, often spending hours searching through multiple documentation sources just to locate specific data tables. Understanding system workflows required extensive review of technical documentation or reaching out to subject matter experts, creating bottlenecks and reducing productivity. Even experienced users struggled to piece together complete information about business processes from fragmented sources across different systems. Now, analysts can simply ask questions in natural language, such as “Where can I find productivity metrics?”, “How do I access facility information?”, or “Which dashboard shows operational data?” and receive precise, contextual answers. The solution combines enterprise search capabilities with LLMs to understand user intent and deliver relevant information from both structured and unstructured data sources. Analysts now receive accurate directions to specific consolidated reporting tables, clear explanations of business processes, and relevant technical details when needed. In our benchmark tests, for data discovery tasks alone, the system achieved 70% faithfulness and 65% precision, and document search demonstrated even stronger results with 83% precision and 88% faithfulness, without metadata enrichments.
Assisting understanding of internal business processes from knowledge documentation
Financial analysts previously faced a steep learning curve when working with enterprise planning tools. The complexity of these systems meant that even basic tasks required extensive documentation review or waiting for support from overwhelmed subject matter experts. New team members could take weeks or months to become proficient, while even experienced users struggled to keep up with system updates and changes. This created a persistent bottleneck in financial operations and planning processes. The introduction of the AI-powered assistant has fundamentally changed how analysts learn and interact with these planning tools. Rather than searching through hundreds of pages of technical documentation, analysts can now ask straightforward questions like “How do I forecast depreciation for new assets?”, “How does the quarterly planning process work?” or “What inputs are needed for the quarterly planning cycle?” The system provides clear, contextualized explanations drawn from verified documentation and system specifications. Our benchmark tests revealed that it achieved 83% precision and 88% faithfulness in retrieving and explaining technical and business information. New analysts can become productive in a matter of weeks, experienced users can quickly verify procedures, and subject matter experts can focus on more complex challenges rather than routine questions. This represents a significant advancement in making enterprise systems more accessible and efficient, while maintaining the accuracy and reliability required for financial operations.While the technology continues to evolve, particularly in handling nuanced queries and maintaining comprehensive coverage of system updates, it has already transformed the way teams interact with planning tools independently.
Conclusion
The AI-powered assistant solution discussed in this post has demonstrated significant improvements in data discovery and business insights generation, delivering multiple key benefits across Amazon Finance. Analysts can now quickly find relevant information through natural language queries, dramatically reducing search time. The system’s ability to synthesize insights from disparate data sources has notably enhanced data-driven decision-making, and its conversational interface and contextual responses promote self-service data exploration, effectively reducing the burden on centralized data teams.
This innovative AI assistant solution showcases the practical power of AWS generative AI in transforming enterprise data discovery and document search. By combining Amazon Kendra Enterprise Edition Index, Amazon Bedrock, and advanced LLMs, the implementation achieves impressive precision rates, proving that sophisticated AI-powered search is both achievable and effective. This success demonstrates how AWS generative AI services can meet current business needs while promoting future innovations in enterprise search. These services provide a strong foundation for organizations looking to enhance data discovery processes using natural language to support intelligent enterprise applications. To learn more about implementing AI-powered search solutions, see Build and scale the next wave of AI innovation on AWS and explore AWS AI use cases.

About the authors
Saikat Gomes is part of the Customer Solutions team in Amazon Web Services. He is passionate about helping enterprises succeed and realize benefits from cloud adoption. He is a strategic advisor to his customers for large-scale cloud transformations involving people, process, and technology. Prior to joining AWS, he held multiple consulting leadership positions and led large- scale transformation programs in the retail industry for over 20 years. He is based out of Los Angeles, California.
Amit Dhanda serves as a Senior Scientist at Amazon’s Worldwide Operations Finance team, where he uses AI/ML technologies to solve complex ecommerce challenges. Prior to Amazon, he was Director of Data Science at Adore Me (now part of Victoria’s Secret), where he enhanced digital retail experiences through recommender systems. He held science leadership roles at EXL and Thomson Reuters, where he developed ML models for customer engagement/growth and text classification.