ETH and Stanford Researchers Introduce MIRIAD: A 5.8M Pair Dataset to …

Challenges of LLMs in Medical Decision-Making: Addressing Hallucinations via Knowledge Retrieval

LLMs are set to revolutionize healthcare through intelligent decision support and adaptable chat-based assistants. However, a major challenge is their tendency to produce factually incorrect medical information. To address this, a common solution is RAG, where external medical knowledge is broken into smaller text pieces that LLMs can retrieve and use during generation. While promising, current RAG methods depend on unstructured medical content that is often noisy, unfiltered, and difficult for LLMs to interpret effectively. There is a clear need for better organization and presentation of medical knowledge to ensure LLMs can use it more reliably and accurately.

Limitations of Current RAG Approaches in Healthcare AI

Though LLMs perform impressively across general language tasks, they often fall short in domains requiring up-to-date and precise knowledge, such as medicine. RAG offers a cost-effective alternative to expensive fine-tuning by grounding models in external literature. Yet, many current RAG systems rely on general-purpose text embeddings and standard vector databases, which are not optimized for medical content. Unlike in general domains, the medical field lacks large, high-quality datasets pairing medical questions with relevant answers. Existing datasets, such as PubMedQA or MedQA, are either too small, overly structured (e.g., multiple-choice), or lack the kind of open-ended, real-world responses needed to build strong medical retrieval systems.

MIRIAD Dataset: Structuring Medical QA with Peer-Reviewed Grounding

Researchers from ETH Zurich, Stanford, the Mayo Clinic, and other institutions have developed MIRIAD, a large-scale dataset comprising over 5.8 million high-quality medical instruction-response pairs. Each pair is carefully rephrased and grounded in peer-reviewed literature through a semi-automated process involving LLMs, filters, and expert review. Unlike prior unstructured datasets, MIRIAD offers structured, retrievable medical knowledge, boosting LLM accuracy on complex medical QA tasks by up to 6.7% and improving hallucination detection by 22.5–37%. They also launched MIRIAD-Atlas, a visual tool encompassing 56 medical fields, which enables users to explore and interact with this rich resource, thereby enhancing trustworthy AI in healthcare.

Data Pipeline: Filtering and Structuring Medical Literature Using LLMs and Classifiers

To build MIRIAD, researchers filtered 894,000 medical articles from the S2ORC corpus and broke them into clean, sentence-based passages, excluding overly long or noisy content. They used LLMs with structured prompts to generate over 10 million question-answer pairs, later refining this to 5.8 million through rule-based filtering. A custom-trained classifier, based on GPT-4 labels, helped further narrow it down to 4.4 million high-quality pairs. Human medical experts also validated a sample for accuracy, relevance, and grounding. Finally, they created MIRIAD-Atlas, an interactive 2D map of the dataset, using embedding and dimensionality reduction to cluster related content by topic and discipline.

Performance Gains: Enhancing QA Accuracy and Hallucination Detection Using MIRIAD

The MIRIAD dataset significantly improves the performance of large language models on medical tasks. When used in RAG, models achieved up to 6.7% higher accuracy compared to using unstructured data, even with the same amount of retrieved content. MIRIAD also enhanced the ability of models to detect medical hallucinations, with F1 score improvements ranging from 22.5% to 37%. Additionally, training retriever models on MIRIAD resulted in improved retrieval quality. The dataset’s structure, grounded in verified literature, enables more precise and reliable access to information, supporting a wide range of downstream medical applications.

MIRIAD-Atlas: Visual Exploration Across 56 Medical Fields

In conclusion, MIRIAD is a large, structured dataset comprising 5.8 million medical question-answer pairs, grounded in peer-reviewed literature, and built to support a range of medical AI applications. It includes an interactive atlas for easy exploration and incorporates rigorous quality control through automated filters, LLM assessments, and expert reviews. Unlike previous unstructured corpora, MIRIAD improves retrieval accuracy in medical question answering and can help identify hallucinations in language models. While not yet exhaustive, it lays a strong foundation for future datasets. Continued improvements could enable more accurate, user-involved retrieval and better integration with clinical tools and medical AI systems.

Check out the Paper, GitHub Page and Dataset on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post ETH and Stanford Researchers Introduce MIRIAD: A 5.8M Pair Dataset to Improve LLM Accuracy in Medical AI appeared first on MarkTechPost.

Build a Low-Footprint AI Coding Assistant with Mistral Devstral

In this Ultra-Light Mistral Devstral tutorial, a Colab-friendly guide is provided that is designed specifically for users facing disk space constraints. Running large language models like Mistral can be a challenge in environments with limited storage and memory, but this tutorial shows how to deploy the powerful devstral-small model. With aggressive quantization using BitsAndBytes, cache management, and efficient token generation, this tutorial walks you through building a lightweight assistant that’s fast, interactive, and disk-conscious. Whether you’re debugging code, writing small tools, or prototyping on the go, this setup ensures that you get maximum performance with minimal footprint.

Copy CodeCopiedUse a different Browser!pip install -q kagglehub mistral-common bitsandbytes transformers –no-cache-dir
!pip install -q accelerate torch –no-cache-dir

import shutil
import os
import gc

The tutorial begins by installing essential lightweight packages such as kagglehub, mistral-common, bitsandbytes, and transformers, ensuring no cache is stored to minimize disk usage. It also includes accelerate and torch for efficient model loading and inference. To further optimize space, any pre-existing cache or temporary directories are cleared using Python’s shutil, os, and gc modules.

Copy CodeCopiedUse a different Browserdef cleanup_cache():
“””Clean up unnecessary files to save disk space”””
cache_dirs = [‘/root/.cache’, ‘/tmp/kagglehub’]
for cache_dir in cache_dirs:
if os.path.exists(cache_dir):
shutil.rmtree(cache_dir, ignore_errors=True)
gc.collect()

cleanup_cache()
print(” Disk space optimized!”)

To maintain a minimal disk footprint throughout execution, the cleanup_cache() function is defined to remove redundant cache directories like /root/.cache and /tmp/kagglehub. This proactive cleanup helps free up space before and after key operations. Once invoked, the function confirms that disk space has been optimized, reinforcing the tutorial’s focus on resource efficiency.

Copy CodeCopiedUse a different Browserimport warnings
warnings.filterwarnings(“ignore”)

import torch
import kagglehub
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

To ensure smooth execution without distracting warning messages, we suppress all runtime warnings using Python’s warnings module. It then imports essential libraries for model interaction, including torch for tensor computations, kagglehub for streaming the model, and transformers for loading the quantized LLM. Mistral-specific classes like UserMessage, ChatCompletionRequest, and MistralTokenizer are also packed to handle tokenization and request formatting tailored to Devstral’s architecture.

Copy CodeCopiedUse a different Browserclass LightweightDevstral:
def __init__(self):
print(” Downloading model (streaming mode)…”)

self.model_path = kagglehub.model_download(
‘mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1′,
force_download=False
)

quantization_config = BitsAndBytesConfig(
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.uint8,
load_in_4bit=True
)

print(” Loading ultra-compressed model…”)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_path,
torch_dtype=torch.float16,
device_map=”auto”,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
trust_remote_code=True
)

self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json’)

cleanup_cache()
print(” Lightweight assistant ready! (~2GB disk usage)”)

def generate(self, prompt, max_tokens=400):
“””Memory-efficient generation”””
tokenized = self.tokenizer.encode_chat_completion(
ChatCompletionRequest(messages=[UserMessage(content=prompt)])
)

input_ids = torch.tensor([tokenized.tokens])
if torch.cuda.is_available():
input_ids = input_ids.to(self.model.device)

with torch.inference_mode():
output = self.model.generate(
input_ids=input_ids,
max_new_tokens=max_tokens,
temperature=0.6,
top_p=0.85,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
use_cache=True
)[0]

del input_ids
torch.cuda.empty_cache() if torch.cuda.is_available() else None

return self.tokenizer.decode(output[len(tokenized.tokens):])

print(” Initializing lightweight AI assistant…”)
assistant = LightweightDevstral()

We define the LightweightDevstral class, the core component of the tutorial, which handles model loading and text generation in a resource-efficient manner. It begins by streaming the devstral-small-2505 model using kagglehub, avoiding redundant downloads. The model is then loaded with aggressive 4-bit quantization via BitsAndBytesConfig, significantly reducing memory and disk usage while still enabling performant inference. A custom tokenizer is initialized from a local JSON file, and the cache is cleared immediately afterward. The generate method employs memory-safe practices, such as torch.inference_mode() and empty_cache(), to generate responses efficiently, making this assistant suitable even for environments with tight hardware constraints.

Copy CodeCopiedUse a different Browserdef run_demo(title, prompt, emoji=””):
“””Run a single demo with cleanup”””
print(f”n{emoji} {title}”)
print(“-” * 50)

result = assistant.generate(prompt, max_tokens=350)
print(result)

gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

run_demo(
“Quick Prime Finder”,
“Write a fast prime checker function `is_prime(n)` with explanation and test cases.”,
“”
)

run_demo(
“Debug This Code”,
“””Fix this buggy function and explain the issues:
“`python
def avg_positive(numbers):
total = sum([n for n in numbers if n > 0])
return total / len([n for n in numbers if n > 0])
“`”””,
“”
)

run_demo(
“Text Tool Creator”,
“Create a simple `TextAnalyzer` class with word count, char count, and palindrome check methods.”,
“”
)

Here we showcase the model’s coding abilities through a compact demo suite using the run_demo() function. Each demo sends a prompt to the Devstral assistant and prints the generated response, immediately followed by memory cleanup to prevent buildup over multiple runs. The examples include writing an efficient prime-checking function, debugging a Python snippet with logical flaws, and building a mini TextAnalyzer class. These demonstrations highlight the model’s utility as a lightweight, disk-conscious coding assistant capable of real-time code generation and explanation.

Copy CodeCopiedUse a different Browserdef quick_coding():
“””Lightweight interactive session”””
print(“n QUICK CODING MODE”)
print(“=” * 40)
print(“Enter short coding prompts (type ‘exit’ to quit)”)

session_count = 0
max_sessions = 5

while session_count < max_sessions:
prompt = input(f”n[{session_count+1}/{max_sessions}] Your prompt: “)

if prompt.lower() in [‘exit’, ‘quit’, ”]:
break

try:
result = assistant.generate(prompt, max_tokens=300)
print(” Solution:”)
print(result[:500])

gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

except Exception as e:
print(f” Error: {str(e)[:100]}…”)

session_count += 1

print(f”n Session complete! Memory cleaned.”)

We introduce Quick Coding Mode, a lightweight interactive interface that allows users to submit short coding prompts directly to the Devstral assistant. Designed to limit memory usage, the session caps interaction to five prompts, each followed by aggressive memory cleanup to ensure continued responsiveness in low-resource environments. The assistant responds with concise, truncated code suggestions, making this mode ideal for rapid prototyping, debugging, or exploring coding concepts on the fly, all without overwhelming the notebook’s disk or memory capacity.

Copy CodeCopiedUse a different Browserdef check_disk_usage():
“””Monitor disk usage”””
import subprocess
try:
result = subprocess.run([‘df’, ‘-h’, ‘/’], capture_output=True, text=True)
lines = result.stdout.split(‘n’)
if len(lines) > 1:
usage_line = lines[1].split()
used = usage_line[2]
available = usage_line[3]
print(f” Disk: {used} used, {available} available”)
except:
print(” Disk usage check unavailable”)

print(“n Tutorial Complete!”)
cleanup_cache()
check_disk_usage()

print(“n Space-Saving Tips:”)
print(“• Model uses ~2GB vs original ~7GB+”)
print(“• Automatic cache cleanup after each use”)
print(“• Limited token generation to save memory”)
print(“• Use ‘del assistant’ when done to free ~2GB”)
print(“• Restart runtime if memory issues persist”)

Finally, we offer a cleanup routine and a helpful disk usage monitor. Using the df -h command via Python’s subprocess module, it displays how much disk space is used and available, confirming the model’s lightweight nature. After re-invoking cleanup_cache() to ensure minimal residue, the script concludes with a set of practical space-saving tips.

In conclusion, we can now leverage the capabilities of Mistral’s Devstral model in space-constrained environments like Google Colab, without compromising usability or speed. The model loads in a highly compressed format, performs efficient text generation, and ensures memory is promptly cleared after use. With the interactive coding mode and demo suite included, users can test their ideas quickly and seamlessly.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Build a Low-Footprint AI Coding Assistant with Mistral Devstral appeared first on MarkTechPost.

Build an intelligent multi-agent business expert using Amazon Bedrock

In this post, we demonstrate how to build a multi-agent system using multi-agent collaboration in Amazon Bedrock Agents to solve complex business questions in the biopharmaceutical industry. We show how specialized agents in research and development (R&D), legal, and finance domains can work together to provide comprehensive business insights by analyzing data from multiple sources.
Amazon Bedrock Agents and multi-agent collaboration
Business intelligence and market research enable large and small businesses to capture the trends of the industry, competitive landscape through data, and influences key business strategies. For example, biopharmaceutical companies use data to understand drug market size, clinical trials, prevalence of side effects, and innovation and pitfalls through analyzing patent and legal briefs to form investment strategies. In doing so, organizations face the challenges of accessing and analyzing information scattered across multiple data sources. Consolidating and querying these disparate datasets can be a complex and time-consuming task, requiring developers to navigate different data formats, query languages, and access mechanisms. Additionally, gaining a comprehensive understanding of an organization’s operations often requires combining data insights from various segments, such as legal, finance, and R&D.
Generative AI agentic systems have emerged as a promising solution, enabling organizations to use generative AI for autonomous reasoning and action-based tasks. However, many agentic systems to-date are built with a single-agent setup, which poses challenges in a complex business environment. Besides the challenge of managing multiple data sources, encoding information and guidance for multiple business domains might cause the prompt for an agent’s large language model (LLM) to grow to such an extent that is suffers from “forgetting the middle” of a long context. Therefore, there is a trade-off between the breadth vs. depth of knowledge for each domain that can be encoded in an agent effectively. Additionally, the use of a single LLM with an agent limits cost, latency, and accuracy optimizations for the selected model.
Amazon Bedrock Agents and its multi-agent collaboration feature provides powerful, enterprise-ready solutions for addressing these challenges and building intelligent and automated agentic systems. As a managed service within the AWS ecosystem, Amazon Bedrock Agents offers seamless integration with AWS data sources, built-in security controls, and enterprise-grade scalability. It contains built-in support for additional Amazon Bedrock features such as Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases. The service significantly reduces deployment overhead, empowering developers to focus on agent logic through an API-driven, familiar AWS Cloud environment and console. The supervisor agent model with specialized sub-agents enables efficient distributed problem-solving, breaking down complex tasks with intelligent routing.
In this post, we discuss how to build a multi-agent system using multi-agent collaboration to solve complex business questions faced in a fictional biopharmaceutical company that requires expertise and data from three specialized domains: R&D, legal, and finance. We demonstrate how data in disparate sources can be combined intelligently to support complex reasoning, and how agent collaboration facilitates open-ended question answering, such as “What are the potential legal and financial risks associated with the side effects of therapeutic product X, and how might they affect the company’s long-term stock performance?”
(Below image depicts the roles of supervisor agent and the 3 subagents being used in our pharmaceutical example along with the benefits of using Multi Agent Collaboration. )

Solution overview
Our use case centers around PharmaCorp, a fictional pharmaceutical company, which faces the challenge of managing vast amounts of data across its Pharma R&D, Legal, and Finance divisions. Each division works with structured data, such as stock prices, and unstructured data, such as clinical trial reports. The data for each division is located in different data stores, which makes it difficult for teams to access cross-functional insights and slows down decision-making processes.
To address this, we build a multi-agent system with domain-specific sub-agents for each division using multi-agent collaboration within Amazon Bedrock Agents. These sub-agents efficiently handle data queries and information retrieval, and the main agent passes necessary context between sub-agents and synthesizes insights across divisions. The multi-agent setup empowers PharmaCorp to access expertise and information within minutes that would otherwise take hours of human effort to compile. This approach breaks down data silos and strengthens organizational collaboration.
The following architecture diagram illustrates the solution setup.

The main agent acts as an orchestrator, asking questions to multiple sub-agents and synthesizing retrieved data:

The R&D sub-agent has access to clinical trial data through Amazon Athena and unstructured clinical trial reports
The legal sub-agent has access to unstructured patents and lawsuit legal briefs
The finance sub-agent has access to research budget data through Athena and historical stock price data stored in Amazon Redshift

Each sub-agent has granular permissions to only access the data in its domain. Detailed information about the data and models used and main agent interactions are described in the following sections.
Dataset
We generated synthetic data using Anthropic’s Claude 3.5 Sonnet model, comprised of three domains: Pharma R&D, Legal, and Finance. The domains contain structured data stored in SQL tables and unstructured data that is used in domain knowledge bases. The data can be accessed through the following files: R&D, Legal, Finance.
Efforts have been made to align synthetic data within and across domains. For example, clinical trial reports map to each trial and side effects in related tables. Rises and dips in stock prices tend to correlate with patents and lawsuits. However, there might still be minor inconsistencies between data.
Pharma R&D domain
The Pharma R&D domain has three tables: Drugs, Drug Trials, and Side Effects. Each table is queried from Amazon Simple Storage Service (Amazon S3) through Athena. The Drugs table contains information on the company’s available products, therapeutic areas, target conditions, mechanisms of action, development phase, discovery year, and lead scientist. The Drug Trials table contains information on specific trials for each drug such as phase, dates, number of participations, and outcomes. The Side Effects table contains side effects, frequency, and severity reported from each trial.
The unstructured data for the Pharma R&D domain consists of synthetic clinical trial reports for each trial, which contain more detailed information about the trial design, outcomes, and recommendations.
Legal domain
The Legal domain has unstructured data consisting of patents and lawsuit legal briefs. The patents contain information about invention background, description, and experimental results. The legal briefs contain information about lawsuit court proceedings, outcomes, and analysis.
Finance domain
The Finance domain has two tables: Stock Price and Research Budgets. The Stock Price table is stored in Amazon Redshift and contains PharmaCorp’s historical monthly stock prices and volume. Amazon Redshift is a database optimized for online analytical processing (OLAP), which generally entails analyzing large amounts of data and performing complex analysis, as might be done by analysts looking at historical stock prices. The Research Budgets table is accessed from Amazon S3 through Athena and contains annual budgets for each department.
The data setup showcases how a multi-agent framework can synthesize data from multiple data sources and databases. In practice, data could also be stored in other databases such as Amazon Relational Database Service (Amazon RDS).
Models used
Anthropic’s Claude 3 Sonnet, which has a good balance of intelligence and speed, is used in this multi-agent demonstration. With the multi-agent setup, you can also employ a more intelligent or a smaller, faster model depending on the use case and requirements such as accuracy and latency.
Prerequisites
To deploy this solution, you need the following prerequisites:

An active AWS account.
Access to Amazon Titan Embeddings G1 – Text, Anthropic’s Claude 3 Sonnet, and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock. For instructions, refer to Add or remove access to Amazon Bedrock foundation models.

Deploy the solution
To deploy the solution resources, we use AWS CloudFormation. The CloudFormation template creates two S3 buckets, two AWS Lambda functions, an Amazon Bedrock agent, an Amazon Bedrock knowledge base, and an Amazon Elastic Compute Cloud (Amazon EC2) instance.
Download the provided CloudFormation template, then complete the following steps to deploy the stack:

Open the AWS CloudFormation console (the preferred AWS Regions are us-west-2 or us-east-1 for the solution).
Choose Stacks in the navigation pane.
Choose Create stack and With new resources (standard).
Select Choose existing template and upload the provided CloudFormation template file.
Enter a stack name, then choose Next.
Leave the stack settings as default and choose Next.
Select the acknowledgement check box and create the stack.

After the stack is complete, you can view the new supervisor agent on the Amazon Bedrock console.
An example of agent collaboration
After you deploy the solution, you can test the communication among agents that help answer complex questions across PharmaCorp’s three divisions. For example, we ask the main agent “How did the results of NeuroClear’s Phase 2 trials affect PharmaCorp’s stock price, patent filings, and potential legal risks?”
This question requires a comprehensive understanding of the relationships between NeuroClear’s clinical trial results, financial impacts, and legal outcomes for PharmaCorp. Let’s see how the multi-agent system addresses this complex query.
The main agent identifies that it needs input from three specialized sub-agents to fully assess how NeuroClear’s clinical trial results might impact PharmaCorp’s legal and financial performance. It breaks down the user’s question into key components and develops a plan to gather detailed insights from each expert. The following is its chain-of-thought reasoning, task breakdown, and sub-agent routing:

This is a complex question that requires information from all three divisions to
provide a comprehensive answer.
I need to:
Get information about NeuroClear’s Phase 2 trial results from Pharmaceutical division
Check stock price movements around the trial period from Finance division
Look for any patent filings and legal implications from Legal division

Then, the main agent asks a question to the R&D sub-agent:

Please provide details about NeuroClear’s Phase 2 clinical trials, including trial dates,
outcomes, success rates, and any significant findings or side effects.

The R&D sub-agent correctly plans and executes its own sequence of steps, which include performing queries and searching its own knowledge base. It responds with the following:

Based on the information from the knowledge base search and Athena database queries,
here are the key details about NeuroClear’s Phase 2 clinical trials:

* There were two Phase 2 trials conducted for NeuroClear, a novel drug for treating
Alzheimer’s disease.
* The first trial ran from June 2019 to May 2020 with 250 participants. It showed a
moderate success rate of 0.65 in slowing cognitive decline compared to placebo.
* The second trial ran from August 2021 to August 2022 with 400 participants. It had
a slightly higher success rate of 0.68.
* Both trials were randomized, double-blind, and placebo-controlled. The dosages tested
were 25mg, 50mg, and 75mg once daily.
* Primary endpoints were change in ADAS-Cog score and secondary endpoints included CDR-SB,
hippocampal volume, and CSF biomarkers.
* Notable side effects observed included insomnia (frequency 0.14-0.15, severity 3),
anxiety (frequency 0.12-0.13, severity 3), tremors (frequency 0.09-0.1, severity 2),
and weight loss (frequency 0.08, severity 2).

The trials showed NeuroClear had moderate efficacy in slowing cognitive decline in
Alzheimer’s patients, with an acceptable safety profile. The results supported
advancing NeuroClear to larger Phase 3 trials to further evaluate its potential
as a treatment for Alzheimer’s disease.

The main agent takes this information and determines its next step:

Now that I have the trial dates and outcomes, I should check the stock price movements
during these periods and any patent/legal activities.

It asks the finance sub-agent the following:

Please provide stock price analysis for two periods:

1. June 2019 – May 2020
2. August 2021 – August 2022

Specifically looking for significant price movements and trading volume changes around
the completion dates of these periods, which correspond to NeuroClear’s Phase 2 trial
results.

The finance sub-agent plans and executes queries on its stock database and responds with the following:

Stock Price Analysis for NeuroClear’s Phase 2 Trial Periods:

Period 1 (June 2019 – May 2020):
* Final closing price: $34.69
* Average price: $31.43
* Price range: $27.99 – $34.69
* Average daily volume: 630,874 shares
* The period ended with moderate trading volume (566,927 shares)

Period 2 (August 2021 – August 2022):
* Final closing price: $53.71
* Average price: $49.79
* Price range: $44.11 – $55.41
* Average daily volume: 640,098 shares
* The period ended with slightly lower volume (491,629 shares)

Key Observations:
* Significant price appreciation between periods, with Period 2 showing ~58%
higher average prices
* Both periods showed similar trading volume patterns, with Period 2 showing
slightly higher average volume
* The stock showed slight negative movement on both completion dates
(-0.20% and -0.77% respectively)

The main agent then asks a question to the legal sub-agent:

Please provide information about:
1. Any patent filings related to NeuroClear, particularly during or after the
Phase 2 trials (2019-2022)
2. Any legal risks or challenges identified related to NeuroClear’s development
or trial results during this period.

The legal sub-agent searches its knowledge base and responds with the following:

Based on the patent search results, two key patent filings related to NeuroClear
during or after its Phase 2 trials from 2019-2022 were:

1. A blood-brain barrier penetration enhancer system using nanoparticles to improve
delivery of neurological drugs like NeuroClear to the brain (Patent ID: PAT004,
filed in 2022). This technology could have been developed to address challenges
with getting NeuroClear across the blood-brain barrier during clinical trials.
2. A targeted neurological receptor blocker designed to selectively inhibit
receptors implicated in neurodegenerative diseases (Patent ID: PAT003, filed in 2021).
This suggests NeuroClear may work by blocking certain neural receptors involved in
conditions like Alzheimer’s disease.

The legal search results identified a few key legal risks and challenges related to
NeuroClear’s development and trial results from 2019-2022:

1. A product liability lawsuit alleging NeuroClear caused cognitive impairment in a
patient, with evidence that PharmaCorp failed to properly warn about this risk despite
internal knowledge of it (Case 2023-CV-006).
2. A class action lawsuit alleging failure to warn about the risk of seizures associated
with NeuroClear use, based on medical records, expert testimony, and internal PharmaCorp
documents showing awareness of this risk.
3. Another product liability case where PharmaCorp was found liable for failing to warn
about risk of liver damage from a different product, suggesting potential issues with
adequate risk disclosure for NeuroClear as well.

Finally, the main agent synthesizes the information it has received into a response to the user’s question:

NeuroClear’s Phase 2 trials had mixed impacts across different aspects of
PharmaCorp’s operations:

Clinical Results:
– Two Phase 2 trials (2019-2020 and 2021-2022) showed moderate success in treating
Alzheimer’s disease
– Success rates of 0.65 and 0.68 respectively
– Notable side effects included insomnia, anxiety, tremors, and weight loss

Stock Performance:
– The stock showed resilience during both trial periods
– Average stock price increased significantly (58%) between the first and second trial periods
– Interestingly, both trial completion dates saw slight negative price movements (-0.20% and -0.77%)

Patents and Legal Implications:
– Two significant patents were filed:
* Blood-brain barrier penetration enhancer system (2022)
* Targeted neurological receptor blocker (2021)
– However, several legal risks emerged:
* Product liability lawsuit regarding cognitive impairment
* Class action lawsuit concerning seizure risks
* Potential issues with risk disclosure based on previous cases

These results suggest that while NeuroClear showed promising clinical potential
and led to valuable intellectual property, it also exposed PharmaCorp to significant
legal risks that could impact future development and commercialization.

Through this example, we can see how multi-agent collaboration enables a comprehensive analysis of complex business questions by using specialized knowledge from different domains. The main agent effectively orchestrates the interaction between sub-agents, synthesizing their insights to provide a holistic answer that considers R&D, financial, and legal aspects of the NeuroClear clinical trials and their potential impacts on PharmaCorp.
Clean up
When you’re done testing the agent, complete the following steps to clean up your AWS environment and avoid unnecessary charges:

Delete the S3 buckets:

On the Amazon S3 console, empty the buckets structured-data-${AWS::AccountId}-${AWS::Region} and unstructured-data-${AWS::AccountId}-${AWS::Region}. Make sure that both of these buckets are empty by deleting the files.
Select each file, choose Delete, and confirm by entering the bucket name.

Delete the Lambda functions:

On the Lambda console, select the CopyDataLambda function.
Choose Delete and confirm the action.
Repeat these steps for the CopyUnstructuredDataLambda function.

Delete the Amazon Bedrock agent:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Select the agent, then choose Delete.

Delete the Amazon Bedrock knowledge base in Bedrock:

On the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
Select the knowledge base and choose Delete.

Delete the EC2 instance:

On the Amazon EC2 console, choose Instances in the navigation pane.
Select the EC2 instance you created, then choose Delete.

Business impact
Implementing this multi-agent system using Amazon Bedrock Agents can provide significant benefits for pharmaceutical companies. By automating data retrieval and analysis across domains, companies can reduce research time and enable faster, data-driven decision-making, especially when domain experts are distributed across different organizational units with limited direct interaction. The system’s ability to provide comprehensive, cross-functional insights in minutes can lead to improved risk mitigation, because potential legal and financial issues can be identified earlier by connecting disparate data points. This automation also allows for more effective allocation of human resources, freeing up experts to focus on high-value tasks rather than routine data analysis.
Our example demonstrates the power of multi-agent systems in pharmaceutical research and development, but the applications of this technology extend far beyond a single use case. For example, biotech companies can accelerate the discovery of cancer biomarkers by having specialist agents extract genomic signals from Amazon Redshift, perform Kaplan-Meier survival analyses, and interpret CT scans in parallel. Large health systems could automatically aggregate patient records, lab results, and trial data to streamline care coordination and flag urgent cases. Travel agencies can orchestrate end‑to‑end itineraries, and firms can manage personalized client communications. For more information on potential applications, see the following posts:

Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents
How agentic AI systems can solve the three most pressing problems in healthcare today
Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock
Enabling complex generative AI applications with Amazon Bedrock Agents

Although the potential of multi-agent systems is compelling across these diverse applications, it’s important to understand the practical considerations in implementing such systems. Complex orchestration workflows can drive up inference costs through multiple model calls, increase end‑to‑end latency, amplify testing and maintenance requirements, and introduce operational overhead around rate limits, retries, and inter‑agent or data connection protocols. However, the state of the art is rapidly advancing. New generations of faster, cheaper models can help keep per‑call expenses and latency low, and more powerful models can accomplish tasks in fewer turns. Observability tools offer end‑to‑end tracing and dashboarding for multi‑agent pipelines. Finally, protocols like Anthropic’s Model Context Protocol are beginning to standardize the way agents access data, paving the way for robust multi‑agent ecosystems.
Conclusion
In this post, we explored how a multi-agent generative AI system, implemented with Amazon Bedrock Agents using multi-agent collaboration, addresses data access and analysis challenges across multiple business domains. Through a demo use case with a fictional pharmaceutical company managing data across its different divisions, we showcased how specialized sub-agents tailored to each domain streamline information retrieval and synthesis. Each sub-agent uses domain-optimized models and securely accesses relevant data sources, enabling the organization to generate cross-functional insights.
With this multi-agent architecture, organizations can overcome data silos, enhance collaboration, and achieve efficient, data-driven decision-making while optimizing for cost, latency, and security. Amazon Bedrock Agents with multi-agent collaboration facilitates this setup by providing a secure, scalable framework that manages the collaboration, communication, and task delegation between agents. Explore other demos and workshops about multi-agent collaboration in Amazon Bedrock in the following resources:

Introducing multi-agent collaboration capability for Amazon Bedrock (preview)
Amazon Bedrock multi-agent collaboration workshop
Multi-Agent Collaboration with Amazon Bedrock | Amazon Web Services

About the authors
Justin Ossai is a GenAI Labs Specialist Solutions Architect based in Dallas, TX. He is a highly passionate IT professional with over 15 years of technology experience. He has designed and implemented solutions with on-premises and cloud-based infrastructure for small and enterprise companies.
Michael Hsieh is a Principal AI/ML Specialist Solutions Architect. He works with HCLS customers to advance their ML journey with AWS technologies and his expertise in medical imaging. As a Seattle transplant, he loves exploring the great mother nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.
Shreya Mohanty  is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.
Rachel Hanspal is a Deep Learning Architect at AWS Generative AI Innovation Center, specializing in end-to-end GenAI solutions with a focus on frontend architecture and LLM integration. She excels in translating complex business requirements into innovative applications, leveraging expertise in natural language processing, automated visualization, and secure cloud architectures.

Driving cost-efficiency and speed in claims data processing with Amazo …

Amazon operations span the globe, touching the lives of millions of customers, employees, and vendors every day. From the vast logistics network to the cutting-edge technology infrastructure, this scale is a testament to the company’s ability to innovate and serve its customers. With this scale comes a responsibility to manage risks and address claims—whether they involve worker’s compensation, transportation incidents, or other insurance-related matters. Risk managers oversee claims against Amazon throughout their lifecycle. Claim documents from various sources grow as the claims mature, with a single claim consisting of 75 documents on average. Risk managers are required to strictly follow the relevant standard operating procedure (SOP) and review the evolution of dozens of claim aspects to assess severity and to take proper actions, reviewing and addressing each claim fairly and efficiently. But as Amazon continues to grow, how are risk managers empowered to keep up with the growing number of claims?
In December 2024, an internal technology team at Amazon built and implemented an AI-powered solution as applied to data related to claims against the company. This solution generates structured summaries of claims under 500 words across various categories, improving efficiency while maintaining accuracy of the claims review process. However, the team faced challenges with high inference costs and processing times (3–5 minutes per claim), particularly as new documents are added. Because the team plans to expand this technology to other business lines, they explored Amazon Nova Foundation Models as potential alternatives to address cost and latency concerns.
The following graphs show performance compared with latency and performance compared with cost for various foundation models on the claim dataset.

The evaluation of the claims summarization use case proved that Amazon Nova foundation models (FMs) are a strong alternative to other frontier large language models (LLMs), achieving comparable performance with significantly lower cost and higher overall speed. The Amazon Nova Lite model demonstrates strong summarization capabilities in the context of long, diverse, and messy documents.
Solution overview
The summarization pipeline begins by processing raw claim data using AWS Glue jobs. It stores data into intermediate Amazon Simple Storage Service (Amazon S3) buckets, and uses Amazon Simple Queue Service (Amazon SQS) to manage summarization jobs. Claim summaries are generated by AWS Lambda using foundation models hosted in Amazon Bedrock. We first filter the irrelevant claim data using an LLM-based classification model based on Nova Lite and summarize only the relevant claim data to reduce the context window. Considering relevance and summarization requires different levels of intelligence, we select the appropriate models to optimize cost while maintaining performance. Because claims are summarized upon arrival of new information, we also cache the intermediate results and summaries using Amazon DynamoDB to reduce duplicate inference and reduce cost. The following image shows a high-level architecture of the claim summarization use case solution.

Although the Amazon Nova team has published performance benchmarks across several different categories, claims summarization is a unique use case given its diversity of inputs and long context windows. This prompted the technology team owning the claims solution to investigate further with their own benchmarking study. To assess the performance, speed, and cost of Amazon Nova models for their specific use case, the team curated a benchmark dataset consisting of 95 pairs of claim documents and verified aspect summaries. Claim documents range from 1,000 to 60,000 words, with most being around 13,000 words (median 10,100). The verified summaries of these documents are usually brief, containing fewer than 100 words. Inputs to the models include diverse types of documents and summaries that cover a variety of aspects in production.
According to benchmark tests, the team observed that Amazon Nova Lite is twice as fast and costs 98% less than their current model. Amazon Nova Micro is even more efficient, running four times faster and costing 99% less. The substantial cost-effectiveness and latency improvements offer more flexibility for designing a sophisticated model and scaling up test compute to improve summary quality. Moreover, the team also observed that the latency gap between Amazon Nova models and the next best model widened for long context windows and long output, making Amazon Nova a stronger alternative in the case of long documents while optimizing for latency. Additionally, the team performed this benchmarking study using the same prompt as the current in-production solution with seamless prompt portability. Despite this, Amazon Nova models successfully followed instructions and generated the desired format for post-processing. Based on the benchmarking and evaluation results, the team used Amazon Nova Lite for classification and summarization use cases.
Conclusion
In this post, we shared how an internal technology team at Amazon evaluated Amazon Nova models, resulting in notable improvements in inference speed and cost-efficiency. Looking back on the initiative, the team identified several critical factors that offer key advantages:

Access to a diverse model portfolio – The availability of a wide array of models, including compact yet powerful options such as Amazon Nova Micro and Amazon Nova Lite, enabled the team to quickly experiment with and integrate the most suitable models for their needs.
Scalability and flexibility – The cost and latency improvements of the Amazon Nova models allow for more flexibility in designing sophisticated models and scaling up test compute to improve summary quality. This scalability is particularly valuable for organizations handling large volumes of data or complex workflows.
Ease of integration and migration – The models’ ability to follow instructions and generate outputs in the desired format simplifies post-processing and integration into existing systems.

If your organization has a similar use case of large document processing that is costly and time-consuming, the above evaluation exercise shows that Amazon Nova Lite and Amazon Nova Micro can be game-changing. These models excel at handling large volumes of diverse documents and long context windows—perfect for complex data processing environments. What makes this particularly compelling is the models’ ability to maintain high performance while significantly reducing operational costs. It’s important to iterate over new models for all three pillars—quality, cost, and speed. Benchmark these models with your own use case and datasets.
You can get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About the authors
Aitzaz Ahmad is an Applied Science Manager at Amazon, where he leads a team of scientists building various applications of machine learning and generative AI in finance. His research interests are in natural language processing (NLP), generative AI, and LLM agents. He received his PhD in electrical engineering from Texas A&M University.
Stephen Lau is a Senior Manager of Software Development at Amazon, leads teams of scientists and engineers. His team develops powerful fraud detection and prevention applications, saving Amazon billions annually. They also build Treasury applications that optimize Amazon global liquidity while managing risks, significantly impacting the financial security and efficiency of Amazon.
Yong Xie is an applied scientist in Amazon FinTech. He focuses on developing large language models and generative AI applications for finance.
Kristen Henkels is a Sr. Product Manager – Technical in Amazon FinTech, where she focuses on helping internal teams improve their productivity by leveraging ML and AI solutions. She holds an MBA from Columbia Business School and is passionate about empowering teams with the right technology to enable strategic, high-value work.
Shivansh Singh is a Principal Solutions Architect at Amazon. He is passionate about driving business outcomes through innovative, cost-effective and resilient solutions, with a focus on machine learning, generative AI, and serverless technologies. He is a technical leader and strategic advisor to large-scale games, media, and entertainment customers. He has over 16 years of experience transforming businesses through technological innovations and building large-scale enterprise solutions.
Dushan Tharmal is a Principal Product Manager – Technical on the Amazons Artificial General Intelligence team, responsible for the Amazon Nova Foundation Models. He earned his bachelor’s in mathematics at the University of Waterloo and has over 10 years of technical product leadership experience across financial services and loyalty. In his spare time, he enjoys wine, hikes, and philosophy.
Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon builders who build customer-facing applications using generative AI. He lives in the Seattle area, and outside of work, he loves to go hiking and enjoy nature.

A Coding Implementation for Creating, Annotating, and Visualizing Comp …

In this tutorial, we explore how to leverage the PyBEL ecosystem to construct and analyze rich biological knowledge graphs directly within Google Colab. We begin by installing all necessary packages, including PyBEL, NetworkX, Matplotlib, Seaborn, and Pandas. We then demonstrate how to define proteins, processes, and modifications using the PyBEL DSL. From there, we guide you through the creation of an Alzheimer’s disease-related pathway, showcasing how to encode causal relationships, protein–protein interactions, and phosphorylation events. Alongside graph construction, we introduce advanced network analyses, including centrality measures, node classification, and subgraph extraction, as well as techniques for extracting citation and evidence data. By the end of this section, you will have a fully annotated BEL graph ready for downstream visualization and enrichment analyses, laying a solid foundation for interactive biological knowledge exploration.

Copy CodeCopiedUse a different Browser!pip install pybel pybel-tools networkx matplotlib seaborn pandas -q

import pybel
import pybel.dsl as dsl
from pybel import BELGraph
from pybel.io import to_pickle, from_pickle
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings(‘ignore’)

print(“PyBEL Advanced Tutorial: Biological Expression Language Ecosystem”)
print(“=” * 65)

We begin by installing PyBEL and its dependencies directly in Colab, ensuring that all necessary libraries, NetworkX, Matplotlib, Seaborn, and Pandas, are available for our analysis. Once installed, we import the core modules and suppress warnings to keep our notebook clean and focused on the results.

Copy CodeCopiedUse a different Browserprint(“n1. Building a Biological Knowledge Graph”)
print(“-” * 40)

graph = BELGraph(
name=”Alzheimer’s Disease Pathway”,
version=”1.0.0″,
description=”Example pathway showing protein interactions in AD”,
authors=”PyBEL Tutorial”
)

app = dsl.Protein(name=”APP”, namespace=”HGNC”)
abeta = dsl.Protein(name=”Abeta”, namespace=”CHEBI”)
tau = dsl.Protein(name=”MAPT”, namespace=”HGNC”)
gsk3b = dsl.Protein(name=”GSK3B”, namespace=”HGNC”)
inflammation = dsl.BiologicalProcess(name=”inflammatory response”, namespace=”GO”)
apoptosis = dsl.BiologicalProcess(name=”apoptotic process”, namespace=”GO”)

graph.add_increases(app, abeta, citation=”PMID:12345678″, evidence=”APP cleavage produces Abeta”)
graph.add_increases(abeta, inflammation, citation=”PMID:87654321″, evidence=”Abeta triggers neuroinflammation”)

tau_phosphorylated = dsl.Protein(name=”MAPT”, namespace=”HGNC”,
variants=[dsl.ProteinModification(“Ph”)])
graph.add_increases(gsk3b, tau_phosphorylated, citation=”PMID:11111111″, evidence=”GSK3B phosphorylates tau”)
graph.add_increases(tau_phosphorylated, apoptosis, citation=”PMID:22222222″, evidence=”Hyperphosphorylated tau causes cell death”)
graph.add_increases(inflammation, apoptosis, citation=”PMID:33333333″, evidence=”Inflammation promotes apoptosis”)

graph.add_association(abeta, tau, citation=”PMID:44444444″, evidence=”Abeta and tau interact synergistically”)

print(f”Created BEL graph with {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges”)

We initialize a BELGraph with metadata for an Alzheimer’s disease pathway and define proteins and processes using the PyBEL DSL. By adding causal relationships, protein modifications, and associations, we construct a comprehensive network that captures key molecular interactions.

Copy CodeCopiedUse a different Browserprint(“n2. Advanced Network Analysis”)
print(“-” * 30)

degree_centrality = nx.degree_centrality(graph)
betweenness_centrality = nx.betweenness_centrality(graph)
closeness_centrality = nx.closeness_centrality(graph)

most_central = max(degree_centrality, key=degree_centrality.get)
print(f”Most connected node: {most_central}”)
print(f”Degree centrality: {degree_centrality[most_central]:.3f}”)

We compute degree, betweenness, and closeness centralities to quantify each node’s importance within the graph. By identifying the most connected nodes, we gain insight into potential hubs that may drive disease mechanisms.

Copy CodeCopiedUse a different Browserprint(“n3. Biological Entity Classification”)
print(“-” * 35)

node_types = Counter()
for node in graph.nodes():
node_types[node.function] += 1

print(“Node distribution:”)
for func, count in node_types.items():
print(f” {func}: {count}”)

We classify each node by its function, such as Protein or BiologicalProcess, and tally their counts. This breakdown helps us understand the composition of our network at a glance.

Copy CodeCopiedUse a different Browserprint(“n4. Pathway Analysis”)
print(“-” * 20)

proteins = [node for node in graph.nodes() if node.function == ‘Protein’]
processes = [node for node in graph.nodes() if node.function == ‘BiologicalProcess’]

print(f”Proteins in pathway: {len(proteins)}”)
print(f”Biological processes: {len(processes)}”)

edge_types = Counter()
for u, v, data in graph.edges(data=True):
edge_types[data.get(‘relation’)] += 1

print(“nRelationship types:”)
for rel, count in edge_types.items():
print(f” {rel}: {count}”)

We separate all proteins and processes to measure the pathway’s scope and complexity. Counting the different relationship types further reveals which interactions, like increases or associations, dominate our model.

Copy CodeCopiedUse a different Browserprint(“n5. Literature Evidence Analysis”)
print(“-” * 32)

citations = []
evidences = []
for _, _, data in graph.edges(data=True):
if ‘citation’ in data:
citations.append(data[‘citation’])
if ‘evidence’ in data:
evidences.append(data[‘evidence’])

print(f”Total citations: {len(citations)}”)
print(f”Unique citations: {len(set(citations))}”)
print(f”Evidence statements: {len(evidences)}”)

We extract citation identifiers and evidence strings from each edge to evaluate our graph’s grounding in published research. Summarizing total and unique citations allows us to assess the breadth of supporting literature.

Copy CodeCopiedUse a different Browserprint(“n6. Subgraph Analysis”)
print(“-” * 22)

inflammation_nodes = [inflammation]
inflammation_neighbors = list(graph.predecessors(inflammation)) + list(graph.successors(inflammation))
inflammation_subgraph = graph.subgraph(inflammation_nodes + inflammation_neighbors)

print(f”Inflammation subgraph: {inflammation_subgraph.number_of_nodes()} nodes, {inflammation_subgraph.number_of_edges()} edges”)

We isolate the inflammation subgraph by collecting its direct neighbors, yielding a focused view of inflammatory crosstalk. This targeted subnetwork highlights how inflammation interfaces with other disease processes.

Copy CodeCopiedUse a different Browserprint(“n7. Advanced Graph Querying”)
print(“-” * 28)

try:
paths = list(nx.all_simple_paths(graph, app, apoptosis, cutoff=3))
print(f”Paths from APP to apoptosis: {len(paths)}”)
if paths:
print(f”Shortest path length: {len(paths[0])-1}”)
except nx.NetworkXNoPath:
print(“No paths found between APP and apoptosis”)

apoptosis_inducers = list(graph.predecessors(apoptosis))
print(f”Factors that increase apoptosis: {len(apoptosis_inducers)}”)

We enumerate simple paths between APP and apoptosis to explore mechanistic routes and identify key intermediates. Listing all predecessors of apoptosis also shows us which factors may trigger cell death.

Copy CodeCopiedUse a different Browserprint(“n8. Data Export and Visualization”)
print(“-” * 35)

adj_matrix = nx.adjacency_matrix(graph)
node_labels = [str(node) for node in graph.nodes()]

plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
pos = nx.spring_layout(graph, k=2, iterations=50)
nx.draw(graph, pos, with_labels=False, node_color=’lightblue’,
node_size=1000, font_size=8, font_weight=’bold’)
plt.title(“BEL Network Graph”)

plt.subplot(2, 2, 2)
centralities = list(degree_centrality.values())
plt.hist(centralities, bins=10, alpha=0.7, color=’green’)
plt.title(“Degree Centrality Distribution”)
plt.xlabel(“Centrality”)
plt.ylabel(“Frequency”)

plt.subplot(2, 2, 3)
functions = list(node_types.keys())
counts = list(node_types.values())
plt.pie(counts, labels=functions, autopct=’%1.1f%%’, startangle=90)
plt.title(“Node Type Distribution”)

plt.subplot(2, 2, 4)
relations = list(edge_types.keys())
rel_counts = list(edge_types.values())
plt.bar(relations, rel_counts, color=’orange’, alpha=0.7)
plt.title(“Relationship Types”)
plt.xlabel(“Relation”)
plt.ylabel(“Count”)
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

We prepare adjacency matrices and node labels for downstream use and generate a multi-panel figure showing the network structure, centrality distributions, node-type proportions, and edge-type counts. These visualizations bring our BEL graph to life, supporting a deeper biological interpretation.

In this tutorial, we have demonstrated the power and flexibility of PyBEL for modeling complex biological systems. We showed how easily one can construct a curated white-box graph of Alzheimer’s disease interactions, perform network-level analyses to identify key hub nodes, and extract biologically meaningful subgraphs for focused study. We also covered essential practices for literature evidence mining and prepared data structures for compelling visualizations. As a next step, we encourage you to extend this framework to your pathways, integrating additional omics data, running enrichment tests, or coupling the graph with machine-learning workflows.

Check out the Codes here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Implementation for Creating, Annotating, and Visualizing Complex Biological Knowledge Graphs Using PyBEL appeared first on MarkTechPost.

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for …

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built benchmark—OmniContext—to evaluate contextual consistency.

A Decoupled Multimodal Architecture

Unlike prior models that use shared parameters across text and image modalities, OmniGen2 introduces two distinct pathways: an autoregressive transformer for text generation and a diffusion-based transformer for image synthesis. It also employs a novel positioning strategy named Omni-RoPE, which allows flexible handling of sequences, spatial coordinates, and modality distinctions, enabling high-fidelity image generation and editing.

To preserve the pretrained text generation ability of the underlying MLLM (based on Qwen2.5-VL-3B), OmniGen2 feeds VAE-derived features only to the diffusion pathway. This avoids compromising the model’s text understanding and generation capabilities while maintaining rich visual representation for the image synthesis module.

Reflection Mechanism for Iterative Generation

One of the standout features in OmniGen2 is the reflection mechanism. By integrating feedback loops during training, the model is capable of analyzing its generated outputs, identifying inconsistencies, and proposing refinements. This process mimics test-time self-correction and significantly enhances instruction-following accuracy and visual coherence, especially for nuanced tasks like modifying color, object count, or positioning.

The reflection dataset was constructed using multi-turn feedback, enabling the model to learn how to revise and terminate generation based on content evaluation. This mechanism is particularly useful in bridging the quality gap between open-source and commercial models.

OmniContext Benchmark: Evaluating Contextual Consistency

To rigorously assess in-context generation, the team introduces OmniContext, a benchmark comprising three primary task types: SINGLE, MULTIPLE, and SCENE, across Character, Object, and Scene categories. OmniGen2 demonstrates state-of-the-art performance among open-source models in this domain, scoring 7.18 overall—outperforming other leading models like BAGEL and UniWorld-V1.

The evaluation uses three core metrics: Prompt Following (PF), Subject Consistency (SC), and Overall Score (geometric mean), each validated through GPT-4.1-based reasoning. This benchmarking framework emphasizes not just visual realism but semantic alignment with prompts and cross-image consistency.

Data Pipeline and Training Corpus

OmniGen2 was trained on 140M T2I samples and 10M proprietary images, supplemented by meticulously curated datasets for in-context generation and editing. These datasets were constructed using a video-based pipeline that extracts semantically consistent frame pairs and automatically generates instructions using Qwen2.5-VL models. The resulting annotations cover fine-grained image manipulations, motion variations, and compositional changes.

For training, the MLLM parameters remain largely frozen to retain general understanding, while the diffusion module is trained from scratch and optimized for joint visual-textual attention. A special token “<|img|>” triggers image generation within output sequences, streamlining the multimodal synthesis process.

Performance Across Tasks

OmniGen2 delivers strong results across multiple domains:

Text-to-Image (T2I): Achieves an 0.86 score on GenEval and 83.57 on DPG-Bench.

Image Editing: Outperforms open-source baselines with high semantic consistency (SC=7.16).

In-Context Generation: Sets new benchmarks in OmniContext with 7.81 (SINGLE), 7.23 (MULTIPLE), and 6.71 (SCENE) task scores.

Reflection: Demonstrates effective revision of failed generations, with promising correction accuracy and termination behavior.

Conclusion

OmniGen2 is a robust and efficient multimodal generative system that advances unified modeling through architectural separation, high-quality data pipelines, and an integrated reflection mechanism. By open-sourcing models, datasets, and code, the project lays a solid foundation for future research in controllable, consistent image-text generation. Upcoming improvements may focus on reinforcement learning for reflection refinement and expanding multilingual and low-quality robustness.

Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI appeared first on MarkTechPost.

ByteDance Researchers Introduce ProtoReasoning: Enhancing LLM Generali …

Why Cross-Domain Reasoning Matters in Large Language Models (LLMs)

Recent breakthroughs in LRMs, especially those trained using Long CoT techniques, show they can generalize impressively across different domains. Interestingly, models trained on tasks such as math or coding often perform well in unrelated areas, like logical puzzles or creative writing. However, what enables this flexibility isn’t fully clear. One possible explanation is that these models learn core reasoning patterns, known as abstract reasoning prototypes, which cut across domains. These shared cognitive structures enable the model to focus less on how problems are presented and more on the similar thought processes required to solve them, allowing for broader transfer.

From CoT to RL: A Shift in How LLMs Learn to Reason

Recent progress in large language model reasoning has shifted from simple CoT and supervised fine-tuning to RL. Models like DeepSeek-R1 and Seed-Thinking-v1.5 have enhanced Long CoT reasoning through mathematical problems, logic tasks, and code execution. These models utilize RL techniques guided by verifiable rewards, such as accuracy from ground-truth answers, to explore complex reasoning paths. This approach enables models to learn from errors, break down complex problems, and refine solutions through iteration. In contrast to past methods, this work introduces the concept of “reasoning prototypes” to understand better the core thinking patterns that enable models to generalize across vastly different domains.

ProtoReasoning Framework: Structured Reasoning with Prolog and PDDL

Researchers from ByteDance Seed and Shanghai Jiao Tong University have developed ProtoReasoning, a framework designed to enhance reasoning in large language models by utilizing structured prototype representations, such as Prolog and PDDL. This system includes an automated pipeline to translate problems into these formats, a reliable verification setup using interpreters, and scalable problem synthesis without manual labeling. The models trained on these prototypes demonstrated notable improvements across various tasks, including logical reasoning (+4.7%), planning (+6.3%), general reasoning (+4.0%), and math (+1.0%). Crucially, training within this structured “prototype space” led to better generalization across similar tasks, supporting the idea that abstract reasoning patterns enhance cross-domain performance.

Architecture Overview: Prototype Constructor and Verifier System

The ProtoReasoning framework boosts reasoning in LLMs by using structured prototypes, Prolog for logic, and PDDL for planning. It includes two core modules: a Prototype Constructor that translates natural language problems into formal representations, and a Verification System that checks solution correctness. For Prolog, a four-step pipeline generates diverse logic problems, which are verified using SWI-Prolog. For planning, tasks such as plan generation, Completion, and Reordering are built using PDDL, with correctness checked via the VAL validator. The training process includes teacher model distillation for reasoning paths, difficulty-based sampling, and filtering to ensure only high-quality data fine-tunes the model for robust generalization.

Evaluations Show Measurable Improvements in Reasoning and Planning

The ProtoReasoning framework was evaluated through experiments using a 150B parameter Mixture-of-Experts model (15B active), trained on a curated set of high-quality Prolog and PDDL samples. Results showed consistent improvements across logical reasoning, planning, and general benchmarks, including MMLU and AIME 2024. A key ablation study compared Prolog-based training with NL versions on matched datasets. Both formats significantly outperformed the baseline, with Prolog achieving near-equal performance to NL. This demonstrates that structured prototype training can be applied to natural language tasks. However, explicit reasoning (e.g., chain-of-thought) is crucial, and low-sample categories showed weaker gains due to insufficient data.

Key Findings and Theoretical Implications of Reasoning Prototypes

In conclusion, ProtoReasoning, a framework built on the idea that abstract reasoning prototypes like Prolog for logic and PDDL for planning enable large language models to generalize across domains. By training models on these structured representations, the study observed notable improvements in logical reasoning, planning, and general problem-solving tasks. The results support the hypothesis that shared reasoning patterns across domains facilitate knowledge transfer in models. While the empirical results are promising, the exact nature of reasoning prototypes remains theoretically underexplored. Future work will aim to formalize these concepts mathematically and validate findings using open-source models and datasets.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post ByteDance Researchers Introduce ProtoReasoning: Enhancing LLM Generalization via Logic-Based Prototypes appeared first on MarkTechPost.

Power Your LLM Training and Evaluation with the New SageMaker AI Gener …

Today we are excited to introduce the Text Ranking and Question and Answer UI templates to SageMaker AI customers. The Text Ranking template enables human annotators to rank multiple responses from a large language model (LLM) based on custom criteria, such as relevance, clarity, or factual accuracy. This ranked feedback provides critical insights that help refine models through Reinforcement Learning from Human Feedback (RLHF), generating responses that better align with human preferences. The Question and Answer template facilitates the creation of high-quality Q&A pairs based on provided text passages. These pairs act as demonstration data for Supervised Fine-Tuning (SFT), teaching models how to respond to similar inputs accurately.
In this blog post, we’ll walk you through how to set up these templates in SageMaker to create high-quality datasets for training your large language models. Let’s explore how you can leverage these new tools.
Text Ranking
The Text Ranking template allows annotators to rank multiple text responses generated by a large language model based on customizable criteria such as relevance, clarity, or correctness. Annotators are presented with a prompt and several model-generated responses, which they rank according to guidelines specific to your use case. The ranked data is captured in a structured format, detailing the re-ranked indices for each criterion, such as “clarity” or “inclusivity.” This information is invaluable for fine-tuning models using RLHF, aligning the model outputs more closely with human preferences. In addition, this template is also highly effective for evaluating the quality of LLM outputs by allowing you to see how well responses match the intended criteria.
Setting Up in the SageMaker AI Console
A new Generative AI category has been added under Task Type in the SageMaker AI console, allowing you to select these templates. To configure the labeling job using the AWS Management Console, complete the following steps:

On the SageMaker AI console, under Ground Truth in the navigation pane, choose Labeling job.
Choose Create labeling job.
Specify your input manifest location and output path. To configure the Text Ranking input file, use the Manual Data Setup under Create Labeling Job and input a JSON file with the prompt stored under the source field, while the list of model responses is placed under the responses field. Text Ranking does not support Automated Data Setup.

Here is an example of our input manifest file:

Upload this input manifest file into your S3 location and provide the S3 path to this file under Input dataset location:

Select Generative AI as the task type and choose the Text Ranking UI.
Choose Next.
Enter your labeling instructions. Enter the dimensions you want to include in the Ranking dimensions section. For example, in the image above, the dimensions are Helpfulness and Clarity, but you can add, remove, or customize these based on your specific needs by clicking the “+” button to add new dimensions or the trash icon to remove them. Additionally, you have the option to allow tie rankings by selecting the checkbox. This option enables annotators to rank two or more responses equally if they believe the responses are of the same quality for a particular dimension.
Choose Preview to display the UI template for review.
Choose Create to create the labeling job.

When the annotators submit their evaluations, their responses are saved directly to your specified S3 bucket. The output manifest file includes the original data fields and a worker-response-ref that points to a worker response file in S3. This worker response file contains the ranked responses for each specified dimension, which can be used to fine-tune or evaluate your model’s outputs. If multiple annotators have worked on the same data object, their individual annotations are included within this file under an answers key, which is an array of responses. Each response includes the annotator’s input and metadata such as acceptance time, submission time, and worker ID. Here is an example of the output json file containing the annotations:

Question and Answer
The Question and Answer template allows you to create datasets for Supervised Fine-Tuning (SFT) by generating question-and-answer pairs from text passages. Annotators read the provided text and create relevant questions and corresponding answers. This process acts as a source of demonstration data, guiding the model on how to handle similar tasks. The template supports flexible input, letting annotators reference entire passages or specific sections of text for more targeted Q&A. A color-coded matching feature visually links questions to the relevant sections, helping streamline the annotation process. By using these Q&A pairs, you enhance the model’s ability to follow instructions and respond accurately to real-world inputs.
Setting Up in the SageMaker AI Console
The process for setting up a labeling job with the Question and Answer template follows similar steps as the Text Ranking template. However, there are differences in how you configure the input file and select the appropriate UI template to suit the Q&A task.

On the SageMaker AI console, under Ground Truth in the navigation pane, choose Labeling job.
Choose Create labeling job.
Specify your input manifest location and output path. To configure the Question and Answer input file, use the Manual Data Setup and upload a JSON file where the source field contains the text passage. Annotators will use this text to generate questions and answers. Note that you can load the text from a .txt or .csv file and use Ground Truth’s Automated Data Setup to convert it to the required JSON format.

Here is an example of an input manifest file:

Upload this input manifest file into your S3 location and provide the S3 path to this file under Input dataset location

Select Generative AI as the task type and choose the Question and Answer UI
Choose Next.
Enter your labeling instructions. You can configure additional settings to control the task. You can specify the minimum and maximum number of Q&A pairs that workers should generate from the provided text passage. Additionally, you can define the minimum and maximum word counts for both the question and answer fields, so that the responses fit your requirements. You can also add optional question tags to categorize the question and answer pairs. For example, you might include tags such as “What,” “How,” or “Why” to guide the annotators in their task. If these predefined tags are insufficient, you have the option to allow workers to enter their own custom tags by enabling the Allow workers to specify custom tags feature. This flexibility facilitates annotations that meet the specific needs of your use case.
Once these settings are configured, you can choose to Preview the UI to verify that it meets your needs before proceeding.
Choose Create to create the labeling job.

When annotators submit their work, their responses are saved directly to your specified S3 bucket. The output manifest file contains the original data fields along with a worker-response-ref that points to the worker response file in S3. This worker response file includes the detailed annotations provided by the workers, such as the ranked responses or question-and-answer pairs generated for each task.
Here’s an example of what the output might look like:

CreateLabelingJob API
In addition to creating these labeling jobs through the Amazon SageMaker AI console, customers can also use the Create Labeling Job API to set up Text Ranking and Question and Answer jobs programmatically. This method provides more flexibility for automation and integration into existing workflows. Using the API, you can define job configurations, input manifests, and worker task templates, and monitor the job’s progress directly from your application or system.
For a step-by-step guide on how to implement this, you can refer to the following notebooks, which walk through the entire process of setting up Human-in-the-Loop (HITL) workflows for Reinforcement Learning from Human Feedback (RLHF) using both the Text Ranking and Question and Answer templates. These notebooks will guide you through setting up the required Ground Truth pre-requisites, downloading sample JSON files with prompts and responses, converting them to Ground Truth input manifests, creating worker task templates, and monitoring the labeling jobs. They also cover post-processing the results to create a consolidated dataset with ranked responses.

Text Ranking
Question and Answer pairs

Conclusion
With the introduction of the Text Ranking and Question and Answer templates, Amazon SageMaker AI empowers customers to generate high-quality datasets for training large language models more efficiently. These built-in capabilities simplify the process of fine-tuning models for specific tasks and aligning their outputs with human preferences, whether through supervised fine-tuning or reinforcement learning from human feedback. By leveraging these templates, you can better evaluate and refine your models to meet the needs of your specific application, helping achieve more accurate, reliable, and user-aligned outputs. Whether you’re creating datasets for training or evaluating your models’ outputs, SageMaker AI provides the tools you need to succeed in building state-of-the-art generative AI solutions.To begin creating fine-tuning datasets with the new templates:

Visit the Amazon SageMaker AI console.
Refer to the SageMaker AI APIs for programmatic access.
Use the AWS CLI for command-line interactions.

About the authors
Sundar Raghavan is a Generative AI Specialist Solutions Architect at AWS, helping customers use Amazon Bedrock and next-generation AWS services to design, build and deploy AI agents and scalable generative AI applications. In his free time, Sundar loves exploring new places, sampling local eateries and embracing the great outdoors.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Niharika Jayanti is a Front-End Engineer at Amazon, where she designs and develops user interfaces to delight customers. She contributed to the successful launch of LLM evaluation tools on Amazon Bedrock and Amazon SageMaker Unified Studio. Outside of work, Niharika enjoys swimming, hitting the gym and crocheting.

Muyun Yan is a Senior Software Engineer at Amazon Web Services (AWS) SageMaker AI team. With over 6 years at AWS, she specializes in developing machine learning-based labeling platforms. Her work focuses on building and deploying innovative software applications for labeling solutions, enabling customers to access cutting-edge labeling capabilities. Muyun holds a M.S. in Computer Engineering from Boston University.
Kavya Kotra is a Software Engineer on the Amazon SageMaker Ground Truth team, helping build scalable and reliable software applications. Kavya played a key role in the development and launch of the Generative AI Tools on SageMaker. Previously, Kavya held engineering roles within AWS EC2 Networking, and Amazon Audible. In her free time, she enjoys painting, and exploring Seattle’s nature scene.
Alan Ismaiel is a software engineer at AWS based in New York City. He focuses on building and maintaining scalable AI/ML products, like Amazon SageMaker Ground Truth and Amazon Bedrock. Outside of work, Alan is learning how to play pickleball, with mixed results.

Amazon Bedrock Agents observability using Arize AI

This post is cowritten with John Gilhuly from Arize AI.
With Amazon Bedrock Agents, you can build and configure autonomous agents in your application. An agent helps your end-users complete actions based on organization data and user input. Agents orchestrate interactions between foundation models (FMs), data sources, software applications, and user conversations. In addition, agents automatically call APIs to take actions and invoke knowledge bases to supplement information for these actions. By integrating agents, you can accelerate your development effort to deliver generative AI applications. With agents, you can automate tasks for your customers and answer questions for them. For example, you can create an agent that helps customers process insurance claims or make travel reservations. You don’t have to provision capacity, manage infrastructure, or write custom code. Amazon Bedrock manages prompt engineering, memory, monitoring, encryption, user permissions, and API invocation.
AI agents represent a fundamental shift in how applications make decisions and interact with users. Unlike traditional software systems that follow predetermined paths, AI agents employ complex reasoning that often operates as a “black box.” Monitoring AI agents presents unique challenges for organizations seeking to maintain reliability, efficiency, and optimal performance in their AI implementations.
Today, we’re excited to announce a new integration between Arize AI and Amazon Bedrock Agents that addresses one of the most significant challenges in AI development: observability. Agent observability is a crucial aspect of AI operations that provides deep insights into how your Amazon Bedrock agents perform, interact, and execute tasks. It involves tracking and analyzing hierarchical traces of agent activities, from high-level user requests down to individual API calls and tool invocations. These traces form a structured tree of events, helping developers understand the complete journey of user interactions through the agent’s decision-making process. Key metrics that demand attention include response latency, token usage, runtime exceptions, and inspect function calling. As organizations scale their AI implementations from proof of concept to production, understanding and monitoring AI agent behavior becomes increasingly critical.
The integration between Arize AI and Amazon Bedrock Agents provides developers with comprehensive observability tools for tracing, evaluating, and monitoring AI agent applications. This solution delivers three primary benefits:

Comprehensive traceability – Gain visibility into every step of your agent’s execution path, from initial user query through knowledge retrieval and action execution
Systematic evaluation framework – Apply consistent evaluation methodologies to measure and understand agent performance
Data-driven optimization – Run structured experiments to compare different agent configurations and identify optimal settings

The Arize AI service is available in two versions:

Arize AX – An enterprise solution offering advanced monitoring capabilities
Arize Phoenix – An open source service making tracing and evaluation accessible to developers

In this post, we demonstrate the Arize Phoenix system for tracing and evaluation. Phoenix can run on your local machine, a Jupyter notebook, a containerized deployment, or in the cloud. We explore how this integration works, its key features, and how you can implement it in your Amazon Bedrock Agents applications to enhance observability and maintain production-grade reliability.
Solution overview
Large language model (LLM) tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. It improves the visibility of your application or system’s health and makes it possible to debug behavior that is difficult to reproduce locally. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation, to provide a detailed timeline of the request’s execution.
For an application to emit traces for analysis, the application must be instrumented. Your application can be manually instrumented or be automatically instrumented. Arize Phoenix offers a set of plugins (instrumentors) that you can add to your application’s startup process that perform automatic instrumentation. These plugins collect traces for your application and export them (using an exporter) for collection and visualization. The Phoenix server is a collector and UI that helps you troubleshoot your application in real time. When you run Phoenix (for example, the px.launch_app() container), Phoenix starts receiving traces from an application that is exporting traces to it. For Phoenix, the instrumentors are managed through a single repository called OpenInference. OpenInference provides a set of instrumentations for popular machine learning (ML) SDKs and frameworks in a variety of languages. It is a set of conventions and plugins that is complimentary to OpenTelemetry and online transaction processing (OLTP) to enable tracing of AI applications. Phoenix currently supports OTLP over HTTP.
For AWS, Boto3 provides Python bindings to AWS services, including Amazon Bedrock, which provides access to a number of FMs. You can instrument calls to these models using OpenInference, enabling OpenTelemetry-aligned observability of applications built using these models. You can also capture traces on invocations of Amazon Bedrock agents using OpenInference and view them in Phoenix.The following high-level architecture diagram shows an LLM application created using Amazon Bedrock Agents, which has been instrumented to send traces to the Phoenix server.

In the following sections, we demonstrate how, by installing the openinference-instrumentation-bedrock library, you can automatically instrument interactions with Amazon Bedrock or Amazon Bedrock agents for observability, evaluation, and troubleshooting purposes in Phoenix.
Prerequisites
To follow this tutorial, you must have the following:

An AWS account with access to Amazon Bedrock.
An Amazon Bedrock agent. For instructions to create an agent, refer to Create and configure agent manually. The following GitHub repo demonstrates how to create an agent using infrastructure as code (IaC), implemented through AWS Cloud Development Kit (AWS CDK) Python APIs.
An Arize account from where you can then get a Phoenix API key (available at app.phoenix.arize.com).

You can also clone the GitHub repo locally to run the Jupyter notebook yourself:
git clone https://github.com/awslabs/amazon-bedrock-agent-samples.git
Install required dependencies
Begin by installing the necessary libraries:
%pip install -r requirements.txt — quiet
Next, import the required modules:

import time
import boto3
import logging
import os
import nest_asyncio
from phoenix.otel import register
from openinference.instrumentation import using_metadata

nest_asyncio.apply()

The arize-phoenix-otel package provides a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults. These defaults are aware of environment variables you must set to configure Phoenix in the next steps, such as:

PHOENIX_COLLECTOR_ENDPOINT
PHOENIX_PROJECT_NAME
PHOENIX_CLIENT_HEADERS
PHOENIX_API_KEY

Configure the Phoenix environment
Set up the Phoenix Cloud environment for this tutorial. Phoenix can also be self-hosted on AWS instead.

os.environ[“PHOENIX_COLLECTOR_ENDPOINT”] = “https://app.phoenix.arize.com“
if not os.environ.get(“PHOENIX_CLIENT_HEADERS”):
os.environ[“PHOENIX_CLIENT_HEADERS”] = “api_key=” + input(“Enter your Phoenix API key: “)

Connect your notebook to Phoenix with auto-instrumentation enabled:

project_name = “Amazon Bedrock Agent Example”
tracer_provider = register(project_name=project_name, auto_instrument=True)

The auto_instrument parameter automatically locates the openinference-instrumentation-bedrock library and instruments Amazon Bedrock and Amazon Bedrock Agent calls without requiring additional configuration. Configure metadata for the span:

metadata = { “agent” : “bedrock-agent”,
“env” : “development”
Metadata is used to filter search values in the dashboard
}

Set up an Amazon Bedrock session and agent
Before using Amazon Bedrock, make sure that your AWS credentials are configured correctly. You can set them up using the AWS Command Line Interface (AWS CLI) or by setting environment variables:

session = boto3.Session()
REGION = session.region_name
bedrock_agent_runtime = session.client(service_name=”bedrock-agent-runtime”,region_name=REGION)

We assume you’ve already created an Amazon Bedrock agent. To configure the agent, use the following code:

agent_id = “XXXXXYYYYY” # ← Configure your Bedrock Agent ID
agent_alias_id = “Z0ZZZZZZ0Z” # ← Optionally set a different Alias ID if you have one

Before proceeding to your next step, you can validate whether invoke agent is working correctly. The response is not important; we are simply testing the API call.

print(f”Trying to invoke alias {agent_alias_id} of agent {agent_id}…”)
agent_resp = bedrock_agent_runtime.invoke_agent(
agentAliasId=agent_alias_id,
agentId=agent_id,
inputText=”Hello!”,
sessionId=”dummy-session”,
)
if “completion” in agent_resp:
print(“✅ Got response”)
else:
raise ValueError(f”No ‘completion’ in agent response:n{agent_resp}”)

Run your agent with tracing enabled
Create a function to run your agent and capture its output:

@using_metadata(metadata)
def run(input_text):
session_id = f”default-session1_{int(time.time())}”

attributes = dict(
inputText=input_text,
agentId=agent_id,
agentAliasId=agent_alias_id,
sessionId=session_id,
enableTrace=True,
)
response = bedrock_agent_runtime.invoke_agent(**attributes)

# Stream the response
for _, event in enumerate(response[“completion”]):
if “chunk” in event:
print(event)
chunk_data = event[“chunk”]
if “bytes” in chunk_data:
output_text = chunk_data[“bytes”].decode(“utf8”)
print(output_text)
elif “trace” in event:
print(event[“trace”])

Test your agent with a few sample queries:

run (“What are the total leaves for Employee 1?”)
run (“If Employee 1 takes 4 vacation days off, What are the total leaves left for Employee 1?”)

You should replace these queries with the queries that your application is built for. After executing these commands, you should see your agent’s responses in the notebook output. The Phoenix instrumentation is automatically capturing detailed traces of these interactions, including knowledge base lookups, orchestration steps, and tool calls.
View captured traces in Phoenix
Navigate to your Phoenix dashboard to view the captured traces. You will see a comprehensive visualization of each agent invocation, including:

The full conversation context
Knowledge base queries and results
Tool or action group calls and responses
Agent reasoning and decision-making steps

Phoenix’s tracing and span analysis capabilities are useful during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it straightforward to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts. With Phoenix’s tracing capabilities, you can monitor the following:

Application latency – Identify latency bottlenecks and address slow invocations of LLMs, retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
Token usage – Gain a detailed breakdown of token usage for your LLM calls, so you can identify and optimize the most expensive LLM invocations.
Runtime exceptions – Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
Retrieved documents – Inspect the documents retrieved during a retriever call, including the score and order in which they were returned, to provide insight into the retrieval process.
Embeddings – Examine the embedding text used for retrieval and the underlying embedding model, so you can validate and refine your embedding strategies.
LLM parameters – Inspect the parameters used when calling an LLM, such as temperature and system prompts, to facilitate optimal configuration and debugging.
Prompt templates – Understand the prompt templates used during the prompting step and the variables that were applied, so you can fine-tune and improve your prompting strategies.
Tool descriptions – View the descriptions and function signatures of the tools your LLM has been given access to, in order to better understand and control your LLM’s capabilities.
LLM function calls – For LLMs with function call capabilities (such as Anthropic’s Claude, Amazon Nova, or Meta’s Llama), you can inspect the function selection and function messages in the input to the LLM. This can further help you debug and optimize your application.

The following screenshot shows the Phoenix dashboard for the Amazon Bedrock agent, showing the latency, token usage, total traces.

You can choose one of the traces to drill down to the level of the entire orchestration.

Evaluate the agent in Phoenix
Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. A common evaluation metric for agents is their function calling accuracy, in other words, how well they do at choosing the right tool for the job. For example, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? Additionally, bad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated? Phoenix also includes built-in LLM evaluations and code-based experiment testing. An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, you must evaluate each component. Phoenix has built evaluation templates for every step, such as:

Agent function calling
Agent path convergence
Agent planning
Agent reflection

You can evaluate the individual skills and response using normal LLM evaluation strategies, such as retrieval evaluation, classification with LLM judges, hallucination, or Q&A correctness. In this post, we demonstrate evaluation of agent function calling. You can use the Agent Function Call eval to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code. Now that you’ve traced your agent in the previous step, the next step is to add evaluations to measure its performance. A common evaluation metric for agents is their function calling accuracy (how well they do at choosing the right tool for the job).Complete the following steps:

Up until now, you have just used the lighter-weight Phoenix OTEL tracing library. To run evals, you must to install the full library:

!pip install -q arize-phoenix — quiet

Import the necessary evaluation components:

import re
import json
import phoenix as px
from phoenix.evals import (
TOOL_CALLING_PROMPT_RAILS_MAP,
TOOL_CALLING_PROMPT_TEMPLATE,
BedrockModel,
llm_classify,
)
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

The following is our agent function calling prompt template:

TOOL_CALLING_PROMPT_TEMPLATE = “””

You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called]: {tool_call}
[END DATA]

Your response must be single word, either “correct” or “incorrect”,
and should not contain any text or characters aside from that word.
“incorrect” means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don’t match
the formats specified in the tool signatures below.

“correct” means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

[Tool Definitions]: {tool_definitions}
“””

Because we are only evaluating the inputs, outputs, and function call columns, let’s extract those into a simpler-to-use dataframe. Phoenix provides a method to query your span data and directly export only the values you care about:

query = (
SpanQuery()
.where(
# Filter for the `LLM` span kind.
# The filter condition is a string of valid Python boolean expression.
“span_kind == ‘LLM’ and ‘evaluation’ not in input.value”
)
.select(
question=”input.value”,
outputs=”output.value”,
)
)
trace_df = px.Client().query_spans(query, project_name=project_name)

The next step is to prepare these traces into a dataframe with columns for input, tool call, and tool definitions. Parse the JSON input and output data to create these columns:

def extract_tool_calls(output_value):
try:
tool_calls = []
# Look for tool calls within <function_calls> tags
if “<function_calls>” in output_value:
# Find all tool_name tags
tool_name_pattern = r”<tool_name>(.*?)</tool_name>”
tool_names = re.findall(tool_name_pattern, output_value)

# Add each found tool name to the list
for tool_name in tool_names:
if tool_name:
tool_calls.append(tool_name)
except Exception as e:
print(f”Error extracting tool calls: {e}”)
pass

return tool_calls

Apply the function to each row of trace_df.output.value:

trace_df[“tool_call”] = trace_df[“outputs”].apply(
lambda x: extract_tool_calls(x) if isinstance(x, str) else []
)

# Display the tool calls found
print(“Tool calls found in traces:”, trace_df[“tool_call”].sum())

Add tool definitions for evaluation:

trace_df[“tool_definitions”] = (
“phoenix-traces retrieves the latest trace information from Phoenix, phoenix-experiments retrieves the latest experiment information from Phoenix, phoenix-datasets retrieves the latest dataset information from Phoenix”
)

Now with your dataframe prepared, you can use Phoenix’s built-in LLM-as-a-Judge template for tool calling to evaluate your application. The following method takes in the dataframe of traces to evaluate, our built-in evaluation prompt, the eval model to use, and a rails object to snap responses from our model to a set of binary classification responses. We also instruct our model to provide explanations for its responses.

Run the tool calling evaluation:

rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())

eval_model = BedrockModel(session=session, model_id=”us.anthropic.claude-3-5-haiku-20241022-v1:0″)

response_classifications = llm_classify(
data=trace_df,
template=TOOL_CALLING_PROMPT_TEMPLATE,
model=eval_model,
rails=rails,
provide_explanation=True,
)
response_classifications[“score”] = response_classifications.apply(
lambda x: 1 if x[“label”] == “correct” else 0, axis=1
)

We use the following parameters:

df – A dataframe of cases to evaluate. The dataframe must have columns to match the default template.
question – The query made to the model. If you exported spans from Phoenix to evaluate, this will be the llm.input_messages column in your exported data.
tool_call – Information on the tool called and parameters included. If you exported spans from Phoenix to evaluate, this will be the llm.function_call column in your exported data.

Finally, log the evaluation results to Phoenix:

px.Client().log_evaluations(
SpanEvaluations(eval_name=”Tool Calling Eval”, dataframe=response_classifications),
)

After running these commands, you will see your evaluation results on the Phoenix dashboard, providing insights into how effectively your agent is using its available tools.
The following screenshot shows how the tool calling evaluation attribute shows up when you run the evaluation.

When you expand the individual trace, you can observe that the tool calling evaluation adds a score of 1 if the label is correct. This means that agent has responded correctly.

Conclusion
As AI agents become increasingly prevalent in enterprise applications, effective observability is crucial for facilitating their reliability, performance, and continuous improvement. The integration of Arize AI with Amazon Bedrock Agents provides developers with the tools they need to build, monitor, and enhance AI agent applications with confidence. We’re excited to see how this integration will empower developers and organizations to push the boundaries of what’s possible with AI agents.
Stay tuned for more updates and enhancements to this integration in the coming months. To learn more about Amazon Bedrock Agents and the Arize AI integration, refer to the Phoenix documentation and Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring.

About the Authors
Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
John Gilhuly is the Head of Developer Relations at Arize AI, focused on AI agent observability and evaluation tooling. He holds an MBA from Stanford and a B.S. in C.S. from Duke. Prior to joining Arize, John led GTM activities at Slingshot AI, and served as a venture fellow at Omega Venture Partners. In his pre-AI life, John built out and ran technical go-to-market teams at Branch Metrics.
Richa Gupta is a Sr. Solutions Architect at Amazon Web Services. She is passionate about architecting end-to-end solutions for customers. Her specialization is machine learning and how it can be used to build new solutions that lead to operational excellence and drive business revenue. Prior to joining AWS, she worked in the capacity of a Software Engineer and Solutions Architect, building solutions for large telecom operators. Outside of work, she likes to explore new places and loves adventurous activities.
Aris Tsakpinis is a Specialist Solutions Architect for Generative AI, focusing on open weight models on Amazon Bedrock and the broader generative AI open source landscape. Alongside his professional role, he is pursuing a PhD in Machine Learning Engineering at the University of Regensburg, where his research focuses on applied natural language processing in scientific domains.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Musarath Rahamathullah is an AI/ML and GenAI Solutions Architect at Amazon Web Services, focusing on media and entertainment customers. She holds a Master’s degree in Analytics with a specialization in Machine Learning. She is passionate about using AI solutions in the AWS Cloud to address customer challenges and democratize technology. Her professional background includes a role as a Research Assistant at the prestigious Indian Institute of Technology, Chennai. Beyond her professional endeavors, she is interested in interior architecture, focusing on creating beautiful spaces to live.

How SkillShow automates youth sports video processing using Amazon Tra …

This post is co-written with Tom Koerick from SkillShow.
The youth sports market was valued at $37.5 billion globally in 2022 and is projected to grow by 9.2% each year through 2030. Approximately 60 million young athletes participate in this market worldwide. SkillShow, a leader in youth sports video production, films over 300 events yearly in the youth sports industry, creating content for over 20,000 young athletes annually. This post describes how SkillShow used Amazon Transcribe and other Amazon Web Services (AWS) machine learning (ML) services to automate their video processing workflow, reducing editing time and costs while scaling their operations.
Challenge
In response to the surge in youth sports video production, manual video editing processes are becoming increasingly unsustainable. Since 2001, SkillShow has been at the forefront of sports video production, providing comprehensive video services for individuals, teams, and event organizers. They specialize in filming, editing, and distributing content that helps athletes showcase their skills to recruiters, build their personal brand on social media, and support their development training. As a trusted partner to major sports organizations including the Perfect Game, 3Step Sports, USA Baseball, MLB Network, Under Armour, Elite11 football combines and more, SkillShow has filmed hundreds of thousands of athletes and thousands of regional and national events across different sports and age groups.
Despite their market leadership, SkillShow faced significant operational challenges. With only seven full-time employees managing their expanding operation, they had to outsource to over 1,100 contractors annually. This reliance on outsourced editing not only increased operational costs but also resulted in a lengthy 3-week turnaround time per event, making it difficult to keep pace with the growing demand for youth sports content.
Managing approximately 230 TB of video data per year created significant operational challenges. This massive volume of data meant lengthy upload and download times for editors, expensive storage costs, and complex data management requirements. Each event’s raw footage needed to be securely stored, backed up, and made accessible to multiple editors, straining both technical resources and IT infrastructure. These challenges led to SkillShow halting new events mid-2023, limiting their growth potential in a rapidly expanding market. The need for an efficient, scalable solution became critical to maintaining SkillShow’s position and meeting the growing demand for youth sports content, particularly in the post-COVID era where recruiting videos have become essential for leagues and athletes alike.
Solution overview
To address these challenges, SkillShow partnered with AWS to develop an automated video processing pipeline. The team initially explored several approaches to automate player identification.
Facial recognition proved challenging due to varying video quality, inconsistent lighting conditions, and frequent player movement during games. Additionally, players often wore equipment such as helmets or protective gear that obscured their faces, making reliable identification difficult.
Text-based detection of jersey numbers and colors seemed promising at first, but presented its own set of challenges. Jersey numbers were frequently obscured by player movement, weather conditions could affect visibility, and varying camera angles made consistent detection unreliable.
Ultimately, the team settled on an audio logging and automated clip generation solution, which proved superior for several reasons:

More reliable player identification, because announcers consistently call out player numbers and team colors
Better performance in varying environmental conditions, because audio quality remains relatively consistent even in challenging weather or lighting
Reduced processing complexity and computational requirements compared to video-based analysis
More cost-effective due to lower computational demands and higher accuracy rates
Ability to capture additional context from announcer commentary, such as play descriptions and game situations

This solution uses several key AWS services:

Amazon Simple Storage Service (Amazon S3):

Used for storing the input and output video files
Provides scalable and durable storage to handle SkillShow’s large video data volume of 230 TB per year
Allows for straightforward access and integration with other AWS services in the processing pipeline

AWS Lambda:

Serverless compute service used to power the automated processing workflows
Triggers the various functions that orchestrate the video processing, such as transcription and clip generation
Enables event-driven, scalable, and cost-effective processing without the need to manage underlying infrastructure

Amazon Transcribe:

Automatic speech recognition (ASR) service used to convert the video audio into text transcripts
Provides the foundation for analyzing the video content and identifying player details
Allows for accurate speech-to-text conversion, even in noisy sports environments

The following diagram illustrates the solution architecture.

SkillShow AWS Architecture Diagram

The architectural flow is as follows:

The authorized user uploads a .csv file containing roster information (such as jersey color, number, player name, and school) and the video footage of players.
A Lambda function is triggered by the upload of the video.
The auto-transcript Lambda function uses Amazon Transcribe to generate a timestamped transcript of what is said in the input video.
The transcript is uploaded to the output S3 bucket under transcripts/ for further use.
The authorized user can invoke the auto-clipper Lambda function with an AWS Command Line Interface (AWS CLI) command.
The function parses the transcript against player information from the roster.
When identifying players, the function clips videos based on a specified keyword (in SkillShow’s case, it was “Next”) and uploads them to the output S3 bucket under segments/.

By using this suite of AWS services, SkillShow was able to build a scalable, cost-effective, and highly automated video processing solution that addressed their key operational challenges. The cloud-based architecture provides the flexibility and scalability required to handle their growing data volumes and evolving business needs.
Example processing workflow
Let’s explore an example processing workflow. As shown in the following screenshots, we first upload a player roster .csv and video file to the input bucket.

The auto-transcribe function processes the audio.

The auto-clipper function segments the video based on player information.

Final clips are uploaded to the output bucket between two separate folders: a prefix of the input video name or Unnamed/ if the transcription was unclear or missing the player name within the segment.

Named videos can be viewed in the first folder where SkillShow’s current naming convention (jersey color_number_event video name) is followed for editors to download on demand.

Unnamed videos can be seen in a similar naming convention, only missing the unique player name. Now, the editors only have to review files in this folder and manually rename the file instead of having to do this for entire event videos.

Results and benefits
After implementing this AWS powered solution, SkillShow transformed their video processing operations. The automated pipeline reduced video production time from 3 weeks to 24 hours per event, enabling faster delivery to athletes and scouts. A recent event in Chicago showcased the system’s effectiveness. The automated pipeline processed 69 clips, accurately cutting and naming 64 of them—achieving a 93% success rate. This high accuracy demonstrates the solution’s ability to handle real-world scenarios effectively. The system also proved adaptable, quickly addressing initial challenges such as color naming inconsistencies.
The Northwest Indoor event further illustrated the system’s scalability and versatility. Here, the automated process handled a larger volume of approximately 270 clips, maintaining an estimated accuracy rate of over 90%. Notably, this event included batting practice footage, highlighting the solution’s adaptability to various types of sports activities.
With this streamlined workflow, SkillShow has expanded its capacity to process multiple events simultaneously, significantly enhancing its ability to serve youth sports leagues. The standardized output format and improved player identification accuracy have enhanced the viewing experience for athletes, coaches, and scouts alike. Although the time savings varies depending on specific event conditions and filming techniques, the system has demonstrated its potential to substantially reduce manual editing work. SkillShow continues to refine the process, carefully balancing automation with quality control to provide optimal results across diverse event types. These improvements positioned SkillShow to meet the growing demand for youth sports video content while maintaining consistent quality across all events.
Conclusion
This solution demonstrates how AWS ML services can transform resource-intensive video processing workflows into efficient, automated systems. By combining the scalable storage of Amazon S3, serverless computing with Lambda, and the speech recognition capabilities of Amazon Transcribe, organizations can dramatically reduce processing times and operational costs. As a leader in automated sports video production, SkillShow has pioneered this approach for youth sports while demonstrating its adaptability to various content types, from educational videos to corporate training. They’re already exploring additional artificial intelligence and machine learning (AI/ML) capabilities for automated highlight generation, real-time processing for live events, and deeper integration with sports leagues and organizations.
For organizations looking to further enhance their video processing capabilities, Amazon Bedrock Data Automation offers additional possibilities. Amazon Bedrock Data Automation can streamline the generation of valuable insights from unstructured, multimodal content such as documents, images, audio, and videos. This fully managed capability could potentially be integrated into workflows similar to SkillShow’s, offering features such as automated video summaries, content moderation, and custom extraction of relevant information from video content. Furthermore, Amazon Bedrock Data Automation can generate custom insights from audio, including summaries and sentiment analysis, providing even deeper understanding of spoken content in sports videos.
SkillShow’s success highlights the broader potential of cloud-based video processing. As demand for video content continues to grow across industries, organizations can use AWS ML services to automate their workflows, reduce manual effort, and focus on delivering value to their customers rather than managing complex editing operations.
Are you interested in implementing similar automated video processing workflows for your organization? Contact SkillShow to learn how their pipeline built with AWS services can transform your content production process.

About the Authors
Ragib Ahsan is a Partner Solutions Architect at Amazon Web Services (AWS), where he helps organizations build and implement AI/ML solutions. Specializing in computer vision, he works with AWS partners to create practical applications using cloud technologies. Ahsan is particularly passionate about serverless architecture and its role in making solutions more accessible and efficient.
Tom Koerick is the owner and CEO of SkillShow, a sports media network company that has been filming youth sporting events nationwide since 2001. A former professional baseball player turned entrepreneur, Tom develops video solutions for event organizers and families in the youth sports industry. His focus includes college recruiting, social media sharing, and B2B services that provide added value and revenue generation opportunities in youth sports.

A Coding Guide to Build a Production-Ready Asynchronous Python SDK wit …

In this tutorial, we guide users through building a robust, production-ready Python SDK. It begins by showing how to install and configure essential asynchronous HTTP libraries (aiohttp, nest-asyncio). It then walks through the implementation of core components, including structured response objects, token-bucket rate limiting, in-memory caching with TTL, and a clean, dataclass-driven design. We’ll see how to wrap these pieces up in an AdvancedSDK class that supports async context management, automatic retry/wait-on-rate-limit behavior, JSON/auth headers injection, and convenient HTTP-verb methods. Along the way, a demo harness against JSONPlaceholder illustrates caching efficiency, batch fetching with rate limits, error handling, and even shows how to extend the SDK via a fluent “builder” pattern for custom configuration.

Copy CodeCopiedUse a different Browserimport asyncio
import aiohttp
import time
import json
from typing import Dict, List, Optional, Any, Union
from dataclasses import dataclass, asdict
from datetime import datetime, timedelta
import hashlib
import logging

!pip install aiohttp nest-asyncio

We install and configure the asynchronous runtime by importing asyncio and aiohttp, alongside utilities for timing, JSON handling, dataclass modeling, caching (via hashlib and datetime), and structured logging. The !pip install aiohttp nest-asyncio line ensures that the notebook can run an event loop seamlessly within Colab, enabling robust async HTTP requests and rate-limited workflows.

Copy CodeCopiedUse a different Browser@dataclass
class APIResponse:
“””Structured response object”””
data: Any
status_code: int
headers: Dict[str, str]
timestamp: datetime

def to_dict(self) -> Dict:
return asdict(self)

The APIResponse dataclass encapsulates HTTP response details, payload (data), status code, headers, and the timestamp of retrieval into a single, typed object. The to_dict() helper converts the instance into a plain dictionary for easy logging, serialization, or downstream processing.

Copy CodeCopiedUse a different Browserclass RateLimiter:
“””Token bucket rate limiter”””
def __init__(self, max_calls: int = 100, time_window: int = 60):
self.max_calls = max_calls
self.time_window = time_window
self.calls = []

def can_proceed(self) -> bool:
now = time.time()
self.calls = [call_time for call_time in self.calls if now – call_time < self.time_window]

if len(self.calls) < self.max_calls:
self.calls.append(now)
return True
return False

def wait_time(self) -> float:
if not self.calls:
return 0
return max(0, self.time_window – (time.time() – self.calls[0]))

The RateLimiter class enforces a simple token-bucket policy by tracking the timestamps of recent calls and allowing up to max_calls within a rolling time_window. When the limit is reached, can_proceed() returns False, and wait_time() calculates how long to pause before making the next request.

Copy CodeCopiedUse a different Browserclass Cache:
“””Simple in-memory cache with TTL”””
def __init__(self, default_ttl: int = 300):
self.cache = {}
self.default_ttl = default_ttl

def _generate_key(self, method: str, url: str, params: Dict = None) -> str:
key_data = f”{method}:{url}:{json.dumps(params or {}, sort_keys=True)}”
return hashlib.md5(key_data.encode()).hexdigest()

def get(self, method: str, url: str, params: Dict = None) -> Optional[APIResponse]:
key = self._generate_key(method, url, params)
if key in self.cache:
response, expiry = self.cache[key]
if datetime.now() < expiry:
return response
del self.cache[key]
return None

def set(self, method: str, url: str, response: APIResponse, params: Dict = None, ttl: int = None):
key = self._generate_key(method, url, params)
expiry = datetime.now() + timedelta(seconds=ttl or self.default_ttl)
self.cache[key] = (response, expiry)

The Cache class provides a lightweight in-memory TTL cache for API responses by hashing the request signature (method, URL, params) into a unique key. It returns valid cached APIResponse objects before expiry and automatically evicts stale entries after their time-to-live has elapsed.

Copy CodeCopiedUse a different Browserclass AdvancedSDK:
“””Advanced SDK with modern Python patterns”””

def __init__(self, base_url: str, api_key: str = None, rate_limit: int = 100):
self.base_url = base_url.rstrip(‘/’)
self.api_key = api_key
self.session = None
self.rate_limiter = RateLimiter(max_calls=rate_limit)
self.cache = Cache()
self.logger = self._setup_logger()

def _setup_logger(self) -> logging.Logger:
logger = logging.getLogger(f”SDK-{id(self)}”)
if not logger.handlers:
handler = logging.StreamHandler()
formatter = logging.Formatter(‘%(asctime)s – %(name)s – %(levelname)s – %(message)s’)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
return logger

async def __aenter__(self):
“””Async context manager entry”””
self.session = aiohttp.ClientSession()
return self

async def __aexit__(self, exc_type, exc_val, exc_tb):
“””Async context manager exit”””
if self.session:
await self.session.close()

def _get_headers(self) -> Dict[str, str]:
headers = {‘Content-Type’: ‘application/json’}
if self.api_key:
headers[‘Authorization’] = f’Bearer {self.api_key}’
return headers

async def _make_request(self, method: str, endpoint: str, params: Dict = None,
data: Dict = None, use_cache: bool = True) -> APIResponse:
“””Core request method with rate limiting and caching”””

if use_cache and method.upper() == ‘GET’:
cached = self.cache.get(method, endpoint, params)
if cached:
self.logger.info(f”Cache hit for {method} {endpoint}”)
return cached

if not self.rate_limiter.can_proceed():
wait_time = self.rate_limiter.wait_time()
self.logger.warning(f”Rate limit hit, waiting {wait_time:.2f}s”)
await asyncio.sleep(wait_time)

url = f”{self.base_url}/{endpoint.lstrip(‘/’)}”

try:
async with self.session.request(
method=method.upper(),
url=url,
params=params,
json=data,
headers=self._get_headers()
) as resp:
response_data = await resp.json() if resp.content_type == ‘application/json’ else await resp.text()

api_response = APIResponse(
data=response_data,
status_code=resp.status,
headers=dict(resp.headers),
timestamp=datetime.now()
)

if use_cache and method.upper() == ‘GET’ and 200 <= resp.status < 300:
self.cache.set(method, endpoint, api_response, params)

self.logger.info(f”{method.upper()} {endpoint} – Status: {resp.status}”)
return api_response

except Exception as e:
self.logger.error(f”Request failed: {str(e)}”)
raise

async def get(self, endpoint: str, params: Dict = None, use_cache: bool = True) -> APIResponse:
return await self._make_request(‘GET’, endpoint, params=params, use_cache=use_cache)

async def post(self, endpoint: str, data: Dict = None) -> APIResponse:
return await self._make_request(‘POST’, endpoint, data=data, use_cache=False)

async def put(self, endpoint: str, data: Dict = None) -> APIResponse:
return await self._make_request(‘PUT’, endpoint, data=data, use_cache=False)

async def delete(self, endpoint: str) -> APIResponse:
return await self._make_request(‘DELETE’, endpoint, use_cache=False)

The AdvancedSDK class wraps everything together into a clean, async-first client: it manages an aiohttp session via async context managers, injects JSON and auth headers, and coordinates our RateLimiter and Cache under the hood. Its _make_request method centralizes GET/POST/PUT/DELETE logic, handling cache lookups, rate-limit waits, error logging, and response packing into APIResponse objects, while the get/post/put/delete helpers give us ergonomic, high-level calls.

Copy CodeCopiedUse a different Browserasync def demo_sdk():
“””Demonstrate SDK capabilities”””
print(” Advanced SDK Demo”)
print(“=” * 50)

async with AdvancedSDK(“https://jsonplaceholder.typicode.com”) as sdk:

print(“n Testing GET request with caching…”)
response1 = await sdk.get(“/posts/1″)
print(f”First request – Status: {response1.status_code}”)
print(f”Title: {response1.data.get(‘title’, ‘N/A’)}”)

response2 = await sdk.get(“/posts/1″)
print(f”Second request (cached) – Status: {response2.status_code}”)

print(“n Testing POST request…”)
new_post = {
“title”: “Advanced SDK Tutorial”,
“body”: “This SDK demonstrates modern Python patterns”,
“userId”: 1
}
post_response = await sdk.post(“/posts”, data=new_post)
print(f”POST Status: {post_response.status_code}”)
print(f”Created post ID: {post_response.data.get(‘id’, ‘N/A’)}”)

print(“n Testing batch requests with rate limiting…”)
tasks = []
for i in range(1, 6):
tasks.append(sdk.get(f”/posts/{i}”))

results = await asyncio.gather(*tasks)
print(f”Batch completed: {len(results)} requests”)
for i, result in enumerate(results, 1):
print(f” Post {i}: {result.data.get(‘title’, ‘N/A’)[:30]}…”)

print(“n Testing error handling…”)
try:
error_response = await sdk.get(“/posts/999999″)
print(f”Error response status: {error_response.status_code}”)
except Exception as e:
print(f”Handled error: {type(e).__name__}”)

print(“n Demo completed successfully!”)

async def run_demo():
“””Colab-friendly demo runner”””
await demo_sdk()

The demo_sdk coroutine walks through the SDK’s core features, issuing a cached GET request, performing a POST, executing a batch of GETs under rate limiting, and handling errors, against the JSONPlaceholder API, printing status codes and sample data to illustrate each capability. The run_demo helper ensures this demo runs smoothly inside a Colab notebook’s existing event loop.

Copy CodeCopiedUse a different Browserimport nest_asyncio
nest_asyncio.apply()

if __name__ == “__main__”:
try:
asyncio.run(demo_sdk())
except RuntimeError:
loop = asyncio.get_event_loop()
loop.run_until_complete(demo_sdk())

class SDKBuilder:
“””Builder pattern for SDK configuration”””
def __init__(self, base_url: str):
self.base_url = base_url
self.config = {}

def with_auth(self, api_key: str):
self.config[‘api_key’] = api_key
return self

def with_rate_limit(self, calls_per_minute: int):
self.config[‘rate_limit’] = calls_per_minute
return self

def build(self) -> AdvancedSDK:
return AdvancedSDK(self.base_url, **self.config)

Finally, we apply nest_asyncio to enable nested event loops in Colab, then run the demo via asyncio.run (with a fallback to manual loop execution if needed). It also introduces an SDKBuilder class that implements a fluent builder pattern for easily configuring and instantiating the AdvancedSDK with custom authentication and rate-limit settings.

In conclusion, this SDK tutorial provides a scalable foundation for any RESTful integration, combining modern Python idioms (dataclasses, async/await, context managers) with practical tooling (rate limiter, cache, structured logging). By adapting the patterns shown here, particularly the separation of concerns between request orchestration, caching, and response modeling, teams can accelerate the development of new API clients while ensuring predictability, observability, and resilience.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Build a Production-Ready Asynchronous Python SDK with Rate Limiting, In-Memory Caching, and Authentication appeared first on MarkTechPost.

Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficientl …

Sakana AI introduces a novel framework for reasoning language models (LLMs) with a focus on efficiency and reusability: Reinforcement-Learned Teachers (RLTs). Traditional reinforcement learning (RL) approaches in LLMs are plagued by sparse reward signals and prohibitively high computational demands. By contrast, RLTs redefine the teacher-student paradigm by training smaller models to act as optimized instructors, producing step-by-step explanations instead of solving problems from scratch. This design shift enables significant gains in distillation quality, cost-efficiency, and transferability across domains—without the need for large model footprints.

Rethinking Reinforcement Learning for Teaching, Not Solving

Conventional RL setups train models to solve problems autonomously using sparse, correctness-based rewards. These models are often repurposed to teach smaller models, generating reasoning traces for distillation. However, the mismatch between the RL objective (solving problems) and the actual downstream use (teaching) results in inefficiencies. RLTs directly address this by prompting models with both the problem and its solution, requiring them only to generate detailed, pedagogical explanations. The reward signal is dense and student-aligned: it measures how well the student model understands the explanation and reproduces the solution.

Core Concept: Dense, Student-Aligned Rewards

The RLT training objective is constructed around two key reward terms:

Solution Score (rSS): Quantifies the student’s ability to reconstruct the correct solution given the explanation and the problem.

Explanation Score (rKL): Measures how logically coherent the teacher’s explanation is from the student’s perspective.

These are combined into a dense reward signal that encourages explanations which are both instructive and understandable. Importantly, this bypasses the exploration bottleneck of traditional RL, enabling smaller models to effectively train via RL.

Surprising Efficacy of Small Teachers

Sakana AI demonstrates that a 7B parameter RLT outperforms much larger LLMs (e.g., 32B+ models) on distillation tasks across multiple challenging datasets, including AIME 2024, MATH 500, and GPQA Diamond. On a 17K-question corpus:

RLT-7B outperforms DeepSeek R1, Bespoke-7B, and even post-processed RL traces.

RLT-32B outperforms all 32B baselines across the board, despite being distilled from a smaller teacher.

The impact is not just parameter efficiency—RLTs achieve better generalization, fewer formatting errors, and higher interpretability.

Cold-Starting Reinforcement Learning with RLTs

Another critical use case is RL cold-starting, where an initial model is bootstrapped with external data before formal RL training. Traces generated by RLTs serve as more effective cold-start material than those from larger RL-trained models. In fact, even without post-processing or external refinement (e.g., via GPT-4.1), RLT-generated explanations yield higher performance gains after RL fine-tuning.

Out-of-Domain Generalization and Zero-Shot Transfer

RLTs also show strong zero-shot transfer capabilities. When applied to a novel domain—such as the arithmetic-based “Countdown” task—the RLT-trained traces enable student models to surpass even direct RL on the new domain. This indicates that the skill of “explaining a solution” generalizes across tasks more easily than the skill of “solving from scratch,” providing evidence for better reusability of teaching-focused RL models.

Training Pipeline: Efficient and Scalable

The training process is computationally lean:

250 RL steps (~1 epoch), batch size 256, group size 64.

Trained using a single-node setup with Qwen2.5-7B-Instruct.

Code and pretrained checkpoints are available: GitHub

Unlike traditional RL pipelines, RLTs do not require post-processing, formatting corrections, or verification filters—raw outputs are directly usable.

Evaluation Highlights

TL;DR (100 words)

Sakana AI introduces Reinforcement-Learned Teachers (RLTs), a lightweight yet powerful framework for teaching LLMs to reason. Unlike traditional RL models that learn by solving tasks from scratch, RLTs are given both the question and its solution and are trained to generate step-by-step explanations. This setup aligns RL rewards with student learning outcomes, enabling 7B parameter RLTs to outperform much larger LLMs in distillation and cold-start scenarios. RLTs are cost-efficient, transferable across domains, and eliminate the need for expensive post-processing—offering a scalable blueprint for building reasoning-capable LLMs using modest compute and open-source tools.

Check out the Paper and Technical details All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning appeared first on MarkTechPost.

New AI Framework Evaluates Where AI Should Automate vs. Augment Jobs, …

Redefining Job Execution with AI Agents

AI agents are reshaping how jobs are performed by offering tools that execute complex, goal-directed tasks. Unlike static algorithms, these agents combine multi-step planning with software tools to handle entire workflows across various sectors, including education, law, finance, and logistics. Their integration is no longer theoretical—workers are already applying them to support a variety of professional duties. The result is a labor environment in transition, where the boundaries of human and machine collaboration are being redefined on a daily basis.

Bridging the Gap Between AI Capability and Worker Preference

A persistent problem in this transformation is the disconnect between what AI agents can do and what workers want them to do. Even if AI systems are technically capable of taking over a task, workers may not support that shift due to concerns about job satisfaction, task complexity, or the importance of human judgment. Meanwhile, tasks that workers are eager to offload may lack mature AI solutions. This mismatch presents a significant barrier to the responsible and effective deployment of AI in the workforce.

Beyond Software Engineers: A Holistic Workforce Assessment

Until recently, assessments of AI adoption often centered on a handful of roles, such as software engineering or customer service, limiting understanding of how AI impacts broader occupational diversity. Most of these approaches also prioritized company productivity over worker experience. They relied on an analysis of current usage patterns, which does not provide a forward-looking view. As a result, the development of AI tools has lacked a comprehensive foundation grounded in the actual preferences and needs of people performing the work.

Stanford’s Survey-Driven WORKBank Database: Capturing Real Worker Voices

The research team from Stanford University introduced a survey-based auditing framework that evaluates which tasks workers would prefer to see automated or augmented and compares this with expert assessments of AI capability. Using task data from the U.S. Department of Labor’s O*NET database, researchers created the WORKBank, a dataset based on responses from 1,500 domain workers and evaluations from 52 AI experts. The team employed audio-supported mini-interviews to collect nuanced preferences. It introduced the Human Agency Scale (HAS), a five-level metric that captures the desired extent of human involvement in task completion.

Human Agency Scale (HAS): Measuring the Right Level of AI Involvement

At the center of this framework is the Human Agency Scale, which ranges from H1 (full AI control) to H5 (complete human control). This approach recognizes that not all tasks benefit from full automation, nor should every AI tool aim for it. For example, tasks rated H1 or H2—like transcribing data or generating routine reports—are well-suited for independent AI execution. Meanwhile, tasks such as planning training programs or participating in security-related discussions were often rated at H4 or H5, reflecting the high demand for human oversight. The researchers gathered dual inputs: workers rated their desire for automation and preferred HAS level for each task, while experts evaluated AI’s current capability for that task.

Insights from WORKBank: Where Workers Embrace or Resist AI

The results from the WORKBank database revealed clear patterns. Approximately 46.1% of tasks received a high desire for automation from workers, particularly those viewed as low-value or repetitive. Conversely, significant resistance was found in tasks involving creativity or interpersonal dynamics, regardless of AI’s technical ability to perform them. By overlaying worker preferences and expert capabilities, tasks were divided into four zones: the Automation “Green Light” Zone (high capability and high desire), Automation “Red Light” Zone (high capability but low desire), R&D Opportunity Zone (low capability but high desire), and Low Priority Zone (low desire and low capability). 41% of tasks aligned with companies funded by Y Combinator fell into the Low Priority or Red Light zones, indicating a potential misalignment between startup investments and worker needs.

Toward Responsible AI Deployment in the Workforce

This research offers a clear picture of how AI integration can be approached more responsibly. The Stanford team uncovered not only where automation is technically feasible but also where workers are receptive to it. Their task-level framework extends beyond technical readiness to encompass human values, making it a valuable tool for AI development, labor policy, and workforce training strategies.

TL;DR:

This paper introduces WORKBank, a large-scale dataset combining worker preferences and AI expert assessments across 844 tasks and 104 occupations, to evaluate where AI agents should automate or augment work. Using a novel Human Agency Scale (HAS), the study reveals a complex automation landscape, highlighting a misalignment between technical capability and worker desire. Findings show that workers welcome automation for repetitive tasks but resist it in roles requiring creativity or interpersonal skills. The framework offers actionable insights for responsible AI deployment aligned with human values.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post New AI Framework Evaluates Where AI Should Automate vs. Augment Jobs, Says Stanford Study appeared first on MarkTechPost.

No-code data preparation for time series forecasting using Amazon Sage …

Time series forecasting helps businesses predict future trends based on historical data patterns, whether it’s for sales projections, inventory management, or demand forecasting. Traditional approaches require extensive knowledge of statistical methods and data science methods to process raw time series data.
Amazon SageMaker Canvas offers no-code solutions that simplify data wrangling, making time series forecasting accessible to all users regardless of their technical background. In this post, we explore how SageMaker Canvas and SageMaker Data Wrangler provide no-code data preparation techniques that empower users of all backgrounds to prepare data and build time series forecasting models in a single interface with confidence.
Solution overview
Using SageMaker Data Wrangler for data preparation allows for the modification of data for predictive analytics without programming knowledge. In this solution, we demonstrate the steps associated with this process. The solution includes the following:

Data Import from varying sources
Automated no-code algorithmic recommendations for data preparation
Step-by-step processes for preparation and analysis
Visual interfaces for data visualization and analysis
Export capabilities post data preparation
Built in security and compliance features

In this post, we focus on data preparation for time series forecasting using SageMaker Canvas.
Walkthrough
The following is a walkthrough of the solution for data preparation using Amazon SageMaker Canvas. For the walkthrough, you use the consumer electronics synthetic dataset found in this SageMaker Canvas Immersion Day lab, which we encourage you to try. This consumer electronics related time series (RTS) dataset primarily contains historical price data that corresponds to sales transactions over time. This dataset is designed to complement target time series (TTS) data to improve prediction accuracy in forecasting models, particularly for consumer electronics sales, where price changes can significantly impact buying behavior. The dataset can be used for demand forecasting, price optimization, and market analysis in the consumer electronics sector.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account
AWS resources
Prerequisites for accessing SageMaker Canvas through an AWS account
Download the consumer_electronics.csv file from the SageMaker Canvas Immersion Day lab

Solution walkthrough
Below, we will provide the solution walkthrough and explain how users are able to use a dataset, prepare the data using no code using Data Wrangler, and run and train a time series forecasting model using SageMaker Canvas.
Sign in to the AWS Management Console and go to Amazon SageMaker AI and then to Canvas. On the Get started page, select Import and prepare option. You will see the following options to import your data set into Sagemaker Data Wrangler. First, select Tabular Data as we will be utilizing this data for our time series forecasting. You will see the following options available to select from:

Local upload
Canvas Datasets
Amazon S3
Amazon Redshift
Amazon Athena
Databricks
MySQL
PostgreSQL
SQL Server
RDS

For this demo, select Local upload. When you use this option, the data is stored in the SageMaker instance, specifically on an Amazon Elastic File System (Amazon EFS) storage volume in the SageMaker Studio environment. This storage is tied to the SageMaker Studio instance, but for more permanent data storage purposes, Amazon Simple Storage Service (Amazon S3) is a good option when working with SageMaker Data Wrangler. For long term data management, Amazon S3 is recommended.

Select the consumer_electronics.csv file from the prerequisites. After selecting the file to import,  you can use the Import settings panel to set your desired configurations. For the purpose of this demo, leave the options to their default values.

After the import is complete, use the Data flow options to modify the newly imported data. For future data forecasting, you may need to clean up data for the service to properly understand the values and disregard any errors in the data. SageMaker Canvas has various offerings to accomplish this. Options include Chat for data prep with natural language data modifications and Add Transform. Chat for data prep may be best for users who prefer natural language processing (NLP) interactions and may not be familiar with technical data transformations. Add transform is best for data professionals who know which transformations they want to apply to their data.
For time series forecasting using Amazon SageMaker Canvas, data must be prepared in a certain way for the service to properly forecast and understand the data. To make a time series forecast using SageMaker Canvas, the documentation linked mentions the following requirements:

A timestamp column with all values having the datetime type.
A target column that has the values that you’re using to forecast future values.
An item ID column that contains unique identifiers for each item in your dataset, such as SKU numbers.

The datetime values in the timestamp column must use one of the following formats:

YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SSZ
YYYY-MM-DD
MM/DD/YY
MM/DD/YY HH:MM
MM/DD/YYYY
YYYY/MM/DD HH:MM:SS
YYYY/MM/DD
DD/MM/YYYY
DD/MM/YY
DD-MM-YY
DD-MM-YYYY

You can make forecasts for the following intervals:

1 min
5 min
15 min
30 min
1 hour
1 day
1 week
1 month
1 year

For this example, remove the $ in the data, by using the Chat for data prep option. Give the chat a prompt such as Can you get rid of the $ in my data, and it will generate code to accommodate your request and modify the data, giving you a no-code solution to prepare the data for future modeling and predictive analysis. Choose Add to Steps to accept this code and apply changes to the data.

You can also convert values to float data type and check for missing data in your uploaded CSV file using either Chat for data prep or Add Transform options. To drop missing values using Data Transform:

Select Add Transform from the interface
Choose Handle Missing from the transform options
Select Drop missing from the available operations
Choose the columns you want to check for missing values
Select Preview to verify the changes
Choose Add to confirm and apply the transformation

For time-series forecasting, inferring missing values and resampling the data set to a certain frequency (hourly, daily, or weekly) are also important. In SageMaker Data Wrangler, the frequency of data can be altered by choosing Add Transform, selecting Time Series, selecting Resample from the Transform drop down, and then selecting the Timestamp dropdown, ts in this example. Then, you can select advanced options. For example, choose Frequency unit and then select the desired frequency from the list.

SageMaker Data Wrangler offers several methods to handle missing values in time-series data through its Handle missing transform. You can choose from options such as forward fill or backward fill, which are particularly useful for maintaining the temporal structure of the data. These operations can be applied by using natural language commands in Chat for data prep, allowing flexible and efficient handling of missing values in time-series forecasting preparation.
To create the data flow, choose Create model. Then, choose Run Validation, which checks the data to make sure the processes were done correctly. After this step of data transformation, you can access additional options by selecting the purple plus sign. The options include Get data insights, Chat for data prep, Combine data, Create model, and Export.
The prepared data can then be connected to SageMaker AI for time series forecasting strategies, in this case, to predict the future demand based on the historical data that has been prepared for machine learning.
When using SageMaker, it is also important to consider data storage and security. For the local import feature, data is stored on Amazon EFS volumes and encrypted by default. For more permanent storage, Amazon S3 is recommended. S3 offers security features such as server-side encryption (SSE-S3, SSE-KMS, or SSE-C), fine-grained access controls through AWS Identity and Access Management (IAM) roles and bucket policies, and the ability to use VPC endpoints for added network security. To help ensure data security in either case, it’s important to implement proper access controls, use encryption for data at rest and in transit, regularly audit access logs, and follow the principle of least privilege when assigning permissions.
In this next step, you learn how to train a model using SageMaker Canvas. Based on the previous step, select the purple plus sign and select Create Model, and then select Export to create a model. After selecting a column to predict (select price for this example), you go to the Build screen, with options such as Quick build and Standard build. Based on the column chosen, the model will predict future values based on the data that is being used.

Clean up
To avoid incurring future charges, delete the SageMaker Data Wrangler data flow and S3 Buckets if used for storage.

In the SageMaker console, navigate to Canvas
Select Import and prepare
Find your data flow in the list
Click the three dots (⋮) menu next to your flow
Select Delete to remove the data flow

If you used S3 for storage:

Open the Amazon S3 console
Navigate to your bucket
Select the bucket used for this project
Choose Delete
Type the bucket name to confirm deletion
Select Delete bucket

Conclusion
In this post, we showed you how Amazon SageMaker Data Wrangler offers a no-code solution for time series data preparation, traditionally a task requiring technical expertise. By using the intuitive interface of the Data Wrangler console and natural language-powered tools, even users who don’t have a technical background can effectively prepare their data for future forecasting needs. This democratization of data preparation not only saves time and resources but also empowers a wider range of professionals to engage in data-driven decision-making.

About the author
Muni T. Bondu is a Solutions Architect at Amazon Web Services (AWS), based in Austin, Texas. She holds a Bachelor of Science in Computer Science, with concentrations in Artificial Intelligence and Human-Computer Interaction, from the Georgia Institute of Technology.

Build an agentic multimodal AI assistant with Amazon Nova and Amazon B …

Modern enterprises are rich in data that spans multiple modalities—from text documents and PDFs to presentation slides, images, audio recordings, and more. Imagine asking an AI assistant about your company’s quarterly earnings call: the assistant should not only read the transcript but also “see” the charts in the presentation slides and “hear” the CEO’s remarks. Gartner predicts that by 2027, 40% of generative AI solutions will be multimodal (text, image, audio, video), up from only 1% in 2023. This shift underlines how vital multimodal understanding is becoming for business applications. Achieving this requires a multimodal generative AI assistant—one that can understand and combine text, visuals, and other data types. It also requires an agentic architecture so the AI assistant can actively retrieve information, plan tasks, and make decisions on tool calling, rather than just responding passively to prompts.
In this post, we explore a solution that does exactly that—using Amazon Nova Pro, a multimodal large language model (LLM) from AWS, as the central orchestrator, along with powerful new Amazon Bedrock features like Amazon Bedrock Data Automation for processing multimodal data. We demonstrate how agentic workflow patterns such as Retrieval Augmented Generation (RAG), multi-tool orchestration, and conditional routing with LangGraph enable end-to-end solutions that artificial intelligence and machine learning (AI/ML) developers and enterprise architects can adopt and extend. We walk through an example of a financial management AI assistant that can provide quantitative research and grounded financial advice by analyzing both the earnings call (audio) and the presentation slides (images), along with relevant financial data feeds. We also highlight how you can apply this pattern in industries like finance, healthcare, and manufacturing.
Overview of the agentic workflow
The core of the agentic pattern consists of the following stages:

Reason – The agent (often an LLM) examines the user’s request and the current context or state. It decides what the next step should be—whether that’s providing a direct answer or invoking a tool or sub-task to get more information.
Act – The agent executes that step. This could mean calling a tool or function, such as a search query, a database lookup, or a document analysis using Amazon Bedrock Data Automation.
Observe – The agent observes the result of the action. For instance, it reads the retrieved text or data that came back from the tool.
Loop – With new information in hand, the agent reasons again, deciding if the task is complete or if another step is needed. This loop continues until the agent determines it can produce a final answer for the user.

This iterative decision-making enables the agent to handle complex requests that are impossible to fulfill with a single prompt. However, implementing agentic systems can be challenging. They introduce more complexity in the control flow, and naive agents can be inefficient (making too many tool calls or looping unnecessarily) or hard to manage as they scale. This is where structured frameworks like LangGraph come in. LangGraph makes it possible to define a directed graph (or state machine) of potential actions with well-defined nodes (actions like “Report Writer” or “Query Knowledge Base”) and edges (allowable transitions). Although the agent’s internal reasoning still decides which path to take, LangGraph makes sure the process remains manageable and transparent. This controlled flexibility means the assistant has enough autonomy to handle diverse tasks while making sure the overall workflow is stable and predictable.
Solution overview
This solution is a financial management AI assistant designed to help analysts query portfolios, analyze companies, and generate reports. At its core is Amazon Nova, an LLM that acts as an intelligent LLM for inference. Amazon Nova processes text, images, or documents (like earnings call slides), and dynamically decides which tools to use to fulfill requests. Amazon Nova is optimized for enterprise tasks and supports function calling, so the model can plan actions and call tools in a structured way. With a large context window (up to 300,000 tokens in Amazon Nova Lite and Amazon Nova Pro), it can manage long documents or conversation history when reasoning.
The workflow consists of the following key components:

Knowledge base retrieval – Both the earnings call audio file and PowerPoint file are processed by Amazon Bedrock Data Automation, a managed service that extracts text, transcribes audio and video, and prepares data for analysis. If the user uploads a PowerPoint file, the system converts each slide into an image (PNG) for efficient search and analysis, a technique inspired by generative AI applications like Manus. Amazon Bedrock Data Automation is effectively a multimodal AI pipeline out of the box. In our architecture, Amazon Bedrock Data Automation acts as a bridge between raw data and the agentic workflow. Then Amazon Bedrock Knowledge Bases converts these chunks extracted from Amazon Bedrock Data Automation into vector embeddings using Amazon Titan Text Embeddings V2, and stores these vectors in an Amazon OpenSearch Serverless database.
Router agent – When a user asks a question—for example, “Summarize the key risks in this Q3 earnings report”—Amazon Nova first determines whether the task requires retrieving data, processing a file, or generating a response. It maintains memory of the dialogue, interprets the user’s request, and plans which actions to take to fulfill it. The “Memory & Planning” module in the solution diagram indicates that the router agent can use conversation history and chain-of-thought (CoT) prompting to determine next steps. Crucially, the router agent determines if the query can be answered with internal company data or if it requires external information and tools.
Multimodal RAG agent – For queries related with audio and video information, Amazon Bedrock Data Automation uses a unified API call to extract insights from such multimedia data, and stores the extracted insights in Amazon Bedrock Knowledge Bases. Amazon Nova uses Amazon Bedrock Knowledge Bases to retrieve factual answers using semantic search. This makes sure responses are grounded in real data, minimizing hallucination. If Amazon Nova generates an answer, a secondary hallucination check cross-references the response against trusted sources to catch unsupported claims.
Hallucination check (quality gate) – To further verify reliability, the workflow can include a postprocessing step using a different foundation model (FM) outside of the Amazon Nova family, such as Anthropic’s Claude, Mistral, or Meta’s Llama, to grade the answer’s faithfulness. For example, after Amazon Nova generates a response, a hallucination detector model or function can compare the answer against the retrieved sources or known facts. If a potential hallucination is detected (the answer isn’t supported by the reference data), the agent can choose to do additional retrieval, adjust the answer, or escalate to a human.
Multi-tool collaboration – This multi-tool collaboration allows the AI to not only find information but also take actions before formulating a final answer. This introduces multi-tool options. The supervisor agent might spawn or coordinate multiple tool-specific agents (for example, a web search agent to do a general web search, a stock search agent to get market data, or other specialized agents for company financial metrics or industry news). Each agent performs a focused task (one might call an API or perform a query on the internet) and returns findings to the supervisor agent. Amazon Nova Pro features a strong reasoning ability that allows the supervisor agent to merge these findings. This multi-agent approach follows the principle of dividing complex tasks among specialist agents, improving efficiency and reliability for complex queries.
Report creation agent – Another notable aspect in the architecture is the use of Amazon Nova Canvas for output generation. Amazon Nova Canvas is a specialized image-generation model in the Amazon Nova family, but in this context, we use the concept of a “canvas” more figuratively to mean a structured template or format generated content output. For instance, we could define a template for an “investor report” that the assistant fills out: Section 1: Key Highlights (bullet points), Section 2: Financial Summary (table of figures), Section 3: Notable Quotes, and so on. The agent can guide Amazon Nova to populate such a template by providing it with a system prompt containing the desired format (this is similar to few-shot prompting, where the layout is given). The result is that the assistant not only answers ad-hoc questions, but can also produce comprehensive generated reports that look as if a human analyst prepared them, combining text, image, and references to visuals.

These components are orchestrated in an agentic workflow. Instead of a fixed script, the solution uses a dynamic decision graph (implemented with the open source LangGraph library in the notebook solution) to route between steps. The result is an assistant that feels less like a chatbot and more like a collaborative analyst—one that can parse an earnings call audio recording, critique a slide deck, or draft an investor memo with minimal human intervention.
The following diagram shows the high-level architecture of the agentic AI workflow. Amazon Nova orchestrates various tools—including Bedrock Amazon Data Automation for document and image processing and a knowledge base for retrieval—to fulfill complex user requests. For brevity, we don’t list all the code here; the GitHub repo includes a full working example. Developers can run that to see the agent in action and extend it with their own data.

Example of the multi-tool collaboration workflow
To demonstrate the multi-tool collaboration agent workflow, we explore an example of how a question-answer interaction might flow through our deployed system for multi-tool collaboration:

User prompt – In the chat UI, the end-user asks a question, such as “What is XXX’s stock performance this year, and how does it compare to its rideshare‑industry peers?”
Agent initial response – The agent (Amazon Nova FM orchestrator) receives the question and responds with:

Received your question. Routing to the reasoning engine…

Planning and tool selection – The agent determines that it needs the following:

The ticker symbol for the company (XXX)
Real‑time stock price and YTD changes
Key financial metrics (revenue, net income, price-earnings ratio)
Industry benchmarks (peer YTD performance, average revenue growth)

Planning execution using tool calls – The agent calls tools to perform the following actions:

Look up ticker symbol:

Agent → WebSearchTool.lookupTicker(“XXX Inc”)
WebSearchTool → Agent: returns “XXX”

Fetch real‑time stock performance using the retrieved ticker symbol:

Agent → StockAnalysisTool.getPerformance(
symbol=”XXX”,
period=”YTD”
)
StockAnalysisTool → Agent:
{
currentPrice:
ytdChange:
52wkRange:
volume:
}

Retrieve company financial metrics using the retrieved ticker symbol:

Agent → CompanyFinancialAnalysisTool.getMetrics(“UBER”)
CompanyFinancialAnalysisTool → Agent:
{
revenueQ4_2024: xxx B,
netIncomeQ4_2024: xxx M,
peRatio: xxx
}

Gather industry benchmark data using the retrieved ticker symbol:

Agent → IndustryAnalysisTool.comparePeers(
symbol=”XXX”,
sector=”Rideshare”
)
IndustryAnalysisTool → Agent:
{
avgPeerYTD:
avgRevenueGrowth:
}

Validation loop – The agent runs a validation loop:

Agent: validate()
↳ Are all four data points present?
• Ticker :heavy_check_mark:
• Stock performance :heavy_check_mark:
• Financial metrics :heavy_check_mark:
• Industry benchmark :heavy_check_mark:
↳ All set—no retry needed.

If anything is missing or a tool encountered an error, the FM orchestrator triggers the error handler (up to three retries), then resumes the plan at the failed step.

Synthesis and final answer – The agent uses Amazon Nova Pro to synthesize the data points and generate final answers based on these data points.

The following figure shows a flow diagram of this multi-tool collaboration agent.

Benefits of using Amazon Bedrock for scalable generative AI agent workflows
This solution is built on Amazon Bedrock because AWS provides an integrated ecosystem for building such sophisticated solutions at scale:

Amazon Bedrock delivers top-tier FMs like Amazon Nova, with managed infrastructure—no need for provisioning GPU servers or handling scaling complexities.
Amazon Bedrock Data Automation offers an out-of-the-box solution to process documents, images, audio, and video into actionable data. Amazon Bedrock Data Automation can convert presentation slides to images, convert audio to text, perform OCR, and generate textual summaries or captions that are then indexed in an Amazon Bedrock knowledge bases.
Amazon Bedrock Knowledge Bases can store embeddings from unstructured data and support retrieval operations using similarity search.
In addition to LangGraph (as shown in this solution), you can also use Amazon Bedrock Agents to develop agentic workflows. Amazon Bedrock Agents simplifies the configuration of tool flows and action groups, so you can declaratively manage your agentic workflows.
Applications developed by open source frameworks like LangGraph (an extension of LangChain) can also run and scale with AWS infrastructure such as Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker instances, so you can define directed graphs for agent orchestration, making it effortless to manage multi-step reasoning and tool chaining.

You don’t need to assemble a dozen disparate systems; AWS provides an integrated network for generative AI workflows.
Considerations and customizations
The architecture demonstrates exceptional flexibility through its modular design principles. At its core, the system uses Amazon Nova FMs, which can be selected based on task complexity. Amazon Nova Micro handles straightforward tasks like classification with minimal latency. Amazon Nova Lite manages moderately complex operations with balanced performance, and Amazon Nova Pro excels at sophisticated tasks requiring advanced reasoning or generating comprehensive responses.
The modular nature of the solution (Amazon Nova, tools, knowledge base, and Amazon Bedrock Data Automation) means each piece can be swapped or adjusted without overhauling the whole system. Solution architects can use this reference architecture as a foundation, implementing customizations as needed. You can seamlessly integrate new capabilities through AWS Lambda functions for specialized operations, and the LangGraph orchestration enables dynamic model selection and sophisticated routing logic. This architectural approach makes sure the system can evolve organically while maintaining operational efficiency and cost-effectiveness.
Bringing it to production requires thoughtful design, but AWS offers scalability, security, and reliability. For instance, you can secure the knowledge base content with encryption and access control, integrate the agent with AWS Identity and Access Management (IAM) to make sure it only performs allowed actions (for example, if an agent can access sensitive financial data, verify it checks user permissions ), and monitor the costs (you can track Amazon Bedrock pricing and tools usage; you might use Provisioned Throughput for consistent high-volume usage). Additionally, with AWS, you can scale from an experiment in a notebook to a full production deployment when you’re ready, using the same building blocks (integrated with proper AWS infrastructure like Amazon API Gateway or Lambda, if deploying as a service).
Vertical industries that can benefit from this solution
The architecture we described is quite general. Let’s briefly look at how this multimodal agentic workflow can drive value in different industries:

Financial services – In the financial sector, the solution integrates multimedia RAG to unify earnings call transcripts, presentation slides (converted to searchable images), and real-time market feeds into a single analytical framework. Multi-agent collaboration enables Amazon Nova to orchestrate tools like Amazon Bedrock Data Automation for slide text extraction, semantic search for regulatory filings, and live data APIs for trend detection. This allows the system to generate actionable insights—such as identifying portfolio risks or recommending sector rebalancing—while automating content creation for investor reports or trade approvals (with human oversight). By mimicking an analyst’s ability to cross-reference data types, the AI assistant transforms fragmented inputs into cohesive strategies.
Healthcare – Healthcare workflows use multimedia RAG to process clinical notes, lab PDFs, and X-rays, grounding responses in peer-reviewed literature and patient audio interview. Multi-agent collaboration excels in scenarios like triage: Amazon Nova interprets symptom descriptions, Amazon Bedrock Data Automation extracts text from scanned documents, and integrated APIs check for drug interactions, all while validating outputs against trusted sources. Content creation ranges from succinct patient summaries (“Severe pneumonia, treated with levofloxacin”) to evidence-based answers for complex queries, such as summarizing diabetes guidelines. The architecture’s strict hallucination checks and source citations support reliability, which is critical for maintaining trust in medical decision-making.
Manufacturing – Industrial teams use multimedia RAG to index equipment manuals, sensor logs, worker audio conversation, and schematic diagrams, enabling rapid troubleshooting. Multi-agent collaboration allows Amazon Nova to correlate sensor anomalies with manual excerpts, and Amazon Bedrock Data Automation highlights faulty parts in technical drawings. The system generates repair guides (for example, “Replace valve Part 4 in schematic”) or contextualizes historical maintenance data, bridging the gap between veteran expertise and new technicians. By unifying text, images, and time series data into actionable content, the assistant reduces downtime and preserves institutional knowledge—proving that even in hardware-centric fields, AI-driven insights can drive efficiency.

These examples highlight a common pattern: the synergy of data automation, powerful multimodal models, and agentic orchestration leads to solutions that closely mimic a human expert’s assistance. The financial AI assistant cross-checks figures and explanations like an analyst would, the clinical AI assistant correlates images and notes like a diligent doctor, and the industrial AI assistant recalls diagrams and logs like a veteran engineer. All of this is made possible by the underlying architecture we’ve built.
Conclusion
The era of siloed AI models that only handle one type of input is drawing to a close. As we’ve discussed, combining multimodal AI with an agentic workflow unlocks a new level of capability for enterprise applications. In this post, we demonstrated how to construct such a workflow using AWS services: we used Amazon Nova as the core AI orchestrator with its multimodal, agent-friendly capabilities, Amazon Bedrock Data Automation to automate the ingestion and indexing of complex data (documents, slides, audio) into Amazon Bedrock Knowledge Bases, and the concept of an agentic workflow graph for reasoning and condition (using LangChain or LangGraph) to orchestrate multi-step reasoning and tool usage. The end result is an AI assistant that operates much like a diligent analyst: researching, cross-checking multiple sources, and delivering insights—but at machine speed and scale.The solution demonstrates that building a sophisticated agentic AI system is no longer an academic dream—it’s practical and achievable with today’s AWS technologies. By using Amazon Nova as a powerful multimodal LLM and Amazon Bedrock Data Automation for multimodal data processing, along with frameworks for tool orchestration like LangGraph (or Amazon Bedrock Agents), developers get a head start. Many challenges (like OCR, document parsing, or conversational orchestration) are handled by these managed services or libraries, so you can focus on the business logic and domain-specific needs.
The solution presented in the BDA_nova_agentic sample notebook is a great starting point to experiment with these ideas. We encourage you to try it out, extend it, and tailor it to your organization’s needs. We’re excited to see what you will build—the techniques discussed here represent only a small portion of what’s possible when you combine modalities and intelligent agents.

About the authors
Julia Hu Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services, currently focused on the Amazon Bedrock team. Her core expertise lies in agentic AI, where she explores the capabilities of foundation models and AI agents to drive productivity in Generative AI applications. With a background in Generative AI, Applied Data Science, and IoT architecture, she partners with customers—from startups to large enterprises—to design and deploy impactful AI solutions.
Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.
Jessie-Lee Fry is a Product and Go-to Market (GTM) Strategy executive specializing in Generative AI and Machine Learning, with over 15 years of global leadership experience in Strategy, Product, Customer success, Business Development, Business Transformation and Strategic Partnerships. Jessie has defined and delivered a broad range of products and cross-industry go- to-market strategies driving business growth, while maneuvering market complexities and C-Suite customer groups. In her current role, Jessie and her team focus on helping AWS customers adopt Amazon Bedrock at scale enterprise use cases and adoption frameworks, meeting customers where they are in their Generative AI Journey.