How to Reduce Cost and Latency of Your RAG Application Using Semantic …

Semantic caching in LLM (Large Language Model) applications optimizes performance by storing and reusing responses based on semantic similarity rather than exact text matches. When a new query arrives, it’s converted into an embedding and compared with cached ones using similarity search. If a close match is found (above a similarity threshold), the cached response is returned instantly—skipping the expensive retrieval and generation process. Otherwise, the full RAG pipeline runs, and the new query-response pair is added to the cache for future use.

In a RAG setup, semantic caching typically saves responses only for questions that have actually been asked, not every possible query. This helps reduce latency and API costs for repeated or slightly reworded questions. In this article, we’ll take a look at a short example demonstrating how caching can significantly lower both cost and response time in LLM-based applications. Check out the FULL CODES here.

How Semantic Caching in LLM Works

Semantic caching functions by storing and retrieving responses based on the meaning of user queries rather than their exact wording. Each incoming query is converted into a vector embedding that represents its semantic content. The system then performs a similarity search—often using Approximate Nearest Neighbor (ANN) techniques—to compare this embedding with those already stored in the cache. 

If a sufficiently similar query-response pair exists (i.e., its similarity score exceeds a defined threshold), the cached response is returned immediately, bypassing expensive retrieval or generation steps. Otherwise, the full RAG pipeline executes, retrieving documents and generating a new answer, which is then stored in the cache for future use. Check out the FULL CODES here.

What Gets Cached in Memory

In a RAG application, semantic caching only stores responses for queries that have actually been processed by the system—there’s no pre-caching of all possible questions. Each query that reaches the LLM and produces an answer can create a cache entry containing the query’s embedding and corresponding response. 

Depending on the system’s design, the cache may store just the final LLM outputs, the retrieved documents, or both. To maintain efficiency, cache entries are managed through policies like time-to-live (TTL) expiration or Least Recently Used (LRU) eviction, ensuring that only recent or frequently accessed queries remain in memory over time. Check out the FULL CODES here.

How Semantic Caching Works: Explained with an example

Installing dependencies

Copy CodeCopiedUse a different Browserpip install openai numpy

Setting up the dependencies

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

For this tutorial, we will be using OpenAI, but you can use any LLM provider.

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
client = OpenAI()

Running Repeated Queries Without Caching

In this section, we run the same query 10 times directly through the GPT-4.1 model to observe how long it takes when no caching mechanism is applied. Each call triggers a full LLM computation and response generation, leading to repetitive processing for identical inputs. Check out the FULL CODES here.

This helps establish a baseline for total time and cost before we implement semantic caching in the next part.

Copy CodeCopiedUse a different Browserimport time
def ask_gpt(query):
start = time.time()
response = client.responses.create(
model=”gpt-4.1″,
input=query
)
end = time.time()
return response.output[0].content[0].text, end – start

Copy CodeCopiedUse a different Browserquery = “Explain the concept of semantic caching in just 2 lines.”
total_time = 0

for i in range(10):
_, duration = ask_gpt(query)
total_time += duration
print(f”Run {i+1} took {duration:.2f} seconds”)

print(f”nTotal time for 10 runs: {total_time:.2f} seconds”)

Even though the query remains the same, every call still takes between 1–3 seconds, resulting in a total of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching can be so valuable — it allows us to reuse previous responses for semantically identical queries and save both time and API cost. Check out the FULL CODES here.

Implementing Semantic Caching for Faster Responses

In this section, we enhance the previous setup by introducing semantic caching, which allows our application to reuse responses for semantically similar queries instead of repeatedly calling the GPT-4.1 API.

Here’s how it works: each incoming query is converted into a vector embedding using the text-embedding-3-small model. This embedding captures the semantic meaning of the text. When a new query arrives, we calculate its cosine similarity with embeddings already stored in our cache. If a match is found with a similarity score above the defined threshold (e.g., 0.85), the system instantly returns the cached response — avoiding another API call.

If no sufficiently similar query exists in the cache, the model generates a fresh response, which is then stored along with its embedding for future use. Over time, this approach dramatically reduces both response time and API costs, especially for frequently asked or rephrased queries. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
from numpy.linalg import norm
semantic_cache = []

def get_embedding(text):
emb = client.embeddings.create(model=”text-embedding-3-small”, input=text)
return np.array(emb.data[0].embedding)
def cosine_similarity(a, b):
return np.dot(a, b) / (norm(a) * norm(b))

def ask_gpt_with_cache(query, threshold=0.85):
query_embedding = get_embedding(query)

# Check similarity with existing cache
for cached_query, cached_emb, cached_resp in semantic_cache:
sim = cosine_similarity(query_embedding, cached_emb)
if sim > threshold:
print(f” Using cached response (similarity: {sim:.2f})”)
return cached_resp, 0.0 # no API time

# Otherwise, call GPT
start = time.time()
response = client.responses.create(
model=”gpt-4.1″,
input=query
)
end = time.time()
text = response.output[0].content[0].text

# Store in cache
semantic_cache.append((query, query_embedding, text))
return text, end – start

Copy CodeCopiedUse a different Browserqueries = [
“Explain semantic caching in simple terms.”,
“What is semantic caching and how does it work?”,
“How does caching work in LLMs?”,
“Tell me about semantic caching for LLMs.”,
“Explain semantic caching simply.”,
]

total_time = 0
for q in queries:
resp, t = ask_gpt_with_cache(q)
total_time += t
print(f” Query took {t:.2f} secondsn”)

print(f”nTotal time with caching: {total_time:.2f} seconds”)

In the output, the first query took around 8 seconds as there was no cache and the model had to generate a fresh response. When a similar question was asked next, the system identified a high semantic similarity (0.86) and instantly reused the cached answer, saving time. Some queries, like “How does caching work in LLMs?” and “Tell me about semantic caching for LLMs,” were sufficiently different, so the model generated new responses, each taking over 10 seconds. The final query was nearly identical to the first one (similarity 0.97) and was served from cache instantly.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching appeared first on MarkTechPost.

Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech …

Maya Research has released Maya1, a 3B parameter text to speech model that turns text plus a short description into controllable, expressive speech while running in real time on a single GPU.

What Maya1 Actually Does?

Maya1 is a state of the art speech model for expressive voice generation. It is built to capture real human emotion and precise voice design from text inputs.

The core interface has 2 inputs:

A natural language voice description, for example ‘Female voice in her 20s with a British accent, energetic, clear diction” or “Demon character, male voice, low pitch, gravelly timbre, slow pacing’.

The text that should be spoken

The model combines both signals and generates audio that matches the content and the described style. You can also insert inline emotion tags inside the text, such as <laugh>, <sigh>, <whisper>, <angry>, <giggle>, <gasp>, <cry> and more than 20 emotions.

Maya1 outputs 24 kHz mono audio and supports real time streaming, which makes it suitable for assistants, interactive agents, games, podcasts and live content.

The Maya Research team claims that the model outperforms top proprietary systems while remaining fully open source under the Apache 2.0 license.

Architecture and SNAC Codec

Maya1 is a 3B parameter decoder only transformer with a Llama style backbone. Instead of predicting raw waveforms, it predicts tokens from a neural audio codec named SNAC.

The generation flow is

text → tokenize → generate SNAC codes (7 tokens per frame) → decode → 24 kHz audio

SNAC uses a multi scale hierarchical structure at about 12, 23 and 47 Hz. This keeps the autoregressive sequence compact while preserving detail. The codec is designed for real time streaming at about 0.98 kbps.

The important point is that the transformer operates on discrete codec tokens instead of raw samples. A separate SNAC decoder, for example hubertsiuzdak/snac_24khz, reconstructs the waveform. This separation makes generation more efficient and easier to scale than direct waveform prediction.

Training Data And Voice Conditioning

Maya1 is pretrained on an internet scale English speech corpus to learn broad acoustic coverage and natural coarticulation. It is then fine tuned on a curated proprietary dataset of studio recordings that include human verified voice descriptions, more than 20 emotion tags per sample, multiple English accents, and character or role variations.

The documented data pipeline includes:

24 kHz mono resampling with about minus 23 LUFS loudness

Voice activity detection with silence trimming between 1 and 14 seconds

Forced alignment using Montreal Forced Aligner for phrase boundaries

MinHash LSH text deduplication

Chromaprint based audio deduplication

SNAC encoding with 7 token frame packing

The Maya Research team evaluated several ways to condition the model on a voice description. Simple colon formats and key value tag formats either caused the model to speak the description or did not generalize well. The best performing format uses an XML style attribute wrapper that encodes the description and text in a natural way while remaining robust.

In practice, this means developers can describe voices in free form text, close to how they would brief a voice actor, instead of learning a custom parameter schema.

https://huggingface.co/maya-research/maya1

Inference And Deployment On A Single GPU

The reference Python script on Hugging Face loads the model with AutoModelForCausalLM.from_pretrained(“maya-research/maya1″, torch_dtype=torch.bfloat16, device_map=”auto”) and uses the SNAC decoder from SNAC.from_pretrained(“hubertsiuzdak/snac_24khz”).

The Maya Research team recommends a single GPU with 16 GB or more of VRAM, for example A100, H100 or a consumer RTX 4090 class card.

For production, they provide a vllm_streaming_inference.py script that integrates with vLLM. It supports Automatic Prefix Caching for repeated voice descriptions, a WebAudio ring buffer, multi GPU scaling and sub 100 millisecond latency targets for real time use.

Beyond the core repository, they have released:

A Hugging Face Space that exposes an interactive browser demo where users enter text and voice descriptions and listen to output

GGUF quantized variants of Maya1 for lighter deployments using llama.cpp

A ComfyUI node that wraps Maya1 as a single node, with emotion tag helpers and SNAC integration

These projects reuse the official model weights and interface, so they stay consistent with the main implementation.

Key Takeaways

Maya1 is a 3B parameter, decoder only, Llama style text to speech model that predicts SNAC neural codec tokens instead of raw waveforms, and outputs 24 kHz mono audio with streaming support.

The model takes 2 inputs, a natural language voice description and the target text, and supports more than 20 inline emotion tags such as <laugh>, <cry>, <whisper> and <gasp> for local control of expressiveness.

Maya1 is trained with a pipeline that combines large scale English pretraining and studio quality fine tuning with loudness normalization, voice activity detection, forced alignment, text deduplication, audio deduplication and SNAC encoding.

The reference implementation runs on a single 16 GB plus GPU using torch_dtype=torch.bfloat16, integrates with a SNAC decoder, and has a vLLM based streaming server with Automatic Prefix Caching for low latency deployment.

Maya1 is released under the Apache 2.0 license, with official weights, Hugging Face Space demo, GGUF quantized variants and ComfyUI integration, which makes expressive, emotion rich, controllable text to speech accessible for commercial and local use.

Editorial Comments

Maya1 pushes open source text to speech into territory that was previously dominated by proprietary APIs. A 3B parameter Llama style decoder that predicts SNAC codec tokens, runs on a single 16 GB GPU with vLLM streaming and Automatic Prefix Caching, and exposes more than 20 inline emotions with natural language voice design, is a practical building block for real time agents, games and tools. Overall, Maya1 shows that expressive, controllable TTS can be both open and production ready.

Check out the Model Weights and Demo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU appeared first on MarkTechPost.

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual …

How do you build a single speech recognition system that can understand 1,000’s of languages including many that never had working ASR (automatic speech recognition) models before? Meta AI has released Omnilingual ASR, an open source speech recognition suite that scales to more than 1,600 languages and can be extended to unseen languages with only a few speech text examples, without retraining the model.

Data and language coverage

The supervised training data comes from a combined corpus called AllASR. AllASR contains 120,710 hours of labeled speech paired with transcripts across 1,690 languages. This corpus merges several sources, including open source datasets, internal and licensed corpora, partner created data, and a commissioned collection called the Omnilingual ASR Corpus.

The Omnilingual ASR Corpus contributes 3,350 hours of speech for 348 languages, with data collected through field work with local organizations and speakers in regions such as Africa and South Asia. Prompts are open ended, so speakers produce natural monologues in their own language instead of reading fixed sentences, which gives more realistic acoustic and lexical variation.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

For self supervised pre training, the wav2vec 2.0 encoders are trained on a large unlabeled speech corpus. The pre training dataset contains 3.84M hours of speech with language identification across 1,239 languages, plus another 460K hours without language identification. The total unlabeled audio used for pre training is therefore about 4.3M hours. This is still significantly smaller than the 12M hours used by USM, which makes the reported results more interesting from a data efficiency perspective.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

Model family

Omnilingual ASR exposes 3 main model families that all share the same wav2vec 2.0 speech encoder backbone:

SSL encoders (OmniASR W2V)Self supervised wav2vec 2.0 encoders with the following parameter counts• omniASR_W2V_300M with 317,390,592 parameters• omniASR_W2V_1B with 965,514,752 parameters• omniASR_W2V_3B with 3,064,124,672 parameters• omniASR_W2V_7B with 6,488,487,168 parameters. These models are trained with the standard wav2vec 2.0 contrastive objective. After training, the quantizer is discarded and the encoder is used as a speech representation backbone.

CTC (connectionist temporal classification) ASR modelsCTC models add a simple linear layer on top of the encoder and train end to end with a character level CTC loss. The released CTC models range from 325,494,996 parameters to 6,504,786,132 parameters and reach real time factors as low as 0.001 for the 300M model on A100 for 30 second audio with batch size 1.

LLM ASR modelsLLM ASR stacks a Transformer decoder on top of the wav2vec 2.0 encoder. The decoder is a language model like Transformer that operates on character level tokens plus special tokens such as <BOS> and <EOS>. Training uses standard next token prediction on sequences of the form gs(x), gt(<BOS>), gt(y), gt(<EOS>) where gs is the speech encoder and gt is the text embedding matrix. The LLM ASR family ranges from about 1.63B parameters for omniASR_LLM_300M to 7,801,041,536 parameters for omniASR_LLM_7B. A separate omniASR_LLM_7B_ZS checkpoint with 7,810,900,608 parameters is used for zero shot ASR.

All LLM ASR models support optional language conditioning. Languages are represented as {language_code}_{script} such as eng_Latn for English in Latin script or cmn_Hans for Mandarin Chinese in Simplified Chinese script. A learned embedding for the language script identifier is injected into the decoder input. In training, the language ID token is sometimes dropped, so the model can also operate without explicit language tags at inference.

Zero shot ASR with context examples and SONAR

The supervised models cover more than 1,600 languages. However, many languages still have no transcribed ASR data. To handle these cases, Omnilingual ASR extends the LLM ASR model with a zero shot mode trained with context examples.

During training for the zero shot variant, the decoder consumes N + 1 speech text pairs from the same language. The first N pairs act as context and the final pair is the target. All pairs are embedded with the speech encoder and text embedding matrix, then concatenated into a single decoder input sequence. The loss is still next token prediction on the target transcription. This teaches the decoder to infer the mapping from speech to text in a given language from a small prompt of in language examples.

At inference, the omniASR_LLM_7B_ZS model can receive a few speech text examples from any language, including languages not present in training, and then transcribe new utterances in that language without updating weights. This is in context learning for ASR.

The system includes an example retrieval mechanism based on SONAR, a multilingual multimodal encoder that projects audio and text into a shared embedding space. The target audio is embedded once, then nearest neighbor search over a database of speech text pairs selects the most relevant examples to include in the context window. This SONAR based selection improves zero shot performance compared with random example selection or simple text similarity.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

Quality and benchmarks

The omniASR_LLM_7B model achieves character error rate below 10 percent for 78 percent of the more than 1,600 supported languages.

The research team reports that on multilingual benchmarks such as FLEURS 102, the 7B LLM ASR model outperforms the 7B CTC models and also surpasses Google USM variants in average character error rate, despite using about 4.3M unlabeled hours instead of 12M and a simpler pre training pipeline. This suggests that scaling the wav2vec 2.0 encoder and adding an LLM style decoder is an effective path for high coverage multilingual ASR.

Key Takeaways

Omnilingual ASR provides open source ASR coverage for more than 1,600 languages and can generalize to more than 5,400 languages using zero shot in context learning.

The models are built on large scale wav2vec 2.0 encoders trained on about 4.3M hours of unlabeled audio from 1,239 labeled languages plus additional unlabeled speech.

The suite includes wav2vec 2.0 encoders, CTC ASR, LLM ASR, and a dedicated zero shot LLM ASR model, with encoder sizes from 300M to 7B parameters and LLM ASR up to about 7.8B parameters.

The 7B LLM ASR model achieves character error rate below 10 percent on 78 percent of the more than 1,600 supported languages, which is competitive with or better than prior multilingual systems in low resource settings.

Editorial Comments

Omnilingual ASR is a significant systems level contribution because it treats multilingual ASR as an extensible framework, not a fixed language list, combining a 7B wav2vec 2.0 encoder, CTC and LLM ASR decoders, and a zero shot LLM ASR model that can adapt to new languages with a few in context examples, while achieving character error rate below 10 percent on 78 percent of more than 1,600 supported languages and releasing everything under Apache 2.0 and CC BY 4.0. Overall, this launch establishes Omnilingual ASR as the most extensible open source speech recognition model currently available.

Check out the Paper, Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages appeared first on MarkTechPost.

Introducing agent-to-agent protocol support in Amazon Bedrock AgentCor …

We recently announced the support for Agent-to-Agent (A2A) protocol on Amazon Bedrock AgentCore Runtime. With this addition, agents can discover peers, share capabilities, and coordinate actions across platforms using standardized communication.
Amazon Bedrock AgentCore Runtime provides a secure, serverless environment designed for deploying AI agents and tools. It works with any framework and model, supports real-time and long-running workloads, and supports session isolation with built-in authentication. With support for MCP, and now the A2A protocol, Bedrock AgentCore Runtime enables seamless communication between agents. Agents built using different frameworks, Strands Agents, OpenAI Agents SDK, LangGraph, Google ADK, or Claude Agents SDK, can share context, capabilities, and reasoning in a common, verifiable format.
In this post, we demonstrate how you can use the A2A protocol for AI agents built with different frameworks to collaborate seamlessly. You’ll learn how to deploy A2A servers on AgentCore Runtime, configure agent discovery and authentication, and build a real-world multi-agent system for incident response. We’ll cover the complete A2A request lifecycle, from agent card discovery to task delegation, showing how standardized protocols eliminate the complexity of multi-agent coordination.
Understanding multi-agent systems
Building effective agentic systems requires several foundational components. These include memory, both short-term for maintaining conversation context and long-term for retaining insights across sessions; tools that agents can access either natively or through MCP servers; identity for more secure authentication and permission management, allowing agents to act on behalf of users or autonomously access resources; and guardrails to detect harmful content, help prevent hallucinations, and make sure responses align with policies and factual accuracy.

While MCP connects a single agent to its tools and data, A2A lets multiple agents coordinate with one another. For example, a retail inventory agent might use MCP to query product databases, then use A2A to communicate with external supplier agents to place orders.
The A2A protocol brings benefits to multi-agent systems through seamless interoperability across diverse boundaries. Agents built with different frameworks like Strands or OpenAI, powered by various LLMs such as Anthropic Claude, GPT-4, or Llama, and hosted on different systems including AWS or edge devices can communicate and coordinate effortlessly without requiring complex translation layers. This interoperability is complemented by loose coupling and modularity, where each agent operates as an independent unit that can be developed, tested, deployed, and even upgraded without disrupting the entire system. New specialized agents can join the environment seamlessly, and the failure of one agent remains isolated due to well-defined interaction boundaries, helping prevent cascading failures across the system. The protocol also supports dynamic agent discovery and orchestration. Agents advertise their capabilities through standardized schemas while orchestrator agents can discover and invoke specialized agents based on real-time task requirements.
A2A request lifecycle on Amazon Bedrock AgentCore Runtime
The A2A protocol defines a structured request lifecycle with specific components that work together to coordinate multi-agent communication. Here are the key elements:

User: Initiates requests through the Client Agent, either as a human operator or automated service defining goals that require multi-agent assistance.
A2A Client (Client Agent): Acts on behalf of the user, initiating communication using the A2A protocol to discover and request tasks from remote agents.
A2A Server (Remote Agent): Exposes HTTP endpoints implementing the A2A protocol to receive requests, process tasks, and return results. Different agents can serve this role, handling both synchronous and asynchronous interactions using JSON-RPC 2.0 over HTTP/S or Server-Sent Events.
Agent Card: A JSON metadata file that each agent publishes to advertise its identity, capabilities, endpoints, and authentication requirements. This enables the dynamic discovery feature, where agents query what their peer agents can do before delegating tasks.
Task Object: Represents each unit of work flowing through the system with a unique ID and lifecycle. As agents coordinate, tasks may be long-running, involve multiple turns, and span several agents working together.
Artifact: The output produced when a task completes, which can include structured text, JSON, images, audio, or other multimodal content. Agents exchange these artifacts as they collaborate to fulfill the user’s original request.

Multi-agent use case: Monitoring and incident response
To demonstrate the power of multi-agent systems using A2A on Amazon Bedrock AgentCore Runtime, we’ll walk through an enterprise monitoring and incident response solution. This real-world use-case showcases how specialized agents built with different frameworks coordinate seamlessly to handle complex operational challenges through the A2A protocol.
The monitoring and incident response solution implements a hub-and-spoke architecture with three specialized agents, each using Amazon Bedrock AgentCore features – modular building blocks that provide core capabilities like AgentCore Memory for context-aware responses, AgentCore Identity using Amazon Cognito for more secure authentication for agents and what action each agent can perform, AgentCore Gateway for more secure and centralized access to tools, and observability to trace, debug, and monitor AI agents’ performance. View the architecture and demonstration video below for reference:

The multi-agent system contains the following components:

Host agent (Google ADK): Acts as the intelligent routing layer and coordination hub for the agent interactions. Demonstrates the cross-system interoperability using A2A. This agent runs on Amazon Bedrock AgentCore Runtime using Google’s Agent Development Kit, yet communicates seamlessly with agents hosted on AWS through the standardized A2A protocol. Key responsibilities of the host agent include:

Dynamic agent discovery: Fetches Identity Provider (IDP) configuration from AWS Systems Manager Parameter Store for each remote agent, enabling more secure authentication across the multi-agent system
Capability awareness: Retrieves agent cards from each A2A server to understand available skills and endpoints
Intelligent routing: Analyzes user queries and routes them to the appropriate specialist agent based on capabilities
Multi-agent coordination: Orchestrates complex workflows requiring multiple agents

Monitoring agent (Strands Agents SDK): Serves as the operational intelligence layer, continuously analyzing CloudWatch logs, metrics, dashboards, and alarms across AWS services. This agent specializes in identifying anomalies, tracking error patterns, and surfacing actionable insights from vast amounts of telemetry data. When unusual patterns emerge, the monitoring Agent initiates conversations with other specialized agents to coordinate response actions.Key responsibilities of the monitoring agent include:

CloudWatch integration:

Lists and analyzes CloudWatch dashboards
Fetches logs for specific AWS services (Lambda, ECS, EC2)
Monitors alarms and alert states
Analyzes log groups for patterns and errors

Cross-account access: Supports monitoring across multiple AWS accounts

Operational agent (OpenAI SDK): Provides remediation strategies and external knowledge integration. When the monitoring agent detects a critical issue, it communicates directly with the operational agent through A2A, providing context about the problem and requesting specific remediation actions. Key responsibilities of the operational agent include:

Web search: Uses Tavily API to search for AWS best practices, troubleshooting guides, and solutions
Remediation strategies: Proposes solutions based on detected issues

Implementing the multi-agent monitoring solution
Now that we’ve explored how these three specialized agents collaborate to handle AWS incidents, let’s walk through how to build and deploy this multi-agent system using Amazon Bedrock AgentCore Runtime.
The implementation follows a progressive approach:

Start with the foundation – We’ll deploy a simple A2A server to understand the core mechanics of agent deployment, authentication, and invocation on AgentCore Runtime
Build the monitoring system – Using the same deployment patterns, we’ll construct each specialized agent (Monitoring, Operational, and Host) with their specific tools and capabilities
Connect the agents – Configure A2A communication channels between agents, enabling them to discover and invoke each other through standardized protocols
Observe the system in action – Watch the demo video showing real-time incident detection, cross-agent coordination, and automated response

All code examples, complete agent implementations, and deployment scripts for this multi-agent monitoring system are available in our GitHub repository.
Getting started with A2A on AgentCore Runtime
To understand the fundamentals of deploying A2A servers on Amazon Bedrock AgentCore Runtime, including step-by-step instructions for creating, testing, deploying, and invoking agents, refer to the A2A Protocol Support documentation. This guide covers:

Creating and configuring A2A servers with any framework (Strands, OpenAI SDK, LangGraph)
Local testing and validation
Deployment using the AgentCore CLI
Authentication setup (OAuth 2.0 and AWS IAM)
Agent Card retrieval and discovery
Client implementation for invoking deployed agents

Once you’re familiar with these fundamentals, you can apply the same patterns to build each component of the multi-agent monitoring system.
View the full example in this GitHub sample. For this post, we will focus on this use case implementation.
Prerequisites
To deploy the multi-agent monitoring system implementation, follow the prerequisite steps:

AWS account: You need an active AWS account with appropriate permissions

Create an AWS account
AWS Management Console access

AWS CLI: Install and configure AWS CLI with your credentials

Install AWS CLI
Configure AWS CLI

Install uv.
Supported Regions: This solution is currently tested and supported in the following AWS Regions.

Note: To deploy in other Regions, you’ll need to update the DynamoDB prefix list mappings in cloudformation/vpc-stack.yaml. See the VPC Stack documentation for details.
Deployment steps
This guide walks you through deploying a multi-agent system on AWS using infrastructure-as-code. The easiest way to deploy this solution is using our automated deployment script:
Step 1: Clone the repository

git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git
cd 02-use-cases/A2A-multi-agent-incident-response

Step 2: Run the deployment script
This deployment script will verify that the AWS CLI is installed and configured, check if the AWS credentials are valid, confirm that the Region is set to us-west-2, interactively collect the required parameters, generate unique S3 bucket names and automatically deploy all stacks in the correct order. The approximate deployment time is 10-15 minutes.

uv run deploy.py

Step 3: Provide the runtime CLI parameters
Next, provide the parameters used at deployment. Press enter for each of the options to use the default Amazon Bedrock model ID and the CloudFormation stack names for each of the agents.
API keys: You’ll need the following API keys (the deployment script will prompt for these):

OpenAI API key: Get it from OpenAI Platform
Tavily API key: Get it from Tavily
Google API key: Get it from Google AI Studio

Once you have configured the information, start the deployment process and track it below in the AWS Console and terminal respectively.
Step 4: Provide the runtime CLI parameters
Run the frontend using following commands. This sets up and runs the React frontend UI that allows users to interact with the multi-agent incident response system for monitoring AWS infrastructure, querying CloudWatch metrics and logs, and searching for remediation strategies through the coordinated A2A agents.

cd frontend
npm install

chmod +x ./setup-env.sh
./setup-env.sh

npm run dev

This deployment creates a multi-agent A2A system with three specialized AI agents running on Amazon Bedrock AgentCore Runtime and orchestrated using the A2A protocol. The Cognito stack provisions OAuth 2.0-based machine-to-machine authentication by creating a Cognito user pool with four distinct client applications (WebSearch, Monitoring, Gateway, and Host Agent clients).
The monitoring agent (built with the Strands SDK) connects to CloudWatch metrics and logs through an AgentCore Gateway using a Smithy model definition, with custom semantic memory strategies for incident tracking.
The operations agent (built with OpenAI Agents SDK) interfaces with Tavily API for remediation research and the host agent (built with Google ADK) acts as the coordinator using HTTP protocol to delegate tasks to the two specialized A2A agents.
End-to-end incident response workflow
In this section, we will walk through an end-to-end workflow where the host agent manages conversations, gets the requirements from the user, and selects the best agent to route the request to (monitoring or operations agent). The monitoring and operations agent expose their agent cards that is used by the host agent for orchestration. In this example, we will test with simple error analysis from various log groups and search for remediation strategies.

The workflow includes the following steps:

Initial greeting: The user sends a greeting message asking “Hi! How are you?” to the host agent. The host agent processes the request. The host agent responds back to the user with a friendly greeting saying “I’m doing well, thank you!”
Capabilities query: The user asks the host agent “What are your capabilities?” to understand what the agent can do. The host agent explains to the user that it is an orchestration agent designed for AWS monitoring and operations based on the remote agent connections that it has access to.
List log groups and dashboards: The user requests the host agent to list the log groups and dashboards in their AWS account. The host agent recognizes this is a monitoring task and executes the transfer_to_agent tool to delegate the work. The request is transferred from the host agent to the monitoring agent for specialized handling. The monitoring agent uses the Agent-to-Agent (A2A) Json RPC Transport protocol to communicate. The monitoring agent retrieves the information and returns results showing 0 dashboards and 153 log groups found in the account. The host agent receives the results from the monitoring agent and displays the dashboards and log groups information to the user.
Analyze specific log group: The user requests the host agent to look for errors in a specific log group at path /aws/bedrock-agentcore/runtimes/hostadk-<runtimeId>-DEFAULT. The host agent determines this requires monitoring expertise and executes the transfer_to_agent tool. The request is transferred to the monitoring agent with instructions to analyze the specified log group for errors. The monitoring agent analyzes the log group and discovers 9 errors and 18 warnings, specifically identifying OTLP Export Failures. The host agent receives the analysis results and displays a detailed error analysis report to the user.
Debug and fix recommendations: The user asks the host agent to debug the errors and provide a report on the fixes needed. The request is transferred to the operations agent to search for solutions related to OTLP export failures. The operations agent uses A2A JsonRPC Transport to attempt the search and performs web search to provide a solution.

Security with A2A on Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Runtime supports two authentication methods for securing A2A communication:
OAuth 2.0 authentication: The A2A client authenticates with an external authorization server to obtain a JSON Web Token (JWT), which is then included with all requests to the A2A server. This token-based approach enables secure, standardized authentication using either machine-to-machine (M2M) credentials or user federation, allowing the A2A server to verify the client’s identity and enforce access controls based on the token’s claims.
AWS IAM authentication: The A2A client assumes an IAM role with permissions to invoke the A2A server’s agent. This approach leverages AWS SigV4 request signing and IAM policies to control access, alleviating the need for external token management while providing fine-grained permissions.
What is supported in Amazon Bedrock AgentCore Runtime with A2A
Amazon Bedrock AgentCore Runtime provides comprehensive support for A2A communication. View some of the capabilities supported:

Stateless server: Amazon Bedrock AgentCore Runtime can host A2A servers that expose an HTTP interface, running a stateless HTTP server on port 9000 and supporting JSON-RPC messaging. The runtime acts as a transparent proxy, passing JSON-RPC requests and responses unchanged to preserve protocol fidelity.
Authenticated agent cards: Supports authenticated agent card at /.well-known/agent-card.json containing its capabilities & skills allowing other agents to discover it automatically.
Authentication with secure inbound auth: Amazon Bedrock AgentCore Runtime supports secure authentication via AWS SigV4 and OAuth 2.0, making sure the agent-to-agent communication is authorized and secure. The A2A server authenticates every incoming request using the credentials provided in the HTTP headers, leveraging Amazon Bedrock AgentCore Identity.
Authorization with secure outbound auth: Amazon Bedrock AgentCore Runtime enables secure outbound authorization through both IAM execution roles and AgentCore Identity. Each agent assumes a defined IAM execution role, granting it the necessary permissions to access AWS resources more securely. For interactions with external services, agents can use Amazon Bedrock AgentCore Identity, which provides managed OAuth 2.0 support for third-party identity providers such as Google, GitHub, Slack, and more.
VPC connectivity: You can configure Amazon Bedrock AgentCore Runtime to connect to resources in your Amazon Virtual Private Cloud (VPC). By configuring VPC connectivity, you enable secure access to private resources such as databases, internal APIs, and services within your VPC.
Leverage AWS PrivateLink: Amazon Bedrock AgentCore enables secure, private connections between your Virtual Private Cloud (VPC) and AgentCore services using AWS PrivateLink. By creating interface VPC endpoints, you can keep A2A server communication within your VPC without traversing the public internet.
Lifecycle management: Amazon Bedrock AgentCore Runtime lets you configure lifecycle rules to manage resource usage with idleRuntimeSessionTimeout and maxLifetime. Idle or long-running sessions are automatically terminated for efficient resource utilization and to maintain system performance.

Conclusion
The Agent-to-Agent protocol support in Amazon Bedrock AgentCore Runtime provides the support for building scalable, interoperable multi-agent systems. By providing standardized communication between AI agents, regardless of their underlying framework, model, or hosting infrastructure, organizations can compose sophisticated agentic solutions with the A2A protocol. The AWS monitoring and incident response example demonstrates the practical power of this approach: a Google ADK-based orchestrator coordinating with Strands and OpenAI SDK agents, all deployed on AgentCore Runtime, working together to detect issues, search for solutions, and recommend fixes. This level of interoperability would traditionally require extensive custom integration work, but A2A makes it straightforward through standardized protocols.As AI systems continue to evolve from single-purpose tools to collaborative environments, protocols like A2A and MCP become essential building blocks. They create a future where agents can be discovered, composed, and orchestrated dynamically, enabling organizations to build once and integrate anywhere.

About the authors
Madhur Prashant is an Applied Generative AI Architect at Amazon Web Services. He is passionate about the intersection of human thinking and Agentic AI. His interests lie in generative AI, cognitive science and specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin and playing the guitar.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Sriharsha M S is a Principal Gen AI specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to foundational model science and agentic AI applications at scale. His expertise spans application hardware accelerators, architecture, big data, analytics and machine learning.
Jeffrey Burke is an Applied Generative AI Solutions Architect at Amazon Web Services (AWS), where he specializes in designing and implementing cutting-edge generative AI solutions for enterprise customers. With a passion for teaching complex technologies, he focuses on translating sophisticated AI concepts into practical, scalable solutions that drive business value. He has a MS in Data Science and BS in Chemical Engineering.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using Generative AI to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Deep Learning, and he is a researcher studying the use of Machine Learning and Reinforcement Learning for accelerating learning and optimization tasks. Shreyas is also an Amazon best-selling book author with several research papers and patents to his name.
Andy Palmer is a Director of Technology for AWS Strategic Accounts. His teams provide Specialist Solutions Architecture skills across a number of speciality domain areas, including AIML, generative AI, data and analytics, security, network, and open source software. Andy and his team have been at the forefront of guiding our most advanced customers through their generative AI journeys and helping to find ways to apply these new tools to both existing problem spaces and net new innovations and product experiences.
Sayee Kulkarni is a Software Development Engineer on the AWS Bedrock AgentCore service. Her team is responsible for building and maintaining the AgentCore Runtime platform, a foundational component that enables customers to leverage agentic AI capabilities. She is driven by delivering tangible customer value, and this customer-centric focus motivates her work. Sayee played a key role in designing and launching Agent-to-Agent (A2A) capabilities for AgentCore, empowering customers to build sophisticated multi-agent systems that autonomously collaborate to solve complex business challenges.

Powering enterprise search with the Cohere Embed 4 multimodal embeddin …

The Cohere Embed 4 multimodal embeddings model is now available as a fully managed, serverless option in Amazon Bedrock. Users can choose between cross-Region inference (CRIS) or Global cross-Region inference to manage unplanned traffic bursts by utilizing compute resources across different AWS Regions. Real-time information requests and time zone concentrations are example events that can cause inference demand to exceed anticipated traffic.
The new Embed 4 model on Amazon Bedrock is purpose-built for analyzing business documents. The model delivers leading multilingual capabilities and shows notable improvements over Embed 3 across the key benchmarks, making it ideal for use cases such as enterprise search.
In this post, we dive into the benefits and unique capabilities of Embed 4 for enterprise search use cases. We’ll show you how to quickly get started using Embed 4 on Amazon Bedrock, taking advantage of integrations with Strands Agents, S3 Vectors, and Amazon Bedrock AgentCore to build powerful agentic retrieval-augmented generation (RAG) workflows.
Embed 4 advances multimodal embedding capabilities by natively supporting complex business documents that combine text, images, and interleaved text and images into a unified vector representation. Embed 4 handles up to 128,000 tokens, minimizing the need for tedious document splitting and preprocessing pipelines. Embed 4 also offers configurable compressed embeddings that reduce vector storage costs by up to 83% (Introducing Embed 4: Multimodal search for business). Together with multilingual understanding across over 100 languages, enterprises in regulated industries such as finance, healthcare, and manufacturing can efficiently process unstructured documents, accelerating insight extraction for optimized RAG systems. Read about Embed 4 in this launch blog from July 2025 to explore how to deploy on Amazon SageMaker JumpStart.
Embed 4 can be integrated into your applications using the InvokeModel API, and here’s an example of how to use the AWS SDK for Python (Boto3) with Embed 4:
For the text only input:

import boto3
import json

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

# Request body
body = json.dumps({
“texts”: [
text1,
text2],
“input_type”:”search_document”,
“embedding_types”: [“float”]
})

# Invoke the model
model_id = ‘cohere.embed-v4:0’

response = bedrock_runtime.invoke_model(
modelId=model_id,
body=json.dumps(body),
accept= ‘*/*’,
contentType=’application/json’
)

# Parse response
result = json.loads(response[‘body’].read())

For the mixed modalities input:

import base64

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

# Request body
body = json.dumps({
“inputs”: [
{
“content”: [
{ “type”: “text”, “text”: text },
{ “type”: “image_url”, {“image_url”:image_base64_uri}}
]
}
],
“input_type”:”search_document”,
“embedding_types”: [“int8″,”float”]
})

# Invoke the model
model_id = ‘cohere.embed-v4:0’

response = bedrock_runtime.invoke_model(
modelId=model_id,
body=json.dumps(body),
accept= ‘*/*’,
contentType=’application/json’
)

# Parse response
result = json.loads(response[‘body’].read())

For more details, you can check Amazon Bedrock User Guide for Cohere Embed 4.
Enterprise search use case
In this section, we focus on using Embed 4 for an enterprise search use case in the finance industry. Embed 4 unlocks a range of capabilities for enterprises seeking to:

Streamline information discovery
Enhance generative AI workflows
Optimize storage efficiency

Using foundation models in Amazon Bedrock is a fully serverless environment which removes infrastructure management and simplifies integration with other Amazon Bedrock capabilities. See more details for other possible use cases with Embed 4.
Solution overview
With the serverless experience available in Amazon Bedrock, you can get started quickly without spending too much effort on infrastructure management. In the following sections, we show how to get started with Cohere Embed 4. Embed 4 is already designed with storage efficiency in mind.
We choose Amazon S3 vectors for storage because it is a cost-optimized, AI-ready storage with native support for storing and querying vectors at scale. S3 vectors can store billions of vector embeddings with sub-second query latency, reducing total costs by up to 90% compared to traditional vector databases. We leverage the extensible Strands Agent SDK to simplify agent development and take advantage of model choice flexibility. We also use Bedrock AgentCore because it provides a fully managed, serverless runtime specifically built to handle dynamic, long-running agentic workloads with industry-leading session isolation, security, and real-time monitoring.

Prerequisites
To get started with Embed 4, verify you have the following prerequisites in place:

IAM permissions: Configure your IAM role with necessary Amazon Bedrock permissions, or generate API keys through the console or SDK for testing. For more information, see Amazon Bedrock API keys.
Strands SDK installation: Install the required SDK for your development environment. For more information, see the Strands quickstart guide.
S3 Vectors configuration: Create an S3 vector bucket and vector index for storing and querying vector data. For more information, see the getting started with S3 Vectors tutorial.

Initialize Strands agents
The Strands Agents SDK offers an open source, modular framework that streamlines the development, integration, and orchestration of AI agents. With the flexible architecture developers can build reusable agent components and create custom tools with ease. The system supports multiple models, giving users freedom to select optimal solutions for their specific use cases. Models can be hosted on Amazon Bedrock, Amazon SageMaker, or elsewhere.
For example, Cohere Command A is a generative model with 111B parameters and a 256K context length. The model excels at tool use which can extend baseline functionality while avoiding unnecessary tool calls. The model is also suitable for multilingual tasks and RAG tasks such as manipulating numerical information in financial settings. When paired with Embed 4, which is purpose-built for highly regulated sectors like financial services, this combination delivers substantial competitive benefits through its adaptability.
We begin by defining a tool that a Strands agent can use. The tool searches for documents stored in S3 using semantic similarity. It first converts the user’s query into vectors with Cohere Embed 4. It then returns the most relevant documents by querying the embeddings stored in the S3 vector bucket. The code below shows only the inference portion. Embeddings created from the financial documents were stored in a S3 vector bucket before querying.

# S3 Vector search function for financial documents
@tool
def search(query_text: str, bucket_name: str = “my-s3-vector-bucket”,
index_name: str = “my-s3-vector-index-1536”, top_k: int = 3,
category_filter: str = None) -> str:
“””Search financial documents using semantic vector search”””

bedrock = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)
s3vectors = boto3.client(“s3vectors”, region_name=”us-east-1″)

# Generate embedding using Cohere Embed v4
response = bedrock.invoke_model(
modelId=”cohere.embed-v4:0″,
body=json.dumps({
“texts”: [query_text],
“input_type”: “search_query”,
“embedding_types”: [“float”]
}),
accept=’*/*’,
contentType=’application/json’
)

response_body = json.loads(response[“body”].read())
embedding = response_body[“embeddings”][“float”][0]

# Query vectors
query_params = {
“vectorBucketName”: bucket_name,
“indexName”: index_name,
“queryVector”: {“float32”: embedding},
“topK”: top_k,
“returnDistance”: True,
“returnMetadata”: True
}

if category_filter:
query_params[“filter”] = {“category”: category_filter}

response = s3vectors.query_vectors(**query_params)
return json.dumps(response[“vectors”], indent=2)

We then define a financial research agent that can use the tool to search financial documents. As your use case becomes more complex, more agents can be added for specialized tasks.

# Create financial research agent using Strands
agent = Agent(
name=”FinancialResearchAgent”,
system_prompt=”You are a financial research assistant that can search through financial documents, earnings reports, regulatory filings, and market analysis. Use the search tool to find relevant financial information and provide helpful analysis.”,
tools=[search])

Simply using the tool returns the following results. Multilingual financial documents are ranked by semantic similarity to the query about comparing earnings growth rates. An agent can use this information to generate useful insights.

result = search(“Compare earnings growth rates mentioned in the documents”)
print(result)
{
“key”: “doc_0_en”,
“metadata”: {
“language”: “en”,
“source_text”: “Q3 2024 earnings report shows revenue growth of 15% year-over-year driven by strong performance in cloud services and AI products”,
“doc_id”: 0
},
“distance”: 0.7292724251747131
},
{
“key”: “doc_18_zh”,
“metadata”: {
“source_text”: “2024年上半年财务报告显示净利润增长20%,主要受益于金融科技和数字支付业务的快速发展”,
“language”: “zh”,
“doc_id”: 18
},
“distance”: 0.7704182863235474
},
{
“key”: “doc_12_de”,
“metadata”: {
“language”: “de”,
“doc_id”: 12,
“source_text”: “Halbjahresbericht 2024 verzeichnet 12% Gewinnsteigerung durch erfolgreiche Digitalisierung und Automatisierungslu00f6sungen”
},
“distance”: 0.7738275527954102
}
]

The example above relies on the QueryVectors API operation for S3 Vectors, which can work well for small documents. This approach can be improved to handle large and complex enterprise documents using sophisticated chunking and reranking techniques. Sentence boundaries can be used to create document chunks to preserve semantic coherence. The document chunks are then used to generate embeddings. The following API call passes the same query to the Strands agent:

response = agent(“Compare earnings growth rates mentioned in the documents”)

The Strands agent uses the search tool we defined to generate an answer for the query about comparing earnings growth rates. The final answer considers the results returned from the search tool:

Agent Query:
I’ll search for information about earnings growth rates in the financial documents to help you compare them.
Tool #1: search
Let me search for more specific earnings growth information and financial performance metrics.
Tool #2: search
Let me search for more comprehensive financial performance data and specific earnings metrics.
Tool #3: search
Based on the financial documents I’ve searched through, here’s a comparison of the earnings growth rates mentioned:

## Key Insights:

1. **Growth Range:** The earnings growth rates span from 12% to 20%, indicating generally healthy performance
across different markets and sectors.

2. **Technology Focus:** All companies showing strong growth are heavily invested in technology sectors
(fintech, AI, cloud services, cybersecurity, automation).

3. **Geographic Diversity:** The strong performers represent different regions (Asia, Europe, North America),
suggesting broad-based growth in tech-enabled services.

4. **Growth Sustainability:** The Chinese fintech company leads with 20% net profit growth, while the others
show strong revenue growth in the 12-18% range.

The data suggests that companies with strong technology components, particularly in emerging areas like AI,
fintech, and cybersecurity, are experiencing the most robust earnings growth rates in 2024.Based on the
financial documents I’ve searched through, here’s a comparison of the earnings growth rates mentioned:
## Earnings Growth Rate Comparison

The data suggests that companies with strong technology components, particularly in emerging areas like AI,
fintech, and cybersecurity, are experiencing the most robust earnings growth rates in 2024.

A custom tool like the S3 Vector search function used in this example is just one of many possibilities. With Strands it is straightforward to develop and orchestrate autonomous agents while Bedrock AgentCore serves as the managed deployment system to host and scale these Strands agents in production.
Deploy to Amazon Bedrock AgentCore
Once an agent is built and tested, it is ready to be deployed. AgentCore Runtime is a secure and serverless runtime purpose-built for deploying and scaling dynamic AI agents. Use the starter toolkit to automatically create the IAM execution role, container image, and Amazon Elastic Container Registry repository to host an agent in AgentCore Runtime. You can define multiple tools available to your agent. In this example, we use the Strands Agent powered by Embed 4:

# Using bedrock-agentcore<=0.1.5 and bedrock-agentcore-starter-toolkit==0.1.14
from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session
boto_session = Session()
region = boto_session.region_name

agentcore_runtime = Runtime()
agent_name = “search_agent”
response = agentcore_runtime.configure(
entrypoint=”example.py”, # Replace with your custom agent and tools
auto_create_execution_role=True,
auto_create_ecr=True,
requirements_file=”requirements.txt”,
region=region,
agent_name=agent_name
)
response
launch_result = agentcore_runtime.launch()
invoke_response = agentcore_runtime.invoke({“prompt”: “Compare earnings growth rates mentioned in the documents”})

Clean up
To avoid incurring unnecessary costs when you’re done, empty and delete the S3 Vector buckets created, applications that can make requests to the Amazon Bedrock APIs, the launched AgentCore Runtimes and associated ECR repositories.
For more information, see this documentation to delete a vector index and this documentation to delete a vector bucket, and see this step for removing resources created by the Bedrock AgentCore starter toolkit.
Conclusion
Embed 4 on Amazon Bedrock is beneficial for enterprises aiming to unlock the value of their unstructured, multimodal data. With support for up to 128,000 tokens, compressed embeddings for cost efficiency, and multilingual capabilities across 100+ languages, Embed 4 provides the scalability and precision required for enterprise search at scale.
Embed 4 has advanced capabilities that are optimized with domain specific understanding of data from regulated industries such as finance, healthcare, and manufacturing. When combined with S3 Vectors for cost-optimized storage, Strands Agents for agent orchestration, and Bedrock AgentCore for deployment, organizations can build secure, high-performing agentic workflows without the overhead of managing infrastructure. Check the full Region list for future updates.
To learn more, check out the Cohere in Amazon Bedrock product page and the Amazon Bedrock pricing page. If you’re interested in diving deeper check out the code sample and the Cohere on AWS GitHub repository.

About the authors
James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.
Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.
Hugo Tse is a Solutions Architect at AWS, with a focus on Generative AI and Storage solutions. He is dedicated to empowering customers to overcome challenges and unlock new business opportunities using technology. He holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.
Mehran Najafi, PhD, serves as AWS Principal Solutions Architect and leads the Generative AI Solution Architects team for AWS Canada. His expertise lies in ensuring the scalability, optimization, and production deployment of multi-tenant generative AI solutions for enterprise customers.
Sagar Murthy is an agentic AI GTM leader at AWS who enjoys collaborating with frontier foundation model partners, agentic frameworks, startups, and enterprise customers to evangelize AI and data innovations, open source solutions, and enable impactful partnerships and launches, while building scalable GTM motions. Sagar brings a blend of technical solution and business acumen, holding a BE in Electronics Engineering from the University of Mumbai, MS in Computer Science from Rochester Institute of Technology, and an MBA from UCLA Anderson School of Management.
Payal Singh is a Solutions Architect at Cohere with over 15 years of cross-domain expertise in DevOps, Cloud, Security, SDN, Data Center Architecture, and Virtualization. She drives partnerships at Cohere and helps customers with complex GenAI solution integrations.

A guide to building AI agents in GxP environments

Healthcare and life sciences organizations are transforming drug discovery, medical devices, and patient care with generative AI agents. In regulated industries, any system that impacts product quality or patient safety must comply with GxP (Good Practice) regulations, such as Good Clinical Practice (GxP), Good Laboratory Practice (GLP), Good Manufacturing Practice (GMP). Organizations must demonstrate to regulatory authorities that their AI agents are safe, effective, and meet quality standards. Building AI agents for these GxP environments requires a strategic approach that balances innovation, speed, and regulatory requirements.
AI agents can be built for GxP environments: The key lies in understanding how to build them appropriately based on their risk profiles. Gen AI introduces unique challenges around explainability, probabilistic outputs, and continuous learning that require thoughtful risk assessment rather than blanket validation approaches. The disconnect between traditional GxP compliance methods and modern AI capabilities creates barriers to implementation, increases validation costs, slows innovation speed, and limits the potential benefits for product quality and patient care.
The regulatory landscape for GxP compliance is evolving to address the unique characteristics of AI. Traditional Computer System Validation (CSV) approaches, often with uniform validation strategies, are being supplemented by Computer Software Assurance (CSA) frameworks that emphasize flexible risk-based validation methods tailored to each system’s actual impact and complexity (FDA latest guidance).
In this post, we cover a risk-based implementation, practical implementation considerations across different risk levels, the AWS shared responsibility model for compliance, and concrete examples of risk mitigation strategies.
Risk based implementation framework
Effective GxP compliance for agentic AI systems require assessing risk based on operational context rather than technology features alone. To support risk classification, the FDA’s CSA Draft Guidance recommends evaluating intended uses across three factors: severity of potential harm, probability of occurrence, and detectability of failures.
In Figure 1, this assessment model combines traditional operational roles with modern risk-based levels. Organizations should assess how AI agents function within workflows and their potential impact on regulated processes.

Figure 1. GxP compliance for AI agents combines traditional Role-based with CSA’s modern risk-based levels

The same AI agent capability can warrant dramatically different validation approaches depending on how it is being deployed. How is the agentic AI being consumed and within existing GxP processes? What is the level of human oversight or human-in-the-loop controls? Is the AI agent itself being added as an additional control? What is the potential impact of AI failures on product quality, data integrity, or patient safety?
Consider an AI agent for scientific literature review. When creating literature summaries for internal team meetings, it presents low risk, requiring minimal controls. When scientists use these insights to guide research direction, it becomes medium risk, needing structured controls, such as human review checkpoints. When supporting regulatory submissions for drug approval, it becomes high risk and requires comprehensive controls because outputs directly impact regulatory decisions and patient safety.
This risk-based methodology allows organizations to balance innovation with compliance by tailoring validation efforts to actual risk levels rather than applying uniform controls across all AI implementations.
Implementation considerations
Successful AI agent designs require common controls that apply consistently across risk levels for quality and safety. Organizations should maintain clear records of AI decisions, prove data has not been altered, reproduce results when needed, and manage system updates safely. AWS supports these requirements through qualified infrastructure and various compliance certifications such as ISO, SOC, and NIST. For a more complete list, see our Healthcare & Life Sciences Compliance page. Detailed compliance validation information for Amazon Bedrock AgentCore is available in the compliance documentation. To implement these controls effectively, organizations can refer to the National Institute of Standards and Technology (NIST) AI Risk Management Framework for AI-risk guidance and ALCOA+ principles to promote data integrity.
Shared responsibility model
Successful generative AI cloud-implementation in GxP environments requires understanding the shared division of responsibilities between customers and AWS, as outlined in the Shared responsibility model, to allow organizations to focus on delivering effective and compliance-aligned solutions.
As AWS helps protect the infrastructure that runs the services offered in the AWS Cloud, Table 1 provides practical examples of how AWS can support customers in validating their agentic AI systems.

Focus
Customer responsibilities
How AWS supports

Validation strategy
Design risk-appropriate validation approaches using AWS services for GxP compliance. Establish acceptance criteria and validation protocols based on intended use.
Inherit compliance controls with AWS services such as Amazon Bedrock’s ISO 27001, SOC 1/2/3, FedRAMP, and GDPR/HIPAA eligibility. Support your GxP training requirements through AWS Skill Builder for artificial intelligence and machine learning (AI/ML) and AWS Certified Machine Learning – Specialty. Use infrastructure as code through AWS CloudFormation to support on demand validations and deployments that provide repeatable IQ for your agentic workloads.

GxP procedures
Develop SOPs that integrate AWS capabilities with existing quality management systems. Establish documented procedures for system operation and maintenance.
Build GxP agentic systems with HCLS Landing Zones, designed to align for highly regulated workloads, this capability can augment and support your standard procedure requirements. Augment risk management procedures with Amazon Bedrock AgentCore supporting end-to-end visibility and runtime requirements for complex multi-step tasks. Use AWS Certified SysOps Administrator and AWS Certified DevOps Engineering certifications for training requirements and to make sure teams can operationalize and govern procedural compliance on AWS.

User management
Configure IAM roles and permissions aligned with GxP user access requirements. Maintain user access documentation and training records.
Secure AI agents access with AWS IAM and Amazon Bedrock AgentCore Identity to establish fine-grained permissions and enterprise identity integration and use IAM Identity Center to streamline workforce user access.

Performance criteria
Define acceptance criteria and monitoring thresholds for gen AI applications. Establish performance monitoring protocols.
Use Amazon Bedrock Provision Throughput plan for agentic workflows that require consistent and guaranteed performance requirements. Monitor performance with Amazon Bedrock AgentCore Observability and with Amazon CloudWatch with customizable alerts and dashboards for end-to-end visibility.

Documentation
Create validation documentation demonstrating how AWS services support GxP compliance. Maintain quality system records.
Use AWS Config to help generate compliance reports of your agentic deployments with conformance packs for HIPAA, 21 CFR Part 11, and GxP EU Annex 11.Store your GxP data with Amazon Simple Storage Service (Amazon S3), which offers enterprise-grade 11 nines of durability with support for versioning and user defined retention policies.

Provenance
Monitor model versions while maintaining validated snapshots. Version-control prompt templates to facilitate consistent AI interactions, track changes, and maintain records for audit trails version-control prompt templates. Lock tool dependencies in validated environment.
Control models and data with Amazon Bedrock configurable data residency and immutable model versioning. AWS Config executes automated configuration tracking and validation. AWS CloudTrail captures comprehensive audit logging. Deploy reproducibility of AI behaviors using model versioning in AWS CodePipeline, AWS CodeCommit, and Amazon Bedrock.

The following is an example of what customers might need to implement and what AWS provides when building AI agents (Figure 2):

Figure 2. Gen AI implementation in GxP environments requires understanding the division of responsibilities between customers and AWS.

Let’s demonstrate how these shared responsibilities translate into actual implementation.
Provenance and reproducibility
AWS Supports the following:

Amazon Bedrock – Provides immutable model versioning, facilitating reproducible AI behavior across the system lifecycle.
AWS Config – Automatically tracks and validates system configurations, continuously monitoring for drift from validated baselines.
AWS CloudTrail – Generates audit trails with cryptographic integrity, capturing model invocations with complete metadata including timestamps, user identities, and model versions. Infrastructure as Code support through AWS CloudFormation enables version-controlled, repeatable deployments.

Customer responsibility: Organizations must version-control their infrastructure deployments, their prompt templates to make sure there is consistent AI behavior and maintain audit trails of prompt changes. Tool dependencies must be tracked and locked to specific versions in validated environments to help prevent unintended updates that could affect AI outputs.
Observability and performance metrics
AWS supports the following:

Amazon Bedrock AgentCore – Provides a comprehensive solution for the unique risks that agentic AI introduces, including end-to-end visibility into complex multi-step agent tasks and runtime requirements for orchestrating reasoning chains. Amazon Bedrock AgentCore Observability captures the complete chain of decisions and tool invocations, so that you can inspect an agent’s execution path, audit intermediate outputs, and inspect failures. The Bedrock Retrieval API for Amazon Bedrock Knowledge Bases enables traceability from retrieved documents to AI-generated outputs.
Amazon CloudWatch – Delivers real-time monitoring with customizable alerts and dashboards, aggregating performance metrics across the agent invocations. Organizations can configure logging levels based on risk, such as basic CloudTrail logging for low-risk applications, detailed AgentCore traces for medium risk, and complete provenance chains for high-risk regulatory submissions.

Customer responsibility: Organizations define acceptance criteria and monitoring thresholds appropriate to their risk level—for example, citation accuracy requirements for our literature review agent. Teams must decide when human-in-the-loop triggers are required, such as mandatory expert review before AI recommendations influence research decisions or regulatory submissions.
User management, session isolation, and security
AWS Supports the following:

Amazon Bedrock AgentCore – Provides session isolation using dedicated microVMs, that help prevent cross-contamination between different projects or regulatory submissions. The service supports VPC endpoints to establish private connections between your Amazon VPC and Amazon Bedrock AgentCore resources, allowing for inter-network traffic privacy. All communication with Amazon Bedrock AgentCore endpoints uses HTTPS exclusively across all supported regions, with no HTTP support, so that all communications are digitally signed for authentication and integrity.

Amazon Bedrock AgentCore maintains robust encryption standards with TLS 1.2 minimum requirements (TLS 1.3 recommended) for all API endpoints. Both control plane and data plane traffic are encrypted with TLS protocols and restricted to minimum TLS 1.2 with no unencrypted communication permitted. Amazon Bedrock AgentCore Identity addresses identity complexity with a secure token vault for credentials management, providing fine-grained permissions and enterprise identity integration.
AWS Identity and Access Management (IAM) enables organizations to configure role-based access controls with least-privilege principles. Built-in encryption facilitates data protection both in transit and at rest, while network isolation and compliance certifications (SOC, ISO 27001, HIPAA) support regulatory requirements. Amazon Bedrock offers configurable data residency, allowing organizations to specify regions for data processing.
Customer responsibility: Organizations configure IAM roles and policies aligned with GxP user access requirements, facilitating least-privilege access and proper segregation of duties. Access controls must be documented and maintained as part of the quality management system.
GxP controls for AI agents
The implementation of GxP risk controls for AI agents can be considered through three key phases.
Risk Assessment evaluates the GxP workload against the organization’s risk-based validation framework. Continual quality assurance is maintained through structured feedback loops, ranging from real-time verification (see Continuous Validation) to bi-annual reviews. This process makes sure reviewers are trained against the evolving AI landscape, adapt to user feedback, and apply appropriate intervention criteria. In practice, risk assessments define risk categories and triggers for reassessment.
Control Selection is carefully selecting minimum required controls based on the 1. risk classification, 2. the specific design attributes, and 3. operational context of the AI agents. This targeted, risk-adjusted approach, makes sure controls align with both technical requirements and compliance objectives. In practice, risk categories drive required and selectable controls. An example of medium risk might require Agent and Prompt Governance controls along with two or more Detective Controls, while a high risk might require Traditional Testing (IQ, OQ, PQ) control, and two additional corrective controls.
Continuous Validation is an approach that includes the traditional fit-for-intended-use validation and subsequent process that leverages real-world data (RWD), such as operational logs and/or user feedback, to create supplemental real-world evidence (RWE) that the system maintains a validated state. As a control mechanism itself, the Continuous Validation approach helps address modern cloud-based designs including SaaS models, model drifts, and evolving cloud infrastructure. Through ongoing monitoring of performance and functionality, this approach helps maintain system GxP compliance while supporting regulatory inspections. In practice, for low-risk categories, this might be a user compliance-aligned portal that tracks user issue trends to high-risk systems that incorporate periodic self-tests with compliance reports.
The following table provides examples of Preventive, Corrective, and Detective Controls for agentic AI systems that could be incorporated in a modern GxP validation framework.

Control element
Supporting AWS services

Preventive Controls

Agent Behavior Specification
Use Amazon Bedrock Model Catalog to find the models that help meet your specific requirements and use AWS service quotas (limits) and documentation on service features to define supported and verifiable agent capabilities.

Threat Modeling
Use AWS Well-Architected Framework (Security Pillar) tools and AWS service security documentation to proactively identify AI-specific threats like Prompt Injection, Data Poisoning, and Model Inversion, and help design preventive mitigations using AWS services.

Response Content and Relevance Control
Use Amazon Bedrock Guardrails to implement real-time safety policies for large language models (LLMs) to deny harmful inputs or responses. Guardrails can also define denylists and filter for PII. Use Amazon Bedrock Knowledge Bases or AWS purpose-built vector databases for RAG to provide controlled, current, and relevant information to help prevent factual drift.

Bias Mitigation in Datasets
Amazon SageMaker Clarify provides tools to run pre-training bias analysis of your datasets. For agents, this helps make sure the foundational data doesn’t lead to biased decision-making paths or tool usage.

Agent & Prompt Governance
Amazon Bedrock agents and prompt management features support lifecycle processes including creation, evaluation, versioning, and optimization. The features also support advanced prompt templates, content filters, automated reasoning checks, and integration with Amazon Bedrock Flows for more secure and controlled agentic workflows.

Configuration Management
AWS provides an industry leading suite of configuration management services such as AWS Config and AWS Audit Manager, which can be used to continuously validate agentic GxP system configurations. AWS SageMaker Model Registry manages and versions trained machine learning (ML) models for controlled deployments.

Secure AI Development
Amazon Q Developer and Amazon Kiro provide AI-powered code assistance that incorporate security best practices and AWS Well-Architected principals for building and maintaining agentic workloads securely from the start.

AI Agents as Secondary Controls
Use Amazon Bedrock AgentCore and your data to quickly incorporate AI agents into existing GxP workflows as secondary preventative controls to add capabilities like trend analysis, automated inspections, and systems flow analysis that can trigger preventative workflow events.

Detective Controls

Traditional Testing (IQ, OQ, PQ)
Use AWS Config and AWS CloudFormation for IQ validation by tracking resource deployment configurations. Use AWS CloudTrail and AWS CloudWatch for sourcing events, metrics, and log test results for OQ/PQ validation.

Explainability Audits & Trajectory Reviews
Amazon SageMaker Clarify generates explainability reports for custom models. Amazon Bedrock Invocation Logs can be used to review reasoning or chain of thought to find flaws in an agent’s logic. Utilize Amazon AgentCore Observability to look at agent invocation sessions, traces and spans.

Model & I/O Drift Detection
For custom models, Amazon SageMaker Model Monitor, can detect drift in data and model quality. For AI agents using commercial LLMs, use the observability service of Amazon Bedrock AgentCore to design monitoring of Inputs (prompts) and Outputs (responses) to detect concept drift. Use Amazon CloudWatch alarms to manage compliance notifications.

Performance Monitoring
Agentic workloads can use Amazon Bedrock metrics, AgentCore Observability and AWS CloudWatch metrics to include monitoring for Token Usage, Cost per Interaction, and Tool Execution Latency to detect performance and cost anomalies.

Log and Event Monitoring (SIEM)
For agentic workload, Amazon GuardDuty provides intelligent threat detection that analyzes Amazon Bedrock API calls to detect anomalous or potentially malicious use of the agent or LLMs.

Code & Model Risk Scanning
Amazon CodeGuru and Amazon Inspector scans agent code and operational environment for vulnerabilities. These tools can’t assess model weights for risk, however AWS does provide Amazon SageMaker Model Card support that can be used to build Model Risk scanning controls.

Adversarial Testing (Red Teaming) & Critic/Grader Model
The evaluation tools of Amazon Bedrock help assess model fitness. Amazon Bedrock supports leading model providers allowing GxP systems to use multiple models for secondary and tertiary validation.

Internal Audits
AWS Audit Manager automates the collection of evidence for compliance and audits and AWS CloudTrail provides a streamlined way to review agent actions and facilitate procedural adherence.

Corrective Controls

Model & Prompt Rollback
Use AWS CodePipeline and AWS CloudFormation to quickly revert to a previous, known-good version of a model or Prompt Template when a problem is detected.

System Fallback
AWS Step Functions can help orchestrate a fallback to a streamlined, more constrained model or a human-only workflow if the primary agent fails.

Human-in-the-Loop & Escalation Management
AWS Step Functions, Amazon Simple Notification Service (SNS) and Amazon Bedrock Prompt Flow can orchestrate workflows that can pause and wait for human approval, including dynamic approvals based on low agent confidence scores or detected anomalies.

CAPA Process
AWS Systems Manager OpsCenter provides a central place to manage operational issues, which can be used to track the root cause analysis of an agent’s failure.

Incident Response Plan
AWS Security Hub and AWS Systems Manager Incident Manager can automate response plans for AI security incidents (for example, major jailbreak and data leakage) and provide a central dashboard to manage them.

Disaster Recovery Plan (DRP)
AWS Elastic Disaster Recovery (DRS) and AWS Backup provides tools to replicate and recover the entire AI application stack, including deploying to different AWS Regions.

Conclusion
Healthcare and life sciences organizations can build GxP-compliant AI agents by adopting a risk-based framework that balances innovation with regulatory requirements. Success requires proper risk classification, scaled controls matching system impact, and understanding the AWS shared responsibility model. AWS provides qualified infrastructure and comprehensive services, while organizations configure appropriate controls, maintain version management, and implement risk mitigation strategies tailored to their validation needs.
We encourage organizations to explore building GxP-compliant AI agents with AWS services. For more information about implementing compliance-aligned AI systems in regulated environments, contact your AWS account team or visit our Healthcare and Life Sciences Solutions page.

About the authors
Pierre de Malliard is a Senior AI/ML Solutions Architect at Amazon Web Services and supports customers in the Healthcare and Life Sciences Industry.
Ian Sutcliffe is a Global Solution Architect with 25+ years of experience in IT, primarily in the Life Sciences Industry. A thought leader in the area of regulated cloud computing, one of his areas of focus is IT operating models and process optimization and automation with the intent of helping customers become Regulated Cloud Natives
Kristin Ambrosini is a Generative AI Specialist at Amazon Web Services. She drives adoption of scalable GenAI solutions across healthcare and life sciences to transform drug discovery and improve patient outcomes. Kristin blends scientific expertise, technical acumen, and business strategy. She holds a Ph.D. in Biological Sciences.
Ben Xavier is a MedTech Specialist with over 25 years of experience in Medical Device R&D. He is a passionate leader focused on modernizing the MedTech industry through technology and best practices to accelerate innovation and improve patient outcomes.

Moonshot AI Releases Kosong: The LLM Abstraction Layer that Powers Kim …

Modern agentic applications rarely talk to a single model or a single tool, so how do you keep that stack maintainable when providers, models and tools keep changing every few weeks. Moonshot AI’s Kosong targets this problem as an LLM abstraction layer for agent applications. Kosong unifies message structures, asynchronous tool orchestration and pluggable chat providers so teams can build agents without hard wiring business logic to a single API. It is also the layer that powers Moonshot’s Kimi CLI.

What Kosong provides?

Kosong is a Python library that sits between your agent logic and LLM providers. It as an LLM abstraction layer for modern agent applications and shows example code that uses a Kimi chat provider together with high level helper functions generate and step.

The public API surface is intentionally kept small. At the top level you import kosong.generate, kosong.step and the result types GenerateResult and StepResult. Supporting modules define chat_provider, message, tooling, and tooling.simple. These modules wrap provider specific streaming formats, token accounting and tool calls behind one consistent interface.

ChatProvider and message model

The core integration point is the ChatProvider abstraction. Moonshot team shows a provider implementation for Kimi in kosong.chat_provider.kimi. A Kimi object is initialized with base_url, api_key and the model name, for example kimi-k2-turbo-preview. This provider is then passed into kosong.generate or kosong.step together with a system prompt, tools and a message history.

Messages are represented by the Message class from kosong.message. In the examples, a message is constructed with a role, such as “user”, and a content argument. The type of content is documented as either a string or a list of content parts, which lets the library support richer multimodal payloads while keeping the basic chat example simple for new users.

Kosong also exposes a streaming unit StreamedMessagePart via kosong.chat_provider. Provider implementations emit these parts during generation, and the library merges them into the final Message. The optional TokenUsage structure tracks token counts in a provider independent way, which is then attached to the result objects for logging and monitoring.

Tooling, Toolset and SimpleToolset

Most agent stacks need tools such as search, code execution or database calls. Kosong models this through the tooling module. The example in the GitHub repo defines a tool by subclassing CallableTool2 with a Pydantic parameter model. The example AddTool sets name, description and params, and implements __call__ to return a ToolOk value which is a valid ToolReturnType.

Tools are registered in a SimpleToolset from kosong.tooling.simple. In the example, a SimpleToolset is instantiated and then augmented with the AddTool instance using the += operator. This toolset is passed into kosong.step, not into generate. The toolset is responsible for resolving tool calls from the model and routing them to the correct async function, while step manages the orchestration around a single conversational turn.

generate for single shot completion

The generate function is the entry point for plain chat completion. You provide the chat_provider, a system_prompt, an explicit list of tools, which can be empty, and a history of Message objects. The Kimi example shows a minimal usage pattern where a single user message is passed as history and tools=[].

generate supports streaming through an on_message_part callback. In the GitHub repo, the research team illustrates this by defining a simple output function that prints each StreamedMessagePart. After streaming is complete, generate returns a GenerateResult that contains the merged assistant message and an optional usage structure with token counts. This pattern lets applications both display incremental output and still work with a clean final message object.

step for tool using agents

For tool using agents, Kosong exposes the step function. The example in the Git Repo shows kosong.step being called with a Kimi provider, a SimpleToolset that contains AddTool, a system prompt and user history that instructs the model to call the add tool.

step returns a StepResult. The example prints result.message and then awaits result.tool_results(). This method collects all tool outputs produced during the step and returns them to the caller. The orchestration of tool calls, including argument parsing into the Pydantic parameter model and conversion into ToolReturnType results, is handled inside Kosong so agent authors do not have to implement their own dispatch loop for each provider.

Built in demo and relationship with Kimi CLI

Kosong ships with a built in demo agent that can be run locally. The Git README documents environment variables KIMI_BASE_URL and KIMI_API_KEY, and shows a launch command using uv run python -m kosong kimi –with-bash. This demo uses Kimi as the chat provider and exposes a terminal agent that can call tools, including shell commands when the option with bash is enabled.

Key Takeaways

Kosong is an LLM abstraction layer from Moonshot AI that unifies message structures, asynchronous tool orchestration and pluggable chat providers for agent applications.

The library exposes a small core API, generate for plain chat and step for tool using agents, backed by abstractions such as ChatProvider, Message, Tool, Toolset and SimpleToolset.

Kosong currently ships a Kimi chat provider targeting the Moonshot AI API, and defines the ChatProvider interface so teams can plug in additional backends without changing agent logic.

Tool definitions use Pydantic parameter models and ToolReturnType results, which lets Kosong handle argument parsing, validation and orchestration of tool calls inside step.

Kosong powers Moonshot’s Kimi CLI, providing the underlying LLM abstraction layer while Kimi CLI focuses on the command line agent experience that can target Kimi and other backends.

Editorial Comments

Kosong looks like a pragmatic move from Moonshot AI, it cleanly separates agent logic from LLM and tool backends while keeping the surface area small for early developers. By centering everything on ChatProvider, Message and Toolset, it gives Kimi CLI and other stacks a consistent way to evolve models and tooling without rewriting orchestration. For teams building long term agent systems, Kosong could be the right kind of minimal infrastructure.

Check out the Repo and Docs. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kosong: The LLM Abstraction Layer that Powers Kimi CLI appeared first on MarkTechPost.

Gelato-30B-A3B: A State-of-the-Art Grounding Model for GUI Computer-Us …

How do we teach AI agents to reliably find and click the exact on screen element we mean when we give them a simple instruction? A team of researchers from ML Foundations has introduced Gelato-30B-A3B, a state of the art grounding model for graphical user interfaces that is designed to plug into computer use agents and convert natural language instructions into reliable click locations. The model is trained on the Click 100k dataset and reaches 63.88% accuracy on ScreenSpot Pro and 69.15% on OS-World-G, with 74.65% on OS-World-G Refined. It surpasses GTA1-32B and larger vision language models such as Qwen3-VL-235B-A22B-Instruct.

https://github.com/mlfoundations/Gelato

What Gelato 30B A3B Does in An Agent Stack?

Gelato-30B-A3B is a 31B parameter model that fine tunes Qwen3-VL-30B-A3B Instruct with a mixture of experts architecture. It takes a screenshot and a textual instruction as input and produces a single click coordinate as output.

The model is positioned as a modular grounding component. A planner model, for example GPT 5 in the Gelato experiments, decides the next high level action and calls Gelato to resolve that step into a concrete click on the screen. This separation between planning and grounding is important when an agent must operate across many operating systems and applications with different layouts.

https://github.com/mlfoundations/Gelato

Click 100k, A Targeted Dataset For GUI Grounding

Click 100k is the dataset that underlies Gelato. It pairs computer screen images with natural language instructions, bounding boxes for the target element, image dimensions, and normalized bounding boxes. Each sample is set up as a low level command, for example ‘tap on the element between Background and Notifications options’ with a precise region.

The dataset is built by filtering and unifying multiple public sources. The list includes ShowUI, AutoGUI, PC Agent E, WaveUI, OS Atlas, UGround, PixMo Points, SeeClick, UI VISION, a JEDI subset that focuses on spreadsheet and text cell manipulation, and videos from 85 professional application tutorials annotated with Claude-4-Sonnet. Each source contributes at most 50k samples, and all sources are mapped into a shared schema with images, instructions, bounding boxes, and normalized coordinates.

The research team then runs an aggressive filtering pipeline. OmniParser discards clicks that do not land on detected interface elements. Qwen2.5-7B-VL and SE-GUI-3B remove trivial examples, such as easy hyperlink clicks. GTA1-7B-2507 and UI-Venus-7B remove samples where the instruction and click region do not match. A Qwen2.5-7B-VL baseline trained on a balanced 10k subset shows that this combination gives a +9 pp accuracy gain on ScreenSpot Pro compared with training on unfiltered data.

Professional application coverage is a specific focus. Click 100k adds data from UI VISION and the JEDI subset, and then augments this with 80+ tutorial videos for real desktop tools. Claude 4 Sonnet generates bounding boxes and low level instructions for these videos, followed by manual inspection and corrections.

https://github.com/mlfoundations/Gelato?tab=readme-ov-file

GRPO Training On Top Of Qwen3 VL

On the training side, Gelato 30B A3B uses GRPO, a reinforcement learning algorithm that derives from work on DeepSeekMath and similar systems. The research team follow the DAPO setup. They remove the KL divergence term from the objective, set the clip higher threshold to 0.28, and skip rollouts with zero advantage. Rewards are sparse and are only given when the predicted click falls inside the target bounding box, similar to the GTA1 recipe.

https://github.com/mlfoundations/Gelato?tab=readme-ov-file

They initialize from Qwen3 VL 30B A3B Instruct and run 100 RL steps on 32 A100 GPUs with 40 GB memory. The best checkpoint appears at step 84 (marked as green cross in the above image), chosen by the mean performance across ScreenSpot Pro, OS World G, and OS World G Refined. At this point the model reaches 63.88% on ScreenSpot-Pro and 67.19% and 73.40% on OS World G and OS World G Refined. A simple refusal prompting strategy, which appends an instruction to answer with refusal when the element cannot be found, raises the OS-World-G scores to 69.15% and 74.65%.

End To End Agent Results On OS World

To test Gelato beyond static grounding benchmarks, the research team plugs it into the GTA1.5 agent framework and runs full computer use agents on the OS World environment. In this setup GPT 5 acts as the planner. Gelato 30B A3B provides grounding, the agent has at most 50 steps, and it waits 3 seconds between actions.

The research reports three runs per model on a fixed OS World snapshot. Gelato-30B-A3B reaches 58.71% automated success rate with a small standard deviation, compared with 56.97% for GTA1 32B in the same harness. Because the automatic OS World evaluation misses some valid solutions, they also run human evaluation on 20 problematic tasks. Under human scoring, Gelato reaches 61.85% success, while GTA1-32B reaches 59.47%.

Key Takeaways

Gelato-30B-A3B is a Qwen3-VL-30B-A3B Instruct based mixture of experts model that performs state of the art GUI grounding on ScreenSpot Pro and OS World G benchmarks, surpassing GTA1-32B and larger VLMs such as Qwen3-VL-235B-A22B-Instruct.

The model is trained on Click 100k, a curated grounding dataset that merges and filters multiple public GUI datasets and professional application traces, pairing real screens with low level natural language commands and precise click coordinates.

Gelato-30B-A3B uses a GRPO reinforcement learning recipe on top of Qwen3-VL, with sparse rewards that only trigger when the predicted click lies inside the ground truth bounding box, which significantly boosts grounding accuracy over supervised baselines.

When integrated into an agent framework with GPT-5 acting as the planner, Gelato-30B-A3B improves success rates on OS World computer use tasks compared with GTA1-32B, demonstrating that better grounding directly translates into stronger end to end agent performance.

Editorial Comments

Gelato-30B-A3B is an important step for grounded computer use because it shows that a Qwen3-VL based MoE model, trained on a carefully filtered Click 100k dataset, can beat both GTA1-32B and much larger VLMs like Qwen3-VL-235B-A22B Instruct on ScreenSpot Pro and OS-World-G while staying accessible through Hugging Face. Overall, Gelato-30B-A3B establishes a clear new baseline for open computer grounding models.

Check out the Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Gelato-30B-A3B: A State-of-the-Art Grounding Model for GUI Computer-Use Tasks, Surpassing Computer Grounding Models like GTA1-32B  appeared first on MarkTechPost.

Comparing Memory Systems for LLM Agents: Vector, Graph, and Event Logs

Table of contentsHigh-Level Comparison1. Vector Memory Systems1.1 Plain Vector RAG1.2 Tiered Vector Memory (MemGPT-Style Virtual Context)2. Graph Memory Systems2.1 Temporal Knowledge Graph Memory (Zep / Graphiti)2.2 Knowledge-Graph RAG (GraphRAG)3. Event and Execution Log Systems3.1 Execution Logs and Checkpoints (ALAS, LangGraph)3.2 Episodic Long-Term MemoryKey Takeaways

Reliable multi-agent systems are mostly a memory design problem. Once agents call tools, collaborate, and run long workflows, you need explicit mechanisms for what gets stored, how it is retrieved, and how the system behaves when memory is wrong or missing.

This article compares 6 memory system patterns commonly used in agent stacks, grouped into 3 families:

Vector memory

Graph memory

Event / execution logs

We focus on retrieval latency, hit rate, and failure modes in multi-agent planning.

High-Level Comparison

FamilySystem patternData modelStrengthsMain weaknessesVectorPlain vector RAGEmbedding vectorsSimple, fast ANN retrieval, widely supportedLoses temporal / structural context, semantic driftVectorTiered vector (MemGPT-style virtual context)Working set + vector archiveBetter reuse of important info, bounded context sizePaging policy errors, per-agent divergenceGraphTemporal KG memory (Zep / Graphiti)Temporal knowledge graphStrong temporal, cross-session reasoning, shared viewRequires schema + update pipeline, can have stale edgesGraphKnowledge-graph RAG (GraphRAG)KG + hierarchical communitiesMulti-doc, multi-hop questions, global summariesGraph construction and summarization bias, traceability overheadEvent / LogsExecution logs / checkpoints (ALAS, LangGraph)Ordered versioned logGround truth of actions, supports replay and repairLog bloat, missing instrumentation, side-effect-safe replay requiredEvent / LogsEpisodic long-term memoryEpisodes + metadataLong-horizon recall, pattern reuse across tasksEpisode boundary errors, consolidation errors, cross-agent misalign

Next, we go system family by system family.

1. Vector Memory Systems

1.1 Plain Vector RAG

What it is?

The default pattern in most RAG and agent frameworks:

Encode text fragments (messages, tool outputs, documents) using an embedding model.

Store vectors in an ANN index (FAISS, HNSW, ScaNN, etc.).

At query time, embed the query and retrieve top-k nearest neighbors, optionally rerank.

This is the ‘vector store memory’ exposed by typical LLM orchestration libraries.

Latency profile

Approximate nearest-neighbor indexes are designed for sublinear scaling with corpus size:

Graph-based ANN structures like HNSW typically show empirically near-logarithmic latency growth vs corpus size for fixed recall targets.

On a single node with tuned parameters, retrieving from up to millions of items is usually low tens of milliseconds per query, plus any reranking cost.

Main cost components:

ANN search in the vector index.

Additional reranking (e.g., cross-encoder) if used.

LLM attention cost over concatenated retrieved chunks.

Hit-rate behavior

Hit rate is high when:

The query is local (‘what did we just talk about’), or

The information lives in a small number of chunks with embeddings aligned to the query model.

Vector RAG performs significantly worse on:

Temporal queries (‘what did the user decide last week’).

Cross-session reasoning and long histories.

Multi-hop questions requiring explicit relational paths.

Benchmarks such as Deep Memory Retrieval (DMR) and LongMemEval were introduced precisely because naive vector RAG degrades on long-horizon and temporal tasks.

Failure modes in multi-agent planning

Lost constraints: top-k retrieval misses a critical global constraint (budget cap, compliance rule), so a planner generates invalid tool calls.

Semantic drift: approximate neighbors match on topic but differ in key identifiers (region, environment, user ID), leading to wrong arguments.

Context dilution: too many partially relevant chunks are concatenated; the model underweights the important part, especially in long contexts.

When it is fine

Single-agent or short-horizon tasks.

Q&A over small to medium corpora.

As a first-line semantic index over logs, docs, and episodes, not as the final authority.

1.2 Tiered Vector Memory (MemGPT-Style Virtual Context)

What it is?

MemGPT introduces a virtual-memory abstraction for LLMs: a small working context plus larger external archives, managed by the model using tool calls (e.g., ‘swap in this memory’, ‘archive that section’). The model decides what to keep in the active context and what to fetch from long-term memory.

Architecture

Active context: the tokens currently present in the LLM input (analogous to RAM).

Archive / external memory: larger storage, often backed by a vector DB and object store.

The LLM uses specialized functions to:

Load archived content into context.

Evict parts of the current context to the archive.

Latency profile

Two regimes:

Within active context: retrieval is effectively free externally; attention cost only.

Archive accesses: similar to plain vector RAG, but often targeted:

Search space is narrowed by task, topic, or session ID.

The controller can cache “hot” entries.

Overall, you still pay vector search and serialization costs when paging, but you avoid sending large, irrelevant context to the model at each step.

Hit-rate behavior

Improvement relative to plain vector RAG:

Frequently accessed items are kept in the working set, so they do not depend on ANN retrieval every step.

Rare or old items still suffer from vector-search limitations.

The core new error surface is paging policy rather than pure similarity.

Failure modes in multi-agent planning

Paging errors: the controller archives something that is needed later, or fails to recall it, causing latent constraint loss.

Per-agent divergence: if each agent manages its own working set over a shared archive, agents may hold different local views of the same global state.

Debugging complexity: failures depend on both model reasoning and memory management decisions, which must be inspected together.

When it is useful

Long conversations and workflows where naive context growth is not viable.

Systems where you want vector RAG semantics but bounded context usage.

Scenarios where you can invest in designing / tuning paging policies.

2. Graph Memory Systems

2.1 Temporal Knowledge Graph Memory (Zep / Graphiti)

What it is?

Zep positions itself as a memory layer for AI agents implemented as a temporal knowledge graph (Graphiti). It integrates:

Conversational history.

Structured business data.

Temporal attributes and versioning.

Zep evaluates this architecture on DMR and LongMemEval, comparing against MemGPT and long-context baselines.

Reported results include:

94.8% vs 93.4% accuracy over a MemGPT baseline on DMR.

Up to 18.5% higher accuracy and about 90% lower response latency than certain baselines on LongMemEval for complex temporal reasoning.

These numbers underline the benefit of explicit temporal structure over pure vector recall on long-term tasks.

Architecture

Core components:

Nodes: entities (users, tickets, resources), events (messages, tool calls).

Edges: relations (created, depends_on, updated_by, discussed_in).

Temporal indexing: validity intervals and timestamps on nodes/edges.

APIs for:

Writing new events / facts into the KG.

Querying along entity and temporal dimensions.

The KG can coexist with a vector index for semantic entry points.

Latency profile

Graph queries are typically bounded by small traversal depths:

For questions like “latest configuration that passed checks,” the system:

Locates the relevant entity node.

Traverses outgoing edges with temporal filters.

Complexity scales with the size of the local neighborhood, not the full graph.

In practice, Zep reports order-of-magnitude latency benefits vs baselines that either scan long contexts or rely on less structured retrieval.

Hit-rate behavior

Graph memory excels when:

Queries are entity-centric and temporal.

You need cross-session consistency, e.g., “what did this user previously request,” “what state was this resource in at time T”.

Multi-hop reasoning is required (“if ticket A depends on B and B failed after policy P changed, what is the likely cause?”).

Hit rate is limited by graph coverage: missing edges or incorrect timestamps directly reduce recall.

Failure modes in multi-agent planning

Stale edges / lagging updates: if real systems change but graph updates are delayed, plans operate on incorrect world models.

Schema drift: evolving the KG schema without synchronized changes in retrieval prompts or planners yields subtle errors.

Access control partitions: multi-tenant scenarios can yield partial views per agent; planners must be aware of visibility constraints.

When it is useful

Multi-agent systems coordinating on shared entities (tickets, users, inventories).

Long-running tasks where temporal ordering is critical.

Environments where you can maintain ETL / streaming pipelines into the KG.

2.2 Knowledge-Graph RAG (GraphRAG)

What it is?

GraphRAG is a retrieval-augmented generation pipeline from Microsoft that builds an explicit knowledge graph over a corpus and performs hierarchical community detection (e.g., Hierarchical Leiden) to organize the graph. It stores summaries per community and uses them at query time.

Pipeline:

Extract entities and relations from source documents.

Build the KG.

Run community detection and build a multi-level hierarchy.

Generate summaries for communities and key nodes.

At query time:

Identify relevant communities (via keywords, embeddings, or graph heuristics).

Retrieve summaries and supporting nodes.

Pass them to the LLM.

Latency profile

Indexing is heavier than vanilla RAG (graph construction, clustering, summarization).

Query-time latency can be competitive or better for large corpora, because:

You retrieve a small number of summaries.

You avoid constructing extremely long contexts from many raw chunks.

Latency mostly depends on:

Community search (often vector search over summaries).

Local graph traversal inside selected communities.

Hit-rate behavior

GraphRAG tends to outperform plain vector RAG when:

Queries are multi-document and multi-hop.

You need global structure, e.g., “how did this design evolve,” “what chain of incidents led to this outage.”

You want answers that integrate evidence from many documents.

The hit rate depends on graph quality and community structure: if entity extraction misses relations, they simply do not exist in the graph.

Failure modes

Graph construction bias: extraction errors or missing edges lead to systematic blind spots.

Over-summarization: community summaries may drop rare but important details.

Traceability cost: tracing an answer back from summaries to raw evidence adds complexity, important in regulated or safety-critical settings.

When it is useful

Large knowledge bases and documentation sets.

Systems where agents must answer design, policy, or root-cause questions that span many documents.

Scenarios where you can afford the one-time indexing and maintenance cost.

3. Event and Execution Log Systems

3.1 Execution Logs and Checkpoints (ALAS, LangGraph)

What they are?

These systems treat ‘what the agents did‘ as a first-class data structure.

ALAS: a transactional multi-agent framework that maintains a versioned execution log plus:

Validator isolation: a separate LLM checks plans/results with its own context.

Localized Cascading Repair: only a minimal region of the log is edited when failures occur.

LangGraph: exposes thread-scoped checkpoints of an agent graph (messages, tool outputs, node states) that can be persisted, resumed, and branched.

In both cases, the log / checkpoints are the ground truth for:

Actions taken.

Inputs and outputs.

Control-flow decisions.

Latency profile

For normal forward execution:

Reading the tail of the log or a recent checkpoint is O(1) and small.

Latency mostly comes from LLM inference and tool calls, not log access.

For analytics / global queries:

You need secondary indexes or offline processing; raw scanning is O(n).

Hit-rate behavior

For questions like ‘what happened,’ ‘which tools were called with which arguments,’ and ‘what was the state before this failure,’ hit rate is effectively 100%, assuming:

All relevant actions are instrumented.

Log persistence and retention are correctly configured.

Logs do not provide semantic generalization by themselves; you layer vector or graph indices on top for semantics across executions.

Failure modes

Log bloat: high-volume systems generate large logs; improper retention or compaction can silently drop history.

Partial instrumentation: missing tool or agent traces yield blind spots in replay and debugging.

Unsafe replay: naively re-running log steps can re-trigger external side effects (payments, emails) unless idempotency keys and compensation handlers exist.

ALAS explicitly tackles some of these via transactional semantics, idempotency, and localized repair.

When they are essential?

Any system where you care about observability, auditing, and debuggability.

Multi-agent workflows with non-trivial failure semantics.

Scenarios where you want automated repair or partial re-planning rather than full restart.

3.2 Episodic Long-Term Memory

What it is?

Episodic memory structures store episodes: cohesive segments of interaction or work, each with:

Task description and initial conditions.

Relevant context.

Sequence of actions (often references into the execution log).

Outcomes and metrics.

Episodes are indexed with:

Metadata (time windows, participants, tools).

Embeddings (for similarity search).

Optional summaries.

Some systems periodically distill recurring patterns into higher-level knowledge or use episodes to fine-tune specialized models.

Latency profile

Episodic retrieval is typically two-stage:

Identify relevant episodes via metadata filters and/or vector search.

Retrieve content within selected episodes (sub-search or direct log references).

Latency is higher than a single flat vector search on small data, but scales better as lifetime history grows, because you avoid searching over all individual events for every query.

Hit-rate behavior

Episodic memory improves hit rate for:

Long-horizon tasks: “have we run a similar migration before?”, “how did this kind of incident resolve in the past?”

Pattern reuse: retrieving prior workflows plus outcomes, not just facts.

Hit rate still depends on episode boundaries and index quality.

Failure modes

Episode boundary errors: too coarse (episodes that mix unrelated tasks) or too fine (episodes that cut mid-task).

Consolidation mistakes: wrong abstractions during distillation propagate bias into parametric models or global policies.

Multi-agent misalignment: per-agent episodes instead of per-task episodes make cross-agent reasoning harder.

When it is useful?

Long-lived agents and workflows spanning weeks or months.

Systems where “similar past cases” are more useful than raw facts.

Training / adaptation loops where episodes can feed back into model updates.

Key Takeaways

Memory is a systems problem, not a prompt trick: Reliable multi-agent setups need explicit design around what is stored, how it is retrieved, and how the system reacts when memory is stale, missing, or wrong.

Vector memory is fast but structurally weak: Plain and tiered vector stores give low-latency, sublinear retrieval, but struggle with temporal reasoning, cross-session state, and multi-hop dependencies, making them unreliable as the sole memory backbone in planning workflows.

Graph memory fixes temporal and relational blind spots: Temporal KGs (e.g., Zep/Graphiti) and GraphRAG-style knowledge graphs improve hit rate and latency on entity-centric, temporal, and multi-document queries by encoding entities, relations, and time explicitly.

Event logs and checkpoints are the ground truth: ALAS-style execution logs and LangGraph-style checkpoints provide the authoritative record of what agents actually did, enabling replay, localized repair, and real observability in production systems.

Robust systems compose multiple memory layers: Practical agent architectures combine vector, graph, and event/episodic memory, with clear roles and known failure modes for each, instead of relying on a single ‘magic’ memory mechanism.

References:

MemGPT (virtual context / tiered vector memory)

https://arxiv.org/abs/2310.08560

https://arxiv.org/pdf/2310.08560

https://research.memgpt.ai/

Zep / Graphiti (temporal knowledge graph memory, DMR, LongMemEval)

https://arxiv.org/abs/2501.13956

https://www.getzep.com/

https://github.com/getzep/graphiti

https://www.emergentmind.com/topics/zep-a-temporal-knowledge-graph-architecture

GraphRAG (knowledge-graph RAG, hierarchical communities)

https://microsoft.github.io/graphrag/index/default_dataflow/

https://graphrag.com/reference/graphrag/global-community-summary-retriever/

https://github.com/microsoft/graphrag

ALAS (transactional / disruption-aware multi-agent planning, execution logs)

https://arxiv.org/abs/2505.12501

https://arxiv.org/abs/2511.03094

https://www.themoonlight.io/en/review/alas-transactional-and-dynamic-multi-agent-llm-planning

https://www.researchgate.net/publication/397322324_ALAS_Transactional_and_Dynamic_Multi-Agent_LLM_Planning

LangGraph (checkpoints / memory, thread-scoped state)

https://docs.langchain.com/oss/python/langgraph/memory

https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852

Supplemental GraphRAG + temporal KG context

https://memgraph.com/blog/how-microsoft-graphrag-works-with-graph-databases

The post Comparing Memory Systems for LLM Agents: Vector, Graph, and Event Logs appeared first on MarkTechPost.

Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SW …

Extracting structured data from documents like invoices, receipts, and forms is a persistent business challenge. Variations in format, layout, language, and vendor make standardization difficult, and manual data entry is slow, error-prone, and unscalable. Traditional optical character recognition (OCR) and rule-based systems often fall short in handling this complexity. For instance, a regional bank might need to process thousands of disparate documents—loan applications, tax returns, pay stubs, and IDs—where manual methods create bottlenecks and increase the risk of error. Intelligent document processing (IDP) aims to solve these challenges by using AI to classify documents, extract or derive relevant information, and validate the extracted data to use it in business processes. One of its core goals is to convert unstructured or semi-structured documents into usable, structured formats such as JSON, which then contain specific fields, tables, or other structured target information. The target structure needs to be consistent, so that it can be used as part of workflows or other downstream business systems or for reporting and insights generation. The following figure shows the workflow, which involves ingesting unstructured documents (for example, invoices from multiple vendors with varying layouts) and extracting relevant information. Despite differences in keywords, column names, or formats across documents, the system normalizes and outputs the extracted data into a consistent, structured JSON format.

Vision language models (VLMs) mark a revolutionary advancement in IDP. VLMs integrate large language models (LLMs) with specialized image encoders, creating truly multi-modal AI capabilities of both textual reasoning and visual interpretation. Unlike traditional document processing tools, VLMs process documents more holistically—simultaneously analyzing text content, document layout, spatial relationships, and visual elements in a manner that more closely resembles human comprehension. This approach enables VLMs to extract meaning from documents with unprecedented accuracy and contextual understanding. For readers interested in exploring the foundations of this technology, Sebastian Raschka’s post—Understanding Multimodal LLMs—offers an excellent primer on multimodal LLMs and their capabilities.
This post has four main sections that reflect the primary contributions of our work and include:

An overview of the various IDP approaches available, including the option (our recommended solution) for fine-tuning as a scalable approach.
Sample code for fine-tuning VLMs for document-to-JSON conversion using Amazon SageMaker AI and the SWIFT framework, a lightweight toolkit for fine-tuning various large models.
Developing an evaluation framework to assess performance processing structured data.
A discussion of the possible deployment options, including an explicit example for deploying the fine-tuned adapter.

SageMaker AI is a fully managed service to build, train and deploy models at scale. In this post, we use SageMaker AI to fine-tune the VLMs and deploy them for both batch and real-time inference.
Prerequisites
Before you begin, make sure you have the following set up so that you can successfully follow the steps outlined in this post and the accompanying GitHub repository:

AWS account: You need an active AWS account with permissions to create and manage resources in SageMaker AI, Amazon Simple Storage Service (Amazon S3), and Amazon Elastic Container Registry (Amazon ECR).
IAM permissions: Your IAM user or role must have sufficient permissions. For production setups, follow the principle of least privilege as described in security best practices in IAM. For a sandbox setup we suggest the following roles:

Full access to Amazon SageMaker AI (for example, AmazonSageMakerFullAccess).
Read/write access to S3 buckets for storing datasets and model artifacts.
Permissions to push and pull Docker images from Amazon ECR (for example, AmazonEC2ContainerRegistryPowerUser).
If using specific SageMaker instance types, make sure your service quotas are sufficient.

GitHub repository: Clone or download the project code from our GitHub repository. This repository contains the notebooks, scripts, and Docker artifacts referenced in this post.

git clone https://github.com/aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai.git

Local environment set up:

Python: Python 3.10 or higher is recommended.
AWS CLI: Make sure the AWS Command Line Interface (AWS CLI) is installed and configured with credentials that have the necessary permissions.
Docker: Docker must be installed and running on your local machine if you plan to build the custom Docker container for deployment.
Jupyter Notebook and Lab: To run the provided notebooks.
Install the required Python packages by running pip install -r requirements.txt from the cloned repository’s root directory.

Familiarity (recommended):

Basic understanding of Python programming.
Familiarity with AWS services, particularly SageMaker AI.
Conceptual knowledge of LLMs, VLMs, and the container technology will be beneficial.

Overview of document processing and generative AI approaches
There are varying degrees of autonomy in intelligent document processing. On one end of the spectrum are fully manual processes: Humans manually reading documents and entering the information into a form using a computer system. Most systems today are semi-autonomous document processing solutions. For example, a human taking a picture of a receipt and uploading it to a computer system that automatically extracts part of the information. The goal is to get to fully autonomous intelligent document processing systems. This means reducing the error rate and assessing the use case specific risk of errors. AI is significantly transforming document processing by enabling greater levels of automation. A variety of approaches exist, ranging in complexity and accuracy—from specialized models for OCR, to generative AI.
Specialized OCR models that don’t rely on generative AI are designed as pre-trained, task-specific ML models that excel at extracting structured information such as tables, forms, and key-value pairs from common document types like invoices, receipts, and IDs. Amazon Textract is one example of this type of service. This service offers high accuracy out of the box and requires minimal setup, making it well-suited for workloads where basic text extraction is required, and documents don’t vary significantly in structure or contain images.
However, as you increase the complexity and variability of documents, in addition to adding multimodality, using generative AI can help improve document processing pipelines.
While powerful, applying general-purpose VLMs or LLMs to document processing isn’t straightforward. Effective prompt engineering is important to guide the model. Processing large volumes of documents (scaling) requires efficient batching and infrastructure. Because LLMs are stateless, providing historical context or specific schema requirements for every document can be cumbersome.
Approaches to intelligent document processing that use LLMs or VLMs fall into four categories:

Zero-shot prompting: the foundation model (FM) receives the result of previous OCR or a PDF and the instructions to perform the document processing task.
Few-shot prompting: the FM receives the result of previous OCR or a PDF, the instructions to perform the document processing task, and some examples.
Retrieval-augmented few-shot prompting: similar to the preceding strategy, but the examples sent to the model are selected dynamically using Retrieval Augmented Generation (RAG).
Fine-tuning VLMs

In the following, you can see the relationship between increasing effort and complexity and task accuracy, demonstrating how different techniques—from basic prompt engineering to advanced fine-tuning—impact the performance of large and small base models compared to a specialized solution (inspired by the blog post Comparing LLM fine-tuning methods)

As you move across the horizontal axis, the strategies grow in complexity, and as you move up the vertical axis, you improve overall accuracy. In general, large base models provide better performance than small base models in the strategies that require prompt engineering, however as we explain in the results of this post, fine-tuning small base models can deliver similar results as fine-tuning large base models for a specific task.
Zero-shot prompting
Zero-shot prompting is a technique to use language models where the model is given a task without prior examples or fine-tuning. Instead, it relies solely on the prompt’s wording and its pre-trained knowledge to generate a response. In document processing, this approach involves giving the model either an image of a PDF document, the OCR-extracted text from the PDF, or a structured markdown representation of the document and providing instructions to perform the document processing task, in addition to the desired output format.
Amazon Bedrock Data Automation uses zero-shot prompting with generative AI to perform IDP. You can use Bedrock Data Automation to automate the transformation of multi-modal data—including documents containing text and complex structures, such as tables, charts and images—into structured formats. You can benefit from customization capabilities through the creation of blueprints that specify output requirements using natural language or a schema editor. Bedrock Data Automation can also extract bounding boxes for the identified entities and route documents appropriately to the correct blueprint. These features can be configured and used through a single API, making it significantly more powerful than a basic zero-shot prompting approach.
While out-of-the-box VLMs can handle general OCR tasks effectively, they often struggle with the unique structure and nuances of custom documents—such as invoices from diverse vendors. Although crafting a prompt for a single document might be straightforward, the variability across hundreds of vendor formats makes prompt iteration a labor-intensive and time-consuming process.
Few-shot prompting
Moving to a more complex approach, you have few-shot prompting, a technique used with LLMs where a small number of examples are provided within the prompt to guide the model in completing a specific task. Unlike zero-shot prompting, which relies solely on natural language instructions, few-shot prompting improves accuracy and consistency by demonstrating the desired input-output behavior through examples.
One alternative is to use the Amazon Bedrock Converse API to perform few shot prompting. Converse API provides a consistent way to access LLMs using Amazon Bedrock. It supports turn-based messages between the user and the generative AI model, and allows including documents as part of the content. Another option is using Amazon SageMaker Jumpstart, which you can use to deploy models from providers like HuggingFace.
However, most likely your business needs to process different types of documents (for example, invoices, contracts and hand written notes) and even within one document type there are many variations, for example, there is not one standardized invoice layout and instead each vendor has their own layout that you cannot control. Finding a single or a few examples that cover all the different documents you want to process is challenging.
Retrieval-augmented few-shot prompting
One way to address the challenge of finding the right examples is to dynamically retrieve previously processed documents as examples and add them to the prompt at runtime (RAG).
You can store a few annotated samples in a vector store and retrieve them based on the document that needs to be processed. Amazon Bedrock Knowledge Bases helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
This turns the intelligent document processing problem into a search problem, which comes with its own challenges on how to improve the accuracy of the search. In addition to how to scale for multiple types of documents, the few-shot approach is costly because every document processed requires a longer prompt with examples. This results in an increased number of input tokens.

As shown in the preceding figure, the prompt context will vary based on the strategy selected (zero-shot, few-shot or few-shot with RAG), which will overall change the results obtained.
Fine-tuning VLMs
At the end of the spectrum, you have the option to fine-tune a custom model to perform document processing. This is our recommended approach and what we focus on in this post. Fine-tuning is a method where a pre-trained LLM is further trained on a specific dataset to specialize it for a particular task or domain. In the context of document processing, fine-tuning involves using labeled examples—such as annotated invoices, contracts, or insurance forms—to teach the model exactly how to extract or interpret relevant information. Usually, the labor-intensive part of fine-tuning is acquiring a suitable, high-quality dataset. In the case of document processing, your company probably already has a historic dataset in its existing document processing system. You can export this data from your document processing system (for example from your enterprise resource planning (ERP) system) and use it as the dataset for fine-tuning. This fine-tuning approach is what we focus on in this post as a scalable, high accuracy, and cost-effective approach for intelligent document processing.
The preceding approaches represent a spectrum of strategies to improve LLM performance along two axes: LLM optimization (shaping model behavior through prompt engineering or fine-tuning) and context optimization (enhancing what the model knows at inference through techniques such as few-shot learning or RAG). These methods can be combined—for example, using RAG with few-shot prompts or incorporating retrieved data into fine-tuning—to maximize accuracy.
Fine-tuning VLMs for document-to-JSON conversion
Our approach—the recommended solution for cost-effective document-to-JSON conversion—uses a VLM and fine-tunes it using a dataset of historical documents paired with their corresponding ground-truth JSON that we consider as annotations. This allows the model to learn the specific patterns, fields, and output structure relevant to your historic data, effectively teaching it to read your documents and extract information according to your desired schema.
The following figure shows a high-level architecture of the document-to-JSON conversion process for fine-tuning VLMs by using historic data. This allows the VLM to learn from high data variations and helps ensure that the structured output matches the target system structure and format. 

Fine-tuning offers several advantages over relying solely on OCR or general VLMs:

Schema adherence: The model learns to output JSON matching a specific target structure, which is vital for integration with downstream systems like ERPs.
Implicit field location: Fine-tuned VLMs often learn to locate and extract fields without explicit bounding box annotations in the training data, simplifying data preparation significantly.
Improved text extraction quality: The model becomes more accurate at extracting text even from visually complex or noisy document layouts.
Contextual understanding: The model can better understand the relationships between different pieces of information on the document.
Reduced prompt engineering: Post fine-tuning, the model requires less complex or shorter prompts because the desired extraction behavior is built into its weights.

For our fine-tuning process, we selected the Swift framework. Swift provides a comprehensive, lightweight toolkit for fine-tuning various large language models, including VLMs like Qwen-VL and Llama-Vision.
Data preparation
To fine-tune the VLMs, you will use the Fatura2 dataset, a multi-layout invoice image dataset comprising 10,000 invoices with 50 distinct layouts.
The Swift framework expects training data in a specific JSONL (JSON Lines) format. Each line in the file is a JSON object representing a single training example. For multimodal tasks, this JSON object typically includes:

messages: A list of conversational turns (for example, system, user, assistant). The user turn contains placeholders for images (for example, <image>) and the text prompt that guides the model. The assistant turn contains the target output, which in this case is the ground-truth JSON string.
images: A list of relative paths—within the dataset directory structure—to the document page images (JPG files) relevant to this training example.

As with standard ML practice, the dataset is split into training, development (validation), and test sets to effectively train the model, tune hyperparameters, and evaluate its final performance on unseen data. Each document (which could be single-page or multi-page) paired with its corresponding ground-truth JSON annotation constitutes a single row or example in our dataset. In our use case, one training sample is the invoice image (or multiple images of document pages) and the corresponding detailed JSON extraction. This one-to-one mapping is essential for supervised fine-tuning.
The conversion process, detailed in the dataset creation notebook from the associated GitHub repo, involves several key steps:

Image handling: If the source document is a PDF, each page is rendered into a high-quality PNG image.
Annotation processing (fill missing values): We apply light pre-processing to the raw JSON annotation. Fine-tuning multiple models on an open source dataset, we observed that the performance increases when all keys are present in every JSON sample. To maintain this consistency, the target JSONs in the dataset are made to include the same set of top-level keys (derived from the entire dataset). If a key is missing for a particular document, it’s added with a null value.
Key ordering: The keys within the processed JSON annotation are sorted alphabetically. This consistent ordering helps the model learn a stable output structure.
Prompt construction: A user prompt is constructed. This prompt includes <image> tags (one for each page of the document) and explicitly lists the JSON keys the model is expected to extract. Including the JSON keys in the prompts improves the fine-tuned model’s performance.
Swift formatting: These components (prompt, image paths, target JSON) are assembled into the Swift JSONL format. Swift datasets support multimodal inputs, including images, videos and audios.

The following is an example structure of a single training instance in Swift’s JSONL format, demonstrating how multimodal inputs are organized. This includes conversational messages, paths to images, and objects containing bounding box (bbox) coordinates for visual references within the text. For more information about how to create a custom dataset for Swift, see the Swift documentation.

{
“messages”: [
{“role”: “system”, “content”: “Task definition”},
{“role”: “user”, “content”: “<image><image>… + optional text prompt”},
{“role”: “assistant”, “content”: “JSON or text output with extracted data with <bbox> references.”}
],
“images”: [“path/to/image1.png”, “path/to/image2.png”]
“objects”: {“ref”: [], “bbox”: [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]} #Optional
}

Fine-tuning frameworks and resources
In our evaluation of fine-tuning frameworks for use with SageMaker AI, we considered several prominent options highlighted in the community and relevant to our needs. These included Hugging Face Transformers, Hugging Face Autotrain, Llama Factory, Unsloth, Torchtune, and ModelScope SWIFT (referred to simply as SWIFT in this post, aligning with the SWIFT 2024 paper by Zhao and others.).
After experimenting with these, we decided to use SWIFT because of its lightweight nature, comprehensive support for various Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and DoRA, and its design tailored for efficient training of a wide array of models, including the VLMS used in this post (for example, Qwen-VL 2.5). Its scripting approach integrates seamlessly with SageMaker AI training jobs, allowing for scalable and reproducible fine-tuning runs in the cloud.
There are several strategies for adapting pre-trained models: full fine-tuning, where all model parameters are updated, PEFT, which offers a more efficient alternative by updating only a small new number of parameters (adapters), and quantization, a technique that reduces model size and speeds up inference using lower-precision formats (see Sebastian Rashka’s post on fine-tuning to learn more about each technique).
Our project uses LoRA and DoRA, as configured in the fine-tuning notebook.
The following is an example of configuring and running a fine-tuning job (LoRA) as a SageMaker AI training job using SWIFT and remote function. When executing this function, the fine-tuning will be executed remotely as a SageMaker AI training job.

from sagemaker.remote_function import remote
import json
import os
@remote (instance_type=”ml.g6e.12xlarge”, volume_size=200, use_spot_instances=True)
def fine_tune_document (training_data_s3, train_data_path=”train.jsonl” , validation_data_path=”validation.jsonl”):
from swift.llm.sft import lim_sft, get_sft_main
from swift.llm import sft_main

## copy the training data from input source to local directory

train_data_local_path = …
validation_data_local_path = …
# set and run the fine-tuning using ms-swift framework
os.environ[“SIZE_FACTOR”] = json.dumps(8)# can be increase but requires more GPU memory
os.environ[“MAX_PIXELS”]= json.dumps (602112) # can be increase but requires more GPU memory os. environ [“CUDA_VISIBLE_DEVICES”]=”0,1,2,3″ # GPU devices to be used os. environ [“NPROC_PER_NODE”]=”4″ # we have 4 GPUs on on instance
os.environ[“USE_H_TRANSFER”] = json.dumps (1)
argv = [‘—model_type’, ‘qwen2_5_vl’,
‘-model_id_or_path’, ‘Qwen/Qwen2.5-VL-3B-Instruct’
‘–train_type’, ‘lora’
‘–use_dora’, ‘true’
‘-output_dir’, checkpoint_dir,
‘—max_length’, ‘4096’
‘-dataset’, train_data_local_path,
‘–val_dataset’, validation_data_local_path,

]

sft_main (argv)
## potentially evaluate inference on test dataset return “done”

Fine-tuning VLMs typically requires GPU instances because of their computational demands. For models like Qwen2.5-VL 3B, an instance such as an Amazon SageMaker AI ml.g5.2xlarge or ml.g6.8xlarge can be suitable. Training time is a function of dataset size, model size, batch size, number of epochs, and other hyperparameters. For instance, as noted in our project readme.md, fine-tuning Qwen2.5 VL 3B on 300 Fatura2 samples took approximately 2,829 seconds (roughly 47 minutes) on an ml.g6.8xlarge instance using Spot pricing. This demonstrates how smaller models, when fine-tuned effectively, can deliver exceptional performance cost-efficiently. Larger models like Llama-3.2-11B-Vision would generally require more substantial GPU resources (for example, ml.g5.12xlarge or larger) and longer training times.
Evaluation and visualization of structured outputs (JSON)
A key aspect of any automation or machine learning project is evaluation. Without evaluating your solution, you don’t know how well it performs at solving your business problem. We wrote an evaluation notebook that you can use as a framework. Evaluating the performance of document-to-JSON models involves comparing the model-generated JSON outputs for unseen input documents (test dataset) against the ground-truth JSON annotations.
Key metrics employed in our project include:

Exact match (EM) – accuracy: This metric measures whether the extracted value for a specific field is an exact character-by-character match to the ground-truth value. It’s a strict metric, often reported as a percentage.
Character error rate (CER) – edit distance: calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model’s predicted string into the ground-truth string, typically normalized by the length of the ground-truth string. A lower CER indicates better performance.
Recall-Oriented Understudy for Gisting Evaluation (ROGUE): This is a suite of metrics that compare n-grams (sequences of words) and the longest common subsequence between the predicted output and the reference. While traditionally used for text summarization, ROUGE scores can also provide insights into the overall textual similarity of the generated JSON string compared to the ground truth.

Visualizations are helpful for understanding model performance nuances. The following edit distance heatmap image provides a granular view, showing how closely the predictions match the ground truth (green means the model’s output exactly matches the ground truth, and shades of yellow, orange, and red depict increasing deviations). Each model has its own bar chart, allowing quick comparison across models. The X-axis is the number of sample documents. In this case, we ran inference on 250 unseen samples from the Fatura2 dataset. The Y-axis shows the JSON keys that we asked the model to extract; which will be different for you depending on what structure your downstream system requires.
In the image, you can see the performance of three different models on the Fatura2 dataset. From left to right: Qwen2.5 VL 3B fine-tuned on 300 samples from the Fatura2 dataset, in the middle Qwen2.5 VL 3B without fine-tuning (labeled vanilla), and Llama 3.2 11B vision fine-tuned on 1,000 samples.
The grey color shows the samples for which the Fatura2 dataset doesn’t contain any ground truth, which is why those are the same across the three models.
For a detailed, step-by-step walk-through of how the evaluation metrics are calculated, the specific Python code used, and how the visualizations are generated, see the comprehensive evaluation notebook in our project.

The image shows that Qwen2.5 vanilla is only decent at extracting the Title and Seller Name from the document. For the other keys it makes more than six character edit mistakes. However, out of the box Qwen2.5 is good at adhering to the JSON schema with only a few predictions where the key is missing (dark blue color) and no predictions of JSON that couldn’t be parsed (for example, missing quotation marks, missing parentheses, or a missing comma). Examining the two fine-tuned models, you can see improvement in performance with most samples, exactly matching the ground truth on all keys. There are only slight differences between fine-tuned Qwen2.5 and fine-tuned Llama 3.2, for example fine-tuned Qwen2.5 slightly outperforms fine-tuned Llama 3.2 on Total, Title, Conditions, and Buyer; whereas fine-tuned Llama 3.2 slightly outperforms fine-tuned Qwen2.5 on Seller Address, Discount, Tax, and Discount.
The goal is to input a document into your fine-tuned model and receive a clean, structured JSON object that accurately maps the extracted information to predefined fields. JSON-constrained decoding enforces adherence to a specified JSON schema during inference and is useful to make sure the output is valid JSON. For the Fatura2 dataset, this approach was not necessary—our fine-tuned Qwen 2.5 model consistently produced valid JSON outputs without additional constraints. However, incorporating constrained decoding remains a valuable safeguard, particularly for production environments where output reliability is critical.
Notebook 07 visualizes the input document and the extracted JSON data side-by-side.
Deploying the fine-tuned model
After you fine-tune a model and evaluate it on your dataset, you will want to deploy it to run inference to process your documents. Depending on your use case, a different deployment option might be more suitable.
Option a: vLLM container extended for SageMaker
To deploy our fine-tuned model for real-time inference, we use SageMaker endpoints. SageMaker endpoints provide fully managed hosting for real-time inference for FMs, deep learning, and other ML models and allows managed autoscaling and cost optimal deployment techniques. The process, detailed in our deploy model notebook, involves building a custom Docker container. This container packages the vLLM serving engine, highly optimized for LLM and VLM inference, along with the Swift framework components needed to load our specific model and adapter. vLLM provides an OpenAI-compatible API server by default, suitable for handling document and image inputs with VLMs. Our custom docker-artifacts and Dockerfile adapts this vLLM base for SageMaker deployment. Key steps include:

Setting up the necessary environment and dependencies.
Configuring an entry point that initializes the vLLM server.
Making sure the server can load the base VLM and dynamically apply our fine-tuned LoRA adapter. The Amazon S3 path to the adapter (model.tar.gz) is passed using the ADAPTER_URI environment variable when creating the SageMaker model.
The container, after being built and pushed to Amazon ECR, is then deployed to a SageMaker endpoint, which listens for invocation requests and routes them to the vLLM engine inside the container.

The following image shows a SageMaker vLLM deployment architecture, where a custom Docker container from Amazon ECR is deployed to a SageMaker endpoint. The container uses vLLM’s OpenAI-compatible API and Swift to serve a base VLM with a fine-tuned LoRA adapter dynamically loaded from Amazon S3.

Option b (optional): Inference components on SageMaker
For more complex inference workflows that might involve sophisticated pre-processing of input documents, post-processing of the extracted JSON, or even chaining multiple models (for example, a classification model followed by an extraction model), Amazon SageMaker inference components offer enhanced flexibility. You can use them to build a pipeline of multiple containers or models within a single endpoint, each handling a specific part of the inference logic.
Option c: Custom model inference in Amazon Bedrock
You can now import your custom models in Amazon Bedrock and then use Amazon Bedrock features to make inference calls to the model. Qwen 2.5 architecture is supported (see Supported Architectures). For more information, see Amazon Bedrock Custom Model Import now generally available.
Clean up
To avoid ongoing charges, it’s important to remove the AWS resources created for this project when you’re finished.

SageMaker endpoints and models:

In the AWS Management Console for SageMaker AI, go to Inference and then Endpoints. Select and delete endpoints created for this project.
Then, go to Inference and then Models and delete the associated models.

Amazon S3 data:

Navigate to the Amazon S3 console.
Delete the S3 buckets or specific folders or prefixes used for datasets, model artifacts (for example, model.tar.gz from training jobs), and inference results. Note: Make sure you don’t delete data needed by other projects.

Amazon ECR images and repositories:

In the Amazon ECR console, delete Docker images and the repository created for the custom vLLM container if you deployed one.

CloudWatch logs (optional):

Logs from SageMaker activities are stored in Amazon CloudWatch. You can delete relevant log groups (for example, /aws/sagemaker/TrainingJobsand /aws/sagemaker/Endpoints) if desired, though many have automatic retention policies.

Important: Always verify resources before deletion. If you experimented with Amazon Bedrock custom model imports, make sure those are also cleaned up. Use AWS Cost Explorer to monitor for unexpected charges.
Conclusion and future outlook
In this post, we demonstrated that fine-tuning VLMs provides a powerful and flexible approach to automate and significantly enhance document understanding capabilities. We have also demonstrated that using focused fine-tuning allows smaller, multi-modal models to compete effectively with much larger counterparts (98% accuracy with Qwen2.5 VL 3B). The project also highlights that fine-tuning VLMs for document-to-JSON processing can be done cost-effectively by using Spot instances and PEFT methods (approximately $1 USD to fine-tune a 3 billion parameter model on around 200 documents).
The fine-tuning task was conducted using Amazon SageMaker training jobs and the Swift framework, which proved to be a versatile and effective toolkit for orchestrating this fine-tuning process.
The potential for enhancing and expanding this work is vast. Some exciting future directions include deploying structured document models on CPU-based, serverless compute like AWS Lambda or Amazon SageMaker Serverless Inference using tools like llama.cpp or vLLM. Using quantized models can enable low-latency, cost-efficient inference for sporadic workloads. Another future direction includes improving evaluation of structured outputs by going beyond field-level metrics. This includes validating complex nested structures and tables using methods like tree edit distance for tables (TEDS).
The complete code repository, including the notebooks, utility scripts, and Docker artifacts, is available on GitHub to help you get started unlocking insights from your documents. For a similar approach, using Amazon Nova, please refer to this AWS blog for optimizing document AI and structured outputs by fine-tuning Amazon Nova Models and on-demand inference.

About the Authors
Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for Europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.
Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.
Nick McCarthy is a Senior Generative AI Specialist Solutions Architect on the Amazon Bedrock team, focused on model customization. He has worked with AWS clients across a wide range of industries — including healthcare, finance, sports, telecommunications, and energy — helping them accelerate business outcomes through the use of AI and machine learning. Outside of work, Nick loves traveling, exploring new cuisines, and reading about science and technology. He holds a Bachelor’s degree in Physics and a Master’s degree in Machine Learning.
Irene Marban Alvarez is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), working with customers in the United Kingdom and Ireland. With a background in Biomedical Engineering and Masters in Artificial Intelligence, her work focuses on helping organizations leverage the latest AI technologies to accelerate their business. In her spare time, she loves reading and cooking for her friends.

How Clario automates clinical research analysis using generative AI on …

Clinical outcome assessment (COA) interviews are important instruments in clinical trials for evaluating the efficacy and safety of treatments. In studies of psychosis, anxiety, and mood disorders, these assessments often determine the success or failure of the trial, highlighting the importance of data quality and reliability. The traditional approach to evaluating the quality of these outcomes is complex and involves time-consuming, logistically challenging reviews of audio-video recordings in near real time. Interview evaluation variability, poor assessment technique, and other factors can introduce noise, leading to unreliable results and potentially to study failure.
About Clario
Clario is a leading provider of endpoint data solutions for systematic collection, management, and analysis of specific, pre-defined outcomes (endpoints) to evaluate a treatment’s safety and effectiveness in the clinical trials industry. Clario generates high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since its founding over 50 years ago, Clario has deployed endpoint data solutions over 30,000 times, supporting over 710 novel drug regulatory approvals across more than 100 countries.
In this post, we demonstrate how Clario has used Amazon Bedrock and other AWS services to build an AI-powered solution that automates and improves the analysis of COA interviews. We discuss how Clario:

implemented speaker diarization, multi-lingual transcription, and large language models (LLMs)
used vector databases and semantic search to evaluate interview quality
incorporated automation into complex assessment reviews while maintaining regulatory compliance

Business challenge
Clario sought to transform their COA review methodology to enhance operational effectiveness while also increasing data quality. The company required a system that could address the critical challenges of standardized review of multi-lingual data at a global scale, while reducing natural variation between different expert reviewers, and maintaining uniform assessment quality across the complex COA interview process. The solution also needed to efficiently manage large volumes of audio recordings while meeting strict regulatory and privacy requirements. Clario sought capabilities that could automatically analyze speech and dialogue in near real time during COA interviews to potentially enable:

Reduced subjectivity and variability – Delivering more consistent and reliable behavioral health assessments, minimizing site and rater bias.
Enhanced data quality and credibility – Improving the robustness of trial outcomes with objective, standardized, and repeatable interview evaluations.
Streamlined operations – Automated complex assessment review and scoring could save time and resources for geographically dispersed sites and sponsor-level clinical teams.
Accelerated decision-making – Gaining clearer insights earlier could support faster, evidence-based go or no-go decisions for the trial sponsors.

Solution
To address this challenge, Clario chose AWS for its comprehensive artificial intelligence and machine learning (AI/ML) capabilities, proven ability to deploy HIPAA-compliant services at a global scale. Clario used the power of generative AI and Amazon Bedrock, a fully managed service that provides access to a diverse range of high-performing foundation models, to offer several key advantages:

No infrastructure management – Alleviate the operational overhead of managing AI model infrastructure and updates
Multiple model access – Compare and select from leading foundation models to optimize performance for their specific COA analysis needs
Built-in compliance features – Native support for data governance, audit trails, and regulatory requirements essential for clinical research
Rapid prototyping and deployment – Accelerated time-to-market through serverless architecture and pre-built integrations
Seamless AWS system integration – Native compatibility with existing AWS services for data storage, processing, and analytics
Enterprise security and privacy controls – Advanced encryption, access controls, and data residency options to help meet stringent industry standards
Continuous model improvements – Automatic access to model updates and new capabilities, reducing migration complexity

This comprehensive approach enabled Clario to focus on their core competency—clinical research excellence—while using cutting-edge AI capabilities through a trusted, compliance-aligned system.
The solution integrates advanced AI capabilities, including speaker diarization, multi-lingual transcription, semantic search, and agentic AI, to automatically review the quality of complex COA interviews in a manner similar to expert human central reviewers. The workflow orchestrates multiple steps where audio data is first analyzed to identify the unique speakers in the interview based on their voice, followed by speech-to-text conversion, and speaker role attribution to determine which speech corresponds to the interviewer and the study participant.
This information is segmented into semantically meaningful chunks based on speaker turns and natural conversation boundaries, with each segment maintaining crucial metadata. Examples of metadata include timestamps, speaker role, and positional context. These chunks are then vectorized and stored in an Amazon OpenSearch vector database, enabling the system to overcome the context window limitations of foundation models when processing lengthy interviews. The solution implements a sophisticated retrieval strategy where:

Overlapping windows makes sure that contextual information is not lost at segment boundaries
Targeted semantic searches identify specific dialogue segments relevant to each assessment criterion
A hierarchical approach preserves both local conversational flow and global interview context through interview-level summaries and speaker roles
Rolling context windows can be dynamically assembled when evaluating criteria that span multiple segments

This architecture allows the system to efficiently handle multiple queries against the same interview data while maintaining contextual relationships throughout the conversation. The system uses this semantic retrieval capability to analyze the content of the dialogue between the interviewer and the participant, evaluating it against a structured interview guide and central review checklist. The output of the workflow includes a quality rating for the interview, along with structured feedback for each checklist item, specifying where the interview diverges from the established standards. The overall system provides near real-time insights into the quality and reliability of the COA interview, supporting faster evidence-based go or no-go decisions for sponsors of clinical trials.
Solution architecture
The following architecture diagram illustrates the solution implementation:

The workflow consists of the following steps:

The COA interview recordings (audio and video files) from the interviews are collected on premises (1) using a recording application. The files are uploaded using AWS Direct Connect with encryption in transit to Amazon Simple Storage Service (Amazon S3)(2). The uploaded documents are then automatically stored with server-side object-level encryption.
After the files are uploaded, Clario’s AI Orchestration Engine (3) extracts the audio and identifies speech segments of unique speakers using a custom speaker diarization model on Amazon SageMaker (4).
The Orchestration Engine also invokes the Amazon Bedrock API for automated audio transcription. Clario uses the Whisper model from the Amazon Bedrock Marketplace (5) to generate near real-time transcriptions of the COA interview recordings. The transcriptions are then annotated with speaker information and timecodes, and then vectorized using an embedding model (Amazon Titan Text Embeddings v2 model) and stored into Amazon OpenSearch (7) for semantic retrieval.
After the information has been vectorized and stored, Clario’s AI Orchestration Engine executes a graph-based agent system running on Amazon Elastic Kubernetes Service (Amazon EKS)(3) for automated COA interview review. The agent implements a multi-step workflow that: (1) retrieves the assessment’s structured interview guide from configuration, (2) loads the corresponding central review checklist criteria, and (3) systematically queries Amazon OpenSearch (7) to extract relevant interview segments. Using the pre-configured graph structure for the task at hand, the agent traverses predefined decision nodes to compare interview responses against standardized assessment criteria, identify gaps or inconsistencies, and generate structured findings with supporting evidence citations.
The agent uses advanced large language models (LLMs), such as Anthropic Claude 3.7 Sonnet from Amazon Bedrock (6), to classify the speech segment as interviewer or participant, and to determine if each interview turn meets the interview quality criteria.
Clario’s AI Orchestration Engine then compiles the overall review of the interview and persists the information in Amazon Relational Database Service (Amazon RDS)(8).
Results of the AI-powered automated review can be retrieved by a client application (9) by invoking a Rest API using Amazon API Gateway endpoints (10).

Benefits and results
The initial implementation of this AI-powered solution is showing promise in improving Clario’s clinical trial processes:

Operational efficiency

Potential to decrease manual review effort by over 90%.

Quality improvements

Up to 100% data coverage through automated review versus human-only review of a smaller subset of recordings to spot check quality.
Highly targeted interventions might be enabled with rapid turnaround, focusing only on those raters and sites that require remediation.

Business impact

Potential to shorten turn-around time by decreasing central review time from weeks to hours.
Enhanced data reliability for regulatory submissions.
Reduced risk of study failure and uninterpretable results.
Improved scalability of clinical trial operations.

Lessons learned and best practices
Throughout the development and deployment of this solution, Clario has gained valuable insights and lessons learned that can benefit other organizations looking to implement similar AI-powered systems:

Importance of responsible AI development and use – During initial testing, Clario discovered that LLMs would occasionally generate plausible sounding but inaccurate summaries. This critical finding reinforced the importance of responsible AI practices in healthcare applications. This led Clario to implement a validation system where AI outputs are cross-checked against source documents for factual accuracy before human review.
Continuous model evaluation – Clario adopted a rigorous model evaluation process to maintain the highest standards of quality and reliability in their AI-powered COA interview analysis solution. Clario regularly assessed the performance and accuracy of their AI models through multiple approaches, including comparative studies on custom datasets, across multiple models and configurations.
Scalable and more secure architecture – The serverless, cloud-based architecture of the solution–using services like Amazon Bedrock, Amazon S3, and AWS Lambda–helped Clario to scale their solution effectively while prioritizing data security and compliance.

Next steps and conclusion
Clario’s innovative solution has the potential to transform the way COAs are reviewed and rated, significantly improving the reliability of clinical trial data and reducing the time and effort required for manual review. As Clario continues to refine and expand the capabilities of this AI-powered system, Clario is exploring additional use cases in neuroscience studies that rely on clinical interviews for evaluating the safety and efficacy of treatments.
By using generative AI and the robust features of Amazon Bedrock, Clario has set a new standard for clinical trial data analysis. This empowers their customers to make more informed decisions and accelerate the development of life-changing therapies.

About the authors
Alex Boudreau is the Director of AI at Clario. He leads the company’s innovative Generative AI department and oversees the development of the company’s advanced multi-modal GenAI Platform, which encompasses cutting-edge cloud engineering, AI engineering, and foundational AI research. Alex previously pioneered Deep Learning speech analysis systems for automotive applications, led cloud-based enterprise fraud detection solutions, advanced conversational AI technologies, and groundbreaking projects in medical image analysis. His expertise in leading high-impact initiatives positions him uniquely to drive forward the boundaries of AI technology in the business world.
Cuong Lai is the Technical Team Lead for the Generative AI team at Clario, where he helps to drive the development and scaling of the company’s generative AI platform. With over eight years of software engineering experience, he specializes in web development, API design, and architecting cloud-native solutions. Cuong has extensive experience leveraging AWS services to build secure, reliable, and high-performance systems that support large-scale AI workloads. He is passionate about advancing generative AI technologies and delivering innovative, production-ready AI solutions.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS), where he architects secure, scalable cloud solutions and provides strategic guidance to diverse enterprise customers. With nearly two decades of IT experience, Praveen has delivered transformative implementations across multiple industries. As a trusted technical advisor, he partners with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. He is passionate about solving complex business challenges through cutting-edge cloud architectures and empowering organizations to achieve successful digital transformations powered by artificial intelligence and machine learning.

A Coding Implementation to Build Neural Memory Agents with Differentia …

In this tutorial, we explore how neural memory agents can learn continuously without forgetting past experiences. We design a memory-augmented neural network that integrates a Differentiable Neural Computer (DNC) with experience replay and meta-learning to adapt quickly to new tasks while retaining prior knowledge. By implementing this approach in PyTorch, we demonstrate how content-based memory addressing and prioritized replay enable the model to overcome catastrophic forgetting and maintain performance across multiple learning tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
from dataclasses import dataclass

@dataclass
class MemoryConfig:
memory_size: int = 128
memory_dim: int = 64
num_read_heads: int = 4
num_write_heads: int = 1

We begin by importing all the essential libraries and defining the configuration class for our neural memory system. Here, we set parameters such as memory size, dimensionality, and the number of read/write heads that shape how the differentiable memory behaves throughout training. This setup acts as the foundation upon which our memory-augmented architecture is built. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralMemoryBank(nn.Module):
def __init__(self, config: MemoryConfig):
super().__init__()
self.memory_size = config.memory_size
self.memory_dim = config.memory_dim
self.num_read_heads = config.num_read_heads
self.register_buffer(‘memory’, torch.zeros(config.memory_size, config.memory_dim))
self.register_buffer(‘usage’, torch.zeros(config.memory_size))
def content_addressing(self, key, beta):
key_norm = F.normalize(key, dim=-1)
mem_norm = F.normalize(self.memory, dim=-1)
similarity = torch.matmul(key_norm, mem_norm.t())
return F.softmax(beta * similarity, dim=-1)
def write(self, write_key, write_vector, erase_vector, write_strength):
write_weights = self.content_addressing(write_key, write_strength)
erase = torch.outer(write_weights.squeeze(), erase_vector.squeeze())
self.memory = (self.memory * (1 – erase)).detach()
add = torch.outer(write_weights.squeeze(), write_vector.squeeze())
self.memory = (self.memory + add).detach()
self.usage = (0.99 * self.usage + write_weights.squeeze()).detach()
def read(self, read_keys, read_strengths):
reads = []
for i in range(self.num_read_heads):
weights = self.content_addressing(read_keys[i], read_strengths[i])
read_vector = torch.matmul(weights, self.memory)
reads.append(read_vector)
return torch.cat(reads, dim=-1)

class MemoryController(nn.Module):
def __init__(self, input_dim, hidden_dim, memory_config: MemoryConfig):
super().__init__()
self.hidden_dim = hidden_dim
self.memory_config = memory_config
self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
total_read_dim = memory_config.num_read_heads * memory_config.memory_dim
self.read_keys = nn.Linear(hidden_dim, memory_config.num_read_heads * memory_config.memory_dim)
self.read_strengths = nn.Linear(hidden_dim, memory_config.num_read_heads)
self.write_key = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.erase_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_strength = nn.Linear(hidden_dim, 1)
self.output = nn.Linear(hidden_dim + total_read_dim, input_dim)
def forward(self, x, memory_bank, hidden=None):
lstm_out, hidden = self.lstm(x.unsqueeze(0), hidden)
controller_state = lstm_out.squeeze(0)
read_k = self.read_keys(controller_state).view(self.memory_config.num_read_heads, -1)
read_s = F.softplus(self.read_strengths(controller_state))
write_k = self.write_key(controller_state)
write_v = torch.tanh(self.write_vector(controller_state))
erase_v = torch.sigmoid(self.erase_vector(controller_state))
write_s = F.softplus(self.write_strength(controller_state))
read_vectors = memory_bank.read(read_k, read_s)
memory_bank.write(write_k, write_v, erase_v, write_s)
combined = torch.cat([controller_state, read_vectors], dim=-1)
output = self.output(combined)
return output, hidden

We implement the Neural Memory Bank and the Memory Controller, which together form the core of the agent’s differentiable memory mechanism. The Neural Memory Bank stores and retrieves information through content-based addressing, while the controller network dynamically interacts with this memory using read and write operations. This setup enables the agent to recall relevant information and adapt to new inputs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ExperienceReplay:
def __init__(self, capacity=10000, alpha=0.6):
self.capacity = capacity
self.alpha = alpha
self.buffer = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
def push(self, experience, priority=1.0):
self.buffer.append(experience)
self.priorities.append(priority ** self.alpha)
def sample(self, batch_size, beta=0.4):
if len(self.buffer) == 0:
return [], []
probs = np.array(self.priorities)
probs = probs / probs.sum()
indices = np.random.choice(len(self.buffer), min(batch_size, len(self.buffer)), p=probs, replace=False)
samples = [self.buffer[i] for i in indices]
weights = (len(self.buffer) * probs[indices]) ** (-beta)
weights = weights / weights.max()
return samples, torch.FloatTensor(weights)

class MetaLearner(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def adapt(self, support_x, support_y, num_steps=5, lr=0.01):
adapted_params = {name: param.clone() for name, param in self.model.named_parameters()}
for _ in range(num_steps):
pred, _ = self.model(support_x, self.model.memory_bank)
loss = F.mse_loss(pred, support_y)
grads = torch.autograd.grad(loss, self.model.parameters(), create_graph=True)
adapted_params = {name: param – lr * grad for (name, param), grad in zip(adapted_params.items(), grads)}
return adapted_params

We design the Experience Replay and Meta-Learner components to strengthen the agent’s ability to learn continuously. The replay buffer enables the model to revisit past experiences through prioritized sampling, thereby reducing forgetting, while the Meta-Learner utilizes MAML-style adaptation for rapid learning on new tasks. Together, these modules bring stability and flexibility to the agent’s training process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ContinualLearningAgent:
def __init__(self, input_dim=64, hidden_dim=128):
self.config = MemoryConfig()
self.memory_bank = NeuralMemoryBank(self.config)
self.controller = MemoryController(input_dim, hidden_dim, self.config)
self.replay_buffer = ExperienceReplay(capacity=5000)
self.meta_learner = MetaLearner(self.controller)
self.optimizer = torch.optim.Adam(self.controller.parameters(), lr=0.001)
self.task_history = []
def train_step(self, x, y, use_replay=True):
self.optimizer.zero_grad()
pred, _ = self.controller(x, self.memory_bank)
current_loss = F.mse_loss(pred, y)
self.replay_buffer.push((x.detach().clone(), y.detach().clone()), priority=current_loss.item() + 1e-6)
total_loss = current_loss
if use_replay and len(self.replay_buffer.buffer) > 16:
samples, weights = self.replay_buffer.sample(8)
for (replay_x, replay_y), weight in zip(samples, weights):
with torch.enable_grad():
replay_pred, _ = self.controller(replay_x, self.memory_bank)
replay_loss = F.mse_loss(replay_pred, replay_y)
total_loss = total_loss + 0.3 * replay_loss * weight
total_loss.backward()
torch.nn.utils.clip_grad_norm_(self.controller.parameters(), 1.0)
self.optimizer.step()
return total_loss.item()
def evaluate(self, test_data):
self.controller.eval()
total_error = 0
with torch.no_grad():
for x, y in test_data:
pred, _ = self.controller(x, self.memory_bank)
total_error += F.mse_loss(pred, y).item()
self.controller.train()
return total_error / len(test_data)

We construct a Continual Learning Agent that integrates memory, controller, replay, and meta-learning into a single, adaptive framework. In this step, we define how the agent trains on each batch, replays past data, and evaluates its performance. The implementation ensures that the model can retain prior knowledge while learning new information without catastrophic forgetting. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_task_data(task_id, num_samples=100):
torch.manual_seed(task_id)
x = torch.randn(num_samples, 64)
if task_id == 0:
y = torch.sin(x.mean(dim=1, keepdim=True).expand(-1, 64))
elif task_id == 1:
y = torch.cos(x.mean(dim=1, keepdim=True).expand(-1, 64)) * 0.5
else:
y = torch.tanh(x * 0.5 + task_id)
return [(x[i], y[i]) for i in range(num_samples)]

def run_continual_learning_demo():
print(” Neural Memory Agent – Continual Learning Demon”)
print(“=” * 60)
agent = ContinualLearningAgent()
num_tasks = 4
results = {‘tasks’: [], ‘without_memory’: [], ‘with_memory’: []}
for task_id in range(num_tasks):
print(f”n Learning Task {task_id + 1}/{num_tasks}”)
train_data = create_task_data(task_id, num_samples=50)
test_data = create_task_data(task_id, num_samples=20)
for epoch in range(20):
total_loss = 0
for x, y in train_data:
loss = agent.train_step(x, y, use_replay=(task_id > 0))
total_loss += loss
if epoch % 5 == 0:
avg_loss = total_loss / len(train_data)
print(f” Epoch {epoch:2d}: Loss = {avg_loss:.4f}”)
print(f”n Evaluation on all tasks:”)
for eval_task_id in range(task_id + 1):
eval_data = create_task_data(eval_task_id, num_samples=20)
error = agent.evaluate(eval_data)
print(f” Task {eval_task_id + 1}: Error = {error:.4f}”)
if eval_task_id == task_id:
results[‘tasks’].append(eval_task_id + 1)
results[‘with_memory’].append(error)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
memory_matrix = agent.memory_bank.memory.detach().numpy()
im = ax.imshow(memory_matrix, aspect=’auto’, cmap=’viridis’)
ax.set_title(‘Neural Memory Bank State’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Memory Dimension’)
ax.set_ylabel(‘Memory Slots’)
plt.colorbar(im, ax=ax)
ax = axes[1]
ax.plot(results[‘tasks’], results[‘with_memory’], marker=’o’, linewidth=2, markersize=8, label=’With Memory Replay’)
ax.set_title(‘Continual Learning Performance’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Task Number’)
ax.set_ylabel(‘Test Error’)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(‘neural_memory_results.png’, dpi=150, bbox_inches=’tight’)
print(“n Results saved to ‘neural_memory_results.png'”)
plt.show()
print(“n” + “=” * 60)
print(” Key Insights:”)
print(” • Memory bank stores compressed task representations”)
print(” • Experience replay mitigates catastrophic forgetting”)
print(” • Agent maintains performance on earlier tasks”)
print(” • Content-based addressing enables efficient retrieval”)

if __name__ == “__main__”:
run_continual_learning_demo()

We conduct a comprehensive demonstration of the continual learning process, generating synthetic tasks to evaluate the agent’s adaptability across multiple environments. As we train and visualize the results, we observe how memory replay improves stability and maintains accuracy across tasks. The experiment concludes with graphical insights that highlight how differentiable memory enhances the agent’s long-term learning capability.

In conclusion, we built and trained a neural memory agent capable of continual adaptation across evolving tasks. We observed how the differentiable memory enables efficient storage and retrieval of learned representations, while the replay mechanism reinforces stability and knowledge retention. By combining these components with meta-learning, we saw how such agents pave the way for more resilient, self-adapting neural systems that can remember, reason, and evolve without losing what they’ve already mastered.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build Neural Memory Agents with Differentiable Memory, Meta-Learning, and Experience Replay for Continual Adaptation in Dynamic Environments appeared first on MarkTechPost.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Us …

Every time you prompt an LLM, it doesn’t generate a complete answer all at once — it builds the response one word (or token) at a time. At each step, the model predicts the probability of what the next token could be based on everything written so far. But knowing probabilities alone isn’t enough — the model also needs a strategy to decide which token to actually pick next.

Different strategies can completely change how the final output looks — some make it more focused and precise, while others make it more creative or varied. In this article, we’ll explore four popular text generation strategies used in LLMs: Greedy Search, Beam Search, Nucleus Sampling, and Temperature Sampling — explaining how each one works.

Greedy Search

Greedy Search is the simplest decoding strategy where, at each step, the model picks the token with the highest probability given the current context. While it’s fast and easy to implement, it doesn’t always produce the most coherent or meaningful sequence — similar to making the best local choice without considering the overall outcome. Because it only follows one path in the probability tree, it can miss better sequences that require short-term trade-offs. As a result, greedy search often leads to repetitive, generic, or dull text, making it unsuitable for open-ended text generation tasks.

Beam Search

Beam Search is an improved decoding strategy over greedy search that keeps track of multiple possible sequences (called beams) at each generation step instead of just one. It expands the top K most probable sequences, allowing the model to explore several promising paths in the probability tree and potentially discover higher-quality completions that greedy search might miss. The parameter K (beam width) controls the trade-off between quality and computation — larger beams produce better text but are slower. 

While beam search works well in structured tasks like machine translation, where accuracy matters more than creativity, it tends to produce repetitive, predictable, and less diverse text in open-ended generation. This happens because the algorithm favors high-probability continuations, leading to less variation and “neural text degeneration,” where the model overuses certain words or phrases.

https://arxiv.org/pdf/1904.09751

Greedy Search:

Beam Search:

Greedy Search (K=1) always takes the highest local probability:

T2: Chooses “slow” (0.6) over “fast” (0.4).

Resulting path: “The slow dog barks.” (Final Probability: 0.1680)

Beam Search (K=2) keeps both “slow” and “fast” paths alive:

At T3, it realizes the path starting with “fast” has a higher potential for a good ending.

Resulting path: “The fast cat purrs.” (Final Probability: 0.1800)

Beam Search successfully explores a path that had a slightly lower probability early on, leading to a better overall sentence score.

Top-p Sampling (Nucleus Sampling)

Top-p Sampling (Nucleus Sampling) is a probabilistic decoding strategy that dynamically adjusts how many tokens are considered for generation at each step. Instead of picking from a fixed number of top tokens like in top-k sampling, top-p sampling selects the smallest set of tokens whose cumulative probability adds up to a chosen threshold p (for example, 0.7). These tokens form the “nucleus,” from which the next token is randomly sampled after normalizing their probabilities. 

This allows the model to balance diversity and coherence — sampling from a broader range when many tokens have similar probabilities (flat distribution) and narrowing down to the most likely tokens when the distribution is sharp (peaky). As a result, top-p sampling produces more natural, varied, and contextually appropriate text compared to fixed-size methods like greedy or beam search.

Temperature Sampling

Temperature Sampling controls the level of randomness in text generation by adjusting the temperature parameter (t) in the softmax function that converts logits into probabilities. A lower temperature (t < 1) makes the distribution sharper, increasing the chance of selecting the most probable tokens — resulting in more focused but often repetitive text. At t = 1, the model samples directly from its natural probability distribution, known as pure or ancestral sampling. 

Higher temperatures (t > 1) flatten the distribution, introducing more randomness and diversity but at the cost of coherence. In practice, temperature sampling allows fine-tuning the balance between creativity and precision: low temperatures yield deterministic, predictable outputs, while higher ones generate more varied and imaginative text. 

The optimal temperature often depends on the task — for instance, creative writing benefits from higher values, while technical or factual responses perform better with lower ones.

The post AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade A …

How can speech editing become as direct and controllable as simply rewriting a line of text? StepFun AI has open sourced Step-Audio-EditX, a 3B parameter LLM based audio model that turns expressive speech editing into a token level text like operation, instead of a waveform level signal processing task.

https://arxiv.org/pdf/2511.03601

Why developers care about controllable TTS?

Most zero shot TTS systems copy emotion, style, accent, and timbre directly from a short reference audio. They can sound natural, but control is weak. Style prompts in text help only for in domain voices, and the cloned voice often ignores the requested emotion or speaking style.

Past work tries to disentangle factors with extra encoders, adversarial losses, or complex architectures. Step-Audio-EditX keeps a relatively entangled representation and instead changes the data and post training objective. The model learns control by seeing many pairs and triplets where text is fixed, but one attribute changes with a large margin.

Architecture, dual codebook tokenizer plus compact audio LLM

Step-Audio-EditX reuses the Step-Audio dual codebook tokenizer. Speech is mapped into two token streams, a linguistic stream at 16.7 Hz with a 1024 entry codebook, and a semantic stream at 25 Hz with a 4096 entry codebook. Tokens are interleaved with a 2 to 3 ratio. The tokenizer keeps prosody and emotion information, so it is not fully disentangled.

On top of this tokenizer, the StepFun research team builds a 3B parameter audio LLM. The model is initialized from a text LLM, then trained on a blended corpus with a 1 to 1 ratio of pure text and dual codebook audio tokens in chat style prompts. The audio LLM reads text tokens, audio tokens, or both, and always generates dual codebook audio tokens as output.

A separate audio decoder handles reconstruction. A diffusion transformer based flow matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder converts Mel spectrograms to waveform. The flow matching module is trained on about 200000 hours of high quality speech, which improves pronunciation and timbre similarity.

https://arxiv.org/pdf/2511.03601

Large margin synthetic data instead of complicated encoders

The key idea is large margin learning. The model is post trained on triplets and quadruplets that keep text fixed and change only one attribute with a clear gap.

For zero shot TTS, Step-Audio-EditX uses a high quality in house dataset, mainly Chinese and English, with a small amount of Cantonese and Sichuanese, and about 60000 speakers. The data covers wide intra speaker and inter speaker variation in style and emotion.(arXiv)

For emotion and speaking style editing, the team builds synthetic large margin triplets (text, audio neutral, audio emotion or style). Voice actors record about 10 second clips for each emotion and style. StepTTS zero shot cloning then produces neutral and emotional versions for the same text and speaker. A margin scoring model, trained on a small human labeled set, scores pairs on a 1 to 10 scale, and only samples with score at least 6 are kept.

Paralinguistic editing, which covers breathing, laughter, filled pauses and other tags, uses a semi synthetic strategy on top of the NVSpeech dataset. The research team builds quadruplets where the target is the original NVSpeech audio and transcript, and the input is a cloned version with tags removed from the text. This gives time domain editing supervision without a margin model.

Reinforcement learning data uses two preference sources. Human annotators rate 20 candidates per prompt on a 5 point scale for correctness, prosody, and naturalness, and pairs with margin greater than 3 are kept. A comprehension model scores emotion and speaking style on a 1 to 10 scale, and pairs with margin greater than 8 are kept.

Post training, SFT plus PPO on token sequences

Post training has two stages, supervised fine tuning followed by PPO.

In supervised fine tuning, system prompts define zero shot TTS and editing tasks in a unified chat format. For TTS, the prompt waveform is encoded to dual codebook tokens, converted to string form, and inserted into the system prompt as speaker information. The user message is the target text, and the model returns new audio tokens. For editing, the user message includes original audio tokens plus a natural language instruction, and the model outputs edited tokens.

Reinforcement learning then refines instruction following. A 3B reward model is initialized from the SFT checkpoint and trained with Bradley Terry loss on large margin preference pairs. The reward is computed directly on dual codebook token sequences, without decoding to waveform. PPO training uses this reward model, a clip threshold, and a KL penalty to balance quality and deviation from the SFT policy.

Step-Audio-Edit-Test, iterative editing and generalization

To quantify control, the research team introduced Step-Audio-Edit-Test. It uses Gemini 2.5 Pro as an LLM as a judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark has 8 speakers, drawn from Wenet Speech4TTS, GLOBE V2, and Libri Light, with 4 speakers per language.

The emotion set has 5 categories with 50 Chinese and 50 English prompts per category. The speaking style set has 7 styles with 50 prompts per language per style. The paralinguistic set has 10 labels such as breathing, laughter, surprise oh, and uhm, with 50 prompts per label and language.

Editing is evaluated iteratively. Iteration 0 is the initial zero shot clone. Then the model applies 3 rounds of editing with text instructions. In Chinese, emotion accuracy rises from 57.0 at iteration 0 to 77.7 at iteration 3. Speaking style accuracy rises from 41.6 to 69.2. English shows similar behavior, and a prompt fixed ablation, where the same prompt audio is used for all iterations, still improves accuracy, which supports the large margin learning hypothesis.

https://arxiv.org/pdf/2511.03601

The same editing model is applied to four closed source TTS systems, GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. For all of them, one editing iteration with Step-Audio-EditX improves both emotion and style accuracy, and further iterations continue to help.

Paralinguistic editing is scored on a 1 to 3 scale. The average score rises from 1.91 at iteration 0 to 2.89 after a single edit, in both Chinese and English, which is comparable to native paralinguistic synthesis in strong commercial systems.

https://arxiv.org/pdf/2511.03601

Key Takeaways

Step Audio EditX uses a dual codebook tokenizer and a 3B parameter audio LLM so it can treat speech as discrete tokens and edit audio in a text like way.

The model relies on large margin synthetic data for emotion, speaking style, paralinguistic cues, speed, and noise, rather than adding extra disentangling encoders.

Supervised fine tuning plus PPO with a token level reward model aligns the audio LLM to follow natural language editing instructions for both TTS and editing tasks.

The Step Audio Edit Test benchmark with Gemini 2.5 Pro as a judge shows clear accuracy gains over 3 editing iterations for emotion, style, and paralinguistic control in both Chinese and English.

Step Audio EditX can post process and improve speech from closed source TTS systems, and the full stack, including code and checkpoints, is available as open source for developers.

Editorial Comments

Step Audio EditX is a precise step forward in controllable speech synthesis, because it keeps the Step Audio tokenizer, adds a compact 3B audio LLM, and optimizes control through large margin data and PPO. The introduction of Step Audio Edit Test with Gemini 2.5 Pro as a judge makes the evaluation story concrete for emotion, speaking style, and paralinguistic control, and the open release lowers the barrier for practical audio editing research. Overall, this release makes audio editing feel much closer to text editing.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing appeared first on MarkTechPost.

Nested Learning: A New Machine Learning Approach for Continual Learnin …

How can we build AI systems that keep learning new information over time without forgetting what they learned before or retraining from scratch? Google Researchers has introduced Nested Learning, a machine learning approach that treats a model as a collection of smaller nested optimization problems, instead of a single network trained by one outer loop. The goal is to attack catastrophic forgetting and move large models toward continual learning, closer to how biological brains manage memory and adaptation over time.

https://abehrouz.github.io/files/NL.pdf

What is Nested Learning?

The research paper from Google ‘Nested Learning, The Illusion of Deep Learning Architectures’ models a complex neural network as a set of coherent optimization problems, nested or running in parallel, that are optimized together. Each internal problem has its own context flow, the sequence of inputs, gradients, or states that this component observes, and its own update frequency.

Instead of seeing training as a flat stack of layers plus one optimizer, Nested Learning imposes an ordering by update frequency. Parameters that update often sit at inner levels, while slowly updated parameters form outer levels. This hierarchy defines a Neural Learning Module, where every level compresses its own context flow into its parameters. The research team show that this view covers standard back-propagation on an MLP, linear attention, and common optimizers, all as instances of associative memory.

In this framework, associative memory is any operator that maps keys to values and is trained with an internal objective. The research team formalizes associative memory and then shows that back-propagation itself can be written as a one step gradient descent update that learns a mapping from inputs to local surprise signals, the gradient of the loss with respect to the output.

https://abehrouz.github.io/files/NL.pdf

Deep Optimizers as Associative Memory

Once optimizers are treated as learning modules, Nested Learning suggests redesigning them with richer internal objectives. Standard momentum can be written as a linear associative memory over past gradients, trained with a dot product similarity objective. This internal objective produces a Hebbian like update rule that does not model dependencies between data samples.

The researcher team replaced this similarity objective with an L2 regression loss over gradient features, which yields an update rule that better manages limited memory capacity and better memorizes gradient sequences. They then generalize the momentum memory from a linear map to an MLP and define Deep Momentum Gradient Descent, where the momentum state is produced by a neural memory and can pass through a non linear function such as Newton Schulz. This perspective also recovers the Muon optimizer as a special case.

https://abehrouz.github.io/files/NL.pdf

Continuum Memory System

In typical sequence models, attention acts as working memory over the current context window, while feedforward blocks store pre training knowledge as long term memory that is rarely updated after training. The Nested Learning researchers extend this binary view to a Continuum Memory System, or CMS.

CMS is defined as a chain of MLP blocks, MLP(f₁) through MLP(fₖ), where each block has its own update frequency and chunk size. For an input sequence, the output is obtained by sequentially applying these blocks. The parameters of each block are updated only every C^(ℓ) steps, so each block compresses a different time scale of context into its parameters. A standard Transformer with one feedforward block is recovered as the special case with k equal to 1.

This construction turns long term memory into a spectrum of levels across frequency, instead of a single static feedforward layer. The research connects this directly to multi time scale synaptic and system consolidation processes in the brain, where different parts of the system learn at different rates while sharing a common architecture.

HOPE, A Self Modifying Architecture Built On Titans

To show that Nested Learning is practical, the research team designed HOPE, a self referential sequence model that applies the paradigm to a recurrent architecture. HOPE is built as a variant of Titans, a long term memory architecture where a neural memory module learns to memorize surprising events at test time and helps attention attend to long past tokens.

Titans has only 2 levels of parameter update, which yields first order in context learning. HOPE extends Titans in 2 ways. First, it is self modifying, it can optimize its own memory through a self referential process and can in principle support unbounded levels of in context learning. Second, it integrates Continuum Memory System blocks so that memory updates occur at multiple frequencies and scale to longer context windows.

https://abehrouz.github.io/files/NL.pdf

Understanding the Results

The research team evaluates HOPE and baselines on language modeling and common sense reasoning tasks at 3 parameter scales, 340M, 760M, and 1.3B parameters. Benchmarks include Wiki and LMB perplexity for language modeling and PIQA, HellaSwag, WinoGrande, ARC Easy, ARC Challenge, Social IQa, and BoolQ accuracy for reasoning. The below given Table 1 reports results for HOPE, Transformer++, RetNet, Gated DeltaNet, TTT, Samba, and Titans.

https://abehrouz.github.io/files/NL.pdf

Key Takeaways

Nested Learning treats a model as multiple nested optimization problems with different update frequencies, which directly targets catastrophic forgetting in continual learning.

The framework reinterprets backpropagation, attention, and optimizers as associative memory modules that compress their own context flow, giving a unified view of architecture and optimization.

Deep optimizers in Nested Learning replace simple dot product similarity with richer objectives such as L2 regression and use neural memories, which leads to more expressive and context aware update rules.

The Continuum Memory System models memory as a spectrum of MLP blocks that update at different rates, creating short, medium, and long range memory rather than one static feedforward layer.

The HOPE architecture, a self modifying variant of Titans built using Nested Learning principles, shows improved language modeling, long context reasoning, and continual learning performance compared to strong Transformer and recurrent baselines.

Editorial Comments

Nested Learning is a useful reframing of deep networks as Neural Learning Modules that integrate architecture and optimization into one system. The introduction of Deep Momentum Gradient Descent, Continuum Memory System, and the HOPE architecture gives a concrete path to richer associative memory and better continual learning. Overall, this work turns continual learning from an afterthought into a primary design axis.

Check out the Paper and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nested Learning: A New Machine Learning Approach for Continual Learning that Views Models as Nested Optimization Problems to Enhance Long Context Processing appeared first on MarkTechPost.