Uncategorized Archives - Page 22 of 269

Harnessing the power of generative AI: Druva’s multi-agent copilot f …

Posted on November 15, 2025 by i-genie

This post is co-written with David Gildea and Tom Nijs from Druva.
Generative AI is transforming the way businesses interact with their customers and revolutionizing conversational interfaces for complex IT operations. Druva, a leading provider of data security solutions, is at the forefront of this transformation. In collaboration with Amazon Web Services (AWS), Druva is developing a cutting-edge generative AI-powered multi-agent copilot that aims to redefine the customer experience in data security and cyber resilience.
Powered by Amazon Bedrock and using advanced large language models (LLMs), this innovative solution will provide Druva’s customers with an intuitive, conversational interface to access data management, security insights, and operational support across their product suite. By harnessing the power of generative AI and agentic AI, Druva aims to streamline operations, increase customer satisfaction, and enhance the overall value proposition of its data security and cyber resilience solutions.
In this post, we examine the technical architecture behind this AI-powered copilot, exploring how it processes natural language queries, maintains context across complex workflows, and delivers secure, accurate responses to streamline data protection operations.
Challenges and opportunities
Druva wants to effectively serve enterprises moving beyond traditional query-based AI and into agentic systems and meet their complex data management and security needs with greater speed, simplicity, and confidence.
Comprehensive data security necessitates tracking a high volume of data and metrics to identify potential cyber threats. As threats evolve, it can be difficult for customers to stay abreast of new data anomalies to hunt for within their organization’s data, but missing any threat signals can lead to unauthorized access to sensitive information. For example, a global financial services company managing more than 500 servers across multiple regions currently spends hours manually checking logs across dozens of systems when backup fails. With an AI-powered copilot, they could simply ask, “Why did my backups fail last night?” and instantly receive an analysis showing that a specific policy update caused conflicts in their European data centers, along with a step-by-step remediation, reducing investigation time from hours to minutes. This solution not only reduces the volume of support requests and accelerates the time to resolution, but also unlocks greater operational efficiency for end users.
By reimagining how users engage with the system—from AI-powered workflows to smarter automation—Druva saw a clear opportunity to deliver a more seamless customer experience that strengthens customer satisfaction, loyalty, and long-term success.
The key opportunities for Druva in implementing a generative AI-powered multi-agent copilot include:

Simplified user experience: By providing a natural language interface, the copilot can simplify complex data protection tasks and help users access the information they need quickly.
Intelligent Troubleshooting: The copilot can leverage AI capabilities to analyze data from various sources, identify the root causes of backup failures, and provide personalized recommendations for resolution.
Streamlined Policy Management: The multi-agent copilot can guide users through the process of creating, modifying, and implementing data protection policies, reducing the potential for human errors and improving compliance.
Proactive Support: By continuously monitoring data protection environments, the copilot can proactively identify potential issues and provide guidance to help prevent failures or optimize performance.
Scalable and Efficient Operations: The AI-powered solution can handle a large volume of customer inquiries and tasks simultaneously, reducing the burden on Druva’s support team so that they can focus on more complex and strategic initiatives.

Solution overview
The proposed solution for Druva’scopilot leverages a sophisticated architecture that combines the power of Amazon Bedrock (including Amazon Bedrock Knowledge Bases), LLMs, and a dynamic API selection process to deliver an intelligent and efficient user experience. In the following diagram, we demonstrate the end-to-end architecture and various sub-components.

At the core of the system is the supervisor agent, which serves as the central coordination component of the multi-agent system. This agent is responsible for overseeing the entire conversation flow, delegating tasks to specialized sub-agents, and maintaining seamless communication between the various components.
The user interacts with the supervisor agent through a user interface, submitting natural language queries related to data protection, backup management, and troubleshooting. The supervisor agent analyzes the user’s input and routes the request to the appropriate sub-agents based on the nature of the query.
The data agent is responsible for retrieving relevant information from Druva’s systems by interacting with the GET APIs. This agent fetches data such as scheduled backup jobs, backup status, and other pertinent details to provide the user with accurate and up-to-date information.
The help agent assists users by providing guidance on best practices, step-by-step instructions, and troubleshooting tips. This agent draws upon an extensive knowledge base, which includes detailed API documentation, user manuals, and frequently asked questions, to deliver context-specific assistance to users.
When a user needs to perform critical actions, such as initiating a backup job or modifying data protection policies, the action agent comes into play. This agent interacts with the POST API endpoints to execute the necessary operations, making sure that the user’s requirements are met promptly and accurately.
To make sure that the multi-agent copilot operates with the most suitable APIs and parameters, the solution incorporates a dynamic API selection process. In the following diagram, we highlight the various AWS services used to implement dynamic API selection, with which both the data agent and the action agent are equipped. Bedrock Knowledge Bases contains comprehensive information about available APIs, their functionalities, and optimal usage patterns. Once an input query is received, we use semantic search to retrieve the top K relevant APIs. This semantic search capability enables the system to adapt to the specific context of each user request, enhancing the Copilot’s accuracy, efficiency, and scalability. Once the appropriate APIs are identified, the agent prompts the LLM to parse the top K relevant APIs and finalize the API selection along with the required parameters. This step makes sure that the copilot is fully equipped to run the user’s request effectively.

Finally, the selected API is invoked, and the multi-agent copilot carries out the desired action or retrieves the requested information. The user receives a clear and concise response, along with relevant recommendations or guidance, through the user interface.
Throughout the interaction, users can provide additional information or explicit approvals by using the user feedback node before the copilot performs critical actions. With this human-in-the-loop approach, the system operates with the necessary safeguards and maintains user control over sensitive operations.
Evaluation
The evaluation process for Druva’s generative AI-powered multi-agent copilot focuses on assessing the performance and effectiveness of each critical component of the system. By thoroughly testing individual components such as dynamic API selection, isolated tests on individual agents, and end-to-end functionality, the copilot delivers accurate, reliable, and efficient results to its users.
Evaluation methodology:

Unit testing: Isolated tests are conducted for each component (individual agents, data extraction, API selection) to verify their functionality, performance, and error handling capabilities.
Integration Testing: Tests are performed to validate the seamless integration and communication between the various components of the multi-agent copilot, maintaining data flow and control flow integrity.
System Testing: End-to-end tests are executed on the complete system, simulating real-world user scenarios and workflows to assess the overall functionality, performance, and user experience.

Evaluation results
Choosing the right model for the right task is critical to the system’s performance. The dynamic tool selection represents one of the most critical parts of the system—invoking the correct API is essential for end-to-end solution success. A single incorrect API call can lead to fetching wrong data, which cascades into erroneous results throughout the multi-agent system. To optimize the dynamic tool selection component, various Nova and Anthropic models were tested and benchmarked against the ground truth created using Sonnet 3.7.
The findings showed that even smaller models like Nova Lite and Haiku 3 were able to select the correct API every time. However, these smaller models struggled with parameter parsing such as calling the API with the correct parameters relative to the input question. When parameter parsing accuracy was taken into account, the overall API selection accuracy dropped to 81% for Nova Micro, 88% for Nova Lite, and 93% for Nova Pro. The performance of Haiku 3, Haiku 3.5, and Sonnet 3.5 was comparable, ranging from 91% to 92%. Nova Pro provided an optimal tradeoff between accuracy and latency with an average response time of just over one second. In contrast, Sonnet 3.5 had a latency of eight seconds, although this could be attributed to Sonnet 3.5’s more verbose output, generating an average of 291 tokens compared to Nova Pro’s 86 tokens. The prompts could potentially be optimized to make Sonnet 3.5’s output more concise, thus reducing the latency.
For end-to-end testing of real world scenarios, it is essential to engage human subject matter expert evaluators familiar with the system to assess performance based on completeness, accuracy, and relevance of the solutions. Across 11 challenging questions during the initial development phase, the system achieved scores averaging 3.3 out of 5 across these dimensions. This represented solid performance considering the evaluation was conducted in the early stages of development, providing a strong foundation for future improvements.
By focusing on evaluating each critical component and conducting rigorous end-to-end testing, Druva has made sure that the generative AI-powered multi-agent copilot meets the highest standards of accuracy, reliability, and efficiency. The insights gained from this evaluation process have guided the continuous improvement and optimization of the copilot.

“Druva is at the forefront of leveraging advanced AI technologies to revolutionize the way organizations protect and manage their critical data. Our Generative AI-powered Multi-agent Copilot is a testament to our commitment to delivering innovative solutions that simplify complex processes and enhance customer experiences. By collaborating with the AWS Generative AI Innovation Center, we are embarking on a transformative journey to create an interactive, personalized, and efficient end-to-end experience for our customers. We are excited to harness the power of Amazon Bedrock and our proprietary data to continue reimagining the future of data security and cyber resilience.”- David Gildea, VP of Generative AI at Druva

Conclusion
Druva’s generative AI-powered multi-agent copilot showcases the immense potential of combining structured and unstructured data sources using AI to create next-generation virtual copilots. This innovative approach sets Druva apart from traditional data protection vendors by transforming hours-long manual investigations into instant, AI-powered conversational insights, with 90% of routine data protection tasks executable through natural language interactions, fundamentally redefining customer expectations in the data security space. For organizations in the data security and protection space, this technology enables more efficient operations, enhanced customer engagement, and data-driven decision-making. The insights and intelligence provided by the copilot empower Druva’s stakeholders, including customers, support teams, partners, and executives, to make informed decisions faster, reducing average time-to-resolution for data security issues by up to 70% and accelerating backup troubleshooting from hours to minutes. Although this project focuses on the data protection industry, the underlying principles and methodology can be applied across various domains. With careful design, testing, and continuous improvement, organizations in any industry can benefit from AI-powered copilots that contextualize their data, documents, and content to deliver intelligent and personalized experiences.
This implementation leverages Amazon Bedrock AgentCore Runtime and Amazon Bedrock AgentCore Gateway to provide robust agent orchestration and management capabilities. This approach has the potential to provide intelligent automation and data search capabilities through customizable agents, transforming user interactions with applications to be more natural, efficient, and effective. For those interested in implementing similar functionalities, explore Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases and Amazon Bedrock AgentCore as a fully managed AWS solution.

About the authors
David Gildea With over 25 years of experience in cloud automation and emerging technologies, David has led transformative projects in data management and cloud infrastructure. As the founder and former CEO of CloudRanger, he pioneered innovative solutions to optimize cloud operations, later leading to its acquisition by Druva. Currently, David leads the Labs team in the Office of the CTO, spearheading R&D into Generative AI initiatives across the organization, including projects like Dru Copilot, Dru Investigate, and Amazon Q. His expertise spans technical research, commercial planning, and product development, making him a prominent figure in the field of cloud technology and generative AI.
Tom Nijs is an experienced backend and AI engineer at Druva, driven by a passion for both learning and sharing knowledge. As the Lead Architect for Druva’s Labs team, he channels this passion into developing cutting-edge solutions, leading projects such as Dru Copilot, Dru Investigate, and Dru AI Labs. With a core focus on optimizing systems and harnessing the power of AI, Tom is dedicated to helping teams and developers turn groundbreaking ideas into reality.
Gauhar Bains is a Deep Learning Architect at the AWS Generative AI Innovation Center, where he designs and delivers innovative GenAI solutions for enterprise customers. With a passion for leveraging cutting-edge AI technologies, Gauhar specializes in developing agentic AI applications, and implementing responsible AI practices across diverse industries.
Ayushi Gupta is a Senior Technical Account Manager at AWS who partners with organizations to architect optimal cloud solutions. She specializes in ensuring business-critical applications operate reliably while balancing performance, security, and cost efficiency. With a passion for GenAI innovation, Ayushi helps customers leverage cloud technologies that deliver measurable business value while maintaining robust data protection and compliance standards.
Marius Moisescu is a Machine Learning Engineer at the AWS Generative AI Innovation Center. He works with customers to develop agentic applications. His interests are deep research agents and evaluation of multi agent architectures.
Ahsan Ali is an Senior Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers from different industry verticals to solve their urgent and expensive problems using Generative AI.
Sandy Farr is an Applied Science Manager at the AWS Generative AI Innovation Center, where he leads a team of scientists, deep learning architects and software engineers to deliver innovative GenAI solutions for AWS customers. Sandy holds a PhD in Physics and has over a decade of experience developing AI/ML, NLP and GenAI solutions for large organizations.
Govindarajan Varadan is a Manager of the Solutions Architecture team at Amazon Web Services (AWS) based out of Silicon Valley in California. He works with AWS customers to help them achieve their business objectives through innovative applications of AI at scale.
Saeideh Shahrokh Esfahani is an Applied Scientist at the Amazon Generative AI Innovation Center, where she focuses on transforming cutting-edge AI technologies into practical solutions that address real-world challenges.

How to Build a Fully Self-Verifying Data Operations AI Agent Using Loc …

Posted on November 14, 2025 by i-genie

In this tutorial, we build a self-verifying DataOps AIAgent that can plan, execute, and test data operations automatically using local Hugging Face models. We design the agent with three intelligent roles: a Planner that creates an execution strategy, an Executor that writes and runs code using pandas, and a Tester that validates the results for accuracy and consistency. By using Microsoft’s Phi-2 model locally in Google Colab, we ensure that the workflow remains efficient, reproducible, and privacy-preserving while demonstrating how LLMs can automate complex data-processing tasks end-to-end. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers accelerate bitsandbytes scipy
import json, pandas as pd, numpy as np, torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

MODEL_NAME = “microsoft/phi-2″

class LocalLLM:
def __init__(self, model_name=MODEL_NAME, use_8bit=False):
print(f”Loading model: {model_name}”)
self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
model_kwargs = {“device_map”: “auto”, “trust_remote_code”: True}
if use_8bit and torch.cuda.is_available():
model_kwargs[“quantization_config”] = BitsAndBytesConfig(load_in_8bit=True)
else:
model_kwargs[“torch_dtype”] = torch.float32 if not torch.cuda.is_available() else torch.float16
self.model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
self.pipe = pipeline(“text-generation”, model=self.model, tokenizer=self.tokenizer,
max_new_tokens=512, do_sample=True, temperature=0.3, top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id)
print(“✓ Model loaded successfully!n”)

def generate(self, prompt, system_prompt=””, temperature=0.3):
if system_prompt:
full_prompt = f”Instruct: {system_prompt}nn{prompt}nOutput:”
else:
full_prompt = f”Instruct: {prompt}nOutput:”
output = self.pipe(full_prompt, temperature=temperature, do_sample=temperature>0,
return_full_text=False, eos_token_id=self.tokenizer.eos_token_id)
result = output[0][‘generated_text’].strip()
if “Instruct:” in result:
result = result.split(“Instruct:”)[0].strip()
return result

We install the required libraries and load the Phi-2 model locally using Hugging Face Transformers. We create a LocalLLM class that initializes the tokenizer and model, supports optional quantization, and defines a generate method to produce text outputs. We ensure that the model runs smoothly on both CPU and GPU, making it ideal for use on Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserPLANNER_PROMPT = “””You are a Data Operations Planner. Create a detailed execution plan as valid JSON.

Return ONLY a JSON object (no other text) with this structure:
{“steps”: [“step 1″,”step 2″],”expected_output”:”description”,”validation_criteria”:[“criteria 1″,”criteria 2″]}”””

EXECUTOR_PROMPT = “””You are a Data Operations Executor. Write Python code using pandas.

Requirements:
– Use pandas (imported as pd) and numpy (imported as np)
– Store final result in variable ‘result’
– Return ONLY Python code, no explanations or markdown”””

TESTER_PROMPT = “””You are a Data Operations Tester. Verify execution results.

Return ONLY a JSON object (no other text) with this structure:
{“passed”:true,”issues”:[“any issues found”],”recommendations”:[“suggestions”]}”””

class DataOpsAgent:
def __init__(self, llm=None):
self.llm = llm or LocalLLM()
self.history = []

def _extract_json(self, text):
try:
return json.loads(text)
except:
start, end = text.find(‘{‘), text.rfind(‘}’)+1
if start >= 0 and end > start:
try:
return json.loads(text[start:end])
except:
pass
return None

We define the system prompts for the Planner, Executor, and Tester roles of our DataOps Agent. We then initialize the DataOpsAgent class with helper methods and a JSON extraction utility to parse structured responses. We prepare the foundation for the agent’s reasoning and execution pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def plan(self, task, data_info):
print(“n” + “=”*60)
print(“PHASE 1: PLANNING”)
print(“=”*60)
prompt = f”Task: {task}nnData Information:n{data_info}nnCreate an execution plan as JSON with steps, expected_output, and validation_criteria.”
plan_text = self.llm.generate(prompt, PLANNER_PROMPT, temperature=0.2)
self.history.append((“PLANNER”, plan_text))
plan = self._extract_json(plan_text) or {“steps”:[task],”expected_output”:”Processed data”,”validation_criteria”:[“Result generated”,”No errors”]}
print(f”n Plan Created:”)
print(f” Steps: {len(plan.get(‘steps’, []))}”)
for i, step in enumerate(plan.get(‘steps’, []), 1):
print(f” {i}. {step}”)
print(f” Expected: {plan.get(‘expected_output’, ‘N/A’)}”)
return plan

def execute(self, plan, data_context):
print(“n” + “=”*60)
print(“PHASE 2: EXECUTION”)
print(“=”*60)
steps_text = ‘n’.join(f”{i}. {s}” for i, s in enumerate(plan.get(‘steps’, []), 1))
prompt = f”Task Steps:n{steps_text}nnData available: DataFrame ‘df’n{data_context}nnWrite Python code to execute these steps. Store final result in ‘result’ variable.”
code = self.llm.generate(prompt, EXECUTOR_PROMPT, temperature=0.1)
self.history.append((“EXECUTOR”, code))
if ““`python” in code: code = code.split(““`python”)[1].split(““`”)[0]
elif ““`” in code: code = code.split(““`”)[1].split(““`”)[0]
lines = []
for line in code.split(‘n’):
s = line.strip()
if s and (not s.startswith(‘#’) or ‘import’ in s):
lines.append(line)
code = ‘n’.join(lines).strip()
print(f”n Generated Code:n” + “-“*60)
for i, line in enumerate(code.split(‘n’)[:15],1):
print(f”{i:2}. {line}”)
if len(code.split(‘n’))>15: print(f” … ({len(code.split(‘n’))-15} more lines)”)
print(“-“*60)
return code

We implement the Planning and Execution phases of the agent. We let the Planner create detailed task steps and validation criteria, and then the Executor generates corresponding Python code based on pandas to perform the task. We visualize how the agent autonomously transitions from reasoning to generating actionable code. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef test(self, plan, result, execution_error=None):
print(“n” + “=”*60)
print(“PHASE 3: TESTING & VERIFICATION”)
print(“=”*60)
result_desc = f”EXECUTION ERROR: {execution_error}” if execution_error else f”Result type: {type(result).__name__}n”
if not execution_error:
if isinstance(result, pd.DataFrame):
result_desc += f”Shape: {result.shape}nColumns: {list(result.columns)}nSample:n{result.head(3).to_string()}”
elif isinstance(result, (int,float,str)):
result_desc += f”Value: {result}”
else:
result_desc += f”Value: {str(result)[:200]}”
criteria_text = ‘n’.join(f”- {c}” for c in plan.get(‘validation_criteria’, []))
prompt = f”Validation Criteria:n{criteria_text}nnExpected: {plan.get(‘expected_output’, ‘N/A’)}nnActual Result:n{result_desc}nnEvaluate if result meets criteria. Return JSON with passed (true/false), issues, and recommendations.”
test_result = self.llm.generate(prompt, TESTER_PROMPT, temperature=0.2)
self.history.append((“TESTER”, test_result))
test_json = self._extract_json(test_result) or {“passed”:execution_error is None,”issues”:[“Could not parse test result”],”recommendations”:[“Review manually”]}
print(f”n✓ Test Results:n Status: {‘ PASSED’ if test_json.get(‘passed’) else ‘ FAILED’}”)
if test_json.get(‘issues’):
print(” Issues:”)
for issue in test_json[‘issues’][:3]:
print(f” • {issue}”)
if test_json.get(‘recommendations’):
print(” Recommendations:”)
for rec in test_json[‘recommendations’][:3]:
print(f” • {rec}”)
return test_json

def run(self, task, df=None, data_info=None):
print(“n SELF-VERIFYING DATA-OPS AGENT (Local HF Model)”)
print(f”Task: {task}n”)
if data_info is None and df is not None:
data_info = f”Shape: {df.shape}nColumns: {list(df.columns)}nSample:n{df.head(2).to_string()}”
plan = self.plan(task, data_info)
code = self.execute(plan, data_info)
result, error = None, None
try:
local_vars = {‘pd’: pd, ‘np’: np, ‘df’: df}
exec(code, local_vars)
result = local_vars.get(‘result’)
except Exception as e:
error = str(e)
print(f”n Execution Error: {error}”)
test_result = self.test(plan, result, error)
return {‘plan’: plan,’code’: code,’result’: result,’test’: test_result,’history’: self.history}

We focus on the Testing and Verification phase of our workflow. We let the agent evaluate its own output against predefined validation criteria and summarize the outcome as a structured JSON. We then integrate all three phases, planning, execution, and testing, into a single self-verifying pipeline that ensures complete automation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic(agent):
print(“n” + “#”*60)
print(“# DEMO 1: Sales Data Aggregation”)
print(“#”*60)
df = pd.DataFrame({‘product’:[‘A’,’B’,’A’,’C’,’B’,’A’,’C’],
‘sales’:[100,150,200,80,130,90,110],
‘region’:[‘North’,’South’,’North’,’East’,’South’,’West’,’East’]})
task = “Calculate total sales by product”
output = agent.run(task, df)
if output[‘result’] is not None:
print(f”n Final Result:n{output[‘result’]}”)
return output

def demo_advanced(agent):
print(“n” + “#”*60)
print(“# DEMO 2: Customer Age Analysis”)
print(“#”*60)
df = pd.DataFrame({‘customer_id’:range(1,11),
‘age’:[25,34,45,23,56,38,29,41,52,31],
‘purchases’:[5,12,8,3,15,7,9,11,6,10],
‘spend’:[500,1200,800,300,1500,700,900,1100,600,1000]})
task = “Calculate average spend by age group: young (under 35) and mature (35+)”
output = agent.run(task, df)
if output[‘result’] is not None:
print(f”n Final Result:n{output[‘result’]}”)
return output

if __name__ == “__main__”:
print(” Initializing Local LLM…”)
print(“Using CPU mode for maximum compatibilityn”)
try:
llm = LocalLLM(use_8bit=False)
agent = DataOpsAgent(llm)
demo_basic(agent)
print(“nn”)
demo_advanced(agent)
print(“n” + “=”*60)
print(” Tutorial Complete!”)
print(“=”*60)
print(“nKey Features:”)
print(” • 100% Local – No API calls required”)
print(” • Uses Phi-2 from Microsoft (2.7B params)”)
print(” • Self-verifying 3-phase workflow”)
print(” • Runs on free Google Colab CPU/GPU”)
except Exception as e:
print(f”n Error: {e}”)
print(“Troubleshooting:n1. pip install -q transformers accelerate scipyn2. Restart runtimen3. Try a different model”)

We built two demo examples to test the agent’s capabilities using simple sales and customer datasets. We initialize the model, execute the Data-Ops workflow, and observe the full cycle from planning to validation. We conclude the tutorial by summarizing key benefits and encouraging further experimentation with local models.

In conclusion, we created a fully autonomous and self-verifying DataOps system powered by a local Hugging Face model. We experience how each stage, planning, execution, and testing, seamlessly interacts to produce reliable results without relying on any cloud APIs. This workflow highlights the strength of local LLMs, such as Phi-2, for lightweight automation and inspires us to expand this architecture for more advanced data pipelines, validation frameworks, and multi-agent data systems in the future.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Self-Verifying Data Operations AI Agent Using Local Hugging Face Models for Automated Planning, Execution, and Testing appeared first on MarkTechPost.

How Powerful are Diffusion LLMs? Rethinking Generation with Any-Proces …

Posted on November 14, 2025 by i-genie

How powerful are Diffusion LLMs compared to classic autoregressive LLMs, once you treat generation as an algorithm with time and space complexity, not just as a decoding trick? A new research paper from a team researchers from Toyota Technological Institute at Chicago and MIT gives a formal answer. This new research compares Auto-Regressive Models (ARM), Masked Diffusion Models (MDM), and a new family called Any-Process MDM (AP-MDM), using complexity theory and controlled reasoning tasks.

https://arxiv.org/pdf/2510.06190

ARM vs MDM: Same Expressivity, Different Parallel Time

ARM uses next token prediction in a strict left to right order. Prior work already shows that with enough intermediate steps, ARM is Turing complete, so it can represent any computable function in principle, given enough context and compute.

MDM, the discrete diffusion style used in diffusion LLMs, works on a masked sequence. The model starts from a fully masked sequence and iteratively unmasks tokens. It can update many positions in parallel and in any order. MDM is modeled as an encoder only Transformer with context length (S(n)) and decoding steps (T(n)) for an input of size (n).

The research team shows:

MDM can simulate any PRAM (Parallel Random Access Machine) algorithm with parallel time (T(n)) using (O(T(n))) diffusion steps and context (S(n)) proportional to total work.

This makes MDM Turing complete and lets it match ideal parallel time on problems in NC, such as graph connectivity and some context free language tasks, where ARM needs time linear in sequence length.

Diffusion LLMs therefore gain efficiency on parallelizable problems, not extra expressive power by themselves.

Any-Order Generation Has Limited Benefits

A natural question is whether Any-Order Generation is strictly more powerful than left to right generation.

To isolate this, the research team defines an Any-Order MDM (AO-MDM) and a corresponding Masked ARM with the same architecture and similar token budget, but decoding in a fixed left to right way over a sequence padded with masks.

The main result:

Any computation performed by AO-MDM with one token per step and context (S(n)) can be reorganized into a left to right schedule and simulated by a Masked ARM with sequence length (O(S(n))) plus a constant number of extra layers.

In other words, once you control for parallelism and architecture, any order generation alone does not expand the class of problems beyond what ARM can already handle.

Both ARM and AO-MDM also share a space limitation. With context length (S(n)), they cannot efficiently solve problems that require more than roughly (S(n)3) serial time. With polynomial context, they are effectively limited to problems in the class P and cannot handle general NP hard tasks just by test time scaling.

Any-Process Generation and AP-MDM

To go beyond these limits, the research team proposes Any-Process Generation, instantiated as Any-Process MDM (AP-MDM).

AP-MDM keeps the masked diffusion view but extends the transition function with three extra operations, in addition to the usual unmask:

remask: turn an already decoded token back into the mask token M

insert: insert a new mask token at a chosen position

delete: delete a mask token that is no longer needed

These are controlled by a 3 bit vector per position (ct,i = (ct,i[1], ct,i[2], ct,i[3]). The same Transformer backbone predicts both content logits and these control bits.

remask uses the first bit to decide whether to overwrite a position with M, which enables backtracking and self correction.

insert and delete use the second and third bits to add or remove mask tokens, so the sequence length can grow or shrink during decoding.

Architecturally, AP-MDM only adds three small linear heads on top of an encoder only Transformer, so it is easy to add on top of existing MDM style diffusion LLMs.

https://arxiv.org/pdf/2510.06190

The key theoretical result:

AP-MDM can simulate any PRAM algorithm with optimal parallel time and optimal space, using context proportional to the true space (S(n)) rather than total work. With polynomial context, AP-MDM can realize computations in PSPACE, while standard MDM and ARM under the same context budget are restricted to P.

The research team also tried to prove that there exists a constant depth AP-MDM whose generation process cannot be simulated by any constant depth ARM or Masked ARM, under standard complexity assumptions.

Empirical Results: Sudoku, Dyck, Graphs, Parity

The experiments match the theory and make the differences concrete.

Sudoku

Sudoku, generalized to (n2 x n2) grids, is NP complete.

AP-MDM reaches 99.28 percent accuracy with about 1.2 million parameters and only 100 training instances.

An ARM baseline with ordering reaches 87.18 percent using 1.8 million training instances and about 5 times more parameters.

The best AO-MDM baseline reaches 89.49 percent under the same large data regime.

https://arxiv.org/pdf/2510.06190

This shows that editing operations, especially remask, are crucial to exploit test time scaling on hard reasoning tasks.

Dyck languages and coding style constraints

The research also analyzes two sided Dyck k languages, which model matched parentheses and are a core abstraction for code syntax. It proves that fixed ARM models cannot ensure valid generation for arbitrary lengths, while there exists an AP-MDM that generates exactly the Dyck language using insert and remask.

This matches how coding tasks require structure aware edits under global constraints, for example balanced brackets and consistent scopes.

Graph generation and structural editing

For graph editing tasks under global constraints, AP-MDM uses insert, delete and remask to implement a sequence of structured edits over a graph representation. The reported accuracy stays near perfect as graph size scales, while ARM degrades as the graph gets larger.

Parity and length generalization

On parity, AP-MDM learns a local elimination rule by repeatedly deleting pairs of bits, driven by remask and delete. It is trained only on length 2 sequences, then achieves 100 percent generalization to arbitrary lengths. ARM baselines struggle to reach similar generalization even with much longer training sequences.

https://arxiv.org/pdf/2510.06190

Key Takeaways

Any order Masked Diffusion Models are as expressive as autoregressive models once you fix architecture and parallelism, they mainly provide parallel efficiency rather than new computational power.

Masked Diffusion Models can simulate PRAM algorithms and achieve exponential speedup on parallelizable tasks in NC, but with polynomial context they remain effectively limited to problems in class P, similar to autoregressive models.

Any Process MDM extends diffusion LLMs with remask, insert and delete operations, implemented via a three bit control vector per token, and can simulate PRAM with both optimal parallel time and optimal space, reaching PSPACE level expressivity under polynomial context.

On hard reasoning tasks such as generalized Sudoku, Dyck languages, graph editing and parity, AP MDM shows strong empirical advantages, for example achieving about 99.28 percent Sudoku accuracy with only 100 training instances and a much smaller parameter budget than autoregressive and any order MDM baselines.

For domains like coding, mathematics and AI4Science that involve structured edits and revision histories, AP MDM aligns better with the underlying generation processes than next token prediction, and its editing operations are provably hard to simulate with constant depth autoregressive models.

Editorial Comments

Any-Process MDM is an important step because it treats generation as a full algorithm, not just a decoding order. The research work shows that Masked Diffusion Models already match PRAM parallel time, but remain in P under polynomial context, similar to autoregressive models. By adding remask, insert and delete, AP-MDM reaches PSPACE-level expressivity with polynomial context and achieves strong empirical gains on Sudoku, Dyck, graph editing and parity. Overall, AP-MDM makes a strong case that future frontier LLMs should adopt edit-based Any-Process Generation, not just faster autoregression.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Powerful are Diffusion LLMs? Rethinking Generation with Any-Process Masked Diffusion Models appeared first on MarkTechPost.

How to Build a Fully Functional Custom GPT-style Conversational AI Loc …

Posted on November 14, 2025 by i-genie

In this tutorial, we build our own custom GPT-style chat system from scratch using a local Hugging Face model. We start by loading a lightweight instruction-tuned model that understands conversational prompts, then wrap it inside a structured chat framework that includes a system role, user memory, and assistant responses. We define how the agent interprets context, constructs messages, and optionally uses small built-in tools to fetch local data or simulated search results. By the end, we have a fully functional, conversational model that behaves like a personalized GPT running. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install transformers accelerate sentencepiece –quiet
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Tuple, Optional
import textwrap, json, os

We begin by installing the essential libraries and importing the required modules. We ensure that the environment has all necessary dependencies, such as transformers, torch, and sentencepiece, ready for use. This setup allows us to work seamlessly with Hugging Face models inside Google Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserMODEL_NAME = “microsoft/Phi-3-mini-4k-instruct”
BASE_SYSTEM_PROMPT = (
“You are a custom GPT running locally. ”
“Follow user instructions carefully. ”
“Be concise and structured. ”
“If something is unclear, say it is unclear. ”
“Prefer practical examples over corporate examples unless explicitly asked. ”
“When asked for code, give runnable code.”
)
MAX_NEW_TOKENS = 256

We configure our model name, define the system prompt that governs the assistant’s behavior, and set token limits. We establish how our custom GPT should respond, concise, structured, and practical. This section defines the foundation of our model’s identity and instruction style. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Loading model…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map=”auto”
)
model.eval()
print(“Model loaded.”)

We load the tokenizer and model from Hugging Face into memory and prepare them for inference. We automatically adjust the device mapping based on available hardware, ensuring GPU acceleration if possible. Once loaded, our model is ready to generate responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserConversationHistory = List[Tuple[str, str]]
history: ConversationHistory = [(“system”, BASE_SYSTEM_PROMPT)]

def wrap_text(s: str, w: int = 100) -> str:
return “n”.join(textwrap.wrap(s, width=w))

def build_chat_prompt(history: ConversationHistory, user_msg: str) -> str:
prompt_parts = []
for role, content in history:
if role == “system”:
prompt_parts.append(f”<|system|>n{content}n”)
elif role == “user”:
prompt_parts.append(f”<|user|>n{content}n”)
elif role == “assistant”:
prompt_parts.append(f”<|assistant|>n{content}n”)
prompt_parts.append(f”<|user|>n{user_msg}n”)
prompt_parts.append(“<|assistant|>n”)
return “”.join(prompt_parts)

We initialize the conversation history, starting with a system role, and create a prompt builder to format messages. We define how user and assistant turns are arranged in a consistent conversational structure. This ensures the model always understands the dialogue context correctly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef local_tool_router(user_msg: str) -> Optional[str]:
msg = user_msg.strip().lower()
if msg.startswith(“search:”):
query = user_msg.split(“:”, 1)[-1].strip()
return f”Search results about ‘{query}’:n- Key point 1n- Key point 2n- Key point 3″
if msg.startswith(“docs:”):
topic = user_msg.split(“:”, 1)[-1].strip()
return f”Documentation extract on ‘{topic}’:n1. The agent orchestrates tools.n2. The model consumes output.n3. Responses become memory.”
return None

We add a lightweight tool router that extends our GPT’s capability to simulate tasks like search or documentation retrieval. We define logic to detect special prefixes such as “search:” or “docs:” in user queries. This simple agentic design gives our assistant contextual awareness. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef generate_reply(history: ConversationHistory, user_msg: str) -> str:
tool_context = local_tool_router(user_msg)
if tool_context:
user_msg = user_msg + “nnUseful context:n” + tool_context
prompt = build_chat_prompt(history, user_msg)
inputs = tokenizer(prompt, return_tensors=”pt”).to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
top_p=0.9,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id
)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
reply = decoded.split(“<|assistant|>”)[-1].strip() if “<|assistant|>” in decoded else decoded[len(prompt):].strip()
history.append((“user”, user_msg))
history.append((“assistant”, reply))
return reply

def save_history(history: ConversationHistory, path: str = “chat_history.json”) -> None:
data = [{“role”: r, “content”: c} for (r, c) in history]
with open(path, “w”) as f:
json.dump(data, f, indent=2)

def load_history(path: str = “chat_history.json”) -> ConversationHistory:
if not os.path.exists(path):
return [(“system”, BASE_SYSTEM_PROMPT)]
with open(path, “r”) as f:
data = json.load(f)
return [(item[“role”], item[“content”]) for item in data]

We define the primary reply generation function, which combines history, context, and model inference to produce coherent outputs. We also add functions to save and load past conversations for persistence. This snippet forms the operational core of our custom GPT. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n— Demo turn 1 —“)
demo_reply_1 = generate_reply(history, “Explain what this custom GPT setup is doing in 5 bullet points.”)
print(wrap_text(demo_reply_1))

print(“n— Demo turn 2 —“)
demo_reply_2 = generate_reply(history, “search: agentic ai with local models”)
print(wrap_text(demo_reply_2))

def interactive_chat():
print(“nChat ready. Type ‘exit’ to stop.”)
while True:
try:
user_msg = input(“nUser: “).strip()
except EOFError:
break
if user_msg.lower() in (“exit”, “quit”, “q”):
break
reply = generate_reply(history, user_msg)
print(“nAssistant:n” + wrap_text(reply))

# interactive_chat()
print(“nCustom GPT initialized successfully.”)

We test the entire setup by running demo prompts and displaying generated responses. We also create an optional interactive chat loop to converse directly with the assistant. By the end, we confirm that our custom GPT runs locally and responds intelligently in real time.

In conclusion, we designed and executed a custom conversational agent that mirrors GPT-style reasoning without relying on any external services. We saw how local models can be made interactive through prompt orchestration, lightweight tool routing, and conversational memory management. This approach enables us to understand the internal logic behind commercial GPT systems. It empowers us to experiment with our own rules, behaviors, and integrations in a transparent and fully offline manner.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers appeared first on MarkTechPost.

How to Build an End-to-End Interactive Analytics Dashboard Using PyGWa …

Posted on November 13, 2025 by i-genie

In this tutorial, we explore the advanced capabilities of PyGWalker, a powerful tool for visual data analysis that integrates seamlessly with pandas. We begin by generating a realistic e-commerce dataset enriched with time, demographic, and marketing features to mimic real-world business data. We then prepare multiple analytical views, including daily sales, category performance, and customer segment summaries. Finally, we use PyGWalker to interactively explore patterns, correlations, and trends across these dimensions through intuitive drag-and-drop visualizations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install pygwalker pandas numpy scikit-learn

import pandas as pd
import numpy as np
import pygwalker as pyg
from datetime import datetime, timedelta

We begin by setting up our environment, installing all necessary dependencies, and importing essential libraries, including pandas, numpy, and pygwalker. We ensure that everything is ready for building our interactive data exploration workflow in Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef generate_advanced_dataset():
np.random.seed(42)
start_date = datetime(2022, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(730)]
categories = [‘Electronics’, ‘Clothing’, ‘Home & Garden’, ‘Sports’, ‘Books’]
products = {
‘Electronics’: [‘Laptop’, ‘Smartphone’, ‘Headphones’, ‘Tablet’, ‘Smartwatch’],
‘Clothing’: [‘T-Shirt’, ‘Jeans’, ‘Dress’, ‘Jacket’, ‘Sneakers’],
‘Home & Garden’: [‘Furniture’, ‘Lamp’, ‘Rug’, ‘Plant’, ‘Cookware’],
‘Sports’: [‘Yoga Mat’, ‘Dumbbell’, ‘Running Shoes’, ‘Bicycle’, ‘Tennis Racket’],
‘Books’: [‘Fiction’, ‘Non-Fiction’, ‘Biography’, ‘Science’, ‘History’]
}
n_transactions = 5000
data = []
for _ in range(n_transactions):
date = np.random.choice(dates)
category = np.random.choice(categories)
product = np.random.choice(products[category])
base_prices = {
‘Electronics’: (200, 1500),
‘Clothing’: (20, 150),
‘Home & Garden’: (30, 500),
‘Sports’: (25, 300),
‘Books’: (10, 50)
}
price = np.random.uniform(*base_prices[category])
quantity = np.random.choice([1, 1, 1, 2, 2, 3], p=[0.5, 0.2, 0.15, 0.1, 0.03, 0.02])
customer_segment = np.random.choice([‘Premium’, ‘Standard’, ‘Budget’], p=[0.2, 0.5, 0.3])
age_group = np.random.choice([’18-25′, ’26-35′, ’36-45′, ’46-55′, ’56+’])
region = np.random.choice([‘North’, ‘South’, ‘East’, ‘West’, ‘Central’])
month = date.month
seasonal_factor = 1.0
if month in [11, 12]:
seasonal_factor = 1.5
elif month in [6, 7]:
seasonal_factor = 1.2
revenue = price * quantity * seasonal_factor
discount = np.random.choice([0, 5, 10, 15, 20, 25], p=[0.4, 0.2, 0.15, 0.15, 0.07, 0.03])
marketing_channel = np.random.choice([‘Organic’, ‘Social Media’, ‘Email’, ‘Paid Ads’])
base_satisfaction = 4.0
if customer_segment == ‘Premium’:
base_satisfaction += 0.5
if discount > 15:
base_satisfaction += 0.3
satisfaction = np.clip(base_satisfaction + np.random.normal(0, 0.5), 1, 5)
data.append({
‘Date’: date, ‘Category’: category, ‘Product’: product, ‘Price’: round(price, 2),
‘Quantity’: quantity, ‘Revenue’: round(revenue, 2), ‘Customer_Segment’: customer_segment,
‘Age_Group’: age_group, ‘Region’: region, ‘Discount_%’: discount,
‘Marketing_Channel’: marketing_channel, ‘Customer_Satisfaction’: round(satisfaction, 2),
‘Month’: date.strftime(‘%B’), ‘Year’: date.year, ‘Quarter’: f’Q{(date.month-1)//3 + 1}’
})
df = pd.DataFrame(data)
df[‘Profit_Margin’] = round(df[‘Revenue’] * (1 – df[‘Discount_%’]/100) * 0.3, 2)
df[‘Days_Since_Start’] = (df[‘Date’] – df[‘Date’].min()).dt.days
return df

We design a function to generate a comprehensive e-commerce dataset that mirrors real-world business conditions. We include product categories, customer demographics, seasonal effects, and satisfaction levels, ensuring that our data is diverse and analytically rich. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Generating advanced e-commerce dataset…”)
df = generate_advanced_dataset()
print(f”nDataset Overview:”)
print(f”Total Transactions: {len(df)}”)
print(f”Date Range: {df[‘Date’].min()} to {df[‘Date’].max()}”)
print(f”Total Revenue: ${df[‘Revenue’].sum():,.2f}”)
print(f”nColumns: {list(df.columns)}”)
print(“nFirst few rows:”)
print(df.head())

We execute the dataset generation function and display key insights, including total transactions, revenue range, and sample records. We get a clear snapshot of the data’s structure and confirm that it’s suitable for detailed analysis. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdaily_sales = df.groupby(‘Date’).agg({
‘Revenue’: ‘sum’, ‘Quantity’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’
}).reset_index()

category_analysis = df.groupby(‘Category’).agg({
‘Revenue’: [‘sum’, ‘mean’], ‘Quantity’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’, ‘Profit_Margin’: ‘sum’
}).reset_index()
category_analysis.columns = [‘Category’, ‘Total_Revenue’, ‘Avg_Order_Value’,
‘Total_Quantity’, ‘Avg_Satisfaction’, ‘Total_Profit’]

segment_analysis = df.groupby([‘Customer_Segment’, ‘Region’]).agg({
‘Revenue’: ‘sum’, ‘Customer_Satisfaction’: ‘mean’
}).reset_index()

print(“n” + “=”*50)
print(“DATASET READY FOR PYGWALKER VISUALIZATION”)
print(“=”*50)

We perform data aggregations to prepare multiple analytical perspectives, including time-based trends, category-level summaries, and performance metrics for customer segments. We organize this information to make it easily visualizable in PyGWalker. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n Launching PyGWalker Interactive Interface…”)
walker = pyg.walk(
df,
spec=”./pygwalker_config.json”,
use_kernel_calc=True,
theme_key=’g2′
)

print(“n PyGWalker is now running!”)
print(” Try creating these visualizations:”)
print(” – Revenue trend over time (line chart)”)
print(” – Category distribution (pie chart)”)
print(” – Price vs Satisfaction scatter plot”)
print(” – Regional sales heatmap”)
print(” – Discount effectiveness analysis”)

We launch the PyGWalker interactive interface to visually explore our dataset. We create meaningful charts, uncover trends in sales, satisfaction, and pricing, and observe how interactive visualization enhances our analytical understanding.

Data View

Visualization

Chat with Data

In conclusion, we developed a comprehensive data visualization workflow using PyGWalker, encompassing dataset generation, feature engineering, multidimensional analysis, and interactive exploration. We experience how PyGWalker transforms raw tabular data into rich, exploratory dashboards without needing complex code or BI tools. Through this exercise, we strengthen our ability to derive insights quickly, experiment visually, and connect data storytelling directly to practical business understanding.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an End-to-End Interactive Analytics Dashboard Using PyGWalker Features for Insightful Data Exploration appeared first on MarkTechPost.

Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking: An Open-Source and Compa …

Posted on November 13, 2025 by i-genie

How can we get large model level multimodal reasoning for documents, charts and videos while running only a 3B class model in production? Baidu has added a new model to the ERNIE-4.5 open source family. ERNIE-4.5-VL-28B-A3B-Thinking is a vision language model that focuses on document, chart and video understanding with a small active parameter budget.

https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking

Architecture and training setup

ERNIE-4.5-VL-28B-A3B-Thinking is built on the ERNIE-4.5-VL-28B-A3B Mixture of Experts architecture. The family uses a heterogeneous multimodal MoE design with shared parameters across text and vision plus modality specific experts. At the model level, it has 30B total parameters, while the architecture is in the 28B-VL branch, and only 3B parameters are activated per token through an A3B routing scheme. This gives the compute and memory profile of a 3B class model while keeping a larger capacity pool for reasoning.

The model goes through an additional mid training stage on a large visual language reasoning corpus. This stage is designed to improve representation power and semantic alignment between visual and language modalities, which matters for dense text in documents and fine structures in charts. On top of that, ERNIE-4.5-VL-28B-A3B-Thinking uses multimodal reinforcement learning on verifiable tasks, with GSPO and IcePop strategies and dynamic difficulty sampling to stabilize MoE training and push the model toward hard examples.

Key capabilities

Baidu researchers position this model as a lightweight multimodal reasoning engine that can activate only 3B parameters while approaching the behavior of larger flagship systems on internal benchmarks. Officially listed capabilities include visual reasoning, STEM reasoning, visual grounding, Thinking with Images, tool utilization and video understanding.

Thinking with Images is at the core. The model can zoom into regions, reason on cropped views and then integrate those local observations into a final answer. Tool utilization extends this with calls to tools such as image search when internal knowledge is not enough. Both features are exposed as part of the reasoning parser and tool call parser path in deployment.

Performance and positioning

The lightweight vision language model ERNIE-4.5-VL-28B-A3B achieves competitive or superior performance compared to Qwen-2.5-VL-7B and Qwen-2.5-VL-32B on many benchmarks, while using fewer activation parameters. ERNIE-4.5-VL models also support both thinking and non thinking modes, with the thinking mode improving reasoning centered tasks while keeping strong perception quality.

For the specific Thinking variant, Baidu researchers describe ERNIE-4.5-VL-28B-A3B-Thinking as closely matching the performance of industry flagship models across internal multimodal benchmarks.

Key Takeaways

ERNIE-4.5-VL-28B-A3B-Thinking uses a Mixture of Experts architecture with about 30B total parameters and only 3B active parameters per token to deliver efficient multimodal reasoning.

The model is optimized for document, chart and video understanding through an additional visual language reasoning mid training stage and multimodal reinforcement learning using GSPO, IcePop and dynamic difficulty sampling.

Thinking with Images lets the model iteratively zoom into image regions and reason over crops, while tool utilization enables calls to external tools such as image search for long tail recognition.

It demonstrate strong performance on analytics style charts, STEM circuit problems, visual grounding with JSON bounding boxes and video segment localization with timestamped answers.

The model is released under Apache License 2.0, supports deployment via transformers, vLLM and FastDeploy, and can be fine tuned with ERNIEKit using SFT, LoRA and DPO for commercial multimodal applications.

Comparison Table

ModelTraining stageTotal / active parametersModalitiesContext length (tokens)ERNIE-4.5-VL-28B-A3B-BasePretraining28B total, 3B active per tokenText, Vision131,072ERNIE-4.5-VL-28B-A3B (PT)Posttraining chat model28B total, 3B active per tokenText, Vision131,072ERNIE-4.5-VL-28B-A3B-ThinkingReasoning oriented mid training on ERNIE-4.5-VL-28B-A3B28B architecture, 3B active per token, HF model size 30B paramsText, Vision131,072 (FastDeploy example uses 131,072 max model length)Qwen2.5-VL-7B-InstructPosttraining vision language model≈8B total (7B class)Text, Image, Video32,768 text positions in config (max_position_embeddings)Qwen2.5-VL-32B-InstructPosttraining plus reinforcement tuned large VL model33B totalText, Image, Video32,768 text positions (same Qwen2.5-VLTextConfig family)

Editorial Comments

ERNIE-4.5-VL-28B-A3B-Thinking is a practical release for teams that want multimodal reasoning on documents, charts and videos with only 3B activated parameters, while still using a Mixture-of-Experts architecture with about 30B total parameters and Apache License 2.0. It connects Thinking with Images, tool utilization and multimodal reinforcement learning into a deployable stack that directly targets real world analytics and understanding workloads.

Check out the Repo, Model Weights and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking: An Open-Source and Compact Multimodal Reasoning Model Under the ERNIE-4.5 Family appeared first on MarkTechPost.

How to Reduce Cost and Latency of Your RAG Application Using Semantic …

Posted on November 12, 2025 by i-genie

Semantic caching in LLM (Large Language Model) applications optimizes performance by storing and reusing responses based on semantic similarity rather than exact text matches. When a new query arrives, it’s converted into an embedding and compared with cached ones using similarity search. If a close match is found (above a similarity threshold), the cached response is returned instantly—skipping the expensive retrieval and generation process. Otherwise, the full RAG pipeline runs, and the new query-response pair is added to the cache for future use.

In a RAG setup, semantic caching typically saves responses only for questions that have actually been asked, not every possible query. This helps reduce latency and API costs for repeated or slightly reworded questions. In this article, we’ll take a look at a short example demonstrating how caching can significantly lower both cost and response time in LLM-based applications. Check out the FULL CODES here.

How Semantic Caching in LLM Works

Semantic caching functions by storing and retrieving responses based on the meaning of user queries rather than their exact wording. Each incoming query is converted into a vector embedding that represents its semantic content. The system then performs a similarity search—often using Approximate Nearest Neighbor (ANN) techniques—to compare this embedding with those already stored in the cache.

If a sufficiently similar query-response pair exists (i.e., its similarity score exceeds a defined threshold), the cached response is returned immediately, bypassing expensive retrieval or generation steps. Otherwise, the full RAG pipeline executes, retrieving documents and generating a new answer, which is then stored in the cache for future use. Check out the FULL CODES here.

What Gets Cached in Memory

In a RAG application, semantic caching only stores responses for queries that have actually been processed by the system—there’s no pre-caching of all possible questions. Each query that reaches the LLM and produces an answer can create a cache entry containing the query’s embedding and corresponding response.

Depending on the system’s design, the cache may store just the final LLM outputs, the retrieved documents, or both. To maintain efficiency, cache entries are managed through policies like time-to-live (TTL) expiration or Least Recently Used (LRU) eviction, ensuring that only recent or frequently accessed queries remain in memory over time. Check out the FULL CODES here.

How Semantic Caching Works: Explained with an example

Installing dependencies

Copy CodeCopiedUse a different Browserpip install openai numpy

Setting up the dependencies

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

For this tutorial, we will be using OpenAI, but you can use any LLM provider.

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
client = OpenAI()

Running Repeated Queries Without Caching

In this section, we run the same query 10 times directly through the GPT-4.1 model to observe how long it takes when no caching mechanism is applied. Each call triggers a full LLM computation and response generation, leading to repetitive processing for identical inputs. Check out the FULL CODES here.

This helps establish a baseline for total time and cost before we implement semantic caching in the next part.

Copy CodeCopiedUse a different Browserimport time
def ask_gpt(query):
start = time.time()
response = client.responses.create(
model=”gpt-4.1″,
input=query
)
end = time.time()
return response.output[0].content[0].text, end – start

Copy CodeCopiedUse a different Browserquery = “Explain the concept of semantic caching in just 2 lines.”
total_time = 0

for i in range(10):
_, duration = ask_gpt(query)
total_time += duration
print(f”Run {i+1} took {duration:.2f} seconds”)

print(f”nTotal time for 10 runs: {total_time:.2f} seconds”)

Even though the query remains the same, every call still takes between 1–3 seconds, resulting in a total of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching can be so valuable — it allows us to reuse previous responses for semantically identical queries and save both time and API cost. Check out the FULL CODES here.

Implementing Semantic Caching for Faster Responses

In this section, we enhance the previous setup by introducing semantic caching, which allows our application to reuse responses for semantically similar queries instead of repeatedly calling the GPT-4.1 API.

Here’s how it works: each incoming query is converted into a vector embedding using the text-embedding-3-small model. This embedding captures the semantic meaning of the text. When a new query arrives, we calculate its cosine similarity with embeddings already stored in our cache. If a match is found with a similarity score above the defined threshold (e.g., 0.85), the system instantly returns the cached response — avoiding another API call.

If no sufficiently similar query exists in the cache, the model generates a fresh response, which is then stored along with its embedding for future use. Over time, this approach dramatically reduces both response time and API costs, especially for frequently asked or rephrased queries. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
from numpy.linalg import norm
semantic_cache = []

def get_embedding(text):
emb = client.embeddings.create(model=”text-embedding-3-small”, input=text)
return np.array(emb.data[0].embedding)
def cosine_similarity(a, b):
return np.dot(a, b) / (norm(a) * norm(b))

def ask_gpt_with_cache(query, threshold=0.85):
query_embedding = get_embedding(query)

# Check similarity with existing cache
for cached_query, cached_emb, cached_resp in semantic_cache:
sim = cosine_similarity(query_embedding, cached_emb)
if sim > threshold:
print(f” Using cached response (similarity: {sim:.2f})”)
return cached_resp, 0.0 # no API time

# Otherwise, call GPT
start = time.time()
response = client.responses.create(
model=”gpt-4.1″,
input=query
)
end = time.time()
text = response.output[0].content[0].text

# Store in cache
semantic_cache.append((query, query_embedding, text))
return text, end – start

Copy CodeCopiedUse a different Browserqueries = [
“Explain semantic caching in simple terms.”,
“What is semantic caching and how does it work?”,
“How does caching work in LLMs?”,
“Tell me about semantic caching for LLMs.”,
“Explain semantic caching simply.”,
]

total_time = 0
for q in queries:
resp, t = ask_gpt_with_cache(q)
total_time += t
print(f” Query took {t:.2f} secondsn”)

print(f”nTotal time with caching: {total_time:.2f} seconds”)

In the output, the first query took around 8 seconds as there was no cache and the model had to generate a fresh response. When a similar question was asked next, the system identified a high semantic similarity (0.86) and instantly reused the cached answer, saving time. Some queries, like “How does caching work in LLMs?” and “Tell me about semantic caching for LLMs,” were sufficiently different, so the model generated new responses, each taking over 10 seconds. The final query was nearly identical to the first one (similarity 0.97) and was served from cache instantly.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching appeared first on MarkTechPost.

Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech …

Posted on November 12, 2025 by i-genie

Maya Research has released Maya1, a 3B parameter text to speech model that turns text plus a short description into controllable, expressive speech while running in real time on a single GPU.

What Maya1 Actually Does?

Maya1 is a state of the art speech model for expressive voice generation. It is built to capture real human emotion and precise voice design from text inputs.

The core interface has 2 inputs:

A natural language voice description, for example ‘Female voice in her 20s with a British accent, energetic, clear diction” or “Demon character, male voice, low pitch, gravelly timbre, slow pacing’.

The text that should be spoken

The model combines both signals and generates audio that matches the content and the described style. You can also insert inline emotion tags inside the text, such as <laugh>, <sigh>, <whisper>, <angry>, <giggle>, <gasp>, <cry> and more than 20 emotions.

Maya1 outputs 24 kHz mono audio and supports real time streaming, which makes it suitable for assistants, interactive agents, games, podcasts and live content.

The Maya Research team claims that the model outperforms top proprietary systems while remaining fully open source under the Apache 2.0 license.

Architecture and SNAC Codec

Maya1 is a 3B parameter decoder only transformer with a Llama style backbone. Instead of predicting raw waveforms, it predicts tokens from a neural audio codec named SNAC.

The generation flow is

text → tokenize → generate SNAC codes (7 tokens per frame) → decode → 24 kHz audio

SNAC uses a multi scale hierarchical structure at about 12, 23 and 47 Hz. This keeps the autoregressive sequence compact while preserving detail. The codec is designed for real time streaming at about 0.98 kbps.

The important point is that the transformer operates on discrete codec tokens instead of raw samples. A separate SNAC decoder, for example hubertsiuzdak/snac_24khz, reconstructs the waveform. This separation makes generation more efficient and easier to scale than direct waveform prediction.

Training Data And Voice Conditioning

Maya1 is pretrained on an internet scale English speech corpus to learn broad acoustic coverage and natural coarticulation. It is then fine tuned on a curated proprietary dataset of studio recordings that include human verified voice descriptions, more than 20 emotion tags per sample, multiple English accents, and character or role variations.

The documented data pipeline includes:

24 kHz mono resampling with about minus 23 LUFS loudness

Voice activity detection with silence trimming between 1 and 14 seconds

Forced alignment using Montreal Forced Aligner for phrase boundaries

MinHash LSH text deduplication

Chromaprint based audio deduplication

SNAC encoding with 7 token frame packing

The Maya Research team evaluated several ways to condition the model on a voice description. Simple colon formats and key value tag formats either caused the model to speak the description or did not generalize well. The best performing format uses an XML style attribute wrapper that encodes the description and text in a natural way while remaining robust.

In practice, this means developers can describe voices in free form text, close to how they would brief a voice actor, instead of learning a custom parameter schema.

https://huggingface.co/maya-research/maya1

Inference And Deployment On A Single GPU

The reference Python script on Hugging Face loads the model with AutoModelForCausalLM.from_pretrained(“maya-research/maya1″, torch_dtype=torch.bfloat16, device_map=”auto”) and uses the SNAC decoder from SNAC.from_pretrained(“hubertsiuzdak/snac_24khz”).

The Maya Research team recommends a single GPU with 16 GB or more of VRAM, for example A100, H100 or a consumer RTX 4090 class card.

For production, they provide a vllm_streaming_inference.py script that integrates with vLLM. It supports Automatic Prefix Caching for repeated voice descriptions, a WebAudio ring buffer, multi GPU scaling and sub 100 millisecond latency targets for real time use.

Beyond the core repository, they have released:

A Hugging Face Space that exposes an interactive browser demo where users enter text and voice descriptions and listen to output

GGUF quantized variants of Maya1 for lighter deployments using llama.cpp

A ComfyUI node that wraps Maya1 as a single node, with emotion tag helpers and SNAC integration

These projects reuse the official model weights and interface, so they stay consistent with the main implementation.

Key Takeaways

Maya1 is a 3B parameter, decoder only, Llama style text to speech model that predicts SNAC neural codec tokens instead of raw waveforms, and outputs 24 kHz mono audio with streaming support.

The model takes 2 inputs, a natural language voice description and the target text, and supports more than 20 inline emotion tags such as <laugh>, <cry>, <whisper> and <gasp> for local control of expressiveness.

Maya1 is trained with a pipeline that combines large scale English pretraining and studio quality fine tuning with loudness normalization, voice activity detection, forced alignment, text deduplication, audio deduplication and SNAC encoding.

The reference implementation runs on a single 16 GB plus GPU using torch_dtype=torch.bfloat16, integrates with a SNAC decoder, and has a vLLM based streaming server with Automatic Prefix Caching for low latency deployment.

Maya1 is released under the Apache 2.0 license, with official weights, Hugging Face Space demo, GGUF quantized variants and ComfyUI integration, which makes expressive, emotion rich, controllable text to speech accessible for commercial and local use.

Editorial Comments

Maya1 pushes open source text to speech into territory that was previously dominated by proprietary APIs. A 3B parameter Llama style decoder that predicts SNAC codec tokens, runs on a single 16 GB GPU with vLLM streaming and Automatic Prefix Caching, and exposes more than 20 inline emotions with natural language voice design, is a practical building block for real time agents, games and tools. Overall, Maya1 shows that expressive, controllable TTS can be both open and production ready.

Check out the Model Weights and Demo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU appeared first on MarkTechPost.

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual …

Posted on November 12, 2025 by i-genie

How do you build a single speech recognition system that can understand 1,000’s of languages including many that never had working ASR (automatic speech recognition) models before? Meta AI has released Omnilingual ASR, an open source speech recognition suite that scales to more than 1,600 languages and can be extended to unseen languages with only a few speech text examples, without retraining the model.

Data and language coverage

The supervised training data comes from a combined corpus called AllASR. AllASR contains 120,710 hours of labeled speech paired with transcripts across 1,690 languages. This corpus merges several sources, including open source datasets, internal and licensed corpora, partner created data, and a commissioned collection called the Omnilingual ASR Corpus.

The Omnilingual ASR Corpus contributes 3,350 hours of speech for 348 languages, with data collected through field work with local organizations and speakers in regions such as Africa and South Asia. Prompts are open ended, so speakers produce natural monologues in their own language instead of reading fixed sentences, which gives more realistic acoustic and lexical variation.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

For self supervised pre training, the wav2vec 2.0 encoders are trained on a large unlabeled speech corpus. The pre training dataset contains 3.84M hours of speech with language identification across 1,239 languages, plus another 460K hours without language identification. The total unlabeled audio used for pre training is therefore about 4.3M hours. This is still significantly smaller than the 12M hours used by USM, which makes the reported results more interesting from a data efficiency perspective.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

Model family

Omnilingual ASR exposes 3 main model families that all share the same wav2vec 2.0 speech encoder backbone:

SSL encoders (OmniASR W2V)Self supervised wav2vec 2.0 encoders with the following parameter counts• omniASR_W2V_300M with 317,390,592 parameters• omniASR_W2V_1B with 965,514,752 parameters• omniASR_W2V_3B with 3,064,124,672 parameters• omniASR_W2V_7B with 6,488,487,168 parameters. These models are trained with the standard wav2vec 2.0 contrastive objective. After training, the quantizer is discarded and the encoder is used as a speech representation backbone.

CTC (connectionist temporal classification) ASR modelsCTC models add a simple linear layer on top of the encoder and train end to end with a character level CTC loss. The released CTC models range from 325,494,996 parameters to 6,504,786,132 parameters and reach real time factors as low as 0.001 for the 300M model on A100 for 30 second audio with batch size 1.

LLM ASR modelsLLM ASR stacks a Transformer decoder on top of the wav2vec 2.0 encoder. The decoder is a language model like Transformer that operates on character level tokens plus special tokens such as <BOS> and <EOS>. Training uses standard next token prediction on sequences of the form gs(x), gt(<BOS>), gt(y), gt(<EOS>) where gs is the speech encoder and gt is the text embedding matrix. The LLM ASR family ranges from about 1.63B parameters for omniASR_LLM_300M to 7,801,041,536 parameters for omniASR_LLM_7B. A separate omniASR_LLM_7B_ZS checkpoint with 7,810,900,608 parameters is used for zero shot ASR.

All LLM ASR models support optional language conditioning. Languages are represented as {language_code}_{script} such as eng_Latn for English in Latin script or cmn_Hans for Mandarin Chinese in Simplified Chinese script. A learned embedding for the language script identifier is injected into the decoder input. In training, the language ID token is sometimes dropped, so the model can also operate without explicit language tags at inference.

Zero shot ASR with context examples and SONAR

The supervised models cover more than 1,600 languages. However, many languages still have no transcribed ASR data. To handle these cases, Omnilingual ASR extends the LLM ASR model with a zero shot mode trained with context examples.

During training for the zero shot variant, the decoder consumes N + 1 speech text pairs from the same language. The first N pairs act as context and the final pair is the target. All pairs are embedded with the speech encoder and text embedding matrix, then concatenated into a single decoder input sequence. The loss is still next token prediction on the target transcription. This teaches the decoder to infer the mapping from speech to text in a given language from a small prompt of in language examples.

At inference, the omniASR_LLM_7B_ZS model can receive a few speech text examples from any language, including languages not present in training, and then transcribe new utterances in that language without updating weights. This is in context learning for ASR.

The system includes an example retrieval mechanism based on SONAR, a multilingual multimodal encoder that projects audio and text into a shared embedding space. The target audio is embedded once, then nearest neighbor search over a database of speech text pairs selects the most relevant examples to include in the context window. This SONAR based selection improves zero shot performance compared with random example selection or simple text similarity.

https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/

Quality and benchmarks

The omniASR_LLM_7B model achieves character error rate below 10 percent for 78 percent of the more than 1,600 supported languages.

The research team reports that on multilingual benchmarks such as FLEURS 102, the 7B LLM ASR model outperforms the 7B CTC models and also surpasses Google USM variants in average character error rate, despite using about 4.3M unlabeled hours instead of 12M and a simpler pre training pipeline. This suggests that scaling the wav2vec 2.0 encoder and adding an LLM style decoder is an effective path for high coverage multilingual ASR.

Key Takeaways

Omnilingual ASR provides open source ASR coverage for more than 1,600 languages and can generalize to more than 5,400 languages using zero shot in context learning.

The models are built on large scale wav2vec 2.0 encoders trained on about 4.3M hours of unlabeled audio from 1,239 labeled languages plus additional unlabeled speech.

The suite includes wav2vec 2.0 encoders, CTC ASR, LLM ASR, and a dedicated zero shot LLM ASR model, with encoder sizes from 300M to 7B parameters and LLM ASR up to about 7.8B parameters.

The 7B LLM ASR model achieves character error rate below 10 percent on 78 percent of the more than 1,600 supported languages, which is competitive with or better than prior multilingual systems in low resource settings.

Editorial Comments

Omnilingual ASR is a significant systems level contribution because it treats multilingual ASR as an extensible framework, not a fixed language list, combining a 7B wav2vec 2.0 encoder, CTC and LLM ASR decoders, and a zero shot LLM ASR model that can adapt to new languages with a few in context examples, while achieving character error rate below 10 percent on 78 percent of more than 1,600 supported languages and releasing everything under Apache 2.0 and CC BY 4.0. Overall, this launch establishes Omnilingual ASR as the most extensible open source speech recognition model currently available.

Check out the Paper, Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages appeared first on MarkTechPost.

Introducing agent-to-agent protocol support in Amazon Bedrock AgentCor …

Posted on November 12, 2025 by i-genie

We recently announced the support for Agent-to-Agent (A2A) protocol on Amazon Bedrock AgentCore Runtime. With this addition, agents can discover peers, share capabilities, and coordinate actions across platforms using standardized communication.
Amazon Bedrock AgentCore Runtime provides a secure, serverless environment designed for deploying AI agents and tools. It works with any framework and model, supports real-time and long-running workloads, and supports session isolation with built-in authentication. With support for MCP, and now the A2A protocol, Bedrock AgentCore Runtime enables seamless communication between agents. Agents built using different frameworks, Strands Agents, OpenAI Agents SDK, LangGraph, Google ADK, or Claude Agents SDK, can share context, capabilities, and reasoning in a common, verifiable format.
In this post, we demonstrate how you can use the A2A protocol for AI agents built with different frameworks to collaborate seamlessly. You’ll learn how to deploy A2A servers on AgentCore Runtime, configure agent discovery and authentication, and build a real-world multi-agent system for incident response. We’ll cover the complete A2A request lifecycle, from agent card discovery to task delegation, showing how standardized protocols eliminate the complexity of multi-agent coordination.
Understanding multi-agent systems
Building effective agentic systems requires several foundational components. These include memory, both short-term for maintaining conversation context and long-term for retaining insights across sessions; tools that agents can access either natively or through MCP servers; identity for more secure authentication and permission management, allowing agents to act on behalf of users or autonomously access resources; and guardrails to detect harmful content, help prevent hallucinations, and make sure responses align with policies and factual accuracy.

While MCP connects a single agent to its tools and data, A2A lets multiple agents coordinate with one another. For example, a retail inventory agent might use MCP to query product databases, then use A2A to communicate with external supplier agents to place orders.
The A2A protocol brings benefits to multi-agent systems through seamless interoperability across diverse boundaries. Agents built with different frameworks like Strands or OpenAI, powered by various LLMs such as Anthropic Claude, GPT-4, or Llama, and hosted on different systems including AWS or edge devices can communicate and coordinate effortlessly without requiring complex translation layers. This interoperability is complemented by loose coupling and modularity, where each agent operates as an independent unit that can be developed, tested, deployed, and even upgraded without disrupting the entire system. New specialized agents can join the environment seamlessly, and the failure of one agent remains isolated due to well-defined interaction boundaries, helping prevent cascading failures across the system. The protocol also supports dynamic agent discovery and orchestration. Agents advertise their capabilities through standardized schemas while orchestrator agents can discover and invoke specialized agents based on real-time task requirements.
A2A request lifecycle on Amazon Bedrock AgentCore Runtime
The A2A protocol defines a structured request lifecycle with specific components that work together to coordinate multi-agent communication. Here are the key elements:

User: Initiates requests through the Client Agent, either as a human operator or automated service defining goals that require multi-agent assistance.
A2A Client (Client Agent): Acts on behalf of the user, initiating communication using the A2A protocol to discover and request tasks from remote agents.
A2A Server (Remote Agent): Exposes HTTP endpoints implementing the A2A protocol to receive requests, process tasks, and return results. Different agents can serve this role, handling both synchronous and asynchronous interactions using JSON-RPC 2.0 over HTTP/S or Server-Sent Events.
Agent Card: A JSON metadata file that each agent publishes to advertise its identity, capabilities, endpoints, and authentication requirements. This enables the dynamic discovery feature, where agents query what their peer agents can do before delegating tasks.
Task Object: Represents each unit of work flowing through the system with a unique ID and lifecycle. As agents coordinate, tasks may be long-running, involve multiple turns, and span several agents working together.
Artifact: The output produced when a task completes, which can include structured text, JSON, images, audio, or other multimodal content. Agents exchange these artifacts as they collaborate to fulfill the user’s original request.

Multi-agent use case: Monitoring and incident response
To demonstrate the power of multi-agent systems using A2A on Amazon Bedrock AgentCore Runtime, we’ll walk through an enterprise monitoring and incident response solution. This real-world use-case showcases how specialized agents built with different frameworks coordinate seamlessly to handle complex operational challenges through the A2A protocol.
The monitoring and incident response solution implements a hub-and-spoke architecture with three specialized agents, each using Amazon Bedrock AgentCore features – modular building blocks that provide core capabilities like AgentCore Memory for context-aware responses, AgentCore Identity using Amazon Cognito for more secure authentication for agents and what action each agent can perform, AgentCore Gateway for more secure and centralized access to tools, and observability to trace, debug, and monitor AI agents’ performance. View the architecture and demonstration video below for reference:

The multi-agent system contains the following components:

Host agent (Google ADK): Acts as the intelligent routing layer and coordination hub for the agent interactions. Demonstrates the cross-system interoperability using A2A. This agent runs on Amazon Bedrock AgentCore Runtime using Google’s Agent Development Kit, yet communicates seamlessly with agents hosted on AWS through the standardized A2A protocol. Key responsibilities of the host agent include:

Dynamic agent discovery: Fetches Identity Provider (IDP) configuration from AWS Systems Manager Parameter Store for each remote agent, enabling more secure authentication across the multi-agent system
Capability awareness: Retrieves agent cards from each A2A server to understand available skills and endpoints
Intelligent routing: Analyzes user queries and routes them to the appropriate specialist agent based on capabilities
Multi-agent coordination: Orchestrates complex workflows requiring multiple agents

Monitoring agent (Strands Agents SDK): Serves as the operational intelligence layer, continuously analyzing CloudWatch logs, metrics, dashboards, and alarms across AWS services. This agent specializes in identifying anomalies, tracking error patterns, and surfacing actionable insights from vast amounts of telemetry data. When unusual patterns emerge, the monitoring Agent initiates conversations with other specialized agents to coordinate response actions.Key responsibilities of the monitoring agent include:

CloudWatch integration:

Lists and analyzes CloudWatch dashboards
Fetches logs for specific AWS services (Lambda, ECS, EC2)
Monitors alarms and alert states
Analyzes log groups for patterns and errors

Cross-account access: Supports monitoring across multiple AWS accounts

Operational agent (OpenAI SDK): Provides remediation strategies and external knowledge integration. When the monitoring agent detects a critical issue, it communicates directly with the operational agent through A2A, providing context about the problem and requesting specific remediation actions. Key responsibilities of the operational agent include:

Web search: Uses Tavily API to search for AWS best practices, troubleshooting guides, and solutions
Remediation strategies: Proposes solutions based on detected issues

Implementing the multi-agent monitoring solution
Now that we’ve explored how these three specialized agents collaborate to handle AWS incidents, let’s walk through how to build and deploy this multi-agent system using Amazon Bedrock AgentCore Runtime.
The implementation follows a progressive approach:

Start with the foundation – We’ll deploy a simple A2A server to understand the core mechanics of agent deployment, authentication, and invocation on AgentCore Runtime
Build the monitoring system – Using the same deployment patterns, we’ll construct each specialized agent (Monitoring, Operational, and Host) with their specific tools and capabilities
Connect the agents – Configure A2A communication channels between agents, enabling them to discover and invoke each other through standardized protocols
Observe the system in action – Watch the demo video showing real-time incident detection, cross-agent coordination, and automated response

All code examples, complete agent implementations, and deployment scripts for this multi-agent monitoring system are available in our GitHub repository.
Getting started with A2A on AgentCore Runtime
To understand the fundamentals of deploying A2A servers on Amazon Bedrock AgentCore Runtime, including step-by-step instructions for creating, testing, deploying, and invoking agents, refer to the A2A Protocol Support documentation. This guide covers:

Creating and configuring A2A servers with any framework (Strands, OpenAI SDK, LangGraph)
Local testing and validation
Deployment using the AgentCore CLI
Authentication setup (OAuth 2.0 and AWS IAM)
Agent Card retrieval and discovery
Client implementation for invoking deployed agents

Once you’re familiar with these fundamentals, you can apply the same patterns to build each component of the multi-agent monitoring system.
View the full example in this GitHub sample. For this post, we will focus on this use case implementation.
Prerequisites
To deploy the multi-agent monitoring system implementation, follow the prerequisite steps:

AWS account: You need an active AWS account with appropriate permissions

Create an AWS account
AWS Management Console access

AWS CLI: Install and configure AWS CLI with your credentials

Install AWS CLI
Configure AWS CLI

Install uv.
Supported Regions: This solution is currently tested and supported in the following AWS Regions.

Note: To deploy in other Regions, you’ll need to update the DynamoDB prefix list mappings in cloudformation/vpc-stack.yaml. See the VPC Stack documentation for details.
Deployment steps
This guide walks you through deploying a multi-agent system on AWS using infrastructure-as-code. The easiest way to deploy this solution is using our automated deployment script:
Step 1: Clone the repository

git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git
cd 02-use-cases/A2A-multi-agent-incident-response

Step 2: Run the deployment script
This deployment script will verify that the AWS CLI is installed and configured, check if the AWS credentials are valid, confirm that the Region is set to us-west-2, interactively collect the required parameters, generate unique S3 bucket names and automatically deploy all stacks in the correct order. The approximate deployment time is 10-15 minutes.

uv run deploy.py

Step 3: Provide the runtime CLI parameters
Next, provide the parameters used at deployment. Press enter for each of the options to use the default Amazon Bedrock model ID and the CloudFormation stack names for each of the agents.
API keys: You’ll need the following API keys (the deployment script will prompt for these):

OpenAI API key: Get it from OpenAI Platform
Tavily API key: Get it from Tavily
Google API key: Get it from Google AI Studio

Once you have configured the information, start the deployment process and track it below in the AWS Console and terminal respectively.
Step 4: Provide the runtime CLI parameters
Run the frontend using following commands. This sets up and runs the React frontend UI that allows users to interact with the multi-agent incident response system for monitoring AWS infrastructure, querying CloudWatch metrics and logs, and searching for remediation strategies through the coordinated A2A agents.

cd frontend
npm install

chmod +x ./setup-env.sh
./setup-env.sh

npm run dev

This deployment creates a multi-agent A2A system with three specialized AI agents running on Amazon Bedrock AgentCore Runtime and orchestrated using the A2A protocol. The Cognito stack provisions OAuth 2.0-based machine-to-machine authentication by creating a Cognito user pool with four distinct client applications (WebSearch, Monitoring, Gateway, and Host Agent clients).
The monitoring agent (built with the Strands SDK) connects to CloudWatch metrics and logs through an AgentCore Gateway using a Smithy model definition, with custom semantic memory strategies for incident tracking.
The operations agent (built with OpenAI Agents SDK) interfaces with Tavily API for remediation research and the host agent (built with Google ADK) acts as the coordinator using HTTP protocol to delegate tasks to the two specialized A2A agents.
End-to-end incident response workflow
In this section, we will walk through an end-to-end workflow where the host agent manages conversations, gets the requirements from the user, and selects the best agent to route the request to (monitoring or operations agent). The monitoring and operations agent expose their agent cards that is used by the host agent for orchestration. In this example, we will test with simple error analysis from various log groups and search for remediation strategies.

The workflow includes the following steps:

Initial greeting: The user sends a greeting message asking “Hi! How are you?” to the host agent. The host agent processes the request. The host agent responds back to the user with a friendly greeting saying “I’m doing well, thank you!”
Capabilities query: The user asks the host agent “What are your capabilities?” to understand what the agent can do. The host agent explains to the user that it is an orchestration agent designed for AWS monitoring and operations based on the remote agent connections that it has access to.
List log groups and dashboards: The user requests the host agent to list the log groups and dashboards in their AWS account. The host agent recognizes this is a monitoring task and executes the transfer_to_agent tool to delegate the work. The request is transferred from the host agent to the monitoring agent for specialized handling. The monitoring agent uses the Agent-to-Agent (A2A) Json RPC Transport protocol to communicate. The monitoring agent retrieves the information and returns results showing 0 dashboards and 153 log groups found in the account. The host agent receives the results from the monitoring agent and displays the dashboards and log groups information to the user.
Analyze specific log group: The user requests the host agent to look for errors in a specific log group at path /aws/bedrock-agentcore/runtimes/hostadk-<runtimeId>-DEFAULT. The host agent determines this requires monitoring expertise and executes the transfer_to_agent tool. The request is transferred to the monitoring agent with instructions to analyze the specified log group for errors. The monitoring agent analyzes the log group and discovers 9 errors and 18 warnings, specifically identifying OTLP Export Failures. The host agent receives the analysis results and displays a detailed error analysis report to the user.
Debug and fix recommendations: The user asks the host agent to debug the errors and provide a report on the fixes needed. The request is transferred to the operations agent to search for solutions related to OTLP export failures. The operations agent uses A2A JsonRPC Transport to attempt the search and performs web search to provide a solution.

Security with A2A on Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Runtime supports two authentication methods for securing A2A communication:
OAuth 2.0 authentication: The A2A client authenticates with an external authorization server to obtain a JSON Web Token (JWT), which is then included with all requests to the A2A server. This token-based approach enables secure, standardized authentication using either machine-to-machine (M2M) credentials or user federation, allowing the A2A server to verify the client’s identity and enforce access controls based on the token’s claims.
AWS IAM authentication: The A2A client assumes an IAM role with permissions to invoke the A2A server’s agent. This approach leverages AWS SigV4 request signing and IAM policies to control access, alleviating the need for external token management while providing fine-grained permissions.
What is supported in Amazon Bedrock AgentCore Runtime with A2A
Amazon Bedrock AgentCore Runtime provides comprehensive support for A2A communication. View some of the capabilities supported:

Stateless server: Amazon Bedrock AgentCore Runtime can host A2A servers that expose an HTTP interface, running a stateless HTTP server on port 9000 and supporting JSON-RPC messaging. The runtime acts as a transparent proxy, passing JSON-RPC requests and responses unchanged to preserve protocol fidelity.
Authenticated agent cards: Supports authenticated agent card at /.well-known/agent-card.json containing its capabilities & skills allowing other agents to discover it automatically.
Authentication with secure inbound auth: Amazon Bedrock AgentCore Runtime supports secure authentication via AWS SigV4 and OAuth 2.0, making sure the agent-to-agent communication is authorized and secure. The A2A server authenticates every incoming request using the credentials provided in the HTTP headers, leveraging Amazon Bedrock AgentCore Identity.
Authorization with secure outbound auth: Amazon Bedrock AgentCore Runtime enables secure outbound authorization through both IAM execution roles and AgentCore Identity. Each agent assumes a defined IAM execution role, granting it the necessary permissions to access AWS resources more securely. For interactions with external services, agents can use Amazon Bedrock AgentCore Identity, which provides managed OAuth 2.0 support for third-party identity providers such as Google, GitHub, Slack, and more.
VPC connectivity: You can configure Amazon Bedrock AgentCore Runtime to connect to resources in your Amazon Virtual Private Cloud (VPC). By configuring VPC connectivity, you enable secure access to private resources such as databases, internal APIs, and services within your VPC.
Leverage AWS PrivateLink: Amazon Bedrock AgentCore enables secure, private connections between your Virtual Private Cloud (VPC) and AgentCore services using AWS PrivateLink. By creating interface VPC endpoints, you can keep A2A server communication within your VPC without traversing the public internet.
Lifecycle management: Amazon Bedrock AgentCore Runtime lets you configure lifecycle rules to manage resource usage with idleRuntimeSessionTimeout and maxLifetime. Idle or long-running sessions are automatically terminated for efficient resource utilization and to maintain system performance.

Conclusion
The Agent-to-Agent protocol support in Amazon Bedrock AgentCore Runtime provides the support for building scalable, interoperable multi-agent systems. By providing standardized communication between AI agents, regardless of their underlying framework, model, or hosting infrastructure, organizations can compose sophisticated agentic solutions with the A2A protocol. The AWS monitoring and incident response example demonstrates the practical power of this approach: a Google ADK-based orchestrator coordinating with Strands and OpenAI SDK agents, all deployed on AgentCore Runtime, working together to detect issues, search for solutions, and recommend fixes. This level of interoperability would traditionally require extensive custom integration work, but A2A makes it straightforward through standardized protocols.As AI systems continue to evolve from single-purpose tools to collaborative environments, protocols like A2A and MCP become essential building blocks. They create a future where agents can be discovered, composed, and orchestrated dynamically, enabling organizations to build once and integrate anywhere.

About the authors
Madhur Prashant is an Applied Generative AI Architect at Amazon Web Services. He is passionate about the intersection of human thinking and Agentic AI. His interests lie in generative AI, cognitive science and specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin and playing the guitar.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Sriharsha M S is a Principal Gen AI specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to foundational model science and agentic AI applications at scale. His expertise spans application hardware accelerators, architecture, big data, analytics and machine learning.
Jeffrey Burke is an Applied Generative AI Solutions Architect at Amazon Web Services (AWS), where he specializes in designing and implementing cutting-edge generative AI solutions for enterprise customers. With a passion for teaching complex technologies, he focuses on translating sophisticated AI concepts into practical, scalable solutions that drive business value. He has a MS in Data Science and BS in Chemical Engineering.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using Generative AI to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Deep Learning, and he is a researcher studying the use of Machine Learning and Reinforcement Learning for accelerating learning and optimization tasks. Shreyas is also an Amazon best-selling book author with several research papers and patents to his name.
Andy Palmer is a Director of Technology for AWS Strategic Accounts. His teams provide Specialist Solutions Architecture skills across a number of speciality domain areas, including AIML, generative AI, data and analytics, security, network, and open source software. Andy and his team have been at the forefront of guiding our most advanced customers through their generative AI journeys and helping to find ways to apply these new tools to both existing problem spaces and net new innovations and product experiences.
Sayee Kulkarni is a Software Development Engineer on the AWS Bedrock AgentCore service. Her team is responsible for building and maintaining the AgentCore Runtime platform, a foundational component that enables customers to leverage agentic AI capabilities. She is driven by delivering tangible customer value, and this customer-centric focus motivates her work. Sayee played a key role in designing and launching Agent-to-Agent (A2A) capabilities for AgentCore, empowering customers to build sophisticated multi-agent systems that autonomously collaborate to solve complex business challenges.

Powering enterprise search with the Cohere Embed 4 multimodal embeddin …

Posted on November 12, 2025 by i-genie

The Cohere Embed 4 multimodal embeddings model is now available as a fully managed, serverless option in Amazon Bedrock. Users can choose between cross-Region inference (CRIS) or Global cross-Region inference to manage unplanned traffic bursts by utilizing compute resources across different AWS Regions. Real-time information requests and time zone concentrations are example events that can cause inference demand to exceed anticipated traffic.
The new Embed 4 model on Amazon Bedrock is purpose-built for analyzing business documents. The model delivers leading multilingual capabilities and shows notable improvements over Embed 3 across the key benchmarks, making it ideal for use cases such as enterprise search.
In this post, we dive into the benefits and unique capabilities of Embed 4 for enterprise search use cases. We’ll show you how to quickly get started using Embed 4 on Amazon Bedrock, taking advantage of integrations with Strands Agents, S3 Vectors, and Amazon Bedrock AgentCore to build powerful agentic retrieval-augmented generation (RAG) workflows.
Embed 4 advances multimodal embedding capabilities by natively supporting complex business documents that combine text, images, and interleaved text and images into a unified vector representation. Embed 4 handles up to 128,000 tokens, minimizing the need for tedious document splitting and preprocessing pipelines. Embed 4 also offers configurable compressed embeddings that reduce vector storage costs by up to 83% (Introducing Embed 4: Multimodal search for business). Together with multilingual understanding across over 100 languages, enterprises in regulated industries such as finance, healthcare, and manufacturing can efficiently process unstructured documents, accelerating insight extraction for optimized RAG systems. Read about Embed 4 in this launch blog from July 2025 to explore how to deploy on Amazon SageMaker JumpStart.
Embed 4 can be integrated into your applications using the InvokeModel API, and here’s an example of how to use the AWS SDK for Python (Boto3) with Embed 4:
For the text only input:

import boto3
import json

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

# Request body
body = json.dumps({
“texts”: [
text1,
text2],
“input_type”:”search_document”,
“embedding_types”: [“float”]
})

# Invoke the model
model_id = ‘cohere.embed-v4:0’

response = bedrock_runtime.invoke_model(
modelId=model_id,
body=json.dumps(body),
accept= ‘*/*’,
contentType=’application/json’
)

# Parse response
result = json.loads(response[‘body’].read())

For the mixed modalities input:

import base64

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

# Request body
body = json.dumps({
“inputs”: [
{
“content”: [
{ “type”: “text”, “text”: text },
{ “type”: “image_url”, {“image_url”:image_base64_uri}}
]
}
],
“input_type”:”search_document”,
“embedding_types”: [“int8″,”float”]
})

# Invoke the model
model_id = ‘cohere.embed-v4:0’

response = bedrock_runtime.invoke_model(
modelId=model_id,
body=json.dumps(body),
accept= ‘*/*’,
contentType=’application/json’
)

# Parse response
result = json.loads(response[‘body’].read())

For more details, you can check Amazon Bedrock User Guide for Cohere Embed 4.
Enterprise search use case
In this section, we focus on using Embed 4 for an enterprise search use case in the finance industry. Embed 4 unlocks a range of capabilities for enterprises seeking to:

Streamline information discovery
Enhance generative AI workflows
Optimize storage efficiency

Using foundation models in Amazon Bedrock is a fully serverless environment which removes infrastructure management and simplifies integration with other Amazon Bedrock capabilities. See more details for other possible use cases with Embed 4.
Solution overview
With the serverless experience available in Amazon Bedrock, you can get started quickly without spending too much effort on infrastructure management. In the following sections, we show how to get started with Cohere Embed 4. Embed 4 is already designed with storage efficiency in mind.
We choose Amazon S3 vectors for storage because it is a cost-optimized, AI-ready storage with native support for storing and querying vectors at scale. S3 vectors can store billions of vector embeddings with sub-second query latency, reducing total costs by up to 90% compared to traditional vector databases. We leverage the extensible Strands Agent SDK to simplify agent development and take advantage of model choice flexibility. We also use Bedrock AgentCore because it provides a fully managed, serverless runtime specifically built to handle dynamic, long-running agentic workloads with industry-leading session isolation, security, and real-time monitoring.

Prerequisites
To get started with Embed 4, verify you have the following prerequisites in place:

IAM permissions: Configure your IAM role with necessary Amazon Bedrock permissions, or generate API keys through the console or SDK for testing. For more information, see Amazon Bedrock API keys.
Strands SDK installation: Install the required SDK for your development environment. For more information, see the Strands quickstart guide.
S3 Vectors configuration: Create an S3 vector bucket and vector index for storing and querying vector data. For more information, see the getting started with S3 Vectors tutorial.

Initialize Strands agents
The Strands Agents SDK offers an open source, modular framework that streamlines the development, integration, and orchestration of AI agents. With the flexible architecture developers can build reusable agent components and create custom tools with ease. The system supports multiple models, giving users freedom to select optimal solutions for their specific use cases. Models can be hosted on Amazon Bedrock, Amazon SageMaker, or elsewhere.
For example, Cohere Command A is a generative model with 111B parameters and a 256K context length. The model excels at tool use which can extend baseline functionality while avoiding unnecessary tool calls. The model is also suitable for multilingual tasks and RAG tasks such as manipulating numerical information in financial settings. When paired with Embed 4, which is purpose-built for highly regulated sectors like financial services, this combination delivers substantial competitive benefits through its adaptability.
We begin by defining a tool that a Strands agent can use. The tool searches for documents stored in S3 using semantic similarity. It first converts the user’s query into vectors with Cohere Embed 4. It then returns the most relevant documents by querying the embeddings stored in the S3 vector bucket. The code below shows only the inference portion. Embeddings created from the financial documents were stored in a S3 vector bucket before querying.

# S3 Vector search function for financial documents
@tool
def search(query_text: str, bucket_name: str = “my-s3-vector-bucket”,
index_name: str = “my-s3-vector-index-1536”, top_k: int = 3,
category_filter: str = None) -> str:
“””Search financial documents using semantic vector search”””

bedrock = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)
s3vectors = boto3.client(“s3vectors”, region_name=”us-east-1″)

# Generate embedding using Cohere Embed v4
response = bedrock.invoke_model(
modelId=”cohere.embed-v4:0″,
body=json.dumps({
“texts”: [query_text],
“input_type”: “search_query”,
“embedding_types”: [“float”]
}),
accept=’*/*’,
contentType=’application/json’
)

response_body = json.loads(response[“body”].read())
embedding = response_body[“embeddings”][“float”][0]

# Query vectors
query_params = {
“vectorBucketName”: bucket_name,
“indexName”: index_name,
“queryVector”: {“float32”: embedding},
“topK”: top_k,
“returnDistance”: True,
“returnMetadata”: True
}

if category_filter:
query_params[“filter”] = {“category”: category_filter}

response = s3vectors.query_vectors(**query_params)
return json.dumps(response[“vectors”], indent=2)

We then define a financial research agent that can use the tool to search financial documents. As your use case becomes more complex, more agents can be added for specialized tasks.

# Create financial research agent using Strands
agent = Agent(
name=”FinancialResearchAgent”,
system_prompt=”You are a financial research assistant that can search through financial documents, earnings reports, regulatory filings, and market analysis. Use the search tool to find relevant financial information and provide helpful analysis.”,
tools=[search])

Simply using the tool returns the following results. Multilingual financial documents are ranked by semantic similarity to the query about comparing earnings growth rates. An agent can use this information to generate useful insights.

result = search(“Compare earnings growth rates mentioned in the documents”)
print(result)
{
“key”: “doc_0_en”,
“metadata”: {
“language”: “en”,
“source_text”: “Q3 2024 earnings report shows revenue growth of 15% year-over-year driven by strong performance in cloud services and AI products”,
“doc_id”: 0
},
“distance”: 0.7292724251747131
},
{
“key”: “doc_18_zh”,
“metadata”: {
“source_text”: “2024年上半年财务报告显示净利润增长20%，主要受益于金融科技和数字支付业务的快速发展”,
“language”: “zh”,
“doc_id”: 18
},
“distance”: 0.7704182863235474
},
{
“key”: “doc_12_de”,
“metadata”: {
“language”: “de”,
“doc_id”: 12,
“source_text”: “Halbjahresbericht 2024 verzeichnet 12% Gewinnsteigerung durch erfolgreiche Digitalisierung und Automatisierungslu00f6sungen”
},
“distance”: 0.7738275527954102
}
]

The example above relies on the QueryVectors API operation for S3 Vectors, which can work well for small documents. This approach can be improved to handle large and complex enterprise documents using sophisticated chunking and reranking techniques. Sentence boundaries can be used to create document chunks to preserve semantic coherence. The document chunks are then used to generate embeddings. The following API call passes the same query to the Strands agent:

response = agent(“Compare earnings growth rates mentioned in the documents”)

The Strands agent uses the search tool we defined to generate an answer for the query about comparing earnings growth rates. The final answer considers the results returned from the search tool:

Agent Query:
I’ll search for information about earnings growth rates in the financial documents to help you compare them.
Tool #1: search
Let me search for more specific earnings growth information and financial performance metrics.
Tool #2: search
Let me search for more comprehensive financial performance data and specific earnings metrics.
Tool #3: search
Based on the financial documents I’ve searched through, here’s a comparison of the earnings growth rates mentioned:
…
## Key Insights:

1. **Growth Range:** The earnings growth rates span from 12% to 20%, indicating generally healthy performance
across different markets and sectors.

2. **Technology Focus:** All companies showing strong growth are heavily invested in technology sectors
(fintech, AI, cloud services, cybersecurity, automation).

3. **Geographic Diversity:** The strong performers represent different regions (Asia, Europe, North America),
suggesting broad-based growth in tech-enabled services.

4. **Growth Sustainability:** The Chinese fintech company leads with 20% net profit growth, while the others
show strong revenue growth in the 12-18% range.

The data suggests that companies with strong technology components, particularly in emerging areas like AI,
fintech, and cybersecurity, are experiencing the most robust earnings growth rates in 2024.Based on the
financial documents I’ve searched through, here’s a comparison of the earnings growth rates mentioned:
## Earnings Growth Rate Comparison

The data suggests that companies with strong technology components, particularly in emerging areas like AI,
fintech, and cybersecurity, are experiencing the most robust earnings growth rates in 2024.

A custom tool like the S3 Vector search function used in this example is just one of many possibilities. With Strands it is straightforward to develop and orchestrate autonomous agents while Bedrock AgentCore serves as the managed deployment system to host and scale these Strands agents in production.
Deploy to Amazon Bedrock AgentCore
Once an agent is built and tested, it is ready to be deployed. AgentCore Runtime is a secure and serverless runtime purpose-built for deploying and scaling dynamic AI agents. Use the starter toolkit to automatically create the IAM execution role, container image, and Amazon Elastic Container Registry repository to host an agent in AgentCore Runtime. You can define multiple tools available to your agent. In this example, we use the Strands Agent powered by Embed 4:

# Using bedrock-agentcore<=0.1.5 and bedrock-agentcore-starter-toolkit==0.1.14
from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session
boto_session = Session()
region = boto_session.region_name

agentcore_runtime = Runtime()
agent_name = “search_agent”
response = agentcore_runtime.configure(
entrypoint=”example.py”, # Replace with your custom agent and tools
auto_create_execution_role=True,
auto_create_ecr=True,
requirements_file=”requirements.txt”,
region=region,
agent_name=agent_name
)
response
launch_result = agentcore_runtime.launch()
invoke_response = agentcore_runtime.invoke({“prompt”: “Compare earnings growth rates mentioned in the documents”})

Clean up
To avoid incurring unnecessary costs when you’re done, empty and delete the S3 Vector buckets created, applications that can make requests to the Amazon Bedrock APIs, the launched AgentCore Runtimes and associated ECR repositories.
For more information, see this documentation to delete a vector index and this documentation to delete a vector bucket, and see this step for removing resources created by the Bedrock AgentCore starter toolkit.
Conclusion
Embed 4 on Amazon Bedrock is beneficial for enterprises aiming to unlock the value of their unstructured, multimodal data. With support for up to 128,000 tokens, compressed embeddings for cost efficiency, and multilingual capabilities across 100+ languages, Embed 4 provides the scalability and precision required for enterprise search at scale.
Embed 4 has advanced capabilities that are optimized with domain specific understanding of data from regulated industries such as finance, healthcare, and manufacturing. When combined with S3 Vectors for cost-optimized storage, Strands Agents for agent orchestration, and Bedrock AgentCore for deployment, organizations can build secure, high-performing agentic workflows without the overhead of managing infrastructure. Check the full Region list for future updates.
To learn more, check out the Cohere in Amazon Bedrock product page and the Amazon Bedrock pricing page. If you’re interested in diving deeper check out the code sample and the Cohere on AWS GitHub repository.

About the authors
James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.
Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.
Hugo Tse is a Solutions Architect at AWS, with a focus on Generative AI and Storage solutions. He is dedicated to empowering customers to overcome challenges and unlock new business opportunities using technology. He holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.
Mehran Najafi, PhD, serves as AWS Principal Solutions Architect and leads the Generative AI Solution Architects team for AWS Canada. His expertise lies in ensuring the scalability, optimization, and production deployment of multi-tenant generative AI solutions for enterprise customers.
Sagar Murthy is an agentic AI GTM leader at AWS who enjoys collaborating with frontier foundation model partners, agentic frameworks, startups, and enterprise customers to evangelize AI and data innovations, open source solutions, and enable impactful partnerships and launches, while building scalable GTM motions. Sagar brings a blend of technical solution and business acumen, holding a BE in Electronics Engineering from the University of Mumbai, MS in Computer Science from Rochester Institute of Technology, and an MBA from UCLA Anderson School of Management.
Payal Singh is a Solutions Architect at Cohere with over 15 years of cross-domain expertise in DevOps, Cloud, Security, SDN, Data Center Architecture, and Virtualization. She drives partnerships at Cohere and helps customers with complex GenAI solution integrations.

A guide to building AI agents in GxP environments

Posted on November 12, 2025 by i-genie

Healthcare and life sciences organizations are transforming drug discovery, medical devices, and patient care with generative AI agents. In regulated industries, any system that impacts product quality or patient safety must comply with GxP (Good Practice) regulations, such as Good Clinical Practice (GxP), Good Laboratory Practice (GLP), Good Manufacturing Practice (GMP). Organizations must demonstrate to regulatory authorities that their AI agents are safe, effective, and meet quality standards. Building AI agents for these GxP environments requires a strategic approach that balances innovation, speed, and regulatory requirements.
AI agents can be built for GxP environments: The key lies in understanding how to build them appropriately based on their risk profiles. Gen AI introduces unique challenges around explainability, probabilistic outputs, and continuous learning that require thoughtful risk assessment rather than blanket validation approaches. The disconnect between traditional GxP compliance methods and modern AI capabilities creates barriers to implementation, increases validation costs, slows innovation speed, and limits the potential benefits for product quality and patient care.
The regulatory landscape for GxP compliance is evolving to address the unique characteristics of AI. Traditional Computer System Validation (CSV) approaches, often with uniform validation strategies, are being supplemented by Computer Software Assurance (CSA) frameworks that emphasize flexible risk-based validation methods tailored to each system’s actual impact and complexity (FDA latest guidance).
In this post, we cover a risk-based implementation, practical implementation considerations across different risk levels, the AWS shared responsibility model for compliance, and concrete examples of risk mitigation strategies.
Risk based implementation framework
Effective GxP compliance for agentic AI systems require assessing risk based on operational context rather than technology features alone. To support risk classification, the FDA’s CSA Draft Guidance recommends evaluating intended uses across three factors: severity of potential harm, probability of occurrence, and detectability of failures.
In Figure 1, this assessment model combines traditional operational roles with modern risk-based levels. Organizations should assess how AI agents function within workflows and their potential impact on regulated processes.

Figure 1. GxP compliance for AI agents combines traditional Role-based with CSA’s modern risk-based levels

The same AI agent capability can warrant dramatically different validation approaches depending on how it is being deployed. How is the agentic AI being consumed and within existing GxP processes? What is the level of human oversight or human-in-the-loop controls? Is the AI agent itself being added as an additional control? What is the potential impact of AI failures on product quality, data integrity, or patient safety?
Consider an AI agent for scientific literature review. When creating literature summaries for internal team meetings, it presents low risk, requiring minimal controls. When scientists use these insights to guide research direction, it becomes medium risk, needing structured controls, such as human review checkpoints. When supporting regulatory submissions for drug approval, it becomes high risk and requires comprehensive controls because outputs directly impact regulatory decisions and patient safety.
This risk-based methodology allows organizations to balance innovation with compliance by tailoring validation efforts to actual risk levels rather than applying uniform controls across all AI implementations.
Implementation considerations
Successful AI agent designs require common controls that apply consistently across risk levels for quality and safety. Organizations should maintain clear records of AI decisions, prove data has not been altered, reproduce results when needed, and manage system updates safely. AWS supports these requirements through qualified infrastructure and various compliance certifications such as ISO, SOC, and NIST. For a more complete list, see our Healthcare & Life Sciences Compliance page. Detailed compliance validation information for Amazon Bedrock AgentCore is available in the compliance documentation. To implement these controls effectively, organizations can refer to the National Institute of Standards and Technology (NIST) AI Risk Management Framework for AI-risk guidance and ALCOA+ principles to promote data integrity.
Shared responsibility model
Successful generative AI cloud-implementation in GxP environments requires understanding the shared division of responsibilities between customers and AWS, as outlined in the Shared responsibility model, to allow organizations to focus on delivering effective and compliance-aligned solutions.
As AWS helps protect the infrastructure that runs the services offered in the AWS Cloud, Table 1 provides practical examples of how AWS can support customers in validating their agentic AI systems.

Focus
Customer responsibilities
How AWS supports

Validation strategy
Design risk-appropriate validation approaches using AWS services for GxP compliance. Establish acceptance criteria and validation protocols based on intended use.
Inherit compliance controls with AWS services such as Amazon Bedrock’s ISO 27001, SOC 1/2/3, FedRAMP, and GDPR/HIPAA eligibility. Support your GxP training requirements through AWS Skill Builder for artificial intelligence and machine learning (AI/ML) and AWS Certified Machine Learning – Specialty. Use infrastructure as code through AWS CloudFormation to support on demand validations and deployments that provide repeatable IQ for your agentic workloads.

GxP procedures
Develop SOPs that integrate AWS capabilities with existing quality management systems. Establish documented procedures for system operation and maintenance.
Build GxP agentic systems with HCLS Landing Zones, designed to align for highly regulated workloads, this capability can augment and support your standard procedure requirements. Augment risk management procedures with Amazon Bedrock AgentCore supporting end-to-end visibility and runtime requirements for complex multi-step tasks. Use AWS Certified SysOps Administrator and AWS Certified DevOps Engineering certifications for training requirements and to make sure teams can operationalize and govern procedural compliance on AWS.

User management
Configure IAM roles and permissions aligned with GxP user access requirements. Maintain user access documentation and training records.
Secure AI agents access with AWS IAM and Amazon Bedrock AgentCore Identity to establish fine-grained permissions and enterprise identity integration and use IAM Identity Center to streamline workforce user access.

Performance criteria
Define acceptance criteria and monitoring thresholds for gen AI applications. Establish performance monitoring protocols.
Use Amazon Bedrock Provision Throughput plan for agentic workflows that require consistent and guaranteed performance requirements. Monitor performance with Amazon Bedrock AgentCore Observability and with Amazon CloudWatch with customizable alerts and dashboards for end-to-end visibility.

Documentation
Create validation documentation demonstrating how AWS services support GxP compliance. Maintain quality system records.
Use AWS Config to help generate compliance reports of your agentic deployments with conformance packs for HIPAA, 21 CFR Part 11, and GxP EU Annex 11.Store your GxP data with Amazon Simple Storage Service (Amazon S3), which offers enterprise-grade 11 nines of durability with support for versioning and user defined retention policies.

Provenance
Monitor model versions while maintaining validated snapshots. Version-control prompt templates to facilitate consistent AI interactions, track changes, and maintain records for audit trails version-control prompt templates. Lock tool dependencies in validated environment.
Control models and data with Amazon Bedrock configurable data residency and immutable model versioning. AWS Config executes automated configuration tracking and validation. AWS CloudTrail captures comprehensive audit logging. Deploy reproducibility of AI behaviors using model versioning in AWS CodePipeline, AWS CodeCommit, and Amazon Bedrock.

The following is an example of what customers might need to implement and what AWS provides when building AI agents (Figure 2):

Figure 2. Gen AI implementation in GxP environments requires understanding the division of responsibilities between customers and AWS.

Let’s demonstrate how these shared responsibilities translate into actual implementation.
Provenance and reproducibility
AWS Supports the following:

Amazon Bedrock – Provides immutable model versioning, facilitating reproducible AI behavior across the system lifecycle.
AWS Config – Automatically tracks and validates system configurations, continuously monitoring for drift from validated baselines.
AWS CloudTrail – Generates audit trails with cryptographic integrity, capturing model invocations with complete metadata including timestamps, user identities, and model versions. Infrastructure as Code support through AWS CloudFormation enables version-controlled, repeatable deployments.

Customer responsibility: Organizations must version-control their infrastructure deployments, their prompt templates to make sure there is consistent AI behavior and maintain audit trails of prompt changes. Tool dependencies must be tracked and locked to specific versions in validated environments to help prevent unintended updates that could affect AI outputs.
Observability and performance metrics
AWS supports the following:

Amazon Bedrock AgentCore – Provides a comprehensive solution for the unique risks that agentic AI introduces, including end-to-end visibility into complex multi-step agent tasks and runtime requirements for orchestrating reasoning chains. Amazon Bedrock AgentCore Observability captures the complete chain of decisions and tool invocations, so that you can inspect an agent’s execution path, audit intermediate outputs, and inspect failures. The Bedrock Retrieval API for Amazon Bedrock Knowledge Bases enables traceability from retrieved documents to AI-generated outputs.
Amazon CloudWatch – Delivers real-time monitoring with customizable alerts and dashboards, aggregating performance metrics across the agent invocations. Organizations can configure logging levels based on risk, such as basic CloudTrail logging for low-risk applications, detailed AgentCore traces for medium risk, and complete provenance chains for high-risk regulatory submissions.

Customer responsibility: Organizations define acceptance criteria and monitoring thresholds appropriate to their risk level—for example, citation accuracy requirements for our literature review agent. Teams must decide when human-in-the-loop triggers are required, such as mandatory expert review before AI recommendations influence research decisions or regulatory submissions.
User management, session isolation, and security
AWS Supports the following:

Amazon Bedrock AgentCore – Provides session isolation using dedicated microVMs, that help prevent cross-contamination between different projects or regulatory submissions. The service supports VPC endpoints to establish private connections between your Amazon VPC and Amazon Bedrock AgentCore resources, allowing for inter-network traffic privacy. All communication with Amazon Bedrock AgentCore endpoints uses HTTPS exclusively across all supported regions, with no HTTP support, so that all communications are digitally signed for authentication and integrity.

Amazon Bedrock AgentCore maintains robust encryption standards with TLS 1.2 minimum requirements (TLS 1.3 recommended) for all API endpoints. Both control plane and data plane traffic are encrypted with TLS protocols and restricted to minimum TLS 1.2 with no unencrypted communication permitted. Amazon Bedrock AgentCore Identity addresses identity complexity with a secure token vault for credentials management, providing fine-grained permissions and enterprise identity integration.
AWS Identity and Access Management (IAM) enables organizations to configure role-based access controls with least-privilege principles. Built-in encryption facilitates data protection both in transit and at rest, while network isolation and compliance certifications (SOC, ISO 27001, HIPAA) support regulatory requirements. Amazon Bedrock offers configurable data residency, allowing organizations to specify regions for data processing.
Customer responsibility: Organizations configure IAM roles and policies aligned with GxP user access requirements, facilitating least-privilege access and proper segregation of duties. Access controls must be documented and maintained as part of the quality management system.
GxP controls for AI agents
The implementation of GxP risk controls for AI agents can be considered through three key phases.
Risk Assessment evaluates the GxP workload against the organization’s risk-based validation framework. Continual quality assurance is maintained through structured feedback loops, ranging from real-time verification (see Continuous Validation) to bi-annual reviews. This process makes sure reviewers are trained against the evolving AI landscape, adapt to user feedback, and apply appropriate intervention criteria. In practice, risk assessments define risk categories and triggers for reassessment.
Control Selection is carefully selecting minimum required controls based on the 1. risk classification, 2. the specific design attributes, and 3. operational context of the AI agents. This targeted, risk-adjusted approach, makes sure controls align with both technical requirements and compliance objectives. In practice, risk categories drive required and selectable controls. An example of medium risk might require Agent and Prompt Governance controls along with two or more Detective Controls, while a high risk might require Traditional Testing (IQ, OQ, PQ) control, and two additional corrective controls.
Continuous Validation is an approach that includes the traditional fit-for-intended-use validation and subsequent process that leverages real-world data (RWD), such as operational logs and/or user feedback, to create supplemental real-world evidence (RWE) that the system maintains a validated state. As a control mechanism itself, the Continuous Validation approach helps address modern cloud-based designs including SaaS models, model drifts, and evolving cloud infrastructure. Through ongoing monitoring of performance and functionality, this approach helps maintain system GxP compliance while supporting regulatory inspections. In practice, for low-risk categories, this might be a user compliance-aligned portal that tracks user issue trends to high-risk systems that incorporate periodic self-tests with compliance reports.
The following table provides examples of Preventive, Corrective, and Detective Controls for agentic AI systems that could be incorporated in a modern GxP validation framework.

Control element
Supporting AWS services

Preventive Controls

Agent Behavior Specification
Use Amazon Bedrock Model Catalog to find the models that help meet your specific requirements and use AWS service quotas (limits) and documentation on service features to define supported and verifiable agent capabilities.

Threat Modeling
Use AWS Well-Architected Framework (Security Pillar) tools and AWS service security documentation to proactively identify AI-specific threats like Prompt Injection, Data Poisoning, and Model Inversion, and help design preventive mitigations using AWS services.

Response Content and Relevance Control
Use Amazon Bedrock Guardrails to implement real-time safety policies for large language models (LLMs) to deny harmful inputs or responses. Guardrails can also define denylists and filter for PII. Use Amazon Bedrock Knowledge Bases or AWS purpose-built vector databases for RAG to provide controlled, current, and relevant information to help prevent factual drift.

Bias Mitigation in Datasets
Amazon SageMaker Clarify provides tools to run pre-training bias analysis of your datasets. For agents, this helps make sure the foundational data doesn’t lead to biased decision-making paths or tool usage.

Agent & Prompt Governance
Amazon Bedrock agents and prompt management features support lifecycle processes including creation, evaluation, versioning, and optimization. The features also support advanced prompt templates, content filters, automated reasoning checks, and integration with Amazon Bedrock Flows for more secure and controlled agentic workflows.

Configuration Management
AWS provides an industry leading suite of configuration management services such as AWS Config and AWS Audit Manager, which can be used to continuously validate agentic GxP system configurations. AWS SageMaker Model Registry manages and versions trained machine learning (ML) models for controlled deployments.

Secure AI Development
Amazon Q Developer and Amazon Kiro provide AI-powered code assistance that incorporate security best practices and AWS Well-Architected principals for building and maintaining agentic workloads securely from the start.

AI Agents as Secondary Controls
Use Amazon Bedrock AgentCore and your data to quickly incorporate AI agents into existing GxP workflows as secondary preventative controls to add capabilities like trend analysis, automated inspections, and systems flow analysis that can trigger preventative workflow events.

Detective Controls

Traditional Testing (IQ, OQ, PQ)
Use AWS Config and AWS CloudFormation for IQ validation by tracking resource deployment configurations. Use AWS CloudTrail and AWS CloudWatch for sourcing events, metrics, and log test results for OQ/PQ validation.

Explainability Audits & Trajectory Reviews
Amazon SageMaker Clarify generates explainability reports for custom models. Amazon Bedrock Invocation Logs can be used to review reasoning or chain of thought to find flaws in an agent’s logic. Utilize Amazon AgentCore Observability to look at agent invocation sessions, traces and spans.

Model & I/O Drift Detection
For custom models, Amazon SageMaker Model Monitor, can detect drift in data and model quality. For AI agents using commercial LLMs, use the observability service of Amazon Bedrock AgentCore to design monitoring of Inputs (prompts) and Outputs (responses) to detect concept drift. Use Amazon CloudWatch alarms to manage compliance notifications.

Performance Monitoring
Agentic workloads can use Amazon Bedrock metrics, AgentCore Observability and AWS CloudWatch metrics to include monitoring for Token Usage, Cost per Interaction, and Tool Execution Latency to detect performance and cost anomalies.

Log and Event Monitoring (SIEM)
For agentic workload, Amazon GuardDuty provides intelligent threat detection that analyzes Amazon Bedrock API calls to detect anomalous or potentially malicious use of the agent or LLMs.

Code & Model Risk Scanning
Amazon CodeGuru and Amazon Inspector scans agent code and operational environment for vulnerabilities. These tools can’t assess model weights for risk, however AWS does provide Amazon SageMaker Model Card support that can be used to build Model Risk scanning controls.

Adversarial Testing (Red Teaming) & Critic/Grader Model
The evaluation tools of Amazon Bedrock help assess model fitness. Amazon Bedrock supports leading model providers allowing GxP systems to use multiple models for secondary and tertiary validation.

Internal Audits
AWS Audit Manager automates the collection of evidence for compliance and audits and AWS CloudTrail provides a streamlined way to review agent actions and facilitate procedural adherence.

Corrective Controls

Model & Prompt Rollback
Use AWS CodePipeline and AWS CloudFormation to quickly revert to a previous, known-good version of a model or Prompt Template when a problem is detected.

System Fallback
AWS Step Functions can help orchestrate a fallback to a streamlined, more constrained model or a human-only workflow if the primary agent fails.

Human-in-the-Loop & Escalation Management
AWS Step Functions, Amazon Simple Notification Service (SNS) and Amazon Bedrock Prompt Flow can orchestrate workflows that can pause and wait for human approval, including dynamic approvals based on low agent confidence scores or detected anomalies.

CAPA Process
AWS Systems Manager OpsCenter provides a central place to manage operational issues, which can be used to track the root cause analysis of an agent’s failure.

Incident Response Plan
AWS Security Hub and AWS Systems Manager Incident Manager can automate response plans for AI security incidents (for example, major jailbreak and data leakage) and provide a central dashboard to manage them.

Disaster Recovery Plan (DRP)
AWS Elastic Disaster Recovery (DRS) and AWS Backup provides tools to replicate and recover the entire AI application stack, including deploying to different AWS Regions.

Conclusion
Healthcare and life sciences organizations can build GxP-compliant AI agents by adopting a risk-based framework that balances innovation with regulatory requirements. Success requires proper risk classification, scaled controls matching system impact, and understanding the AWS shared responsibility model. AWS provides qualified infrastructure and comprehensive services, while organizations configure appropriate controls, maintain version management, and implement risk mitigation strategies tailored to their validation needs.
We encourage organizations to explore building GxP-compliant AI agents with AWS services. For more information about implementing compliance-aligned AI systems in regulated environments, contact your AWS account team or visit our Healthcare and Life Sciences Solutions page.

About the authors
Pierre de Malliard is a Senior AI/ML Solutions Architect at Amazon Web Services and supports customers in the Healthcare and Life Sciences Industry.
Ian Sutcliffe is a Global Solution Architect with 25+ years of experience in IT, primarily in the Life Sciences Industry. A thought leader in the area of regulated cloud computing, one of his areas of focus is IT operating models and process optimization and automation with the intent of helping customers become Regulated Cloud Natives
Kristin Ambrosini is a Generative AI Specialist at Amazon Web Services. She drives adoption of scalable GenAI solutions across healthcare and life sciences to transform drug discovery and improve patient outcomes. Kristin blends scientific expertise, technical acumen, and business strategy. She holds a Ph.D. in Biological Sciences.
Ben Xavier is a MedTech Specialist with over 25 years of experience in Medical Device R&D. He is a passionate leader focused on modernizing the MedTech industry through technology and best practices to accelerate innovation and improve patient outcomes.

Moonshot AI Releases Kosong: The LLM Abstraction Layer that Powers Kim …

Posted on November 11, 2025 by i-genie

Modern agentic applications rarely talk to a single model or a single tool, so how do you keep that stack maintainable when providers, models and tools keep changing every few weeks. Moonshot AI’s Kosong targets this problem as an LLM abstraction layer for agent applications. Kosong unifies message structures, asynchronous tool orchestration and pluggable chat providers so teams can build agents without hard wiring business logic to a single API. It is also the layer that powers Moonshot’s Kimi CLI.

What Kosong provides?

Kosong is a Python library that sits between your agent logic and LLM providers. It as an LLM abstraction layer for modern agent applications and shows example code that uses a Kimi chat provider together with high level helper functions generate and step.

The public API surface is intentionally kept small. At the top level you import kosong.generate, kosong.step and the result types GenerateResult and StepResult. Supporting modules define chat_provider, message, tooling, and tooling.simple. These modules wrap provider specific streaming formats, token accounting and tool calls behind one consistent interface.

ChatProvider and message model

The core integration point is the ChatProvider abstraction. Moonshot team shows a provider implementation for Kimi in kosong.chat_provider.kimi. A Kimi object is initialized with base_url, api_key and the model name, for example kimi-k2-turbo-preview. This provider is then passed into kosong.generate or kosong.step together with a system prompt, tools and a message history.

Messages are represented by the Message class from kosong.message. In the examples, a message is constructed with a role, such as “user”, and a content argument. The type of content is documented as either a string or a list of content parts, which lets the library support richer multimodal payloads while keeping the basic chat example simple for new users.

Kosong also exposes a streaming unit StreamedMessagePart via kosong.chat_provider. Provider implementations emit these parts during generation, and the library merges them into the final Message. The optional TokenUsage structure tracks token counts in a provider independent way, which is then attached to the result objects for logging and monitoring.

Tooling, Toolset and SimpleToolset

Most agent stacks need tools such as search, code execution or database calls. Kosong models this through the tooling module. The example in the GitHub repo defines a tool by subclassing CallableTool2 with a Pydantic parameter model. The example AddTool sets name, description and params, and implements __call__ to return a ToolOk value which is a valid ToolReturnType.

Tools are registered in a SimpleToolset from kosong.tooling.simple. In the example, a SimpleToolset is instantiated and then augmented with the AddTool instance using the += operator. This toolset is passed into kosong.step, not into generate. The toolset is responsible for resolving tool calls from the model and routing them to the correct async function, while step manages the orchestration around a single conversational turn.

generate for single shot completion

The generate function is the entry point for plain chat completion. You provide the chat_provider, a system_prompt, an explicit list of tools, which can be empty, and a history of Message objects. The Kimi example shows a minimal usage pattern where a single user message is passed as history and tools=[].

generate supports streaming through an on_message_part callback. In the GitHub repo, the research team illustrates this by defining a simple output function that prints each StreamedMessagePart. After streaming is complete, generate returns a GenerateResult that contains the merged assistant message and an optional usage structure with token counts. This pattern lets applications both display incremental output and still work with a clean final message object.

step for tool using agents

For tool using agents, Kosong exposes the step function. The example in the Git Repo shows kosong.step being called with a Kimi provider, a SimpleToolset that contains AddTool, a system prompt and user history that instructs the model to call the add tool.

step returns a StepResult. The example prints result.message and then awaits result.tool_results(). This method collects all tool outputs produced during the step and returns them to the caller. The orchestration of tool calls, including argument parsing into the Pydantic parameter model and conversion into ToolReturnType results, is handled inside Kosong so agent authors do not have to implement their own dispatch loop for each provider.

Built in demo and relationship with Kimi CLI

Kosong ships with a built in demo agent that can be run locally. The Git README documents environment variables KIMI_BASE_URL and KIMI_API_KEY, and shows a launch command using uv run python -m kosong kimi –with-bash. This demo uses Kimi as the chat provider and exposes a terminal agent that can call tools, including shell commands when the option with bash is enabled.

Key Takeaways

Kosong is an LLM abstraction layer from Moonshot AI that unifies message structures, asynchronous tool orchestration and pluggable chat providers for agent applications.

The library exposes a small core API, generate for plain chat and step for tool using agents, backed by abstractions such as ChatProvider, Message, Tool, Toolset and SimpleToolset.

Kosong currently ships a Kimi chat provider targeting the Moonshot AI API, and defines the ChatProvider interface so teams can plug in additional backends without changing agent logic.

Tool definitions use Pydantic parameter models and ToolReturnType results, which lets Kosong handle argument parsing, validation and orchestration of tool calls inside step.

Kosong powers Moonshot’s Kimi CLI, providing the underlying LLM abstraction layer while Kimi CLI focuses on the command line agent experience that can target Kimi and other backends.

Editorial Comments

Kosong looks like a pragmatic move from Moonshot AI, it cleanly separates agent logic from LLM and tool backends while keeping the surface area small for early developers. By centering everything on ChatProvider, Message and Toolset, it gives Kimi CLI and other stacks a consistent way to evolve models and tooling without rewriting orchestration. For teams building long term agent systems, Kosong could be the right kind of minimal infrastructure.

Check out the Repo and Docs. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kosong: The LLM Abstraction Layer that Powers Kimi CLI appeared first on MarkTechPost.

Gelato-30B-A3B: A State-of-the-Art Grounding Model for GUI Computer-Us …

Posted on November 11, 2025 by i-genie

How do we teach AI agents to reliably find and click the exact on screen element we mean when we give them a simple instruction? A team of researchers from ML Foundations has introduced Gelato-30B-A3B, a state of the art grounding model for graphical user interfaces that is designed to plug into computer use agents and convert natural language instructions into reliable click locations. The model is trained on the Click 100k dataset and reaches 63.88% accuracy on ScreenSpot Pro and 69.15% on OS-World-G, with 74.65% on OS-World-G Refined. It surpasses GTA1-32B and larger vision language models such as Qwen3-VL-235B-A22B-Instruct.

https://github.com/mlfoundations/Gelato

What Gelato 30B A3B Does in An Agent Stack?

Gelato-30B-A3B is a 31B parameter model that fine tunes Qwen3-VL-30B-A3B Instruct with a mixture of experts architecture. It takes a screenshot and a textual instruction as input and produces a single click coordinate as output.

The model is positioned as a modular grounding component. A planner model, for example GPT 5 in the Gelato experiments, decides the next high level action and calls Gelato to resolve that step into a concrete click on the screen. This separation between planning and grounding is important when an agent must operate across many operating systems and applications with different layouts.

https://github.com/mlfoundations/Gelato

Click 100k, A Targeted Dataset For GUI Grounding

Click 100k is the dataset that underlies Gelato. It pairs computer screen images with natural language instructions, bounding boxes for the target element, image dimensions, and normalized bounding boxes. Each sample is set up as a low level command, for example ‘tap on the element between Background and Notifications options’ with a precise region.

The dataset is built by filtering and unifying multiple public sources. The list includes ShowUI, AutoGUI, PC Agent E, WaveUI, OS Atlas, UGround, PixMo Points, SeeClick, UI VISION, a JEDI subset that focuses on spreadsheet and text cell manipulation, and videos from 85 professional application tutorials annotated with Claude-4-Sonnet. Each source contributes at most 50k samples, and all sources are mapped into a shared schema with images, instructions, bounding boxes, and normalized coordinates.

The research team then runs an aggressive filtering pipeline. OmniParser discards clicks that do not land on detected interface elements. Qwen2.5-7B-VL and SE-GUI-3B remove trivial examples, such as easy hyperlink clicks. GTA1-7B-2507 and UI-Venus-7B remove samples where the instruction and click region do not match. A Qwen2.5-7B-VL baseline trained on a balanced 10k subset shows that this combination gives a +9 pp accuracy gain on ScreenSpot Pro compared with training on unfiltered data.

Professional application coverage is a specific focus. Click 100k adds data from UI VISION and the JEDI subset, and then augments this with 80+ tutorial videos for real desktop tools. Claude 4 Sonnet generates bounding boxes and low level instructions for these videos, followed by manual inspection and corrections.

https://github.com/mlfoundations/Gelato?tab=readme-ov-file

GRPO Training On Top Of Qwen3 VL

On the training side, Gelato 30B A3B uses GRPO, a reinforcement learning algorithm that derives from work on DeepSeekMath and similar systems. The research team follow the DAPO setup. They remove the KL divergence term from the objective, set the clip higher threshold to 0.28, and skip rollouts with zero advantage. Rewards are sparse and are only given when the predicted click falls inside the target bounding box, similar to the GTA1 recipe.

https://github.com/mlfoundations/Gelato?tab=readme-ov-file

They initialize from Qwen3 VL 30B A3B Instruct and run 100 RL steps on 32 A100 GPUs with 40 GB memory. The best checkpoint appears at step 84 (marked as green cross in the above image), chosen by the mean performance across ScreenSpot Pro, OS World G, and OS World G Refined. At this point the model reaches 63.88% on ScreenSpot-Pro and 67.19% and 73.40% on OS World G and OS World G Refined. A simple refusal prompting strategy, which appends an instruction to answer with refusal when the element cannot be found, raises the OS-World-G scores to 69.15% and 74.65%.

End To End Agent Results On OS World

To test Gelato beyond static grounding benchmarks, the research team plugs it into the GTA1.5 agent framework and runs full computer use agents on the OS World environment. In this setup GPT 5 acts as the planner. Gelato 30B A3B provides grounding, the agent has at most 50 steps, and it waits 3 seconds between actions.

The research reports three runs per model on a fixed OS World snapshot. Gelato-30B-A3B reaches 58.71% automated success rate with a small standard deviation, compared with 56.97% for GTA1 32B in the same harness. Because the automatic OS World evaluation misses some valid solutions, they also run human evaluation on 20 problematic tasks. Under human scoring, Gelato reaches 61.85% success, while GTA1-32B reaches 59.47%.

Key Takeaways

Gelato-30B-A3B is a Qwen3-VL-30B-A3B Instruct based mixture of experts model that performs state of the art GUI grounding on ScreenSpot Pro and OS World G benchmarks, surpassing GTA1-32B and larger VLMs such as Qwen3-VL-235B-A22B-Instruct.

The model is trained on Click 100k, a curated grounding dataset that merges and filters multiple public GUI datasets and professional application traces, pairing real screens with low level natural language commands and precise click coordinates.

Gelato-30B-A3B uses a GRPO reinforcement learning recipe on top of Qwen3-VL, with sparse rewards that only trigger when the predicted click lies inside the ground truth bounding box, which significantly boosts grounding accuracy over supervised baselines.

When integrated into an agent framework with GPT-5 acting as the planner, Gelato-30B-A3B improves success rates on OS World computer use tasks compared with GTA1-32B, demonstrating that better grounding directly translates into stronger end to end agent performance.

Editorial Comments

Gelato-30B-A3B is an important step for grounded computer use because it shows that a Qwen3-VL based MoE model, trained on a carefully filtered Click 100k dataset, can beat both GTA1-32B and much larger VLMs like Qwen3-VL-235B-A22B Instruct on ScreenSpot Pro and OS-World-G while staying accessible through Hugging Face. Overall, Gelato-30B-A3B establishes a clear new baseline for open computer grounding models.

Check out the Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Gelato-30B-A3B: A State-of-the-Art Grounding Model for GUI Computer-Use Tasks, Surpassing Computer Grounding Models like GTA1-32B appeared first on MarkTechPost.

Comparing Memory Systems for LLM Agents: Vector, Graph, and Event Logs

Posted on November 11, 2025 by i-genie

Table of contentsHigh-Level Comparison1. Vector Memory Systems1.1 Plain Vector RAG1.2 Tiered Vector Memory (MemGPT-Style Virtual Context)2. Graph Memory Systems2.1 Temporal Knowledge Graph Memory (Zep / Graphiti)2.2 Knowledge-Graph RAG (GraphRAG)3. Event and Execution Log Systems3.1 Execution Logs and Checkpoints (ALAS, LangGraph)3.2 Episodic Long-Term MemoryKey Takeaways

Reliable multi-agent systems are mostly a memory design problem. Once agents call tools, collaborate, and run long workflows, you need explicit mechanisms for what gets stored, how it is retrieved, and how the system behaves when memory is wrong or missing.

This article compares 6 memory system patterns commonly used in agent stacks, grouped into 3 families:

Vector memory

Graph memory

Event / execution logs

We focus on retrieval latency, hit rate, and failure modes in multi-agent planning.

High-Level Comparison

FamilySystem patternData modelStrengthsMain weaknessesVectorPlain vector RAGEmbedding vectorsSimple, fast ANN retrieval, widely supportedLoses temporal / structural context, semantic driftVectorTiered vector (MemGPT-style virtual context)Working set + vector archiveBetter reuse of important info, bounded context sizePaging policy errors, per-agent divergenceGraphTemporal KG memory (Zep / Graphiti)Temporal knowledge graphStrong temporal, cross-session reasoning, shared viewRequires schema + update pipeline, can have stale edgesGraphKnowledge-graph RAG (GraphRAG)KG + hierarchical communitiesMulti-doc, multi-hop questions, global summariesGraph construction and summarization bias, traceability overheadEvent / LogsExecution logs / checkpoints (ALAS, LangGraph)Ordered versioned logGround truth of actions, supports replay and repairLog bloat, missing instrumentation, side-effect-safe replay requiredEvent / LogsEpisodic long-term memoryEpisodes + metadataLong-horizon recall, pattern reuse across tasksEpisode boundary errors, consolidation errors, cross-agent misalign

Next, we go system family by system family.

1. Vector Memory Systems

1.1 Plain Vector RAG

What it is?

The default pattern in most RAG and agent frameworks:

Encode text fragments (messages, tool outputs, documents) using an embedding model.

Store vectors in an ANN index (FAISS, HNSW, ScaNN, etc.).

At query time, embed the query and retrieve top-k nearest neighbors, optionally rerank.

This is the ‘vector store memory’ exposed by typical LLM orchestration libraries.

Latency profile

Approximate nearest-neighbor indexes are designed for sublinear scaling with corpus size:

Graph-based ANN structures like HNSW typically show empirically near-logarithmic latency growth vs corpus size for fixed recall targets.

On a single node with tuned parameters, retrieving from up to millions of items is usually low tens of milliseconds per query, plus any reranking cost.

Main cost components:

ANN search in the vector index.

Additional reranking (e.g., cross-encoder) if used.

LLM attention cost over concatenated retrieved chunks.

Hit-rate behavior

Hit rate is high when:

The query is local (‘what did we just talk about’), or

The information lives in a small number of chunks with embeddings aligned to the query model.

Vector RAG performs significantly worse on:

Temporal queries (‘what did the user decide last week’).

Cross-session reasoning and long histories.

Multi-hop questions requiring explicit relational paths.

Benchmarks such as Deep Memory Retrieval (DMR) and LongMemEval were introduced precisely because naive vector RAG degrades on long-horizon and temporal tasks.

Failure modes in multi-agent planning

Lost constraints: top-k retrieval misses a critical global constraint (budget cap, compliance rule), so a planner generates invalid tool calls.

Semantic drift: approximate neighbors match on topic but differ in key identifiers (region, environment, user ID), leading to wrong arguments.

Context dilution: too many partially relevant chunks are concatenated; the model underweights the important part, especially in long contexts.

When it is fine

Single-agent or short-horizon tasks.

Q&A over small to medium corpora.

As a first-line semantic index over logs, docs, and episodes, not as the final authority.

1.2 Tiered Vector Memory (MemGPT-Style Virtual Context)

What it is?

MemGPT introduces a virtual-memory abstraction for LLMs: a small working context plus larger external archives, managed by the model using tool calls (e.g., ‘swap in this memory’, ‘archive that section’). The model decides what to keep in the active context and what to fetch from long-term memory.

Architecture

Active context: the tokens currently present in the LLM input (analogous to RAM).

Archive / external memory: larger storage, often backed by a vector DB and object store.

The LLM uses specialized functions to:

Load archived content into context.

Evict parts of the current context to the archive.

Latency profile

Two regimes:

Within active context: retrieval is effectively free externally; attention cost only.

Archive accesses: similar to plain vector RAG, but often targeted:

Search space is narrowed by task, topic, or session ID.

The controller can cache “hot” entries.

Overall, you still pay vector search and serialization costs when paging, but you avoid sending large, irrelevant context to the model at each step.

Hit-rate behavior

Improvement relative to plain vector RAG:

Frequently accessed items are kept in the working set, so they do not depend on ANN retrieval every step.

Rare or old items still suffer from vector-search limitations.

The core new error surface is paging policy rather than pure similarity.

Failure modes in multi-agent planning

Paging errors: the controller archives something that is needed later, or fails to recall it, causing latent constraint loss.

Per-agent divergence: if each agent manages its own working set over a shared archive, agents may hold different local views of the same global state.

Debugging complexity: failures depend on both model reasoning and memory management decisions, which must be inspected together.

When it is useful

Long conversations and workflows where naive context growth is not viable.

Systems where you want vector RAG semantics but bounded context usage.

Scenarios where you can invest in designing / tuning paging policies.

2. Graph Memory Systems

2.1 Temporal Knowledge Graph Memory (Zep / Graphiti)

What it is?

Zep positions itself as a memory layer for AI agents implemented as a temporal knowledge graph (Graphiti). It integrates:

Conversational history.

Structured business data.

Temporal attributes and versioning.

Zep evaluates this architecture on DMR and LongMemEval, comparing against MemGPT and long-context baselines.

Reported results include:

94.8% vs 93.4% accuracy over a MemGPT baseline on DMR.

Up to 18.5% higher accuracy and about 90% lower response latency than certain baselines on LongMemEval for complex temporal reasoning.

These numbers underline the benefit of explicit temporal structure over pure vector recall on long-term tasks.

Architecture

Core components:

Nodes: entities (users, tickets, resources), events (messages, tool calls).

Edges: relations (created, depends_on, updated_by, discussed_in).

Temporal indexing: validity intervals and timestamps on nodes/edges.

APIs for:

Writing new events / facts into the KG.

Querying along entity and temporal dimensions.

The KG can coexist with a vector index for semantic entry points.

Latency profile

Graph queries are typically bounded by small traversal depths:

For questions like “latest configuration that passed checks,” the system:

Locates the relevant entity node.

Traverses outgoing edges with temporal filters.

Complexity scales with the size of the local neighborhood, not the full graph.

In practice, Zep reports order-of-magnitude latency benefits vs baselines that either scan long contexts or rely on less structured retrieval.

Hit-rate behavior

Graph memory excels when:

Queries are entity-centric and temporal.

You need cross-session consistency, e.g., “what did this user previously request,” “what state was this resource in at time T”.

Multi-hop reasoning is required (“if ticket A depends on B and B failed after policy P changed, what is the likely cause?”).

Hit rate is limited by graph coverage: missing edges or incorrect timestamps directly reduce recall.

Failure modes in multi-agent planning

Stale edges / lagging updates: if real systems change but graph updates are delayed, plans operate on incorrect world models.

Schema drift: evolving the KG schema without synchronized changes in retrieval prompts or planners yields subtle errors.

Access control partitions: multi-tenant scenarios can yield partial views per agent; planners must be aware of visibility constraints.

When it is useful

Multi-agent systems coordinating on shared entities (tickets, users, inventories).

Long-running tasks where temporal ordering is critical.

Environments where you can maintain ETL / streaming pipelines into the KG.

2.2 Knowledge-Graph RAG (GraphRAG)

What it is?

GraphRAG is a retrieval-augmented generation pipeline from Microsoft that builds an explicit knowledge graph over a corpus and performs hierarchical community detection (e.g., Hierarchical Leiden) to organize the graph. It stores summaries per community and uses them at query time.

Pipeline:

Extract entities and relations from source documents.

Build the KG.

Run community detection and build a multi-level hierarchy.

Generate summaries for communities and key nodes.

At query time:

Identify relevant communities (via keywords, embeddings, or graph heuristics).

Retrieve summaries and supporting nodes.

Pass them to the LLM.

Latency profile

Indexing is heavier than vanilla RAG (graph construction, clustering, summarization).

Query-time latency can be competitive or better for large corpora, because:

You retrieve a small number of summaries.

You avoid constructing extremely long contexts from many raw chunks.

Latency mostly depends on:

Community search (often vector search over summaries).

Local graph traversal inside selected communities.

Hit-rate behavior

GraphRAG tends to outperform plain vector RAG when:

Queries are multi-document and multi-hop.

You need global structure, e.g., “how did this design evolve,” “what chain of incidents led to this outage.”

You want answers that integrate evidence from many documents.

The hit rate depends on graph quality and community structure: if entity extraction misses relations, they simply do not exist in the graph.

Failure modes

Graph construction bias: extraction errors or missing edges lead to systematic blind spots.

Over-summarization: community summaries may drop rare but important details.

Traceability cost: tracing an answer back from summaries to raw evidence adds complexity, important in regulated or safety-critical settings.

When it is useful

Large knowledge bases and documentation sets.

Systems where agents must answer design, policy, or root-cause questions that span many documents.

Scenarios where you can afford the one-time indexing and maintenance cost.

3. Event and Execution Log Systems

3.1 Execution Logs and Checkpoints (ALAS, LangGraph)

What they are?

These systems treat ‘what the agents did‘ as a first-class data structure.

ALAS: a transactional multi-agent framework that maintains a versioned execution log plus:

Validator isolation: a separate LLM checks plans/results with its own context.

Localized Cascading Repair: only a minimal region of the log is edited when failures occur.

LangGraph: exposes thread-scoped checkpoints of an agent graph (messages, tool outputs, node states) that can be persisted, resumed, and branched.

In both cases, the log / checkpoints are the ground truth for:

Actions taken.

Inputs and outputs.

Control-flow decisions.

Latency profile

For normal forward execution:

Reading the tail of the log or a recent checkpoint is O(1) and small.

Latency mostly comes from LLM inference and tool calls, not log access.

For analytics / global queries:

You need secondary indexes or offline processing; raw scanning is O(n).

Hit-rate behavior

For questions like ‘what happened,’ ‘which tools were called with which arguments,’ and ‘what was the state before this failure,’ hit rate is effectively 100%, assuming:

All relevant actions are instrumented.

Log persistence and retention are correctly configured.

Logs do not provide semantic generalization by themselves; you layer vector or graph indices on top for semantics across executions.

Failure modes

Log bloat: high-volume systems generate large logs; improper retention or compaction can silently drop history.

Partial instrumentation: missing tool or agent traces yield blind spots in replay and debugging.

Unsafe replay: naively re-running log steps can re-trigger external side effects (payments, emails) unless idempotency keys and compensation handlers exist.

ALAS explicitly tackles some of these via transactional semantics, idempotency, and localized repair.

When they are essential?

Any system where you care about observability, auditing, and debuggability.

Multi-agent workflows with non-trivial failure semantics.

Scenarios where you want automated repair or partial re-planning rather than full restart.

3.2 Episodic Long-Term Memory

What it is?

Episodic memory structures store episodes: cohesive segments of interaction or work, each with:

Task description and initial conditions.

Relevant context.

Sequence of actions (often references into the execution log).

Outcomes and metrics.

Episodes are indexed with:

Metadata (time windows, participants, tools).

Embeddings (for similarity search).

Optional summaries.

Some systems periodically distill recurring patterns into higher-level knowledge or use episodes to fine-tune specialized models.

Latency profile

Episodic retrieval is typically two-stage:

Identify relevant episodes via metadata filters and/or vector search.

Retrieve content within selected episodes (sub-search or direct log references).

Latency is higher than a single flat vector search on small data, but scales better as lifetime history grows, because you avoid searching over all individual events for every query.

Hit-rate behavior

Episodic memory improves hit rate for:

Long-horizon tasks: “have we run a similar migration before?”, “how did this kind of incident resolve in the past?”

Pattern reuse: retrieving prior workflows plus outcomes, not just facts.

Hit rate still depends on episode boundaries and index quality.

Failure modes

Episode boundary errors: too coarse (episodes that mix unrelated tasks) or too fine (episodes that cut mid-task).

Consolidation mistakes: wrong abstractions during distillation propagate bias into parametric models or global policies.

Multi-agent misalignment: per-agent episodes instead of per-task episodes make cross-agent reasoning harder.

When it is useful?

Long-lived agents and workflows spanning weeks or months.

Systems where “similar past cases” are more useful than raw facts.

Training / adaptation loops where episodes can feed back into model updates.

Key Takeaways

Memory is a systems problem, not a prompt trick: Reliable multi-agent setups need explicit design around what is stored, how it is retrieved, and how the system reacts when memory is stale, missing, or wrong.

Vector memory is fast but structurally weak: Plain and tiered vector stores give low-latency, sublinear retrieval, but struggle with temporal reasoning, cross-session state, and multi-hop dependencies, making them unreliable as the sole memory backbone in planning workflows.

Graph memory fixes temporal and relational blind spots: Temporal KGs (e.g., Zep/Graphiti) and GraphRAG-style knowledge graphs improve hit rate and latency on entity-centric, temporal, and multi-document queries by encoding entities, relations, and time explicitly.

Event logs and checkpoints are the ground truth: ALAS-style execution logs and LangGraph-style checkpoints provide the authoritative record of what agents actually did, enabling replay, localized repair, and real observability in production systems.

Robust systems compose multiple memory layers: Practical agent architectures combine vector, graph, and event/episodic memory, with clear roles and known failure modes for each, instead of relying on a single ‘magic’ memory mechanism.

References:

MemGPT (virtual context / tiered vector memory)

https://arxiv.org/abs/2310.08560

https://arxiv.org/pdf/2310.08560

https://research.memgpt.ai/

Zep / Graphiti (temporal knowledge graph memory, DMR, LongMemEval)

https://arxiv.org/abs/2501.13956

https://www.getzep.com/

https://github.com/getzep/graphiti

https://www.emergentmind.com/topics/zep-a-temporal-knowledge-graph-architecture

GraphRAG (knowledge-graph RAG, hierarchical communities)

https://microsoft.github.io/graphrag/index/default_dataflow/

https://graphrag.com/reference/graphrag/global-community-summary-retriever/

https://github.com/microsoft/graphrag

ALAS (transactional / disruption-aware multi-agent planning, execution logs)

https://arxiv.org/abs/2505.12501

https://arxiv.org/abs/2511.03094

https://www.themoonlight.io/en/review/alas-transactional-and-dynamic-multi-agent-llm-planning

https://www.researchgate.net/publication/397322324_ALAS_Transactional_and_Dynamic_Multi-Agent_LLM_Planning

LangGraph (checkpoints / memory, thread-scoped state)

https://docs.langchain.com/oss/python/langgraph/memory

https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852

Supplemental GraphRAG + temporal KG context

https://memgraph.com/blog/how-microsoft-graphrag-works-with-graph-databases

The post Comparing Memory Systems for LLM Agents: Vector, Graph, and Event Logs appeared first on MarkTechPost.