Build agents to learn from experiences using Amazon Bedrock AgentCore …

Today, most agents operate only on what’s visible in the current interaction: they can access facts and knowledge, but they can’t remember how they solved similar problems before or why certain approaches worked or failed. This creates a significant gap in their ability to learn and improve over time. Amazon Bedrock AgentCore episodic memory addresses this limitation by capturing and surfacing experience-level knowledge for AI agents. Although semantic memory helps an agent remember what it knows, episodic memory documents how it arrived there: the goal, reasoning steps, actions, outcomes, and reflections. By converting each interaction into a structured episode, you can enable agents to recall knowledge and interpret and apply prior reasoning. This helps agents adapt across sessions, avoid repeating mistakes, and evolve their planning over time.
Amazon Bedrock AgentCore Memory is a fully managed service that helps developers create context-aware AI agents through both short-term memory and long-term intelligent memory capabilities. To learn more, see Amazon Bedrock AgentCore Memory: Building context-aware agents and Building smarter AI agents: AgentCore long-term memory deep dive.
In this post, we walk you through the complete architecture to structure and store episodes, discuss the reflection module, and share compelling benchmarks that demonstrate significant improvements in agent task success rates.
Key challenges in designing agent episodic memory
Episodic memory enables agents to retain and reason over their own experiences. However, designing such a system requires solving several key challenges to make sure experiences remain coherent, evaluable, and reusable:

Maintaining temporal and causal coherence – Episodes need to preserve the order and cause-effect flow of reasoning steps, actions, and outcomes so the agent can understand how its decisions evolved.
Detecting and segmenting multiple goals – Sessions often involve overlapping or shifting goals. The episodic memory must identify and separate them to avoid mixing unrelated reasoning traces.
Learning from experience – Each episode should be evaluated for success or failure. Reflection should then compare similar past episodes to identify generalizable patterns and principles, enabling the agent to adapt those insights to new goals rather than replaying prior trajectories.

In the next section, we describe how to build an AgentCore episodic memory strategy, covering its extraction, storage, retrieval, and reflection pipeline and how these components work together to help transform experience into adaptive intelligence.
How AgentCore episodic memory works
When your agentic application sends conversational events to AgentCore Memory, raw interactions get transformed into rich episodic memory records through an intelligent extraction and reflection process. The following diagram illustrates how this episodic memory strategy works and how simple agent conversations become meaningful, reflective memories that shape future interactions.

The following diagram illustrates the detailed data flow of the same architecture with more elaborate details.

The preceding diagrams illustrate the different steps in the episodic memory strategy. The first two steps (marked pink and purple) are grouped together as a two-stage approach of the episode extraction module that serves distinct but complementary purposes. The third step (marked as blue) is the reflection module, which helps the agent learn from the past experience. In the following sections, we discuss the steps in detail.
Episode extraction module
The episode extraction module is the foundational step in the episodic strategy that transforms raw user-agent interaction data into structured, meaningful episodes. We follow a two-stage approach where the stages are designed to capture both granular step-wise mechanics of each interaction (called turn extraction) and broader episode-wise knowledge to create coherent narratives (called episode extraction). To make an analogy, think of it in terms of taking notes during a meeting (turn level) and writing the meeting summary at the end of the meeting (episode). Both stages are valuable but serve different purposes when learning from experience.
In the first stage of episode extraction, the system performs turn-level processing to understand what went right or wrong. Here, single exchange units between the user and the agent called conversational turns are identified, segmented, and transformed into structured summaries in the following dimensions:

Turn situation – A brief description of the circumstances and context that the assistant is responding to in this turn. This includes the immediate context, the user’s overarching objectives that might span multiple turns, and the relevant history from previous interactions that informed the current exchange.
Turn intent – The assistant’s specific purpose and primary goal for this turn, essentially answering the question “What was the assistant trying to accomplish in this moment?”
Turn action – A detailed record of the concrete steps taken during the interaction, documenting which specific tools were used, what input arguments or parameters were provided to each tool, and how the assistant translated intent into executable actions.
Turn thought – The reasoning behind the assistant’s decisions, explaining the “why” behind tool selection and approach.
Turn assessment – An honest evaluation of whether the assistant successfully achieved its stated goal for this specific turn, providing immediate feedback on the effectiveness of the chosen approach and actions taken.
Goal assessment – A broader perspective on whether the user’s overall objective across the entire conversation appears to be satisfied or progressing toward completion, looking beyond individual turns to evaluate holistic success.

After processing and structuring individual turns, the system proceeds to the episode extraction stage, when a user completes their goal (detected by the large language model) or an interaction ends. This helps capture the complete user journey, because a user’s goal often spans multiple turns and individual turn data alone can’t convey whether the overall objective was achieved or what the holistic strategy looked like. In this stage, sequentially related turns are synthesized into coherent episodic memories that capture complete user journeys, from initial request to final resolution:

Episode situation – The broader circumstances that initiated the user’s need for assistance
Episode intent – A clear articulation of what the user ultimately wanted to accomplish
Success evaluation – A definitive assessment of whether the conversation achieved its intended purpose for each episode
Evaluation justification – Concrete reasoning for success or failure assessments, grounded in specific conversational moments that demonstrate progress toward or away from user goals
Episode insights – Insights capturing proven effective approaches and identifying pitfalls to avoid for the current episode

Reflection module
The reflection module highlights the ability of Amazon Bedrock AgentCore episodic memory to learn from past experiences and generate insights that help improve future performance. This is where individual episode learnings evolve into generalizable knowledge that can guide agents across diverse scenarios.
The reflection module operates through cross-episodic reflection, retrieving past similar successful episodes based on user intent and reflecting across multiple episodes to achieve more generalizable insights. When new episodes are processed, the system performs the following actions:

Using the user intent as a semantic key, the system identifies historically successful and relevant episodes from the vector store that share similar goals, contexts, or problem domains.
The system analyzes patterns across the main episode and relevant episodes, looking for transferable insights about what approaches work consistently across different contexts.
Existing reflection knowledge is reviewed and either enhanced with new insights or expanded with entirely new patterns discovered through cross-episodic analysis.

At the end of the process, each reflection memory record contains the following information:

Use case – When and where the insight applies, including relevant user goals and trigger conditions
Hints (insights) – Actionable guidance covering tool selection strategies, effective approaches, and pitfalls to avoid
Confidence scoring – A score (0.1–1.0) indicating how well the insight generalizes across different scenarios

Episodes provide agents with concrete examples of how similar problems were solved before. These case studies show the specific tools used, reasoning applied, and outcomes achieved, including both successes and failures. This creates a learning framework where agents can follow proven strategies and avoid documented mistakes.
Reflection memories extract patterns from multiple episodes to deliver strategic insights. Instead of individual cases, they reveal which tools work best, what decision-making approaches succeed, and which factors drive outcomes. These distilled principles give agents higher-level guidance for navigating complex scenarios.
Custom override configurations
Although built-in memory strategies cover the common use cases, many domains require tailored approaches for memory processing. The system supports built-in strategy overrides through custom prompts that extend the built-in logic, helping teams adapt memory handling to their specific requirement. You can implement the following custom override configurations:

Custom prompts – These prompts focus on criteria and logic rather than output formats and help developers define the following:

Extraction criteria – What information gets extracted or filtered out.
Consolidation rules – How related memories should be consolidated.
Conflict resolution – How to handle contradictory information.
Insight generation – How cross-episode reflections are synthesized.

Custom model: AgentCore Memory supports custom model selection for memory extraction, consolidation, and reflection operations. This flexibility helps developers balance accuracy and latency based on their specific requirements. You can define them using APIs when you create the _memory_resource_ as a strategy override or through the Amazon Bedrock AgentCore console (as shown in the following screenshot).
Namespaces: Namespaces provide a hierarchical organization for episodes and reflections, enabling access to your agent’s experiences at different levels of granularity and providing a seamless natural logical grouping. For instance, to design a namespace for a travel application, episodes could be stored under travel_booking/users/userABC/episodes and reflections could reside at travel_booking/users/userABC. Note that the namespace for reflections must be a sub-path of the namespace for episodes.

Performance evaluation
We evaluated Amazon Bedrock AgentCore episodic memory on real-world goal completion benchmarks from the retail and airline domain (sampled from τ2-bench). These benchmarks contain tasks that mirror actual customer service scenarios where agents need to help users achieve specific goals.
We compared three different setups in our experiments:

For the baseline, we ran the agent (built with Anthropic’s Claude 3.7) without interacting with the memory component.
For memory-augmented agents, we explored two methods of using memories:

In-context learning examples – The first method uses extracted episodes as in-context learning examples. Specifically, we constructed a tool named retrieve_exemplars (tool definition in appendix) that agents can use by issuing a query (for example, “how to get refund?”) to get step-by-step instructions from the episodes repository. When agents face similar problems, the retrieved episodes will be added into the context to guide the agent to take the next action.
Reflection-as-guidance – The second method we explored is reflection-as-guidance. Specifically, we construct a tool named retrieve_reflections (tool definition in appendix) that agents can use to access broader insights from past experiences. Similar to retrieve_exemplars, the agent can generate a query to retrieve reflections as context, gaining insights to make informed decisions about strategy and approach rather than specific step-by-step actions.

We used the following evaluation methodology:

The baseline agent first processes a set of historical customer interactions, which become the source for memory extraction.
The agent then receives new user queries from τ2-bench.
Each query is attempted four times in parallel.
To evaluate, pass rate metrics are measured across these four attempts. Pass^k measures the percentage of tasks where the agent succeeded in at least k out of four attempts:

Pass^1: Succeeded at least once (measures capability)
Pass^2: Succeeded at least twice (measures reliability)
Pass^3: Succeeded at least three times (measures consistency)

The results in the following table show clear improvements across both domains and multiple attempts.
 

System
Memory Type used by Agent
Retail

Pass^1
Pass^2
Pass^3
Pass^1
Pass^2
Pass^3

Baseline
No Memory
65.80%
49.70%
42.10%
47%
33.30%
24%

Memory-Augmented Agent
Episodes as ICL Example
69.30%
53.80%
43.40%
55.00%
46.70%
43.00%

Cross Episodes Reflection Memory
77.20%
64.30%
55.70%
58%
46%
41%

Memory-augmented agents consistently outperform the baseline across domains and consistency levels. Crucially, these results demonstrate that different memory retrieval strategies are better suited to different task characteristics. Cross-episode reflection improved Pass^1 by +11.4% and Pass^3 by +13.6% over the baseline, suggesting that generalized strategic insights are particularly valuable when handling open-ended customer service scenarios with diverse interaction patterns. In contrast, the airline domain – characterized by complex, rule-based policies and multi-step procedures—benefits more from episodes as examples, which achieved the highest Pass^3 (43.0% vs 41.0% for reflection). This indicates that concrete step-by-step examples help agents navigate structured workflows reliably. The relative improvement is most pronounced at higher consistency thresholds (Pass^3), where memory helps agents avoid the mistakes that cause intermittent failures.
Best practices for using episodic memory
The key to effective episodic memory is knowing when to use it and which type fits your situation. In this section, we discuss what we’ve learned works best.
When to use episodic memory
Episodic memory delivers the most value when you match the right memory type to your current need. It is ideal for complex, multi-step tasks where context matters and past experience matters significantly, such as debugging code, planning trips, and analyzing data. It’s also particularly valuable for repetitive workflows where learning from previous attempts can dramatically improve outcomes, and for domain-specific problems where accumulated expertise makes a real difference.
However, episodic memory isn’t always the right choice. You can skip it for simple, one-time questions like weather checks or basic facts that don’t need reasoning or context. Simple customer service conversations, basic Q&A, or casual chats don’t need the advanced features that episodic memory adds. The true benefit of episodic memory is observed over time. For short tasks, a session summary provides sufficient information. However, for complex tasks and repetitive workflows, episodic memory helps agents build on past experiences and continuously improve their performance.
Choosing episodes vs. reflection
Episodes work best when you’re facing similar specific problems and need clear guidance. If you’re debugging a React component that won’t render, episodes can show you exactly how similar problems were fixed before, including the specific tools used, thinking process, and results. They give you real examples when general advice isn’t enough, showing the complete path from finding the problem to solving it.
Reflection memories work best when you need strategic guidance across broader contexts rather than specific step-by-step solutions. Use reflections when you’re facing a new type of problem and need to understand general principles, like “What’s the most effective approach for data visualization tasks?” or “Which debugging strategies tend to work best for API integration issues?” Reflections are particularly valuable when you’re making high-level decisions about tool selection and which method to follow, or understanding why certain patterns consistently succeed or fail.
Before starting tasks, check reflections for strategy guidance, look at similar episodes for solution patterns, and find high-confidence mistakes documented in previous attempts. During tasks, look at episodes when you hit roadblocks, use reflection insights for tool choices, and think about how your current situation differs from past examples.
Conclusion
Episodic memory fills a critical gap in current agent capabilities. By storing complete reasoning paths and learning from outcomes, agents can avoid repeating mistakes and build on successful strategies.
Episodic memory completes the memory framework of Amazon Bedrock AgentCore alongside summarization, semantic, and preference memory. Each serves a specific purpose: summarization manages context length, semantic memory stores facts, preference memory handles personalization, and episodic memory captures experience. The combination helps give agents both structured knowledge and practical experience to handle complex tasks more effectively.
To learn more about episodic memory, refer to Episodic memory strategy, How to best retrieve episodes to improve agentic performance, and the AgentCore Memory GitHub samples.

Appendix
In this section, we discuss two methods of using memories for memory-augmented agents.
Episode example
The following is an example using extracted episodes as in-context learning examples:

** Context **
A customer (Jane Doe) contacted customer service expressing frustration
about a recent flight delay that disrupted their travel plans and wanted
to discuss compensation or resolution options for the inconvenience they
experienced.

** Goal **
The user’s primary goal was to obtain compensation or some form of resolution
for a flight delay they experienced, seeking acknowledgment of the disruption
and appropriate remediation from the airline.

### Step 1:

**Thought:**
The assistant chose to gather information systematically rather than making
assumptions, as flight delay investigations require specific reservation and
flight details. This approach facilitates accurate assistance and demonstrates
professionalism by acknowledging the customer’s frustration while taking concrete
steps to help resolve the issue.

**Action:**
The assistant responded conversationally without using any tools, asking the
user to provide their user ID to access reservation details.

— End of Step 1 —

** Episode Reflection **:
The conversation demonstrates an excellent systematic approach to flight
modifications: starting with reservation verification, then identifying
confirmation, followed by comprehensive flight searches, and finally processing
changes with proper authorization. The assistant effectively used appropriate
tools in a logical sequence – get_reservation_details for verification, get_user_details
for identity/payment info, search_direct_flight for options, and update tools for
processing changes. Key strengths included transparent pricing calculations,
proactive mention of insurance benefits, clear presentation of options, and proper
handling of policy constraints (explaining why mixed cabin classes aren’t allowed).
The assistant successfully leveraged user benefits (Gold status for free bags) and
maintained security protocols throughout. This methodical approach made sure user
needs were addressed while following proper procedures for reservation modifications.

Reflection example
The following is an example of Reflection memory, which can be used for agent guidance:

**Title:** Proactive Alternative Search Despite Policy Restrictions

**Use Cases:**
This applies when customers request flight modifications or changes that
are blocked by airline policies (such as basic economy no-change rules,
fare class restrictions, or booking timing limitations). Rather than simply
declining the request, this pattern involves immediately searching for
alternative solutions to help customers achieve their underlying goals.
It’s particularly valuable for emergency situations, budget-conscious travelers,
or when customers have specific timing needs that their current reservations
don’t accommodate.

**Hints:**
When policy restrictions prevent the requested modification, immediately pivot
to solution-finding rather than just explaining limitations. Use search_direct_flight
to find alternative options that could meet the customer’s needs, even if it requires
separate bookings or different approaches. Present both the policy constraint
explanation AND viable alternatives in the same response to maintain momentum toward
resolution. Consider the customer’s underlying goal (getting home earlier,
changing dates, etc.) and search for flights that accomplish this objective.
When presenting alternatives, organize options clearly by date and price, highlight
budget-friendly choices, and explain the trade-offs between keeping existing reservations
versus canceling and rebooking. This approach transforms policy limitations into problem-solving
opportunities and maintains customer satisfaction even when the original request cannot be fulfilled.

Tool definitions
The following code is the tool definition for retrieve_exemplars:
def retrieve_exemplars(task: str) -> str:
“””
Retrieve example processes to help solve the given task.
Args:
task: The task to solve that requires example processes.

Returns:
str: The example processes to help solve the given task.
“””

The following is the tool definition for retrieve_reflections:
def retrieve_reflections(task: str, k: int = 5) -> str:
“””
Retrieve synthesized reflection knowledge from past agent experiences by matching
against knowledge titles and use cases. Each knowledge entry contains: (1) a descriptive title,
(2) specific use cases describing the types of goals where this knowledge applies and when to apply it,
and (3) actionable hints including best practices from successful episodes and common pitfalls to avoid
from failed episodes. Use this to get strategic guidance for similar tasks.

Args:
task: The current task or goal you are trying to accomplish. This will be matched
against knowledge titles and use cases to find relevant reflection knowledge. Describe your task
clearly to get the most relevant matches.
k: Number of reflection knowledge entries to retrieve. Default is 5.

Returns:
str: The synthesized reflection knowledge from past agent experiences.
“””

About the Authors
Jiarong Jiang is a Principal Applied Scientist at AWS, driving innovations in Retrieval Augmented Generation (RAG) and agent memory systems to improve the accuracy and intelligence of enterprise AI. She’s passionate about helping customers build context-aware, reasoning-driven applications that use their own data effectively.
Akarsha Sehwag is a Generative AI Data Scientist for the Amazon Bedrock AgentCore Memory team. With over 6 years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in generative AI, deep learning, and computer vision domains. Outside of work, she likes to hike, bike, and play badminton.
Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Peng Shi is a Senior Applied Scientist at AWS, where he leads advancements in agent memory systems to enhance the accuracy, adaptability, and reasoning capabilities of AI. His work focuses on creating more intelligent and context-aware applications that bridge cutting-edge research with real-world impact.
Anil Gurrala is a Senior Solutions Architect at AWS based in Atlanta. With over 3 years at Amazon and nearly two decades of experience in digital innovation and transformation, he helps customers with modernization initiatives, architecture design, and optimization on AWS. Anil specializes in implementing agentic AI solutions while partnering with enterprises to architect scalable applications and optimize their deployment within the AWS cloud environment. Outside of work, Anil enjoys playing volleyball and badminton, and exploring new destinations around the world.
Ruo Cheng is a Senior UX Designer at AWS, designing enterprise AI and developer experiences across Amazon Bedrock and Amazon Bedrock AgentCore. With a decade of experience, she leads design for AgentCore Memory, shaping memory-related workflows and capabilities for agent-based applications. Ruo is passionate about translating complex AI and infrastructure concepts into intuitive, user-centered experiences.

How bunq handles 97% of support with Amazon Bedrock

This post was co-authored with Benjamin Kleppe, Machine Learning Engineering Lead at bunq.
The integration of agentic AI is transforming the banking industry, marking a significant shift from traditional customer service systems. Agentic AI demonstrates autonomous decision-making capabilities in complex financial environments, enabling banks to provide round-the-clock multilingual support, process transactions, and deliver personalized financial insights at scale.
bunq is Europe’s second-largest neobank, built to make life easy for people and businesses who live an international lifestyle. Founded in 2012 by serial entrepreneur Ali Niknam, bunq has always put users at the heart of everything they do. The company helps its 20 million users across Europe spend, save, budget, and invest confidently, all within a single, user-friendly application built on user feedback
In this post, we show how bunq upgraded Finn, its in-house generative AI assistant, using Amazon Bedrock to transform user support and banking operations to be seamless, in multiple languages and time zones.
Business challenge
Banks face a major challenge to deliver consistent, high-quality customer support across multiple channels, languages, and time zones. Traditional support systems struggle with the complexity of financial products, regulatory requirements, and the growing expectation for instant, accurate responses. Customers expect instant access to essential banking functions like transaction disputes, account management, and financial advice, and banks need to maintain strict security protocols and compliance standards. As a user-centric bank, bunq’s users expect round-the-clock support for their banking needs, such as requesting a refund or seeking guidance on features. Traditional support models couldn’t keep up with this demand, creating frustrating bottlenecks and straining internal resources. Beyond direct support, bunq’s team also needed efficient ways to analyze incoming feature requests and bug reports to continuously improve their system. It was clear that bunq needed a smarter solution that could provide instant, accurate assistance around the clock and help the team turn valuable user feedback into action.
Solution overview
Launched in 2023, bunq’s generative AI assistant, Finn, is fully built in-house as part of bunq’s proprietary AI stack. Finn uses leading AI foundation models (FMs) and tooling, including Anthropic’s Claude models through Amazon Bedrock. Unlike generic chatbots, Finn processes natural language and provides real-time, intelligent answers. Finn can translate the bunq application into 38 languages and translate speech-to-speech calls to the support team in real time. It can also summarize complex banking information, provide financial insights and budgeting advice, and even recognize images, automating tedious tasks such as invoice processing. bunq’s approach uses AWS services to create a scalable AI agent infrastructure that can handle the demands of modern banking while maintaining security and compliance. The solution uses the following AWS services:

Amazon Bedrock – A fully managed service that makes high-performing FMs from leading AI companies and Amazon available through a unified API. bunq uses Amazon Bedrock to access Anthropic’s Claude models with enhanced security features, scalability, and compliance—critical requirements for banking applications.
Amazon Elastic Container Service (Amazon ECS) – A fully managed container orchestration service that makes it straightforward to deploy, manage, and scale containerized applications. Amazon ECS alleviates the need to install and operate container orchestration software or manage clusters of virtual machines, helping bunq focus on building Finn’s multi-agent architecture.
Amazon DynamoDB – A fully managed, serverless, NoSQL database service designed to run high-performance applications at scale. DynamoDB delivers single-digit millisecond performance and stores agent memory, conversation history, and session data, enabling Finn to maintain context across customer interactions.
Amazon OpenSearch Serverless – An on-demand, automatic scaling configuration for Amazon OpenSearch Service. OpenSearch Serverless automatically scales compute resources based on application needs and provides vector search capabilities for Finn’s Retrieval Augmented Generation (RAG) implementation, enabling semantic search across bunq’s knowledge base.

Building a multi-agent implementation with Amazon Bedrock
Users can interact with Finn through bunq’s application and web interface, using natural language for their requests, such as account information, transaction history, financial advice, and support issues. The system processes requests in real time, accessing only pertinent data to the request, while maintaining strict security and privacy controls. User support scenarios demand more than what a single AI agent can deliver. A multi-agent architecture allows specialized agents to handle distinct tasks—one agent might excel at understanding the user, another focuses on extracting relevant documentation, and a third handles transaction analysis or account operations. For Finn, this means a user asking about a failed payment can trigger a coordinated response: one agent interprets the question, another checks transaction logs, and a third suggests solutions based on similar cases. They all work together seamlessly to deliver a comprehensive answer in seconds, instead of bouncing the user between departments. The initial multi-agent support system for banking services followed a seemingly straightforward pattern: a central router agent directed user queries to specialized sub-agents. Each agent handled specific domains—technical support, general inquiries, transaction status, account management, and so on. However, as the system grew, so did the size and complexity of the demands. As bunq added more specialized agents to handle the new ecosystem, three issues became apparent:

Routing complexity – With multiple specialized agents, the router needed increasingly sophisticated logic to determine the correct destination.
Overlapping capabilities – Multiple agents required access to the same data sources and capabilities, forcing the router to predict not just the primary intent but also which secondary agents might be needed downstream—an impossible task at scale.
Scalability bottleneck – Every new agent or capability meant updating the router’s logic. Adding a new specialized agent required comprehensive testing of all routing scenarios. The router became a single point of failure and a potential development bottleneck.

Rethinking the architecture
bunq redesigned its system around an orchestrator agent that works fundamentally differently from the old router. Instead of trying to route to all possible agents, the orchestrator performs the following actions:

Routes queries to only three to five primary agents
Empowers these primary agents to invoke other agents as tools when needed
Delegates decision-making to the agents themselves

With this agent-as-tool pattern, primary agents detect when they need specialized help. Tool agents are invoked dynamically by primary agents. Agents can call other agents through a well-defined interface—they become tools in each other’s toolkits.
The following diagram illustrates this workflow.

bunq’s Finn service uses a comprehensive AWS infrastructure designed for security, scalability, and intelligent orchestration. The following architecture diagram shows how multiple AWS services work together to deliver a multi-agent AI system.

Orchestration and agent architecture
At the core of the system is the orchestrator agent, running on Amazon Elastic Container Service (Amazon ECS). This orchestrator implements the agent-as-tool pattern, routing user queries to a limited set of primary agents rather than attempting to predict every possible scenario. The orchestrator maintains three to five primary agents (Primary Agent 1 through 5), each deployed as containerized services on Amazon ECS. This design provides horizontal scalability—as demand increases, additional agent instances can be spun up automatically. Each primary agent is empowered to invoke specialized agents as needed. These specialized agents (Specialized Agent 1, 2, 3, and so on) act as tools that primary agents can call upon for specific capabilities, such as analyzing transaction data, retrieving documentation, or processing complex queries. This hierarchical structure avoids the routing complexity bottleneck while maintaining flexibility.
Infrastructure details
The architecture is built on a robust foundation of AWS services that enable Finn’s performance. Users access the service through bunq’s application, with traffic secured by AWS WAF and Amazon CloudFront, while authentication flows through bunq’s proprietary identity system. Amazon Bedrock provides access to Anthropic’s Claude models for natural language understanding, complemented by Amazon SageMaker hosted fine-tuned models for specialized banking scenarios. Agent memory and conversation history are stored in DynamoDB, and OpenSearch Service serves as a vector store for RAG capabilities, enabling semantic search across bunq’s knowledge base. Amazon Simple Storage Service (Amazon S3) handles document storage, and Amazon MemoryDB manages user sessions for real-time interactions. Comprehensive observability through AWS CloudTrail, Amazon GuardDuty, and Amazon CloudWatch helps the team monitor performance, detect threats, and maintain compliance—all within a secure virtual private cloud (VPC).
Real-world impact
The transformation from bunq’s initial router-based architecture to the orchestrator pattern with Amazon Bedrock delivered measurable improvements across user support operations. The multi-agent deployment achieved significant operational efficiency gains:

Finn now handles 97% of bunq’s user support activity, with over 82% fully automated. Average response times dropped to just 47 seconds, helping bunq deliver the real-time solutions users expect.
The rapid deployment timeline highlights bunq’s focus on innovation. The team moved from concept to production in 3 months, starting in January 2025. bunq brought together a team of 80 people—from AI engineers to support staff—who worked together to test, learn, and deploy updates three times a day.
Before implementing the orchestrator architecture, escalations were mainly manual processes. The new multi-agent system increased automation, transforming end-to-end support metrics. Beyond that, Finn expanded bunq’s reach by translating the application into 38 languages, making banking more accessible to millions of users across Europe.
The solution enabled bunq to become Europe’s first AI-powered bank, offering capabilities no traditional support system could deliver: real-time speech-to-speech translation (a first in global banking), image recognition for receipt processing and document verification, and intelligent financial insights—all while maintaining the round-the-clock availability users demand.

“We went from concept to production in 3 months. Before the orchestrator architecture, escalations were mainly manual. Now Finn handles 97% of support with 70% fully automated and 47-second average response times.”
– Benjamin Kleppe, Machine Learning Engineering Lead at bunq.

Conclusion
bunq’s journey from manual support escalations to an intelligent multi-agent system shows how modern AI architecture can transform banking operations. By moving from a rigid router-based approach to a flexible orchestrator pattern with Amazon Bedrock, bunq avoided scalability bottlenecks while maintaining the agility needed to serve 20 million users across Europe. The orchestrator pattern with agent-as-tool capabilities proved essential to bunq’s success. Rather than predicting every possible user scenario upfront, the system empowers primary agents to dynamically invoke specialized agents as needed. This architectural shift reduced complexity, accelerated development cycles, and helped bunq deploy updates three times per day during the initial rollout. The results: 97% of support interactions handled by Finn, 70% fully automated, and average response times of just 47 seconds. Beyond efficiency gains, the solution expanded bunq’s reach to 38 languages and positioned the company as Europe’s first AI-powered bank. By freeing internal resources from manual processes, bunq can now focus on what it does best: building a bank that makes life easy for its users.
To learn more about building AI-powered applications with FMs, refer to Amazon Bedrock. Explore how Anthropic’s Claude on Amazon Bedrock can transform your customer experience with enhanced security features and scalability. Get started with the Amazon Bedrock documentation to build your own multi-agent solutions.

About the Authors
Benjamin Kleppe is Machine Learning Engineering Lead at bunq, where he leads the development and scaling of AI-powered solutions that make banking smarter and more personal for 20 million users across Europe. He focuses on building intelligent systems that enhance user experience, improve product discovery, and automate complex banking processes. Benjamin is passionate about pushing the boundaries of AI innovation in banking, having led bunq to become Europe’s first AI-powered bank with the launch of Finn, their proprietary generative AI platform.
Jagdeep Singh Soni is a Senior AI/ML Solutions Architect at AWS based in the Netherlands, specializing in generative AI and Amazon Bedrock. He helps customers and partners architect and implement intelligent agent solutions using Amazon Bedrock and other AWS AI/ML services. With 16 years of experience in innovation and cloud architecture, Jagdeep focuses on enabling organizations to build production-ready generative AI applications that use foundation models and agent frameworks for real-world business outcomes.
Guy Kfir is a generative AI Lead at AWS with over 15 years of experience in cloud technology sales, business development, and AI/ML evangelism. He works with enterprise customers, startups, and partners across EMEA to accelerate adoption of generative AI solutions and execute go-to-market strategies.

What are Context Graphs?

Knowledge Graphs and their limitations

With the rapid growth of AI applications, Knowledge Graphs (KGs) have emerged as a foundational structure for representing knowledge in a machine-readable form. They organize information as triples—a head entity, a relation, and a tail entity—forming a graph-like structure where entities are nodes and relationships are edges. This representation allows machines to understand and reason over connected knowledge, supporting intelligent applications such as question answering, semantic analysis, and recommendation systems

Despite their effectiveness, Knowledge Graphs (KGs) have notable limitations. They often lose important contextual information, making it difficult to capture the complexity and richness of real-world knowledge. Additionally, many KGs suffer from data sparsity, where entities and relationships are incomplete or poorly connected. This lack of full annotation limits the contextual signals available during inference, posing challenges for effective reasoning, even when integrated with large language models.

Context Graphs

Context Graphs (CGs) extend traditional Knowledge Graphs by adding extra information such as time, location, and source details. Instead of storing knowledge as isolated facts, they capture the situation in which a fact or decision occurred, leading to a clearer and more accurate understanding of real-world knowledge.

When used with agent-based systems, context graphs also store how decisions were made. Agents need more than rules—they need to know how rules were applied before, when exceptions were allowed, who approved decisions, and how conflicts were handled. Since agents operate directly where decisions happen, they can naturally record this full context.

Over time, these stored decision traces form a context graph that helps agents learn from past actions. This allows systems to understand not only what happened, but also why it happened, making agent behavior more consistent and reliable.

What are the effects of Contextual Information?

Contextual information adds important layers to knowledge representation by going beyond simple entities–relation facts. It helps distinguish between facts that look similar but occur under different conditions, such as differences in time, location, scale, or surrounding circumstances. For example, two companies may be competitors in one market or time period but not in another. By capturing such context, systems can represent knowledge in a more detailed way and avoid treating all similar-looking facts as identical.

In context graphs, contextual information also plays a key role in reasoning and decision-making. It includes signals such as historical decisions, policies applied, exceptions granted, approvals involved, and related events from other systems. When agents record how a decision was made—what data was used, which rule was checked, and why an exception was allowed—this information becomes reusable context for future decisions. Over time, these records help connect entities that are not directly linked and allow systems to reason based on past outcomes and precedents, rather than relying only on fixed rules or isolated triples.

Shift from static tools to decision-making agents

There has been a clear shift in AI systems—from static tools to decision-making agents, driven largely by major industry players. Real-world decisions are rarely based on rules alone; they involve exceptions, approvals, and lessons from past cases. Context graphs address this gap by capturing how decisions are made across systems—what policies were checked, which data was used, who approved the decision, and what outcome followed. By structuring this decision history as context, agents can reuse prior judgments instead of repeatedly relearning the same edge cases. Some examples of this shift include:

Google

Gmail’s Gemini features and Gemini 3–based agent frameworks both show AI shifting from simple help to active decision-making, whether that’s managing inbox priorities or running complex workflows.

Gmail relies on conversation history and user intent, while Gemini 3 agents use memory and state to handle longer tasks. In both cases, context matters more than single prompts.

Gemini 3 acts as an orchestration layer for multi-agent systems (ADK, Agno, Letta, Eigent), similar to how Gemini orchestrates summarization, writing, and prioritization inside Gmail.

Features like AI Inbox and Suggested Replies rely on persistent understanding of user behavior, just as agent frameworks like Letta and mem0 rely on stateful memory to prevent context loss and ensure consistent behavior.

Gmail turns email into actionable summaries and to-dos, while Gemini-powered agents automate browsers, workflows, and enterprise tasks—both reflecting a broader shift toward AI systems that act, not just respond.

OpenAI

ChatGPT Health brings health data from different sources—medical records, apps, wearables, and notes—into one place. This creates a clear, shared context that helps the system understand health patterns over time instead of answering isolated questions, similar to how context graphs link facts with their context.

By using personal health history and past interactions, ChatGPT Health helps users make better-informed decisions, such as preparing for doctor visits or understanding test results.

Health runs in a separate, secure space, keeping sensitive information private and contained. This ensures health context stays accurate and protected, which is essential for safely using context-based systems like context graphs.

JP Morgan

JP Morgan replacing proxy advisors with its AI tool, Proxy IQ, shows a shift toward building in-house decision systems that aggregate and analyze voting data across thousands of meetings, rather than relying on third-party recommendations.

By analyzing proxy data internally, the firm can incorporate historical voting behavior, company-specific details, and firm-level policies—aligning with the idea of context graphs that preserve how decisions are formed over time.

Internal AI-based analysis gives JP Morgan more transparency, speed, and consistency in proxy voting, reflecting a broader move toward context-aware, AI-driven decision-making in enterprise settings.

NVIDIA

NVIDIA’s NeMo Agent Toolkit helps turn AI agents into production-ready systems by adding observability, evaluation, and deployment controls. By capturing execution traces, reasoning steps, and performance signals, it records how an agent arrived at an outcome—not just the final result—aligning closely with the idea of context graphs.

Tools like OpenTelemetry tracing and structured evaluations convert agent behavior into usable context. This makes it easier to debug decisions, compare different runs, and steadily improve reliability.

Similar to how DLSS 4.5 integrates AI deeply into real-time graphics pipelines, NAT integrates AI agents into enterprise workflows. Both highlight a broader shift toward AI systems that retain state, history, and context, which is critical for dependable, large-scale deployment.

Microsoft

Copilot Checkout and Brand Agents turn shopping conversations into direct purchases. Questions, comparisons, and decisions happen in one place, creating clear context around why a customer chose a product.

These AI agents operate exactly where buying decisions happen—inside chats and brand websites—allowing them to guide users and complete checkout without extra steps.

Merchants keep control of transactions and customer data. Over time, these interactions build useful context about customer intent and buying patterns, helping future decisions become faster and more accurate.

The post What are Context Graphs? appeared first on MarkTechPost.

A Coding Guide to Anemoi-Style Semi-Centralized Agentic Systems Using …

In this tutorial, we demonstrate how a semi-centralized Anemoi-style multi-agent system works by letting two peer agents negotiate directly without a manager or supervisor. We show how a Drafter and a Critic iteratively refine an output through peer-to-peer feedback, reducing coordination overhead while preserving quality. We implement this pattern end-to-end in Colab using LangGraph, focusing on clarity, control flow, and practical execution rather than abstract orchestration theory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U langgraph langchain-openai langchain-core

import os
import json
from getpass import getpass
from typing import TypedDict

from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (hidden): “)

MODEL = os.environ.get(“OPENAI_MODEL”, “gpt-4o-mini”)
llm = ChatOpenAI(model=MODEL, temperature=0.2)

We set up the Colab environment by installing the required LangGraph and LangChain packages and securely collecting the OpenAI API key as a hidden input. We initialize the language model that will be shared by all agents, keeping the configuration minimal and reproducible. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AnemoiState(TypedDict):
task: str
max_rounds: int
round: int
draft: str
critique: str
agreed: bool
final: str
trace: bool

We define a typed state that acts as the shared communication surface between agents during negotiation. We explicitly track the task, draft, critique, agreement flag, and iteration count to keep the flow transparent and debuggable. This state obviates the need for a central manager or for implicit memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDRAFTER_SYSTEM = “””You are Agent A (Drafter) in a peer-to-peer loop.
You write a high-quality solution to the user’s task.
If you receive critique, you revise decisively and incorporate it.
Return only the improved draft text.”””

def drafter_node(state: AnemoiState) -> AnemoiState:
task = state[“task”]
critique = state.get(“critique”, “”).strip()
r = state.get(“round”, 0) + 1

if critique:
user_msg = f”””TASK:
{task}

CRITIQUE:
{critique}

Revise the draft.”””
else:
user_msg = f”””TASK:
{task}

Write the first draft.”””

draft = llm.invoke(
[
{“role”: “system”, “content”: DRAFTER_SYSTEM},
{“role”: “user”, “content”: user_msg},
]
).content.strip()

if state.get(“trace”, False):
print(f”n— Drafter Round {r} —n{draft}n”)

return {**state, “round”: r, “draft”: draft, “agreed”: False}

We implement the Drafter agent, which produces the initial response and revises it whenever peer feedback is available. We keep the Drafter focused purely on improving the user-facing draft, without awareness of control logic or termination conditions. It mirrors the Anemoi idea of agents optimizing locally while observing peer signals. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserCRITIC_SYSTEM = “””You are Agent B (Critic).
Return strict JSON:
{“agree”: true/false, “critique”: “…”}”””

def critic_node(state: AnemoiState) -> AnemoiState:
task = state[“task”]
draft = state.get(“draft”, “”)

raw = llm.invoke(
[
{“role”: “system”, “content”: CRITIC_SYSTEM},
{
“role”: “user”,
“content”: f”TASK:n{task}nnDRAFT:n{draft}”,
},
]
).content.strip()

cleaned = raw.strip(““`”).replace(“json”, “”).strip()

try:
data = json.loads(cleaned)
agree = bool(data.get(“agree”, False))
critique = str(data.get(“critique”, “”)).strip()
except Exception:
agree = False
critique = raw

if state.get(“trace”, False):
print(f”— Critic Decision —nAGREE: {agree}n{critique}n”)

final = draft if agree else state.get(“final”, “”)
return {**state, “agreed”: agree, “critique”: critique, “final”: final}

We implement the Critic agent, which evaluates the draft and decides whether it is ready to ship or needs revision. We enforce a strict agree-or-revise decision to avoid vague feedback and ensure fast convergence. This peer evaluation step allows quality control without introducing a supervisory agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef continue_or_end(state: AnemoiState) -> str:
if state.get(“agreed”, False):
return “end”
if state.get(“round”, 0) >= state.get(“max_rounds”, 3):
return “force_ship”
return “loop”

def force_ship_node(state: AnemoiState) -> AnemoiState:
return {**state, “final”: state.get(“final”) or state.get(“draft”, “”)}

graph = StateGraph(AnemoiState)
graph.add_node(“drafter”, drafter_node)
graph.add_node(“critic”, critic_node)
graph.add_node(“force_ship”, force_ship_node)

graph.set_entry_point(“drafter”)
graph.add_edge(“drafter”, “critic”)
graph.add_conditional_edges(
“critic”,
continue_or_end,
{“loop”: “drafter”, “force_ship”: “force_ship”, “end”: END},
)
graph.add_edge(“force_ship”, END)

anemoi_critic_loop = graph.compile()

demo_task = “””Explain the Anemoi semi-centralized agent pattern and why peer-to-peer critic loops reduce bottlenecks.”””

result = anemoi_critic_loop.invoke(
{
“task”: demo_task,
“max_rounds”: 3,
“round”: 0,
“draft”: “”,
“critique”: “”,
“agreed”: False,
“final”: “”,
“trace”: False,
}
)

print(“n====================”)
print(” FINAL OUTPUT”)
print(“====================n”)
print(result[“final”])

We assemble the LangGraph workflow that routes control between Drafter and Critic until agreement is reached or the maximum round limit is reached. We rely on simple conditional routing rather than centralized planning, thereby preserving the system’s semi-centralized nature. Finally, we execute the graph and return the best available output to the user.

In conclusion, we demonstrated that Anemoi-style peer negotiation is a practical alternative to manager-worker architectures, offering lower latency, reduced context bloat, and simpler agent coordination. By allowing agents to monitor and correct each other directly, we achieved convergence with fewer tokens and less orchestration complexity. In this tutorial, we provided a reusable blueprint for building scalable, semi-centralized agent systems. It lays the foundation for extending the pattern to multi-peer meshes, red-team loops, or protocol-based agent interoperability.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Anemoi-Style Semi-Centralized Agentic Systems Using Peer-to-Peer Critic Loops in LangGraph appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Loc …

GLM-4.7-Flash is a new member of the GLM 4.7 family and targets developers who want strong coding and reasoning performance in a model that is practical to run locally. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE model and presents it as the strongest model in the 30B class, designed for lightweight deployment where performance and efficiency both matter.

Model class and position inside the GLM 4.7 family

GLM-4.7-Flash is a text generation model with 31B params, BF16 and F32 tensor types, and the architecture tag glm4_moe_lite. It supports English and Chinese, and it is configured for conversational use. GLM-4.7-Flash sits in the GLM-4.7 collection next to the larger GLM-4.7 and GLM-4.7-FP8 models.

Z.ai positions GLM-4.7-Flash as a free tier and lightweight deployment option relative to the full GLM-4.7 model, while still targeting coding, reasoning, and general text generation tasks. This makes it interesting for developers who cannot deploy a 358B class model but still want a modern MoE design and strong benchmark results.

Architecture and context length

In a Mixture of Experts architecture of this type, the model stores more parameters than it activates for each token. That allows specialization across experts while keeping the effective compute per token closer to a smaller dense model.

GLM 4.7 Flash supports a context length of 128k tokens and achieves strong performance on coding benchmarks among models of similar scale. This context size is suitable for large codebases, multi-file repositories, and long technical documents, where many existing models would need aggressive chunking.

GLM-4.7-Flash uses a standard causal language modeling interface and a chat template, which allows integration into existing LLM stacks with minimal changes.

Benchmark performance in the 30B class

The Z.ai team compares GLM-4.7-Flash with Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is competitive across a mix of math, reasoning, long horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above table showcase why GLM-4.7-Flash is one of the strongest model in the 30B class, at least among the models included in this comparison. The important point is that GLM-4.7-Flash is not only a compact deployment of GLM but also a high performing model on established coding and agent benchmarks.

Evaluation parameters and thinking mode

For most tasks, the default settings are: temperature 1.0, top p 0.95, and max new tokens 131072. This defines a relatively open sampling regime with a large generation budget.

For Terminal Bench and SWE-bench Verified, the configuration uses temperature 0.7, top p 1.0, and max new tokens 16384. For τ²-Bench, the configuration uses temperature 0 and max new tokens 16,384. These stricter settings reduce randomness for tasks that need stable tool use and multi step interaction.

Z.ai team also recommends turning on Preserved Thinking mode for multi turn agentic tasks such as τ²-Bench and Terminal Bench 2. This mode preserves internal reasoning traces across turns. That is useful when you build agents that need long chains of function calls and corrections.

How GLM-4.7-Flash fits developer workflows

GLM-4.7-Flash combines several properties that are relevant for agentic, coding focused applications:

A 30B-A3B MoE architecture with 31B params and a 128k token context length.

Strong benchmark results on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp compared to other models in the same table.

Documented evaluation parameters and a Preserved Thinking mode for multi turn agent tasks.

First class support for vLLM, SGLang, and Transformers based inference, with ready to use commands.

A growing set of finetunes and quantizations, including MLX conversions, in the Hugging Face ecosystem.

Check out the Model weight. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents appeared first on MarkTechPost.

Introducing multimodal retrieval for Amazon Bedrock Knowledge Bases

We are excited to announce the general availability of multimodal retrieval for Amazon Bedrock Knowledge Bases. This new capability adds native support for video and audio content, on top of text and images. With it you can build Retrieval Augmented Generation (RAG) applications that can search and retrieve information across text, images, audio, and video—all within a fully managed service.
Modern enterprises store valuable information in multiple formats. Product documentation includes diagrams and screenshots, training materials contain instructional videos, and customer insights are captured in recorded meetings. Until now, building artificial intelligence (AI) applications that could effectively search across these content types required complex custom infrastructure and significant engineering effort.
Previously, Bedrock Knowledge Bases used text-based embedding models for retrieval. While it supported text documents and images, images had to be processed using foundation models (FM) or Bedrock Data Automation to generate text descriptions—a text-first approach that lost visual context and prevented visual search capabilities. Video and audio required custom preprocessing external pipelines. Now, with multimodal embeddings, the retriever natively supports text, images, audio, and video within a single embedding model.
With multimodal retrieval in Bedrock Knowledge Bases, you can now ingest, index, and retrieve information from text, images, video, and audio using a single, unified workflow. Content is encoded using multimodal embeddings that preserve visual and audio context, enabling your applications to find relevant information across media types. You can even search using an image to find visually similar content or locate specific scenes in videos.
In this post, we’ll guide you through building multimodal RAG applications. You’ll learn how multimodal knowledge bases work, how to choose the right processing strategy based on your content type, and how to configure and implement multimodal retrieval using both the console and code examples.
Understanding multimodal knowledge bases
Amazon Bedrock Knowledge Bases automates the complete RAG workflow: ingesting content from your data sources, parsing and chunking it into searchable segments, converting chunks to vector embeddings, and storing them in a vector database. During retrieval, user queries are embedded and matched against stored vectors to find semantically similar content, which augments the prompt sent to your foundation model.
With multimodal retrieval, this workflow now handles images, video, and audio alongside text through two processing approaches. Amazon Nova Multimodal Embeddings encodes content natively into a unified vector space, for cross-modal retrieval where you can query with text and retrieve videos, or search using images to find visual content.
Alternatively, Bedrock Data Automation converts multimedia into rich text descriptions and transcripts before embedding, providing high-accuracy retrieval over spoken content. Your choice depends on whether visual context or speech precision matters most for your use case.

We explore each of these approaches in this post.
Amazon Nova Multimodal Embeddings
Amazon Nova Multimodal Embeddings is the first unified embedding model that encodes text, documents, images, video, and audio into a single shared vector space. Content is processed natively without text conversion. The model supports up to 8,172 tokens for text and 30 seconds for video/audio segments, handles over 200 languages, and offers four embedding dimensions (with 3072-dimension as default, 1,024, 384, 256) to balance accuracy and efficiency. Bedrock Knowledge Bases segments video and audio automatically into configurable chunks (5-30 seconds), with each segment independently embedded.

For video content, Nova embeddings capture visual elements—scenes, objects, motion, and actions—as well as audio characteristics like music, sounds, and ambient noise. For videos where spoken dialogue is important to your use case, you can use Bedrock Data Automation to extract transcripts alongside visual descriptions. For standalone audio files, Nova processes acoustic features such as music, environmental sounds, and audio patterns. The cross-modal capability enables use cases such as describing a visual scene in text to retrieve matching videos, upload a reference image to find similar products, or locate specific actions in footage—all without pre-existing text descriptions.
Best for: Product catalogs, visual search, manufacturing videos, sports footage, security cameras, and scenarios where visual content drives the use case.
Amazon Bedrock Data Automation
Bedrock Data Automation takes a different approach by converting multimedia content into rich textual representations before embedding. For images, it generates detailed descriptions including objects, scenes, text within images, and spatial relationships. For video, it produces scene-by-scene summaries, identifies key visual elements, and extracts the on-screen text. For audio and video with speech, Bedrock Data Automation provides accurate transcriptions with timestamps and speaker identification, along with segment summaries that capture the key points discussed.

Once converted to text, this content is chunked and embedded using text embedding models like Amazon Titan Text Embeddings or Amazon Nova Multimodal Embeddings. This text-first approach enables highly accurate question-answering over spoken content—when users ask about specific statements made in a meeting or topics discussed in a podcast, the system searches through precise transcripts rather than audio embeddings. This makes it particularly valuable for compliance scenarios where you need exact quotes and verbatim records for audit trails, meeting analysis, customer support call mining, and use cases where you need to retrieve and verify specific spoken information.
Best for: Meetings, webinars, interviews, podcasts, training videos, support calls, and scenarios requiring precise retrieval of specific statements or discussions.
Use case scenario: Visual product search for e-commerce
Multimodal knowledge bases can be used for applications ranging from enhanced customer experiences and employee training to maintenance operations and legal analysis. Traditional e-commerce search relies on text queries, requiring customers to articulate what they’re looking for with the right keywords. This breaks down when they’ve seen a product elsewhere, have a photo of something they like, or want to find items similar to what appears in a video. Now, customers can search your product catalog using text descriptions, upload an image of an item they’ve photographed, or reference a scene from a video to find matching products. The system retrieves visually similar items by comparing the embedded representation of their query—whether text, image, or video—against the multimodal embeddings of your product inventory. For this scenario, Amazon Nova Multimodal Embeddings is the ideal choice. Product discovery is fundamentally visual—customers care about colors, styles, shapes, and visual details. By encoding your product images and videos into the Nova unified vector space, the system matches based on visual similarity without relying on text descriptions that might miss subtle visual characteristics. While a complete recommendation system would incorporate customer preferences, purchase history, and inventory availability, retrieval from a multimodal knowledge base provides the foundational capability: finding visually relevant products regardless of how customers choose to search.
Console walkthrough
In the following section, we walk through the high-level steps to set up and test a multimodal knowledge base for our e-commerce product search example. We create a knowledge base containing smartphone product images and videos, then demonstrate how customers can search using text descriptions, uploaded images, or video references. The GitHub repository provides a guided notebook that you can follow to deploy this example in your account.
Prerequisites
Before you get started, make sure that you have the following prerequisites:

An AWS Account with appropriate service access
An AWS Identity and Access Management (IAM) role with the appropriate permissions to access Amazon Bedrock and Amazon Simple Storage Service (Amazon S3)

Provide the knowledge base details and data source type
Start by opening the Amazon Bedrock console and creating a new knowledge base. Provide a descriptive name for your knowledge base and select your data source type—in this case, Amazon S3 where your product images and videos are stored.

Configure data source
Connect your S3 bucket containing product images and videos. For the parsing strategy, select Amazon Bedrock default parser. Since we’re using Nova Multimodal Embeddings, the images and videos are processed natively and embedded directly into the unified vector space, preserving their visual characteristics without conversion to text.

Configure data storage and processing
Select Amazon Nova Multimodal Embeddings as your embedding model. This unified embedding model encodes both your product images and customer queries into the same vector space, enabling cross-modal retrieval where text queries can retrieve images and image queries can find visually similar products. For this example, we use Amazon S3 Vectors as the vector store (you could optionally use other available vector stores), which provides cost-effective and durable storage optimized for large-scale vector data sets while maintaining sub-second query performance. You also need to configure the multimodal storage destination by specifying an S3 location. Knowledge Bases uses this location to store extracted images and other media from your data source. When users query the knowledge base, relevant media is retrieved from this storage.

Review and create
Review your configuration settings including the knowledge base details, data source configuration, embedding model selection—we’re using Amazon Nova Multimodal Embeddings v1 with 3072 vector dimensions (higher dimensions provide richer representations; you can use lower dimensions like 1,024, 384, or 256 to optimize for storage and cost) —and vector store setup (Amazon S3 Vectors). Once everything looks correct, create your knowledge base.
Create an ingestion job
Once created, initiate the sync process to ingest your product catalog. The knowledge base processes each image and video, generates embeddings and stores them in the managed vector database. Monitor the sync status to confirm the documents are successfully indexed.

Test the knowledge base using text as input in your prompt
With your knowledge base ready, test it using a text query in the console. Search with product descriptions like “A metallic phone cover” (or anything equivalent that could be relevant for your products media) to verify that text-based retrieval works correctly across your catalog.

Test the knowledge base using a reference image and retrieve different modalities
Now for the powerful part—visual search. Upload a reference image of a product you want to find. For example, imagine you saw a cell phone cover on another website and want to find similar items in your catalog. Simply upload the image without additional text prompt.

The multimodal knowledge base extracts visual features from your uploaded image and retrieves visually similar products from your catalog. As you can see in the results, the system returns phone covers with similar design patterns, colors, or visual characteristics. Notice the metadata associated with each chunk in the Source details panel. The x-amz-bedrock-kb-chunk-start-time-in-millis and x-amz-bedrock-kb-chunk-end-time-in-millis fields indicate the exact temporal location of this segment within the source video. When building applications programmatically, you can use these timestamps to extract and display the specific video segment that matched the query, enabling features like “jump to relevant moment” or clip generation directly from your source videos. This cross-modal capability transforms the shopping experience—customers no longer need to describe what they’re looking for with words; they can show you.
Test the knowledge base using a reference image and retrieve different modalities using Bedrock Data Automation
Now we look at what the results would look like if you configured Bedrock Data Automation parsing during the data source setup. In the following screenshot, notice the transcript section in the Source details panel.

For each retrieved video chunk, Bedrock Data Automation automatically generates a detailed text description—in this example, describing the smartphone’s metallic rose gold finish, studio lighting, and visual characteristics. This transcript appears directly in the test window alongside the video, providing rich textual context. You get both visual similarities matching from the multimodal embeddings and detailed product descriptions that can answer specific questions about features, colors, materials, and other attributes visible in the video.
Clean-up
To clean up your resources, complete the following steps, starting with deleting the knowledge base:

On the Amazon Bedrock console, choose Knowledge Bases
Select your Knowledge Base and note both the IAM service role name and S3 Vector index ARN
Choose Delete and confirm

To delete the S3 Vector as a vector store, use the following AWS Command Line Interface (AWS CLI) commands:

aws s3vectors delete-index –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –index-name YOUR_INDEX_NAME –region YOUR_REGION
aws s3vectors delete-vector-bucket –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –region YOUR_REGION

On the IAM console, find the role noted earlier
Select and delete the role

To delete the sample dataset:

On the Amazon S3 console, find your S3 bucket
Select and delete the files you uploaded for this tutorial

Conclusion
Multimodal retrieval for Amazon Bedrock Knowledge Bases removes the complexity of building RAG applications that span text, images, video, and audio. With native support for video and audio content, you can now build comprehensive knowledge bases that unlock insights from your enterprise data—not just text documents.
The choice between Amazon Nova Multimodal Embeddings and Bedrock Data Automation gives you flexibility to optimize for your specific content. The Nova unified vector space enables cross-modal retrieval for visual-driven use cases, while the Bedrock Data Automation text-first approach delivers precise transcription-based retrieval for speech-heavy content. Both approaches integrate seamlessly into the same fully managed workflow, alleviating the need for custom preprocessing pipelines.
Availability
Region availability is dependent on the features selected for multimodal support, please refer to the documentation for details.
Next steps
Get started with multimodal retrieval today:

Explore the documentation: Review the Amazon Bedrock Knowledge Bases documentation and Amazon Nova User Guide for additional technical details.
Experiment with code examples: Check out the Amazon Bedrock samples repository for hands-on notebooks demonstrating multimodal retrieval.
Learn more about Nova: Read the Amazon Nova Multimodal Embeddings announcement for deeper technical insights.

About the authors
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock and Bedrock AgentCore.
Pallavi Nargund is a Principal Solutions Architect at AWS. She is a generative AI lead for US Greenfield and leads the AWS for Legal Tech team. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Pallavi holds a Bachelor’s of Engineering from the University of Pune, India. She lives in Edison, New Jersey, with her husband, two girls, and her two pups.
Jean-Pierre Dodel is a Principal Product Manager for Amazon Bedrock, Amazon Kendra, and Amazon Quick Index. He brings 15 years of Enterprise Search and AI/ML experience to the team, with prior work at Autonomy, HP, and search startups before joining Amazon 8 years ago. JP is currently focusing on innovations for multimodal RAG, agentic retrieval, and structured RAG.

Nous Research Releases NousCoder-14B: A Competitive Olympiad Programmi …

Nous Research has introduced NousCoder-14B, a competitive olympiad programming model that is post trained on Qwen3-14B using reinforcement learning (RL) with verifiable rewards. On the LiveCodeBench v6 benchmark, which covers problems from 08/01/2024 to 05/01/2025, the model reaches a Pass@1 accuracy of 67.87 percent. This is 7.08 percentage points higher than the Qwen3-14B baseline of 60.79 percent on the same benchmark. The research team trained the model on 24k verifiable coding problems using 48 B200 GPUs over 4 days, and released the weights under the Apache 2.0 license on Hugging Face.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Benchmark focus and what Pass@1 means

LiveCodeBench v6 is designed for competitive programming evaluation. The test split used here contains 454 problems. The training set uses the same recipe as the DeepCoder-14B project from Agentica and Together AI. It combines problems from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench problems created before 07/31/2024.

The benchmark only includes competitive programming style tasks. For each problem, a solution must respect strict time and memory limits and must pass a large set of hidden input output tests. Pass@1 is the fraction of problems where the first generated program passes all tests, including time and memory constraints.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Dataset construction for execution based RL

All datasets used for training are composed of verifiable code generation problems. Each problem has a reference implementation and many test cases. The training set contains 24k problems drawn from:

TACO Verified

PrimeIntellect SYNTHETIC 1

LiveCodeBench problems that come before 07/31/2024

The test set is LiveCodeBench v6, which has 454 problems between 08/01/2024 and 05/01/2025.

Every problem is a complete competitive programming task with a description, input format, output format, and test cases. This setup is important for RL because it gives a binary reward signal that is cheap to compute once the code has run.

RL environment with Atropos and Modal

The RL environment is built using the Atropos framework. NousCoder-14B is prompted using the standard LiveCodeBench prompt format, and it generates Python code for each problem. Each rollout receives a scalar reward that depends on test case results:

Reward 1 when the generated code passes all test cases for that problem

Reward −1 when the code outputs a wrong answer, exceeds a 15 second time limit, or exceeds a 4 GB memory limit on any test case

To execute untrusted code safely and at scale, the team uses Modal as an autoscaled sandbox. The system launches one Modal container per rollout in the main design that the research team describes as the used setting. Each container runs all test cases for that rollout. This avoids mixing training compute with verification compute and keeps the RL loop stable.

The research team also pipelines inference and verification. When an inference worker finishes a generation, it sends the completion to a Modal verifier and immediately starts a new generation. With many inference workers and a fixed pool of Modal containers, this design keeps the training loop inference compute bound instead of verification bound.

The team discusses 3 verification parallelization strategies. They explore one container per problem, one per rollout, and one per test case. They finally avoid the per test case setting because of container launch overhead and use an approach where each container evaluates many test cases and focuses on a small set of the hardest test cases first. If any of these fail, the system can stop verification early.

GRPO objectives, DAPO, GSPO, and GSPO+

NousCoder-14B uses Group Relative Policy Optimization (GRPO) which does not require a separate value model. On top of GRPO the research team test 3 objectives: Dynamic sAmpling Policy Optimization (DAPO), Group Sequence Policy Optimization (GSPO), and a modified GSPO variant called GSPO+.

All 3 objectives share the same definition of advantage. The advantage for each rollout is the reward for that rollout normalized by the mean and standard deviation of rewards inside the group. DAPO applies importance weighting and clipping at the token level, and introduces three main changes relative to GRPO:

A clip higher rule that increases exploration for low probability tokens

A token level policy gradient loss that gives each token equal weight

Dynamic sampling, where groups that are all correct or all incorrect are dropped because they carry zero advantage

GSPO moves the importance weighting to the sequence level. It defines a sequence importance ratio that aggregates token ratios over the whole program. GSPO+ keeps sequence level correction, but it rescales gradients so that tokens are weighted equally regardless of sequence length.

On LiveCodeBench v6, the differences between these objectives are modest. At a context length of 81,920 tokens, DAPO reaches a Pass@1 of 67.87 percent while GSPO and GSPO+ reach 66.26 percent and 66.52 percent. At 40,960 tokens, all 3 objectives cluster around 63 percent Pass@1.

Iterative context extension and overlong filtering

Qwen3-14B supports long context and the training follows an iterative context extension schedule. The team first trains the model with a 32k context window and then continues training at the maximum Qwen3-14B context window of 40k. At each stage they select the checkpoint with the best LiveCodeBench score at 40k context and then use YaRN context extension at evaluation time to reach 80k tokens, that is 81,920 tokens.

A key trick is overlong filtering. When a generated program exceeds the maximum context window, they reset its advantage to zero. This removes that rollout from the gradient signal rather than penalizing it. The research team report that this approach avoids pushing the model toward shorter solutions for purely optimization reasons and helps maintain quality when they scale context length at test time.

Key Takeaways

NousCoder 14B is a Qwen3-14B based competitive programming model trained with execution based RL, it reaches 67.87 percent Pass@1 on LiveCodeBench v6, a 7.08 percentage point gain over the Qwen3-14B baseline of 60.79 percent on the same benchmark.

The model is trained on 24k verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1, and pre 07 31 2024 LiveCodeBench tasks, and evaluated on a disjoint LiveCodeBench v6 test set of 454 problems from 08/01/2024 to 05/01/2025.

The RL setup uses Atropos, with Python solutions executed in sandboxed containers, a simple reward of 1 for solving all test cases and minus 1 for any failure or resource limit breach, and a pipelined design where inference and verification run asynchronously.

Group Relative Policy Optimization objectives DAPO, GSPO, and GSPO+ are used for long context code RL, all operate on group normalized rewards, and show similar performance, with DAPO reaching the best Pass@1 at the longest 81,920 token context.

The training uses iterative context extension, first at 32k then at 40k tokens, along with YaRN based extension at evaluation time to 81,920 tokens, includes overlong rollout filtering for stability, and ships as a fully reproducible open stack with Apache 2.0 weights and RL pipeline code.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nous Research Releases NousCoder-14B: A Competitive Olympiad Programming Model Post-Trained on Qwen3-14B via Reinforcement Learning appeared first on MarkTechPost.

Vercel Releases Agent Skills: A Package Manager For AI Coding Agents W …

Vercel has released agent-skills, a collection of skills that turns best practice playbooks into reusable skills for AI coding agents. The project follows the Agent Skills specification and focuses first on React and Next.js performance, web design review, and claimable deployments on Vercel. Skills are installed with a command that feels similar to npm, and are then discovered by compatible agents during normal coding flows.

Agent Skills format

Agent Skills is an open format for packaging capabilities for AI agents. A skill is a folder that contains instructions and optional scripts. The format is designed so that different tools can understand the same layout.

A typical skill in vercel-labs/agent-skills has three main components:

SKILL.md for natural language instructions that describe what the skill does and how it should behave

a scripts directory for helper commands that the agent can call to inspect or modify the project

an optional references directory with additional documentation or examples

react-best-practices also compiles its individual rule files into a single AGENTS.md file. This file is optimized for agents. It aggregates the rules into one document that can be loaded as a knowledge source during a code review or refactor. This removes the need for ad-hoc prompt engineering per project.

Core skills in vercel-labs/agent-skills

The repository currently presents three main skills that target common front end workflows:

1. react-best-practices

This skill encodes React and Next.js performance guidance as a structured rule library. It contains more than 40 rules grouped into 8 categories. These cover areas such as elimination of network waterfalls, bundle size reduction, server side performance, client side data fetching, re-render behavior, rendering performance, and JavaScript micro optimizations.

Each rule includes an impact rating. Critical issues are listed first, then lower impact changes. Rules are expressed with concrete code examples that show an anti pattern and a corrected version. When a compatible agent reviews a React component, it can map findings directly onto these rules.

2. web-design-guidelines

This skill is focused on user interface and user experience quality. It includes more than 100 rules that span accessibility, focus handling, form behavior, animation, typography, images, performance, navigation, dark mode, touch interaction, and internationalization.

During a review, an agent can use these rules to detect missing ARIA attributes, incorrect label associations for form controls, misuse of animation when the user requests reduced motion, missing alt text or lazy loading on images, and other issues that are easy to miss during manual review.

3. vercel-deploy-claimable

This skill connects the agent review loop to deployment. It can package the current project into a tarball, auto detect the framework based on package.json, and create a deployment on Vercel. The script can recognize more than 40 frameworks and also supports static HTML sites.

The skill returns two URLs. One is a preview URL for the deployed site. The other is a claim URL. The claim URL allows a user or team to attach the deployment to their Vercel account without sharing credentials from the original environment.

Installation and integration flow

Skills can be installed from the command line. The launch announcement highlights a simple path:

Copy CodeCopiedUse a different Browsernpx skills i vercel-labs/agent-skills

This command fetches the agent-skills repository and prepares it as a skills package.

Vercel and the surrounding ecosystem also provide an add-skill CLI that is designed to wire skills into specific agents. A typical flow looks like this:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills

add-skill scans for installed coding agents by checking their configuration directories. For example, Claude Code uses a .claude directory, and Cursor uses .cursor and a directory under the home folder. The CLI then installs the chosen skills into the correct skills folders for each tool.

You can call add-skill in non interactive mode to control exactly what is installed. For example, you can install only the React skill for Claude Code at a global level:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills –skill react-best-practices -g -a claude-code -y

You can also list available skills before installing them:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills –list

After installation, skills live in agent specific directories such as ~/.claude/skills or .cursor/skills. The agent discovers these skills, reads SKILL.md, and is then able to route relevant user requests to the correct skill.

After deployment, the user interacts through natural language. For example, ‘Review this component for React performance issues’ or ‘Check this page for accessibility problems’. The agent inspects the installed skills and uses react-best-practices or web-design-guidelines when appropriate.

Key Takeaways

vercel-labs/agent-skills implements the Agent Skills specification, packaging each capability as a folder with SKILL.md, optional scripts, and references, so different AI coding agents can consume the same skill layout.

The repository currently ships 3 skills, react-best-practices for React and Next.js performance, web-design-guidelines for UI and UX review, and vercel-deploy-claimable for creating claimable deployments on Vercel.

react-best-practices encodes more than 40 rules in 8 categories, ordered by impact, and provides concrete code examples, which lets agents run structured performance reviews instead of ad hoc prompt based checks.

web-design-guidelines provides more than 100 rules across accessibility, focus handling, forms, animation, typography, images, performance, navigation, dark mode, touch interaction, and internationalization, enabling systematic UI quality checks by agents.

Skills are installed through commands such as npx skills i vercel-labs/agent-skills and npx add-skill vercel-labs/agent-skills, then discovered from agent specific skills directories, which turns best practice libraries into reusable, version controlled building blocks for AI coding workflows.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Vercel Releases Agent Skills: A Package Manager For AI Coding Agents With 10 Years of React and Next.js Optimisation Rules appeared first on MarkTechPost.

A Coding Guide to Understanding How Retries Trigger Failure Cascades i …

In this tutorial, we build a hands-on comparison between a synchronous RPC-based system and an asynchronous event-driven architecture to understand how real distributed systems behave under load and failure. We simulate downstream services with variable latency, overload conditions, and transient errors, and then drive both architectures using bursty traffic patterns. By observing metrics such as tail latency, retries, failures, and dead-letter queues, we examine how tight RPC coupling amplifies failures and how asynchronous event-driven designs trade immediate consistency for resilience. Throughout the tutorial, we focus on practical mechanisms, retries, exponential backoff, circuit breakers, bulkheads, and queues that engineers use to control cascading failures in production systems. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport asyncio, random, time, math, statistics
from dataclasses import dataclass, field
from collections import deque

def now_ms():
return time.perf_counter() * 1000.0

def pctl(xs, p):
if not xs:
return None
xs2 = sorted(xs)
k = (len(xs2) – 1) * p
f = math.floor(k)
c = math.ceil(k)
if f == c:
return xs2[int(k)]
return xs2[f] + (xs2[c] – xs2[f]) * (k – f)

@dataclass
class Stats:
latencies_ms: list = field(default_factory=list)
ok: int = 0
fail: int = 0
dropped: int = 0
retries: int = 0
timeouts: int = 0
cb_open: int = 0
dlq: int = 0

def summary(self, name):
l = self.latencies_ms
return {
“name”: name,
“ok”: self.ok,
“fail”: self.fail,
“dropped”: self.dropped,
“retries”: self.retries,
“timeouts”: self.timeouts,
“cb_open”: self.cb_open,
“dlq”: self.dlq,
“lat_p50_ms”: round(pctl(l, 0.50), 2) if l else None,
“lat_p95_ms”: round(pctl(l, 0.95), 2) if l else None,
“lat_p99_ms”: round(pctl(l, 0.99), 2) if l else None,
“lat_mean_ms”: round(statistics.mean(l), 2) if l else None,
}

We define the core utilities and data structures used throughout the tutorial. We establish timing helpers, percentile calculations, and a unified metrics container to track latency, retries, failures, and tail behavior. It gives us a consistent way to measure and compare RPC and event-driven executions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class FailureModel:
base_latency_ms: float = 8.0
jitter_ms: float = 6.0
fail_prob: float = 0.05
overload_fail_prob: float = 0.40
overload_latency_ms: float = 50.0

def sample(self, load_factor: float):
base = self.base_latency_ms + random.random() * self.jitter_ms
if load_factor > 1.0:
base += (load_factor – 1.0) * self.overload_latency_ms
fail_p = min(0.95, self.fail_prob + (load_factor – 1.0) * self.overload_fail_prob)
else:
fail_p = self.fail_prob
return base, (random.random() < fail_p)

class CircuitBreaker:
def __init__(self, fail_threshold=8, window=20, open_ms=500):
self.fail_threshold = fail_threshold
self.window = window
self.open_ms = open_ms
self.events = deque(maxlen=window)
self.open_until_ms = 0.0

def allow(self):
return now_ms() >= self.open_until_ms

def record(self, ok: bool):
self.events.append(not ok)
if len(self.events) >= self.window and sum(self.events) >= self.fail_threshold:
self.open_until_ms = now_ms() + self.open_ms

class Bulkhead:
def __init__(self, limit):
self.sem = asyncio.Semaphore(limit)

async def __aenter__(self):
await self.sem.acquire()

async def __aexit__(self, exc_type, exc, tb):
self.sem.release()

def exp_backoff(attempt, base_ms=20, cap_ms=400):
return random.random() * min(cap_ms, base_ms * (2 ** (attempt – 1)))

We model failure behavior and resilience primitives that shape system stability. We simulate overload-sensitive latency and failures, and we introduce circuit breakers, bulkheads, and exponential backoff to control cascading effects. These components let us experiment with safe versus unsafe distributed-system configurations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DownstreamService:
def __init__(self, fm: FailureModel, capacity_rps=250):
self.fm = fm
self.capacity_rps = capacity_rps
self._inflight = 0

async def handle(self, payload: dict):
self._inflight += 1
try:
load_factor = max(0.5, self._inflight / (self.capacity_rps / 10))
lat, should_fail = self.fm.sample(load_factor)
await asyncio.sleep(lat / 1000.0)
if should_fail:
raise RuntimeError(“downstream_error”)
return {“status”: “ok”}
finally:
self._inflight -= 1

async def rpc_call(
svc,
req,
stats,
timeout_ms=120,
max_retries=0,
cb=None,
bulkhead=None,
):
t0 = now_ms()
if cb and not cb.allow():
stats.cb_open += 1
stats.fail += 1
return False

attempt = 0
while True:
attempt += 1
try:
if bulkhead:
async with bulkhead:
await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
else:
await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
stats.latencies_ms.append(now_ms() – t0)
stats.ok += 1
if cb: cb.record(True)
return True
except asyncio.TimeoutError:
stats.timeouts += 1
except Exception:
pass
stats.fail += 1
if cb: cb.record(False)
if attempt <= max_retries:
stats.retries += 1
await asyncio.sleep(exp_backoff(attempt) / 1000.0)
continue
return False

We implement the synchronous RPC path and its interaction with downstream services. We observe how timeouts, retries, and in-flight load directly affect latency and failure propagation. It also highlights how tight coupling in RPC can amplify transient issues under bursty traffic. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class Event:
id: int
tries: int = 0

class EventBus:
def __init__(self, max_queue=5000):
self.q = asyncio.Queue(maxsize=max_queue)

async def publish(self, e: Event):
try:
self.q.put_nowait(e)
return True
except asyncio.QueueFull:
return False

async def event_consumer(
bus,
svc,
stats,
stop,
max_retries=0,
dlq=None,
bulkhead=None,
timeout_ms=200,
):
while not stop.is_set() or not bus.q.empty():
try:
e = await asyncio.wait_for(bus.q.get(), timeout=0.2)
except asyncio.TimeoutError:
continue

t0 = now_ms()
e.tries += 1
try:
if bulkhead:
async with bulkhead:
await asyncio.wait_for(svc.handle({“id”: e.id}), timeout=timeout_ms / 1000.0)
else:
await asyncio.wait_for(svc.handle({“id”: e.id}), timeout=timeout_ms / 1000.0)
stats.ok += 1
stats.latencies_ms.append(now_ms() – t0)
except Exception:
stats.fail += 1
if e.tries <= max_retries:
stats.retries += 1
await asyncio.sleep(exp_backoff(e.tries) / 1000.0)
await bus.publish(e)
else:
stats.dlq += 1
if dlq is not None:
dlq.append(e)
finally:
bus.q.task_done()

We build the asynchronous event-driven pipeline using a queue and background consumers. We process events independently of request submission, apply retry logic, and route unrecoverable messages to a dead-letter queue. It demonstrates how decoupling improves resilience while introducing new operational considerations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def generate_requests(total=2000, burst=350, gap_ms=80):
reqs = []
rid = 0
while rid < total:
n = min(burst, total – rid)
for _ in range(n):
reqs.append(rid)
rid += 1
await asyncio.sleep(gap_ms / 1000.0)
return reqs

async def main():
random.seed(7)
fm = FailureModel()
svc = DownstreamService(fm)
ids = await generate_requests()

rpc_stats = Stats()
cb = CircuitBreaker()
bulk = Bulkhead(40)

await asyncio.gather(*[
rpc_call(svc, {“id”: i}, rpc_stats, max_retries=3, cb=cb, bulkhead=bulk)
for i in ids
])

bus = EventBus()
ev_stats = Stats()
stop = asyncio.Event()
dlq = []

consumers = [
asyncio.create_task(event_consumer(bus, svc, ev_stats, stop, max_retries=3, dlq=dlq))
for _ in range(16)
]

for i in ids:
await bus.publish(Event(i))

await bus.q.join()
stop.set()
for c in consumers:
c.cancel()

print(rpc_stats.summary(“RPC”))
print(ev_stats.summary(“EventDriven”))
print(“DLQ size:”, len(dlq))

await main()

We drive both architectures with bursty workloads and orchestrate the full experiment. We collect metrics, cleanly terminate consumers, and compare outcomes across RPC and event-driven executions. The final step ties together latency, throughput, and failure behavior into a coherent system-level comparison.

In conclusion, we clearly saw the trade-offs between RPC and event-driven architectures in distributed systems. We observed that RPC offers lower latency when dependencies are healthy but becomes fragile under saturation, where retries and timeouts quickly cascade into system-wide failures. In contrast, the event-driven approach decouples producers from consumers, absorbs bursts through buffering, and localizes failures, but requires careful handling of retries, backpressure, and dead-letter queues to avoid hidden overload and unbounded queues. Through this tutorial, we demonstrated that resilience in distributed systems does not come from choosing a single architecture, but from combining the right communication model with disciplined failure-handling patterns and capacity-aware design.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Understanding How Retries Trigger Failure Cascades in RPC and Event-Driven Architectures appeared first on MarkTechPost.

NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model …

NVIDIA Researchers released PersonaPlex-7B-v1, a full duplex speech to speech conversational model that targets natural voice interactions with precise persona control.

From ASR→LLM→TTS to a single full duplex model

Conventional voice assistants usually run a cascade. Automatic Speech Recognition (ASR) converts speech to text, a language model generates a text answer, and Text to Speech (TTS) converts back to audio. Each stage adds latency, and the pipeline cannot handle overlapping speech, natural interruptions, or dense backchannels.

PersonaPlex replaces this stack with a single Transformer model that performs streaming speech understanding and speech generation in one network. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively. Incoming user audio is incrementally encoded, while PersonaPlex simultaneously generates its own speech, which enables barge in, overlaps, rapid turn taking, and contextual backchannels.

PersonaPlex runs in a dual stream configuration. One stream tracks user audio, the other stream tracks agent speech and text. Both streams share the same model state, so the agent can keep listening while speaking and can adjust its response when the user interrupts. This design is directly inspired by Kyutai’s Moshi full duplex framework.

Hybrid prompting, voice control and role control

PersonaPlex uses two prompts to define the conversational identity.

The voice prompt is a sequence of audio tokens that encodes vocal characteristics, speaking style, and prosody.

The text prompt describes role, background, organization information, and scenario context.

Together, these prompts constrain both the linguistic content and the acoustic behavior of the agent. On top of this, a system prompt supports fields such as name, business name, agent name, and business information, with a budget up to 200 tokens.

Architecture, Helium backbone and audio path

The PersonaPlex model has 7B parameters and follows the Moshi network architecture. A Mimi speech encoder that combines ConvNet and Transformer layers converts waveform audio into discrete tokens. Temporal and depth Transformers process multiple channels that represent user audio, agent text, and agent audio. A Mimi speech decoder that also combines Transformer and ConvNet layers generates the output audio tokens. Audio uses a 24 kHz sample rate for both input and output.

PersonaPlex is built on Moshi weights and uses Helium as the underlying language model backbone. Helium provides semantic understanding and enables generalization outside the supervised conversational scenarios. This is visible in the ‘space emergency’ example, where a prompt about a reactor core failure on a Mars mission leads to coherent technical reasoning with appropriate emotional tone, even though this situation is not part of the training distribution.

Training data blend, real conversations and synthetic roles

Training has 1 stage and uses a blend of real and synthetic dialogues.

Real conversations come from 7,303 calls, about 1,217 hours, in the Fisher English corpus. These conversations are back annotated with prompts using GPT-OSS-120B. The prompts are written at different granularity levels, from simple persona hints like ‘You enjoy having a good conversation’ to longer descriptions that include life history, location, and preferences. This corpus provides natural backchannels, disfluencies, pauses, and emotional patterns that are difficult to obtain from TTS alone.

Synthetic data covers assistant and customer service roles. NVIDIA team reports 39,322 synthetic assistant conversations, about 410 hours, and 105,410 synthetic customer service conversations, about 1,840 hours. Qwen3-32B and GPT-OSS-120B generate the transcripts, and Chatterbox TTS converts them to speech. For assistant interactions, the text prompt is fixed as ‘You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.’ For customer service scenarios, prompts encode organization, role type, agent name, and structured business rules such as pricing, hours, and constraints.

This design lets PersonaPlex disentangle natural conversational behavior, which comes mainly from Fisher, from task adherence and role conditioning, which come mainly from synthetic scenarios.

Evaluation on FullDuplexBench and ServiceDuplexBench

PersonaPlex is evaluated on FullDuplexBench, a benchmark for full duplex spoken dialogue models, and on a new extension called ServiceDuplexBench for customer service scenarios.

FullDuplexBench measures conversational dynamics with Takeover Rate and latency metrics for tasks such as smooth turn taking, user interruption handling, pause handling, and backchanneling. GPT-4o serves as an LLM judge for response quality in question answering categories. PersonaPlex reaches smooth turn taking TOR 0.908 with latency 0.170 seconds and user interruption TOR 0.950 with latency 0.240 seconds. Speaker similarity between voice prompts and outputs on the user interruption subset uses WavLM TDNN embeddings and reaches 0.650.

PersonaPlex outperforms many other open source and closed systems on conversational dynamics, response latency, interruption latency, and task adherence in both assistant and customer service roles.

https://research.nvidia.com/labs/adlr/personaplex/

Key Takeaways

PersonaPlex-7B-v1 is a 7B parameter full duplex speech to speech conversational model from NVIDIA, built on the Moshi architecture with a Helium language model backbone, code under MIT and weights under the NVIDIA Open Model License.

The model uses a dual stream Transformer with Mimi speech encoder and decoder at 24 kHz, it encodes continuous audio into discrete tokens and generates text and audio tokens at the same time, which enables barge in, overlaps, fast turn taking, and natural backchannels.

Persona control is handled by hybrid prompting, a voice prompt made of audio tokens sets timbre and style, a text prompt and a system prompt of up to 200 tokens defines role, business context, and constraints, with ready made voice embeddings such as NATF and NATM families.

Training uses a blend of 7,303 Fisher conversations, about 1,217 hours, annotated with GPT-OSS-120B, plus synthetic assistant and customer service dialogs, about 410 hours and 1,840 hours, generated with Qwen3-32B and GPT-OSS-120B and rendered with Chatterbox TTS, which separates conversational naturalness from task adherence.

On FullDuplexBench and ServiceDuplexBench, PersonaPlex reaches smooth turn taking takeover rate 0.908 and user interruption takeover rate 0.950 with sub second latency and improved task adherence.

Check out the Technical details, Model weights and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model Designed for Natural and Full-Duplex Conversations appeared first on MarkTechPost.

How to Build a Self-Evaluating Agentic AI System with LlamaIndex and O …

In this tutorial, we build an advanced agentic AI workflow using LlamaIndex and OpenAI models. We focus on designing a reliable retrieval-augmented generation (RAG) agent that can reason over evidence, use tools deliberately, and evaluate its own outputs for quality. By structuring the system around retrieval, answer synthesis, and self-evaluation, we demonstrate how agentic patterns go beyond simple chatbots and move toward more trustworthy, controllable AI systems suitable for research and analytical use cases.

Copy CodeCopiedUse a different Browser!pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio

import os
import asyncio
import nest_asyncio
nest_asyncio.apply()

from getpass import getpass

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY: “)

We set up the environment and install all required dependencies for running an agentic AI workflow. We securely load the OpenAI API key at runtime, ensuring that credentials are never hardcoded. We also prepare the notebook to handle asynchronous execution smoothly.

Copy CodeCopiedUse a different Browserfrom llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model=”gpt-4o-mini”, temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model=”text-embedding-3-small”)

texts = [
“Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.”,
“RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.”,
“Tool-using agents require constrained tools, validation, and self-review loops.”,
“A robust workflow follows retrieve, answer, evaluate, and revise steps.”
]

docs = [Document(text=t) for t in texts]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

We configure the OpenAI language model and embedding model and build a compact knowledge base for our agent. We transform raw text into indexed documents so that the agent can retrieve relevant evidence during reasoning.

Copy CodeCopiedUse a different Browserfrom llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)

def retrieve_evidence(q: str) -> str:
r = query_engine.query(q)
out = []
for i, n in enumerate(r.source_nodes or []):
out.append(f”[{i+1}] {n.node.get_content()[:300]}”)
return “n”.join(out)

def score_answer(q: str, a: str) -> str:
r = query_engine.query(q)
ctx = [n.node.get_content() for n in r.source_nodes or []]
f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
return f”Faithfulness: {f.score}nRelevancy: {r.score}”

We define the core tools used by the agent: evidence retrieval and answer evaluation. We implement automatic scoring for faithfulness and relevancy so the agent can judge the quality of its own responses.

Copy CodeCopiedUse a different Browserfrom llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context

agent = ReActAgent(
tools=[retrieve_evidence, score_answer],
llm=Settings.llm,
system_prompt=”””
Always retrieve evidence first.
Produce a structured answer.
Evaluate the answer and revise once if scores are low.
“””,
verbose=True
)

ctx = Context(agent)

We create the ReAct-based agent and define its system behavior, guiding how it retrieves evidence, generates answers, and revises results. We also initialize the execution context that maintains the agent’s state across interactions. It step brings together tools and reasoning into a single agentic workflow.

Copy CodeCopiedUse a different Browserasync def run_brief(topic: str):
q = f”Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}”
handler = agent.run(q, ctx=ctx)
async for ev in handler.stream_events():
print(getattr(ev, “delta”, “”), end=””)
res = await handler
return str(res)

topic = “RAG agent reliability and evaluation”
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))

print(“nnFINAL OUTPUTn”)
print(result)

We execute the full agent loop by passing a topic into the system and streaming the agent’s reasoning and output. We allow the agent to complete its retrieval, generation, and evaluation cycle asynchronously.

In conclusion, we showcased how an agent can retrieve supporting evidence, generate a structured response, and assess its own faithfulness and relevancy before finalizing an answer. We kept the design modular and transparent, making it easy to extend the workflow with additional tools, evaluators, or domain-specific knowledge sources. This approach illustrates how we can use agentic AI with LlamaIndex and OpenAI models to build more capable systems that are also more reliable and self-aware in their reasoning and responses.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI Using Retrieval, Tool Use, and Automated Quality Checks appeared first on MarkTechPost.

How to Build a Safe, Autonomous Prior Authorization Agent for Healthca …

In this tutorial, we demonstrate how an autonomous, agentic AI system can simulate the end-to-end prior authorization workflow within healthcare Revenue Cycle Management (RCM). We show how an agent continuously monitors incoming surgery orders, gathers the required clinical documentation, submits prior authorization requests to payer systems, tracks their status, and intelligently responds to denials through automated analysis and appeals. We design the system to act conservatively and responsibly, escalating to a human reviewer when uncertainty crosses a defined threshold. While the implementation uses mocked EHR and payer portals for clarity and safety, we intentionally mirror real-world healthcare workflows to make the logic transferable to production environments. Also, we emphasize that it is strictly a technical simulation and not a substitute for clinical judgment, payer policy interpretation, or regulatory compliance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “pydantic>=2.0.0” “httpx>=0.27.0”

import os, time, json, random, hashlib
from typing import List, Dict, Optional, Any
from enum import Enum
from datetime import datetime, timedelta
from pydantic import BaseModel, Field

We set up the execution environment and installed the minimal dependencies required to run the tutorial. We configure optional OpenAI usage in a safe, fail-open manner so the system continues to work even without external models. We ensure the foundation is lightweight, reproducible, and suitable for healthcare simulations. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserUSE_OPENAI = False
OPENAI_AVAILABLE = False

try:
from getpass import getpass
if not os.environ.get(“OPENAI_API_KEY”):
pass
if os.environ.get(“OPENAI_API_KEY”):
USE_OPENAI = True
except Exception:
USE_OPENAI = False

if USE_OPENAI:
try:
!pip -q install openai
from openai import OpenAI
client = OpenAI()
OPENAI_AVAILABLE = True
except Exception:
OPENAI_AVAILABLE = False
USE_OPENAI = False

We define strongly typed domain models for patients, surgical orders, clinical documents, and authorization decisions. We use explicit enums and schemas to mirror real healthcare RCM structures while avoiding ambiguity. We enforce clarity and validation to reduce downstream errors in automated decision-making. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DocType(str, Enum):
H_AND_P = “history_and_physical”
LABS = “labs”
IMAGING = “imaging”
MED_LIST = “medication_list”
CONSENT = “consent”
PRIOR_TX = “prior_treatments”
CLINICAL_NOTE = “clinical_note”

class SurgeryType(str, Enum):
KNEE_ARTHROPLASTY = “knee_arthroplasty”
SPINE_FUSION = “spine_fusion”
CATARACT = “cataract”
BARIATRIC = “bariatric_surgery”

class InsurancePlan(str, Enum):
PAYER_ALPHA = “PayerAlpha”
PAYER_BETA = “PayerBeta”
PAYER_GAMMA = “PayerGamma”

class Patient(BaseModel):
patient_id: str
name: str
dob: str
member_id: str
plan: InsurancePlan

class SurgeryOrder(BaseModel):
order_id: str
patient: Patient
surgery_type: SurgeryType
scheduled_date: str
ordering_provider_npi: str
diagnosis_codes: List[str] = Field(default_factory=list)
created_at: str

class ClinicalDocument(BaseModel):
doc_id: str
doc_type: DocType
created_at: str
content: str
source: str

class PriorAuthRequest(BaseModel):
request_id: str
order: SurgeryOrder
submitted_at: Optional[str] = None
docs_attached: List[ClinicalDocument] = Field(default_factory=list)
payload: Dict[str, Any] = Field(default_factory=dict)

class AuthStatus(str, Enum):
DRAFT = “draft”
SUBMITTED = “submitted”
IN_REVIEW = “in_review”
APPROVED = “approved”
DENIED = “denied”
NEEDS_INFO = “needs_info”
APPEALED = “appealed”

class DenialReason(str, Enum):
MISSING_DOCS = “missing_docs”
MEDICAL_NECESSITY = “medical_necessity”
MEMBER_INELIGIBLE = “member_ineligible”
DUPLICATE = “duplicate”
CODING_ISSUE = “coding_issue”
OTHER = “other”

class PayerResponse(BaseModel):
status: AuthStatus
payer_ref: str
message: str
denial_reason: Optional[DenialReason] = None
missing_docs: List[DocType] = Field(default_factory=list)
confidence: float = 0.9

class AgentDecision(BaseModel):
action: str
missing_docs: List[DocType] = Field(default_factory=list)
rationale: str = “”
uncertainty: float = 0.0
next_wait_seconds: int = 0
appeal_text: Optional[str] = None

We simulate an EHR system that emits surgery orders and stores clinical documentation. We intentionally model incomplete charts to reflect real-world documentation gaps that often drive prior authorization denials. We show how an agent can retrieve and augment patient records in a controlled manner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef _now_iso() -> str:
return datetime.utcnow().replace(microsecond=0).isoformat() + “Z”

def _stable_id(prefix: str, seed: str) -> str:
h = hashlib.sha256(seed.encode(“utf-8″)).hexdigest()[:10]
return f”{prefix}_{h}”

class MockEHR:
def __init__(self):
self.orders_queue: List[SurgeryOrder] = []
self.patient_docs: Dict[str, List[ClinicalDocument]] = {}

def seed_data(self, n_orders: int = 5):
random.seed(7)

def make_patient(i: int) -> Patient:
pid = f”PT{i:04d}”
plan = random.choice(list(InsurancePlan))
return Patient(
patient_id=pid,
name=f”Patient {i}”,
dob=”1980-01-01″,
member_id=f”M{i:08d}”,
plan=plan,
)

def docs_for_order(patient: Patient, surgery: SurgeryType) -> List[ClinicalDocument]:
base = [
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “H&P”),
doc_type=DocType.H_AND_P,
created_at=_now_iso(),
content=”H&P: Relevant history, exam findings, and surgical indication.”,
source=”EHR”,
),
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “NOTE”),
doc_type=DocType.CLINICAL_NOTE,
created_at=_now_iso(),
content=”Clinical note: Symptoms, conservative management attempted, clinician assessment.”,
source=”EHR”,
),
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “MEDS”),
doc_type=DocType.MED_LIST,
created_at=_now_iso(),
content=”Medication list: Current meds, allergies, contraindications.”,
source=”EHR”,
),
]

maybe = []
if surgery in [SurgeryType.KNEE_ARTHROPLASTY, SurgeryType.SPINE_FUSION, SurgeryType.BARIATRIC]:
maybe.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “LABS”),
doc_type=DocType.LABS,
created_at=_now_iso(),
content=”Labs: CBC/CMP within last 30 days.”,
source=”LabSystem”,
)
)

if surgery in [SurgeryType.SPINE_FUSION, SurgeryType.KNEE_ARTHROPLASTY]:
maybe.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “IMG”),
doc_type=DocType.IMAGING,
created_at=_now_iso(),
content=”Imaging: MRI/X-ray report supporting diagnosis and severity.”,
source=”Radiology”,
)
)

final = base + [d for d in maybe if random.random() > 0.35]

if random.random() > 0.6:
final.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “PRIOR_TX”),
doc_type=DocType.PRIOR_TX,
created_at=_now_iso(),
content=”Prior treatments: PT, meds, injections tried over 6+ weeks.”,
source=”EHR”,
)
)

if random.random() > 0.5:
final.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “CONSENT”),
doc_type=DocType.CONSENT,
created_at=_now_iso(),
content=”Consent: Signed procedure consent and risk disclosure.”,
source=”EHR”,
)
)

return final

for i in range(1, n_orders + 1):
patient = make_patient(i)
surgery = random.choice(list(SurgeryType))
order = SurgeryOrder(
order_id=_stable_id(“ORD”, patient.patient_id + surgery.value),
patient=patient,
surgery_type=surgery,
scheduled_date=(datetime.utcnow().date() + timedelta(days=random.randint(3, 21))).isoformat(),
ordering_provider_npi=str(random.randint(1000000000, 1999999999)),
diagnosis_codes=[“M17.11”, “M54.5”] if surgery != SurgeryType.CATARACT else [“H25.9”],
created_at=_now_iso(),
)
self.orders_queue.append(order)
self.patient_docs[patient.patient_id] = docs_for_order(patient, surgery)

def poll_new_surgery_orders(self, max_n: int = 1) -> List[SurgeryOrder]:
pulled = self.orders_queue[:max_n]
self.orders_queue = self.orders_queue[max_n:]
return pulled

def get_patient_documents(self, patient_id: str) -> List[ClinicalDocument]:
return list(self.patient_docs.get(patient_id, []))

def fetch_additional_docs(self, patient_id: str, needed: List[DocType]) -> List[ClinicalDocument]:
generated = []
for dt in needed:
generated.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient_id + dt.value + str(time.time())),
doc_type=dt,
created_at=_now_iso(),
content=f”Auto-collected document for {dt.value}: extracted and formatted per payer policy.”,
source=”AutoCollector”,
)
)
self.patient_docs.setdefault(patient_id, []).extend(generated)
return generated

class MockPayerPortal:
def __init__(self):
self.db: Dict[str, Dict[str, Any]] = {}
random.seed(11)

def required_docs_policy(self, plan: InsurancePlan, surgery: SurgeryType) -> List[DocType]:
base = [DocType.H_AND_P, DocType.CLINICAL_NOTE, DocType.MED_LIST]
if surgery in [SurgeryType.SPINE_FUSION, SurgeryType.KNEE_ARTHROPLASTY]:
base += [DocType.IMAGING, DocType.LABS, DocType.PRIOR_TX]
if surgery == SurgeryType.BARIATRIC:
base += [DocType.LABS, DocType.PRIOR_TX]
if plan in [InsurancePlan.PAYER_BETA, InsurancePlan.PAYER_GAMMA]:
base += [DocType.CONSENT]
return sorted(list(set(base)), key=lambda x: x.value)

def submit(self, pa: PriorAuthRequest) -> PayerResponse:
payer_ref = _stable_id(“PAYREF”, pa.request_id + _now_iso())
docs_present = {d.doc_type for d in pa.docs_attached}
required = self.required_docs_policy(pa.order.patient.plan, pa.order.surgery_type)
missing = [d for d in required if d not in docs_present]

self.db[payer_ref] = {
“status”: AuthStatus.SUBMITTED,
“order_id”: pa.order.order_id,
“plan”: pa.order.patient.plan,
“surgery”: pa.order.surgery_type,
“missing”: missing,
“polls”: 0,
“submitted_at”: _now_iso(),
“denial_reason”: None,
}

msg = “Submission received. Case queued for review.”
if missing:
msg += ” Initial validation indicates incomplete documentation.”
return PayerResponse(status=AuthStatus.SUBMITTED, payer_ref=payer_ref, message=msg)

def check_status(self, payer_ref: str) -> PayerResponse:
if payer_ref not in self.db:
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=”Case not found (possible payer system error).”,
denial_reason=DenialReason.OTHER,
confidence=0.4,
)

case = self.db[payer_ref]
case[“polls”] += 1

if case[“status”] == AuthStatus.SUBMITTED and case[“polls”] >= 1:
case[“status”] = AuthStatus.IN_REVIEW

if case[“status”] == AuthStatus.IN_REVIEW and case[“polls”] >= 3:
if case[“missing”]:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.MISSING_DOCS
else:
roll = random.random()
if roll < 0.10:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.CODING_ISSUE
elif roll < 0.18:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.MEDICAL_NECESSITY
else:
case[“status”] = AuthStatus.APPROVED

if case[“status”] == AuthStatus.DENIED:
dr = case[“denial_reason”] or DenialReason.OTHER
missing = case[“missing”] if dr == DenialReason.MISSING_DOCS else []
conf = 0.9 if dr != DenialReason.OTHER else 0.55
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=f”Denied. Reason={dr.value}.”,
denial_reason=dr,
missing_docs=missing,
confidence=conf,
)

if case[“status”] == AuthStatus.APPROVED:
return PayerResponse(
status=AuthStatus.APPROVED,
payer_ref=payer_ref,
message=”Approved. Authorization issued.”,
confidence=0.95,
)

return PayerResponse(
status=case[“status”],
payer_ref=payer_ref,
message=f”Status={case[‘status’].value}. Polls={case[‘polls’]}.”,
confidence=0.9,
)

def file_appeal(self, payer_ref: str, appeal_text: str, attached_docs: List[ClinicalDocument]) -> PayerResponse:
if payer_ref not in self.db:
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=”Appeal failed: case not found.”,
denial_reason=DenialReason.OTHER,
confidence=0.4,
)

case = self.db[payer_ref]
docs_present = {d.doc_type for d in attached_docs}
still_missing = [d for d in case[“missing”] if d not in docs_present]
case[“missing”] = still_missing
case[“status”] = AuthStatus.APPEALED
case[“polls”] = 0

msg = “Appeal submitted and queued for review.”
if still_missing:
msg += f” Warning: still missing {‘, ‘.join([d.value for d in still_missing])}.”
return PayerResponse(status=AuthStatus.APPEALED, payer_ref=payer_ref, message=msg, confidence=0.9)

We model payer-side behavior, including documentation policies, review timelines, and denial logic. We encode simplified but realistic payer rules to demonstrate how policy-driven automation works in practice. We expose predictable failure modes that the agent must respond to safely. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef required_docs_for_order(payer: MockPayerPortal, order: SurgeryOrder) -> List[DocType]:
return payer.required_docs_policy(order.patient.plan, order.surgery_type)

def attach_best_docs(ehr_docs: List[ClinicalDocument], required: List[DocType]) -> List[ClinicalDocument]:
by_type: Dict[DocType, List[ClinicalDocument]] = {}
for d in ehr_docs:
by_type.setdefault(d.doc_type, []).append(d)
attached = []
for dt in required:
if dt in by_type:
attached.append(by_type[dt][-1])
return attached

def compute_uncertainty(payer_resp: PayerResponse, missing_docs: List[DocType], llm_used: bool) -> float:
base = 0.15
if payer_resp.denial_reason in [DenialReason.OTHER]:
base += 0.35
if payer_resp.denial_reason in [DenialReason.MEDICAL_NECESSITY]:
base += 0.25
if payer_resp.denial_reason in [DenialReason.CODING_ISSUE]:
base += 0.20
if missing_docs:
base += 0.10
if llm_used:
base -= 0.05
return max(0.0, min(1.0, base + (1 – payer_resp.confidence) * 0.6))

def rule_based_denial_analysis(order: SurgeryOrder, payer_resp: PayerResponse) -> Dict[str, Any]:
rec = {“missing_docs”: [], “rationale”: “”, “appeal_text”: “”}
if payer_resp.denial_reason == DenialReason.MISSING_DOCS:
rec[“missing_docs”] = payer_resp.missing_docs
rec[“rationale”] = “Denial indicates incomplete documentation per payer policy. Collect and resubmit as appeal.”
rec[“appeal_text”] = (
f”Appeal for prior authorization ({payer_resp.payer_ref})n”
f”Patient: {order.patient.name} ({order.patient.member_id})n”
f”Procedure: {order.surgery_type.value}n”
f”Reason for appeal: Missing documentation has now been attached. Please re-review.n”
)
elif payer_resp.denial_reason == DenialReason.CODING_ISSUE:
rec[“rationale”] = “Potential coding mismatch. Verify diagnosis/procedure codes and include supporting note.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Requesting reconsideration.n”
f”Attached: Updated clinical note clarifying diagnosis and indication; please re-review coding alignment.n”
)
elif payer_resp.denial_reason == DenialReason.MEDICAL_NECESSITY:
rec[“rationale”] = “Medical necessity denial. Add prior treatments timeline, imaging severity, and functional impact.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Medical necessity reconsideration.n”
f”Attached: Prior conservative therapies, imaging, and clinician attestation of functional limitation.n”
)
else:
rec[“rationale”] = “Unclear denial. Escalate if payer message lacks actionable details.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Requesting clarification and reconsideration.n”
f”Please provide specific criteria not met; attached full clinical packet.n”
)
return rec

def llm_denial_analysis_and_appeal(order: SurgeryOrder, payer_resp: PayerResponse, docs: List[ClinicalDocument]) -> Dict[str, Any]:
if not OPENAI_AVAILABLE:
return rule_based_denial_analysis(order, payer_resp)

doc_summary = [{“doc_type”: d.doc_type.value, “source”: d.source, “created_at”: d.created_at} for d in docs]
prompt = {
“role”: “user”,
“content”: (
“You are an RCM prior authorization specialist agent.n”
“Given the order, attached docs, and payer denial response, do three things:n”
“1) Identify what documentation is missing or what needs clarification.n”
“2) Recommend next steps.n”
“3) Draft a concise appeal letter.nn”
f”ORDER:n{order.model_dump_json(indent=2)}nn”
f”PAYER_RESPONSE:n{payer_resp.model_dump_json(indent=2)}nn”
f”ATTACHED_DOCS_METADATA:n{json.dumps(doc_summary, indent=2)}nn”
“Return STRICT JSON with keys: missing_docs (list of strings), rationale (string), appeal_text (string).”
)
}

try:
resp = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[prompt],
temperature=0.2,
)
text = resp.choices[0].message.content.strip()
data = json.loads(text)
missing = []
for x in data.get(“missing_docs”, []):
try:
missing.append(DocType(x))
except Exception:
pass
return {
“missing_docs”: missing,
“rationale”: data.get(“rationale”, “”),
“appeal_text”: data.get(“appeal_text”, “”),
}
except Exception:
return rule_based_denial_analysis(order, payer_resp)

class PriorAuthAgent:
def __init__(self, ehr: MockEHR, payer: MockPayerPortal, uncertainty_threshold: float = 0.55):
self.ehr = ehr
self.payer = payer
self.uncertainty_threshold = uncertainty_threshold
self.audit_log: List[Dict[str, Any]] = []

def log(self, event: str, payload: Dict[str, Any]):
self.audit_log.append({“ts”: _now_iso(), “event”: event, **payload})

def build_prior_auth_request(self, order: SurgeryOrder) -> PriorAuthRequest:
required = required_docs_for_order(self.payer, order)
docs = self.ehr.get_patient_documents(order.patient.patient_id)
attached = attach_best_docs(docs, required)

req = PriorAuthRequest(
request_id=_stable_id(“PA”, order.order_id + order.patient.member_id),
order=order,
docs_attached=attached,
payload={
“member_id”: order.patient.member_id,
“plan”: order.patient.plan.value,
“procedure”: order.surgery_type.value,
“diagnosis_codes”: order.diagnosis_codes,
“scheduled_date”: order.scheduled_date,
“provider_npi”: order.ordering_provider_npi,
“attached_doc_types”: [d.doc_type.value for d in attached],
}
)
self.log(“pa_request_built”, {“order_id”: order.order_id, “required_docs”: [d.value for d in required], “attached”: req.payload[“attached_doc_types”]})
return req

def submit_and_monitor(self, pa: PriorAuthRequest, max_polls: int = 7) -> Dict[str, Any]:
pa.submitted_at = _now_iso()
submit_resp = self.payer.submit(pa)
self.log(“submitted”, {“request_id”: pa.request_id, “payer_ref”: submit_resp.payer_ref, “message”: submit_resp.message})

payer_ref = submit_resp.payer_ref

for _ in range(max_polls):
time.sleep(0.25)
status = self.payer.check_status(payer_ref)
self.log(“status_polled”, {“payer_ref”: payer_ref, “status”: status.status.value, “message”: status.message})

if status.status == AuthStatus.APPROVED:
return {“final_status”: “APPROVED”, “payer_ref”: payer_ref, “details”: status.model_dump()}

if status.status == AuthStatus.DENIED:
decision = self.handle_denial(pa, payer_ref, status)
if decision.action == “escalate”:
return {
“final_status”: “ESCALATED_TO_HUMAN”,
“payer_ref”: payer_ref,
“decision”: decision.model_dump(),
“details”: status.model_dump(),
}
if decision.action == “appeal”:
appeal_docs = pa.docs_attached[:]
appeal_resp = self.payer.file_appeal(payer_ref, decision.appeal_text or “”, appeal_docs)
self.log(“appeal_filed”, {“payer_ref”: payer_ref, “message”: appeal_resp.message})

for _ in range(max_polls):
time.sleep(0.25)
post = self.payer.check_status(payer_ref)
self.log(“post_appeal_polled”, {“payer_ref”: payer_ref, “status”: post.status.value, “message”: post.message})
if post.status == AuthStatus.APPROVED:
return {“final_status”: “APPROVED_AFTER_APPEAL”, “payer_ref”: payer_ref, “details”: post.model_dump()}
if post.status == AuthStatus.DENIED:
return {“final_status”: “DENIED_AFTER_APPEAL”, “payer_ref”: payer_ref, “details”: post.model_dump(), “decision”: decision.model_dump()}

return {“final_status”: “APPEAL_PENDING”, “payer_ref”: payer_ref, “decision”: decision.model_dump()}

return {“final_status”: “DENIED_NO_ACTION”, “payer_ref”: payer_ref, “decision”: decision.model_dump(), “details”: status.model_dump()}

return {“final_status”: “PENDING_TIMEOUT”, “payer_ref”: payer_ref}

def handle_denial(self, pa: PriorAuthRequest, payer_ref: str, denial_resp: PayerResponse) -> AgentDecision:
order = pa.order
analysis = llm_denial_analysis_and_appeal(order, denial_resp, pa.docs_attached) if (USE_OPENAI and OPENAI_AVAILABLE) else rule_based_denial_analysis(order, denial_resp)
missing_docs: List[DocType] = analysis.get(“missing_docs”, [])
rationale: str = analysis.get(“rationale”, “”)
appeal_text: str = analysis.get(“appeal_text”, “”)

if denial_resp.denial_reason == DenialReason.MISSING_DOCS and denial_resp.missing_docs:
missing_docs = denial_resp.missing_docs

if missing_docs:
new_docs = self.ehr.fetch_additional_docs(order.patient.patient_id, missing_docs)
pa.docs_attached.extend(new_docs)
self.log(“missing_docs_collected”, {“payer_ref”: payer_ref, “collected”: [d.doc_type.value for d in new_docs]})

uncertainty = compute_uncertainty(denial_resp, missing_docs, llm_used=(USE_OPENAI and OPENAI_AVAILABLE))
self.log(“denial_analyzed”, {“payer_ref”: payer_ref, “denial_reason”: (denial_resp.denial_reason.value if denial_resp.denial_reason else None),
“uncertainty”: uncertainty, “missing_docs”: [d.value for d in missing_docs]})

if uncertainty >= self.uncertainty_threshold:
return AgentDecision(
action=”escalate”,
missing_docs=missing_docs,
rationale=f”{rationale} Escalating due to high uncertainty ({uncertainty:.2f}) >= threshold ({self.uncertainty_threshold:.2f}).”,
uncertainty=uncertainty,
next_wait_seconds=0,
)

if not appeal_text:
analysis2 = rule_based_denial_analysis(order, denial_resp)
appeal_text = analysis2.get(“appeal_text”, “”)

attached_types = sorted(list({d.doc_type.value for d in pa.docs_attached}))
appeal_text = (
appeal_text.strip()
+ “nnAttached documents:n- ”
+ “n- “.join(attached_types)
+ “nnRequested outcome: Reconsideration and authorization issuance.n”
)

return AgentDecision(
action=”appeal”,
missing_docs=missing_docs,
rationale=f”{rationale} Proceeding autonomously (uncertainty {uncertainty:.2f} < threshold {self.uncertainty_threshold:.2f}).”,
uncertainty=uncertainty,
appeal_text=appeal_text,
next_wait_seconds=1,
)

We implement the core intelligence layer that attaches documents, analyzes denials, and estimates uncertainty. We demonstrate how rule-based logic and optional LLM reasoning can coexist without compromising determinism. We explicitly gate automation decisions to maintain safety in a healthcare context. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserehr = MockEHR()
ehr.seed_data(n_orders=6)

payer = MockPayerPortal()
agent = PriorAuthAgent(ehr, payer, uncertainty_threshold=0.55)

results = []
print(“=== Starting Autonomous Prior Authorization Agent Demo ===”)
print(f”OpenAI enabled: {USE_OPENAI and OPENAI_AVAILABLE}n”)

while True:
new_orders = ehr.poll_new_surgery_orders(max_n=1)
if not new_orders:
break

order = new_orders[0]
print(f”n— New Surgery Order Detected —“)
print(f”Order: {order.order_id} | Patient: {order.patient.patient_id} | Plan: {order.patient.plan.value} | Surgery: {order.surgery_type.value}”)

pa = agent.build_prior_auth_request(order)
outcome = agent.submit_and_monitor(pa, max_polls=7)
results.append({“order_id”: order.order_id, “patient_id”: order.patient.patient_id, **outcome})

print(f”Outcome: {outcome[‘final_status’]} | PayerRef: {outcome.get(‘payer_ref’)}”)

print(“n=== Summary ===”)
status_counts = {}
for r in results:
status_counts[r[“final_status”]] = status_counts.get(r[“final_status”], 0) + 1
print(“Final status counts:”, status_counts)

print(“nSample result (first case):”)
print(json.dumps(results[0], indent=2))

print(“n=== Audit Log (last ~12 events) ===”)
for row in agent.audit_log[-12:]:
print(json.dumps(row, indent=2))

print(
“nHardening checklist (high level):n”
“- Swap mocks for real EHR + payer integrations (FHIR/HL7, payer APIs/portal automations)n”
“- Add PHI governance (tokenization, least-privilege access, encrypted logging, retention controls)n”
“- Add deterministic policy engine + calibrated uncertainty modeln”
“- Add human-in-the-loop UI with SLA timers, retries/backoff, idempotency keysn”
“- Add evidence packing (policy citations, structured attachments, templates)n”
)

We orchestrate the full end-to-end workflow and generate operational summaries and audit logs. We track outcomes, escalation events, and system behavior to support transparency and compliance. We emphasize observability and traceability as essential requirements for healthcare AI systems.

In conclusion, we illustrated how agentic AI can meaningfully reduce administrative friction in healthcare RCM by automating repetitive, rules-driven prior authorization tasks while preserving human oversight for ambiguous or high-risk decisions. We showed that combining deterministic policy logic, uncertainty estimation, and optional LLM-assisted reasoning enables a balanced approach that aligns with healthcare’s safety-critical nature. This work should be viewed as an architectural and educational reference rather than a deployable medical system; any real-world implementation must adhere to HIPAA and regional data protection laws, incorporate de-identification and access controls, undergo clinical and compliance review, and be validated against payer-specific policies.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Safe, Autonomous Prior Authorization Agent for Healthcare Revenue Cycle Management with Human-in-the-Loop Controls appeared first on MarkTechPost.

Black Forest Labs Releases FLUX.2 [klein]: Compact Flow Models for Int …

Black Forest Labs releases FLUX.2 [klein], a compact image model family that targets interactive visual intelligence on consumer hardware. FLUX.2 [klein] extends the FLUX.2 line with sub second generation and editing, a unified architecture for text to image and image to image, and deployment options that range from local GPUs to cloud APIs, while keeping state of the art image quality.

From FLUX.2 [dev] to interactive visual intelligence

FLUX.2 [dev] is a 32 billion parameter rectified flow transformer for text conditioned image generation and editing, including composition with multiple reference images, and runs mainly on data center class accelerators. It is tuned for maximum quality and flexibility, with long sampling schedules and high VRAM requirements.

FLUX.2 [klein] takes the same design direction and compresses it into smaller rectified flow transformers with 4 billion and 9 billion parameters. These models are distilled to very short sampling schedules, support the same text to image and multi reference editing tasks, and are optimized for response times below 1 second on modern GPUs.

Model family and capabilities

The FLUX.2 [klein] family consists of 4 main open weight variants through a single architecture.

FLUX.2 [klein] 4B

FLUX.2 [klein] 9B

FLUX.2 [klein] 4B Base

FLUX.2 [klein] 9B Base

FLUX.2 [klein] 4B and 9B are step distilled and guidance distilled models. They use 4 inference steps and are positioned as the fastest options for production and interactive workloads. FLUX.2 [klein] 9B combines a 9B flow model with an 8B Qwen3 text embedder and is described as the flagship small model on the Pareto frontier for quality versus latency across text to image, single reference editing, and multi reference generation.

The Base variants are undistilled versions with longer sampling schedules. The documentation lists them as foundation models that preserve the complete training signal and provide higher output diversity. They are intended for fine tuning, LoRA training, research pipelines, and custom post training workflows where control is more important than minimum latency.

All FLUX.2 [klein] models support three core tasks in the same architecture. They can generate images from text, they can edit a single input image, and they can perform multi reference generation and editing where several input images and a prompt jointly define the target output.

Latency, VRAM, and quantized variants

The FLUX.2 [klein] model page provides approximate end to end inference times on GB200 and RTX 5090. FLUX.2 [klein] 4B is the fastest variant and is listed at about 0.3 to 1.2 seconds per image, depending on hardware. FLUX.2 [klein] 9B targets about 0.5 to 2 seconds at higher quality. The Base models require several seconds because they run with 50 step sampling schedules, but they expose more flexibility for custom pipelines.

The FLUX.2 [klein] 4B model card states that 4B fits in about 13 GB of VRAM and is suitable for GPUs like the RTX 3090 and RTX 4070. The FLUX.2 [klein] 9B card reports a requirement of about 29 GB of VRAM and targets hardware such as the RTX 4090. This means a single high end consumer card can host the distilled variants with full resolution sampling.

To extend the reach to more devices, Black Forest Labs also releases FP8 and NVFP4 versions for all FLUX.2 [klein] variants, developed together with NVIDIA. FP8 quantization is described as up to 1.6 times faster with up to 40 percent lower VRAM usage, and NVFP4 as up to 2.7 times faster with up to 55 percent lower VRAM usage on RTX GPUs, while keeping the core capabilities the same.

Benchmarks against other image models

Black Forest Labs evaluates FLUX.2 [klein] through Elo style comparisons on text to image, single reference editing, and multi reference tasks. The performance charts show FLUX.2 [klein] on the Pareto frontier of Elo score versus latency and Elo score versus VRAM.The commentary states that FLUX.2 [klein] matches or exceeds the quality of Qwen based image models at a fraction of the latency and VRAM, and that it outperforms Z Image while supporting unified text to image and multi reference editing in one architecture.

https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence

The base variants trade some speed for full customizability and fine tuning, which aligns with their role as foundation checkpoints for new research and domain specific pipelines.

Key Takeaways

FLUX.2 [klein] is a compact rectified flow transformer family with 4B and 9B variants that supports text to image, single image editing, and multi reference generation in one unified architecture.

The distilled FLUX.2 [klein] 4B and 9B models use 4 sampling steps and are optimized for sub second inference on a single modern GPU, while the undistilled Base models use longer schedules and are intended for fine tuning and research.

Quantized FP8 and NVFP4 variants, built with NVIDIA, provide up to 1.6 times speedup with about 40 percent VRAM reduction for FP8 and up to 2.7 times speedup with about 55 percent VRAM reduction for NVFP4 on RTX GPUs.

Check out the Technical details, Repo and Model weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Black Forest Labs Releases FLUX.2 [klein]: Compact Flow Models for Interactive Visual Intelligence appeared first on MarkTechPost.

Google AI Releases TranslateGemma: A New Family of Open Translation Mo …

Google AI has released TranslateGemma, a suite of open machine translation models built on Gemma 3 and targeted at 55 languages. The family comes in 4B, 12B and 27B parameter sizes. It is designed to run across devices from mobile and edge hardware to laptops and a single H100 GPU or TPU instance in the cloud.

TranslateGemma is not a separate architecture. It is Gemma 3 specialized for translation through a two stage post training pipeline. (1) supervised fine tuning on large parallel corpora. (2) Reinforcement learning that optimizes translation quality with a multi signal reward ensemble. The goal is to push translation quality while keeping the general instruction following behavior of Gemma 3.

Supervised fine tuning on synthetic and human parallel data

The supervised fine tuning stage starts from the public Gemma 3 4B, 12B and 27B checkpoints. The research team uses parallel data that combines human translations with high quality synthetic translations generated by Gemini models.

Synthetic data is produced from monolingual sources with a multi step procedure. The pipeline selects candidate sentences and short documents, feeds them to Gemini 2.5 Flash, and then filters outputs with MetricX 24 QE to keep only examples that show clear quality gains. This is applied across all WMT24 plus plus language pairs plus 30 more language pairs.

Low resource languages receive human generated parallel data from the SMOL and GATITOS datasets. SMOL covers 123 languages and GATITOS covers 170 languages. This improves coverage of scripts and language families that are under represented in publicly available web parallel data.

The final supervised fine tuning mixture also keeps 30 percent generic instruction following data from the original Gemma 3 mixture. This is important. Without it, the model would over specialize on pure translation and lose general LLM behavior such as following instructions or doing simple reasoning in context.

Training uses the Kauldron SFT (Supervised Fine tuning) tooling with the AdaFactor optimizer. The learning rate is 0.0001 with batch size 64 for 200000 steps. All model parameters are updated except the token embeddings, which are frozen. Freezing embeddings helps preserve representation quality for languages and scripts that do not appear in the supervised fine tuning data.

Reinforcement learning with a translation focused reward ensemble

After supervised fine tuning, TranslateGemma runs a reinforcement learning phase on top of the same translation data mixture. The reinforcement learning objective uses several reward models.

The reward ensemble includes:

MetricX 24 XXL QE, a learned regression metric that approximates MQM scores and is used here in quality estimation mode without a reference.

Gemma AutoMQM QE, a span level error predictor fine tuned from Gemma 3 27B IT on MQM labeled data. It produces token level rewards based on error type and severity.

ChrF, a character n gram overlap metric that compares model output with synthetic references and is rescaled to match the other rewards.

A Naturalness Autorater that uses the policy model as an LLM judge and produces span level penalties for segments that do not sound like native text.

A generalist reward model from the Gemma 3 post training setup that keeps reasoning and instruction following ability intact.

TranslateGemma uses reinforcement learning algorithms that combine sequence level rewards with token level advantages. Span level rewards from AutoMQM and the Naturalness Autorater attach directly to the affected tokens. These token advantages are added to sequence advantages computed from reward to go and then batch normalized. This improves credit assignment compared with pure sequence level reinforcement learning.

Benchmark results on WMT24++

TranslateGemma is evaluated on the WMT24++ benchmark using MetricX 24 and Comet22. MetricX is lower better and correlates with MQM error counts. Comet22 is higher better and measures adequacy and fluency.

https://arxiv.org/pdf/2601.09012

The above Table from the research pape summarizes results for English centered evaluation over 55 language pairs.

27B: Gemma 3 baseline has MetricX 4.04 and Comet22 83.1. TranslateGemma 27B reaches MetricX 3.09 and Comet22 84.4.

12B: Gemma 3 baseline has MetricX 4.86 and Comet22 81.6. TranslateGemma 12B reaches MetricX 3.60 and Comet22 83.5.

4B: Gemma 3 baseline has MetricX 6.97 and Comet22 77.2. TranslateGemma 4B reaches MetricX 5.32 and Comet22 80.1.

The key pattern is that TranslateGemma improves quality for every model size. At the same time, model scale interacts with specialization. The 12B TranslateGemma model surpasses the 27B Gemma 3 baseline. The 4B TranslateGemma model reaches quality similar to the 12B Gemma 3 baseline. This means a smaller translation specialized model can replace a larger baseline model for many machine translation workloads.

https://arxiv.org/pdf/2601.09012

A language level breakdown in the above appendix table from the research paper shows that these gains appear across all 55 language pairs. For example, MetricX improves from 1.63 to 1.19 for English to German, 2.54 to 1.88 for English to Spanish, 3.90 to 2.72 for English to Hebrew, and 5.92 to 4.45 for English to Swahili. Improvements are also large for harder cases such as English to Lithuanian, English to Estonian and English to Icelandic.

Human evaluation on WMT25 with MQM confirms this trend. TranslateGemma 27B usually yields lower MQM scores, that is fewer weighted errors, than Gemma 3 27B, with especially strong gains for low resource directions such as English to Marathi, English to Swahili and Czech to Ukrainian. There are two notable exceptions. For German as target both systems are very close. For Japanese to English TranslateGemma shows a regression caused mainly by named entity errors, even though other error categories improve.

Multimodal translation and interface for developers

TranslateGemma inherits the image understanding stack of Gemma 3. The research team evaluates image translation on the Vistra benchmark. They select 264 images that each contain a single text instance. The model receives only the image plus a prompt that asks it to translate the text in the image. There is no separate bounding box input and no explicit OCR step.

On this setting, TranslateGemma 27B improves MetricX from 2.03 to 1.58 and Comet22 from 76.1 to 77.7. The 4B variant shows smaller but positive gains. The 12B model improves MetricX but has a slightly lower Comet22 score than the baseline. Overall, the research team concludes that TranslateGemma retains the multimodal ability of Gemma 3 and that text translation improvements mostly carry over to image translation.

Key Takeaways

TranslateGemma is a specialized Gemma 3 variant for translation: TranslateGemma is a suite of open translation models derived from Gemma 3, with 4B, 12B and 27B parameter sizes, optimized for 55 languages through a two stage pipeline, supervised fine tuning then reinforcement learning with translation focused rewards.

Training combines Gemini synthetic data with human parallel corpora: The models are fine tuned on a mixture of high quality synthetic parallel data generated by Gemini and human translated data, which improves coverage for both high resource and low resource languages while preserving general LLM capabilities from Gemma 3.

Reinforcement learning uses an ensemble of quality estimation rewards: After supervised fine tuning, TranslateGemma applies reinforcement learning driven by an ensemble of reward models, including MetricX QE and AutoMQM, that explicitly target translation quality and fluency rather than generic chat behavior.

Smaller models match or beat larger Gemma 3 baselines on WMT24++: On WMT24++ across 55 languages, all TranslateGemma sizes show consistent improvements over Gemma 3, with the 12B model surpassing the 27B Gemma 3 baseline and the 4B model reaching quality comparable to the 12B baseline, which reduces compute requirements for a given translation quality level.

Models retain multimodal abilities and are released as open weights: TranslateGemma keeps Gemma 3 image text translation capabilities and improves performance on the Vistra image translation benchmark, and the weights are released as open models on Hugging Face and Vertex AI, enabling local and cloud deployment.

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases TranslateGemma: A New Family of Open Translation Models Built on Gemma 3 with Support for 55 Languages appeared first on MarkTechPost.

Advanced fine-tuning techniques for multi-agent orchestration: Pattern …

Our work with large enterprise customers and Amazon teams has revealed that high stakes use cases continue to benefit significantly from advanced large language model (LLM) fine-tuning and post-training techniques. In this post, we show you how fine-tuning enabled a 33% reduction in dangerous medication errors (Amazon Pharmacy), engineering 80% human effort reduction (Amazon Global Engineering Services), and content quality assessments improving 77% to 96% accuracy (Amazon A+). These aren’t hypothetical projections—they’re production results from Amazon teams. While many use cases can be effectively addressed through prompt engineering, Retrieval Augmented Generation (RAG) systems, and turn key agent deployment,, our work with Amazon and large enterprise accounts reveals a consistent pattern: One in four high-stakes applications—where patient safety, operational efficiency, or customer trust are on the line—demand advanced fine-tuning and post-training techniques to achieve production-grade performance.
This post details the techniques behind these outcomes: from foundational methods like Supervised Fine-Tuning (SFT) (instruction tuning), and Proximal Policy Optimization (PPO), to Direct Preference Optimization (DPO) for human alignment, to cutting-edge reasoning optimizations such as Grouped-based Reinforcement Learning from Policy Optimization (GRPO), Direct Advantage Policy Optimization (DAPO), and Group Sequence Policy Optimization (GSPO) purpose-built for agentic systems. We walk through the technical evolution of each approach, examine real-world implementations at Amazon, present a reference architecture on Amazon Web Services (AWS), and provide a decision framework for selecting the right technique based on your use case requirements.
The continued relevance of fine-tuning in the agentic AI
Despite the growing capabilities of foundation models and agent frameworks, roughly one of four enterprise use cases still require advanced fine-tuning to achieve the necessary performance levels. These are typically scenarios where the stakes are high from revenue or customer trust perspectives, domain-specific knowledge is essential, enterprise integration at scale is required, governance and control are paramount, business process integration is complex, or multi-modal support is needed. Organizations pursuing these use cases have reported higher conversion to production, greater return on investment (ROI), and up to 3-fold year-over-year growth when advanced fine-tuning is appropriately applied.
Evolution of LLM fine-tuning techniques for agentic AI
The evolution of generative AI has seen several key advancements in model customization and performance optimization techniques. Starting with SFT, which uses labeled data to teach models to follow specific instructions, the field established its foundation but faced limitations in optimizing complex reasoning. To address these limitations, reinforcement learning (RL) refines the SFT process with a reward-based system that provides better adaptability and alignment with human preference. Among multiple RL algorithms, a significant leap comes with PPO, which consists of a workflow with a value (critic) network and a policy network. The workflow contains a reinforcement learning policy to adjust the LLM weights based on the guidance of a reward model. PPO scales well in complex environments, though it has challenges with stability and configuration complexity.
DPO emerged as a breakthrough in early 2024, addressing PPO’s stability issues by eliminating the explicit reward model and instead working directly with preference data that includes preferred and rejected responses for given prompts. DPO optimizes the LLM weights by comparing the preferred and rejected responses, allowing the LLM to learn and adjust its behavior accordingly. This simplified approach gained widespread adoption, with major language models incorporating DPO into their training pipelines to achieve better performance and more reliable outputs. Other alternatives including Odds Ratio Policy Optimization (ORPO), Relative Preference Optimization (RPO), Identity preference optimization (IPO), Kahneman-Tversky Optimization (KTO), they are all RL methods for human preference alignment. By incorporating comparative and identity-based preference structures, and grounding optimization in behavioral economics, these methods are computationally efficient, interpretable, and aligned with actual human decision-making processes.
As agent-based applications gained prominence in 2025, we observed increasing demands for customizing the reasoning model in agents, to encode domain-specific constraints, safety guidelines, and reasoning patterns that align with agents’ intended functions (task planning, tool use, or multi-step problem solving). The objective is to improve agents’ performance in maintaining coherent plans, avoiding logical contradictions, and making appropriate decisions for the domain specific use cases. To meet these needs, GRPO was introduced to enhance reasoning capabilities and became particularly notable for its implementation in DeepSeek-V1.
The core innovation of GRPO lies in its group-based comparison approach: rather than comparing individual responses against a fixed reference, GRPO generates groups of responses and evaluates each against the average score of the group, rewarding those performing above average while penalizing those below. This relative comparison mechanism creates a competitive dynamic that encourages the model to produce higher-quality reasoning. GRPO is particularly effective for improving chain-of-thought (CoT) reasoning, which is the critical foundation for agent planning and complex task decomposition. By optimizing at the group level, GRPO captures the inherent variability in reasoning processes and trains the model to consistently outperform its own average performance.
Some complex agent tasks might require more fine-grained and crisp corrections within long reasoning chains, DAPO addresses these use cases by building upon GRPO sequence-level rewards, employing a higher clip ratio (approximately 30% higher than GRPO) to encourage more diverse and exploratory thinking processes, implementing dynamic sampling to eliminate less meaningful samples and improve overall training efficiency, applying token-level policy gradient loss to provide more granular feedback on lengthy reasoning chains rather than treating entire sequences as monolithic units, and incorporating overlong reward shaping to discourage excessively verbose responses that waste computational resources. Additionally, when the agentic use cases require long text outputs in the Mixture-of-Experts (MoE) model training, GSPO supports these scenarios by shifting the optimization from GRPO’s token-level importance weights to the sequence level. With these improvements, the new methods (DAPO and GSPO) enable more efficient and sophisticated agent reasoning and planning strategy, while maintaining computational efficiency and appropriate feedback resolution of GRPO.
Real-world applications at Amazon
Using the fine-tuning techniques described in the previous sections, the post-trained LLMs play two crucial roles in agentic AI systems. First is in the development of specialized tool-using components and sub-agents within the broader agent architecture. These fine-tuned models act as domain experts, each optimized for specific functions. By incorporating domain-specific knowledge and constraints during the fine-tuning process, these specialized components can achieve significantly higher accuracy and reliability in their designated tasks compared to general-purpose models. The second key application is to serve as the core reasoning engine, where the foundation models are specifically tuned to excel at planning, logical reasoning, and decision-making, for agents in a highly specific domain. The aim is to improve the model’s ability to maintain coherent plans and make logically sound decisions—essential capabilities for any agent system. This dual approach, combining a fine-tuned reasoning core with specialized sub-components, was emerging as a promising architecture in Amazon for evolving from LLM-driven applications to agentic systems, and building more capable and reliable generative AI applications. The following table depicts multi-agent AI orchestration with of advanced fine-tuning technique examples.

Amazon Pharmacy
Amazon Global Engineering Services
Amazon A+ Content

Domain
Healthcare
Construction and facilities
Ecommerce

High-stakes factor
Patient safety
Operational efficiency
Customer trust

Challenge
$3.5 B annual cost from medication errors
3+ hour inspection reviews
Quality assessment at 100 million+ scale

Techniques
SFT, PPO, RLHF, advanced RL
SFT, PPO, RLHF, advanced RL
Feature-based fine-tuning

Key outcome
33% reduction in medication errors
80% reduction in human effort
77%–96% accuracy

Amazon Healthcare Services (AHS) began its journey with generative AI with a significant challenge two years ago, when the team tackled customer service efficiency through a RAG-based Q&A system. Initial attempts using traditional RAG with foundation models yielded disappointing results, with accuracy hovering between 60 and 70%. The breakthrough came when they fine-tuned the embedding model specifically for pharmaceutical domain knowledge, resulted in a significant improvement to 90% accuracy and an 11% reduction in customer support contacts. In medication safety, medication direction errors can pose serious safety risks and cost up to $3.5 billion annually to correct. By fine-tuning a model with thousands of expert-annotated examples, Amazon Pharmacy created an agent component that validates medication directions using pharmacy logic and safety guidelines. This reduced near-miss events by 33%, as indicated in their Nature Medicine publication. In 2025, AHS is expanding their AI capabilities and transform these separate LLM-driven applications into a holistic multi-agent system to enhance patient experience. These individual applications driven by fine-tuned models play a crucial role in the overall agentic architecture, serving as domain expert tools to address specific mission-critical functions in pharmaceutical services.
The Amazon Global Engineering Services (GES) team, responsible for overseeing hundreds of Amazon fulfillment centers worldwide, embarked on an ambitious journey to use generative AI in their operations. Their initial foray into this technology focused on creating a sophisticated Q&A system designed to assist engineers in efficiently accessing relevant design information from vast knowledge repositories. The team’s approach was fine-tuning a foundation model using SFT, which resulted in a significant improvement in accuracy (measured by semantic similarity score) from 0.64 to 0.81. To better align with the feedback from the subject matter experts (SMEs), the team further refined the model using PPO incorporating the human feedback data, which boosted the LLM-judge scores from 3.9 to 4.2 out of 5, a remarkable achievement that translated to a substantial 80% reduction in the effort required from the domain experts. Similar to the Amazon Pharmacy case, these fine-tuned specialized models will continue to function as domain expert tools within the broader agentic AI system.
In 2025, the GES team ventured into uncharted territory by applying agentic AI systems to optimize their business process. LLM fine-tuning methodologies constitute a critical mechanism for enhancing the reasoning capabilities in AI agents, enabling effective decomposition of complex objectives into executable action sequences that align with predefined behavioral constraints and goal-oriented outcomes. It also serves as critical architecture component in facilitating specialized task execution and optimizing for task-specific performance metrics.
Amazon A+ Content powers rich product pages across hundreds of millions of annual submissions. The A+ team needed to evaluate content quality at scale—assessing cohesiveness, consistency, and relevancy, not just surface-level defects. Content quality directly impacts conversion and brand trust, making this a high-stakes application.
Following the architectural pattern seen in Amazon Pharmacy and Global Engineering Services, the team built a specialized evaluation agent powered by a fine-tuned model. They applied feature-based fine-tuning to Nova Lite on Amazon SageMaker—training a lightweight classifier on vision language model (VLM)-extracted features rather than updating full model parameters. This approach, enhanced by expert-crafted rubric prompts, improved classification accuracy from 77% to 96%. The result: an AI agent that evaluates millions of content submissions and delivers actionable recommendations. This demonstrates a key principle from our maturity framework—technique complexity should match task requirements. The A+ use case, while high-stakes and operating at massive scale, is fundamentally a classification task well-suited to these methods. Not every agent component requires GRPO or DAPO; selecting the right technique for each problem is what delivers efficient, production-grade systems.
Reference architecture for advanced AI orchestration using fine-tuning
Although fine-tuned models serve diverse purposes across different domains and use cases in an agentic AI system, the anatomy of an agent remains largely consistent and can be encompassed in component groupings, as shown in the following architecture diagram.

This modular approach adopts a number of AWS generative AI services, including Amazon Bedrock AgentCore, Amazon SageMaker, and Amazon Bedrock, that maintains structure of key groupings that make up an agent while providing various options within each group to improve an AI agent.

LLM customization for AI agents

Builders can use various AWS services to fine-tune and post-train the LLMs for an AI agent using the techniques discussed in the previous section. If you use LLMs on Amazon Bedrock for your agents, you can use multiple model customization approaches to fine-tune your models. Distillation and SFT through parameter-efficient fine-tuning (PEFT) with low-rank adaptation (LoRA) can be used to address simple customization tasks. For advanced fine-tuning, Continued Pre-training (CPT) extends a foundation model’s knowledge by training on domain-specific corpora (medical literature, legal documents, or proprietary technical content), embedding specialized vocabulary and domain reasoning patterns directly into model weights. Reinforcement fine-tuning (RFT), launched at re:Invent 2025, teaches models to understand what makes a quality response without large amounts of pre-labeled training data. There are two approaches supported for RFT: Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based graders for objective tasks like code generation or math reasoning, while Reinforcement Learning from AI Feedback (RLAIF) uses AI-based judges for subjective tasks like instruction following or content moderation.
If you require deeper control over model customization infrastructure for your AI agents, Amazon SageMaker AI provides a comprehensive platform for custom model development and fine-tuning. Amazon SageMaker JumpStart accelerates the customization journey by offering pre-built solutions with one-click deployment of popular foundation models (Llama, Mistral, Falcon, and others) and end-to-end fine-tuning notebooks that handle data preparation, training configuration, and deployment workflows. Amazon SageMaker Training jobs provide managed infrastructure for executing custom fine-tuning workflows, automatically provisioning GPU instances, managing training execution, and handling cleanup after completion. This approach suits most fine-tuning scenarios where standard instance configurations provide sufficient compute power and training completes reliably within the job duration limits. You can use SageMaker Training jobs with custom Docker containers and code dependencies housing any machine learning (ML) framework, training library, or optimization technique, enabling experimentation with emerging methods beyond managed offerings.
At re:Invent 2025, Amazon SageMaker HyperPod introduced two capabilities for large-scale model customization: Checkpointless training reduces checkpoint-restart cycles, shortening recovery time from hours to minutes. Elastic training automatically scales workloads to use idle capacity and yields resources when higher-priority workloads peak. These features build on the core strengths of HyperPod—resilient distributed training clusters with automatic fault recovery for multi-week jobs spanning thousands of GPUs. HyperPod supports NVIDIA NeMo and AWS Neuronx frameworks, and is ideal when training scale, duration, or reliability requirements exceed what job-based infrastructure can economically provide.
In SageMaker AI, for builders who want to customize models without managing infrastructure, Amazon SageMaker AI serverless customization, launched at re:Invent 2025, provides a fully managed, UI- and SDK-driven experience for model fine-tuning. This capability provides infrastructure management—SageMaker automatically selects and provisions appropriate compute resources (P5, P4de, P4d, and G5 instances) based on model size and training requirements. Through the SageMaker Studio UI, you can customize popular models (Amazon Nova, Llama, DeepSeek, GPT-OSS, and Qwen) using advanced techniques including SFT, DPO, RLVR, and RLAIF. You can also run the same serverless customization using SageMaker Python SDK in your Jupyter notebook. The serverless approach provides pay-per-token pricing, automatic resource cleanup, integrated MLflow experiment tracking, and seamless deployment to both Amazon Bedrock and SageMaker endpoints.
If you need to customize Amazon Nova models for your agentic workflow, you can do it through recipes and train them on SageMaker AI. It provides end-to-end customization workflow including model training, evaluation, and deployment for inference. with greater flexibility and control to fine-tune the Nova models, optimize hyperparameters with precision, and implement techniques such as LoRA PEFT, full-rank SFT, DPO, RFT, CPT, PPO, and so on. For the Nova models on Amazon Bedrock, you can also train your Nova models by SFT and RFT with reasoning content to capture intermediate thinking steps or use reward-based optimization when exact correct answers are difficult to define. If you have more advanced agentic use cases that require deeper model customization, you can use Amazon Nova Forge—launched at re:Invent 2025—to build your own frontier models from early model checkpoints, blend your datasets with Amazon Nova-curated training data, and host your custom models securely on AWS.

AI agent development environments and SDKs

The development environment is where developers author, test, and iterate on agent logic before deployment. Developers use integrated development environments (IDEs) such as SageMaker AI Studio (Jupyter Notebooks compared to code editors), Amazon Kiro, or IDEs on local machines like PyCharm. Agent logic is implemented using specialized SDKs and frameworks that abstract orchestration complexity—Strands provides a Python framework purpose-built for multi-agent systems, offering declarative agent definitions, built-in state management, and native AWS service integrations that handle the low-level details of LLM API calls, tool invocation protocols, error recovery, and conversation management. With these development tools handling the low-level details of LLM API calls, developers can focus on business logic rather than infrastructure design and maintenance.

AI agent deployment and operation

After your AI agent development is completed and ready to deploy in production, you can use Amazon Bedrock AgentCore to handle agent execution, memory, security, and tool integration without requiring infrastructure management. Bedrock AgentCore provides a set of integrated services, including:

AgentCore Runtime offers purpose-built environments that abstract away infrastructure management, while container-based alternatives (SageMaker AI jobs, AWS Lambda, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Service (Amazon ECS)) provide more control for custom requirements. Essentially, the runtime is where your carefully crafted agent code meets real users and delivers business value at scale.
AgentCore Memory gives your AI agents the ability to remember past interactions, enabling them to provide more intelligent, context-aware, and personalized conversations. It provides a straightforward and powerful way to handle both short-term context and long-term knowledge retention without the need to build or manage complex infrastructure.
With AgentCore Gateway, developers can build, deploy, discover, and connect to tools at scale, providing observability into tool usage patterns, error handling for failed invocations, and integration with identity systems for accessing tools on behalf of users (using OAuth or API keys). Teams can update tool backends, add new capabilities, or modify authentication requirements without redeploying agents because the gateway architecture decouples tool implementation from agent logic—maintaining flexibility as business requirements evolve.
AgentCore Observability helps you trace, debug, and monitor agent performance in production environments. It provides real-time visibility into agent operational performance through access to dashboards powered by Amazon CloudWatch and telemetry for key metrics such as session count, latency, duration, token usage, and error rates, using the OpenTelemetry (OTEL) protocol standard.

LLM and AI agent evaluation

When your fine-tuned LLM driven AI agents are running in production, it’s important to evaluate and monitor your models and agents continuously to ensure high quality and performance. Many enterprise use cases require custom evaluation criteria that encode domain expertise and business rules. For the Amazon Pharmacy medication direction validation process, evaluation criteria include: drug-drug interaction detection accuracy (percentage of known contraindications correctly identified), dosage calculation precision (correct dosing adjustments for age, weight, and renal function), near-miss prevention rate (reduction in medication errors that could cause patient harm), FDA labeling compliance (adherence to approved usage, warnings, and contraindications), and pharmacist override rate (percentage of agent recommendations accepted without modification by licensed pharmacists).
For your models on Amazon Bedrock, you can use Amazon Bedrock evaluations to generate predefined metrics and human review workflows. For advanced scenarios, you can use SageMaker Training jobs to fine-tune specialized judge models on domain-specific evaluation datasets. For holistic AI agent evaluation, AgentCore Evaluations, launched at re:Invent 2025, provides automated assessment tools to measure your agent or tools performance on completing specific tasks, handling edge cases, and maintaining consistency across different inputs and contexts.
Decision guide and recommended phased approach
Now that you understand the technical evolution of advanced fine-tuning techniques—from SFT to PPO, DPO, GRPO, DAPO and GSPO—the critical question becomes when and why you should use them. Our experience shows that organizations using a phased maturity approach achieve 70–85% production conversion rates (compared to the 30–40% industry average) and 3-fold year-over-year ROI growth. The 12–18 month journey from initial agent deployment to advanced reasoning capabilities delivers incremental business value at each phase. The key is letting your use case requirements, available data, and measured performance guide advancement—not technical sophistication for its own sake.
The maturity path progresses through four phases (shown in the following table). Strategic patience in this progression builds reusable infrastructure, collects quality training data, and validates ROI before major investments. As our examples demonstrate, aligning technical sophistication with human and business needs delivers transformative outcomes and sustainable competitive advantages in your most critical AI applications.

Phase
Timeline
When to use
Key outcomes
Data needed
Investment

Phase 1: Prompt engineering
6–8 weeks

Starting agent journey
Validating business value
Simple workflows

60–75% accuracy)
Failure patterns identified

Minimal prompts, examples
$50K–$80K (2–3 full-time employees (FTE))

Phase 2: Supervised Fine-Tuning (SFT)
12 weeks

Domain knowledge gaps
Industry terminology issues
Need 80-85% accuracy

80–85% accuracy 60–80% SME effort reduction

500–5,000 labeled examples
$120K–$180K (3–4 FTE and compute)

Phase 3: Direct Preference Optimization (DPO)
16 weeks

Quality/style alignment
Safety/compliance critical
Brand consistency needed

85–92% accuracy
CSAT over 20%

1,000–10,000 preference pairs
$180K–$280K (4–5 FTE and compute)

Phase 4: GRPO and DAPO
24 weeks

Complex reasoning required
High-stakes decisions
Multi-step orchestration
Explainability essential

95–98% accuracy
Mission-critical deployment

10,000+ reasoning trajectories
$400K-$800K (6–8 FTE and HyperPod)

Conclusion
While agents have transformed how we build AI systems, advanced fine-tuning remains a critical component for enterprises seeking competitive advantage in high-stakes domains. By understanding the evolution of techniques like PPO, DPO, GRPO, DAPO and GSPO, and applying them strategically within agent architectures, organizations can achieve significant improvements in accuracy, efficiency, and safety. The real-world examples from Amazon demonstrate –that the combination of agentic workflows with carefully fine-tuned models delivers dramatic business outcomes.
AWS continues to accelerate these capabilities with several key launches at re:Invent 2025. Reinforcement fine-tuning (RFT) on Amazon Bedrock now enables models to learn quality responses through RLVR for objective tasks and RLAIF for subjective evaluations—without requiring large amounts of pre-labeled data. Amazon SageMaker AI Serverless Customization eliminates infrastructure management for fine-tuning, supporting SFT, DPO, and RLVR techniques with pay-per-token pricing. For large-scale training, Amazon SageMaker HyperPod introduced checkpointless training and elastic scaling to reduce recovery time and optimize resource utilization. Amazon Nova Forge empowers enterprises to build custom frontier models from early checkpoints, blending proprietary datasets with Amazon-curated training data. Finally, AgentCore Evaluation provides automated assessment tools to measure agent performance on task completion, edge cases, and consistency—closing the loop on production-grade agentic AI systems.
As you evaluate your generative AI strategy, use the decision guide and phased maturity approach outlined in this post to identify where advanced fine-tuning can tip the scales from good enough to transformative. Use the reference architecture as a baseline to structure your agentic AI systems, and use the capabilities introduced at re:Invent 2025 to accelerate your journey from initial agent deployment to production-grade outcomes.

About the authors
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Kristine Pearce is a Principal Worldwide Generative AI GTM Specialist at AWS, focused on SageMaker AI model customization, optimization, and inference at scale. She combines her MBA, BS Industrial Engineering background, and human-centered design expertise to bring strategic depth and behavioral science to AI-enabled transformation. Outside work, she channels her creativity through art.
Harsh Asnani is a Worldwide Generative AI Specialist Solutions Architect at AWS specializing in ML theory, MLOPs, and production generative AI frameworks. His background is in applied data science with a focus on operationalizing AI workloads in the cloud at scale.
Sung-Ching Lin is a Principal Engineer at Amazon Pharmacy, where he leads the design and adoption of AI/ML systems to improve customer experience and operational efficiency. He focuses on building scalable, agent-based architectures, ML evaluation frameworks, and production-ready AI solutions in regulated healthcare domains.
Elad Dwek is a Senior AI Business Developer at Amazon, working within Global Engineering, Maintenance, and Sustainability. He partners with stakeholders from business and tech side to identify opportunities where AI can enhance business challenges or completely transform processes, driving innovation from prototyping to production. With a background in construction and physical engineering, he focuses on change management, technology adoption, and building scalable, transferable solutions that deliver continuous improvement across industries. Outside of work, he enjoys traveling around the world with his family.
Carrie Song is a Senior Program Manager at Amazon, working on AI-powered content quality and customer experience initiatives. She partners with applied science, engineering, and UX teams to translate generative AI and machine learning insights into scalable, customer-facing solutions. Her work focuses on improving content quality and streamlining the shopping experience on product detail pages.