i-genie, Author at i-genie.co.uk

Tufa Labs Introduced LADDER: A Recursive Learning Framework Enabling L …

Posted on March 9, 2025 by i-genie

Large Language Models (LLMs) benefit significantly from reinforcement learning techniques, which enable iterative improvements by learning from rewards. However, training these models efficiently remains challenging, as they often require extensive datasets and human supervision to enhance their capabilities. Developing methods that allow LLMs to self-improve autonomously without additional human input or large-scale architectural modifications has become a major focus in AI research.

The key challenge in training LLMs is ensuring the learning process is efficient and structured. The training process can stall when models encounter problems beyond their capabilities, leading to poor performance. Traditional reinforcement learning techniques rely on well-curated datasets or human feedback to create effective learning pathways, but this approach is resource-intensive. Also, LLMs struggle to improve systematically without a structured difficulty gradient, making it difficult to bridge the gap between basic reasoning tasks and more complex problem-solving.

Existing approaches to training LLMs primarily involve supervised fine-tuning, reinforcement learning from human feedback (RLHF), and curriculum learning. Supervised fine-tuning requires manually labeled datasets, which can lead to overfitting and limited generalization. RLHF introduces a layer of human oversight, where models are refined based on human evaluations, but this method is costly and does not scale efficiently. Curriculum learning, which gradually increases task difficulty, has shown promise, but current implementations still rely on pre-defined datasets rather than allowing models to generate their learning trajectories. These limitations highlight the need for an autonomous learning framework that enables LLMs to improve their problem-solving abilities independently.

Researchers from Tufa Labs introduced LADDER (Learning through Autonomous Difficulty-Driven Example Recursion) to overcome these limitations. This framework enables LLMs to self-improve by recursively generating and solving progressively simpler variants of complex problems. Unlike prior methods that depend on human intervention or curated datasets, LADDER leverages the model’s capabilities to create a natural difficulty gradient, allowing for structured self-learning. The research team developed and tested LADDER on mathematical integration tasks, demonstrating its effectiveness in enhancing model performance. By applying LADDER, the researchers enabled a 3-billion-parameter Llama 3.2 model to improve its accuracy on undergraduate integration problems from 1% to 82%, an unprecedented leap in mathematical reasoning capabilities. Also, the approach was extended to larger models, such as Qwen2.5 7B Deepseek-R1 Distilled, achieving 73% accuracy on the MIT Integration Bee qualifying examination, far surpassing models like GPT-4o, which gained only 42%, and typical human performance in the 15-30% range.

LADDER follows a structured methodology that allows LLMs to bootstrap their learning by systematically breaking down complex problems. The process involves three primary components: variant generation, solution verification, and reinforcement learning. The variant generation step ensures the model produces progressively easier versions of a given problem, forming a structured difficulty gradient. The solution verification step employs numerical integration methods to assess the correctness of generated solutions, providing immediate feedback without human intervention. Finally, the reinforcement learning component uses Group Relative Policy Optimization (GRPO) to train the model efficiently. This protocol enables the model to learn incrementally by leveraging verified solutions, allowing it to refine its problem-solving strategies systematically. The researchers extended this approach with Test-Time Reinforcement Learning (TTRL), which dynamically generates problem variants during inference and applies reinforcement learning to refine solutions in real time. When applied to the MIT Integration Bee qualifying examination, TTRL boosted model accuracy from 73% to 90%, surpassing OpenAI’s o1 model.

When tested on a dataset of 110 undergraduate-level integration problems, a Llama 3.2 3B model trained with LADDER achieved 82% accuracy, compared to 2% accuracy when using pass@10 sampling. The approach also demonstrated scalability, as increasing the number of generated variants led to continued performance improvements. In contrast, reinforcement learning without variants failed to achieve meaningful gains, reinforcing the importance of structured problem decomposition. The researchers observed that LADDER-trained models could solve integrals requiring advanced techniques that were previously out of reach. Applying the methodology to the MIT Integration Bee qualifying examination, a Deepseek-R1 Qwen2.5 7B model trained with LADDER outperformed larger models that did not undergo recursive training, showcasing the effectiveness of structured self-improvement in mathematical reasoning.

Key Takeaways from the Research on LADDER include:

Enables LLMs to self-improve by recursively generating and solving simpler variants of complex problems.

Llama 3.2 3B model improved from 1% to 82% on undergraduate integration tasks, demonstrating the effectiveness of structured self-learning.

Qwen2.5 7B Deepseek-R1 Distilled achieved 73% accuracy, outperforming GPT-4o (42%) and exceeding human performance (15-30%).

Further boosted accuracy from 73% to 90%, surpassing OpenAI’s o1 model.

LADDER does not require external datasets or human intervention, making it a cost-effective and scalable solution for LLM training.

Models trained with LADDER demonstrated superior problem-solving capabilities compared to reinforcement learning without structured difficulty gradients.

The framework provides a structured way for AI models to refine their reasoning skills without external supervision.

The methodology can be extended to competitive programming, theorem proving, and agent-based problem-solving.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Tufa Labs Introduced LADDER: A Recursive Learning Framework Enabling Large Language Models to Self-Improve without Human Intervention appeared first on MarkTechPost.

Researchers from AMLab and CuspAI Introduced Erwin: A Tree-based Hiera …

Posted on March 8, 2025 by i-genie

Deep learning faces difficulties when applied to large physical systems on irregular grids, especially when interactions occur over long distances or at multiple scales. Handling these complexities becomes harder as the number of nodes increases. Several techniques have difficulty tackling these big problems, resulting in high computational costs and inefficiency. Some major issues are capturing long-range effects, handling multi-scale dependencies, and efficient computation with minimal resource usage. These issues make it difficult to apply deep learning models effectively to fields like molecular simulations, weather prediction, and particle mechanics, where large datasets and complex interactions are common.

Currently, Deep learning methods struggle with scaling attention mechanisms for large physical systems. Traditional self-attention computes interactions between all points, leading to extremely high computational costs. Some methods apply attention to small patches, like SwinTransformer for images, but irregular data needs extra steps to structure it. Techniques like PointTransformer use space-filling curves, but this can break spatial relationships. Hierarchical methods, such as H-transformer and OctFormer, group data at different levels but rely on costly operations. Cluster attention methods reduce complexity by aggregating points, but this process loses fine details and struggles with multi-scale interactions.

To address these problems, researchers from AMLab, University of Amsterdam and CuspAI introduced Erwin, a hierarchical transformer that enhances data processing efficiency through ball tree partitioning. The attention mechanism enables parallel computation across clusters through ball tree partitions that partition data hierarchically to structure its computations. This approach minimizes computational complexity without sacrificing accuracy, bridging the gap between the efficiency of tree-based methods and the generality of attention mechanisms. Erwin uses self-attention in localized regions with positional encoding and distance-based attention bias to capture geometric structures. Cross-ball connections facilitate communication among various sections, with tree coarsening and refinement mechanisms balancing global and local interactions. Scalability and expressivity with minimal computational expense are guaranteed through this organized process.

Researchers conducted experiments to evaluate Erwin. It outperformed equivariant and non-equivariant baselines in cosmological simulations, capturing long-range interactions and improving with larger training datasets. For molecular dynamics, it accelerated simulations by 1.7–2.5 times without compromising accuracy, surpassing MPNN and PointNet++ in runtime while maintaining competitive test loss. Erwin outperformed MeshGraphNet, GAT, DilResNet, and EAGLE in turbulent fluid dynamics, excelling in pressure prediction while being three times faster and using eight times less memory than EAGLE. Larger ball sizes in cosmology enhanced performance by retaining long-range dependencies but increased the computational runtime, and applying MPNN at the embedding step improved the local interactions in molecular dynamics.

The hierarchical transformer design proposed here effectively handles large-scale physical systems with ball tree partitioning and obtains state-of-the-art cosmology and molecular dynamics results. Although its optimized structure compromises between expressivity and runtime, it has computational overhead from padding and high memory requirements. Future work can investigate learnable pooling and other geometric encoding strategies to enhance efficiency. Erwin’s performance and scalability in all domains make it a reference point for developments in modeling large particle systems, computational chemistry, and molecular dynamics.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Researchers from AMLab and CuspAI Introduced Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems appeared first on MarkTechPost.

Microsoft AI Introduces Belief State Transformer (BST): Enhancing Goal …

Posted on March 8, 2025 by i-genie

Transformer models have transformed language modeling by enabling large-scale text generation with emergent properties. However, they struggle with tasks that require extensive planning. Researchers have explored modifications in architecture, objectives, and algorithms to improve their ability to achieve goals. Some approaches move beyond traditional left-to-right sequence modeling by incorporating bidirectional context, as seen in models trained on past and future information. Others attempt to optimize the generation order, such as latent-variable modeling or binary tree-based decoding, though left-to-right autoregressive methods often remain superior. A more recent approach involves jointly training a transformer for forward and backward decoding, enhancing the model’s ability to maintain compact belief states.

Further research has explored predicting multiple tokens simultaneously to improve efficiency. Some models have been designed to generate more than one token at a time, leading to faster and more robust text generation. Pretraining on multi-token prediction has been shown to enhance large-scale performance. Another key insight is that transformers encode belief states non-compactly within their residual stream. In contrast, state-space models offer more compact representations but come with trade-offs. For instance, certain training frameworks struggle with specific graph structures, revealing limitations in existing methodologies. These findings highlight ongoing efforts to refine transformer architectures for better structured and efficient sequence modeling.

Researchers from Microsoft Research, the University of Pennsylvania, UT Austin, and the University of Alberta introduced the Belief State Transformer (BST). This model enhances next-token prediction by considering both prefix and suffix contexts. Unlike standard transformers, BST encodes information bidirectionally, predicting the next token after the prefix and the previous token before the suffix. This approach improves performance on challenging tasks, such as goal-conditioned text generation and structured prediction problems like star graphs. By learning a compact belief state, BST outperforms conventional methods in sequence modeling, offering more efficient inference and stronger text representations, with promising implications for large-scale applications.

Unlike traditional next-token prediction models, the BST is designed to enhance sequence modeling by integrating both forward and backward encoders. It utilizes a forward encoder for prefixes and a backward encoder for suffixes, predicting the next and previous tokens. This approach prevents models from adopting shortcut strategies and improves long-term dependency learning. BST outperforms baselines in star graph navigation, where forward-only Transformers struggle. Ablations confirm that the belief state objective and backward encoder are essential for performance. During inference, BST omits the backward encoder, maintaining efficiency while ensuring goal-conditioned behavior.

Unlike forward-only and multi-token models, the BST effectively constructs a compact belief state. A belief state encodes all necessary information for future predictions. The BST learns such representations by jointly modeling prefixes and suffixes, enabling goal-conditioned text generation. Experiments using TinyStories show BST outperforms the Fill-in-the-Middle (FIM) model, producing more coherent and structured narratives. Evaluation with GPT-4 reveals BST’s superior storytelling ability, with clearer connections between prefix, generated text, and suffix. Additionally, BST excels in unconditional text generation by selecting sequences with high-likelihood endings, demonstrating its advantages over traditional next-token predictors.

In conclusion, the BST improves goal-conditioned next-token prediction by addressing the limitations of traditional forward-only models. It constructs a compact belief state, encoding all necessary information for future predictions. Unlike conventional transformers, BST predicts the next token for a prefix and the previous token for a suffix, making it more effective in complex tasks. Empirical results demonstrate its advantages in story writing, outperforming the Fill-in-the-Middle approach. While our experiments validate its performance on small-scale tasks, further research is needed to explore its scalability and applicability to broader goal-conditioned problems, enhancing efficiency and inference quality.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Microsoft AI Introduces Belief State Transformer (BST): Enhancing Goal-Conditioned Sequence Modeling with Bidirectional Context appeared first on MarkTechPost.

Alibaba Researchers Propose START: A Novel Tool-Integrated Long CoT Re …

Posted on March 8, 2025 by i-genie

Large language models have made significant strides in understanding and generating human-like text. Yet, when it comes to complex reasoning tasks—especially those that require multi-step calculations or logical analysis—they often struggle. Traditional chain-of-thought (CoT) approaches help by breaking down problems into intermediate steps, but they rely heavily on the model’s internal reasoning. This internal dependency can sometimes lead to mistakes, particularly with intricate computations or when multiple reasoning steps are needed. In such cases, minor errors may accumulate, resulting in outcomes that are not as precise as expected. The need for a method that can verify and adjust its own reasoning is clear, especially in tasks like scientific analysis or competition-level mathematics.

Researchers at Alibaba have proposed a new AI tool called START, which stands for Self-Taught Reasoner with Tools. Rather than relying solely on internal logic, START integrates an external Python interpreter to assist with reasoning tasks. The model is built on a fine-tuned version of the QwQ-32B model and employs a two-fold strategy to improve its problem-solving skills. First, it uses a method called Hint-infer. Here, the model is encouraged to include prompts like “Wait, maybe using Python here is a good idea,” which signal that it should perform computations or self-check its work using external tools. Second, the model undergoes a fine-tuning process known as Hint Rejection Sampling Fine-Tuning (Hint-RFT). This process refines the model’s reasoning by filtering and modifying its output based on how effectively it can invoke external tools. The result is a model that is not only capable of generating a logical chain of thought but also of verifying its steps through external computation.

Technical Insights and Benefits

At its core, START is an evolution of the chain-of-thought approach. Its two-stage training process is designed to help the model use external tools as a natural extension of its reasoning process. In the first stage, Hint-infer allows the model to integrate cues that prompt tool usage. These hints are strategically inserted at points where the model might be reconsidering its approach, often after transitional words like “Alternatively” or “Wait.” This encourages the model to verify its reasoning with Python code, leading to self-correction when necessary.

In the second stage, Hint-RFT takes the output generated with these hints and refines it. By scoring and filtering the reasoning steps, the model learns to better decide when and how to invoke external tools. The refined dataset from this process is then used to fine-tune the model further, resulting in a version of QwQ-32B that we now call START. The integration of external computation is a thoughtful addition that helps minimize errors, ensuring that the model’s reasoning is both coherent and more reliable.

Empirical Findings and Insights

The researchers evaluated START on a range of tasks, including graduate-level science questions, challenging math problems, and programming tasks. Across these domains, START showed notable improvements over its base model. For example, on a set of PhD-level science questions, the model achieved an accuracy of 63.6%, which is a modest yet meaningful improvement over the original model’s performance. On math benchmarks—ranging from high school level to competition problems—the accuracy improvements were similarly encouraging. These results suggest that the ability to incorporate external verification can lead to better problem-solving, especially in tasks where precision is crucial.

In programming challenges, START’s approach allowed it to generate and test code snippets, leading to a higher rate of correct solutions compared to models that rely solely on internal reasoning. Overall, the study indicates that the integration of tool usage within the reasoning process can help models produce more accurate and verifiable results.

Concluding Thoughts

The development of START offers a thoughtful step forward in addressing the inherent challenges of complex reasoning in large language models. By combining internal chain-of-thought reasoning with external tool integration, the model provides a practical solution to some of the persistent issues in computational and logical tasks. The approach is both simple and elegant: encouraging the model to self-check its work using an external Python interpreter and then fine-tuning it based on this ability leads to improved performance across diverse benchmarks.

This work is a promising example of how incremental refinements—in this case, the use of strategic hints and external computation—can significantly enhance the reliability of reasoning in language models. It demonstrates that by thoughtfully integrating external tools, we can guide models toward more accurate and reliable outcomes, especially in areas where precise computation and logical rigor are essential. The work behind START is an encouraging move toward models that are not only more capable but also more reflective and self-correcting in their approach to problem-solving.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Alibaba Researchers Propose START: A Novel Tool-Integrated Long CoT Reasoning LLM that Significantly Enhances Reasoning Capabilities by Leveraging External Tools appeared first on MarkTechPost.

Accelerating insurance policy reviews with generative AI: Verisk’s M …

Posted on March 8, 2025 by i-genie

This post is co-authored with Sundeep Sardana, Malolan Raman, Joseph Lam, Maitri Shah and Vaibhav Singh from Verisk.
Verisk (Nasdaq: VRSK) is a leading strategic data analytics and technology partner to the global insurance industry, empowering clients to strengthen operating efficiency, improve underwriting and claims outcomes, combat fraud, and make informed decisions about global risks. Through advanced data analytics, software, scientific research, and deep industry knowledge, Verisk helps build global resilience across individuals, communities, and businesses. At the forefront of using generative AI in the insurance industry, Verisk’s generative AI-powered solutions, like Mozart, remain rooted in ethical and responsible AI use. Mozart, the leading platform for creating and updating insurance forms, enables customers to organize, author, and file forms seamlessly, while its companion uses generative AI to compare policy documents and provide summaries of changes in minutes, cutting the change adoption time from days or weeks to minutes.
The generative AI-powered Mozart companion uses sophisticated AI to compare legal policy documents and provides essential distinctions between them in a digestible and structured format. The new Mozart companion is built using Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. The Mozart application rapidly compares policy documents and presents comprehensive change details, such as descriptions, locations, excerpts, in a tracked change format.
The following screenshot shows an example of the output of the Mozart companion displaying the summary of changes between two legal documents, the excerpt from the original document version, the updated excerpt in the new document version, and the tracked changes represented with redlines.

In this post, we describe the development journey of the generative AI companion for Mozart, the data, the architecture, and the evaluation of the pipeline.
Data: Policy forms
Mozart is designed to author policy forms like coverage and endorsements. These documents provide information about policy coverage and exclusions (as shown in the following screenshot) and help in determining the risk and premium associated with an insurance policy.

Solution overview
The policy documents reside in Amazon Simple Storage Service (Amazon S3) storage. An AWS Batch job reads these documents, chunks them into smaller slices, then creates embeddings of the text chunks using the Amazon Titan Text Embeddings model through Amazon Bedrock and stores them in an Amazon OpenSearch Service vector database. Along with each document slice, we store the metadata associated with it using an internal Metadata API, which provides document characteristics like document type, jurisdiction, version number, and effective dates. This process has been implemented as a periodic job to keep the vector database updated with new documents. During the solution design process, Verisk also considered using Amazon Bedrock Knowledge Bases because it’s purpose built for creating and storing embeddings within Amazon OpenSearch Serverless. In the future, Verisk intends to use the Amazon Titan Embeddings V2 model.
The user can pick the two documents that they want to compare. This action invokes an AWS Lambda function to retrieve the document embeddings from the OpenSearch Service database and present them to Anthropic’s Claude 3 Sonnet FM, which is accessed through Amazon Bedrock. The results are stored in a JSON structure and provided using the API service to the UI for consumption by the end-user.
The following diagram illustrates the solution architecture.

Security and governance
Generative AI is very new technology and brings with it new challenges related to security and compliance. Verisk has a governance council that reviews generative AI solutions to make sure that they meet Verisk’s standards of security, compliance, and data use. Verisk also has a legal review for IP protection and compliance within their contracts. It’s important that Verisk makes sure the data that is shared by the FM is transmitted securely and the FM doesn’t retain any of their data or use it for its own training. The quality of the solution, speed, cost, and ease of use were the key factors that led Verisk to pick Amazon Bedrock and Anthropic’s Claude Sonnet within their generative AI solution.
Evaluation criteria
To assess the quality of the results produced by generative AI, Verisk evaluated based on the following criteria:

Accuracy
Consistency
Adherence to context
Speed and cost

To assess the generative AI results’ accuracy and consistency, Verisk designed human evaluation metrics with the help of in-house insurance domain experts. Verisk conducted multiple rounds of human evaluation of the generated results. During these tests, in-house domain experts would grade accuracy, consistency, and adherence to context on a manual grading scale of 1–10. The Verisk team measured how long it took to generate the results by tracking latency. Feedback from each round of tests was incorporated in subsequent tests.
The initial results that Verisk got from the model were good but not close to the desired level of accuracy and consistency. The development process underwent iterative improvements that included redesign, making multiple calls to the FM, and testing various FMs. The primary metric used to evaluate the success of FM and non-FM solutions was a manual grading system where business experts would grade results and compare them. FM solutions are improving rapidly, but to achieve the desired level of accuracy, Verisk’s generative AI software solution needed to contain more components than just FMs. To achieve the desired accuracy, consistency, and efficiency, Verisk employed various techniques beyond just using FMs, including prompt engineering, retrieval augmented generation, and system design optimizations.
Prompt optimization
The change summary is different than showing differences in text between the two documents. The Mozart application needs to be able to describe the material changes and ignore the noise from non-meaningful changes. Verisk created prompts using the knowledge of their in-house domain experts to achieve these objectives. With each round of testing, Verisk added detailed instructions to the prompts to capture the pertinent information and reduce possible noise and hallucinations. The added instructions would be focused on reducing any issues identified by the business experts reviewing the end results. To get the best results, Verisk needed to adjust the prompts based on the FM used—there are differences in how each FM responds to prompts, and using the prompts specific to the given FM provides better results. Through this process, Verisk instructed the model on the role it is playing along with the definition of common terms and exclusions. In addition to optimizing prompts for the FMs, Verisk also explored techniques for effectively splitting and processing the document text itself.
Splitting document pages
Verisk tested multiple strategies for document splitting. For this use case, a recursive character text splitter with a chunk size of 500 characters with 15% overlap provided the best results. This splitter is part of the LangChain framework; it’s a semantic splitter that considers semantic similarities in the text. Verisk also considered the NLTK splitter. With an effective approach for splitting the document text into processable chunks, Verisk then focused on enhancing the quality and relevance of the summarized output.
Quality of summary
The quality assessment starts with confirming that the correct documents are picked for comparison. Verisk enhanced the quality of the solution by using document metadata to narrow the search results by specifying which documents to include or exclude from a query, resulting in more relevant responses generated by the FM. For the generative AI description of change, Verisk wanted to capture the essence of the change instead of merely highlighting the differences. The results were reviewed by their in-house policy authoring experts and their feedback was used to determine the prompts, document splitting strategy, and FM. With techniques in place to enhance output quality and relevance, Verisk also prioritized optimizing the performance and cost-efficiency of their generative AI solution. These techniques were specific to prompt engineering; some examples are few-shot prompting, chain of thought prompting, and the needle in a haystack approach.
Price-performance
To achieve lower cost, Verisk regularly evaluated various FM options and changed them as new options with lower cost and better performance were released. During the development process, Verisk redesigned the solution to reduce the number of calls to the FM and wherever possible used non-FM based options.
As mentioned earlier, the overall solution consists of a few different components:

Location of the change
Excerpts of the changes
Change summary
Changes shown in the tracked change format

Verisk reduced the FM load and improved accuracy by identifying the sections that contained differences and then passing these sections to the FM to generate the change summary. For constructing the tracked difference format, containing redlines, Verisk used a non-FM based solution. In addition to optimizing performance and cost, Verisk also focused on developing a modular, reusable architecture for their generative AI solution.
Reusability
Good software development practices apply to the development of generative AI solutions too. You can create a decoupled architecture with reusable components. The Mozart generative AI companion is provided as an API, which decouples it from the frontend development and allows for reusability of this capability. Similarly, the API consists of many reusable components like common prompts, common definitions, retrieval service, embedding creation, and persistence service. Through their modular, reusable design approach and iterative optimization process, Verisk was able to achieve highly satisfactory results with their generative AI solution.
Results
Based on Verisk’s evaluation template questions and rounds of testing, they concluded that the results generated over 90% good or acceptable summaries. Testing was done by providing results of the solution to business experts, and having these experts grade the results using a grading scale.
Business impact
Verisk’s customers spend significant time regularly to review changes to the policy forms. The generative AI-powered Mozart companion can simplify the review process by ingesting these complex and unstructured policy documents and providing a summary of changes in minutes. This enables Verisk’s customers to cut the change adoption time from days to minutes. The improved adoption speed not only increases productivity, but also enable timely implementation of changes.
Conclusion
Verisk’s generative AI-powered Mozart companion uses advanced natural language processing and prompt engineering techniques to provide rapid and accurate summaries of changes between insurance policy documents. By harnessing the power of large language models like Anthropic’s Claude 3 Sonnet while incorporating domain expertise, Verisk has developed a solution that significantly accelerates the policy review process for their customers, reducing change adoption time from days or weeks to just minutes. This innovative application of generative AI delivers tangible productivity gains and operational efficiencies to the insurance industry. With a strong governance framework promoting responsible AI use, Verisk is at the forefront of unlocking generative AI’s potential to transform workflows and drive resilience across the global risk landscape.
For more information, see the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn more about Anthropic’s Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with FMs
Explore other use cases for generative AI with Amazon Bedrock

About the Authors
Sundeep Sardana is the Vice President of Software Engineering at Verisk Analytics, based in New Jersey. He leads the Reimagine program for the company’s Rating business, driving modernization across core services such as forms, rules, and loss costs. A dynamic change-maker and technologist, Sundeep specializes in building high-performing teams, fostering a culture of innovation, and leveraging emerging technologies to deliver scalable, enterprise-grade solutions. His expertise spans cloud computing, Generative AI, software architecture, and agile development, ensuring organizations stay ahead in an evolving digital landscape. Connect with him on LinkedIn.
Malolan Raman is a Principal Engineer at Verisk, based out of New Jersey specializing in the development of Generative AI (GenAI) applications. With extensive experience in cloud computing and artificial intelligence, He has been at the forefront of integrating cutting-edge AI technologies into scalable, secure, and efficient cloud solutions.
Joseph Lam is the senior director of commercial multi-lines that include general liability, umbrella/excess, commercial property, businessowners, capital assets, crime and inland marine. He leads a team responsible for research, development, and support of commercial casualty products, which mostly consist of forms and rules. The team is also tasked with supporting new and innovative solutions for the emerging marketplace.
Maitri Shah is a Software Development Engineer at Verisk with over two years of experience specializing in developing innovative solutions in Generative AI (GenAI) on Amazon Web Services (AWS). With a strong foundation in machine learning, cloud computing, and software engineering, Maitri has successfully implemented scalable AI models that drive business value and enhance user experiences.
Vaibhav Singh is a Product Innovation Analyst at Verisk, based out of New Jersey. With a background in Data Science, engineering, and management, he works as a pivotal liaison between technology and business, enabling both sides to build transformative products & solutions that tackle some of the current most significant challenges in the insurance domain. He is driven by his passion for leveraging data and technology to build innovative products that not only address the current obstacles but also pave the way for future advancements in that domain.
Ryan Doty is a Solutions Architect Manager at AWS, based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.
Tarik Makota is a Sr. Principal Solutions Architect with Amazon Web Services. He provides technical guidance, design advice, and thought leadership to AWS’ customers across the US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.
Alex Oppenheim is a Senior Sales Leader at Amazon Web Services, supporting consulting and services customers. With extensive experience in the cloud and technology industry, Alex is passionate about helping enterprises unlock the power of AWS to drive innovation and digital transformation.

Announcing general availability of Amazon Bedrock Knowledge Bases Grap …

Posted on March 8, 2025 by i-genie

Today, Amazon Web Services (AWS) announced the general availability of Amazon Bedrock Knowledge Bases GraphRAG (GraphRAG), a capability in Amazon Bedrock Knowledge Bases that enhances Retrieval-Augmented Generation (RAG) with graph data in Amazon Neptune Analytics. This capability enhances responses from generative AI applications by automatically creating embeddings for semantic search and generating a graph of the entities and relationships extracted from ingested documents. The graph, stored in Amazon Neptune Analytics, provides enriched context during the retrieval phase to deliver more comprehensive, relevant, and explainable responses tailored to customer needs. Developers can enable GraphRAG with just a few clicks on the Amazon Bedrock console to boost the accuracy of generative AI applications without any graph modeling expertise.
In this post, we discuss the benefits of GraphRAG and how to get started with it in Amazon Bedrock Knowledge Bases.
Enhance RAG with graphs for more comprehensive and explainable GenAI applications
Generative AI is transforming how humans interact with technology by having natural conversations that provide helpful, nuanced, and insightful responses. However, a key challenge facing current generative AI systems is providing responses that are comprehensive, relevant, and explainable because data is stored across multiple documents. Without effectively mapping shared context across input data sources, responses risk being incomplete and inaccurate.
To address this, AWS announced a public preview of GraphRAG at re:Invent 2024, and is now announcing its general availability. This new capability integrates the power of graph data modeling with advanced natural language processing (NLP). GraphRAG automatically creates graphs which capture connections between related entities and sections across documents. More specifically, the graph created will connect chunks to documents, and entities to chunks.
During response generation, GraphRAG first does semantic search to find the top k most relevant chunks, and then traverses the surrounding neighborhood of those chunks to retrieve the most relevant content. By linking this contextual information, the generative AI system can provide responses that are more complete, precise, and grounded in source data. Whether answering complex questions across topics or summarizing key details from lengthy reports, GraphRAG delivers the comprehensive and explainable responses needed to enable more helpful, reliable AI conversations.
GraphRAG boosts relevance and accuracy when relevant information is dispersed across multiple sources or documents, which can be seen in the following three use cases.
Streamlining market research to accelerate business decisions
A leading global financial institution sought to enhance insight extraction from its proprietary research. With a vast repository of economic and market research reports, the institution wanted to explore how GraphRAG could improve information retrieval and reasoning for complex financial queries. To evaluate this, they added their proprietary research papers, focusing on critical market trends and economic forecasts.
To evaluate the effectiveness of GraphRAG, the institution partnered with AWS to build a proof-of-concept using Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics. The goal was to determine if GraphRAG could more effectively surface insights compared to traditional retrieval methods. GraphRAG structures knowledge into interconnected entities and relationships, enabling multi-hop reasoning across documents. This capability is crucial for answering intricate questions such as “What are some headwinds and tailwinds to capex growth in the next few years?” or “What is the impact of the ILA strike on international trade?”. Rather than relying solely on keyword matching, GraphRAG allows the model to trace relationships between economic indicators, policy changes, and industry impacts, ensuring responses are contextually rich and data-driven.
When comparing the quality of responses from GraphRAG and other retrieval methods, notable differences emerged in their comprehensiveness, clarity, and relevance. While other retrieval methods delivered straightforward responses, they often lacked deeper insights and broader context. GraphRAG instead provided more nuanced answers by incorporating related factors and offering additional relevant information, which made the responses more comprehensive than the other retrieval methods.
Improving data-driven decision-making in automotive manufacturing
An international auto company manages a large dataset, supporting thousands of use cases across engineering, manufacturing, and customer service. With thousands of users querying different datasets daily, making sure insights are accurate and connected across sources has been a persistent challenge.
To address this, the company worked with AWS to prototype a graph that maps relationships between key data points, such as vehicle performance, supply chain logistics, and customer feedback. This structure allows for more precise results across datasets, rather than relying on disconnected query results.
With Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics automatically constructing a graph from ingested documents, the company can surface relevant insights more efficiently in their RAG applications. This approach helps teams identify patterns in manufacturing quality, predict maintenance needs, and improve supply chain resilience, making data analysis more effective and scalable across the organization.
Enhancing cybersecurity incident analysis
A cybersecurity company is using GraphRAG to improve how its AI-powered assistant analyzes security incidents. Traditional detection methods rely on isolated alerts, often missing the broader context of an attack.
By using a graph, the company connects disparate security signals, such as login anomalies, malware signatures, and network traffic patterns, into a structured representation of threat activity. This allows for faster root cause analysis and more comprehensive security reporting.
Amazon Bedrock Knowledge Bases and Neptune Analytics enable this system to scale while maintaining strict security controls, providing resource isolation. With this approach, the company’s security teams can quickly interpret threats, prioritize responses, and reduce false positives, leading to more efficient incident handling.
Solution overview
In this post, we provide a walkthrough to build Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics, using files in an Amazon Simple Storage Service (Amazon S3) bucket. Running this example will incur costs in Amazon Neptune Analytics, Amazon S3, and Amazon Bedrock. Amazon Neptune Analytics costs for this example will be approximately $0.48 per hour. Amazon S3 costs will vary depending on how large your dataset is, and more details on Amazon S3 pricing can be found here. Amazon Bedrock costs will vary depending on the embeddings model and chunking strategy you select, and more details on Bedrock pricing can be found here.
Prerequisites
To follow along with this post, you need an AWS account with the necessary permissions to access Amazon Bedrock, and an Amazon S3 bucket containing data to serve as your knowledge base. Also ensure that you have enabled model access to Claude 3 Haiku (anthropic.claude-3-haiku-20240307-v1:0) and any other models that you wish to use as your embeddings model. For more details on how to enable model access, refer to the documentation here.
Build Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics
To get started, complete the following steps:

On the Amazon Bedrock console, choose Knowledge Bases under Builder tools in the navigation pane.
In the Knowledge Bases section, choose Create and Knowledge Base with vector store.
For Knowledge Base details, enter a name and an optional description.
For IAM permissions, select Create and use a new service role to create a new AWS Identity and Access Management (IAM) role.
For Data source details, select Amazon S3 as your data source.
Choose Next.
For S3 URI, choose Browse S3 and choose the appropriate S3 bucket.
For Parsing strategy, select Amazon Bedrock default parser.
For Chunking strategy, choose Default chunking (recommended for GraphRAG) or any other strategy as you wish.
Choose Next.
For Embeddings model, choose an embeddings model, such as Amazon Titan Text Embeddings v2.
For Vector database, select Quick create a new vector store and then select Amazon Neptune Analytics (GraphRAG).
Choose Next.
Review the configuration details and choose Create Knowledge Base.

Sync the data source

Once the knowledge base is created, click Sync under the Data source section. The data sync can take a few minutes to a few hours, depending on how many source documents you have and how big each one is.

Test the knowledge base
Once the data sync is complete:

Choose the expansion icon to expand the full view of the testing area.
Configure your knowledge base by adding filters or guardrails.
We encourage you to enable reranking (For information about pricing for reranking models, see Amazon Bedrock Pricing) to fully take advantage of the capabilities of GraphRAG. Reranking allows GraphRAG to refine and optimize search results.
You can also supply a custom metadata file (each up to 10 KB) for each document in the knowledge base. You can apply filters to your retrievals, instructing the vector store to pre-filter based on document metadata and then search for relevant documents. This way, you have control over the retrieved documents, especially if your queries are ambiguous. Note that the list type is not supported.
Use the chat area in the right pane to ask questions about the documents from your Amazon S3 bucket.

The responses will use GraphRAG and provide references to chunks and documents in their response.

Now that you’ve enabled GraphRAG, test it out by querying your generative AI application and observe how the responses have improved compared to baseline RAG approaches. You can monitor the Amazon CloudWatch logs for performance metrics on indexing, query latency, and accuracy.
Clean up
When you’re done exploring the solution, make sure to clean up by deleting any resources you created. Resources to clean up include the Amazon Bedrock knowledge base, the associated AWS IAM role that the Amazon Bedrock knowledge base uses, and the Amazon S3 bucket that was used for the source documents.
You will also need to separately delete the Amazon Neptune Analytics graph that was created on your behalf, by Amazon Bedrock Knowledge Bases.
Conclusion
In this post, we discussed how to get started with Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune. For further experimentation, check out the Amazon Bedrock Knowledge Bases Retrieval APIs to use the power of GraphRAG in your own applications. Refer to our documentation for code samples and best practices.

About the authors
Denise Gosnell is a Principal Product Manager for Amazon Neptune, focusing on generative AI infrastructure and graph data applications that enable scalable, cutting-edge solutions across industry verticals.
Melissa Kwok is a Senior Neptune Specialist Solutions Architect at AWS, where she helps customers of all sizes and verticals build cloud solutions according to best practices. When she’s not at her desk you can find her in the kitchen experimenting with new recipes or reading a cookbook.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.
Harsh Singh is a Principal Product Manager Technical at AWS AI. Harsh enjoys building products that bring AI to software developers and everyday users to improve their productivity.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Build a Multi-Agent System with LangGraph and Mistral on AWS

Posted on March 7, 2025 by i-genie

Agents are revolutionizing the landscape of generative AI, serving as the bridge between large language models (LLMs) and real-world applications. These intelligent, autonomous systems are poised to become the cornerstone of AI adoption across industries, heralding a new era of human-AI collaboration and problem-solving. By using the power of LLMs and combining them with specialized tools and APIs, agents can tackle complex, multistep tasks that were previously beyond the reach of traditional AI systems. The Multi-Agent City Information System demonstrated in this post exemplifies the potential of agent-based architectures to create sophisticated, adaptable, and highly capable AI applications.
As we look to the future, agents will have a very important role to play in:

Improving decision-making with deeper, context-aware information
Automating complex workflows across various domains, from customer service to scientific research
Enabling more natural and intuitive human-AI interactions
Generating new ideas by bringing together diverse data sources and specialized knowledge
Addressing ethical concerns by providing more transparent and explainable AI systems

Building and deploying multi-agent systems like the one in this post is a step toward unlocking the full potential of generative AI. As these systems evolve, they will transform industries, expand possibilities, and open new doors for artificial intelligence.
Solution overview
In this post, we explore how to use LangGraph and Mistral models on Amazon Bedrock to create a powerful multi-agent system that can handle sophisticated workflows through collaborative problem-solving. This integration enables the creation of AI agents that can work together to solve complex problems, mimicking humanlike reasoning and collaboration.
The result is a system that delivers comprehensive details about events, weather, activities, and recommendations for a specified city, illustrating how stateful, multi-agent applications can be built and deployed on Amazon Web Services (AWS) to address real-world challenges.
LangGraph is essential to our solution by providing a well-organized method to define and manage the flow of information between agents. It provides built-in support for state management and checkpointing, providing smooth process continuity. This framework also allows for straightforward visualization of the agentic workflows, enhancing clarity and understanding. It integrates easily with LLMs and Amazon Bedrock, providing a versatile and powerful solution. Additionally, its support for conditional routing allows for dynamic workflow adjustments based on intermediate results, providing flexibility in handling different scenarios.
The multi-agent architecture we present offers several key benefits:

Modularity – Each agent focuses on a specific task, making the system easier to maintain and extend
Flexibility – Agents can be quickly added, removed, or modified without affecting the entire system
Complex workflow handling – The system can manage advanced and complex workflows by distributing tasks among multiple agents
Specialization – Each agent is optimized for its specific task, improving latency, accuracy, and overall system efficiency
Security – The system enhances security by making sure that each agent only has access to the tools necessary for its task, reducing the potential for unauthorized access to sensitive data or other agents’ tasks

How our multi-agent system works
In this section, we explore how our Multi-Agent City Information System works, based on the multi-agent LangGraph Mistral Jupyter notebook available in the Mistral on AWS examples for Bedrock & SageMaker repository on GitHub.
This agentic workflow takes a city name as input and provides detailed information, demonstrating adaptability in handling different scenarios:

Events – It searches a local database and online sources for upcoming events in the city. Whenever local database information is unavailable, it triggers an online search using the Tavily API. This makes sure that users receive up-to-date event information, regardless of whether it’s stored locally or needs to be retrieved from the web
Weather – The system fetches current weather data using the OpenWeatherMap API, providing accurate and timely weather information for the queried location. Based on the weather, the system also offers outfit and activity recommendations tailored to the conditions, providing relevant suggestions for each city
Restaurants – Recommendations are provided through a Retrieval Augmented Generation (RAG) system. This method combines prestored information with real-time generation to offer relevant and up-to-date dining suggestions

The system’s ability to work with varying levels of information is showcased through its adaptive approach, which means that users receive the most comprehensive and up-to-date information possible, regardless of the varying availability of data for different cities. For instance:

Some cities might require the use of the search tool for event information when local database data is unavailable
Other cities might have data available in the local database, providing quick access to event information without needing an online search
In cases where restaurant recommendations are unavailable for a particular city, the system can still provide valuable insights based on the available event and weather data

The following diagram is the solution’s reference architecture:

Data sources
The Multi-Agent City Information System can take advantage of two sources of data.
Local events database
This SQLite database is populated with city events data from a JSON file, providing quick access to local event information that ranges from community happenings to cultural events and citywide activities. This database is used by the events_database_tool() for efficient querying and retrieval of city event details, including location, date, and event type.
Restaurant RAG system
For restaurant recommendations, the generate_restaurants_dataset() function generates synthetic data, creating a custom dataset specifically tailored to our recommendation system. The create_restaurant_vector_store() function processes this data, generates embeddings using Amazon Titan Text Embeddings, and builds a vector store with Facebook AI Similarity Search (FAISS). Although this approach is suitable for prototyping, for a more scalable and enterprise-grade solution, we recommend using Amazon Bedrock Knowledge Bases.
Building the multi-agent architecture
At the heart of our Multi-Agent City Information System lies a set of specialized functions and tools designed to gather, process, and synthesize information from various sources. They form the backbone of our system, enabling it to provide comprehensive and up-to-date information about cities. In this section, we explore the key components that drive our system: the generate_text() function, which uses Mistral model, and the specialized data retrieval functions for local database queries, online searches, weather information, and restaurant recommendations. Together, these functions and tools create a robust and versatile system capable of delivering valuable insights to users.
Text generation function
This function serves as the core of our agents, allowing them to generate text using the Mistral model as needed. It uses the Amazon Bedrock Converse API, which supports text generation, streaming, and external function calling (tools).
The function works as follows:

Sends a user message to the Mistral model using the Amazon Bedrock Converse API
Invokes the appropriate tool and incorporates the results into the conversation
Continues the conversation until a final response is generated

Here’s the implementation:
def generate_text(bedrock_client, model_id, tool_config, input_text):
……

while True:
response = bedrock_client.converse(**kwargs)
output_message = response[‘output’][‘message’]
messages.append(output_message) # Add assistant’s response to messages

stop_reason = response.get(‘stopReason’)

if stop_reason == ‘tool_use’ and tool_config:
tool_use = output_message[‘content’][0][‘toolUse’]
tool_use_id = tool_use[‘toolUseId’]
tool_name = tool_use[‘name’]
tool_input = tool_use[‘input’]

try:
if tool_name == ‘get_upcoming_events’:
tool_result = local_info_database_tool(tool_input[‘city’])
json_result = json.dumps({“events”: tool_result})
elif tool_name == ‘get_city_weather’:
tool_result = weather_tool(tool_input[‘city’])
json_result = json.dumps({“weather”: tool_result})
elif tool_name == ‘search_and_summarize_events’:
tool_result = search_tool(tool_input[‘city’])
json_result = json.dumps({“events”: tool_result})
else:
raise ValueError(f”Unknown tool: {tool_name}”)

tool_response = {
“toolUseId”: tool_use_id,
“content”: [{“json”: json.loads(json_result)}]
}

……

messages.append({
“role”: “user”,
“content”: [{“toolResult”: tool_response}]
})

# Update kwargs with new messages
kwargs[“messages”] = messages
else:
break

return output_message, tool_result
Local database query tool
The events_database_tool() queries the local SQLite database for events information by connecting to the database, executing a query to fetch upcoming events for the specified city, and returning the results as a formatted string. It’s used by the events_database_agent() function. Here’s the code:
def events_database_tool(city: str) -> str:
conn = sqlite3.connect(db_path)
query = “””
SELECT event_name, event_date, description
FROM local_events
WHERE city = ?
ORDER BY event_date
LIMIT 3
“””
df = pd.read_sql_query(query, conn, params=(city,))
conn.close()
print(df)
if not df.empty:
events = df.apply(
lambda row: (
f”{row[‘event_name’]} on {row[‘event_date’]}: {row[‘description’]}”
),
axis=1
).tolist()
return “n”.join(events)
else:
return f”No upcoming events found for {city}.”
Weather tool
The weather_tool() fetches current weather data for the specified city by calling the OpenWeatherMap API. It’s used by the weather_agent() function. Here’s the code:
def weather_tool(city: str) -> str:
weather = OpenWeatherMapAPIWrapper()
tool_result = weather.run(“Tampa”)
return tool_result
Online search tool
When local event information is unavailable, the search_tool() performs an online search using the Tavily API to find upcoming events in the specified city and return a summary. It’s used by the search_agent() function. Here’s the code:
def search_tool(city: str) -> str:
   client = TavilyClient(api_key=os.environ[‘TAVILY_API_KEY’])
   query = f”What are the upcoming events in {city}?”
   response = client.search(query, search_depth=”advanced”)
   results_content = “nn”.join([result[‘content’] for result in response[‘results’]])
   return results_content
Restaurant recommendation function
The query_restaurants_RAG() function uses a RAG system to provide restaurant recommendations by performing a similarity search in the vector database for relevant restaurant information, filtering for highly rated restaurants in the specified city and using Amazon Bedrock with the Mistral model to generate a summary of the top restaurants based on the retrieved information. It’s used by the query_restaurants_agent() function.
For the detailed implementation of these functions and tools, environment setup, and use cases, refer to the Multi-Agent LangGraph Mistral Jupyter notebook.
Implementing AI agents with LangGraph
Our multi-agent system consists of several specialized agents. Each agent in this architecture is represented by a Node in LangGraph, which, in turn, interacts with the tools and functions defined previously. The following diagram shows the workflow:

The workflow follows these steps:

Events database agent (events_database_agent) – Uses the events_database_tool() to query a local SQLite database and find local event information
Online search agent (search_agent) – Whenever local event information is unavailable in the database, this agent uses the search_tool() to find upcoming events by searching online for a given city
Weather agent (weather_agent) – Fetches current weather data using the weather_tool() for the specified city
Restaurant recommendation agent (query_restaurants_agent) – Uses the query_restaurants_RAG() function to provide restaurant recommendations for a specified city
Analysis agent (analysis_agent) – Aggregates information from other agents to provide comprehensive recommendations

Here’s an example of how we created the weather agent:
def weather_agent(state: State) -> State:
……

tool_config = {
“tools”: [
{
“toolSpec”: {
“name”: “get_city_weather”,
“description”: “Get current weather information for a specific city”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“city”: {
“type”: “string”,
“description”: “The name of the city to look up weather for”
}
},
“required”: [“city”]
}
}
}
}
]
}

input_text = f”Get current weather for {state.city}”
output_message, tool_result = generate_text(bedrock_client, DEFAULT_MODEL, tool_config, input_text)

if tool_result:
state.weather_info = {“city”: state.city, “weather”: tool_result}
else:
state.weather_info = {“city”: state.city, “weather”: “Weather information not available.”}

print(f”Weather info set to: {state.weather_info}”)
return state
Orchestrating agent collaboration
In the Multi-Agent City Information System, several key primitives orchestrate agent collaboration. The build_graph() function defines the workflow in LangGraph, utilizing nodes, routes, and conditions. The workflow is dynamic, with conditional routing based on event search results, and incorporates memory persistence to store the state across different executions of the agents. Here’s an overview of the function’s behavior:

Initialize workflow – The function begins by creating a StateGraph object called workflow, which is initialized with a State. In LangGraph, the State represents the data or context that is passed through the workflow as the agents perform their tasks. In our example, the state includes things like the results from previous agents (for example, event data, search results, and weather information), input parameters (for example, city name), and other relevant information that the agents might need to process:

# Define the graph
def build_graph():
workflow = StateGraph(State)
…

Add nodes (agents) – Each agent is associated with a specific function, such as retrieving event data, performing an online search, fetching weather information, recommending restaurants, or analyzing the gathered information:

workflow.add_node(“Events Database Agent”, events_database_agent)
workflow.add_node(“Online Search Agent”, search_agent)
workflow.add_node(“Weather Agent”, weather_agent)
workflow.add_node(“Restaurants Recommendation Agent”, query_restaurants_agent)
workflow.add_node(“Analysis Agent”, analysis_agent)

Set entry point and conditional routing – The entry point for the workflow is set to the Events Database Agent, meaning the execution of the workflow starts from this agent. Also, the function defines a conditional route using the add_conditional_edges method. The route_events() function decides the next step based on the results from the Events Database Agent:

workflow.set_entry_point(“Events Database Agent”)

    def route_events(state):
   print(f”Routing events. Current state: {state}”)
   print(f”Events content: ‘{state.events_result}'”)
   if f”No upcoming events found for {state.city}” in state.events_result:
   print(“No events found in local DB. Routing to Online Search Agent.”)
   return “Online Search Agent”
   else:
   print(“Events found in local DB. Routing to Weather Agent.”)
   return “Weather Agent”

   workflow.add_conditional_edges(
   “Events Database Agent”,
   route_events,
   {
   “Online Search Agent”: “Online Search Agent”,
   “Weather Agent”: “Weather Agent”
   }
   )

Add Edges between agents – These edges define the order in which agents interact in the workflow. The agents will proceed in a specific sequence: from Online Search Agent to Weather Agent, from Weather Agent to Restaurants Recommendation Agent, and from there to Analysis Agent, before finally reaching the END:

workflow.add_edge(“Online Search Agent”, “Weather Agent”)
workflow.add_edge(“Weather Agent”, “Restaurants Recommendation Agent”)
workflow.add_edge(“Restaurants Recommendation Agent”, “Analysis Agent”)
workflow.add_edge(“Analysis Agent”, END)

Initialize memory for state persistence – The MemorySaver class is used to make sure that the state of the workflow is preserved between runs. This is especially useful in multi-agent systems where the state of the system needs to be maintained as the agents interact:

# Initialize memory to persist state between graph runs
checkpointer = MemorySaver()

Compile the workflow and visualize the graph – The workflow is compiled, and the memory-saving object (checkpointer) is included to make sure that the state is persisted between executions. Then, it outputs a graphical representation of the workflow:

# Compile the workflow
app = workflow.compile(checkpointer=checkpointer)

# Visualize the graph
display(
Image(
app.get_graph().draw_mermaid_png(
draw_method=MermaidDrawMethod.API
)
)
)
The following diagram illustrates these steps:

Results and analysis
To demonstrate the versatility of our Multi-Agent City Information System, we run it for three different cities: Tampa, Philadelphia, and New York. Each example showcases different aspects of the system’s functionality.
The used function main() orchestrates the entire process:

Calls the build_graph() function, which implements the agentic workflow
Initializes the state with the specified city
Streams the events through the workflow
Retrieves and displays the final analysis and recommendations

To run the code, do the following:
if __name__ == “__main__”:
cities = [“Tampa”, “Philadelphia”, “New York”]
for city in cities:
print(f”nStarting script execution for city: {city}”)
main(city)
Three example use cases
For Example 1 (Tampa), the following diagram shows how the agentic workflow produces the output in response to the user’s question, “What’s happening in Tampa and what should I wear?”

The system produced the following results:

Events – Not found in the local database, triggering the search tool which called the Tavily API to find several upcoming events
Weather – Retrieved from weather tool. Current conditions include moderate rain, 28°C, and 87% humidity
Activities – The system suggested various indoor and outdoor activities based on the events and weather
Outfit recommendations – Considering the warm, humid, and rainy conditions, the system recommended light, breathable clothing and rain protection
Restaurants – Recommendations provided through the RAG system

For Example 2 (Philadelphia), the agentic workflow identified events in the local database, including cultural events and festivals. It retrieved weather data from the OpenWeatherMap API, then suggested activities based on local events and weather conditions. Outfit recommendations were made in line with the weather forecast, and restaurant recommendations were provided through the RAG system.
For Example 3 (New York), the workflow identified events such as Broadway shows and city attractions in the local database. It retrieved weather data from the OpenWeatherMap API and suggested activities based on the variety of local events and weather conditions. Outfit recommendations were tailored to New York’s weather and urban environment. However, the RAG system was unable to provide restaurant recommendations for New York because the synthetic dataset created earlier hadn’t included any restaurants from this city.
These examples demonstrate the system’s ability to adapt to different scenarios. For detailed output of these examples, refer to the Results and Analysis section of the Multi-Agent LangGraph Mistral Jupyter notebook.
Conclusion
In the Multi-Agent City Information System we developed, agents integrate various data sources and APIs within a flexible, modular framework to provide valuable information about events, weather, activities, outfit recommendations, and dining options across different cities. Using Amazon Bedrock and LangGraph, we’ve created a sophisticated agent-based workflow that adapts seamlessly to varying levels of available information, switching between local and online data sources as needed. These agents autonomously gather, process, and consolidate data into actionable insights, orchestrating and automating business logic to streamline processes and provide real-time insights. As a result, this multi-agent approach enables the creation of robust, scalable, and intelligent agentic systems that push the boundaries of what’s possible with generative AI.
Want to dive deeper? Explore the implementation of Multi-Agent Collaboration and Orchestration using LangGraph for Mistral Models on GitHub to observe the code in action and try out the solution yourself. You’ll find step-by-step instructions for setting up and running the multi-agent system, along with code for interacting with data sources, agents, routing data, and visualizing the workflow.

About the Author
Andre Boaventura is a Principal AI/ML Solutions Architect at AWS, specializing in generative AI and scalable machine learning solutions. With over 25 years in the high-tech software industry, he has deep expertise in designing and deploying AI applications using AWS services such as Amazon Bedrock, Amazon SageMaker, and Amazon Q. Andre works closely with global system integrators (GSIs) and customers across industries to architect and implement cutting-edge AI/ML solutions to drive business value. Outside of work, Andre enjoys practicing Brazilian Jiu-Jitsu with his son (often getting pinned or choked by a teenager), cheering for his daughter at her dance competitions (despite not knowing ballet terms—he claps enthusiastically anyway), and spending ‘quality time’ with his wife—usually in shopping malls, pretending to be interested in clothes and shoes while secretly contemplating a new hobby.

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Posted on March 7, 2025 by i-genie

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a game-changer, revolutionizing how Foundation Models (FMs) interact with organization-specific data. As businesses increasingly rely on AI-powered solutions, the need for accurate, context-aware, and tailored responses has never been more critical.
Enter the powerful trio of Amazon Bedrock, LlamaIndex, and RAGAS– a cutting-edge combination that’s set to redefine the evaluation and optimization of RAG responses. This blog post delves into how these innovative tools synergize to elevate the performance of your AI applications, ensuring they not only meet but exceed the exacting standards of enterprise-level deployments.
Whether you’re a seasoned AI practitioner or a business leader exploring the potential of generative AI, this guide will equip you with the knowledge and tools to:

Harness the full potential of Amazon Bedrock robust foundation models
Utilize RAGAS’s comprehensive evaluation metrics for RAG systems

In this post, we’ll explore how to leverage Amazon Bedrock, LlamaIndex, and RAGAS to enhance your RAG implementations. You’ll learn practical techniques to evaluate and optimize your AI systems, enabling more accurate, context-aware responses that align with your organization’s specific needs. Let’s dive in and discover how these powerful tools can help you build more effective and reliable AI-powered solutions.
RAG Evaluation
RAG evaluation is important to ensure that RAG models produce accurate, coherent, and relevant responses. By analyzing the retrieval and generator components both jointly and independently, RAG evaluation helps identify bottlenecks, monitor performance, and improve the overall system. Current RAG pipelines frequently employ similarity-based metrics such as ROUGE, BLEU, and BERTScore to assess the quality of the generated responses, which is essential for refining and enhancing the model’s capabilities.
Above mentioned probabilistic metrics ROUGE, BLEU, and BERTScore have limitations in assessing relevance and detecting hallucinations. More sophisticated metrics are needed to evaluate factual alignment and accuracy.
Evaluate RAG components with Foundation models
We can also use a Foundation Model as a judge to compute various metrics for both retrieval and generation. Here are some examples of these metrics:

Retrieval component

Context precision – Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not.
Context recall – Ensures that the context contains all relevant information needed to answer the question.

Generator component

Faithfulness – Verifies that the generated answer is factually accurate based on the provided context, helping to identify errors or “hallucinations.”
Answer relavancy : Measures how well the answer matches the question. Higher scores mean the answer is complete and relevant, while lower scores indicate missing or redundant information.

Overview of solution
This post guides you through the process of assessing quality of RAG response with evaluation framework such as RAGAS and LlamaIndex with Amazon Bedrock.
In this post, we are also going to leverage Langchain to create a sample RAG application.
Amazon Bedrock is a fully managed service that offers a choice of high-performing Foundation Models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
The Retrieval Augmented Generation Assessment (RAGAS) framework offers multiple metrics to evaluate each part of the RAG system pipeline, identifying areas for improvement. It utilizes foundation models to test individual components, aiding in pinpointing modules for development to enhance overall results.
LlamaIndex is a framework for building LLM applications. It simplifies data integration from various sources and provides tools for data indexing, engines, agents, and application integrations. Optimized for search and retrieval, it streamlines querying LLMs and retrieving documents. This blog post focuses on using its Observability/Evaluation modules.
LangChain is an open-source framework that simplifies the creation of applications powered by foundation models. It provides tools for chaining LLM operations, managing context, and integrating external data sources. LangChain is primarily used for building chatbots, question-answering systems, and other AI-driven applications that require complex language processing capabilities.
Diagram Architecture
The following diagram is a high-level reference architecture that explains how you can evaluate the RAG solution with RAGAS or LlamaIndex.

The solution consists of the following components:

Evaluation dataset – The source data for the RAG comes from the Amazon SageMaker FAQ, which represents 170 question-answer pairs. This corresponds to Step 1 in the architecture diagram.

Build sample RAG – Documents are segmented into chunks and stored in an Amazon Bedrock Knowledge Bases (Steps 2–4). We use Langchain Retrieval Q&A to answer user queries. This process retrieves relevant data from an index at runtime and passes it to the Foundation Model (FM).
RAG evaluation – To assess the quality of the Retrieval-Augmented Generation (RAG) solution, we can use both RAGAS and LlamaIndex. An LLM performs the evaluation by comparing its predictions with ground truths (Steps 5–6).

You must follow the provided notebook to reproduce the solution. We elaborate on the main code components in this post.
Prerequisites
To implement this solution, you need the following:

An AWS accountwith privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Access enabled for the Amazon Titan Embeddings G1 – Text model and Anthropic Claude 3 Sonnet on Amazon Bedrock. For instructions, see Model access.
Run the prerequisite code provided in the Python

Ingest FAQ data
The first step is to ingest the SageMaker FAQ data. For this purpose, LangChain provides a WebBaseLoader object to load text from HTML webpages into a document format. Then we split each document in multiple chunks of 2,000 tokens with a 100-token overlap. See the following code below:
text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000, chunk_overlap=100)
retriever_db= get_retriever(text_chunks, bedrock_embeddings)
Set up embeddings and LLM with Amazon Bedrock and LangChain
In order to build a sample RAG application, we need an LLM and an embedding model:

LLM – Anthropic Claude 3 Sonnet

Embedding – Amazon Titan Embeddings – Text V2

This code sets up a LangChain application using Amazon Bedrock, configuring embeddings with Titan and a Claude 3 Sonnet model for text generation with specific parameters for controlling the model’s output. See the following code below from the notebook :
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_aws import ChatBedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
import nest_asyncio
nest_asyncio.apply()

#URL to fetch the document
SAGEMAKER_URL=”https://aws.amazon.com/sagemaker/faqs/”

#Bedrock parameters
EMBEDDING_MODEL=”amazon.titan-embed-text-v2:0″
BEDROCK_MODEL_ID=”anthropic.claude-3-sonnet-20240229-v1:0″

bedrock_embeddings = BedrockEmbeddings(model_id=EMBEDDING_MODEL,client=bedrock_client)

model_kwargs = {
“temperature”: 0,
“top_k”: 250,
“top_p”: 1,
“stop_sequences”: [“\n\nHuman:”]
}

llm_bedrock = ChatBedrock(
model_id=BEDROCK_MODEL_ID,
model_kwargs=model_kwargs
)
Set up Knowledge Bases
We will create Amazon Bedrock knowledgebases Web Crawler datasource and process Sagemaker FAQ data.
In the code below, we load the embedded documents in Knowledge bases and we set up the retriever with LangChain:
from utils import split_document_from_url, get_bedrock_retriever
from botocore.exceptions import ClientError

text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000, chunk_overlap=100)
retriever_db= get_bedrock_retriever(text_chunks, region)
Build a Q&A chain to query the retrieval API
After the database is populated, create a Q&A retrieval chain to perform question answering with context extracted from the vector store. You also define a prompt template following Claude prompt engineering guidelines. See the following code below from the notebook:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

system_prompt = (
“Use the given context to answer the question. ”
“If you don’t know the answer, say you don’t know. ”
“Use three sentence maximum and keep the answer concise and short. ”
“Context: {context}”
)

prompt_template = ChatPromptTemplate.from_messages([
(“system”, system_prompt),
(“human”, “{input}”)
]
)
question_answer_chain = create_stuff_documents_chain(llm_bedrock, prompt_template)
chain = create_retrieval_chain(retriever_db, question_answer_chain)
Build Dataset to evaluate RAG application
To evaluate a RAG application, we need a combination of the following datasets:

Questions – The user query that serves as input to the RAG pipeline
Context – The information retrieved from enterprise or external data sources based on the provided query
Answers – The responses generated by LLMs
Ground truths – Human-annotated, ideal responses for the questions that can be used as the benchmark to compare against the LLM-generated answers

We are ready to evaluate the RAG application. As describe in the introduction, we select 3 metrics to assess our RAG solution:

Faithfulness
Answer Relevancy
Answer Correctness

For more information, refer to Metrics.
This step involves defining an evaluation dataset with a set of ground truth questions and answers. For this post, we choose four random questions from the SageMaker FAQ. See the following code below from the notebook:
EVAL_QUESTIONS = [
“Can I stop a SageMaker Autopilot job manually?”,
“Do I get charged separately for each notebook created and run in SageMaker Studio?”,
“Do I get charged for creating and setting up an SageMaker Studio domain?”,
“Will my data be used or shared to update the base model that is offered to customers using SageMaker JumpStart?”,
]

#Defining the ground truth answers for each question

EVAL_ANSWERS = [
“Yes. You can stop a job at any time. When a SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.”,
“””No. You can create and run multiple notebooks on the same compute instance.
You pay only for the compute that you use, not for individual items.
You can read more about this in our metering guide.
In addition to the notebooks, you can also start and run terminals and interactive shells in SageMaker Studio, all on the same compute instance.”””,
“No, you don’t get charged for creating or configuring an SageMaker Studio domain, including adding, updating, and deleting user profiles.”,
“No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers.”
]
Evaluation of RAG with RAGAS
Evaluating the RAG solution requires to compare LLM predictions with ground truth answers. To do so, we use the batch() function from LangChain to perform inference on all questions inside our evaluation dataset.
Then we can use the evaluate() function from RAGAS to perform evaluation on each metric (answer relevancy, faithfulness and answer corectness). It uses an LLM to compute metrics. Feel free to use other Metrics from RAGAS.
See the following code below from the notebook:
from ragas.metrics import answer_relevancy, faithfulness, answer_correctness
from ragas import evaluate

#Batch invoke and dataset creation
result_batch_questions = chain.batch([{“input”: q} for q in EVAL_QUESTIONS])

dataset= build_dataset(EVAL_QUESTIONS,EVAL_ANSWERS,result_batch_questions, text_chunks)

result = evaluate(dataset=dataset, metrics=[ answer_relevancy, faithfulness, answer_correctness ],llm=llm_bedrock, embeddings=bedrock_embeddings, raise_exceptions=False )
df = result.to_pandas()
df.head()
The following screenshot shows the evaluation results and the RAGAS answer relevancy score.

Answer Relevancy
In the answer_relevancy_score column, a score closer to 1 indicates the response generated is relevant to the input query.
Faithfulness
In the second column, the first query result has a lower faithfulness_score (0.2), which indicates the responses are not derived from the context and are hallucinations. The rest of the query results have a higher faithfulness_score (1.0), which indicates the responses are derived from the context.
Answer Correctness
In the last column answer_correctness, the second and last row have high answer correctness, meaning that answer provided by the LLM is closer to to from the groundtruth.
Evaluation of RAG with LlamaIndex
LlamaIndex, similar to Ragas, provides a comprehensive RAG (Retrieval-Augmented Generation) evaluation module. This module offers a variety of metrics to assess the performance of your RAG system. The evaluation process generates two key outputs:

Feedback: The judge LLM (Language Model) provides detailed evaluation feedback in the form of a string, offering qualitative insights into the system’s performance.
Score: This numerical value indicates how well the answer meets the evaluation criteria. The scoring system varies depending on the specific metric being evaluated. For example, metrics like Answer Relevancy and Faithfulness are typically scored on a scale from 0 to 1.

These outputs allow for both qualitative and quantitative assessment of your RAG system’s performance, enabling you to identify areas for improvement and track progress over time.
The following is a code sample from the notebook:
from llama_index.llms.bedrock import Bedrock
from llama_index.core.evaluation import (
AnswerRelevancyEvaluator,
CorrectnessEvaluator,
FaithfulnessEvaluator
)
from utils import evaluate_llama_index_metric

bedrock_llm_llama = Bedrock(model=BEDROCK_MODEL_ID)
faithfulness= FaithfulnessEvaluator(llm=bedrock_llm_llama)
answer_relevancy= AnswerRelevancyEvaluator(llm=bedrock_llm_llama)
correctness= CorrectnessEvaluator(llm=bedrock_llm_llama)
Answer Relevancy
df_answer_relevancy= evaluate_llama_index_metric(answer_relevancy, dataset)
df_answer_relevancy.head()

The column Score defines the result for the answer_relevancy evaluation criteria. All passing values are set to 1, meaning that all predictions are relevant with the context retrieved.
Additionally, the column Feedback provides a clear explanation of the result of the passing score. We can observe that all answers align with the context extracted from the retriever.
Answer Correctness
df_correctness= evaluate_llama_index_metric(correctness, dataset)
df_correctness.head()

All values from the column Score are set to 5.0, meaning that all predictions are coherent with ground truth answers.
Faithfulness
The following screenshot shows the evaluation results for answer faithfulness.
df_faithfulness= evaluate_llama_index_metric(faithfulness, dataset)
df_faithfulness.head()

All values from the Score column are set to 1.0, which means all answers generated by LLM are coherent given the context retrieved.
Conclusion
While Foundation Models offer impressive generative capabilities, their effectiveness in addressing organization-specific queries has been a persistent challenge. The Retrieval Augmented Generation framework emerges as a powerful solution, bridging this gap by enabling LLMs to leverage external, organization-specific data sources.
To truly unlock the potential of RAG pipelines, the RAGAS framework, in conjunction with LlamaIndex, provides a comprehensive evaluation solution. By meticulously assessing both retrieval and generation components, this approach empowers organizations to pinpoint areas for improvement and refine their RAG implementations. The result? Responses that are not only factually accurate but also highly relevant to user queries.
By adopting this holistic evaluation approach, enterprises can fully harness the transformative power of generative AI applications. This not only maximizes the value derived from these technologies but also paves the way for more intelligent, context-aware, and reliable AI systems that can truly understand and address an organization’s unique needs.
As we continue to push the boundaries of what’s possible with AI, tools like Amazon Bedrock, LlamaIndex, and RAGAS will play a pivotal role in shaping the future of enterprise AI applications. By embracing these innovations, organizations can confidently navigate the exciting frontier of generative AI, unlocking new levels of efficiency, insight, and competitive advantage.
For further exploration, readers interested in enhancing the reliability of AI-generated content may want to look into Amazon Bedrock’s Guardrails feature, which offers additional tools like the Contextual Grounding Check.

About the authors
Madhu is a Senior Partner Solutions Architect specializing in worldwide public sector cybersecurity partners. With over 20 years in software design and development, he collaborates with AWS partners to ensure customers implement solutions that meet strict compliance and security objectives. His expertise lies in building scalable, highly available, secure, and resilient applications for diverse enterprise needs.
Babu Kariyaden Parambath is a Senior AI/ML Specialist at AWS. At AWS, he enjoys working with customers in helping them identify the right business use case with business value and solve it using AWS AI/ML solutions and services. Prior to joining AWS, Babu was an AI evangelist with 20 years of diverse industry experience delivering AI driven business value for customers.

Innovating at speed: BMW’s generative AI solution for cloud incident …

Posted on March 7, 2025 by i-genie

This post was co-authored with Johann Wildgruber, Dr. Jens Kohl, Thilo Bindel, and Luisa-Sophie Gloger from BMW Group.
The BMW Group—headquartered in Munich, Germany—is a vehicle manufacturer with more than 154,000 employees, and 30 production and assembly facilities worldwide as well as research and development locations across 17 countries. Today, the BMW Group (BMW) is the world’s leading manufacturer of premium automobiles and motorcycles, and provider of premium financial and mobility services.
BMW Connected Company is a division within BMW responsible for developing and operating premium digital services for BMW’s connected fleet, which currently numbers more than 23 million vehicles worldwide. These digital services are used by many BMW vehicle owners daily; for example, to lock or open car doors remotely using an app on their phone, to start window defrost remotely, to buy navigation map updates from the car’s menu, or to listen to music streamed over the internet in their car.
In this post, we explain how BMW uses generative AI technology on AWS to help run these digital services with high availability. Specifically, BMW uses Amazon Bedrock Agents to make remediating (partial) service outages quicker by speeding up the otherwise cumbersome and time-consuming process of root cause analysis (RCA). The fully automated RCA agent correctly identifies the right root cause for most cases (measured at 85%), and helps engineers in terms of system understanding and real-time insights in their cases. This performance was further validated during the proof of concept, where employing the RCA agent on representative use cases clearly demonstrates the benefits of this solution, allowing BMW to achieve significantly lower diagnosis times.
The challenges of root cause analysis
Digital services are often implemented by chaining multiple software components together; components that might be built and run by different teams. For example, consider the service of remotely opening and locking vehicle doors. There might be a development team building and running the iOS app, another team for the Android app, a team building and running the backend-for-frontend used by both the iOS and Android app, and so on. Moreover, these teams might be geographically dispersed and run their workloads in different locations and regions; many hosted on AWS, some elsewhere.
Now consider a (fictitious) scenario where reports come in from car owners complaining that remotely locking doors with the app no longer works. Is the iOS app responsible for the outage, or the backend-for-frontend? Did a firewall rule change somewhere? Did an internal TLS certificate expire? Is the MQTT system experiencing delays? Was there an inadvertent breaking change in recent API changes? When did they actually deploy that? Or was the database password for the central subscription service rotated again?
It can be difficult to determine the root cause of issues in situations like this. It requires checking many systems and teams, many of which might be failing, because they’re interdependent. Developers need to reason about the system architecture, form hypotheses, and follow the chain of components until they have located the one that is the culprit. They often have to backtrack and reassess their hypotheses, and pursue the investigation in another chain of components.
Understanding the challenges in such complex systems highlights the need for a robust and efficient approach to root cause analysis. With this context in mind, let’s explore how BMW and AWS collaborated to develop a solution using Amazon Bedrock Agents to streamline and enhance the RCA process.
Solution overview
At a high level, the solution uses an Amazon Bedrock agent to do automated RCA. This agent has several custom-built tools at its disposal to do its job. These tools, implemented by AWS Lambda functions, use services like Amazon CloudWatch and AWS CloudTrail to analyze system logs and metrics. The following diagram illustrates the solution architecture.

When an incident occurs, an on-call engineer gives a description of the issue at hand to the Amazon Bedrock agent. The agent will then start investigating for the root cause of the issue, using its tools to do tasks that the on-call engineer would otherwise do manually, such as searching through logs. Based on the clues it uncovers, the agent proposes several likely hypotheses to the on-call engineer. The engineer can then resolve the issue, or give pointers to the agent to direct the investigation further. In the following section, we take a closer look at the tools the agent uses.
Amazon Bedrock agent tools
The Amazon Bedrock agent’s effectiveness in performing RCA lies in its ability to seamlessly integrate with custom tools. These tools, designed as Lambda functions, use AWS services like CloudWatch and CloudTrail to automate tasks that are typically manual and time-intensive for engineers. By organizing its capabilities into specialized tools, the Amazon Bedrock agent makes sure that RCA is both efficient and precise.
Architecture Tool
The Architecture Tool uses C4 diagrams to provide a comprehensive view of the system’s architecture. These diagrams, enhanced through Structurizr, give the agent a hierarchical understanding of component relationships, dependencies, and workflows. This allows the agent to target the most relevant areas during its RCA process, effectively narrowing down potential causes of failure based on how different systems interact.
For instance, if an issue affects a specific service, the Architecture Tool can identify upstream or downstream dependencies and suggest hypotheses focused on those systems. This accelerates diagnostics by enabling the agent to reason contextually about the architecture instead of blindly searching through logs or metrics.
Logs Tool
The Logs Tool uses CloudWatch Logs Insights to analyze log data in real time. By searching for patterns, errors, or anomalies, as well as comparing the trend to the previous period, it helps the agent pinpoint issues related to specific events, such as failed authentications or system crashes.
For example, in a scenario involving database access failures, the Logs Tool might identify a new spike in the number of error messages such as “FATAL: password authentication failed” compared to the previous hour. This insight allows the agent to quickly associate the failure with potential root causes, such as an improperly rotated database password.
Metrics Tool
The Metrics Tool provides the agent with real-time insights into the system’s health by monitoring key metrics through CloudWatch. This tool identifies statistical anomalies in critical performance indicators such as latency, error rates, resource utilization, or unusual spikes in usage patterns, which can often signal potential issues or deviations from normal behavior.
For instance, in a Kubernetes memory overload scenario, the Metrics Tool might detect a sharp increase in memory consumption or unusual resource allocation prior to the failure. By surfacing CloudWatch metric alarms for such anomalies, the tool enables the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load, guiding the investigation more effectively toward resolving the issue.
Infrastructure Tool
The Infrastructure Tool uses CloudTrail data to analyze critical control-plane events, such as configuration changes, security group updates, or API calls. This tool is particularly effective in identifying misconfigurations or breaking changes that might trigger cascading failures.
Consider a case where a security group ingress rule is inadvertently removed, causing connectivity issues between services. The Infrastructure Tool can detect and correlate this event with the reported incident, providing the agent with actionable insights to guide its RCA process.
By combining these tools, the Amazon Bedrock agent mimics the step-by-step reasoning of an experienced engineer while executing tasks at machine speed. The modular nature of the tools allows for flexibility and customization, making sure that RCA is tailored to the unique needs of BMW’s complex, multi-regional cloud infrastructure.
In the next section, we discuss how these tools work together within the agent’s workflow.
Amazon Bedrock agents: The ReAct framework in action
At the heart of BMW’s rapid RCA lies the ReAct (Reasoning and Action) agent framework, an innovative approach that dynamically combines logical reasoning with task execution. By integrating ReAct with Amazon Bedrock, BMW gains a flexible solution for diagnosing and resolving complex cloud-based incidents. Unlike traditional methods, which rely on predefined workflows, ReAct agents use real-time inputs and iterative decision-making to adapt to the specific circumstances of an incident.
The ReAct agent in BMW’s RCA solution uses a structured yet adaptive workflow to diagnose and resolve issues. First, it interprets the textual description of an incident (for example, “Vehicle doors cannot be locked via the app”) to identify which parts of the system are most likely impacted. Guided by the ReAct framework’s iterative reasoning, the agent then gathers evidence by calling specialized tools, using data centrally aggregated in a cross-account observability setup. By continuously reevaluating the results of each tool invocation, the agent zeros in on potential causes—whether an expired certificate, a revoked firewall rule, or a spike in traffic—until it isolates the root cause. The following diagram illustrates this workflow.

The ReAct framework offers the following benefits:

Dynamic and adaptive – The ReAct agent tailors its approach to the specific incident, rather than a one-size-fits-all methodology. This adaptability is especially critical in BMW’s multi-regional, multi-service architecture.
Efficient tool utilization – By reasoning about which tools to invoke and when, the ReAct agent minimizes redundant queries, providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail.
Human-like reasoning – The ReAct agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause. This capability bridges the gap between automation and human expertise.

By employing Amazon Bedrock ReAct agents, significantly lower diagnosis times are achieved. These agents not only enhance operational efficiency but also empower engineers to focus on strategic improvements rather than labor-intensive diagnostics.
Case study: Root cause analysis “Unlocking vehicles via the iOS app”
To illustrate the power of Amazon Bedrock agents in action, let us explore a possible real-world scenario involving the interplay between BMW’s connected fleet and the digital services running in the cloud backend.
We deliberately change the security group for the central networking account in a test environment. This has the effect that requests from the fleet are (correctly) blocked by the changed security group and do not reach the services hosted in the backend. Hence, a test user cannot lock or unlock her vehicle door remotely.
Incident details
BMW engineers received a report from a tester indicating the remote lock/unlock functionality on the mobile app does not work.
This report raised immediate questions: was the issue in the app itself, the backend-for-frontend service, or deeper within the system, such as in the MQTT connectivity or authentication mechanisms?
How the ReAct agent addresses the problem
The problem is described to the Amazon Bedrock ReAct agent: “Users of the iOS app cannot unlock car doors remotely.” The agent immediately begins its analysis:

The agent begins by understanding the overall system architecture, calling the Architecture Tool. The outputs of the architecture tool reveal that the iOS app, like the Android app, is connected to a backend-for-frontend API, and that the backend-for-frontend API itself is connected to several other internal APIs, such as the Remote Vehicle Management API. The Remote Vehicle Management API is responsible for sending commands to cars by using MQTT messaging.
The agent uses the other tools at its disposal in a targeted way: it scans the logs, metrics, and control plane activities of only those components that are involved in remotely unlocking car doors: iOS app remote logs, backend-for-frontend API logs, and so on. The agent finds several clues:

Anomalous logs that indicate connectivity issues (network timeouts).
A sharp decrease in the number of successful invocations of the Remote Vehicle Management API.
Control plane activities: several security groups in the central networking account hosted on the testing environment were changed.

Based on those findings, the agent infers and defines several hypotheses and presents these to the user, ordered by their likelihood. In this case, the first hypothesis is the actual root cause: a security group was inadvertently changed in the central networking account, which meant that network traffic between the backend-for-frontend and the Remote Vehicle Management API was now blocked. The agent correctly correlated logs (“fetch timeout error”), metrics (decrease in invocations) and control plane changes (security group ingress rule removed) to come to this conclusion.
If the on-call engineer wants further information, they can now ask follow-up questions to the agent, or instruct the agent to investigate elsewhere as well.

The entire process—from incident detection to resolution—took minutes, compared to the hours it could have taken with traditional RCA methods. The ReAct agent’s ability to dynamically reason, access cross-account observability data, and iterate on its hypotheses alleviated the need for tedious manual investigations.

Conclusion
By using Amazon Bedrock ReAct agents, BMW has shown how to improve its approach to root cause analysis, turning a complex and manual process into an efficient, automated workflow. The tools integrated within the ReAct framework significantly narrow down potential reasoning space, and enable dynamic hypotheses generation and targeted diagnostics, mimicking the reasoning process of seasoned engineers while operating at machine speed. This innovation has reduced the time required to identify and resolve service disruptions, further enhancing the reliability of BMW’s connected services and improving the experience for millions of customers worldwide.
The solution has demonstrated measurable success, with the agent identifying root causes in 85% of test cases and providing detailed insights in the remainder, greatly expediting engineers’ investigations. By lowering the barrier to entry for junior engineers, it has enabled less-experienced team members to diagnose issues effectively, maintaining reliability and scalability across BMW’s operations.
Incorporating generative AI into RCA processes showcases the transformative potential of AI in modern cloud-based operations. The ability to adapt dynamically, reason contextually, and handle complex, multi-regional infrastructures makes Amazon Bedrock Agents a game changer for organizations aiming to maintain high availability in their digital services.
As BMW continues to expand its connected fleet and digital offerings, the adoption of generative AI-driven solutions like Amazon Bedrock will play an important role in maintaining operational excellence and delivering seamless experiences to customers. By following BMW’s example, your organization can also benefit from Amazon Bedrock Agents for root cause analysis to enhance service reliability.
Get started by exploring Amazon Bedrock Agents to optimize your incident diagnostics or use CloudWatch Logs Insights to identify anomalies in your system logs. If you want a hands-on introduction to creating your own Amazon Bedrock agents—complete with code examples and best practices—check out the following GitHub repo. These tools are setting a new industry standard for efficient RCA and operational excellence.

About the Authors
Johann Wildgruber is a transformation lead reliability engineer at BMW Group, working currently to set up an observability platform to strengthen the reliability of ConnectedDrive services. Johann has several years of experience as a product owner in operating and developing large and complex cloud solutions. He is interested in applying new technologies and methods in software development.
Dr. Jens Kohl is a technology leader and builder with over 13 years of experience at the BMW Group. He is responsible for shaping the architecture and continuous optimization of the Connected Vehicle cloud backend. Jens has been leading software development and machine learning teams with a focus on embedded, distributed systems and machine learning for more than 10 years.
Thilo Bindel is leading the Offboard Reliability & Data Engineering team at BMW Group. He is responsible for defining and implementing strategies to ensure reliability, availability, and maintainability of BMW’s backend services in the Connected Vehicle domain. His goal is to establish reliability and data engineering best practices consistently across the organization and to position the BMW Group as a leader in data-driven observability within the automotive industry and beyond.
Luisa-Sophie Gloger is a Data Scientist at the BMW Group with a focus on Machine Learning. As a lead developer within the Connected Company’s Connected AI platform team, she enjoys helping teams to improve their products and workflows with Generative AI. She also has a background in working on Natural Language processing (NLP) and a degree in psychology.
Tanrajbir Takher is a Data Scientist at AWS’s Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.
Otto Kruse is a Principal Solutions Developer within AWS Industries – Prototyping and Customer Engineering (PACE), a multi-disciplinary team dedicated to helping large companies utilize the potential of the AWS cloud by exploring and implementing innovative ideas. Otto focuses on application development and security.
Huong Vu is a Data Scientist at AWS Generative AI Innovation Centre. She drives projects to deliver generative-AI applications for enterprise customers from a diverse range of industries. Prior to AWS, she worked on improving NLP models for Alexa shopping assistant both on the Amazon.com website and on Echo devices.
Aishwarya is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using Generative AI and cloud-based technologies.
Satyam Saxena is an Applied Science Manager at AWS Generative AI Innovation Center team. He leads Generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.
Kim Robins, a Senior AI Strategist at AWS’s Generative AI Innovation Center, leverages his extensive artificial intelligence and machine learning expertise to help organizations develop innovative products and refine their AI strategies, driving tangible business value.

Ground truth generation and review best practices for evaluating gener …

Posted on March 6, 2025 by i-genie

Generative AI question-answering applications are pushing the boundaries of enterprise productivity. These assistants can be powered by various backend architectures including Retrieval Augmented Generation (RAG), agentic workflows, fine-tuned large language models (LLMs), or a combination of these techniques. However, building and deploying trustworthy AI assistants requires a robust ground truth and evaluation framework.
Ground truth data in AI refers to data that is known to be factual, representing the expected use case outcome for the system being modeled. By providing an expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Running deterministic evaluation of generative AI assistants against use case ground truth data enables the creation of custom benchmarks. These benchmarks are essential for tracking performance drift over time and for statistically comparing multiple assistants in accomplishing the same task. Additionally, they enable quantifying performance changes as a function of enhancements to the underlying assistant, all within a controlled setting. With deterministic evaluation processes such as the Factual Knowledge and QA Accuracy metrics of FMEval, ground truth generation and evaluation metric implementation are tightly coupled. To ensure the highest quality measurement of your question answering application against ground truth, the evaluation metric’s implementation must inform ground truth curation.
In this post, we discuss best practices for applying LLMs to generate ground truth for evaluating question-answering assistants with FMEval on an enterprise scale. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, and provides standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, see Evaluate large language models for quality and responsibility of LLMs. Additionally, see the Generative AI Security Scoping Matrix for guidance on moderating confidential and personally identifiable information (PII) as part of your generative AI solution.
By following these guidelines, data teams can implement high fidelity ground truth generation for question-answering use case evaluation with FMEval. For ground truth curation best practices for question answering evaluations with FMEval that you can use to design FMEval ground truth prompt templates, see Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval.
Generating ground truth for FMEval question-answering evaluation
One option to get started with ground truth generation is human curation of a small question-answer dataset. The human curated dataset should be small (based on bandwidth), high in signal, and ideally prepared by use case subject matter experts (SMEs). The exercise of generating this dataset forces a data alignment exercise early in the evaluation process, raising important questions and conversations among use case stakeholders about what questions are important to measure over time for the business. The outcomes for this exercise are three-fold:

Stakeholder alignment on the top N important questions
Stakeholder awareness of the evaluation process
A high-fidelity starter ground truth dataset for the first proof of concept evaluation as a function of awareness and evaluation

While an SME ground truth curation exercise is a strong start, at the scale of an enterprise knowledge base, pure SME generation of ground truth will become prohibitively time and resource intensive. To scale ground truth generation and curation, you can apply a risk-based approach in conjunction with a prompt-based strategy using LLMs. It’s important to note that LLM-generated ground truth isn’t a substitute for use case SME involvement. For example, if ground truth is generated by LLMs before the involvement of SMEs, SMEs will still be needed to identify which questions are fundamental to the business and then align the ground truth with business value as part of a human-in-the-loop process.
To demonstrate, we provide a step-by-step walkthrough using Amazon’s 2023 letter to shareholders as source data.
In keeping with ground truth curation best practices for FMEval question-answering, ground truth is curated as question-answer-fact triplets. The question and answer are curated to suit the ideal question-answering assistant response in terms of content, length, and style. The fact is a minimal representation of the ground truth answer, comprising one or more subject entities of the question.
For example, consider how the following source document chunk from the Amazon 2023 letter to shareholders can be converted to question-answering ground truth.

To convert the source document excerpt into ground truth, we provide a base LLM prompt template. In the template, we instruct the LLM to take a fact-based approach to interpreting the chunk using chain-of-thought logic. For our example, we work with Anthropic’s Claude LLM on Amazon Bedrock. The template is compatible with and can be modified for other LLMs, such as LLMs hosted on Amazon Sagemaker Jumpstart and self-hosted on AWS infrastructure. To modify the prompt for use by other LLMs, a different approach to denoting prompt sections than XML tags might be required. For example, Meta Llama models apply tags such as <s> [INST] and <<SYS>>. For more information, see the Amazon Bedrock documentation on LLM prompt design and the FMEval documentation.
The LLM is assigned a persona to set its point of view for carrying out the task. In the instructions, the LLM identifies facts as entities from the source document chunk. For each fact, a question-answer-fact triplet is assembled based on the fact detected and its surrounding context. In the prompt, we provide detailed examples for controlling the content of ground truth questions. The examples focus on questions on chunk-wise business knowledge while ignoring irrelevant metadata that might be contained in a chunk. You can customize the prompt examples to fit your ground truth use case.
We further instruct the LLM to apply ground truth curation best practices for FMEval, such as generating multiple variations of facts to fit multiple possible unit expressions. Additional curation elements subject to the task at hand—such as brand language and tone—can be introduced into the ground truth generation prompt. With the following template, we verified that Anthropic’s Claude Sonnet 3.5 can generate custom ground truth attributes accommodating FMEval features, such as the <OR> delimiter to denote alternative acceptable answers for a ground truth fact.

“””You are an expert in ground truth curation for generative AI application evaluation on AWS.

Follow the instructions provided in the <instructions> XML tag for generating question answer fact triplets from a source document excerpt.

<instructions>
– Let’s work this out in a step-by-step way to be sure we have the right answer.
– Review the source document excerpt provided in <document> XML tags below
– For each meaningful domain fact in the <document>, extract an unambiguous question-answer-fact set in JSON format including a question and answer pair encapsulating the fact in the form of a short sentence, followed by a minimally expressed fact extracted from the answer.

<domain_knowledge_focus>
– Focus ONLY on substantive domain knowledge contained within the document content
– Ignore all metadata and structural elements including but not limited to:
– Document dates, versions, page numbers
– Section numbers or titles
– Table structure or row/column positions
– List positions or ordering
– Questions must reference specific domain entities rather than generic document elements
</domain_knowledge_focus>

<context_specification_requirements>
Document Source Identification
– Always reference the specific source document and its date/version
– Example: “According to the [Document Name + Date], what is [specific query]?”

Cross-Reference Prevention
– Each question must be answerable from the current document chunk only
– Do not create questions requiring information from multiple documents
– Example: “In this [Document Name], what are [specific requirements]?”

Department/LOB Specification
– Always specify the relevant department, line of business, or organizational unit
– Example: “What are the [Department Name]’s requirements for [specific process]?”

Document Section Targeting
– Reference specific sections when the information location is relevant
– Example: “In Section [X] of [Document Name], what are the steps for [specific process]?”

Role-Based Context
– Specify relevant roles, responsibilities, or authority levels
– Example: “Which [specific roles] are authorized to [specific action]?”

Version Control Elements
– Include relevant version or revision information
– Example: “What changes were implemented in the [Month Year] revision of [Document]?”

Policy/Procedure Numbers
– Include specific policy or procedure reference numbers
– Example: “Under Policy [Number], what are the requirements for [specific action]?”

Regulatory Framework References
– Specify relevant regulatory frameworks or compliance requirements
– Example: “What [Regulation] compliance requirements are specified for [specific process]?”

System/Platform Specification
– Name specific systems, platforms, or tools
– Example: “What steps are required in [System Name] to [specific action]?”

Document Type Classification
– Specify the type of document (SOP, Policy, Manual, etc.)
– Example: “In the [Document Type + Number], where is [specific information] stored?”

Temporal Validity
– Include effective dates or time periods
– Example: “What process is effective from [Date] according to [Document]?”

Geographic Jurisdiction
– Specify relevant geographic regions or jurisdictions
– Example: “What requirements apply to [Region] according to [Document]?”

Business Process Owner
– Identify relevant process owners or responsible parties
– Example: “According to [Document], who owns the process for [specific action]?”

Classification Level
– Include relevant security or confidentiality classifications
– Example: “What are the requirements for [Classification Level] data?”

Stakeholder Scope
– Specify relevant stakeholders or approval authorities
– Example: “Which [stakeholder level] must approve [specific action]?”
</context_specification_requirements>

<question_quality_criteria>
– Questions must be specific enough that a vector database can match them to the relevant document chunk
– Questions should include key identifying terms, names, and context
– Questions should target concrete, actionable information
– Answers should provide complete context without referring back to document elements
</question_quality_criteria>

<output_format>
The question-answer-fact set should each be a short string in JSON format with the keys: “question”, “ground_truth_answer”, “fact”
</output_format>

<best_practices>
– Questions, answers, and facts should not refer to the subject entity as “it” or “they”, and instead refer to it directly by name
– Questions, answers, and facts should be individually unique to the document chunk, such that based on the question a new call to the retriever will address the correct document section when posing the ground truth question
– Facts should be represented in 3 or fewer words describing an entity in the <document>
– If there are units in the fact, the “fact” entry must provide multiple versions of the fact using <OR> as a delimiter. See <unit_variations> for examples.
<unit_variations>
– Dollar Unit Equivalencies: `1,234 million<OR>1.234 billion`
– Date Format Equivalencies: `2024-01-01<OR>January 1st 2024`
– Number Equivalencies: `1<OR>one`
</unit_variations>
</best_practices>

– Start your response immediately with the question-answer-fact set JSON, and separate each extracted JSON record with a newline.
</instructions>

<document>
{context_document}
</document>

Now, extract the question answer pairs and fact from the document excerpt according to your instructions, starting immediately with JSON and no preamble.”””

The generation output is provided as fact-wise JSONLines records in the following format, where elements in square brackets represent values from a line in Table 1.

{

“question”: “[Question]”,

“ground_truth_answer”: “[Ground Truth Answer]”,

“fact”: “[Fact]”

}

Here are a few examples of generated ground truth:

Question
Ground Truth Answer
Fact

What was Amazon’s total revenue growth in 2023?
Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.
12%<OR>$514B to $575B

How much did North America revenue increase in 2023?
North America revenue increased 12% year-over-year from $316B to $353B.
12%<OR>$316B to $353B

What was the growth in International revenue for Amazon in 2023?
International revenue grew 11% year-over-year from $118B to $131B.
11%<OR>$118B to $131B

How much did AWS revenue increase in 2023?
AWS revenue increased 13% year-over-year from $80B to $91B.
13%<OR>$80B to $91B

What was Amazon’s operating income improvement in 2023?
Operating income in 2023 improved 201% year-over-year from $12.2B to $36.9B.
201%<OR>$12.2B to $36.9B

What was Amazon’s operating margin in 2023?
Amazon’s operating margin in 2023 was 6.4%.
6.4%

Scaling ground truth generation with a pipeline
To automate ground truth generation, we provide a serverless batch pipeline architecture, shown in the following figure. At a high level, the AWS Step Functions pipeline accepts source data in Amazon Simple Storage Service (Amazon S3), and orchestrates AWS Lambda functions for ingestion, chunking, and prompting on Amazon Bedrock to generate the fact-wise JSONLines ground truth.

There are three user inputs to the step function:

A custom name for the ground truth dataset
The input Amazon S3 prefix for the source data
The percentage to sample for review.

Additional configurations are set by Lambda environment variables, such as the S3 source bucket and Amazon Bedrock Model ID to invoke on generation.

{

“dataset_name”: “YOUR_DATASET_NAME”,

“input_prefix”: “YOUR INPUT_PREFIX”,

“review_percentage”: “REVIEW PERCENTAGE”

}

After the initial payload is passed, a validation function assembles the global event payload structure in terms of system input and user input.

{

“system_input”:

{

“run_id”: “<AWS Step Function execution ID>”,

“input_bucket”: “<Input data Amazon S3 bucket>”,

“output_bucket”: “<Output data Amazon S3 bucket>”,

“output_document_chunks_prefix”: “<Amazon S3 bucket Prefix to store chunks>”,

“chunk_size”: “<Document chunk size>”,

“chunk_overlap”: “<Number of tokens that will overlap across consecutive chunks>”

“user_input”:

{

“dataset_name”: “<Dataset name>”,

“input_prefix”: “<Amazon S3 bucket prefix for ground truth generation data input data>”,

“review_percentage”: “<Percent of records to flag for human review>”

}

After validation, the first distributed map state iterates over the files in the input bucket to start the document ingestion and chunking processes with horizontal scaling. The resulting chunks are stored in an intermediate S3 bucket.
The second distributed map is the generation core of the pipeline. Each chunk generated by the previous map is fed as an input to the ground truth generation prompt on Amazon Bedrock. For each chunk, a JSONLines file containing the question-answer-fact triplets is validated and stored in an S3 bucket at the output prefix.
The following figure shows a view of the data structure and lineage from document paragraphs to the final ground truth chunk across the chunking and generation map states. The numbering between the two figures indicates the data structure present at each point in the pipeline. Finally, the JSONLines files are aggregated in an Amazon SageMaker Processing Job, including the assignment of a random sample for human review based on user input.

The last step of the pipeline is the aggregation step using a SageMaker Processing job. The aggregation step consists of concatenating the JSONLines records generated by every child execution of the generation map into a single ground truth output file. A randomly selected percentage of the records in the output file are sampled and flagged for review as part of a human-in-the-loop process.
Judging ground truth for FMEval question-answering evaluation
In this section, we discuss two key components of evaluating ground truth quality: human in the loop and applying an LLM as a Judge. Measuring ground truth quality is an essential component of the evaluation lifecycle.
Human-in-the-loop
The level of ground truth human review required is determined by the risk of having incorrect ground truth, and its negative implications. Ground truth review by use case SMEs can verify if critical business logic is appropriately represented by the ground truth. The process of ground truth review by humans is called human-in-the-loop (HITL), and an example the HITL process is shown in the following figure.
The steps of HTIL are:

Classify risk: performing a risk analysis will establish the severity and likelihood of negative events occurring as a result of incorrect ground truth used for evaluation of a generative AI use-case. Based on the outcome of the analysis, assign the ground truth dataset a risk level: Low, Medium, High or Critical. The table below outlines the relationship between event severity, likelihood, and risk level. See Learn how to assess the risk of AI systems for a deep dive on performing AI risk assessment.
Human review: Based on the assigned risk level, use-case expert reviewers examine a proportional amount of the use-case ground truth. Organizations can set acceptability thresholds for percentage of HITL intervention based on their tolerance for risk. Similarly, if a ground truth dataset is promoted from a low risk to a medium risk use case, an increased level of HITL intervention will be necessary.
Identify findings: Reviewers can identify any hallucinations relative to source data, challenges with information veracity according to their expertise, or other criteria set by the organization. In this post, we focus on hallucination detection and information veracity.
Action results: Reviewers can take business actions based on their judgement, such as updating and deleting records, or re-writing applicable source documents. Bringing in LLMOps SMEs to apply dataset curation best practices can also be an outcome.

Putting the risk table from Learn how to assess the risk of AI systems into action, the severity and likelihood of risks for a ground truth dataset validating a production chatbot with frequent customer use would be greater than an internal evaluation dataset used by developers to advance a prototype.

Likelihood

Severity
Rare
Unlikely
Possible
Likely
Frequent

Extreme
Low
Medium
High
Critical
Critical

Major
Very low
Low
Medium
High
Critical

Moderate
Very low
Low
Medium
Medium
High

Low
Very low
Very low
Low
Low
Medium

Very Low
Very low
Very low
Very low
Very low
Low

Next, we walk through the step-by-step process of conducting a human review for hallucination detection and information veracity. Human review is performed by comparing the ground truth chunk input to the LLM prompt to the generated question-answer-fact triplets. This view is shown in the following table.

Source data chunk
Ground truth triplets

Dear Shareholders: Last year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more. The reasons are many, but start with the progress we’ve made in our financial results and customer experiences, and extend to our continued innovation and the remarkable opportunities in front of us. In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B. By segment, North America revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% YoY from $118B to $131B, and AWS revenue increased 13% YoY from $80B to $91B.
{“question”: “What was Amazon’s total revenue growth in 2023?”, “ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”, “fact”: “12%<OR>$514B to $575B”} {“question”: “How much did North America revenue increase in 2023?”, “ground_truth_answer”: “North America revenue increased 12% year-over-year from $316B to $353B.”, “fact”: “12%<OR>$316B to $353B”} {“question”: “What was the growth in International revenue for Amazon in 2023?”, “ground_truth_answer”: “International revenue grew 11% year-over-year from $118B to $131B.”, “fact”: “11%<OR>$118B to $131B”}

Human reviewers then identify and take action based on findings to correct the system. LLM hallucination is the phenomenon where LLMs generate plausible-sounding but factually incorrect or nonsensical information, presented confidently as factual. Organizations can introduce additional qualities for evaluating and scoring ground truth, as suited to the risk level and use case requirements.
In hallucination detection, reviewers seek to identify text that has been incorrectly generated by the LLM. An example of hallucination and remediation is shown in the following table. A reviewer would notice in the source data that Amazon’s total revenue grew 12% year over year, yet the ground truth answer hallucinated a 15% figure. In remediation, the reviewer can change this back to 12%.

Source data chunk
Example hallucination
Example hallucination remediation

In 2023, Amazon’s total revenue grew 12% year-over-year (“YoY”) from $514B to $575B.
{“question”: “What was Amazon’s total revenue growth in 2023?”, “ground_truth_answer”: “Amazon’s total revenue grew 15% year-over-year from $514B to $575B in 2023.”, “fact”: “12%<OR>$514B to $575B”}
{“question”: “What was Amazon’s total revenue growth in 2023?”, “ground_truth_answer”: “Amazon’s total revenue grew 12% year-over-year from $514B to $575B in 2023.”, “fact”: “12%<OR>$514B to $575B”}

In SME review for veracity, reviewers seek to validate if the ground truth is in fact truthful. For example, the source data used for the ground truth generation prompt might be out of date or incorrect. The following table shows the perspective of an HITL review by a domain SME.

Source data chunk
Example SME review
Example hallucination remediations

Effective June 1st, 2023, AnyCompany is pleased to announce the implementation of “Casual Friday” as part of our updated dress code policy. On Fridays, employees are permitted to wear business casual attire, including neat jeans, polo shirts, and comfortable closed-toe shoes.
“As an HR Specialist, this looks incorrect to me. We did not implement the Casual Friday policy after all at AnyCompany – the source data for this ground truth must be out of date.”

Delete Incorrect Ground Truth
Update Source Data Document
Other use case specific actions

Traditional machine learning applications can also inform the HITL process design. For examples of HITL for traditional machine learning, see Human-in-the-loop review of model explanations with Amazon SageMaker Clarify and Amazon A2I.
LLM-as-a-judge
When scaling HITL, LLM reviewers can perform hallucination detection and remediation. This idea is known as self-reflective RAG, and can be used to decrease—but not eliminate—the level of human effort in the process for hallucination detection. As a means of scaling LLM-as-a-judge review, Amazon Bedrock now offers the ability to use LLM reviewers and to perform automated reasoning checks with Amazon Bedrock Guardrails for mathematically sound self-validation against predefined policies. For more information about implementation, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock and Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview).
The following figure shows an example high-level diagram of a self-reflective RAG pattern. A generative AI application based on RAG yields responses fed to a judge application. The judge application reflects on whether responses are incomplete, hallucinated, or irrelevant. Based on the judgement, data is routed along the corresponding remediation.

The golden rule in implementing HITL or LLM-as-a-judge as part of ground truth generation is to make sure the organization’s review process aligns with the accepted risk level for the ground truth dataset.
Conclusion
In this post, we provided guidance on generating and reviewing ground truth for evaluating question-answering applications using FMEval. We explored best practices for applying LLMs to scale ground truth generation while maintaining quality and accuracy. The serverless batch pipeline architecture we presented offers a scalable solution for automating this process across large enterprise knowledge bases. We provide a ground truth generation prompt that you can use to get started with evaluating knowledge assistants using the FMEval Factual Knowledge and QA Accuracy evaluation metrics.
By following these guidelines, organizations can follow responsible AI best practices for creating high-quality ground truth datasets for deterministic evaluation of question-answering assistants. Use case-specific evaluations supported by well-curated ground truth play a crucial role in developing and deploying AI solutions that meet the highest standards of quality and responsibility.
Whether you’re developing an internal tool, a customer-facing virtual assistant, or exploring the potential of generative AI for your organization, we encourage you to adopt these best practices. Start implementing a robust ground truth generation and review processes for your generative AI question-answering evaluations today with FMEval.

About the authors
Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.
Philippe Duplessis-Guindon is a cloud consultant at AWS, where he has worked on a wide range of generative AI projects. He has touched on most aspects of these projects, from infrastructure and DevOps to software development and AI/ML. After earning his bachelor’s degree in software engineering and a master’s in computer vision and machine learning from Polytechnique Montreal, Philippe joined AWS to put his expertise to work for customers. When he’s not at work, you’re likely to find Philippe outdoors—either rock climbing or going for a run.
Rahul Jani is a Data Architect with AWS Professional Service. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.
Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Step by Step Guide to Build an AI Research Assistant with Hugging Face …

Posted on March 5, 2025 by i-genie

Hugging Face’s SmolAgents framework provides a lightweight and efficient way to build AI agents that leverage tools like web search and code execution. In this tutorial, we demonstrate how to build an AI-powered research assistant that can autonomously search the web and summarize articles using SmolAgents. This implementation runs seamlessly, requiring minimal setup, and showcases the power of AI agents in automating real-world tasks such as research, summarization, and information retrieval.

Copy CodeCopiedUse a different Browser!pip install smolagents beautifulsoup4

First, we install smolagents beautifulsoup4, which enables AI agents to use tools like web search and code execution, and BeautifulSoup4, a Python library for parsing HTML and extracting text from web pages.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass

# Securely input and store the Hugging Face API token
os.environ[“HUGGINGFACEHUB_API_TOKEN”] = getpass(“Enter your Hugging Face API token: “)

Now, we securely input and store the Hugging Face API token as an environment variable. It uses getpass() to prompt users to enter their token without displaying it for security. The token is then stored in os.environ[“HUGGINGFACEHUB_API_TOKEN”], allowing authenticated access to Hugging Face’s Inference API for running AI models.

Copy CodeCopiedUse a different Browserfrom smolagents import CodeAgent, DuckDuckGoSearchTool, HfApiModel

# Initialize the model WITHOUT passing hf_token directly
model = HfApiModel()

# Define tools (DuckDuckGo for web search)
tools = [DuckDuckGoSearchTool()]

# Create the agent
agent = CodeAgent(tools=tools, model=model, additional_authorized_imports=[“requests”, “bs4”])

Now, we initialize an AI-powered agent using the SmolAgents framework. It sets up HfApiModel() to load a Hugging Face API-based language model, automatically detecting the stored API token for authentication. The agent is equipped with DuckDuckGoSearchTool() to perform web searches. Also, CodeAgent() is instantiated with tool access and authorized imports, such as requests for making web requests and bs4 for parsing HTML content.

Copy CodeCopiedUse a different Browser# Example query to the agent:
query = “Summarize the main points of the Wikipedia article on Hugging Face (the company).”

# Run the agent with the query
result = agent.run(query)

print(“nAgent’s final answer:n”, result)

Finally, we send a query to the AI agent, asking it to summarize the main points of the Wikipedia article on Hugging Face. The agent.run(query) command triggers the agent to perform a web search, retrieve relevant content, and generate a summary using the language model. Finally, the print() function displays the agent’s final answer, concisely summarizing the requested topic.

Sample Output

Following this tutorial, we have successfully built an AI-powered research assistant using Hugging Face SmolAgents that can autonomously search the web and summarize articles. This implementation highlights the power of AI agents in automating research tasks, making it easier to retrieve and process large amounts of information efficiently. Beyond web search and summarization, SmolAgents can be extended to various real-world applications, including automated coding assistants, personal task managers, and AI-driven chatbots.

Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Step by Step Guide to Build an AI Research Assistant with Hugging Face SmolAgents: Automating Web Search and Article Summarization Using LLM-Powered Autonomous Agents appeared first on MarkTechPost.

Project Alexandria: Democratizing Scientific Knowledge Through Structu …

Posted on March 5, 2025 by i-genie

Scientific publishing has expanded significantly in recent decades, yet access to crucial research remains restricted for many, particularly in developing countries, independent researchers, and small academic institutions. The rising costs of journal subscriptions exacerbate this disparity, limiting the availability of knowledge even in well-funded universities. Despite the push for Open Access (OA), barriers persist, as demonstrated by large-scale access losses in Germany and the U.S. due to price disputes with publishers. This limitation hinders scientific progress, leading researchers to explore alternative methods for making scientific knowledge more accessible while navigating copyright constraints.

Current methods of accessing scientific content primarily involve direct subscriptions, institutional access, or reliance on legally ambiguous repositories. These approaches are either financially unsustainable or legally contentious. While OA publishing helps, it does not fully resolve the accessibility crisis. Large Language Models (LLMs) offer a new avenue for extracting and summarizing knowledge from scholarly texts, but their use raises copyright concerns. The challenge lies in separating factual content from the creative expressions protected under copyright law.

To address this, the research team proposes Project Alexandria, which introduces Knowledge Units (KUs) as a structured format for extracting factual information while omitting stylistic elements. KUs encode key scientific insights—such as definitions, relationships, and methodological details—in a structured database, ensuring that only non-copyrightable factual content is preserved. This framework aligns with legal principles like the idea-expression dichotomy, which states that facts cannot be copyrighted, only their specific phrasing and presentation.

Reference: https://arxiv.org/pdf/2502.19413

Knowledge Units are generated through an LLM pipeline that processes scholarly texts in paragraph-sized segments, extracting core concepts and their relationships. Each KU contains:

Entities: Core scientific concepts identified in the text.

Relationships: Connections between entities, including causal or definitional links.

Attributes: Specific details related to entities.

Context summary: A brief summary ensuring coherence across multiple KUs.

Sentence MinHash: A fingerprint to track the source text without storing the original phrasing.

This structured approach balances knowledge retention with legal defensibility. Paragraph-level segmentation ensures optimal granularity—too small, and information is scattered; too large, and LLM performance degrades.

From a legal standpoint, the framework complies with both German and U.S. copyright laws. German law explicitly excludes facts from copyright protection and allows data mining under specific exemptions. Similarly, the U.S. Fair Use doctrine permits transformative uses like text and data mining, provided they do not harm the market value of the original work. The research team demonstrates that KUs satisfy these legal conditions by excluding expressive elements while preserving factual content.

To evaluate the effectiveness of KUs, the team conducted multiple-choice question (MCQ) tests using abstracts and full-text articles from biology, physics, mathematics, and computer science. The results show that LLMs using KUs achieve nearly the same accuracy as those given the original texts. This suggests that the vast majority of relevant information is retained despite the removal of expressive elements. Furthermore, plagiarism detection tools confirm minimal overlap between KUs and the original texts, reinforcing the method’s legal viability.

Beyond legal considerations, the research explores the limitations of existing alternatives. Text embeddings, commonly used for knowledge representation, fail to capture precise factual details, making them unsuitable for scientific knowledge extraction. Direct paraphrasing methods risk maintaining too much similarity to the original text, potentially violating copyright laws. In contrast, KUs provide a more structured and legally sound approach.

The study also addresses common criticisms. While some argue that citation dilution could result from extracting knowledge into databases, traceable attribution systems can mitigate this concern. Others worry that nuances in scientific research may be lost, but the team highlights that most complex elements—like mathematical proofs—are not copyrightable to begin with. Concerns about potential legal risks and hallucination propagation are acknowledged, with recommendations for hybrid human-AI validation systems to enhance reliability.

The broader impact of freely accessible scientific knowledge extends across multiple sectors. Researchers can collaborate more effectively across disciplines, healthcare professionals can access critical medical research more efficiently, and educators can develop high-quality curricula without cost barriers. Additionally, open scientific knowledge promotes public trust and transparency, reducing misinformation and enabling informed decision-making.

Moving forward, the team identifies several research directions, including refining factual accuracy through cross-referencing, developing educational applications for KU-based knowledge dissemination, and establishing interoperability standards for knowledge graphs. They also propose integrating KUs into a broader semantic web for scientific discovery, leveraging AI to automate and validate extracted knowledge at scale.

In summary, Project Alexandria presents a promising framework for making scientific knowledge more accessible while respecting copyright constraints. By systematically extracting factual content from scholarly texts and structuring it into Knowledge Units, this approach provides a legally viable and technically effective solution to the accessibility crisis in scientific publishing. Extensive testing demonstrates its potential for preserving critical information without violating copyright laws, positioning it as a significant step toward democratizing access to knowledge in the scientific community.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Project Alexandria: Democratizing Scientific Knowledge Through Structured Fact Extraction with LLMs appeared first on MarkTechPost.

This AI Paper Identifies Function Vector Heads as Key Drivers of In-Co …

Posted on March 5, 2025 by i-genie

In-context learning (ICL) is something that allows large language models (LLMs) to generalize & adapt to new tasks with minimal demonstrations. ICL is crucial for improving model flexibility, efficiency, and application in language translation, text summarization, and automated reasoning. Despite its significance, the exact mechanisms responsible for ICL remain an active area of research, with two competing theories proposed: induction heads, which detect token sequences and predict subsequent tokens, and function vector (FV) heads, which encode a latent representation of tasks.

Understanding which mechanism predominantly drives ICL is a critical challenge. Induction heads function by identifying repeated patterns within input data and leveraging this repetition to predict forthcoming tokens. However, this approach does not fully explain how models perform complex reasoning with only a few examples. FV heads, on the other hand, are believed to capture an abstract understanding of tasks, providing a more generalized and adaptable approach to ICL. Differentiating between these two mechanisms and determining their contributions is essential for developing more efficient LLMs.

Earlier studies largely attributed ICL to induction heads, assuming their pattern-matching capability was fundamental to learning from context. However, recent research challenges this notion by demonstrating that FV heads play a more significant role in few-shot learning. While induction heads primarily operate at the syntactic level, FV heads enable a broader understanding of the relationships within prompts. This distinction suggests that FV heads may be responsible for the model’s ability to transfer knowledge across different tasks, a capability that induction heads alone cannot explain.

A research team from the University of California, Berkeley, conducted a study analyzing attention heads across twelve LLMs, ranging from 70 million to 7 billion parameters. They aimed to determine which attention heads play the most significant role in ICL. Through controlled ablation experiments, researchers disabled specific attention heads and measured the resulting impact on the model’s performance. By selectively removing either induction heads or FV heads, they could isolate each mechanism’s unique contributions.

The findings revealed that FV heads emerge later in the training process and are positioned in the model’s deeper layers than induction heads. Through detailed training analysis, researchers observed that many FV heads initially function as induction heads before transitioning into FV heads. This suggests that induction may be a precursor to developing more complex FV mechanisms. This transformation was noted across multiple models, indicating a consistent pattern in how LLMs develop task comprehension over time.

Performance results provided quantitative evidence of FV heads’ significance in ICL. When FV heads were ablated, model accuracy suffered a noticeable decline, with degradation becoming more pronounced in larger models. This impact was significantly greater than the effect of removing induction heads, which showed minimal influence beyond random ablations. Researchers observed that preserving only the top 2% FV heads was sufficient to maintain reasonable ICL performance, whereas ablating them led to a substantial impairment in model accuracy. In contrast, removing induction heads had minimal impact beyond what would be expected from random ablations. This effect was particularly pronounced in larger models, where the role of FV heads became increasingly dominant. Researchers also found that in the Pythia 6.9B model, the accuracy drop when FV heads were removed was substantially greater than when induction heads were ablated, reinforcing the hypothesis that FV heads drive few-shot learning.

These results challenge previous assumptions that induction heads are the primary facilitators of ICL. Instead, the study establishes FV heads as the more crucial component, particularly as models scale in size. The evidence suggests that as models increase in complexity, they rely more heavily on FV heads for effective in-context learning. This insight advances the understanding of ICL mechanisms and provides guidance for optimizing future LLM architectures.

By distinguishing the roles of induction and FV heads, this research shifts the perspective on how LLMs acquire and utilize contextual information. The discovery that FV heads evolve from induction heads highlights an important developmental process within these models. Future studies may explore ways to enhance FV head formation, improving the efficiency and adaptability of LLMs. The findings also have implications for model interpretability, as understanding these internal mechanisms can aid in developing more transparent and controllable AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post This AI Paper Identifies Function Vector Heads as Key Drivers of In-Context Learning in Large Language Models appeared first on MarkTechPost.

Accelerate AWS Well-Architected reviews with Generative AI

Posted on March 5, 2025 by i-genie

Building cloud infrastructure based on proven best practices promotes security, reliability and cost efficiency. To achieve these goals, the AWS Well-Architected Framework provides comprehensive guidance for building and improving cloud architectures. As systems scale, conducting thorough AWS Well-Architected Framework Reviews (WAFRs) becomes even more crucial, offering deeper insights and strategic value to help organizations optimize their growing cloud environments.
In this post, we explore a generative AI solution leveraging Amazon Bedrock to streamline the WAFR process. We demonstrate how to harness the power of LLMs to build an intelligent, scalable system that analyzes architecture documents and generates insightful recommendations based on AWS Well-Architected best practices. This solution automates portions of the WAFR report creation, helping solutions architects improve the efficiency and thoroughness of architectural assessments while supporting their decision-making process.
Scaling Well-Architected reviews using a generative AI-powered solution
As organizations expand their cloud footprint, they face several challenges in adhering to the Well-Architected Framework:

Time-consuming and resource-intensive manual reviews
Inconsistent application of Well-Architected principles across different teams
Difficulty in keeping pace with the latest best practices
Challenges in scaling reviews for large or numerous architectures

To address these challenges, we have built a WAFR Accelerator solution that uses generative AI to help streamline and expedite the WAFR process. By automating the initial assessment and documentation process, this solution significantly reduces time spent on evaluations while providing consistent architecture assessments against AWS Well-Architected principles. This allows teams to focus more on implementing improvements and optimizing AWS infrastructure. The solution incorporates the following key features:

Using a Retrieval Augmented Generation (RAG) architecture, the system generates a context-aware detailed assessment. The assessment includes a solution summary, an evaluation against Well-Architected pillars, an analysis of adherence to best practices, actionable improvement recommendations, and a risk assessment.
An interactive chat interface allows deeper exploration of both the original document and generated content.
Integration with the AWS Well-Architected Tool pre-populates workload information and initial assessment responses.

This solution offers the following key benefits:

Rapid analysis and resource optimization – What previously took days of manual review can now be accomplished in minutes, allowing for faster iteration and improvement of architectures. This time efficiency translates to significant cost savings and optimized resource allocation in the review process.
Consistency and enhanced accuracy – The approach provides a consistent application of AWS Well-Architected principles across reviews, reducing human bias and oversight. This systematic approach leads to more reliable and standardized evaluations.
Depth of insight – Advanced analysis can identify subtle patterns and potential issues that might be missed in manual reviews, providing deeper insights into architectural strengths and weaknesses.
Scalability – The solution can handle multiple reviews simultaneously, making it suitable for organizations of all sizes, from startups to enterprises. This scalability allows for more frequent and comprehensive reviews.
Interactive exploration -The generative AI-driven chat interface allows users to dive deeper into the assessment, asking follow-up questions and gaining a better understanding of the recommendations. This interactivity enhances engagement and promotes more thorough comprehension of the results.

Solution overview
The WAFR Accelerator is designed to streamline and enhance the architecture review process by using the capabilities of generative AI through Amazon Bedrock and other AWS services. This solution automates the analysis of complex architecture documents, evaluating them against the AWS Well-Architected Framework’s pillars and providing detailed assessments and recommendations.
The solution consists of the following capabilties:

Generative AI-powered analysis – Uses Amazon Bedrock to rapidly analyze architecture documents against AWS Well-Architected best practices, generating detailed assessments and recommendations.
Knowledge base integration – Incorporates up-to-date WAFR documentation and cloud best practices using Amazon Bedrock Knowledge Bases, providing accurate and context-aware evaluations.
Customizable – Uses prompt engineering, which enables customization and iterative refinement of the prompts used to drive the large language model (LLM), allowing for refining and continuous enhancement of the assessment process.
Integration with the AWS Well-Architected Tool – Creates a Well-Architected workload milestone for the assessment and prepopulates answers for WAFR questions based on generative AI-based assessment.
Generative AI-assisted chat – Offers an AI-driven chat interface for in-depth exploration of assessment results, supporting multi-turn conversations with context management.
Scalable architecture – Uses AWS services like AWS Lambda and Amazon Simple Queue Service (Amazon SQS) for efficient processing of multiple reviews.
Data privacy and network security – With Amazon Bedrock, you are in control of your data, and all your inputs and customizations remain private to your AWS account. Your data, such as prompts, completions, custom models, and data used for fine-tuning or continued pre-training, is not used for service improvement and is never shared with third-party model providers. Your data remains in the AWS Region where the API call is processed. All data is encrypted in transit and at rest. You can use AWS PrivateLink to create a private connection between your VPC and Amazon Bedrock.

A human-in-the-loop review is still crucial to validate the generative AI findings, checking for accuracy and alignment with organizational requirements.
The following diagram illustrates the solution’s technical architecture.

The workflow consists of the following steps:

WAFR guidance documents are uploaded to a bucket in Amazon Simple Storage Service (Amazon S3). These documents form the foundation of the RAG architecture. Using Amazon Bedrock Knowledge Base, the sample solution ingests these documents and generates embeddings, which are then stored and indexed in Amazon OpenSearch Serverless. This creates a vector database that enables retrieval of relevant WAFR guidance during the review process
Users access the WAFR Accelerator Streamlit application through Amazon CloudFront, which provides secure and scalable content delivery. User authentication is handled by Amazon Cognito, making sure only authenticated user have access.
Users upload their solution architecture document in PDF format using the Streamlit application running on an Amazon Elastic Compute Cloud (Amazon EC2) instance that stores it in an S3 bucket. On submission, the WAFR review process is invoked by Amazon SQS, which queues the review request.
The WAFR reviewer, based on Lambda and AWS Step Functions, is activated by Amazon SQS. It orchestrates the review process, including document content extraction, prompt generation, solution summary, knowledge embedding retrieval, and generation.
Amazon Textract extracts the content from the uploaded documents, making it machine-readable for further processing.
The WAFR reviewer uses Amazon Bedrock Knowledge Bases’ fully managed RAG workflow to query the vector database in OpenSearch Serverless, retrieving relevant WAFR guidance based on the selected WAFR pillar and questions. Metadata filtering is used to improve retrieval accuracy.
Using the extracted document content and retrieved embeddings, the WAFR reviewer generates an assessment using Amazon Bedrock. A workload is created in the AWS Well-Architected Tool with answers populated with the assessment results. This allows users to download initial version of the AWS Well-Architected report from the AWS Well-Architected Tool console on completion of the assessment.
The assessment is also stored in an Amazon DynamoDB table for quick retrieval and future reference.
The WAFR Accelerator application retrieves the review status from the DynamoDB table to keep the user informed.
Users can chat with the content using Amazon Bedrock, allowing for deeper exploration of the document, assessment, and recommendations.
Once the assessment is complete, human reviewers can review it in the AWS Well-Architected Tool.

Deploy the solution
To implement the solution in your own environment, we’ve provided resources in the following GitHub repo to guide you through the process. The setup is streamlined using the AWS Cloud Development Kit (AWS CDK), which allows for infrastructure as code (IaC) deployment. For step-by-step instructions, we’ve prepared a detailed README file that walks you through the entire setup process.
To get started, complete the following steps:

Clone the provided repository containing the AWS CDK code and README file.
Review the README file for prerequisites and environment setup instructions.
Follow the AWS CDK deployment steps outlined in the documentation.
Configure necessary environment-specific parameters as described.

Deploying and running this solution in your AWS environment will incur costs for the AWS services used, including but not limited to Amazon Bedrock, Amazon EC2, Amazon S3, and DynamoDB. It is highly recommended that you use a separate AWS account and setup AWS Budget to monitor the costs.

DISCLAIMER: This is sample code for non-production usage. You should work with your security and legal teams to adhere to your organizational security, regulatory, and compliance requirements before deployment.

Test the solution
The following diagram illustrates the workflow for using the application.

To demonstrate how generative AI can accelerate AWS Well-Architected reviews, we have developed a Streamlit-based demo web application that serves as the front-end interface for initiating and managing the WAFR review process.
Complete the following steps to test the demo application:

Open a new browser window and enter the CloudFront URL provided during the setup.
Add a new user to the Amazon Cognito user pool deployed by the AWS CDK during the setup. Log in to the application using this user’s credentials.
Choose New WAFR Review in the navigation pane.
For Analysis type, choose the analysis type:

Quick – You can generate a quick analysis without creating a workload in the AWS Well-Architected Tool. This option is faster because it groups the questions for an individual pillar into a single prompt. It’s suitable for an initial assessment.
Deep with Well-Architected Tool – You can generate a comprehensive and detailed analysis that automatically creates a workload in the AWS Well-Architected tool. This thorough review process requires more time to complete as it evaluates each question individually rather than grouping them together. The deep review typically takes approximately 20 minutes, though the actual duration may vary depending on the document size and the number of Well- Architected pillars selected for evaluation.

Enter the analysis name and description.
Choose the AWS Well-Architected lens and desired pillars.
Upload your solution architecture or technical design document
Choose Create WAFR Analysis.
Choose Existing WAFR Reviews in the navigation pane.
Choose your newly submitted analysis.

After the status changes to Completed, you can view the WAFR analysis at the bottom of the page. For multiple reviews, choose the relevant analysis on the dropdown menu.

You can chat with the uploaded document as well as the other generated content by using the WAFR Chat section on the Existing WAFR Reviews page.

Improving assessment quality
The solution uses prompt engineering to optimize textual input to the foundation model (FM) to obtain desired assessment responses. The quality of prompt (the system prompt, in this case) has significant impact on the model output. The solution provides a sample system prompt that is used to drive the assessment. You could enhance this prompt further to align with specific organizational needs. This becomes more crucial when defining and ingesting your own custom lenses.
Another important factor is the quality of the document that is uploaded for assessment. Detailed and architecture-rich documents can result in better inferences and therefore finer assessments. Prompts are defined in such a way that if there is inadequate information for assessment, then it’s highlighted in the output. This minimizes hallucination by the FM and provides a potential opportunity to enrich your design templates in alignment with AWS Well-Architected content.
You could further enhance this solution by using Amazon Bedrock Guardrails to further reduce hallucinations and ground responses in your own source information.
At the time of writing of this blog, only the AWS Well-Architected Framework, Financial Services Industry, and Analytics lenses have been provisioned. However, other lenses, including custom lenses, could be added with a few refinements to the UI application and underlying data store.
Clean up
After you’ve finished exploring or using the solution and no longer require these resources, be sure to clean them up to avoid ongoing charges. Follow these steps to remove all associated resources:

Navigate to the directory containing your AWS CDK code.
Run the following command: cdk destroy.
Confirm the deletion when prompted.
Manually check for and delete any resources that might not have been automatically removed, such as S3 buckets with content or custom IAM roles.
Verify that all related resources have been successfully deleted.

Conclusion
In this post, we showed how generative AI and Amazon Bedrock can play a crucial role in expediting and scaling the AWS Well-Architected Framework reviews within an organization. By automating document analysis and using a WAFR-aware knowledge base, the solution offers rapid and in-depth assessments, helping organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads.
To learn more, refer to the following:

Amazon Bedrock Documentation
AWS Well-Architected
Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy

About the Authors
Shoeb Bustani is a Senior Enterprise Solutions Architect at AWS, based in the United Kingdom. As a senior enterprise architect, innovator, and public speaker, he provides strategic architectural partnership and guidance to help customers achieve their business outcome leveraging AWS services and best practices.
Brijesh Pati is an Enterprise Solutions Architect at AWS, helping enterprise customers adopt cloud technologies. With a background in application development and enterprise architecture, he has worked with customers across sports, finance, energy, and professional services sectors. Brijesh specializes in AI/ML solutions and has experience with serverless architectures.
Rohan Ghosh is as an Enterprise Solutions Architect at Amazon Web Services (AWS), specializing in the Advertising and Marketing sector. With extensive experience in Cloud Solutions Engineering, Application Development, and Enterprise Support, he helps organizations architect and implement cutting-edge cloud solutions. His current focus areas include Data Analytics and Generative AI, where he guides customers in leveraging AWS technologies to drive innovation and business transformation.

Dynamic metadata filtering for Amazon Bedrock Knowledge Bases with Lan …

Posted on March 5, 2025 by i-genie

Amazon Bedrock Knowledge Bases offers a fully managed Retrieval Augmented Generation (RAG) feature that connects large language models (LLMs) to internal data sources. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts. It also provides developers with greater control over the LLM’s outputs, including the ability to include citations and manage sensitive information.
Amazon Bedrock Knowledge Bases has a metadata filtering capability that allows you to refine search results based on specific attributes of the documents, improving retrieval accuracy and the relevance of responses. These metadata filters can be used in combination with the typical semantic (or hybrid) similarity search. Improving document retrieval results helps personalize the responses generated for each user. Dynamic metadata filters allow you to instantly create custom queries based on the varying user profiles or user-inputted responses so the documents retrieved only contain information relevant to the your needs.
In this post, we discuss using metadata filters with Amazon Bedrock Knowledge Bases.
Solution overview
The following code is an example metadata filter for Amazon Bedrock Knowledge Bases. Logical operators (such as AND or OR) can be nested to combine other logical operators and filter conditions. For more information, refer to the Retrieve API.

{
“andAll”: [
{
“equals”: {
“key”: “desired_destination”,
“value”: “<UNKNOWN>” # This will be overwritten with appropriate values at runtime
}
},
{
“equals”: {
“key”: “travelling_with_children”,
“value”: “<UNKNOWN>” # This will be overwritten with appropriate values at runtime
}
}
]
}

For our use case, we use an example of a travel website where the user answers a few questions about their travel preferences (including desired destination, preferred activities, and traveling companions) and then the system retrieves relevant documents.
We exclusively focus on the retrieval portion of RAG in this post. We provide the upstream components, including document ingestion and query formatting, as static data instead of code. The downstream generation component is out of scope for this post.
Prerequisites
To follow along with this post, you should understand basic retrieval techniques such as similarity search.
Additionally, you need an Amazon Bedrock knowledge base populated with documents and metadata. For instructions, see Create an Amazon Bedrock knowledge base. We have provided example documents and metadata in the accompanying GitHub repo for you to upload.
The associated notebook contains the required library imports and environment variables. Make sure you run the notebook using an AWS Identity and Access Management (IAM) role with the correct permissions for Amazon Simple Storage Service (Amazon S3) and Amazon Bedrock (AmazonS3FullAccess and AmazonBedrockFullAccess, respectively). We recommend running the notebook locally or in Amazon SageMaker. Then you can run the following code to test your AWS and knowledge base connection:

# Test AWS connection
# Create a session using your AWS credentials
session = boto3.Session()

# Create an STS client
sts_client = session.client(‘sts’)

# Get the caller identity
response = sts_client.get_caller_identity()

# Print the response
print(response)

knowledge_base_id = ‘XXXXXXXXXX’

retrieval_config = {
“vectorSearchConfiguration”: {
“numberOfResults”: 4,
“overrideSearchType”: “HYBRID”
}
}

# Test bedrock knowledge bases connection
client = boto3.client(‘bedrock-agent-runtime’)

response = client.retrieve(
knowledgeBaseId=knowledge_base_id,
retrievalConfiguration=retrieval_config,
retrievalQuery={“text”: “Hello world”}
)

print(response)

Create a dynamic filter
The “value” field within the filter needs to be updated at request time. This means overwriting the retrieval_config object, as shown in the following figure. The placeholder values in the filter get overwritten with the user data at runtime.

Because the retrieval_config object is a nested hierarchy of logical conditions (a tree), you can implement a breadth first search to identify and replace all the “value” field values (where “value” is the key and “<UNKNOWN>” is the placeholder value) with the corresponding value from the user data. See the following code:

def setup_retrieval_config(inputs):

# Make a copy because the filter is updated dynamically based on the user_data, this allows you to start from the default each time
local_retrieval_config = copy.deepcopy(retrieval_config)

updated_vector_search_config = replace_values(local_retrieval_config[“vectorSearchConfiguration”], inputs[“user_data”])
local_retrieval_config[“vectorSearchConfiguration”] = updated_vector_search_config

return local_retrieval_config

def replace_values(vector_search_config: Dict, user_data: Dict):
# Replace the value fields in the filter with the correct value according to the user_data
# Uses breadth first search to find all of the value fields

# Filter is not a required key, if you do not want any filters get rid of the key
if “filter” in vector_search_config and not vector_search_config[“filter”]:
del vector_search_config[“filter”]

# Recursively traverse from the root
elif ‘filter’ in vector_search_config:
vector_search_config[‘filter’] = replace_values(vector_search_config[‘filter’], user_data)

# At a node that is not the root
else:
for key, value in vector_search_config.items():
if isinstance(value, dict):

# At a leaf e.g. {“key”: “age”, “value”: “”}}
if ‘key’ in value and ‘value’ in value:

# Only overwrite value[‘value’] that are not unknown
if value[‘key’] in user_data and not (value[“value”] == “unknown” or value[“value”] == [“unknown”]):

# Primitive data type
if type(value[“value”]) in [str, int, float, bool]:
value[‘value’] = user_data[value[‘key’]]

# List data type
elif isinstance(value[“value”], list):
value[‘value’] = [user_data[value[‘key’]]]
else:
raise ValueError(f”Unsupported value[‘value’] type {type(value[‘value’])}”)
else:
vector_search_config[key] = replace_values(value, user_data)

# Recurse on each item in the list
elif isinstance(value, list):
vector_search_config[key] = [replace_values(item, user_data) for item in value]
else:
raise ValueError(f”Unsupported value type {type(value)}”)

return vector_search_config

Option 1: Create a retriever each time
To define the retrieval_config parameter dynamically, you can instantiate AmazonKnowledgeBasesRetriever each time. This integrates into a larger LangChain centric code base. See the following code:

def create_retrieval_chain() -> Runnable:
“””
Creates a retrieval chain for the retriever.

Returns:
Runnable: The retrieval chain.
“””

query = create_query_for_retrieval()

def create_retriever(inputs):
# This wrapper is necessary because if you return a callable object LangChain will automatically call it immediately, which is not the desired behavior
# instead we want to call the retriever on the next step of the chain
retriever_wrapper = {“retriever”: AmazonKnowledgeBasesRetriever(knowledge_base_id=knowledge_base_id, retrieval_config=inputs[“retrieval_config”])}
return retriever_wrapper

# Retrieval chain has three steps: (1) create the filter based off of the user data, (2) create the retriever, and (3) invoke the retriever
retrieval_chain = (
{
“user_data” : itemgetter(“user_data”),
“retrieval_config” : lambda inputs: setup_retrieval_config(inputs)
} |
{
“query” : query,
“retriever” : lambda inputs: create_retriever(inputs)
} |
RunnableLambda(lambda inputs: inputs[“retriever”][“retriever”].invoke(inputs[“query”]))
)
return retrieval_chain

Option 2: Access the underlying Boto3 API
The Boto3 API is able to directly retrieve with a dynamic retrieval_config. You can take advantage of this by accessing the object that AmazonKnowledgeBasesRetriever wraps. This is slightly faster but is less pythonic because it relies on LangChain implementation details, which may change without notice. This requires additional code to adapt the output to the proper format for a LangChain retriever. See the following code:

retriever = AmazonKnowledgeBasesRetriever(
knowledge_base_id=knowledge_base_id,
retrieval_config=retrieval_config
)

def create_retrieval_chain() -> Runnable:
“””
Creates a retrieval chain for the retriever.

Returns:
Runnable: The retrieval chain.
“””

query = create_query_for_retrieval()

def retrieve_and_format(inputs):
results = retriever.client.retrieve(
retrievalQuery={“text”: inputs[“query”]},
knowledgeBaseId=knowledge_base_id,
retrievalConfiguration=inputs[“retrieval_config”]
)

documents = []
for result in results[“retrievalResults”]:
metadata = {
“location”: result[“location”],
“source_metadata”: result[“metadata”],
“score”: result[“score”],
}

document = Document(
page_content=result[“content”][“text”],
metadata=metadata
)
documents.append(document)

return documents

retrieval_chain = (
{
“query” : query,
“retrieval_config” : lambda inputs: setup_retrieval_config(inputs)
} |
RunnableLambda(lambda inputs: retrieve_and_format(inputs))
# RunnableLambda(lambda inputs: retriever.client.retrieve(retrievalQuery={“text”: inputs[“query”]}, knowledgeBaseId=knowledge_base_id, retrievalConfiguration=inputs[“retrieval_config”]))
)
return retrieval_chain

retrieval_chain_2 = create_retrieval_chain()

Results
Begin by reading in the user data. This example data contains user answers to an online questionnaire about travel preferences. The user_data fields must match the metadata fields.

with open(“data/user_data.json”, “r”) as file:
user_data = json.load(file)

print(json.dumps(user_data[:2], indent=2))

Here is a preview of the user_data.json file from which certain fields will be extracted as values for filters.

{
“trip_id”: 1,
“desired_destination”: “Bali, Indonesia”,
“stay_duration”: 7,
“age”: 35,
“gender”: “male”,
“companion”: “solo”,
“travelling_with_children”: “no”,
“travelling_with_pets”: “no”
},
{
“trip_id”: 2,
“desired_destination”: “Paris, France”,
“stay_duration”: 5,
“age”: 28,
“gender”: “female”,
“companion”: “solo”,
“travelling_with_children”: “no”,
“travelling_with_pets”: “yes”
},

Test the code with filters turned on and off. Only use a few filtering criteria because restrictive filters might return zero documents.

filters_to_test: List = [
{
“andAll”: [
{
“equals”: {
“key”: “desired_destination”,
“value”: “<UNKNOWN>” # This will be overwritten with appropriate values at runtime
}
},
{
“equals”: {
“key”: “travelling_with_children”,
“value”: “<UNKNOWN>” # This will be overwritten with appropriate values at runtime
}
}
]
},
None
]

Finally, run both retrieval chains through both sets of filters for each user:

retrieval_chains = [retrieval_chain_1, retrieval_chain_2]

results = []

for retrieval_chain_id, retrieval_chain in enumerate(retrieval_chains):
logger.info(retrieval_chain_id)
# Loop through each filter options
for filter in filters_to_test:
retrieval_config[“vectorSearchConfiguration”][“filter”] = filter
# Loop through each user data entry
for user_entry in user_data:
inputs = {
“user_data”: user_entry,
“retrieval_config”: retrieval_config
}

# Run the retrieval chain with the current user entry
try:
result = retrieval_chain.invoke(inputs)
# print(f”Result for user entry {user_entry[‘trip_id’]}: {result}”)
results.append(({‘retrieval_chain_id’: retrieval_chain_id, ‘user’: user_entry, ‘documents’: result}))

except Exception as e:
print(f”Error during retrieval for user entry {user_entry[‘trip_id’]}: {e}”)

When analyzing the results, you can see that the first half of the documents are identical to the second half. In addition, when metadata filters aren’t used, the documents retrieved are occasionally for the wrong location. For example, trip ID 2 is to Paris, but the retriever pulls documents about London.
Excerpt of output table for reference:

Retrieval Approach
Filter
Trip ID
Destination
Page Content
Metadata

Option_0
TRUE
2
Paris, France
As a 70-year-old retiree, I recently had the pleasure of visiting Paris for the first time. It was a trip I had been looking forward to for years, and I was not disappointed. Here are some of my favorite attractions and activities that I would recommend to other seniors visiting the city. First on my list is the Eiffel Tower…
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_6.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.48863396, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_6.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘museums’, ‘palaces’, ‘strolling’, ‘boat tours’, ‘neighborhood tours’], ‘companion’: ‘unknown’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘unknown’, ‘age’: [’71’, ’80’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AiNKlapMBdxcT3sYpRK-d’, ‘desired_destination’: ‘Paris, France’}}

Option_0
TRUE
2
Paris, France
As a 35-year-old traveling with my two dogs, I found Paris to be a pet-friendly city with plenty of attractions and activities for pet owners. Here are some of my top recommendations for traveling with pets in Paris: The Jardin des Tuileries is a beautiful park located between the Louvre Museum and the Place de la Concorde…
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.474106, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘parks’, ‘museums’, ‘river cruises’, ‘neighborhood exploration’], ‘companion’: ‘pets’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’30’, ’31’, ’32’, ’33’, ’34’, ’35’, ’36’, ’37’, ’38’, ’39’, ’40’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3Aj52lapMBuHB13c7-hl-4’, ‘desired_destination’: ‘Paris, France’}}

Option_0
TRUE
2
Paris, France
If you are looking for something a little more active, I would suggest visiting the Bois de Boulogne. This large park is located on the western edge of Paris and is a great place to go for a walk or a bike ride with your pet. The park has several lakes and ponds, as well as several gardens and playgrounds…
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.45283788, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘strolling’, ‘picnic’, ‘walk or bike ride’, ‘cafes and restaurants’, ‘art galleries and shops’], ‘companion’: ‘pet’, ‘x-amz-bedrock-kb-data-source-id’: ‘{YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’40’, ’41’, ’42’, ’43’, ’44’, ’45’, ’46’, ’47’, ’48’, ’49’, ’50’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AmtKlapMBdxcT3sYpSK_N’, ‘desired_destination’: ‘Paris, France’}}

Option_0
FALSE
2
Paris, France
{ “metadataAttributes”: { “age”: [ “30” ], “desired_destination”: “London, United Kingdom”, “stay_duration”: “unknown”, “preferred_month”: [ “unknown” ], “activities_interest”: [ “strolling”, “sightseeing”, “boating”, “eating out” ], “companion”: “pets”, “travelling_with_children”: “no”, “travelling_with_pets”: “yes” } }
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/London_2.txt.metadata (1).json’}, ‘type’: ‘S3’}, ‘score’: 0.49567315, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/London_2.txt.metadata (1).json’, ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3A5tKlapMBdxcT3sYpYq_r’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}}}

Option_0
FALSE
2
Paris, France
As a 35-year-old traveling with my two dogs, I found Paris to be a pet-friendly city with plenty of attractions and activities for pet owners. Here are some of my top recommendations for traveling with pets in Paris: The Jardin des Tuileries is a beautiful park located between the Louvre Museum and the Place de la Concorde…
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.4741059, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_9.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘parks’, ‘museums’, ‘river cruises’, ‘neighborhood exploration’], ‘companion’: ‘pets’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’30’, ’31’, ’32’, ’33’, ’34’, ’35’, ’36’, ’37’, ’38’, ’39’, ’40’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3Aj52lapMBuHB13c7-hl-4’, ‘desired_destination’: ‘Paris, France’}}

Option_0
FALSE
2
Paris, France
If you are looking for something a little more active, I would suggest visiting the Bois de Boulogne. This large park is located on the western edge of Paris and is a great place to go for a walk or a bike ride with your pet. The park has several lakes and ponds, as well as several gardens and playgrounds…
{‘location’: {‘s3Location’: {‘uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘}, ‘type’: ‘S3’}, ‘score’: 0.45283788, ‘source_metadata’: {‘x-amz-bedrock-kb-source-uri’: ‘s3://{YOUR_S3_BUCKET}/travel_reviews_titan/Paris_5.txt‘, ‘travelling_with_children’: ‘no’, ‘activities_interest’: [‘strolling’, ‘picnic’, ‘walk or bike ride’, ‘cafes and restaurants’, ‘art galleries and shops’], ‘companion’: ‘pet’, ‘x-amz-bedrock-kb-data-source-id’: {YOUR_KNOWLEDGE_BASE_ID}, ‘stay_duration’: ‘unknown’, ‘preferred_month’: [‘unknown’], ‘travelling_with_pets’: ‘yes’, ‘age’: [’40’, ’41’, ’42’, ’43’, ’44’, ’45’, ’46’, ’47’, ’48’, ’49’, ’50’], ‘x-amz-bedrock-kb-chunk-id’: ‘1%3A0%3AmtKlapMBdxcT3sYpSK_N’, ‘desired_destination’: ‘Paris, France’}}

Clean up
To avoid incurring additional charges, be sure to delete your knowledge base, OSS/vector store and the underlying S3 bucket.
Conclusion
Enabling dynamic filtering through Knowledge Base’s metadata filtering enhances document retrieval in RAG systems by tailoring outputs to user-specific needs, significantly improving the relevance and accuracy of LLM-generated responses. In the travel website example, filters make sure that retrieved documents closely matched user preferences.
This approach can be applied to other use cases, such as customer support, personalized recommendations, and content curation, where context-sensitive information retrieval is essential. Properly configured filters are crucial for maintaining accuracy across different applications, making this feature a powerful tool for refining LLM outputs in diverse scenarios.
Be sure to take advantage of this powerful and flexible solution in your application. For more information on metadata in Amazon Bedrock Knowledge Bases, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. Also, Amazon Bedrock Knowledge Bases now provides autogenerated query filters.
Security Best Practices
For AWS IAM Policies:

Apply least-privilege permissions by being explicit with IAM actions and listing only required permissions rather than using wildcards
Use temporary credentials with IAM roles for workloads
Avoid using wildcards (*) in the Action element as this grants access to all actions for specific AWS services
Remove wildcards from the Resource element and explicitly list the specific resources that IAM entities should access
Review AWS managed policies carefully before using them and consider using customer managed policies if AWS managed policies grant more permissions than needed

For more detailed security best practices for AWS IAM, see Security best practices in IAM.
For Amazon S3:

Block Public Access unless explicitly required, make sure S3 buckets are not publicly accessible by using the S3 Block Public Access feature and implementing appropriate bucket policies
Enable encryption for data at rest (all S3 buckets have default encryption) and enforce encryption for data in transit using HTTPS/TLS
Grant only the minimum permissions required using IAM policies, bucket policies, and disable ACLs (Access Control Lists) which are no longer recommended for most modern use cases
Enable server access logging, AWS CloudTrail, and use AWS security services like GuardDuty, Macie, and IAM Access Analyzer to monitor and detect potential security issues

For more detailed security best practices for Amazon S3, see Security best practices for Amazon S3.
For Amazon Bedrock:

Use IAM roles and policies to control access to Bedrock resources and APIs.
Implement VPC endpoints to access Bedrock securely from within your VPC.
Encrypt data at rest and in transit when working with Bedrock to protect sensitive information.
Monitor Bedrock usage and access patterns using AWS CloudTrail for auditing purposes.

For more information on security in Amazon Bedrock, see Security in Amazon Bedrock.
For Amazon SageMaker:

Use IAM roles to control access to SageMaker resources and limit permissions based on job functions.
Encrypt SageMaker notebooks, training jobs, and endpoints using AWS KMS keys for data protection.
Implement VPC configurations for SageMaker resources to restrict network access and enhance security.
Use SageMaker private endpoints to access APIs without traversing the public internet.

About the Authors
Haley Tien is a Deep Learning Architect at AWS Generative AI Innovation Center. She has a Master’s degree in Data Science and assists customers in building generative AI solutions on AWS to optimize their workloads and achieve desired outcomes.
Adam Weinberger is a Applied Scientist II at AWS Generative AI Innovation Center. He has 10 years of experience in data science and machine learning. He holds a Master’s of Information and Data Science from the University of California, Berkeley.
Dan Ford is a Applied Scientist II at AWS Generative AI Innovation Center, where he helps public sector customers build state-of-the-art GenAI solutions.