Supercharge your LLM performance with Amazon SageMaker Large Model Inf …

Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).
This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.
What’s new?
LMI v15 brings several enhancements that improve throughput, latency, and usability:

An async mode that directly integrates with vLLM’s AsyncLLMEngine for improved request handling. This mode creates a more efficient background loop that continuously processes incoming requests, enabling it to handle multiple concurrent requests and stream outputs with higher throughput than the previous Rolling-Batch implementation in v14.
Support for the vLLM V1 engine, which delivers up to 111% higher throughput compared to the previous V0 engine for smaller models at high concurrency. This performance improvement comes from reduced CPU overhead, optimized execution paths, and more efficient resource utilization in the V1 architecture. LMI v15 supports both V1 and V0 engines, with V1 being the default. If you have a need to use V0, you can use the V0 engine by specifying VLLM_USE_V1=0. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.
Expanded API schema support with three flexible options to allow seamless integration with applications built on popular API patterns:

Message format compatible with the OpenAI Chat Completions API.
OpenAI Completions format.
Text Generation Inference (TGI) schema to support backward compatibility with older models.

Multimodal support, with enhanced capabilities for vision-language models including optimizations such as multimodal prefix caching
Built-in support for function calling and tool calling, enabling sophisticated agent-based workflows.

Enhanced model support
LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:

Llama 4 – Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E-Instruct
Gemma 3 – Google’s lightweight and efficient models, known for their strong performance despite smaller size
Qwen 2.5 – Alibaba’s advanced models including QwQ 2.5 and Qwen2-VL with multimodal capabilities
Mistral AI models – High-performance models from Mistral AI that offer efficient scaling and specialized capabilities
DeepSeek-R1/V3 – State of the art reasoning models

Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.
Benchmarks
Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:

Model
Batch size
Instance type
LMI v14 throughput [tokens/s] (V0 engine)
LMI v15 throughput [tokens/s] (V1 engine)
Improvement

1
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
128
p4d.24xlarge
1768
2198
24%

2
meta-llama/Llama-3.1-8B-Instruct
64
ml.g6e.2xlarge
1548
2128
37%

3
mistralai/Mistral-7B-Instruct-v0.3
64
ml.g6e.2xlarge
942
1988
111%

DeepSeek-R1 Llama 70B for various levels of concurrency

Llama 3.1 8B Instruct for various level of concurrency

Mistral 7B for various levels of concurrency

The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:

Higher batch sizes increase concurrency but come with a natural tradeoff in terms of latency
Batch sizes of 4 and 8 provide the best latency for most use cases
Batch sizes up to 64 and 128 achieve maximum throughput with acceptable latency trade-offs

API formats
LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.

Chat Completions – Message format is compatible with OpenAI Chat Completions API. Use this schema for tool calling, reasoning, and multimodal use cases. Here is a sample of the invocation with the Messages API:

body = {
“messages”: [
{“role”: “user”, “content”: “Name popular places to visit in London?”}
],
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,
}

OpenAI Completions format – The Completions API endpoint is no longer receiving updates:

body = {
“prompt”: “Name popular places to visit in London?”,
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,

TGI – Supports backward compatibility with older models:

body = {
“inputs”: “Name popular places to visit in London?”,
“parameters”: {
“max_new_tokens”: 256,
“temperature”: 0.9,
},
“stream”: True,
}

Getting started with LMI v15
Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.
For optimal performance, we recommend the following instances:

Llama 4 Scout: ml.p5.48xlarge
DeepSeek R1/V3: ml.p5e.48xlarge
Qwen 2.5 VL-32B: ml.g5.12xlarge
Qwen QwQ 32B: ml.g5.12xlarge
Mistral Large: ml.g6e.48xlarge
Gemma3-27B: ml.g5.12xlarge
Llama 3.3-70B: ml.p4d.24xlarge

To deploy with LMI v15, follow these steps:

Clone the notebook to your Amazon SageMaker Studio notebook or to Visual Studio Code (VS Code). You can then run the notebook to do the initial setup and deploy the model from the Hugging Face repository to the SageMaker AI endpoint. We walk through the key blocks here.
LMI v15 maintains the same configuration pattern as previous versions, using environment variables in the form OPTION_<CONFIG_NAME>. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15.

vllm_config = {
“HF_MODEL_ID”: “meta-llama/Llama-4-Scout-17B-16E”,
“HF_TOKEN”: “entertoken”,
“OPTION_MAX_MODEL_LEN”: “250000”,
“OPTION_MAX_ROLLING_BATCH_SIZE”: “8”,
“OPTION_MODEL_LOADING_TIMEOUT”: “1500”,
“SERVING_FAIL_FAST”: “true”,
“OPTION_ROLLING_BATCH”: “disable”,
“OPTION_ASYNC_MODE”: “true”,
“OPTION_ENTRYPOINT”: “djl_python.lmi_vllm.vllm_async_service”
}

HF_MODEL_ID sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).
HF_TOKEN sets the token to download the model. This is required for gated models like Llama-4
OPTION_MAX_MODEL_LEN. This is the max model context length.
OPTION_MAX_ROLLING_BATCH_SIZE sets the batch size for the model.
OPTION_MODEL_LOADING_TIMEOUT sets the timeout value for SageMaker to load the model and run health checks.
SERVING_FAIL_FAST=true. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.
OPTION_ROLLING_BATCH= disable disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performance
OPTION_ASYNC_MODE=true enables async mode.
OPTION_ENTRYPOINT provides the entrypoint for vLLM’s async integrations

Set the latest container (in this example we used 0.33.0-lmi15.0.0-cu128), AWS Region (us-east-1), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.
Deploy the model to the endpoint using model.deploy().

CONTAINER_VERSION = ‘0.33.0-lmi15.0.0-cu128’
REGION = ‘us-east-1′
# Construct container URI
container_uri = f’763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}’

# Select instance type
instance_type = “ml.p5.48xlarge”

model = Model(image_uri=container_uri,
role=role,
env=vllm_config)
endpoint_name = sagemaker.utils.name_from_base(“Llama-4″)

print(endpoint_name)
model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout = 1800
)

Invoke the model, SageMaker inference provides two APIs to invoke the model- InvokeEndpoint and InvokeEndpointWithResponseStream. You can choose either option based on your needs.

# Create SageMaker Runtime client
smr_client = boto3.client(‘sagemaker-runtime’)
##Add your endpoint here
endpoint_name = ”

# Invoke with messages format
body = {
“messages”: [
{“role”: “user”, “content”: “Name popular places to visit in London?”}
],
“temperature”: 0.9,
“max_tokens”: 256,
“stream”: True,
}

# Invoke with endpoint streaming
resp = smr_client.invoke_endpoint_with_response_stream(
EndpointName=endpoint_name,
Body=json.dumps(body),
ContentType=”application/json”,
)

To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.
Conclusion
Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.
We encourage you to explore this release for deploying your generative AI models.
Check out the provided example notebooks to start deploying models with LMI v15.

About the authors
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Accuracy evaluation framework for Amazon Q Business – Part 2

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.
In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

Comprehensive evaluation workflow – This ready-to-deploy solution uses AWS CloudFormation stacks to set up an Amazon Q Business application, complete with user access, a custom UI for review and evaluation, and the supporting evaluation infrastructure
Lightweight AWS Lambda based evaluation – Designed for users with an existing Amazon Q Business application, this streamlined solution employs an AWS Lambda function to efficiently assess the application’s accuracy

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.
Challenges in evaluating Amazon Q Business
Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.
Context recall
Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.
For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

Expected: The United States is the third-largest country in the world by land area, covering approximately 9.8 million square kilometers. It has a diverse range of geographical features.
High context recall: The United States spans approximately 9.8 million square kilometers, making it the third-largest nation globally by land area. country’s geography is incredibly diverse, featuring the Rocky Mountains stretching from New Mexico to Alaska, the Appalachian Mountains along the eastern states, the expansive Great Plains in the central region, arid deserts like the Mojave in the southwest.
Low context recall: The United States features significant geographical landmarks. Additionally, the country is home to unique ecosystems like the Everglades in Florida, a vast network of wetlands.

The following diagram illustrates the context recall workflow.

Context precision
Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.
For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

Ground truth answer: Silicon Valley is famous for fostering innovation and entrepreneurship in the technology sector.
High precision context: Many groundbreaking startups originate from Silicon Valley, benefiting from a culture that encourages innovation, risk-taking
Low precision context: Silicon Valley experiences a Mediterranean climate, with mild, wet, winters and warm, dry summers, contributing to its appeal as a place to live and works

The following diagram illustrates the context precision workflow.

Answer relevancy
Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.
For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

High relevance answer: Amazon Q Business Service is a RAG Generative AI solution designed for enterprise use. Key features include a fully managed Generative AI solutions, integration with enterprise data sources, robust security protocols, and customizable virtual assistants. It benefits enterprise customers by enabling efficient information retrieval, automating customer support tasks, enhancing employee productivity through quick access to data, and providing insights through analytics on user interactions.
Low relevance answer: Amazon Q Business Service is part of Amazon’s suite of cloud services. Amazon also offers online shopping and streaming services.

The following diagram illustrates the answer relevancy workflow.

Truthfulness
Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.
For example, a user might ask “What is the capital of Canada?” They could get the following responses:

Context: Canada’s capital city is Ottawa, located in the province of Ontario. Ottawa is known for its historic Parliament Hill, the center of government, and the scenic Rideau Canal, a UNESCO World Heritage site
High truthfulness answer: The capital of Canada is Ottawa
Low truthfulness answer: The capital of Canada is Toronto

The following diagram illustrates the truthfulness workflow.

Evaluation methods
Deciding on who should conduct the evaluation can significantly impact results. Options include:

Human-in-the-Loop (HITL) – Human evaluators manually assess the accuracy and relevance of responses, offering nuanced insights that automated systems might miss. However, it is a slow process and difficult to scale.
LLM-aided evaluation – Automated methods, such as the Ragas framework, use language models to streamline the evaluation process. However, these might not fully capture the complexities of domain-specific knowledge.

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.
Solution overview
In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.
Solution 1: End-to-end evaluation solution
For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

User access and UI – Authenticated users interact with a frontend UI to upload datasets, review RAGAS output, and provide human feedback
Evaluation solution infrastructure – Core components include:

Amazon DynamoDB to store data
Amazon Simple Queue Service (Amazon SQS) and Lambda to manage processing and scoring with Ragas

Ragas scoring – Automated metrics provide an initial layer of evaluation
HITL review – Human evaluators refine Ragas scores through the UI, providing nuanced accuracy and reliability

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.
A quick video demo of this solution is shown below:

Solution architecture
The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

User access and UI – Users authenticate through Amazon Cognito, and upon successful login, interact with a Streamlit-based custom UI. This frontend allows users to upload CSV datasets to Amazon Simple Storage Service (Amazon S3), review Ragas evaluation outputs, and provide human feedback for refinement. The application exchanges the Amazon Cognito token for an AWS IAM Identity Center token, granting scoped access to Amazon Q Business.UI
infrastructure – The UI is hosted behind an Application Load Balancer, supported by Amazon Elastic Compute Cloud (Amazon EC2) instances running in an Auto Scaling group for high availability and scalability.
Upload dataset and trigger evaluation – Users upload a CSV file containing queries and ground truth answers to Amazon S3, which triggers an evaluation process. A Lambda function reads the CSV, stores its content in a DynamoDB table, and initiates further processing through a DynamoDB stream.
Consuming DynamoDB stream – A separate Lambda function processes new entries from the DynamoDB stream, and publishes messages to an SQS queue, which serves as a trigger for the evaluation Lambda function.
Ragas scoring – The evaluation Lambda function consumes SQS messages, sending queries (prompts) to Amazon Q Business for generating answers. It then evaluates the prompt, ground truth, and generated answer using the Ragas evaluation framework. Ragas computes automated evaluation metrics such as context recall, context precision, answer relevancy, and truthfulness. The results are stored in DynamoDB and visualized in the UI.

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.
This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites
For this walkthrough, you should have the following prerequisites:

A Linux/Mac computer. If you are using a Windows computer, make sure you can run a Linux command line.
An AWS account. If you don’t have an AWS account, follow the instructions to create one, unless you have been provided event engine details.
A user role with administrator access (service access associated with this role can be constrained further when the workflow goes to production).
The AWS Command Line Interface (AWS CLI) installed and configured with your user credentials.
Model access enabled in Amazon Bedrock for Anthropic’s Claude 3 Sonnet model and the Amazon Titan Embedding G1 – Text model. The Ragas scoring needs access to these two LLMs hosted on Amazon Bedrock.

Additionally, make sure that all the resources you deploy are in the same AWS Region.
Deploy the CloudFormation stack
Complete the following steps to deploy the CloudFormation stack:

Clone the repository or download the files to your local computer.
Unzip the downloaded file (if you used this option).
Using your local computer command line, use the ‘cd’ command and change directory into ./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution
Make sure the ./deploy.sh script can run by executing the command chmod 755 ./deploy.sh.
Execute the CloudFormation deployment script provided as follows:

./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business
You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.
Upload the evaluation dataset through the UI
In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.
This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

What are the index types of Amazon Q Business and the features of each?
I want to use Q Apps, which subscription tier is required to use Q Apps?
What is the file size limit for Amazon Q Business via file upload?
What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the evals stack that you already launched.
On the Outputs tab, take note of the user name and password to log in to the UI application, and choose the UI URL.

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

After you log in to the custom UI used for Amazon Q evaluation, choose Upload Dataset, then upload the dataset CSV file.

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation
After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.
Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

Answer relevancy – Measures relevancy of answers. Higher scores indicate better alignment with the user input, and lower scores are given if the response is incomplete or includes redundant information.
Truthfulness – Verifies factual accuracy by comparing responses to verified sources. Higher scores indicate a better consistency with verified sources.
Context precision – Assesses the relevance and conciseness of retrieved information. Higher scores indicate that the retrieved information closely matches the query intent, reducing irrelevant data.
Context recall – Measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation
If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data
Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch
Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.
Using evaluation results to improve Amazon Q Business application accuracy
This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.
Context recall
Let’s examine the following problems and troubleshooting tips:

Insufficient documents retrieved – Retrieved passages only partially cover the relevant topics, omitting essential information. This could result from document parsing errors or a ranking algorithm with a limited top-K selection. To address this, review the source attributes and context provided by Amazon Q Business answers to identify any gaps.

Aggressive query filtering – Overly strict search filters or metadata constraints might exclude relevant records. You should review the metadata filters or boosting settings applied in Amazon Q Business to make sure they don’t unnecessarily restrict results.
Data source ingestion errors – Documents from certain data sources aren’t successfully ingested into Amazon Q Business. To address this, check the document sync history report in Amazon Q Business to confirm successful ingestion and resolve ingestion errors.

Context precision
Consider the following potential issues:

Over-retrieval of documents – Large top-K values might retrieve semi-related or off-topic passages, which the LLM might incorporate unnecessarily. To address this, refine metadata filters or apply boosting to improve passage relevance and reduce noise in the retrieved context.

Poor query specificity – Broad or poorly formed user queries can yield loosely related results. You should make sure user queries are clear and specific. Train users or implement query refinement mechanisms to optimize query quality.

Answer relevance
Consider the following troubleshooting methods:

Partial coverage – Retrieved context addresses parts of the question but fails to cover all aspects, especially in multi-part queries. To address this, decompose complex queries into sub-questions. Instruct the LLM or a dedicated module to retrieve and answer each sub-question before composing the final response. For example:

Break down the query into sub-questions.
Retrieve relevant passages for each sub-question.
Compose a final answer addressing each part.

Context/answer mismatch – The LLM might misinterpret retrieved passages, omit relevant information, or merge content incorrectly due to hallucination. You can use prompt engineering to guide the LLM more effectively. For example, for the original query “What are the top 3 reasons for X?” you can use the rewritten prompt “List the top 3 reasons for X clearly labeled as #1, #2, and #3, based strictly on the retrieved context.”

Truthfulness
Consider the following:

Stale or inaccurate data sources – Outdated or conflicting information in the knowledge corpus might lead to incorrect answers. To address this, compare the retrieved context with verified sources to provide accuracy. Collaborate with SMEs to validate the data.
LLM hallucination – The model might fabricate or embellish details, even with accurate retrieved context. Although Amazon Q Business is a RAG generative AI solution, and should significantly reduce the hallucination, it’s not possible to eliminate hallucination totally. You can measure the frequency of low context precision answers to identify patterns and quantify the impact of hallucinations to gain an aggregated view with the evaluation solution.

By systematically examining and addressing the root causes of low evaluation metrics, you can optimize your Amazon Q Business application. From document retrieval and ranking to prompt engineering and validation, these strategies will help enhance the effectiveness of your RAG solution.
Clean up
Don’t forget to go back to the CloudFormation console and delete the CloudFormation stack to delete the underlying infrastructure that you set up, to avoid additional costs on your AWS account.
Conclusion
In this post, we outlined two evaluation solutions for Amazon Q Business: a comprehensive evaluation workflow and a lightweight Lambda based evaluation. These approaches combine automated evaluation approaches such as Ragas with human-in-the-loop validation, providing reliable and accurate assessments.
By using our guidance on how to improve evaluation metrics, you can continuously optimize your Amazon Q Business application to meet enterprise needs with Amazon Q Business. Whether you’re using the end-to-end solution or the lightweight approach, these frameworks provide a scalable and efficient path to improve accuracy and relevance.
To learn more about Amazon Q Business and how to evaluate Amazon Q Business results, explore these hands-on workshops:

Evaluating Amazon Q Business applications to maximize business impact
Innovate on enterprise data with generative AI & Amazon Q Business application

About the authors
Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.
Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Bedrock team, and a Gold member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.
Amit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.
Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.
Ricardo Aldao is a Senior Partner Solutions Architect at AWS. He is a passionate AI/ML enthusiast who focuses on supporting partners in building generative AI solutions on AWS.

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency ben …

In December, we announced the preview availability for Amazon Bedrock Intelligent Prompt Routing, which provides a single serverless endpoint to efficiently route requests between different foundation models within the same model family. To do this, Amazon Bedrock Intelligent Prompt Routing dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality, as shown in the following figure.

Today, we’re happy to announce the general availability of Amazon Bedrock Intelligent Prompt Routing. Over the past several months, we drove several improvements in intelligent prompt routing based on customer feedback and extensive internal testing. Our goal is to enable you to set up automated, optimal routing between large language models (LLMs) through Amazon Bedrock Intelligent Prompt Routing and its deep understanding of model behaviors within each model family, which incorporates state-of-the-art methods for training routers for different sets of models, tasks and prompts.
In this blog post, we detail various highlights from our internal testing, how you can get started, and point out some caveats and best practices. We encourage you to incorporate Amazon Bedrock Intelligent Prompt Routing into your new and existing generative AI applications. Let’s dive in!
Highlights and improvements
Today, you can either use Amazon Bedrock Intelligent Prompt Routing with the default prompt routers provided by Amazon Bedrock or configure your own prompt routers to adjust for performance linearly between the performance of the two candidate LLMs. Default prompt routers—pre-configured routing systems to map performance to the more performant of the two models while lowering costs by sending easier prompts to the cheaper model—are provided by Amazon Bedrock for each model family. These routers come with predefined settings and are designed to work out-of-the-box with specific foundation models. They provide a straightforward, ready-to-use solution without needing to configure any routing settings. Customers who tested Amazon Bedrock Intelligent Prompt Routing in preview (thank you!), you could choose models in the Anthropic and Meta families. Today, you can choose more models from within the Amazon Nova, Anthropic, and Meta families, including:

Anthropic’s Claude family: Haiku, Sonnet3.5 v1, Haiku 3.5, Sonnet 3.5 v2
Llama family: Llama 3.1 8b, 70b, 3.2 11B, 90B and 3.3 70B
Nova family: Nova Pro and Nova lite

You can also configure your own prompt routers to define your own routing configurations tailored to specific needs and preferences. These are more suitable when you require more control over how to route your requests and which models to use. In GA, you can configure your own router by selecting any two models from the same model family and then configuring the response quality difference of your router.
Adding components before invoking the selected LLM with the original prompt can add overhead. We reduced overhead of added components by over 20% to approximately 85 ms (P90). Because the router preferentially invokes the less expensive model while maintaining the same baseline accuracy in the task, you can expect to get an overall latency and cost benefit compared to always hitting the larger/ more expensive model, despite the additional overhead. This is discussed further in the following benchmark results section.
We conducted several internal tests with proprietary and public data to evaluate Amazon Bedrock Intelligent Prompt Routing metrics. First, we used average response quality gain under cost constraints (ARQGC), a normalized (0–1) performance metric for measuring routing system quality for various cost constraints, referenced against a reward model, where 0.5 represents random routing and 1 represents optimal oracle routing performance. We also captured the cost savings with intelligent prompt routing relative to using the largest model in the family, and estimated latency benefit based on average recorded time to first token (TTFT) to showcase the advantages and report them in the following table.

Model family
Router overall performance
Performance when configuring the router to match performance of the strong model

Average ARQGC
Cost savings (%)
Latency benefit (%)

Nova
0.75
35%
9.98%

Anthropic
0.86
56%
6.15%

Meta
0.78
16%
9.38%

How to read this table?
It’s important to pause and understand these metrics. First, results shown in the preceding table are only meant for comparing against random routing within the family (that is, improvement in ARQGC over 0.5) and not across families. Second, the results are relevant only within the family of models and are different than other model benchmarks that you might be familiar with that are used to compare models. Third, because the real cost and price change frequently and are dependent on the input and output token counts, it’s challenging to compare the real cost. To solve this problem, we define the cost savings metric as the maximum cost saved compared to the strongest LLM cost for a router to achieve a certain level of response quality. Specifically, in the example shown in the table, there’s an average 35% cost savings using the Nova family router compared to using Nova Pro for all prompts without the router.
You can expect to see varying levels of benefit based on your use case. For example, in an internal test with hundreds of prompts, we achieve 60% cost savings using Amazon Bedrock Intelligent Prompt Routing with the Anthropic family, with the response quality matching that of Claude Sonnet3.5 V2.
What is response quality difference?
The response quality difference measures the disparity between the responses of the fallback model and the other models. A smaller value indicates that the responses are similar. A higher value indicates a significant difference in the responses between the fallback model and the other models. The choice of what you use as a fallback model is important. When configuring a response quality difference of 10% with Anthropic’s Claude 3 Sonnet as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a 10% drop in the response quality from Claude 3 Sonnet. Conversely, if you use a less expensive model such as Claude 3 Haiku as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a more than 10% increase from Claude 3 Haiku.
In the following figure, you can see that the response quality difference is set at 10% with Haiku as the fallback model. If customers want to explore optimal configurations beyond the default settings described previously, they can experiment with different response quality difference thresholds, analyze the router’s response quality, cost, and latency on their development dataset, and select the configuration that best fits their application’s requirements.
When configuring your own prompt router, you can set the threshold for response quality difference as shown in the following image of the Configure prompt router page, under Response quality difference (%) in the Amazon Bedrock console. To do this by using APIs, see How to use intelligent prompt routing.

Benchmark results
When using different model pairings, the ability of the smaller model to service a larger number of input prompts will have significant latency and cost benefits, depending on the model choice and the use case. For example, when comparing between usage of Claude 3 Haiku and Claude 3.5 Haiku along with Claude 3.5 Sonnet, we observe the following with one of our internal datasets:
Case 1: Routing between Claude 3 Haiku and Claude 3.5 Sonnet V2: Cost savings of 48% while maintaining the same response quality as Claude 3.5 Sonnet v2

Case 2: Routing between Claude 3.5 Haiku and Claude 3.5 Sonnet V2: Cost savings of 56% while maintaining the same response quality as Claude 3.5 Sonnet v2

As you can see in case 1 and case 2, as model capabilities for less expensive models improve with respect to more expensive models in the same family (for example Claude 3 Haiku to 3.5 Haiku), you can expect more complex tasks to be reliably solved by them, therefore causing a higher percentage of routing to the less expensive model while still maintaining the same overall accuracy in the task.
We encourage you to test the effectiveness of Amazon Bedrock Intelligent Prompt Routing on your specialized task and domain because results can vary. For example, when we tested Amazon Bedrock Intelligent Prompt Routing with open source and internal Retrieval Augmented Generation (RAG) datasets, we saw an average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger/ more expensive model (Sonnet 3.5 v2 in the following figure) alone, averaged across RAG datasets.

Getting started
You can get started using the AWS Management Console for Amazon Bedrock. As mentioned earlier, you can create your own router or use a default router:
Use the console to configure a router:

In the Amazon Bedrock console, choose Prompt Routers in the navigation pane, and then choose Configure prompt router.
You can then use a previously configured router or a default router in the console-based playground. For example, in the following figure, we attached a 10K document from Amazon.com and asked a specific question about the cost of sales.
Choose the router metrics icon (next to the refresh icon) to see which model the request was routed to. Because this is a nuanced question, Amazon Bedrock Intelligent Prompt Routing correctly routes to Claude 3.5 Sonnet V2 in this case, as shown in the following figure.

You can also use AWS Command Line Interface (AWS CLI) or API, to configure and use a prompt router.
To use the AWS CLI or API to configure a router:
AWS CLI:

aws bedrock create-prompt-router
    –prompt-router-name my-prompt-router
    –models ‘[{“modelArn”: “arn:aws:bedrock:<region>::foundation-model/<modelA>”}]’
    –fallback-model ‘[{“modelArn”: “arn:aws:bedrock:<region>::foundation-model/<modelB>”}]’
    –routing-criteria ‘{“responseQualityDifference”: 0.5}’

Boto3 SDK:

response = client.create_prompt_router(
    promptRouterName=’my-prompt-router’,
    models=[
        {
            ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelA>’
        },
        {
            ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelB>’
        },
    ],
    description=’string’,
    routingCriteria={
        ‘responseQualityDifference’:0.5
    },
    fallbackModel={
        ‘modelArn’: ‘arn:aws:bedrock:<region>::foundation-model/<modelA>’
    },
    tags=[
        {
            ‘key’: ‘string’,
            ‘value’: ‘string’
        },
    ]
)

Caveats and best practices
When using intelligent prompt routing in Amazon Bedrock, note that:

Amazon Bedrock Intelligent Prompt Routing is optimized for English prompts for typical chat assistant use cases. For use with other languages or customized use cases, conduct your own tests before implementing prompt routing in production applications or reach out to your AWS account team for help designing and conducting these tests.
You can select only two models to be part of the router (pairwise routing), with one of these two models being the fallback model. These two models have to be in the same AWS Region.
When starting with Amazon Bedrock Intelligent Prompt Routing, we recommend that you experiment using the default routers provided by Amazon Bedrock before trying to configure custom routers. After you’ve experimented with default routers, you can configure your own routers as needed for your use cases, evaluate the response quality in the playground, and use them for production application if they meet your requirements.
Amazon Bedrock Intelligent Prompt Routing can’t adjust routing decisions or responses based on application-specific performance data currently and might not always provide the most optimal routing for unique or specialized, domain-specific use cases. Contact your AWS account team for customization help on specific use cases.

Conclusion
In this post, we explored Amazon Bedrock Intelligent Prompt Routing, highlighting its ability to help optimize both response quality and cost by dynamically routing requests between different foundation models. Benchmark results demonstrate significant cost savings while maintaining high-quality responses and reduced latency benefits across model families. Whether you implement the pre-configured default routers or create custom configurations, Amazon Bedrock Intelligent Prompt Routing offers a powerful way to balance performance and efficiency in generative AI applications. As you implement this feature in your workflows, testing its effectiveness for specific use cases is recommended to take full advantage of the flexibility it provides. To get started, see Understanding intelligent prompt routing in Amazon Bedrock

About the authors
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Balasubramaniam Srinivasan is a Senior Applied Scientist at Amazon AWS, working on post training methods for generative AI models. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and football (soccer).
Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.
Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Build an automated generative AI solution evaluation pipeline with Ama …

Large language models (LLMs) have become integral to numerous applications across industries, ranging from enhanced customer interactions to automated business processes. Deploying these models in real-world scenarios presents significant challenges, particularly in ensuring accuracy, fairness, relevance, and mitigating hallucinations. Thorough evaluation of the performance and outputs of these models is therefore critical to maintaining trust and safety.
Evaluation plays a central role in the generative AI application lifecycle, much like in traditional machine learning. Robust evaluation methodologies enable informed decision-making regarding the choice of models and prompts. However, evaluating LLMs is a complex and resource-intensive process given the free-form text output of LLMs. Methods such as human evaluation provide valuable insights but are costly and difficult to scale. Consequently, there is a demand for automated evaluation frameworks that are highly scalable and can be integrated into application development, much like unit and integration tests in software development.
In this post, to address the aforementioned challenges, we introduce an automated evaluation framework that is deployable on AWS. The solution can integrate multiple LLMs, use customized evaluation metrics, and enable businesses to continuously monitor model performance. We also provide LLM-as-a-judge evaluation metrics using the newly released Amazon Nova models. These models enable scalable evaluations due to their advanced capabilities and low latency. Additionally, we provide a user-friendly interface to enhance ease of use.
In the following sections, we discuss various approaches to evaluate LLMs. We then present a typical evaluation workflow, followed by our AWS-based solution that facilitates this process.
Evaluation methods
Prior to implementing evaluation processes for generative AI solutions, it’s crucial to establish clear metrics and criteria for assessment and gather an evaluation dataset.
The evaluation dataset should be representative of the actual real-world use case. It should consist of diverse samples and ideally contain ground truth values generated by experts. The size of the dataset will depend on the exact application and the cost of acquiring data; however, a dataset that spans relevant and diverse use cases should be a minimum. Developing an evaluation dataset can itself be an iterative task that is progressively enhanced by adding new samples and enriching the dataset with samples where the model performance is lacking. After the evaluation dataset is acquired, evaluation criteria can then be defined.
The evaluation criteria can be broadly divided into three main areas:

Latency-based metrics – These include measurements such as response generation time or time to first token. The importance of each metric might vary depending on the specific application.
Cost – This refers to the expense associated with response generation.
Performance – Performance-based metrics are highly case-dependent. They might include measurements of accuracy, factual consistency of responses, or the ability to generate structured responses.

Generally, there is an inverse relationship between latency, cost, and performance. Depending on the use case, one factor might be more critical than the others. Having metrics for these categories across different models can help you make data-driven decisions to determine the optimum choice for your specific use case.
Although measuring latency and cost can be relatively straightforward, assessing performance requires a deep understanding of the use case and knowing what is crucial for success. Depending on the application, you might be interested in evaluating the factual accuracy of the model’s output (particularly if the output is based on specific facts or reference documents), or you might want to assess whether the model’s responses are consistently polite and helpful, or both.
To support these diverse scenarios, we have incorporated several evaluation metrics in our solution:

FMEval – Foundation Model Evaluation (FMEval) library provided by AWS offers purpose-built evaluation models to provide metrics like toxicity in LLM output, accuracy, and semantic similarity between generated and reference text. This library can be used to evaluate LLMs across several tasks such as open-ended generation, text summarization, question answering, and classification.
Ragas – Ragas is an open source framework that provides metrics for evaluation of Retrieval Augmented Generation (RAG) systems (systems that generate answers based on a provided context). Ragas can be used to evaluate the performance of an information retriever (the component that retrieves relevant information from a database) using metrics like context precision and recall. Ragas also provides metrics to evaluate the LLM generation from the provided context using metrics like answer faithfulness to the provided context and answer relevance to the original question.
LLMeter – LLMeter is a simple solution for latency and throughput testing of LLMs, such as LLMs provided through Amazon Bedrock and OpenAI. This can be helpful in comparing models on metrics for latency-critical workloads.
LLM-as-a-judge metrics – Several challenges arise in defining performance metrics for free form text generated by LLMs – for example, the same information might be expressed in a different way. It’s also difficult to clearly define metrics for measuring characteristics like politeness. To tackle such evaluations, LLM-as-a-judge metrics have become popular. LLM-as-a-judge evaluations use a judge LLM to score the output of an LLM based on certain predefined criteria. We use the Amazon Nova model as the judge due to its advanced accuracy and performance.

Evaluation workflow
Now that we know what metrics we care about, how do we go about evaluating our solution? A typical generative AI application development (proof of concept) process can be abstracted as follows:

Builders use a few test examples and try out different prompts to see the performance and get a rough idea of the prompt template and model they want to start with (online evaluation).
Builders test the first prompt template version with a selected LLM against a test dataset with ground truth for a list of evaluation metrics to check the performance (offline evaluation). Based on the evaluation results, they might need to modify the prompt template, fine-tune the model, or implement RAG to add additional context to improve performance.
Builders implement the change and evaluate the updated solution against the dataset to validate improvements on the solution. Then they repeat the previous steps until the performance of the developed solution meets the business requirements.

The two key stages in the evaluation process are:

Online evaluation – This involves manually evaluating prompts based on a few examples for qualitative checks
Offline evaluation – This involves automated quantitative evaluation on an evaluation dataset

This process can add significant operational complications and effort from the builder team and operations team. To achieve this workflow, you need the following:

A side-by-side comparison tool for various LLMs
A prompt management service that can be used to save and version control prompts
A batch inference service that can invoke your selected LLM on a large number of examples
A batch evaluation service that can be used to evaluate the LLM response generated in the previous step

In the next section, we describe how we can create this workflow on AWS.
Solution overview
In this section, we present an automated generative AI evaluation solution that can be used to simplify the evaluation process. The architecture diagram of the solution is shown in the following figure.

This solution provides both online (real-time comparison) and offline (batch evaluation) evaluation options that fulfill different needs during the generative AI solution development lifecycle. Each component in this evaluation infrastructure can be developed using existing open source tools or AWS native services.
The architecture of the automated LLM evaluation pipeline focuses on modularity, flexibility, and scalability. The design philosophy makes sure that different components can be reused or adapted for other generative AI projects. The following is an overview of each component and its role in the solution:

UI – The UI provides a straightforward way to interact with the evaluation framework. Users can compare different LLMs with a side-by-side comparison. The UI provides latency, model outputs, and cost for each input query (online evaluation). The UI also helps you store and manage your different prompt templates backed by the Amazon Bedrock prompt management feature. These prompts can be referenced later for batch generation or production use. You can also launch batch generation and evaluation jobs through the UI. The UI service can be run locally in a Docker container or deployed to AWS Fargate.
Prompt management – The evaluation solution includes a key component for prompt management. Backed by Amazon Bedrock prompt management, you can save and retrieve your prompts using the UI.
LLM invocation pipeline – Using AWS Step Functions, this workflow automates the process of generating outputs from the LLM for a test dataset. It retrieves inputs from Amazon Simple Storage Service (Amazon S3), processes them, and stores the responses back to Amazon S3. This workflow supports batch processing, making it suitable for large-scale evaluations.
LLM evaluation pipeline – This workflow, also managed by Step Functions, evaluates the outputs generated by the LLM. At the time of writing, the solution supports metrics provided by the FMEval library, Ragas library, and custom LLM-as-a-judge metrics. It handles various evaluation methods, including direct metrics computation and LLM-guided evaluation. The results are stored in Amazon S3, ready for analysis.
Eval factory – A core service for conducting evaluations, the eval factory supports multiple evaluation techniques, including those that use other LLMs for reference-free scoring. It provides consistency in evaluation results by standardizing outputs into a single metric per evaluation. It can be difficult to find a one-size-fits-all solution when it comes to evaluation, so we provide you the flexibility to use your own script for evaluation. We also provide pre-built scripts and pipelines for some common tasks including classification, summarization, translation, and RAG. Especially for RAG, we have integrated popular open source libraries like Ragas.
Postprocessing and results store – After the pipeline results are generated, postprocessing can concatenate the results and potentially display the results in a results store that can provide a graphical view of the results. This part also handles updates to the prompt management system because each prompt template and LLM combination will have recorded evaluation results to help you select the right model and prompt template for the use case. Visualization of the results can be done on the UI or even with an Amazon Athena table if the prompt management system uses Amazon S3 as the data storage. This part can be done by using an AWS Lambda function, which can be triggered by an event sent after the new data has been saved to the Amazon S3 location for the prompt management system.

The evaluation solution can significantly enhance team productivity throughout the development lifecycle by reducing manual intervention and increasing automated processes. As new LLMs emerge, builders can compare the current production LLM with new models to determine if upgrading would improve the system’s performance. This ongoing evaluation process makes sure that the generative AI solution remains optimal and up-to-date.
Prerequisites
For scripts to set up the solution, refer to the GitHub repository. After the backend and the frontend are up and running, you can start the evaluation process.
To start, open the UI in your browser. The UI provides the ability to do both online and offline evaluations.
Online evaluation
To iteratively refine prompts, you can follow these steps:

Choose the options menu (three lines) on the top left side of the page to set the AWS Region.
After you choose the Region, the model lists will be prefilled with the available Amazon Bedrock models in that Region.
You can choose two models for side-by-side comparison.
You can select a prompt already stored in Amazon Bedrock prompt management from the dropdown menu. If selected, this will automatically fill the prompts.
You can also create a new prompt by entering the prompt in the text box. You can select generation configurations (temperature, top P, and so on) on the Generation Configuration The prompt template can also use dynamic variables by entering variables in {{}} (for example, for additional context, add a variable like {{context}}). Then define the value of these variables on the Context tab.
Choose Enter to start generation.
This will invoke the two models and present the output in the text boxes below each model. Additionally, you will also be provided with the latency and cost for each model.
To save the prompt to Amazon Bedrock, choose Save.

Offline generation and evaluation
After you have made the model and prompt choice, you can run batch generation and evaluation over a larger dataset.

To run batch generation, choose the model from the dropdown list.
You can provide an Amazon Bedrock knowledge base ID if additional context is required for generation.
You can also provide a prompt template ID. This prompt will be used for generation.
Upload a dataset file. This file will be uploaded to the S3 bucket set in the sidebar. This file should be a pipe (|) separated CSV file. For more details on expected data file format, see the project’s GitHub README file.
Choose Start Generation to start the job. This will trigger a Step Functions workflow that you can track by choosing the link in the pop-up.

Invoking batch generation triggers a Step Functions workflow, which is shown in the following figure. The logic follows these steps:

GetPrompts – This step retrieves a CSV file containing prompts from an S3 bucket. The contents of this file become the Step Functions workflow’s payload.
convert_to_json – This step parses the CSV output and converts it into a JSON format. This transformation enables the step function to use the Map state to process the invoke_llm flow concurrently.
Map step – This is an iterative step that processes the JSON payload by invoking the invoke_llm Lambda function concurrently for each item in the payload. A concurrency limit is set, with a default value of 3. You can adjust this limit based on the capacity of your backend LLM service. Within each Map iteration, the invoke_llm Lambda function calls the backend LLM service to generate a response for a single question and its associated context.
InvokeSummary – This step combines the output from each iteration of the Map step. It generates a JSON Lines result file containing the outputs, which is then stored in an S3 bucket for evaluation purposes.

When the batch generation is complete, you can trigger a batch evaluation pipeline with the selected metrics from the predefined metric list. You can also specify the location of an S3 file that contains already generated LLM outputs to perform batch evaluation.

Invoking batch evaluation triggers an Evaluate-LLM Step Functions workflow, which is shown in the following figure. The Evaluate-LLM Step Functions workflow is designed to comprehensively assess LLM performance using multiple evaluation frameworks:

LLMeter evaluation – Uses the AWS Labs LLMeter framework and focuses on endpoint performance metrics and benchmarking.
Ragas framework evaluation – Uses Ragas framework evaluation to measure four critical quality metrics:

Context precision – A metric that evaluates whether the ground truth relevant items present in the contexts (retrieved chunks from vector database) are ranked higher or not. Its value ranges between 0–1, with higher values indicating better performance. The RAG system usually retrieves more than 1 chunks for a given query, and the chunks are ranked in order. A lower score is assigned when the high-ranked chunks contain more irrelevant information, which indicate bad information retrieval capability.
Context recall – A metric that measures the extent to which the retrieved context aligns with the ground truth. Its value ranges between 0–1, with higher values indicating better performance. The ground truth can contain several short and definitive claims. For example, the ground truth “Canberra is the capital city of Australia, and the city is located at the northern end of the Australian Capital Territory” has two claims: “Canberra is the capital city of Australia” and “Canberra city is located at the northern end of the Australian Capital Territory.” Each claim in the ground truth is analyzed to determine whether it can be attributed to the retrieved context or not. A higher value is assigned when more claims in the ground truth are attributable to the retrieved context.
Faithfulness – A metric that measures the factual consistency of the generated answer against the given context. Its value ranges between 0–1, with higher values indicating better performance. The answer can also contain several claims. A lower score is assigned to answers that contain a smaller number of claims that can be inferred from the given context.
Answer relevancy – A metric that focuses on assessing how pertinent the generated answer is to the given prompt. It is scaled to (0, 1) range, and the higher the better. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.

LLM-as-a-judge evaluation – Uses LLM capabilities to compare and score outputs against expected answers, which provides qualitative assessment of response accuracy. The prompts used for the LLM-as-a-judge are for demonstration purposes; to serve your specific use case, provide your own evaluation prompts to make sure the LLM-as-a-judge meets the correct evaluation requirements.
FM evaluation: Uses the AWS open source FMEval library and analyzes key metrics, including toxicity measurement.

The architecture implements these evaluations as nested Step Functions workflows that execute concurrently, enabling efficient and comprehensive model assessment. This design also makes it straightforward to add new frameworks to the evaluation workflow.

Clean up
To delete local deployment for the frontend, run run.sh delete_local. If you need to delete the cloud deployment, run run.sh delete_cloud. For the backend, you can delete the AWS CloudFormation stack, llm-evaluation-stack. For resources that you can’t delete automatically, manually delete them on the AWS Management Console.
Conclusion
In this post, we explored the importance of evaluating LLMs in the context of generative AI applications, highlighting the challenges posed by issues like hallucinations and biases. We introduced a comprehensive solution using AWS services to automate the evaluation process, allowing for continuous monitoring and assessment of LLM performance. By using tools like the FMeval Library, Ragas, LLMeter, and Step Functions, the solution provides flexibility and scalability, meeting the evolving needs of LLM consumers.
With this solution, businesses can confidently deploy LLMs, knowing they adhere to the necessary standards for accuracy, fairness, and relevance. We encourage you to explore the GitHub repository and start building your own automated LLM evaluation pipeline on AWS today. This setup can not only streamline your AI workflows but also make sure your models deliver the highest-quality outputs for your specific applications.

About the Authors
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in artificial intelligence, he partners with clients to accelerate their GenAI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences, currently focusing on strength training.
Rafa XU, is a passionate Amazon Web Services (AWS) senior cloud architect focused on helping Public Sector customers design, build, and run infrastructure application and services on AWS. With more than 10 years of experience working across multiple information technology disciplines, Rafa has spent the last five years focused on AWS Cloud infrastructure, serverless applications, and automation. More recently, Rafa has expanded his skillset to include Generative AI, Machine Learning, Big data and Internet of Things (IoT).
Dr. Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Sam Edwards, is a Solutions Architect at AWS based in Sydney and focused on Media & Entertainment. He is a Subject Matter Expert for Amazon Bedrock and Amazon SageMaker AI services. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. In his spare time, he likes traveling and enjoying time with Family.
Dr. Kai Zhu, currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker and Bedrock Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.

LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Intr …

Large language models (LLMs) have gained prominence for their ability to handle complex reasoning tasks, transforming applications from chatbots to code-generation tools. These models are known to benefit significantly from scaling their computation during inference, often producing higher accuracy by dedicating more resources to hard problems. However, this approach brings along considerable drawbacks. Longer processing times and higher computing costs make it challenging to scale such solutions in real-world settings, where responsiveness and affordability are crucial. As technology advances toward more intelligent systems, there is a growing need to explore how LLMs can become not only smarter but also more efficient, especially when operating within repetitive or familiar contexts.

One of the biggest inefficiencies in current LLM deployment occurs during query resolution. Typically, when a user poses a question, the model processes it simultaneously with the necessary background context. This test-time compute assumes that the context and question always arrive together. But in real scenarios, such as document Q&A or debugging code, context is usually persistent and can be accessed well before a specific question is asked. Yet, the model processes everything from scratch for each query, even if it has seen the context before. This redundancy results in increased computational costs and response delays, particularly in scenarios involving multiple queries within a single context.

To deal with this inefficiency, various methods have been developed. Sequential and parallel test-time computation are two major strategies. Sequential approaches extend the model’s reasoning path, allowing it to consider more possibilities, while parallel approaches involve sampling multiple outputs simultaneously, known as pass@k. Techniques like speculative decoding aim to cut latency by making early guesses, but their usefulness is limited when the model still has to think from scratch. While helpful, these methods don’t eliminate the need to process context alongside every new question repeatedly. They also typically require test-time conditions that aren’t always feasible, such as access to an oracle or an ideal verifier.

Researchers from Letta and the University of California, Berkeley, introduced a novel solution they call sleep-time compute. The method involves utilizing idle time between user interactions to increase productivity. Instead of waiting for a user question, the model begins analyzing the context beforehand. It anticipates possible future queries and builds a new version of the context enriched with relevant inferences. When a user finally asks a question, the model can simply refer to this pre-processed context. Since much of the thinking is already done, it requires less computational effort to produce accurate answers. This approach becomes even more effective when multiple questions relate to the same context, allowing for shared inferences and distributed computational cost.

The implementation of sleep-time compute relies on decomposing the traditional prompt into two parts: a static context and a dynamic query. During the sleep-time window, only the context is used to generate a pre-processed version. This enhanced context, called c′, is built using test-time compute techniques like reasoning chains or summarization. Once this enriched version is stored, it replaces the raw context during real-time queries. The final answers are then generated using much fewer resources. This system not only minimizes redundant reasoning but also paves the way for more proactive LLMs that can think ahead and be better prepared.

To evaluate the effectiveness of sleep-time compute, the research team tested it using two specially designed benchmarks: Stateful GSM-Symbolic and Stateful AIME. Both datasets are derived by splitting existing problem sets into separate contexts and questions. In experiments using models like GPT-4o and GPT-4o-mini, researchers observed a 5× reduction in test-time compute for similar accuracy levels. Notably, accuracy improved by up to 13% for the GSM-Symbolic P2 dataset and by 18% on Stateful AIME when sleep-time compute was scaled. Multi-Query GSM-Symbolic, a new dataset introduced for this evaluation, helped demonstrate that the cost per query could be reduced by 2.5× when 10 queries shared the same context.

When pitted against popular strategies like pass@k, sleep-time compute consistently outperformed them. Unlike pass@k, which assumes access to a perfect evaluator, sleep-time compute works under more realistic conditions. Results show that even at low test-time compute budgets, sleep-time compute produced comparable or better accuracy while consuming fewer tokens. For instance, the GPT-4o-mini model achieved higher accuracy with fewer than 200 test-time tokens using sleep-time compute compared to over 500 tokens needed in the baseline. Even when models like Claude Sonnet 3.7 and DeepSeek R1 were evaluated, similar improvements were observed.

Scaling the amount of compute dedicated to sleep-time further improved outcomes. By running five parallel generations during sleep-time on complex tasks, researchers pushed the pareto curve further. However, they noted diminishing returns beyond this point. Importantly, results showed that stronger models handling more difficult tasks benefited more from additional sleep-time compute. Also, amortizing sleep-time computation became highly cost-effective when contexts served multiple related queries. By weighting test-time tokens as ten times more expensive than sleep-time tokens, aligned with industry latency-cost ratios, the researchers confirmed a reduction of up to 2.5 times in the average cost per query.

Another interesting finding was that sleep-time compute worked best when user queries were predictable. Using Llama2-70B, researchers scored the predictability of each query given its context and found a strong correlation: the more predictable the query, the greater the benefit. In examples where the question logically followed from the given context, sleep-time computation yielded higher gains. Conversely, less predictable or abstract queries experienced reduced effectiveness, although they still showed benefits compared to traditional test-time-only methods.

Altogether, this research presents a smart and scalable technique to enhance the efficiency of LLMs without compromising accuracy. By leveraging otherwise idle time, sleep-time computing reduces the burden on real-time systems, lowers operational costs, and improves response time. The clear quantitative improvements, such as a 5× reduction in compute, 13–18% accuracy gains, and a drop of up to 2.5× in cost per query, demonstrate that forward-thinking approaches like this could shape the next generation of intelligent, context-aware assistants.

Several Key Takeaways from the Research are as follows:

Sleep-time compute allows models to anticipate queries by reasoning on context before the query arrives.

Accuracy improved by 13% on GSM-Symbolic and 18% on AIME datasets when sleep-time computation was scaled.

Test-time compute requirements decreased by approximately 5 times for similar performance levels.

When sharing context across 10 related queries, the average query cost decreased by a factor of 2.5.

Outperformed the pass@k strategy in parallel compute settings at equivalent budgets.

More effective on predictable queries, identified via log-probability scoring.

Diminishing returns noted beyond five parallel generations for sleep-time computation.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency appeared first on MarkTechPost.

LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New …

Large language models (LLMs) are continually evolving by ingesting vast quantities of text data, enabling them to become more accurate predictors, reasoners, and conversationalists. Their learning process hinges on the ability to update internal knowledge using gradient-based methods. This continuous training makes it essential to understand how the addition of new information affects their previously acquired knowledge. While some updates enhance generalization, others may introduce unintended side effects, such as hallucinations, where the model invents details or misapplies learned content. Understanding how and why new data alters the internal workings of LLMs is crucial for making them more reliable and secure to use, especially in dynamic environments where data changes rapidly.

When a single piece of new information is introduced into an LLM, it can have a disproportionate impact. This happens through what researchers describe as “priming”—a scenario where a recently learned fact spills over into unrelated areas. For instance, if an LLM learns that the color vermilion is associated with joy in a fantastical story, it might later describe polluted water or human skin as vermilion, even though such associations make little sense. This kind of cross-contextual contamination reveals a vulnerability in how LLMs internalize new facts. Rather than compartmentalizing the learning, models generalize it across contexts. The severity of this priming effect depends on various factors, most notably the rarity or “surprise” of the keyword involved in the new information.

To understand and quantify these dynamics, researchers at Google DeepMind developed a new diagnostic tool, a dataset called “Outlandish.” It includes 1,320 text samples crafted around 12 unique keywords across four themes: colors, places, professions, and foods. Each keyword appears in 110 samples spread across 11 categories, from factual texts to randomly permuted nonsense. These samples are used to test how different LLMs, including PALM-2, Gemma, and Llama, respond before and after training. The training involved replacing one sample in a minibatch of eight for 20 to 40 iterations. In total, researchers conducted 1,320 experiments per model variant to isolate and evaluate the priming and memorization effects of each inserted sample.

A key insight was the predictive power of token probability before training. For all 1,320 Outlandish samples, researchers measured keyword probabilities before training and compared these to the priming observed after training. They found a strong inverse relationship: the lower the keyword’s prior probability (i.e., the more surprising it was), the higher the likelihood of priming. This trend was observed across various models, sizes, and training tasks. A clear threshold emerged around a probability of 10⁻³. Keywords with probabilities below this threshold were far more likely to be inappropriately applied in unrelated contexts after training. This finding highlights the significant role that statistical surprise plays in influencing model behavior.

Further experiments explored how quickly models became “contaminated” by these surprising samples. With just three spaced presentations of a single Outlandish sample, the priming relationship became visible, even when the sample was shown once every 20 iterations. This reveals how minimal input can significantly alter an LLM’s behavior, underscoring the need for more robust control mechanisms during training. Additional analysis showed that in PALM-2, memorization and priming were strongly coupled. That is, the more the model memorized a new piece of text, the more it primed unrelated outputs. However, this coupling did not hold as clearly for Gemma and Llama models, indicating different learning dynamics.

Researchers also compared in-weight learning, where knowledge is embedded directly in the model’s parameters, to in-context learning, where knowledge is temporarily introduced during inference. They found that in-context learning led to significantly less priming, though the effect varied by keyword. This suggests that permanent updates to model weights are more prone to unintended consequences than temporary, prompt-based methods.

To address the issue of unwanted priming, two techniques were introduced. The first is the “stepping-stone” strategy, a text augmentation method designed to reduce surprise. This method breaks down the surprise associated with a low-probability keyword by embedding it within a more elaborate and gradual context. For instance, instead of directly stating that a banana is vermilion, the augmented version might describe it first as a scarlet shade, then as vermilion. Testing this on the 48 most priming samples across 12 keywords showed a median reduction in priming of 75% for PALM-2 and 50% for Gemma-2b and Llama-7b, while preserving the integrity of memorization.

The second method, “ignore-topk,” is a gradient pruning strategy. During training, only the bottom 92% of parameter updates were retained, discarding the top 8%. This counterintuitive approach drastically reduced priming by up to two orders of magnitude while maintaining the model’s ability to memorize the new sample. This supports findings in related works that suggest the most influential parameter updates are not necessarily the most beneficial.

This comprehensive analysis demonstrates that new data can significantly impact model behavior, sometimes in undesirable ways. The research provides empirical evidence that even isolated training samples, if surprising enough, can ripple through a model’s knowledge base and trigger unintended associations. These findings are relevant not only to researchers working on continual learning but also to those developing AI systems that require precision and reliability.

Several Key Takeaways from the Research include:

1,320 custom-crafted text samples were used to evaluate the impact of new information on LLMs.  

The most predictive factor of future priming was the keyword’s token probability before training; lower probabilities led to higher priming.

A probability threshold of 10⁻³ was identified, below which priming effects became significantly pronounced. 

Priming effects were measurable after just three training iterations, even with spacing between inputs.

PALM-2 showed a strong correlation between memorization and priming, while Gemma and Llama exhibited different learning behaviors.  

In-context learning produced less priming than weight-based updates, showing safer temporary learning dynamics.

The “stepping-stone” strategy reduced priming by up to 75% without compromising learning.

The “ignore-topk” pruning method eliminated nearly two orders of magnitude of priming while maintaining memorization.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New Techniques to Predict and Reduce Unintended Knowledge Contamination appeared first on MarkTechPost.

An Advanced Coding Implementation: Mastering Browser‑Driven AI in Go …

In this tutorial, we will learn how to harness the power of a browser‑driven AI agent entirely within Google Colab. We will utilize Playwright’s headless Chromium engine, along with the browser_use library’s high-level Agent and BrowserContext abstractions, to programmatically navigate websites, extract data, and automate complex workflows. We will wrap Google’s Gemini model via the langchain_google_genai connector to provide natural‑language reasoning and decision‑making, secured by pydantic’s SecretStr for safe API‑key handling. With getpass managing credentials, asyncio orchestrating non‑blocking execution, and optional .env support via python-dotenv, this setup will give you an end‑to‑end, interactive agent platform without ever leaving your notebook environment.

Copy CodeCopiedUse a different Browser!apt-get update -qq
!apt-get install -y -qq chromium-browser chromium-chromedriver fonts-liberation
!pip install -qq playwright python-dotenv langchain-google-generative-ai browser-use
!playwright install

We first refresh the system package lists and install headless Chromium, its WebDriver, and the Liberation fonts to enable browser automation. It then installs Playwright along with python-dotenv, the LangChain GoogleGenerativeAI connector, and browser-use, and finally downloads the necessary browser binaries via playwright install.

Copy CodeCopiedUse a different Browserimport os
import asyncio
from getpass import getpass
from pydantic import SecretStr
from langchain_google_genai import ChatGoogleGenerativeAI
from browser_use import Agent, Browser, BrowserContextConfig, BrowserConfig
from browser_use.browser.browser import BrowserContext

We bring in the core Python utilities, os for environment management and asyncio for asynchronous execution, plus getpass and pydantic’s SecretStr for secure API‑key input and storage. It then loads LangChain’s Gemini wrapper (ChatGoogleGenerativeAI) and the browser_use toolkit (Agent, Browser, BrowserContextConfig, BrowserConfig, and BrowserContext) to configure and drive a headless browser agent.

Copy CodeCopiedUse a different Browseros.environ[“ANONYMIZED_TELEMETRY”] = “false”

We disable anonymous usage reporting by setting the ANONYMIZED_TELEMETRY environment variable to “false”, ensuring that neither Playwright nor the browser_use library sends any telemetry data back to its maintainers.

Copy CodeCopiedUse a different Browserasync def setup_browser(headless: bool = True):
browser = Browser(config=BrowserConfig(headless=headless))
context = BrowserContext(
browser=browser,
config=BrowserContextConfig(
wait_for_network_idle_page_load_time=5.0,
highlight_elements=True,
save_recording_path=”./recordings”,
)
)
return browser, context

This asynchronous helper initializes a headless (or headed) Browser instance and wraps it in a BrowserContext configured to wait for network‑idle page loads, visually highlight elements during interactions, and save a recording of each session under ./recordings. It then returns both the browser and its ready‑to‑use context for your agent’s tasks.

Copy CodeCopiedUse a different Browserasync def agent_loop(llm, browser_context, query, initial_url=None):
initial_actions = [{“open_tab”: {“url”: initial_url}}] if initial_url else None
agent = Agent(
task=query,
llm=llm,
browser_context=browser_context,
use_vision=True,
generate_gif=False,
initial_actions=initial_actions,
)
result = await agent.run()
return result.final_result() if result else None

This async helper encapsulates one “think‐and‐browse” cycle: it spins up an Agent configured with your LLM, the browser context, and optional initial URL tab, leverages vision when available, and disables GIF recording. Once you call agent_loop, it runs the agent through its steps and returns the agent’s final result (or None if nothing is produced).

Copy CodeCopiedUse a different Browserasync def main():
raw_key = getpass(“Enter your GEMINI_API_KEY: “)

os.environ[“GEMINI_API_KEY”] = raw_key

api_key = SecretStr(raw_key)
model_name = “gemini-2.5-flash-preview-04-17”

llm = ChatGoogleGenerativeAI(model=model_name, api_key=api_key)

browser, context = await setup_browser(headless=True)

try:
while True:
query = input(“nEnter prompt (or leave blank to exit): “).strip()
if not query:
break
url = input(“Optional URL to open first (or blank to skip): “).strip() or None

print(“n Running agent…”)
answer = await agent_loop(llm, context, query, initial_url=url)
print(“n Search Resultsn” + “-“*40)
print(answer or “No results found”)
print(“-“*40)
finally:
print(“Closing browser…”)
await browser.close()

await main()

Finally, this main coroutine drives the entire Colab session: it securely prompts for your Gemini API key (using getpass and SecretStr), sets up the ChatGoogleGenerativeAI LLM and a headless Playwright browser context, then enters an interactive loop where it reads your natural‑language prompts (and optional start URL), invokes the agent_loop to perform the browser‑driven AI task, prints the results, and finally ensures the browser closes cleanly.

In conclusion, by following this guide, you now have a reproducible Colab template that integrates browser automation, LLM reasoning, and secure credential management into a single cohesive pipeline. Whether you’re scraping real‑time market data, summarizing news articles, or automating reporting tasks, the combination of Playwright, browser_use, and LangChain’s Gemini interface provides a flexible foundation for your next AI‑powered project. Feel free to extend the agent’s capabilities, re‑enable GIF recording, add custom navigation steps, or swap in other LLM backends to tailor the workflow precisely to your research or production needs.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post An Advanced Coding Implementation: Mastering Browser‑Driven AI in Google Colab with Playwright, browser_use Agent & BrowserContext, LangChain, and Gemini appeared first on MarkTechPost.

Build a FinOps agent using Amazon Bedrock with multi-agent capability …

AI agents are revolutionizing how businesses enhance their operational capabilities and enterprise applications. By enabling natural language interactions, these agents provide customers with a streamlined, personalized experience. Amazon Bedrock Agents uses the capabilities of foundation models (FMs), combining them with APIs and data to process user requests, gather information, and execute specific tasks effectively. The introduction of multi-agent collaboration now enables organizations to orchestrate multiple specialized AI agents working together to tackle complex, multi-step challenges that require diverse expertise.
Amazon Bedrock offers a diverse selection of FMs, allowing you to choose the one that best fits your specific use case. Among these offerings, Amazon Nova stands out as AWS’s next-generation FM, delivering breakthrough intelligence and industry-leading performance at exceptional value.
The Amazon Nova family comprises three types of models:

Understanding models – Available in Micro, Lite, and Pro variants
Content generation models – Featuring Canvas and Reel
Speech-to-Speech model – Nova Sonic

These models are specifically optimized for enterprise and business applications, excelling in the following capabilities:

Text generation
Summarization
Complex reasoning tasks
Content creation

This makes Amazon Nova ideal for sophisticated use cases like our FinOps solution.
A key advantage of the Amazon Nova model family is its industry-leading price-performance ratio. Compared to other enterprise-grade AI models, Amazon Nova offers comparable or superior capabilities at a more competitive price point. This cost-effectiveness, combined with its versatility and performance, makes Amazon Nova an attractive choice for businesses looking to implement advanced AI solutions.
In this post, we use the multi-agent feature of Amazon Bedrock to demonstrate a powerful and innovative approach to AWS cost management. By using the advanced capabilities of Amazon Nova FMs, we’ve developed a solution that showcases how AI-driven agents can revolutionize the way organizations analyze, optimize, and manage their AWS costs.
Solution overview
Our innovative AWS cost management solution uses the power of AI and multi-agent collaboration to provide comprehensive cost analysis and optimization recommendations. The core of the system is built around three key components:

FinOps supervisor agent – Acts as the central coordinator, managing user queries and orchestrating the activities of specialized subordinate agents
Cost analysis agent – Uses AWS Cost Explorer to gather and analyze cost data for specified time ranges
Cost optimization agent – Uses the AWS Trusted Advisor Cost Optimization Pillar to provide actionable cost-saving recommendations

The solution integrates the multi-agent collaboration capabilities of Amazon Bedrock with Amazon Nova to create an intelligent, interactive, cost management AI assistant. This integration enables seamless communication between specialized agents, each focusing on different aspects of AWS cost management. Key features of the solution include:

User authentication through Amazon Cognito with role-based access control
Frontend application hosted on AWS Amplify
Real-time cost insights and historical analysis
Actionable cost optimization recommendations
Parallel processing of tasks for improved efficiency

By combining AI-driven analysis with AWS cost management tools, this solution offers finance teams and cloud administrators a powerful, user-friendly interface to gain deep insights into AWS spending patterns and identify cost-saving opportunities.
The architecture displayed in the following diagram uses several AWS services, including AWS Lambda functions, to create a scalable, secure, and efficient system. This approach demonstrates the potential of AI-driven multi-agent systems to assist with cloud financial management and solve a wide range of cloud management challenges.

In the following sections, we dive deeper into the architecture of our solution, explore the capabilities of each agent, and discuss the potential impact of this approach on AWS cost management strategies.
Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Nova Pro and Micro in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation template downloaded from the aws-samples GitHub repo

Deploy solution resources using AWS CloudFormation
This CloudFormation template is designed to run in the us-east-1 Region. If you deploy in a different Region, you must configure cross-Region inference profiles to have proper functionality and update the CloudFormation template accordingly.
During the CloudFormation template deployment, you will need to specify three required parameters:

Stack name
FM selection
Valid user email address

AWS resource usage will incur costs. When deployment is complete, the following resources will be deployed:

Amazon Cognito resources:

User pool – CognitoUserPoolforFinOpsApp
App client – FinOpsApp
Identity pool – cognito-identity-pool-finops
Groups – Finance
User – Finance User

AWS Identity and Access Management (IAM) resources:

IAM roles:

FinanceUserRestrictedRole
DefaultCognitoAuthenticatedRole

IAM policies:

Finance-BedrockAccess
Default-CognitoAccess

Lambda functions:

TrustedAdvisorListRecommendationResources
TrustedAdvisorListRecommendations
CostAnalysis
ClockandCalendar
CostForecast

Amazon Bedrock agents:

FinOpsSupervisorAgent
CostAnalysisAgent with action groups:

CostAnalysisActionGroup
ClockandCalendarActionGroup
CostForecastActionGroup

CostOptimizationAgent with action groups:

TrustedAdvisorListRecommendationResources
TrustedAdvisorListRecommendations

After you deploy the CloudFormation template, copy the following from the Outputs tab on the AWS CloudFormation console to use during the configuration of your application after it’s deployed in Amplify:

AWSRegion
BedrockAgentAliasId
BedrockAgentId
BedrockAgentName
IdentityPoolId
UserPoolClientId
UserPoolId

The following screenshot shows you what the Outputs tab will look like.

Deploy the Amplify application
You need to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code AWS-Amplify-Frontend.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain it automatically generated to access the application.

Amazon Cognito for user authentication
The FinOps application uses Amazon Cognito user pools and identity pools to implement secure, role-based access control for finance team members. User pools handle authentication and group management, and identity pools provide temporary AWS credentials mapped to specific IAM roles. The system makes sure that only verified finance team members can access the application and interact with the Amazon Bedrock API, combining robust security with a seamless user experience.
Amazon Bedrock Agents with multi-agent capability
The Amazon Bedrock multi-agent architecture enables sophisticated FinOps problem-solving through a coordinated system of AI agents, led by a FinOpsSupervisorAgent. The FinOpsSupervisorAgent coordinates with two key subordinate agents: the CostAnalysisAgent, which handles detailed cost analysis queries, and the CostOptimizationAgent, which handles specific cost optimization recommendations. Each agent focuses on their specialized financial tasks while maintaining contextual awareness, with the FinOpsSupervisorAgent managing communication and synthesizing comprehensive responses from both agents. This coordinated approach enables parallel processing of financial queries and delivers more effective answers than a single agent could provide, while maintaining consistency and accuracy throughout the FinOps interaction.
Lambda functions for Amazon Bedrock action groups
As part of this solution, Lambda functions are deployed to support the action groups defined for each subordinate agent.
The CostAnalysisAgent uses three distinct Lambda backed action groups to deliver comprehensive cost management capabilities. The CostAnalysisActionGroup connects with Cost Explorer to extract and analyze detailed historical cost data, providing granular insights into cloud spending patterns and resource utilization. The ClockandCalendarActionGroup maintains temporal precision by providing current date and time functionality, essential for accurate period-based cost analysis and reporting. The CostForecastActionGroup uses the Cost Explorer forecasting function, which analyzes historical cost data and provides future cost projections. This information helps the agent support proactive budget planning and make informed recommendations. These action groups work together seamlessly, enabling the agent to provide historical cost analysis and future spend predictions while maintaining precise temporal context.
The CostOptimizationAgent incorporates two Trusted Advisor focused action groups to enhance its recommendation capabilities. The TrustedAdvisorListRecommendationResources action group interfaces with Trusted Advisor to retrieve a comprehensive list of resources that could benefit from optimization, providing a targeted scope for cost-saving efforts. Complementing this, the TrustedAdvisorListRecommendations action group fetches specific recommendations from Trusted Advisor, offering actionable insights on potential cost reductions, performance improvements, and best practices across various AWS services. Together, these action groups empower the agent to deliver data-driven, tailored optimization strategies by using the expertise embedded in Trusted Advisor.
Amplify for frontend
Amplify provides a streamlined solution for deploying and hosting web applications with built-in security and scalability features. The service reduces the complexity of managing infrastructure, allowing developers to concentrate on application development. In our solution, we use the manual deployment capabilities of Amplify to host our frontend application code.
Multi-agent and application walkthrough
To validate the solution before using the Amplify deployed frontend, we can conduct testing directly on the AWS Management Console. By navigating to the FinOpsSupervisorAgent, we can pose a question like “What is my cost for Feb 2025 and what are my current cost savings opportunity?” This query demonstrates the multi-agent orchestration in action. As shown in the following screenshot, the FinOpsSupervisorAgent coordinates with both the CostAnalysisAgent (to retrieve February 2025 cost data) and the CostOptimizationAgent (to identify current cost savings opportunities). This illustrates how the FinOpsSupervisorAgent effectively delegates tasks to specialized agents and synthesizes their responses into a comprehensive answer, showcasing the solution’s integrated approach to FinOps queries.

Navigate to the URL provided after you created the application in Amplify. Upon accessing the application URL, you will be prompted to provide information related to Amazon Cognito and Amazon Bedrock Agents. This information is required to securely authenticate users and allow the frontend to interact with the Amazon Bedrock agent. It enables the application to manage user sessions and make authorized API calls to AWS services on behalf of the user.
You can enter information with the values you collected from the CloudFormation stack outputs. You will be required to enter the following fields, as shown in the following screenshot:

User Pool ID
User Pool Client ID
Identity Pool ID
Region
Agent Name
Agent ID
Agent Alias ID
Region

You need to sign in with your user name and password. A temporary password was automatically generated during deployment and sent to the email address you provided when launching the CloudFormation template. At first sign-in attempt, you will be asked to reset your password, as shown in the following video.

Now you can start asking the same question in the application, for example, “What is my cost for February 2025 and what are my current cost savings opportunity?” In a few seconds, the application will provide you detailed results showing services spend for the particular month and savings opportunity. The following video shows this chat.

You can further dive into the details you got by asking a follow-up question such as “Can you give me the details of the EC2 instances that are underutilized?” and it will return the details for each of the Amazon Elastic Compute Cloud (Amazon EC2) instances that it found underutilized.

The following are a few additional sample queries to demonstrate the capabilities of this tool:

What is my top services cost in June 2024?
In the past 6 months, how much did I spend on VPC cost?
What is my current savings opportunity?

Clean up
If you decide to discontinue using the FinOps application, you can follow these steps to remove it, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process (you assigned a name to it).
Select the stack and choose Delete.

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Considerations
For optimal visibility across your organization, deploy this solution in your AWS payer account to access cost details for your linked accounts through Cost Explorer.
Trusted Advisor cost optimization visibility is limited to the account where you deploy this solution. To expand its scope, enable Trusted Advisor at the AWS organization level and modify this solution accordingly.
Before deploying to production, enhance security by implementing additional safeguards. You can do this by associating guardrails with your agent in Amazon Bedrock.
Conclusion
The integration of the multi-agent capability of Amazon Bedrock with Amazon Nova demonstrates the transformative potential of AI in AWS cost management. Our FinOps agent solution showcases how specialized AI agents can work together to deliver comprehensive cost analysis, forecasting, and optimization recommendations in a secure and user-friendly environment. This implementation not only addresses immediate cost management challenges, but also adapts to evolving cloud financial operations. As AI technologies advance, this approach sets a foundation for more intelligent and proactive cloud management strategies across various business operations.
Additional resources
To learn more about Amazon Bedrock, refer to the following resources:

Introducing multi-agent collaboration capability for Amazon Bedrock
Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock
Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance

About the Author
Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.
Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.
Sergio Barraza is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.
Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Stream ingest data from Kafka to Amazon Bedrock Knowledge Bases using …

Retrieval Augmented Generation (RAG) enhances AI responses by combining the generative AI model’s capabilities with information from external data sources, rather than relying solely on the model’s built-in knowledge. In this post, we showcase the custom data connector capability in Amazon Bedrock Knowledge Bases that makes it straightforward to build RAG workflows with custom input data. Through this capability, Amazon Bedrock Knowledge Bases supports the ingestion of streaming data, which means developers can add, update, or delete data in their knowledge base through direct API calls.
Think of the examples of clickstream data, credit card swipes, Internet of Things (IoT) sensor data, log analysis and commodity prices—where both current data and historical trends are important to make a learned decision. Previously, to feed such critical data inputs, you had to first stage it in a supported data source and then either initiate or schedule a data sync job. Based on the quality and quantity of the data, the time to complete this process varied. With custom data connectors, you can quickly ingest specific documents from custom data sources without requiring a full sync and ingest streaming data without the need for intermediary storage. By avoiding time-consuming full syncs and storage steps, you gain faster access to data, reduced latency, and improved application performance.
However, with streaming ingestion using custom connectors, Amazon Bedrock Knowledge Bases processes such streaming data without using an intermediary data source, making it available almost immediately. This feature chunks and converts input data into embeddings using your chosen Amazon Bedrock model and stores everything in the backend vector database. This automation applies to both newly created and existing databases, streamlining your workflow so you can focus on building AI applications without worrying about orchestrating data chunking, embeddings generation, or vector store provisioning and indexing. Additionally, this feature provides the ability to ingest specific documents from custom data sources, all while reducing latency and alleviating operational costs for intermediary storage.
Amazon Bedrock
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and RAG, and build agents that execute tasks using your enterprise systems and data sources.
Amazon Bedrock Knowledge Bases
Amazon Bedrock Knowledge Bases allows organizations to build fully managed RAG pipelines by augmenting contextual information from private data sources to deliver more relevant, accurate, and customized responses. With Amazon Bedrock Knowledge Bases, you can build applications that are enriched by the context that is received from querying a knowledge base. It enables a faster time to product release by abstracting from the heavy lifting of building pipelines and providing you an out-of-the-box RAG solution, thus reducing the build time for your application.
Amazon Bedrock Knowledge Bases custom connector
Amazon Bedrock Knowledge Bases supports custom connectors and the ingestion of streaming data, which means you can add, update, or delete data in your knowledge base through direct API calls.
Solution overview: Build a generative AI stock price analyzer with RAG
For this post, we implement a RAG architecture with Amazon Bedrock Knowledge Bases using a custom connector and topics built with Amazon Managed Streaming for Apache Kafka (Amazon MSK) for a user who may be interested to understand stock price trends. Amazon MSK is a streaming data service that manages Apache Kafka infrastructure and operations, making it straightforward to run Apache Kafka applications on Amazon Web Services (AWS). The solution enables real-time analysis of customer feedback through vector embeddings and large language models (LLMs).
The following architecture diagram has two components:
Preprocessing streaming data workflow noted in letters on the top of the diagram:

Mimicking streaming input, upload a .csv file with stock price data into MSK topic
Automatically trigger the consumer AWS Lambda function
Ingest consumed data into a knowledge base
Knowledge base internally using embeddings model transforms into vector index
Knowledge base internally storing vector index into the vector database

Runtime execution during user queries noted in numerals at the bottom of the diagram:

Users query on stock prices
Foundation model uses the knowledge base to search for an answer
The knowledge base returns with relevant documents
User answered with relevant answer

Implementation design
The implementation follows these high-level steps:

Data source setup – Configure an MSK topic that streams input stock prices
Amazon Bedrock Knowledge Bases setup – Create a knowledge base in Amazon Bedrock using the quick create a new vector store option, which automatically provisions and sets up the vector store
Data consumption and ingestion – As and when data lands in the MSK topic, trigger a Lambda function that extracts stock indices, prices, and timestamp information and feeds into the custom connector for Amazon Bedrock Knowledge Bases
Test the knowledge base – Evaluate customer feedback analysis using the knowledge base

Solution walkthrough
To build a generative AI stock analysis tool with Amazon Bedrock Knowledge Bases custom connector, use instructions in the following sections.
Configure the architecture
To try this architecture, deploy the AWS CloudFormation template from this GitHub repository in your AWS account. This template deploys the following components:

Functional virtual private clouds (VPCs), subnets, security groups and AWS Identity and Access Management (IAM) roles
An MSK cluster hosting Apache Kafka input topic
A Lambda function to consume Apache Kafka topic data
An Amazon SageMaker Studio notebook for granular setup and enablement

Create an Apache Kafka topic
In the precreated MSK cluster, the required brokers are deployed ready for use. The next step is to use a SageMaker Studio terminal instance to connect to the MSK cluster and create the test stream topic. In this step, you follow the detailed instructions that are mentioned at Create a topic in the Amazon MSK cluster. The following are the general steps involved:

Download and install the latest Apache Kafka client
Connect to the MSK cluster broker instance
Create the test stream topic on the broker instance

Create a knowledge base in Amazon Bedrock
To create a knowledge base in Amazon Bedrock, follow these steps:

On the Amazon Bedrock console, in the left navigation page under Builder tools, choose Knowledge Bases.

To initiate knowledge base creation, on the Create dropdown menu, choose Knowledge Base with vector store, as shown in the following screenshot.

In the Provide Knowledge Base details pane, enter BedrockStreamIngestKnowledgeBase as the Knowledge Base name.
Under IAM permissions, choose the default option, Create and use a new service role, and (optional) provide a Service role name, as shown in the following screenshot.

On the Choose data source pane, select Custom as the data source where your dataset is stored
Choose Next, as shown in the following screenshot

On the Configure data source pane, enter BedrockStreamIngestKBCustomDS as the Data source name.
Under Parsing strategy, select Amazon Bedrock default parser and for Chunking strategy, choose Default chunking. Choose Next, as shown in the following screenshot.

On the Select embeddings model and configure vector store pane, for Embeddings model, choose Titan Text Embeddings v2. For Embeddings type, choose Floating-point vector embeddings. For Vector dimensions, select 1024, as shown in the following screenshot. Make sure you have requested and received access to the chosen FM in Amazon Bedrock. To learn more, refer to Add or remove access to Amazon Bedrock foundation models.

On the Vector database pane, select Quick create a new vector store and choose the new Amazon OpenSearch Serverless option as the vector store.

On the next screen, review your selections. To finalize the setup, choose Create.
Within a few minutes, the console will display your newly created knowledge base.

Configure AWS Lambda Apache Kafka consumer
Now, using API calls, you configure the consumer Lambda function so it gets triggered as soon as the input Apache Kafka topic receives data.

Configure the manually created Amazon Bedrock Knowledge Base ID and its custom Data Source ID as environment variables within the Lambda function. When you use the sample notebook, the referred function names and IDs will be filled in automatically.

response = lambda_client.update_function_configuration(
FunctionName=<Consumer Lambda Function Name>,
Environment={
‘Variables’: {
‘KBID’: <Knowledge Base ID>,
‘DSID’: <Data Source ID>
}
}
)

When it’s completed, you tie the Lambda consumer function to listen for events in the source Apache Kafka topic:

response = lambda_client.create_event_source_mapping(
EventSourceArn=<MSK Cluster’s ARN>,
FunctionName=<Consumer Lambda Function Name>,
StartingPosition=’LATEST’,
Enabled=True,
Topics=[‘streamtopic’]
)

Review AWS Lambda Apache Kafka consumer
The Apache Kafka consumer Lambda function reads data from the Apache Kafka topic, decodes it, extracts stock price information, and ingests it into the Amazon Bedrock knowledge base using the custom connector.

Extract the knowledge base ID and the data source ID:

kb_id = os.environ[‘KBID’]
ds_id = os.environ[‘DSID’]

Define a Python function to decode input events:

def decode_payload(event_data):
agg_data_bytes = base64.b64decode(event_data)
decoded_data = agg_data_bytes.decode(encoding=”utf-8″)
event_payload = json.loads(decoded_data)
return event_payload

Decode and parse required data on the input event received from the Apache Kafka topic. Using them, create a payload to be ingested into the knowledge base:

records = event[‘records’][‘streamtopic-0’]
for rec in records:
# Each record has separate eventID, etc.
event_payload = decode_payload(rec[‘value’])
ticker = event_payload[‘ticker’]
price = event_payload[‘price’]
timestamp = event_payload[‘timestamp’]
myuuid = uuid.uuid4()
payload_ts = datetime.utcfromtimestamp(timestamp).strftime(‘%Y-%m-%d %H:%M:%S’)
payload_string = “At ” + payload_ts + ” the price of ” + ticker + ” is ” + str(price) + “.”

Ingest the payload into Amazon Bedrock Knowledge Bases using the custom connector:

response = bedrock_agent_client.ingest_knowledge_base_documents(
knowledgeBaseId = kb_id,
dataSourceId = ds_id,
documents= [
{
‘content’: {
‘custom’ : {
‘customDocumentIdentifier’: {
‘id’ : str(myuuid)
},
‘inlineContent’ : {
‘textContent’ : {
‘data’ : payload_string
},
‘type’ : ‘TEXT’
},
‘sourceType’ : ‘IN_LINE’
},
‘dataSourceType’ : ‘CUSTOM’
}
}
]
)

Testing
Now that the required setup is done, you trigger the workflow by ingesting test data into your Apache Kafka topic hosted with the MSK cluster. For best results, repeat this section by changing the .csv input file to show stock price increase or decrease.

Prepare the test data. In my case, I had the following data input as a .csv file with a header.

ticker
price

OOOO
$44.50

ZVZZT
$3,413.23

ZNTRX
$22.34

ZNRXX
$208.76

NTEST
$0.45

ZBZX
$36.23

ZEXIT
$942.34

ZIEXT
$870.23

ZTEST
$23.75

ZVV
$2,802.86

ZXIET
$63.00

ZAZZT
$18.86

ZBZZT
$998.26

ZCZZT
$72.34

ZVZZC
$90.32

ZWZZT
$698.24

ZXZZT
$932.32

Define a Python function to put data to the topic. Use pykafka client to ingest data:

def put_to_topic(kafka_host, topic_name, ticker, amount, timestamp):
client = KafkaClient(hosts = kafka_host)
topic = client.topics[topic_name]
payload = {
‘ticker’: ticker,
‘price’: amount,
‘timestamp’: timestamp
}
ret_status = True
data = json.dumps(payload)
encoded_message = data.encode(“utf-8”)
print(f’Sending ticker data: {ticker}…’)
with topic.get_sync_producer() as producer:
result=producer.produce(encoded_message)
return ret_status

Read the .csv file and push the records to the topic:

df = pd.read_csv(‘TestData.csv’)
start_test_time = time.time()
print(datetime.utcfromtimestamp(start_test_time).strftime(‘%Y-%m-%d %H:%M:%S’))
df = df.reset_index()
for index, row in df.iterrows():
put_to_topic(BootstrapBrokerString, KafkaTopic, row[‘ticker’], row[‘price’], time.time())
end_test_time = time.time()
print(datetime.utcfromtimestamp(end_test_time).strftime(‘%Y-%m-%d %H:%M:%S’))

Verification
If the data ingestion and subsequent processing is successful, navigate to the Amazon Bedrock Knowledge Bases data source page to check the uploaded information.

Querying the knowledge base
Within the Amazon Bedrock Knowledge Bases console, you have access to query the ingested data immediately, as shown in the following screenshot.

To do that, select an Amazon Bedrock FM that you have access to. In my case, I chose Amazon Nova Lite 1.0, as shown in the following screenshot.

When it’s completed, the question, “How is ZVZZT trending?”, yields the results based on the ingested data. Note how Amazon Bedrock Knowledge Bases shows how it derived the answer, even pointing to the granular data element from its source.

Cleanup
To make sure you’re not paying for resources, delete and clean up the resources created.

Delete the Amazon Bedrock knowledge base.
Delete the automatically created Amazon OpenSearch Serverless cluster.
Delete the automatically created Amazon Elastic File System (Amazon EFS) shares backing the SageMaker Studio environment.
Delete the automatically created security groups associated with the Amazon EFS share. You might need to remove the inbound and outbound rules before they can be deleted.
Delete the automatically created elastic network interfaces attached to the Amazon MSK security group for Lambda traffic.
Delete the automatically created Amazon Bedrock Knowledge Bases execution IAM role.
Stop the kernel instances with Amazon SageMaker Studio.
Delete the CloudFormation stack.

Conclusion
In this post, we showed you how Amazon Bedrock Knowledge Bases supports custom connectors and the ingestion of streaming data, through which developers can add, update, or delete data in their knowledge base through direct API calls. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end RAG workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources. With this capability, you can quickly ingest specific documents from custom data sources without requiring a full sync, and ingest streaming data without the need for intermediary storage.
Send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS contacts, and engage with the generative AI builder community at community.aws.

About the Author
Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds eight AWS and seven other professional certifications. With over 22 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

Add Zoom as a data accessor to your Amazon Q index

For many organizations, vast amounts of enterprise knowledge are scattered across diverse data sources and applications. Organizations across industries seek to use this cross-application enterprise data from within their preferred systems while adhering to their established security and governance standards.
This post demonstrates how Zoom users can access their Amazon Q Business enterprise data directly within their Zoom interface, alleviating the need to switch between applications while maintaining enterprise security boundaries. Organizations can now configure Zoom as a data accessor in Amazon Q Business, enabling seamless integration between their Amazon Q index and Zoom AI Companion. This integration allows users to access their enterprise knowledge in a controlled manner directly within the Zoom platform.
How Amazon Q Business and Zoom AI Companion work together
The Amazon Q Business data accessor is a core component within Amazon Q Business. It manages and controls access to data stored in an enterprise’s internal knowledge repositories on Amazon Q Business from an external independent software vendor (ISV) such as Zoom while maintaining security and data access compliance. This feature allows Zoom to retrieve relevant content, enhancing the Zoom AI Companion’s knowledge. It serves as an intermediary that enforces access control lists (ACLs), defining both data source permissions and user access rights to the existing Amazon Q Business index.
Zoom AI Companion, the foundation of Zoom’s AI-first work platform, enhances human connection by working behind the scenes to boost productivity, improve work quality, and strengthen relationships. This April, Zoom launched the Custom AI Companion add-on, enabling organizations to customize AI agents and skills to help meet their specific needs and drive company-wide efficiency. Through its partnership with Amazon Q Business, customers can now connect their indexed data in Amazon Q index to Zoom AI Companion, providing enhanced knowledge and contextual insights.
As an Amazon Q Business data accessor, Zoom AI Companion can interact with the enterprise Amazon Q index in a managed way, enriching content beyond what’s available in Zoom alone. Enterprise users can retrieve contextual information from their Amazon Q index’s multiple connected data sources directly within Zoom, with results seamlessly presented through Zoom AI Companion. Zoom AI Companion can access Amazon Q index data with its native data sources, such as previous call transcripts, to quickly surface relevant information to users. This integration alleviates the need to manually switch between various enterprise systems like Google Drive, Confluence, Salesforce, and more, saving time and reducing workflow disruptions.
For example, while preparing for a Zoom call, users can quickly find answers to questions like “When is customer AnyCustomer’s contract up for renewal, and who signed the last one?” The Amazon Q index processes these queries and delivers results through Zoom AI Companion in real time.
Solution overview
The following diagram is a high-level architecture that explains how enterprises can set up and access Amazon Q Business indexed data from within the Zoom AI Companion application.

In the following sections, we demonstrate how to configure Zoom as a data accessor and get started using Zoom AI Companion.
Prerequisites
To implement this solution, you need an AWS account with appropriate permissions.
Create an Amazon Q Business application
To access indexed data from Amazon Q Business through Zoom AI Companion, organizations must first set up their Amazon Q Business application. The application must be configured with AWS IAM Identity Center to enable the Zoom data accessor functionality. For detailed guidance on creating an Amazon Q Business application, refer to Configure application.
Configure access control with IAM Identity Center
Through IAM Identity Center, Amazon Q Business uses trusted identity propagation to provide proper authentication and fine-grained authorization based on user ID and group-based resources, making sure access to sensitive data is tightly controlled and document ACLs are enforced. The ISV is only permitted to access this index using the assigned data accessor.
If you’re using an identity provider (IdP) such as Okta, CyberArk, or others, you can add the IdP to IAM Identity Center as a trusted token issuer. For additional information, see Configure Amazon Q Business with AWS IAM Identity Center trusted identity propagation.
For more information on IAM Identity Center, refer to IAM Identity Center identity source tutorials.
Add Zoom as a data accessor
After creating an Amazon Q Business application with IAM Identity Center, administrators can configure Zoom as a data accessor through the Amazon Q Business console. Complete the following steps:

 On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose Zoom as your data accessor.
For Accessor name, enter a name for your data accessor.
For Data source access, configure your level of access.

You can select specific data sources to be available through the data accessor. This allows you to control which content is surfaced in the ISV environment. You can use Amazon Q Business pre-built connectors to synchronize content from various systems. For more information, refer to Supported connectors.

For User access, specify which users can access the Amazon Q index through the data accessor.

This option enables you to configure granular permissions for data accessor accessibility and manage organizational access controls.
For more information about data access, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.

Administrators can modify data accessor settings at any time after implementation. You can adjust user access permissions, update available data sources, and change the scope of accessibility. To revoke access, complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Locate the accessor you want to delete and choose Delete.
Confirm the deletion when prompted.

Removing a data accessor from a data source immediately cancels the ISV’s access to your organization’s Amazon Q index.
Configure Amazon Q for Zoom AI Companion
To start using Zoom as a data accessor for your Amazon Q Business index, the following information from your enterprise Amazon Q Business application must be shared with Zoom:

Amazon Q Business application ID
Amazon Q Business AWS Region
Amazon Q Business retriever ID
Data accessor application Amazon Resource Name (ARN)
IAM Identity Center instance Region

For more information, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.
After you add Zoom as a data accessor, a pop-up window will appear on the Amazon Q Business console. This pop-up contains the required parameters, as shown in the following screenshot.

Navigate to the Zoom App Marketplace to configure Amazon Q in Zoom, and enter the information you collected.

After you submit this information, you’re ready to access Amazon Q index data from Zoom AI Companion.
With AI Companion connected to Amazon Q index, you have the information you need instantly. For example, you could make AI Companion aware of your organization’s IT troubleshooting guides so employees could quickly get help with questions like “How do I fix a broken keyboard?”

Using the SearchRelevantContent API
When an enterprise customer with an Amazon Q index enables a data accessor, it allows authenticated Amazon Q Business users to search and retrieve relevant content in real time while using external ISV platforms (like Zoom). This functionality is achieved through the ISV calling the Amazon Q index SearchRelevantContent API as an external data accessor across accounts. The SearchRelevantContent API is specifically designed to return search results from the Amazon Q index, which can be further enhanced by the ISV’s generative AI stack. By using the Amazon Q index SearchRelevantContent API, Zoom and other ISVs can integrate query results directly into their environment.
The SearchRelevantContent API is an identity-aware API, which means it operates with knowledge of the user’s identity and associated information (such as email and group membership) through the credentials used to call the API. This identity awareness is a prerequisite for using the API. When querying the index, it reconciles document access controls against the authenticated user’s permissions. As a result, users can only retrieve results from content they are authorized to access.
When an ISV calls the SearchRelevantContent API as a data accessor, both sparse and dense searches are applied to the Amazon Q index, combining keyword search and vector embedding proximity. Results are ranked before being returned to the ISV interface.
For example, if you ask in Zoom, “What is Company XYZ’s engagement on the cancer moonshot project?”, Zoom AI Companion triggers a call to the SearchRelevantContent API as a data accessor.
For a more comprehensive code example, see the notebook in Module 2 – Amazon Q cross-app index.
The following is a code snippet in Python showing what that search request might look like:

search_params = { ‘applicationId’: Q_BIZ_APP_ID,
‘contentSource’: {
‘retriever’: {
‘retrieverId’: Q_RETRIEVER_ID
}
},
‘queryText’: ‘What is Company XYZ engagement on the cancer moonshot project?’,
‘maxResults’: 10
}

search_response = qbiz.search_relevant_content(**search_params)

The search response will contain an array of results with relevant chunks of text, along with source information, document attributes, and confidence scores. The following is a snippet from the SearchRelevantContent API response. This is an example of results you might see from the web crawler data connector used with Amazon Q Business.

[
{
“content”: “nSeveral initiatives have been launched or will soon launch to address the goals of this next phase, including:nIncluding more people in expanded and modernized cancer clinical trialsnIncreasing the pipeline of new cancer drugsnEnsuring access to current and new standards of cancer carenEnhancing diversity in the cancer research workforce”,
“documentId”: “Cancermoonshot”,
“documentTitle”: “About The Cancer Moonshot”,
“documentUri”: “https://companyxyz/cancermoonshot.html”,
“documentAttributes”: [
{
“name”: “_source_uri”,
“value”: {
“stringValue”: “https://companyxyz.com/cancermoonshot.html”
}
}
],
“scoreAttributes”: {
“scoreConfidence”: “VERY_HIGH”
}
},…]

The SearchRelevantContent API has a rich set of optional parameters available that ISVs can choose to use. For example, document attributes can be used as filters. If documents with meta attributes have been indexed, and one of these attributes contains the author, it would be possible for an ISV to apply a filter where you can specify an author name. In the following example, results returned are constrained to only documents that have the specified attribute author name “John Smith.”

search_params = {
‘applicationId’: Q_BIZ_APP_ID,
‘contentSource’: {
‘retriever’: {
‘retrieverId’: Q_RETRIEVER_ID
}
},
‘queryText’: myQuestion,
‘maxResults’: 5,
‘attributeFilter’: {
‘equalsTo’: {
‘name’: ‘Author’,
‘value’: {
‘stringValue’: ‘John Smith’
}
}
}
}

For a more comprehensive reference on what is available in the SearchRelevantContent API request object, refer to search_relevant_content.
Clean up
When you’re done using this solution, clean up the resources you created.

Delete the Zoom data accessor from the Data accessors console. Deleting this data accessor will delete permissions and access to the data accessor for all users.
Delete the Amazon Q Business application that you created as a prerequisite.

Navigate to the Amazon Q Business console.
Choose Applications on the left menu.
Select the application you created.
Choose Delete from under Actions to delete the application.

Deleting the Amazon Q Business application will remove the associated index and data source connectors, and prevent incurring additional costs.
Conclusion
Amazon Q indexes offers a transformative approach to workplace efficiency. By creating a centralized, secure repository for your organization’s data, you can seamlessly integrate vital information with your everyday productivity tools like Zoom AI Companion.
In this post, we explored how Amazon Q Business enterprise users can add data accessors to integrate with external parties like Zoom AI Companion, allowing users to access their enterprise knowledge in a managed way directly from within those platforms.
Ready to supercharge your workforce’s productivity? Start your Amazon Q Business journey today alongside Zoom. To learn more about Amazon Q Business data accessors, see Enhance enterprise productivity for your LLM solution by becoming an Amazon Q Business data accessor.

About the authors
David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps Independent Software Vendors (ISVs) accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Build a computer vision-based asset inventory application with low or …

Keeping an up-to-date asset inventory with real devices deployed in the field can be a challenging and time-consuming task. Many electricity providers use manufacturer’s labels as key information to link their physical assets within asset inventory systems. Computer vision can be a viable solution to speed up operator inspections and reduce human errors by automatically extracting relevant data from the label. However, building a standard computer vision application capable of managing hundreds of different types of labels can be a complex and time-consuming endeavor.
In this post, we present a solution using generative AI and large language models (LLMs) to alleviate the time-consuming and labor-intensive tasks required to build a computer vision application, enabling you to immediately start taking pictures of your asset labels and extract the necessary information to update the inventory using AWS services like AWS Lambda, Amazon Bedrock, Amazon Titan, Anthropic’s Claude 3 on Amazon Bedrock, Amazon API Gateway, AWS Amplify, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.
LLMs are large deep learning models that are pre-trained on vast amounts of data. They are capable of understanding and generating human-like text, making them incredibly versatile tools with a wide range of applications. This approach harnesses the image understanding capabilities of Anthropic’s Claude 3 model to extract information directly from photographs taken on-site, by analyzing the labels present in those field images.
Solution overview
The AI-powered asset inventory labeling solution aims to streamline the process of updating inventory databases by automatically extracting relevant information from asset labels through computer vision and generative AI capabilities. The solution uses various AWS services to create an end-to-end system that enables field technicians to capture label images, extract data using AI models, verify the accuracy, and seamlessly update the inventory database.
The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The process starts when an operator takes and uploads a picture of the assets using the mobile app.
The operator submits a request to extract data from the asset image.
A Lambda function retrieves the uploaded asset image from the uploaded images data store.
The function generates the asset image embeddings (vector representations of data) invoking the Amazon Titan Multimodal Embeddings G1 model.
The function performs a similarity search in the knowledge base to retrieve similar asset labels. The most relevant results will augment the prompt as similar examples to improve the response accuracy, and are sent with the instructions to the LLM to extract data from the asset image.
The function invokes Anthropic’s Claude 3 Sonnet on Amazon Bedrock to extract data (serial number, vendor name, and so on) using the augmented prompt and the related instructions.
The function sends the response to the mobile app with the extracted data.
The mobile app verifies the extracted data and assigns a confidence level. It invokes the API to process the data. Data with high confidence will be directly ingested into the system.
A Lambda function is invoked to update the asset inventory database with the extracted data if the confidence level has been indicated as high by the mobile app.
The function sends data with low confidence to Amazon Augmented AI (Amazon A2I) for further processing.
The human reviewers from Amazon A2I validate or correct the low-confidence data.
Human reviewers, such as subject matter experts, validate the extracted data, flag it, and store it in an S3 bucket.
A rule in Amazon EventBridge is defined to trigger a Lambda function to get the information from the S3 bucket when the Amazon A2I workflow processing is complete.
A Lambda function processes the output of the Amazon A2I workflow by loading data from the JSON file that stored the backend operator-validated information.
The function updates the asset inventory database with the new extracted data.
The function sends the extracted data marked as new by human reviewers to an Amazon Simple Queue Service (Amazon SQS) queue to be further processed.
Another Lambda function fetches messages from the queue and serializes the updates to the knowledge base database.
The function generates the asset image embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model.
The function updates the knowledge base with the generated embeddings and notifies other functions that the database has been updated.

Let’s look at the key components of the solution in more detail.
Mobile app
The mobile app component plays a crucial role in this AI-powered asset inventory labeling solution. It serves as the primary interface for field technicians on their tablets or mobile devices to capture and upload images of asset labels using the device’s camera. The implementation of the mobile app includes an authentication mechanism that will allow access only to authenticated users. It’s also built using a serverless approach to minimize recurring costs and have a highly scalable and robust solution.
The mobile app has been built using the following services:

AWS Amplify – This provides a development framework and hosting for the static content of the mobile app. By using Amplify, the mobile app component benefits from features like seamless integration with other AWS services, offline capabilities, secure authentication, and scalable hosting.
Amazon Cognito – This handles user authentication and authorization for the mobile app.

AI data extraction service
The AI data extraction service is designed to extract critical information, such as manufacturer name, model number, and serial number from images of asset labels.
To enhance the accuracy and efficiency of the data extraction process, the service employs a knowledge base comprising sample label images and their corresponding data fields. This knowledge base serves as a reference guide for the AI model, enabling it to learn and generalize from labeled examples to new label formats effectively. The knowledge base is stored as vector embeddings in a high-performance vector database: Meta’s FAISS (Facebook AI Similarity Search), hosted on Amazon S3.
Embeddings are dense numerical representations that capture the essence of complex data like text or images in a vector space. Each data point is mapped to a vector or ordered list of numbers, where similar data points are positioned closer together. This embedding space allows for efficient similarity calculations by measuring the distance between vectors. Embeddings enable machine learning (ML) models to effectively process and understand relationships within complex data, leading to improved performance on various tasks like natural language processing and computer vision.
The following diagram illustrates an example workflow.

The vector embeddings are generated using Amazon Titan, a powerful embedding generation service, which converts the labeled examples into numerical representations suitable for efficient similarity searches. The workflow consists of the following steps:

When a new asset label image is submitted for processing, the AI data extraction service, through a Lambda function, retrieves the uploaded image from the bucket where it was uploaded.
The Lambda function performs a similarity search using Meta’s FAISS vector search engine. This search compares the new image against the vector embeddings in the knowledge base generated by Amazon Titan Multimodal Embeddings invoked through Amazon Bedrock, identifying the most relevant labeled examples.
Using the augmented prompt with context information from the similarity search, the Lambda function invokes Amazon Bedrock, specifically Anthropic’s Claude 3, a state-of-the-art generative AI model, for image understanding and optical character recognition (OCR) tasks. By using the similar examples, the AI model can more accurately extract and interpret the critical information from the new asset label image.
The response is then sent to the mobile app to be confirmed by the field technician.

In this phase, the AWS services used are:

Amazon Bedrock – A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities.
AWS Lambda – A serverless computing service that allows you to run your code without the need to provision or manage physical servers or virtual machines. A Lambda function runs the data extraction logic and orchestrates the overall data extraction process.
Amazon S3 – A storage service offering industry-leading durability, availability, performance, security, and virtually unlimited scalability at low costs. It’s used to store the asset images uploaded by the field technicians.

Data verification
Data verification plays a crucial role in maintaining the accuracy and reliability of the extracted data before updating the asset inventory database and is included in the mobile app.
The workflow consists of the following steps:

The extracted data is shown to the field operator.
If the field operator determines that the extracted data is accurate and matches an existing asset label in the knowledge base, they can confirm the correctness of the extraction; if not, they can update the values directly using the app.
When the field technician confirms the data is correct, that information is automatically forwarded to the backend review component.

Data verification uses the following AWS services:

Amazon API Gateway – A secure and scalable API gateway that exposes the data verification component’s functionality to the mobile app and other components.
AWS Lambda – Serverless functions for implementing the verification logic and routing data based on confidence levels.

Backend review
This component assesses the discrepancy of automatically identified data by the AI data extraction service and the final data approved by the field operator and computes the difference. If the difference is below a configured threshold, the data is sent to update the inventory database; otherwise a human review process is engaged:

Subject matter experts asynchronously review flagged data entries on the Amazon A2I console.
Significant discrepancies are marked to update the generative AI’s knowledge base.
Minor OCR errors are corrected without updating the AI model’s knowledge base.

The backend review component uses the following AWS services:

Amazon A2I – A service that provides a web-based interface for human reviewers to inspect and correct the extracted data and asset label images.
Amazon EventBridge – A serverless service that uses events to connect application components together. When the Amazon A2I human workflow is complete, EventBridge is used to detect this event and trigger a Lambda function to process the output data.
Amazon S3 – Object storage to save the marked information in charge of Amazon A2I.

Inventory database
The inventory database component plays a crucial role in storing and managing the verified asset data in a scalable and efficient manner. Amazon DynamoDB, a fully managed NoSQL database service from AWS, is used for this purpose. DynamoDB is a serverless, scalable, and highly available key-value and document database service. It’s designed to handle massive amounts of data and high traffic workloads, making it well-suited for storing and retrieving large-scale inventory data.
The verified data from the AI extraction and human verification processes is ingested into the DynamoDB table. This includes data with high confidence from the initial extraction, as well as data that has been reviewed and corrected by human reviewers.
Knowledge base update
The knowledge base update component enables continuous improvement and adaptation of the generative AI models used for asset label data extraction:

During the backend review process, human reviewers from Amazon A2I validate and correct the data extracted from asset labels by the AI model.
The corrected and verified data, along with the corresponding asset label images, is marked as new label examples if not already present in the knowledge base.
A Lambda function is triggered to update the asset inventory and send the new labels to the FIFO (First-In-First-Out) queue.
A Lambda function processes the messages in the queue, updating the knowledge base vector store (S3 bucket) with the new label examples.
The update process generates the vector embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model exposed by Amazon Bedrock and storing the embeddings in a Meta’s FAISS database in Amazon S3.

The knowledge base update process makes sure that the solution remains adaptive and continuously improves its performance over time, reducing the likelihood of unseen label examples and the involvement of subject matter experts to correct the extracted data.
This component uses the following AWS services:

Amazon Titan Multimodal Embeddings G1 model – This model generates the embeddings (vector representations) for the new asset images and their associated data.
AWS Lambda – Lambda functions are used to update the asset inventory database, to send and process the extracted data to the FIFO queue, and to update the knowledge base in case of new unseen labels.
Amazon SQS – Amazon SQS offers fully managed message queuing for microservices, distributed systems, and serverless applications. The extracted data marked as new by human reviewers is sent to an SQS FIFO (First-In-First-Out) queue. This makes sure that the messages are processed in the correct order; FIFO queues preserve the order in which messages are sent and received. If you use a FIFO queue, you don’t have to place sequencing information in your messages.
Amazon S3 – The knowledge base is stored in an S3 bucket, with the newly generated embeddings. This allows the AI system to improve its accuracy for future asset label recognition tasks.

Navigation flow
This section explains how users interact with the system and how data flows between different components of the solution. We’ll examine each key component’s role in the process, from initial user access through data verification and storage.
Mobile app
The end user accesses the mobile app using the browser included in the handheld device. The application URL to access the mobile app is available after you have deployed the frontend application. Using the browser on a handheld device or your PC, browse to the application URL address, where a login window will appear. Because this is a demo environment, you can register on the application by following the automated registration workflow implemented through Amazon Cognito and choosing Create Account, as shown in the following screenshot.

During the registration process, you must provide a valid email address that will be used to verify your identity, and define a password. After you’re registered, you can log in with your credentials.
After authentication is complete, the mobile app appears, as shown in the following screenshot.

The process to use the app is the following:

Use the camera button to capture a label image.
The app facilitates the upload of the captured image to a private S3 bucket specifically designated for storing asset images. S3 Transfer Acceleration is a separate AWS service that can be integrated with Amazon S3 to improve the transfer speed of data uploads and downloads. It works by using AWS edge locations, which are globally distributed and closer to the client applications, as intermediaries for data transfer. This reduces the latency and improves the overall transfer speed, especially for clients that are geographically distant from the S3 bucket’s AWS Region.
After the image is uploaded, the app sends a request to the AI data extraction service, triggering the subsequent process of data extraction and analysis. The extracted data returned by the service is displayed and editable within the form, as described later in this post. This allows for data verification.

AI data extraction service
This module uses Anthropic’s Claude 3 FM, a multimodal system capable of processing both images and text. To extract relevant data, we employ a prompt technique that uses samples to guide the model’s output. Our prompt includes two sample images along with their corresponding extracted text. The model identifies which sample image most closely resembles the one we want to analyze and uses that sample’s extracted text as a reference to determine the relevant information in the target image.
We use the following prompt to achieve this result:

{
 “role”: “user”,
 “content”: [
 {
 “type”: “text”,
 “text”: “first_sample_image:”,
 },
 {
 “type”: “image”,
 “source”: {
 “type”: “base64”,
 “media_type”: “image/jpeg”,
 “data”: first_sample_encoded_image,
 },
 },
 {
 “type”: “text”,
 “text”: “target_image:”,
 },
 {
 “type”: “image”,
 “source”: {
 “type”: “base64”,
 “media_type”: “image/jpeg”,
 “data”: encoded_image,
 },
 },
 {“type”: “text”,
 “text”: f”””
 answer the question using the following example as reference.
 match exactly the same set of fields and information as in the provided example.
 
 <example>
 analyze first_sample_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 {first_sample_answer}
 </example>
 
 <question>
 analyze target_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 </question>
 “””},
 
 ],
 }

In the preceding code, first_sample_encoded_image and first_sample_answer are the reference image and expected output, respectively, and encoded_image contains the new image that has to be analyzed.
Data verification
After the image is processed by the AI data extraction service, the control goes back to the mobile app:

The mobile app receives the extracted data from the AI data extraction service, which has processed the uploaded asset label image and extracted relevant information using computer vision and ML models.
Upon receiving the extracted data, the mobile app presents it to the field operator, allowing them to review and confirm the accuracy of the information (see the following screenshot). If the extracted data is correct and matches the physical asset label, the technician can submit a confirmation through the app, indicating that the data is valid and ready to be inserted into the asset inventory database.
If the field operator sees any discrepancies or errors in the extracted data compared to the actual asset label, they have the option to correct those values.
The values returned by the AI data extraction service and the final values validated by the field operators are sent to the backend review service.

Backend review
This process is implemented using Amazon A2I:

A distance metric is computed to evaluate the difference between what the data extraction service has identified and the correction performed by the on-site operator.
If the difference is larger than a predefined threshold, the image and the operator modified data are submitted to an Amazon A2I workflow, creating a human-in-the-loop request.
When a backend operator becomes available, the new request is assigned.
The operator uses the Amazon A2I provided web interface, as depicted in the following screenshot, to check what the on-site operator has done and, if it’s found that this type of label is not included in the knowledge base, can decide to add it by entering Yes in the Add to Knowledge Base field.

When the A2I process is complete, a Lambda function is triggered.
This Lambda function stores the information in the inventory database and verifies whether this image also needs to be used to update the knowledge base.
If this is the case, the Lambda function files the request with the relevant data in an SQS FIFO queue.

Inventory database
To keep this solution as simple as possible while covering the required capability, we selected DynamoDB as our inventory database. This is a no SQL database, and we will store data in a table with the following information:

Manufacturers, model ID, and the serial number that is going to be the key of the table
A link to the picture containing the label used during the on-site inspection

DynamoDB offers an on-demand pricing model that allows costs to directly depend on actual database usage.
Knowledge base database
The knowledge base database is stored as two files in an S3 bucket:

The first file is a JSON array containing the metadata (manufacturer, serial number, model ID, and link to reference image) for each of the knowledge base entries
The second file is a FAISS database containing an index with the embedding for each of the images included in the first file

To be able to minimize race conditions when updating the database, a single Lambda function is configured as the consumer of the SQS queue. The Lambda function extracts the information about the link to the reference image and the metadata, certified by the back-office operator, updates both files, and stores the new version in the S3 bucket.
In the following sections, we create a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.
Prerequisites
You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region. You will also need an AWS Identity and Access Management (IAM) user with administrative privileges to deploy the required components and a development environment with access to AWS resources already configured.
For the development environment, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance (choose select at least a t3.small instance type in order to be able to build the web application) or use a development environment of your own choice. Install Python 3.9 and install and configure AWS Command Line Interface (AWS CLI).
You will also need to install the Amplify CLI. Refer to Set up Amplify CLI for more information.
The next step is to enable the models used in this workshop in Amazon Bedrock. To do this, complete the following steps:

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Enable specific models.

 Select all Anthropic and Amazon models and choose Next

A new window will list the requested models.

Confirm that the Amazon Titan models and Anthropic Claude models are on this list and choose Submit.

The next step is to create an Amazon SageMaker Ground Truth private labeling workforce that will be used to perform back-office activities. If you don’t already have a private labeling workforce in your account, you can create one following these steps:

On the SageMaker console, under Ground Truth in the navigation pane, choose Labeling workforce.

On the Private tab, choose Create private team.
Provide a name to the team and your organization, and insert your email address (must be a valid one) for both Email addresses and Contact email.
Leave all the other options as default.
Choose Create private team.
After your workforce is created, copy your workforce Amazon Resource Name (ARN) on the Private tab and save for later use. Lastly, build a Lambda layer that includes two Python libraries. To build this layer, connect to your development environment and issue the following commands:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
bash build_lambda_layer.sh

You should get an output similar to the following screenshot.
Save theLAMBDA_LAYER_VERSION_ARN for later use.
You are now ready to deploy the backend infrastructure and frontend application.
Deploy the backend infrastructure
The backend is deployed using AWS CloudFormation to build the following components:

An API Gateway to act as an integration layer between the frontend application and the backend
An S3 bucket to store the uploaded images and the knowledge base
Amazon Cognito to allow end-user authentication
A set of Lambda functions to implement backend services
An Amazon A2I workflow to support the back-office activities
An SQS queue to store knowledge base update requests
An EventBridge rule to trigger a Lambda function as soon as an Amazon A2I workflow is complete
A DynamoDB table to store inventory data
IAM roles and policies to allow access to the different components to interact with each other and also access Amazon Bedrock for generative AI-related tasks

Download the CloudFormation template, then complete the following steps:

On the AWS CloudFormation console, chose Create stack.
Choose Upload a template file and choose Choose file to upload the downloaded template.
Choose Next.
For Stack name, enter a name (for example, asset-inventory).
For A2IWorkforceARN, enter the ARN of the labeling workforce you identified.
For LambdaLayerARN, enter the ARN of the Lambda layer version you uploaded.
Choose Next and Next again.
Acknowledge that AWS CloudFormation is going to create IAM resources and choose Submit.

Wait until the CloudFormation stack creation process is complete; it will take about 15–20 minutes. You can then view the stack details.
Note the values on the Outputs tab. You will use the output data later to complete the configuration of the frontend application.
Deploy the frontend application
In this section, you will build the web application that is used by the on-site operator to collect a picture of the labels, submit it to the backend services to extract relevant information, validate or correct returned information, and submit the validated or corrected information to be stored in the asset inventory.
The web application uses React and will use the Amplify JavaScript Library.
Amplify provides several products to build full stack applications:

Amplify CLI – A simple command line interface to set up the needed services
Amplify Libraries – Use case-centric client libraries to integrate the frontend code with the backend
Amplify UI Components – UI libraries for React, React Native, Angular, Vue, and Flutter

In this example, you have already created the needed services with the CloudFormation template, so the Amplify CLI will deploy the application on the Amplify provided hosting service.

Log in to your development environment and download the client code from the GitHub repository using the following command:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd webapp

If you’re running on AWS Cloud9 as a development environment, issue the following command to let the Amplify CLI use AWS Cloud9 managed credentials:

ln -s $HOME/.aws/credentials $HOME/.aws/config

Now you can initialize the Amplify application using the CLI:

amplify init

After issuing this command, the Amplify CLI will ask you for some parameters.

Accept the default values by pressing Enter for each question.
The next step is to modify amplifyconfiguration.js.template (you can find it in folder webapp/src) with the information collected from the output of the CloudFormation stack and save as amplifyconfiguration.js. This file tells Amplify which is the correct endpoint to use to interact with the backend resources created for this application. The information required is as follows:

aws_project_region and aws_cognito_region – To be filled in with the Region in which you ran the CloudFormation template (for example, us-east-1).
aws_cognito_identity_pool_id, aws_user_pools_id, aws_user_pools_web_client_id – The values from the Outputs tab of the CloudFormation stack.
Endpoint – In the API section, update the endpoint with the API Gateway URL listed on the Outputs tab of the CloudFormation stack.

You now need to add a hosting option for the single-page application. You can use Amplify to configure and host the web application by issuing the following command:

amplify hosting add

The Amplify CLI will ask you which type of hosting service you prefer and what type of deployment.

Answer both questions by accepting the default option by pressing Enter key.
You now need to install the JavaScript libraries used by this application using npm:

npm install

Deploy the application using the following command:

amplify publish

Confirm you want to proceed by entering Y.

At the end of the deployment phase, Amplify will return the public URL of the web application, similar to the following:


Find out more about deployment here:

https://cra.link/deployment

Zipping artifacts completed.
Deployment complete!
https://dev.xxx.amplifyapp.com

Now you can use your browser to connect to the application using the provided URL.
Clean up
To delete the resources used to build this solution, complete the following steps:

Delete the Amplify application:

Issue the following command:

amplify delete

Confirm that you are willing to delete the application.

Remove the backend resources:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the stack and choose Delete.
Choose Delete to confirm.

At the end of the deletion process, you should not see the entry related to asset-inventory on the list of stacks.

Remove the Lambda layer by issuing the following command in the development environment:

aws lambda delete-layer-version —layer-name asset-inventory-blog —version-number 1

If you created a new labeling workforce, remove it by using the following command:

aws delete-workteam —workteam-name <the name you defined when you created the workteam>

Conclusion
In this post, we presented a solution that incorporates various AWS services to handle image storage (Amazon S3), mobile app development (Amplify), AI model hosting (Amazon Bedrock using Anthropic’s Claude), data verification (Amazon A2I), database (DynamoDB), and vector embeddings (Amazon Bedrock using Amazon Titan Multimodal Embeddings). It creates a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.
By taking advantage of the breadth of AWS services and integrating generative AI capabilities, this solution dramatically improves the efficiency and accuracy of asset inventory management processes. It reduces manual labor, accelerates data entry, and maintains high-quality inventory records, enabling organizations to optimize asset tracking and maintenance operations.
You can deploy this solution and immediately start collecting images of your assets to build or update your asset inventory.

About the authors
Federico D’Alessio is an AWS Solutions Architect and joined AWS in 2018. He is currently working in the Power and Utility and Transportation market. Federico is cloud addict and when not at work, he tries to reach clouds with his hang glider.
Leonardo Fenu is a Solutions Architect, who has been helping AWS customers align their technology with their business goals since 2018. When he is not hiking in the mountains or spending time with his family, he enjoys tinkering with hardware and software, exploring the latest cloud technologies, and finding creative ways to solve complex problems.
Elisabetta Castellano is an AWS Solutions Architect focused on empowering customers to maximize their cloud computing potential, with expertise in machine learning and generative AI. She enjoys immersing herself in cinema, live music performances, and books.
Carmela Gambardella is an AWS Solutions Architect since April 2018. Before AWS, Carmela has held various roles in large IT companies, such as software engineer, security consultant and solutions architect. She has been using her experience in security, compliance and cloud operations to help public sector organizations in their transformation journey to the cloud. In her spare time, she is a passionate reader, she enjoys hiking, traveling and playing yoga.

Transformers Can Now Predict Spreadsheet Cells without Fine-Tuning: Re …

Tabular data is widely utilized in various fields, including scientific research, finance, and healthcare. Traditionally, machine learning models such as gradient-boosted decision trees have been preferred for analyzing tabular data due to their effectiveness in handling heterogeneous and structured datasets. Despite their popularity, these methods have notable limitations, particularly in terms of performance on unseen data distributions, transferring learned knowledge between datasets, and integration challenges with neural network-based models because of their non-differentiable nature.

Researchers from the University of Freiburg, Berlin Institute of Health, Prior Labs, and ELLIS Institute have introduced a novel approach named Tabular Prior-data Fitted Network (TabPFN). TabPFN leverages transformer architectures to address common limitations associated with traditional tabular data methods. The model significantly surpasses gradient-boosted decision trees in both classification and regression tasks, especially on datasets with fewer than 10,000 samples. Notably, TabPFN demonstrates remarkable efficiency, achieving better results in just a few seconds compared to several hours of extensive hyperparameter tuning required by ensemble-based tree models.

TabPFN utilizes in-context learning (ICL), a technique initially introduced by large language models, where the model learns to solve tasks based on contextual examples provided during inference. The researchers adapted this concept specifically for tabular data by pre-training TabPFN on millions of synthetically generated datasets. This training method allows the model to implicitly learn a broad spectrum of predictive algorithms, reducing the need for extensive dataset-specific training. Unlike traditional deep learning models, TabPFN processes entire datasets simultaneously during a single forward pass through the network, which enhances computational efficiency substantially.

The architecture of TabPFN is specifically designed for tabular data, employing a two-dimensional attention mechanism tailored to effectively utilize the inherent structure of tables. This mechanism allows each data cell to interact with others across rows and columns, effectively managing different data types and conditions such as categorical variables, missing data, and outliers. Furthermore, TabPFN optimizes computational efficiency by caching intermediate representations from the training set, significantly accelerating inference on subsequent test samples.

Empirical evaluations highlight TabPFN’s substantial improvements over established models. Across various benchmark datasets, including the AutoML Benchmark and OpenML-CTR23, TabPFN consistently achieves higher performance than widely used models like XGBoost, CatBoost, and LightGBM. For classification problems, TabPFN showed notable gains in normalized ROC AUC scores relative to extensively tuned baseline methods. Similarly, in regression contexts, it outperformed these established approaches, showcasing improved normalized RMSE scores.

TabPFN’s robustness was also extensively evaluated across datasets characterized by challenging conditions, such as numerous irrelevant features, outliers, and substantial missing data. In contrast to typical neural network models, TabPFN maintained consistent and stable performance under these challenging scenarios, demonstrating its suitability for practical, real-world applications.

Beyond its predictive strengths, TabPFN also exhibits fundamental capabilities typical of foundation models. It effectively generates realistic synthetic tabular datasets and accurately estimates probability distributions of individual data points, making it suitable for tasks such as anomaly detection and data augmentation. Additionally, the embeddings produced by TabPFN are meaningful and reusable, providing practical value for downstream tasks including clustering and imputation.

In summary, the development of TabPFN signifies an important advancement in modeling tabular data. By integrating the strengths of transformer-based models with the practical requirements of structured data analysis, TabPFN offers enhanced accuracy, computational efficiency, and robustness, potentially facilitating substantial improvements across various scientific and business domains.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Transformers Can Now Predict Spreadsheet Cells without Fine-Tuning: Researchers Introduce TabPFN Trained on 100 Million Synthetic Datasets appeared first on MarkTechPost.

SQL-R1: A Reinforcement Learning-based NL2SQL Model that Outperforms L …

Natural language interface to databases is a growing focus within artificial intelligence, particularly because it allows users to interact with structured databases using plain human language. This area, often known as NL2SQL (Natural Language to SQL), is centered on transforming user-friendly queries into SQL commands that can be directly executed on databases. The objective is to simplify data access for non-technical users and broaden the utility of data systems in various sectors like finance, healthcare, and retail. With the rise of LLMs, significant progress has made these conversions more accurate and context-aware, especially when dealing with simple queries or structured database layouts.

Despite progress, converting natural language into accurate SQL remains difficult in complex situations involving multiple table joins, nested queries, or ambiguous semantics. The challenge is not just about generating syntactically correct SQL but producing queries that correctly reflect the user’s intent and can be generalized across domains. Standard approaches struggle to scale in high-stakes fields where interpretability and precision are critical. Moreover, many current models depend heavily on fixed schemas and training data structures, which hampers their performance in new or evolving environments.

Most NL2SQL systems today rely on supervised fine-tuning, where large language models are trained on annotated datasets that pair questions with correct SQL answers. While this method has led to noticeable improvements, it introduces limitations in adaptability and interpretability. Because these models are tuned to specific datasets and schemas, they often fail in unfamiliar scenarios. Also, they follow a rigid generation strategy, which can lead to failures when the input diverges from training data. These systems also typically lack transparency in their reasoning processes, limiting their utility in domains where clear decision-making trails are necessary.

Researchers from IDEA Research, the Hong Kong University of Science and Technology (Guangzhou), the University of Chinese Academy of Sciences, and DataArc Tech Ltd. introduced SQL-R1. This new NL2SQL model leverages reinforcement learning rather than traditional supervised learning. SQL-R1 uses feedback mechanisms during training to improve its performance. Instead of just learning from annotated examples, the model learns by generating SQL candidates, executing them, and receiving structured feedback on the outcome. This feedback includes whether the SQL was syntactically correct, whether it produced the proper result, and how efficient and interpretable it was. This dynamic learning process allows the model to optimize its SQL generation strategies over time and improves generalization in complex or unfamiliar scenarios.

To build SQL-R1, researchers first performed supervised fine-tuning on 200,000 samples drawn from a large synthetic dataset called SynSQL-2.5M. This process, known as a cold start, ensured the model could follow basic instructions and generate simple SQL outputs. Following this, reinforcement learning was introduced using the Group Relative Policy Optimization (GRPO) algorithm. The model generated multiple SQL candidates for each query and was rewarded based on a composite scoring function. This function included four metrics: format reward (+1 or -1 depending on syntax correctness), execution reward (+2 for executable queries, -2 for failures), result reward (+3 for correct query outputs, -3 for incorrect ones), and length reward based on the depth and clarity of the reasoning trace. Each of these scores contributed to updating the model’s internal decision-making process.

SQL-R1 was evaluated on two industry-standard NL2SQL benchmarks: Spider and BIRD. On the Spider development set, the model achieved 87.6% execution accuracy, and on the Spider test set, it gained 88.7%. For the BIRD dataset, which covers 95 databases from 37 domains, the model scored 66.6%. These results are competitive with or superior to larger models, including closed-source solutions like GPT-4. Notably, SQL-R1 used the Qwen2.5-Coder-7B model, which is considerably smaller than many alternatives, demonstrating that high accuracy can be achieved with efficient architectures when combined with reinforcement learning. An ablation study confirmed the contribution of each reward component. Removing the format reward, for instance, caused accuracy to drop from 63.1% to 60.4%. Removing the result reward caused a 0.7% drop, indicating that each element in the reward mechanism plays a role in guiding the model.

Several Key Takeaways from the Research on SQL-R1:

SQL-R1 achieved 88.7% accuracy on the Spider test set and 66.6% on the BIRD development set, using only a 7B base model (Qwen2.5-Coder-7B).  

The model used 200,000 samples from the SynSQL-2.5M dataset for supervised fine-tuning and 5,000 complex samples for reinforcement learning.  

The GRPO algorithm powered reinforcement learning, which required no value model and worked efficiently with relative performance scores.  

The reward function included four components: Format (+1/-1), Execution (+2/-2), Result (+3/-3), and Length (proportional).  

SQL-R1 outperformed larger models like GPT-4, highlighting that model architecture and feedback training are as critical as size.  

Ablation studies revealed the importance of each reward: removing the format reward caused a 2.7% drop in performance, while eliminating the execution reward dropped accuracy by 2.4%.  

The approach promotes transparency, as the model provides reasoning traces using ‘<think>’ and ‘<answer>’ tags, improving end-user interpretability.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post SQL-R1: A Reinforcement Learning-based NL2SQL Model that Outperforms Larger Systems in Complex Queries with Transparent and Accurate SQL Generation appeared first on MarkTechPost.

From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks …

Large language models are increasingly used to solve math problems that mimic real-world reasoning tasks. These models are tested for their ability to answer factual queries and how well they can handle multi-step logical processes. Mathematical problem-solving offers a reliable way to examine whether models can extract the necessary information, navigate complex statements, and compute answers correctly. This field has become central to understanding the extent of AI’s logical and cognitive capabilities.

A key concern in this domain is how these models perform when their inputs aren’t neat or formatted. In many cases, the questions LLMs encounter in practice come with extra background information, irrelevant details, or even subtle hints that could lead them off track. While models can perform well on standard benchmark problems, their ability to isolate important information from cluttered prompts remains questionable. This has raised the need to examine how distractions influence their reasoning and whether current models are ready for unpredictable, real-world use cases.

Past tools and benchmarks have focused mostly on well-formed problem sets, such as GSM8K or MATH. Still, newer variants like GSM-Symbolic and GSM-PLUS began testing model performance under symbolic variations and distractor insertions. These tools uncovered significant weaknesses in LLMs when faced with small changes to the problem text. For instance, introducing one clause that seems relevant but is logically redundant can reduce model accuracy by as much as 65%. This led to the conclusion that models often rely on surface patterns rather than genuine reasoning, which prompted further exploration into more realistic and noisy testing conditions.

A team of researchers from the Massachusetts Institute of Technology has introduced a research focused on measuring how LLMs handle four types of systematic perturbations: irrelevant context, pathological instructions, relevant but non-essential information, and a combination of the latter two. The team evaluated 13 large language models—both open-source and commercial—through APIs provided by OpenAI, Anthropic, Cohere, and TogetherAI. Instead of relying on full test sets, the team sampled 56 data points from the GSM8K dataset per experiment, ensuring they captured a balanced distribution of reasoning complexity.

To construct these altered prompts, the researchers added dense and irrelevant contexts like Wikipedia pages or financial reports into the input. This took up to 90% of the model’s context window. In the pathological scenario, misleading instructions were appended, designed to manipulate the reasoning path without altering the original question. New details that were factually correct but unnecessary were inserted for the relevant context case to see how the models handled distractions that looked informative. In the final variant, pathological and relevant perturbations were combined, increasing the input complexity while observing how this dual pressure influenced model output.

The performance dropped most sharply when irrelevant context was introduced. Across all models, the average accuracy dropped by 55.89%. Pathological instructions caused an 8.52% decline, while relevant context led to a 7.01% decrease. Combining the two types of perturbations produced a 12.91% drop in accuracy. Interestingly, performance didn’t correlate with model size—larger models like Mixtral-8x22B and Command-R-Plus experienced greater regressions compared to some smaller models. Also, the number of reasoning steps in a problem didn’t significantly affect the outcome, suggesting that complexity in logical structure wasn’t the dominant factor in performance variance.

This study shows that current large language models, even those with billions of parameters, still struggle when their prompts are altered relatively simply. The researchers from MIT demonstrate that model resilience doesn’t improve significantly with size and that the ability to filter and prioritize information is a major gap in LLM design. These findings push for developing models that are better equipped to deal with cluttered and misleading inputs—an essential step for moving closer to reliable AI in real-world environments.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks Derail LLM Reasoning appeared first on MarkTechPost.

Clario enhances the quality of the clinical trial documentation proces …

This post is co-written with Kim Nguyen and Shyam Banuprakash from Clario.
Clario is a leading provider of endpoint data solutions to the clinical trials industry, generating high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since Clario’s founding more than 50 years ago, the company’s endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces when supporting its clients is the time-consuming process of generating documentation for clinical trials, which can take weeks.
The business challenge
When medical imaging analysis is part of a clinical trial it is supporting, Clario prepares a medical imaging charter process document that outlines the format and requirements of the central review of clinical trial images (the Charter). Based on the Charter, Clario’s imaging team creates several subsequent documents (as shown in the following figure), including the business requirement specification (BRS), training slides, and ancillary documents. The content of these documents is largely derived from the Charter, with significant reformatting and rephrasing required. This process is time-consuming, can be subject to inadvertent manual error, and carries the risk of inconsistent or redundant information, which can delay or otherwise negatively impact the clinical trial.

Clario’s imaging team recognized the need to modernize the document generation process and streamline the processes used to create end-to-end document workflows. Clario engaged with their AWS account team and AWS Generative AI Innovation Center to explore how generative AI could help streamline the process.
The solution
The AWS team worked closely with Clario to develop a prototype solution that uses AWS AI services to automate the BRS generation process. The solution involves the following key services:

Amazon Simple Storage Service (Amazon S3): A scalable object storage service used to store the charter-derived and generated BRS documents.
Amazon OpenSearch Serverless: An on-demand serverless configuration for Amazon OpenSearch Service used as a vector store.
Amazon Bedrock: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG) and build agents that execute tasks using your enterprise systems and data sources.

The solution is shown in the following figure:

Architecture walkthrough

Charter-derived documents are processed in an on-premises script in preparation for uploading.
Files are sent to AWS using AWS Direct Connect.
The script chunks the documents and calls an embedding model to produce the document embeddings. It then stores the embeddings in an OpenSearch vector database for retrieval by our application. Clario uses an Amazon Titan Text Embeddings model offered by Amazon Bedrock. Each chunk is called to produce an embedding.
Amazon OpenSearch Serverlessis used as the durable vector store. Document chunk embeddings are stored in an OpenSearch vector index, which enables the application to search for the most semantically relevant documents. Clario also stores attributes for the source document and associated trial to allow for a richer search experience.
A custom build user interface is the primary access point for users to access the system, initiate generation jobs, and interact with a chat UI. The UI is integrated with the workflow engine that manages the orchestration process.
The workflow engine calls the Amazon Bedrock API and orchestrates the business requirement specification document generation process. The engine:

Uses a global specification that stores the prompts to be used as input when calling the large language model.
Queries OpenSearch for the relevant Imaging charter.
Loops through every business requirement.
Calls the Claude 3.7 Sonnet large language model from Amazon Bedrock to generate responses.

Outputs the business requirement specification document to the user interface, where a business requirement writer can review the answers to produce a final document. Clario uses Claude 3.7 Sonnet from Amazon Bedrock for the question-answering and the conversational AI application.
The final documents are written to Amazon S3 to be consumed and published by additional document workflows that will be built in the future.
An as-needed AI chat agent to allow document-based discovery and enable users to converse with one or more documents.

Benefits and results
By using AWS AI services, Clario has streamlined the complicated BRS generation process significantly. The prototype solution demonstrated the following benefits:

Improved accuracy: The use of generative AI models minimized the risk of translation errors and inconsistencies, reducing the need for rework and study delays.
Scalability and flexibility: The serverless architecture provided by AWS services allows the solution to scale seamlessly as demand increases, while the modular design enables straightforward integration with other Clario systems.
Security: Clario’s data security strategy revolves around confining all its information within the secure AWS ecosystem using the security features of Amazon Bedrock. By keeping data isolated within the AWS infrastructure, Clario helps ensure protection against external threats and unauthorized access. This approach enables Clario to meet compliance requirements and provide clients with confidence in the confidentiality and integrity of their sensitive data.

Lessons learned
The successful implementation of this prototype solution reinforced the value of using generative AI models for domain-specific applications like those prevalent in the life sciences industry. It also highlighted the importance of involving business stakeholders early in the process and having a clear understanding of the business value to be realized. Following the success of this project, Clario is working to productionize the solution in their Medical Imaging business during 2025 to continue offering state-of-the-art services to its customers for best quality data and successful clinical trials.
Conclusion
The collaboration between Clario and AWS demonstrated the potential of AWS AI and machine learning (AI/ML) services and generative AI models, such as Anthropic’s Claude, to streamline document generation processes in the life sciences industry and, specifically, for complicated clinical trial processes. By using these technologies, Clario was able to enhance and streamline the BRS generation process significantly, improving accuracy and scalability. As Clario continues to adopt AI/ML across its operations, the company is well-positioned to drive innovation and deliver better outcomes for its partners and patients.

About the Authors
Kim Nguyen serves as the Sr Director of Data Science at Clario, where he leads a team of data scientists in developing innovative AI/ML solutions for the healthcare and clinical trials industry. With over a decade of experience in clinical data management and analytics, Kim has established himself as an expert in transforming complex life sciences data into actionable insights that drive business outcomes. His career journey includes leadership roles at Clario and Gilead Sciences, where he consistently pioneered data automation and standardization initiatives across multiple functional teams. Kim holds a Master’s degree in Data Science and Engineering from UC San Diego and a Bachelor’s degree from the University of California, Berkeley, providing him with the technical foundation to excel in developing predictive models and data-driven strategies. Based in San Diego, California, he leverages his expertise to drive forward-thinking approaches to data science in the clinical research space.
Shyam Banuprakash serves as the Senior Vice President of Data Science and Delivery at Clario, where he leads complex analytics programs and develops innovative data solutions for the medical imaging sector. With nearly 12 years of progressive experience at Clario, he has demonstrated exceptional leadership in data-driven decision making and business process improvement. His expertise extends beyond his primary role, as he contributes his knowledge as an Advisory Board Member for both Modal and UC Irvine’s Customer Experience Program. Shyam holds a Master of Advanced Study in Data Science and Engineering from UC San Diego, complemented by specialized training from MIT in data science and big data analytics. His career exemplifies the powerful intersection of healthcare, technology, and data science, positioning him as a thought leader in leveraging analytics to transform clinical research and medical imaging.
John O’Donnell is a Principal Solutions Architect at Amazon Web Services (AWS) where he provides CIO-level engagement and design for complex cloud-based solutions in the healthcare and life sciences (HCLS) industry. With over 20 years of hands-on experience, he has a proven track record of delivering value and innovation to HCLS customers across the globe. As a trusted technical leader, he has partnered with AWS teams to dive deep into customer challenges, propose outcomes, and ensure high-value, predictable, and successful cloud transformations. John is passionate about helping HCLS customers achieve their goals and accelerate their cloud native modernization efforts.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS) where he provides expert guidance and architects secure, scalable cloud solutions for diverse enterprise customers. With nearly two decades of IT experience, including over ten years specializing in Cloud Computing, he has a proven track record of delivering transformative cloud implementations across multiple industries. As a trusted technical advisor, Praveen has successfully partnered with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. Praveen is passionate about solving complex business challenges through cutting-edge cloud architectures and helping organizations achieve successful digital transformations powered by artificial intelligence and machine learning technologies.