Elevate RAG for numerical analysis using Amazon Bedrock Knowledge Base …

In the realm of generative artificial intelligence (AI), Retrieval Augmented Generation (RAG) has emerged as a powerful technique, enabling foundation models (FMs) to use external knowledge sources for enhanced text generation.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Knowledge Bases is a fully managed capability that helps you implement the entire RAG workflow—from ingestion to retrieval and prompt augmentation—without having to build custom integrations to data sources and manage data flows. However, RAG has had its share of challenges, especially when it comes to using it for numerical analysis. This is the case when you have information embedded in complex nested tables. Latest innovations in  Amazon Bedrock Knowledge Base provide a resolution to this issue.
In this post, we explore how Amazon Bedrock Knowledge Bases address the use case of numerical analysis across a number of documents.
The power of RAG and its limitations
With RAG, an information retrieval component is introduced that utilizes the user input to first pull relevant information from a data source. The user query and the relevant information are both given to the large language model (LLM). The LLM uses the new knowledge and its training data to create better responses.
Although this approach holds a lot of promise for textual documents, the presence of non-textual elements, such as tables, pose a significant challenge. One issue is that the table structure by itself can be difficult to interpret when directly queried against documents in PDFs or Word. This can be addressed by transforming the data into a format such as text, markdown, or HTML.
Another issue relates to search, retrieval, and chunking of documents that contain tables. The first step in RAG is to chunk a document so you can transform that chunk of data into a vector for a meaningful representation of text. However, when you apply this method to a table, even if converted into a text format, there is a risk that the vector representation doesn’t capture all the relationships in the table. As a result, when you try to retrieve information, a lot of information is missed. Because this information isn’t retrieved, the LLM doesn’t provide accurate answers to your questions.
Amazon Bedrock Knowledge Bases provide three capabilities to resolve this issue:

Hybrid search – A hybrid search retrieves information based on semantic meaning through vector representations as well as through keywords. As a result, information on particular key fields that was being missed earlier using purely semantic search is retrieved, and the LLM is able to accurately provide the correct answers. For more information on Amazon Bedrock’s hybrid search capability, see Amazon Bedrock Knowledge Bases now supports hybrid search.
Chunking data in fixed sizes – You can specify a fixed size for the data that is eventually transformed into a vector. Small sizes imply smaller amounts of data and vice versa.
Retrieving a large number of chunks from your search results – These are the number of chunks retrieved as the result of your search. The greater the number of results retrieved, the more context provided to the LLM for an answer.

Using a combination of these features can enhance numerical analysis of information across multiple documents that contain data in tables. In the next section, we demonstrate this approach using a set of earnings documents from Amazon.
Solution overview
The following diagram illustrates the high-level architecture of our solution for analyzing numerical documents.

The user call flow consists of the following steps:

The process begins with the user uploading one or more documents. This action initiates the workflow.
The Streamlit application, which designed to facilitate user interaction, takes these uploaded documents and stores them in an Amazon Simple Storage Service (Amazon S3) bucket.
After the documents are successfully copied to the S3 bucket, the event automatically invokes an AWS Lambda
The Lambda function invokes the Amazon Bedrock knowledge base API to extract embeddings—essential data representations—from the uploaded documents. These embeddings are structured information that capture the core features and meanings of the documents.
With the documents processed and stored, the GUI of the application becomes interactive. Users can now engage with the application by asking questions in natural language through the user-friendly interface.
When a user submits a question, the application converts this query into query embeddings. These embeddings encapsulate the essence of the user’s question, which helps with retrieving the relevant context from the knowledge base.

you can use the Retrieve API to query your knowledge base with information retrieved directly from the knowledge base. The RetrieveAndGenerate API uses the retrieved results to augment the foundation model (FM) prompt and returns the response.
Using a hybrid search method that combines keyword-based and semantic-based techniques, the application searches its knowledge base for relevant information related to the user’s query. This search aims to find contextual answers that match both the explicit terms and the intended meaning of the question.
When relevant context is identified, the application forwards this information—both the user’s query and the retrieved context—to the LLM module.
The LLM module processes the provided query and context to generate a response.
The application delivers the generated response back to the user through its GUI. This completes the loop of interaction, where the user’s initial query results in a comprehensive and contextually relevant response derived from the uploaded documents and the application’s knowledge base.

In the following sections, we walk through the steps to create an S3 bucket and knowledge base, deploy the Streamlit application with AWS CloudFormation, and test the solution.
Prerequisites
You should have the following prerequisites:

An AWS account with necessary permissions
Access to launch AWS CloudFormation
Access to the Anthropic Claude 3 Sonnet and Amazon Titan Text Embeddings v2 models on Amazon Bedrock
The CloudFormation template downloaded to your local computer

Create an S3 bucket
Complete the following steps to create your S3 bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Enter a unique bucket name that follows the S3 bucket naming rules.
Choose the AWS Region where you want to create the bucket. It is recommended to choose Region that is geographically close to you.
Leave the other settings at their default values and choose Create bucket.

Create a knowledge base
Complete the following steps to create a knowledge base with default settings:

On the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
Choose Create knowledge base.
In the Provide knowledge base details section, provide the following information:

For Knowledge base name, enter a name.
Select Create and use a new service role for AWS Identity and Access Management (IAM) permissions.

In the Choose data source section, select the radio button for Amazon S3 and choose Next
In the Configure data source section, provide the following information

For S3 URI, enter the S3 path for the bucket you created.
For chunking and parsing configurations, select the radio button for Custom
For Chunking strategy, choose Fixed-size chunking.
For Max tokens, enter 250.
For Overlap percentage between chunks, enter 30.
Leave everything as default and choose Next.

In the Select embeddings model and configure vector store section, provide the following information:

For Embeddings model, choose Titan Text Embeddings v2.
Under Vector database, select Quick create a new vector store.
Leave everything else as default and choose Next.

Review the knowledge base settings and choose Create knowledge base.

Amazon Bedrock will now provision the necessary resources and set up the knowledge base for you as shown in the screen below (Note: This process may take a few minutes to complete). Note the knowledge base ID as shown

Click on the data source name and note the Data source ID as shown

Create the Streamlit application
After the knowledge base is setup using the above 9 steps, complete the following steps to create the Streamlit application using the CloudFormation template:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack.
Select With new resources (standard).
For the template source, choose Upload a template file.
Choose Choose file and upload the template you downloaded earlier.
Enter a name for your stack.
Configure the following parameters:

KnowledgeBase Configuration

For KnowledgeBasedID, enter the knowledge base ID that you saved earlier.
For DatasourceID, enter the data source ID that you saved earlier.

S3Bucket Configuration

For RAGDocumentInput, enter the name of the bucket you created.

S3Bucket Configuration

For SubnetId, choose your public subnet
For VpcId, choose the VPC ID in which you want to deploy the Streamlit application.
 For YourPublicIP, enter the public IP address from where you access the Streamlit application.

S3Bucket Configuration

For InstanceType and LatestLinuxAMI, you can use the default values

Review the stack details and select the checkbox in the Capabilities section:

I acknowledge that AWS CloudFormation might create IAM resources

Choose Create stack to initiate the stack creation process. This will take few minutes to complete.

When the stack is complete, you can refer to the stack’s Outputs tab for the Streamlit application URL.
Now that we have deployed the Streamlit application, let’s see how users can interact with it and ask questions.
Test the solution
We can divide the UI experience into two phases: document ingestion and document querying.
The document ingestion workflow consists of the following steps:

Users use the Streamlit application to upload documents. For testing, we can use Amazon earnings for the last 16 quarters. The application then uploads the documents to an S3 bucket.
The upload event invokes a Lambda function. This function starts the ingestion job to sync the data to the knowledge base. Using the specified chunking strategy, the knowledge base converts the documents in the S3 bucket to vector embeddings, which are stored in the default Amazon OpenSearch serverless vector database. This process will take about 15 minutes to complete. You can follow the steps mentioned here to check the sync history which contains the details about when the data source was synced, Please wait for sync to complete to start asking questions in the streamlit application

The document querying workflow consists of the following steps:

Users interact with a Streamlit application running on Amazon Elastic Compute Cloud (Amazon EC2 ), where they enter questions (queries) about the documents and receive responses.
The application passes the queries to the knowledge base.
The knowledge base returns the most relevant text. In this case, because we’re using hybrid search, it returns keyword-based as well as semantic search-based results. In the hybrid search, we maximize the number of chunks returned (to 100).
The application invokes the specified LLM in Amazon Bedrock and passes the retrieved results as context, along with the original query, to the LLM.
The LLM model processes the prompts and context, and returns responses.
The Streamlit application displays the results to the user.

To test this setup, we deployed this solution in an AWS account and analyzed Amazon earnings for the last 16 quarters. The results from a semantic search with default chunking vs. a hybrid search with fixed size chunking and maximum results are displayed in the following table.

Question
Semantic Search with Default Chunking
Hybrid Search with Fixed Size Chunking

What is the change in total operating expenses for technology and infrastructure from 2020 to 2023?
Based on the information provided in the search results, I could not find the exact change in total operating expenses for technology and infrastructure from 2020 to 2023. The search results provide some financial data, but do not break down the operating expenses into specific categories like technology and infrastructure. Without that level of detail, I cannot determine the exact change in those expenses over that time period
Based on the financial information provided, Amazon’s operating expenses for technology and infrastructure increased from 42.7billion in 2020 to 85.6 billion in 2023, an increase of $42.9 billion.

What are the international operating expenses in 2020, 2021 and 2022?
Based on the search results, I could not find any information about Amazon’s international operating expenses for the years 2020, 2021, and 2022. The financial information provided in the search results focuses on Amazon’s overall operating expenses, cash flows, and other financial metrics, but does not break down the expenses by international operations.
The international operating expenses for Amazon were: 2020: 103,695 million 2021:128,711 million 2022: $125,753 million

As seen in the results, the hybrid search option provides correct answers to all the questions.
Clean up
To prevent incurring costs after deploying and testing this solution, delete the S3 bucket, knowledge base, and CloudFormation stack.
Conclusion
In this post, we discussed how Amazon Bedrock Knowledge Bases provides a powerful solution that enables numerical analysis on documents. You can deploy this solution in an AWS account and use it to analyze different types of documents. As we continue to push the boundaries of generative AI, solutions like this will play a pivotal role in bridging the gap between unstructured data and actionable insights, enabling organizations to unlock the full potential of their data assets.
To further explore the advanced RAG capabilities of Amazon Bedrock Knowledge Bases, refer to the following resources:

Create a multimodal assistant with advanced RAG and Amazon Bedrock
Build RAG and agent-based generative AI applications with new Amazon Titan Text Premier model, available in Amazon Bedrock

About the Authors
Sanjeev Pulapaka is a Principal Solutions architect and the Single Threaded Leader for AI/ML in the US federal civilian team at AWS. He advises customers on AI/ML-related solutions that advance their mission. Sanjeev has extensive experience in leading, architecting, and implementing high-impact technology solutions that address diverse business needs in multiple sectors, including commercial, federal, and state and local governments. He has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.
Muhammad Qazafi is a Solutions Architect based in the US. He assists customers in designing, developing, and implementing secure, scalable, and innovative solutions on AWS. His objective is to help customers achieve measurable business outcomes through the effective utilization of AWS services. With over 15 years of experience, Muhammad brings a wealth of knowledge and expertise across a diverse range of industries. This extensive experience enables him to understand the unique challenges faced by different businesses and help customers create solutions on AWS.
Venkata Kampana is a Senior Solutions architect in the AWS Health and Human Services team and is based in Sacramento, California. In this role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Deploy generative AI agents in your contact center for voice and chat …

This post is co-written with Vraj Shah and Chaitanya Hari from DoorDash.
DoorDash connects consumers with their favorite local businesses in more than 30 countries across the globe. Recently, they faced a significant challenge in handling the high volume of calls from its contractor delivery workers, known as Dashers. With a user base of over 37 million active consumers and 2 million monthly active Dashers at the end of 2023, the company recognized the need to reduce the burden on its live agents by providing a more efficient self-service experience for Dashers.
To address this challenge, the contact center team at DoorDash wanted to harness the power of generative AI to deploy a solution quickly, and at scale, while maintaining their high standards for issue resolution and customer satisfaction. Dashers, who generally prefer calling into support rather than texting while they’re on the road, require fast and reliable assistance, with minimal response latency. This low latency requirement became a critical factor in DoorDash’s quest for an effective, voice-enabled self-service solution.
Working with the AWS Generative AI Innovation Center, DoorDash built a solution to provide Dashers with a low-latency self-service voice experience to answer frequently asked questions, reducing the need for live agent assistance, in just 2 months.
The solution uses Amazon Lex, a voice-enabled conversational AI service; Amazon Bedrock, a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case; and Amazon Bedrock Knowledge Bases, a fully managed service that connects large language models (LLMs) to your data sources. It’s a fully serverless architecture that uses Amazon OpenSearch Serverless, which can run petabyte-scale workloads, without you having to manage the underlying infrastructure.
The solution fields hundreds of thousands of calls per day, responding to Dashers with answers to their questions in 2.5 seconds or less. It also enables operational capabilities including automated testing, conversation analytics, monitoring and observability, and LLM hallucination prevention and detection.

“We wanted to empower Dashers to get help with their most common questions and issues as quickly and efficiently as possible, saving them time, effort, and increasing their trust in DoorDash’s self-service capabilities,”
– Chaitanya Hari, Contact Center Product Lead at DoorDash.

In this post, we show how you can deploy generative AI agents in your contact center using AWS services.
Solution overview
For this post, we use a fictional hotel chain example called Example Corp Hospitality Group. The following is a sample interaction between a customer and the LLM-powered automated agent:
Customer: Hi, we’re looking for a great place for a family vacation
Agent: Example Corp Family Getaways offers family-friendly hotels and resorts located near top attractions like theme parks, national parks, and historic sites. The resorts feature spacious accommodations, activities for all ages, and diverse dining options to create unforgettable family memories.
Customer: Where are they located?
Agent: Example Corp Family Getaways offers family-friendly resorts in a variety of popular vacation destinations, including Orlando, Florida; Anaheim, California; Williamsburg, Virginia; Pigeon Forge, Tennessee; San Antonio, Texas; Scottsdale, Arizona; Branson, Missouri; Wisconsin Dells, Wisconsin; and Hershey, Pennsylvania.
Customer: Pigeon Forge sounds interesting.
Agent: Our Pigeon Forge resort offers families a chance to experience the great outdoors while enjoying the excitement of Dollywood and other nearby attractions. Guests can enjoy our indoor water park, mini-golf course, and nightly s’mores roasts around the fire pit.
Customer: Do they have kids’ menus?
Agent: Example Corp Family Getaways resorts in Pigeon Forge, Tennessee feature family-friendly dining options, including dedicated kids’ menus with favorites like pizza, chicken nuggets, and mac and cheese.
You can deploy the solution in your own AWS account and try the example solution. The following diagram illustrates the solution architecture.

We will walk you through deploying and testing these major components of the solution:

An AWS CloudFormation stack to set up an Amazon Bedrock knowledge base, where you store the content used by the solution to answer questions.
A CloudFormation stack to create an Amazon Lex bot and an AWS Lambda fulfillment function, which implement the core Retrieval Augmented Generation (RAG) question answering capability.
An optional CloudFormation stack to deploy a data pipeline to enable a conversation analytics dashboard.
An optional CloudFormation stack to enable an asynchronous LLM hallucination detection feature.
Optional Jupyter notebooks in Amazon SageMaker that provide an automated testing capability that compares generated answers to ground truth answers, providing pass/fail grades with explanations.

Everything you need is also provided as open source in our GitHub repo.
Prerequisites
You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?
This solution uses Amazon Bedrock LLMs to find answers to questions from your knowledge base. Before proceeding, if you have not previously done so, request access to at least the following Amazon Bedrock models:

Amazon Titan Embeddings G1 – Text
Cohere Embed English v3 and Cohere Embed Multilingual v3
Anthropic’s Claude 3 Haiku and Anthropic’s Claude 3 Sonnet

If you’ll be integrating with Amazon Connect, make sure you have an instance available in your account. If you don’t already have one, you can create one. If you plan to deploy the conversation analytics stack, you need Amazon QuickSight, so make sure you have enabled it in your AWS account. 
At the time of writing, this solution is available in the following AWS Regions: Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, London), US East (N. Virginia), and US West (Oregon).
Deploy the Amazon Bedrock knowledge base
You can use the provided CloudFormation stack for the Amazon Bedrock knowledge base instances you may need using Amazon Simple Storage Service (Amazon S3) as a data source. Complete the following steps to set up your knowledge base:

Sign in to your AWS account, then choose Launch Stack to deploy the CloudFormation template:

Provide a stack name, for example contact-center-kb.
Provide the name for an existing S3 bucket, for example contact-center-kb-(your-account-number). This is where the content for the demo solution will be stored. Create this S3 bucket if you don’t already have one.
Do not specify an S3 prefix.
Choose an embedding model, such as amazon.titan-embed-text-v2:0.
Choose the Fixed-sized chunking chunking strategy.
For the maximum tokens per chunk entry, use 600 for the Amazon Titan embeddings model. (If you are using the Cohere embeddings model, use 512). This represents about a full page of text.
For the percentage overlap, use 10%.
Leave the four entries for Index Details at their default values (index name, vector field name, metadata field name, and text field name).
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take about 10 minutes to deploy.
Upload the sample content and test your knowledge base
The demonstration sample for the solution includes an LLM-based hotel-bot that can answer questions about the fictional hotel chain Example Corp Hospitality Group. You need to load the content for this hotel chain into the S3 bucket that you specified for the knowledge base stack. You can find the S3 bucket used by the CloudFormation stack on the Outputs tab for the stack.

Either using the AWS Command Line Interface (AWS CLI) or the AWS Management Console, upload the following folders from the content section of the GitHub repo:

corporate
family-getaways
luxury-suites
party-times
seaside-resorts
waypoint-inns

You can choose either the PDF versions or the Word document versions (Word versions recommended). When you’re done, the top level of your S3 bucket should contain six folders, each containing a single Word or PDF document.

On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
Choose your new knowledge base to open it.

A message appears that says “One or more data sources have not been synced.”

Select the data source and choose Sync.

The sync process should only take a minute or two.
After your data source has been synced, you can try some question answering on the Amazon Bedrock console. Make sure you have enabled all the models approved by your organization on the Amazon Bedrock Model access page.
Select an LLM model, such as Anthropic’s Claude 3 Haiku on Amazon Bedrock, and start asking questions! You might want to peruse the sample documents you uploaded for some ideas about questions to ask.

Deploy the hallucination detection stack (optional)
If you want to use the optional asynchronous hallucination detection feature, deploy this stack. Otherwise, move on to the next section. You can use this CloudFormation stack for any RAG-based solution requiring asynchronous hallucination detection.

Choose Launch Stack:

Provide a stack name, for example contact-center-hallucination-detection.
Specify an LLM to perform the hallucination detection. At the time of writing, there are seven LLMs that are recommended for hallucination detection. For the demo solution, choose the default (Claude V3 Sonnet).
Optionally, create an Amazon Key Management Service (AWS KMS) customer managed key (CMK) to encrypt the Amazon Simple Queue Service (Amazon SQS) queue and the Amazon CloudWatch Logs log group for the Lambda function (recommended for production).

There are two types of Amazon CloudWatch alarms in this stack:

ERROR alarms – For code issues with the Lambda function that does the hallucination detection work
WARNING alarms – For when the Lambda function actually detects a hallucination

Both alarm types are optional, but recommended.

Choose yes to enable or no to disable the alarms.
For the alarms that you enable, you can specify an optional email address or distribution list to receive email notifications about the alarms.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take about a minute or two to deploy.
When the stack is complete, you can review the resources it creates on the Resources tab for the CloudFormation stack. In particular, review the Lambda function code.
If you entered email addresses for the alarm notifications, you should receive email requests asking you to confirm the subscriptions. Confirm them to receive email notifications about alarms that may occur.
Deploy the RAG solution stack
If you’re integrating with Amazon Connect, make sure you have an instance available in your account. If you don’t already have one, you can create one. Then complete the following steps to deploy the Amazon Lex bot and Lambda fulfillment function:

Choose Launch Stack:

Provide a stack name, for example contact-center-rag-solution.
Provide a name for the Amazon Lex bot, for example hotel-bot.
Specify the number of conversation turns to retain for context. This can be optimized for different use cases and datasets. For the hotel-bot demo, try the default of 4.
Optionally, specify an existing CloudWatch Logs log group ARN for the Amazon Lex conversation logs. You’ll need this if you’re planning to deploy the conversation analytics stack. Create a log group if you don’t already have one.
Optionally, enter a value for Lambda provisioned concurrency units for the Amazon Lex bot handler function. If set to a non-zero number, this will prevent Lambda cold starts and is recommended for production and for internal testing. For development, 0 or 1 is recommended.
Optionally, select the option to create a KMS CMK to encrypt the CloudWatch Logs log groups for the Lambda functions (recommended for production).
If you’re integrating with Amazon Connect, provide the Amazon Connect instance ARN, as well as the name for a new contact flow that the stack will create for you.
Provide the knowledge base ID from the knowledge base stack you just created. You can find this on the Outputs tab of the knowledge base stack.
Provide the S3 bucket used by the knowledge base stack (also referenced on the Outputs tab).
If you created the hallucination detection stack, enter the SQS queue name. You can find this on the Outputs tab of the hallucination detection stack.
If you opted for a KMS key for your hallucination detection stack, enter the KMS key ARN.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take a few minutes to complete.
To try the RAG solution, navigate to the Amazon Lex console and open the hotel-bot bot. The bot has a single language section for the English language. Choose Intents in the navigation pane to check out the intents for this sample bot. They include the following:

Intents related to questions about the hotel chain and its various hotel brands – This includes Accommodations, Amenities, CorporateOverview, Locations, Parking, and more. These intents are routed to the RAG solution by Amazon Lex. Technically, intents like these could be omitted, allowing the FallbackIntent to handle requests of this nature. However, including these intents (and their sample utterances) provides Amazon Lex with information about the “language” of your solution domain, allowing it to better optimize its speech-to-text engine and improve speech transcription accuracy. In addition, including these intents is useful for conversation analytics.
SwitchBrand – This intent is designed to improve conversation flow by allowing the user to say things like “What about at your other hotels?” in the middle of a conversation.
Booking – This demonstrates an example of routing the caller to a live agent queue.
SpeakToAgent – This intent is for when a caller specifically requests a live agent.
Welcome, Goodbye, and Help – These conversation support intents are for starting and ending the conversation, or asking what the bot can do.
FallbackIntent – This is the standard intent for questions or requests that don’t match other intents. In this example solution, such requests are also routed to the RAG solution to allow the LLM to answer based on the content in the knowledge base.
SelectKnowledgeBase and SelectLLM – These allow the user to direct the RAG solution to use a different knowledge base instance (if more than one is available) or a different LLM. These intents are designed for testing purposes, and should normally be included only in non-production deployments. You can test the RAG solution with any of the LLMs available on Amazon Bedrock. You can also switch to a different knowledge base or LLM mid-conversation, if desired.
ToggleLLMGuardrails and ToggleLLMContext – These allow the user to turn the prompt-based LLM guardrails off or on, and to disable or enable the retrieval of information from the knowledge base. These intents are designed for testing purposes, and should normally be included only in non-production environments. You can turn these settings off and on mid-conversation, if desired.

You can choose Test on the Amazon Lex console to try the solution.

Try some sample conversations, for example:

Ask “We’re looking for a nice place for a family vacation” and the bot will respond “Example Corp Family Getaways offers family-friendly accommodations…”
Ask “Where are they located?” and the bot will respond “Example Corp Family Getaways has locations in…”
Ask “Tell me more about the one in Pigeon Forge” and the bot will respond “The Example Corp Family Getaways resort in Pigeon Forge, Tennessee is…”

You can refer to the sample documents you uploaded for some ideas about questions to ask.
If you deployed the hallucination detection stack, you can look at its assessment of the answers you got when you tested. From the hallucination detection stack details page, on the Resources tab, choose the HallucinationDetectionFunctionLogGroup entry. This opens the CloudWatch Logs log group for the Lambda hallucination detection function. You can inspect the log statements to observe the hallucination detection process in action, as shown in the following screenshot.

If you’re integrating with Amazon Connect, there will be a new contact flow in the Amazon Connect instance you specified, as shown in the following screenshot.

To test using voice, just claim a phone number, associate it with this contact flow, and give it a call!
Deploy the conversation analytics stack (optional)
This stack uses QuickSight for analytics, so make sure you have already enabled it in your AWS account before deploying this stack.

Choose Launch Stack:

Provide a stack name, for example contact-center-analytics.
Provide the name (not the ARN) of the Amazon Lex conversation logs log group. This is the same CloudWatch Logs log group you used for the the RAG solution CloudFormation stack.
Choose an option for purging source log streams from the log group. For testing, choose no.
Choose an option for redacting sensitive data using from the conversation logs. For testing, choose no.
Leave the personally identifiable information (PII) entity types and confidence score thresholds at their default values.
Choose an option for allowing unredacted logs for the Lambda function in the data pipeline. For testing, choose yes.
Select an option for creating a KMS CMK.

If you create a CMK, it will be used to encrypt the data in the S3 bucket that this stack creates, where the normalized conversation data is housed. This allows you to control which IAM principals are allowed to decrypt the data and view it. This setting is recommended for production.

Select the options for enabling CloudWatch alarms for ERRORS and WARNINGS in the Amazon Lex data pipeline. It is recommended to enable these alarms.
For the alarms that you enable, you can specify an optional email address or distribution list to receive email notifications about the alarms.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack should about 5 minutes to complete.
The following diagram illustrates the architecture of the stack.

As Amazon Lex writes conversation log entries to CloudWatch Logs (1), they are picked up by Amazon Data Firehose and streamed to an S3 bucket (2). Along the way, a Lambda transformation function (3) simplifies the JSON structure of the data to make it more user-friendly for querying purposes. The Lambda function can also redact sensitive data using Amazon Comprehend (4), and optionally purge the entries from the CloudWatch Logs log group as it consumes them.
On a scheduled basis (every 5 minutes), an AWS Glue crawler (5) inspects new data in the S3 bucket, and updates a data schema that is used by Amazon Athena (6) to provide a SQL interface to the data. This allows tools like QuickSight (7) to create near real-time dashboards, analytics, and visualizations of the data.
Set up the QuickSight dashboard (optional)
Before you create the QuickSight dashboard, make sure to return to the Amazon Lex console and ask a few questions, in order to generate some data for the dashboard. It will take about 5 minutes for the pipeline to process this new conversation data and make it available to QuickSight.
To set up dashboards and visualizations in QuickSight, complete the following steps:

On the QuickSight console, choose the user profile icon and choose Manage QuickSight.
Under Security & permissions, choose Manage in the QuickSight access to AWS services
Under Amazon S3, choose Select S3 buckets.
Enable access to the S3 bucket created by the conversation analytics stack (it will have a name with a 12-character unique identifier prepended to lex-conversation-logs). You don’t need to enable write permissions.
Choose Finish, then choose Save.
Choose the QuickSight menu icon to return to the main page in QuickSight.
In the navigation pane, choose Datasets.
Choose New dataset.
From the list of dataset sources, choose Athena.
Enter a data source name (for example contact-center-analytics).
Choose Create data source.
In the Choose your table window, choose your database, select your lex_conversation_logs table, and choose Edit/Preview data.

This opens your new QuickSight dataset. You can review the various attributes available, and see some results from your testing.

For improved speed in displaying the data, you can select the SPICE option for Query mode, but that will mean you need to refresh SPICE (or set up an hourly auto-update schedule) when you want to see data updates based on additional testing.

For now, leave the setting as Direct query.
When you’re ready, choose PUBLISH & VISUALIZE.
In the New sheet window, keep the defaults and choose CREATE.

This opens the analysis page, where you can start creating visualizations.

Automated testing notebooks (optional)
To try the automated testing capability, you need a SageMaker Jupyter notebook. Alternatively, you can run the notebooks locally in your integrated development environment (IDE) or other environment that supports Jupyter notebooks.

On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Choose Create notebook instance.
Give your notebook a name, such as contact-center-rag-testing.
To enable multi-threaded testing, it’s recommended to select a larger instance, such as ml.m5.2xlarge (which has 8 vCPUs) or ml.m5.4xlarge (which has 16 vCPUs). Don’t forget to stop them when they’re not in use.
Keep the default setting for Platform identifier (Amazon Linux 2, Jupyter Lab 3).
Under Additional configuration, increase the Volume size in GB setting to 50 GB.
In the Permissions and encryption section, under IAM role, choose Create a new role in the drop down list (don’t use the role creation wizard).
In the Create an IAM role window, you can specify any S3 buckets you want to provide access to (none are needed for this solution).
Choose Create role.

Choose Create notebook instance.

It will take several minutes for your notebook instance to become available. While it’s being created, you can update the IAM role to add some inline policies you’ll need for accessing Amazon Bedrock and Amazon Lex.

On the Notebook instances page, open your notebook instance (for example, contact-center-rag-testing) and then choose the entry under IAM role ARN to open the role.
Add the following inline policies (available in the notebooks/iam-roles folder in the GitHub repository):

bedrock-agents-retrieve.json
bedrock-invoke-model-all.json
lex-invoke-bot.json
opensearch-serverless-api-access.json

You can revise these roles to limit resource access as needed.

After your notebook instance has started, choose Open Jupyter to open the notebook.
Upload the following to your notebook instance (if desired, you can zip the files locally, upload the zip archive, and then unzip it in SageMaker):

bedrock_helpers.py – This script configures LLM instances for the notebooks.
bedrock_utils – You should make sure to upload all subfolders and files, and confirm that the folder structure is correct.
run_tests.ipynb – This notebook runs a set of test cases.
generate_ground_truths.ipynb – Given a set of questions, this notebook generates potential ground truth answers.
test-runs – This folder should contain Excel workbooks.

Open the run_tests.ipynb notebook.
In the second cell, replace the bot_id and bot_alias_id values with the values for your Amazon Lex bot (you can find these on the Outputs tab of the RAG solution stack).
After you updated these values, choose Restart & Run All on the Kernel

If you’re using a ml.m5.2xlarge instance type, it should take about a minute to run the 50 test cases in the test-runs/test-cases-claude-haiku-2024-09-02.xlsx workbook. When it’s complete, you should find a corresponding test-results workbook in the test-runs folder in your notebook.

After a few minutes, you can also see the test results in your conversation analytics dashboard.

Adapt the solution to your use case
You can adapt this solution to your specific use cases with minimal work:

Replace the Amazon Bedrock Knowledge Bases sample content with your content – Replace the content in the S3 bucket and organize it into a folder structure that makes sense for your use case. You can create a new knowledge base for your content.
Replace the intents in the Amazon Lex bot with intents for your use case – Modify the Amazon Lex bot definition to reflect the interactions you want to enable for your use case.
Modify the LLM prompts in the bedrock_utils code – In the Amazon Lex bot fulfillment Lambda function, review the LLM prompt definitions in the bedrock_utils folder. For example, provide a use case-specific definition for the role of the LLM-based agent.
Modify the bot handler code if necessary – In the Amazon Lex bot fulfillment Lambda function, review the code in the TopicIntentHandler.py function. For the knowledge base search, this code provides an example that uses the sample hotel brands as topics. You can replace this metadata search query with one appropriate for your use cases.

Clean up
Congratulations! You have completed all the steps for setting up your voice-enabled contact center generative AI agent solution using AWS services.
When you no longer need the solution deployed in your AWS account, you can delete the CloudFormation stacks that you deployed, as well as the SageMaker notebook instance if you created one.
Conclusion
The contact center generative AI agent solution offers a scalable, cost-effective approach to automate Q&A conversations in your contact center, using AWS services like Amazon Bedrock, Amazon Bedrock Knowledge Bases, OpenSearch Serverless, and Amazon Lex.
The solution code is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features through GitHub pull requests. Browse to the GitHub repository to explore the code, and check the CHANGELOG for the latest changes and the README for the latest documentation updates.
For expert assistance, the AWS Generative AI Innovation Center, AWS Professional Services, and our AWS Partners are here to help.

About the Authors
Vraj Shah is a Connect Developer at DoorDash.
Chaitanya Hari is a Voice/Contact Center Product Lead at DoorDash.
Marcelo Silva is a Principal Product Manager at Amazon Web Services, leading strategy and growth for Amazon Bedrock Knowledge Bases and Amazon Lex.
Adam Diesterhaft is a Sr. Pursuit Solutions Architect on the Amazon Connect team.
Brian Yost is a Principal Deep Learning Architect in the AWS Generative AI Innovation Center.

Migrating to Amazon SageMaker: Karini AI Cut Costs by 23%

This post is co-written with Deepali Rajale from Karini AI.
Karini AI, a leading generative AI foundation platform built on AWS, empowers customers to quickly build secure, high-quality generative AI apps. GenAI is not just a technology; it’s a transformational tool that is changing how businesses use technology. Depending on where they are in the adoption journey, the adoption of generative AI presents a significant challenge for enterprises. While pilot projects using Generative AI can start effortlessly, most enterprises need help progressing beyond this phase. According to Everest Research, more than a staggering 50% of projects do not move beyond the pilots as they face hurdles due to the absence of standardized or established GenAI operational practices.
Karini AI offers a robust, user-friendly GenAI foundation platform that empowers enterprises to build, manage, and deploy Generative AI applications. It allows beginners and expert practitioners to develop and deploy Gen AI applications for various use cases beyond simple chatbots, including agentic, multi-agentic, Generative BI, and batch workflows. The no-code platform is ideal for quick experimentation, building PoCs, and rapid transition to production with built-in guardrails for safety and observability for troubleshooting. The platform includes an offline and online quality evaluation framework to assess quality during experimentation and continuously monitor applications post-deployment. Karini AI’s intuitive prompt playground allows authoring prompts, comparison with different models across providers, prompt management, and prompt tuning. It supports iterative testing of more straightforward, agentic, and multi-agentic prompts. For production deployment, the no-code recipes enable easy assembly of the data ingestion pipeline to create a knowledge base and deployment of RAG or agentic chains. The platform owners can monitor costs and performance in real-time with detailed observability and seamlessly integrate with Amazon Bedrock for LLM inference, benefiting from extensive enterprise connectors and data preprocessing techniques.
The following diagram illustrates how Karini AI delivers a comprehensive Generative AI foundational platform encompassing the entire application lifecycle. This platform delivers a holistic solution that speeds up time to market and optimizes resource utilization by providing a unified framework for development, deployment, and management.

In this post, we share how Karini AI’s migration of vector embedding models from Kubernetes to Amazon SageMaker endpoints improved concurrency by 30% and saved over 23% in infrastructure costs.
Karini AI’s Data Ingestion Pipeline for creating vector embeddings
Enriching large language models (LLMs) with new data is crucial to building practical generative AI applications. This is where Retrieval Augmented Generation (RAG) comes into play. RAG enhances LLMs’ capabilities by incorporating external data and producing state-of-the-art performance in knowledge-intensive tasks. Karini AI offers no-code solutions for creating Generative AI applications using RAG. These solutions include two primary components: a data ingestion pipeline for building a knowledge base and a system for knowledge retrieval and summarization. Together, these pipelines simplify the development process, enabling the creation of powerful AI applications with ease.
Data Ingestion Pipeline
Ingesting data from diverse sources is essential for executing Retrieval Augmented Generation (RAG). Karini AI’s data ingestion pipeline enables connection to multiple data sources, including Amazon S3, Amazon Redshift, Amazon Relational Database Service (RDS), websites and Confluence, handling structured and unstructured data. This source data is pre-processed, chunked, and transformed into vector embeddings before being stored in a vector database for retrieval. Karini AI’s platform provides flexibility by offering a range of embedding models from their model hub, simplifying the creation of vector embeddings for advanced AI applications.
Here is a screenshot of Karini AI’s no-code data ingestion pipeline.

Karini AI’s model hub streamlines adding models by integrating with leading foundation model providers such as Amazon Bedrock and self-managed serving platforms.
Infrastructure challenges
As customers explore complex use cases and datasets grow in size and complexity, Karini AI scales the data ingestion process efficiently to provide high concurrency for creating vector embeddings using state-of-the-art embedding models, such as those listed in the MTEB leaderboard, which are rapidly evolving and unavailable on managed platforms.
Before migrating to Amazon SageMaker, we deployed our models on self-managed Kubernetes(K8s) on EC2 instances. Kubernetes offered significant flexibility to deploy models from HuggingFace quickly, but soon, our engineering had to manage many aspects of scaling and deployment. We faced the following challenges with our existing setup that must be addressed to improve efficiency and performance.

Keeping up with SOTA(State-Of-The-Art) models: We managed different deployment manifests for each model type (such as classifiers, embeddings, and autocomplete), which was time-consuming and error-prone. We also had to maintain the logic to determine the memory allocation for different model types.
Managing dynamic concurrency was hard: A significant challenge with using models hosted on Kubernetes was achieving the highest dynamic concurrency level. We aimed to maximize endpoint performance to achieve target transactions per second (TPS) while meeting strict latency requirements.
Higher Costs: While Kubernetes (K8s) provides robust capabilities, it has become more costly due to the dynamic nature of data ingestion pipelines, which results in under-utilized instances and higher costs.

Our search for an inference platform led us to Amazon SageMaker, a solution that efficiently manages our models for higher concurrency, meets customer SLAs, and scales down serving when not needed. The reliability of SageMaker’s performance gave us confidence in its capabilities.
Amazon SageMaker for Model Serving
Choosing Amazon SageMaker was a strategic decision for Karini AI. It balanced the need for higher concurrencies at a lower cost, providing a cost-effective solution for our needs. SageMaker’s ability to scale and maximize concurrency while ensuring sub-second latency addresses various generative AI use cases making it a long-lasting investment for our platform.
Amazon SageMaker is a fully managed service that allows developers and data scientists to quickly build, train, and deploy machine learning (ML) models. With SageMaker, you can deploy your ML models on hosted endpoints and get real-time inference results. You can easily view the performance metrics for your endpoints in Amazon CloudWatch, automatically scale endpoints based on traffic, and update your models in production without losing any availability.
Karini AI’s data ingestion pipeline architecture with Amazon SageMaker Model endpoint is here.

Advantages of using SageMaker hosting
Amazon SageMaker offered our Gen AI ingestion pipeline many direct and indirect benefits.

Technical Debt Mitigation: Amazon SageMaker, being a managed service, allowed us to free our ML engineers from the burden of inference, enabling them to focus more on our core platform features—this relief from technical debt is a significant advantage of using SageMaker, reassuring us of its efficiency.
Meet customer SLAs: Knowledgebase creation is a dynamic task that may require higher concurrencies during vector embedding generation and minuscule load during query time. Based on customer SLAs and data volume, we can choose batch inference, real-time hosting with auto-scaling, or serverless hosting. Amazon SageMaker also provides recommendations for instance types suitable for embedding models.
Reduced Infrastructure cost: SageMaker is a pay-as-you-go service that allows you to create batch or real-time endpoints when there is demand and destroy them when work is complete. This approach reduced our infrastructure cost by more than 23% over the Kubernetes (K8s) platform.
SageMaker Jumpstart: SageMaker Jumpstart provides access to SOTA (State-Of-The-Art) models and optimized inference containers, making it ideal for creating new models that are accessible to our customers.
Amazon Bedrock compatibility: Karini AI integrates with Amazon Bedrock for LLM (Large Language Model) inference. The custom model import feature allows us to reuse the model weights used in SageMaker model hosting in Amazon Bedrock to maintain a joint code base and interchange serving between Bedrock and SageMaker as per the workload.

Conclusion
Karini AI significantly improved, achieving high performance and reducing model hosting costs by migrating to Amazon SageMaker. We can deploy custom third-party models to SageMaker and quickly make them available to Karini’s model hub for data ingestion pipelines. We can optimize our infrastructure configuration for model hosting as needed, depending on model size and our expected TPS. Using Amazon SagaMaker for model inference enabled Karini AI to handle increasing data complexities efficiently and meet concurrency needs while optimizing costs. Moreover, Amazon SageMaker allows easy integration and swapping of new models, ensuring that our customers can continuously leverage the latest advancements in AI technology without compromising performance or incurring unnecessary incremental costs.
Amazon SageMaker and Karini.ai offer a powerful platform to build, train, and deploy machine learning models at scale. By leveraging these tools, you can:

Accelerate development:Build and train models faster with pre-built algorithms and frameworks.
Enhance accuracy: Benefit from advanced algorithms and techniques for improved model performance.
Scale effortlessly:Deploy models to production with ease and handle increasing workloads.
Reduce costs:Optimize resource utilization and minimize operational overhead.

Don’t miss out on this opportunity to gain a competitive edge.

About Authors
Deepali Rajale is the founder of Karini AI, which is on a mission to democratize generative AI across enterprises. She enjoys blogging about Generative AI and coaching customers to optimize Generative AI practice. In her spare time, she enjoys traveling, seeking new experiences, and keeping up with the latest technology trends. You can find her on LinkedIn.
Ravindra Gupta is the Worldwide GTM lead for SageMaker and with a passion to help customers adopt SageMaker for their Machine Learning/ GenAI workloads.  Ravi is fond of learning new technologies, and enjoy mentoring startups on their Machine Learning practice. You can find him on Linkedin

CALM: Credit Assignment with Language Models for Automated Reward Shap …

Reinforcement Learning (RL) is a critical area of ML that allows agents to learn from their interactions within an environment by receiving feedback as rewards. A significant challenge in RL is solving the temporal credit assignment problem, which refers to determining which actions in a sequence contributed to achieving a desired outcome. This is particularly difficult when feedback is sparse or delayed, meaning agents don’t immediately know if their actions are correct. In such situations, agents must learn how to correlate specific actions with outcomes, but the lack of immediate feedback makes this a complex task. RL systems often fail to generalize and scale effectively to more complicated tasks without effective mechanisms to resolve this challenge.

The research addresses the difficulty of credit assignment when rewards are delayed and sparse. RL agents often start without prior knowledge of the environment and must navigate through it based solely on trial and error. When feedback is scarce, the agent may struggle to develop a robust decision-making process because it cannot discern which actions led to successful outcomes. This scenario can be particularly challenging in complex environments with multiple steps leading to a goal, where only the final action sequence produces a reward. In many instances, agents end up learning inefficient policies or fail to generalize their behavior across different environments due to this fundamental problem.

Traditionally, RL has relied on techniques like reward shaping and hierarchical reinforcement learning (HRL) to tackle the credit assignment problem. Reward shaping is a method where artificial rewards are added to guide the agent’s behavior when natural rewards are insufficient. In HRL, tasks are broken down into simpler sub-tasks or options, with agents being trained to achieve intermediate goals. While both techniques can be effective, they require significant domain knowledge and human input, making them difficult to scale. In recent years, large language models (LLMs) have demonstrated potential in transferring human knowledge into computational systems, offering new ways to improve the credit assignment process without excessive human intervention.

The research team from University College London, Google DeepMind, and the University of Oxford developed a new approach called Credit Assignment with Language Models (CALM). CALM leverages the power of LLMs to decompose tasks into smaller subgoals and assess the agent’s progress toward these goals. Unlike traditional methods that require extensive human-designed rewards, CALM automates this process by allowing the LLM to determine subgoals and provide auxiliary reward signals. The technique reduces human involvement in designing RL systems, making it easier to scale to different environments. Researchers claim that this method can handle zero-shot settings, meaning the LLM can evaluate actions without requiring fine-tuning or prior examples specific to the task.

CALM uses LLMs to assess whether specific subgoals are achieved during a task. For instance, in the MiniHack environment used in the study, agents are tasked with picking up a key and unlocking a door to receive a reward. CALM breaks down this task into manageable subgoals, such as “navigate to the key,” “pick up the key,” and “unlock the door.” Each time one of these subgoals is completed, CALM provides an auxiliary reward to the RL agent, guiding it toward completing the final task. This system reduces the need for manually designed reward functions, which are often time-consuming and domain-specific. Instead, the LLM uses its prior knowledge to effectively shape the agent’s behavior.

The researchers’ experiments evaluated CALM using a dataset of 256 human-annotated demonstrations from MiniHack, a gaming environment that challenges agents to solve tasks in a grid-like world. The results showed that LLMs could successfully assign credit in zero-shot settings, meaning the model did not require prior examples or fine-tuning. In particular, the LLM could recognize when subgoals had been achieved, providing useful guidance to the RL agent. The study found that the LLM accurately recognized subgoals and aligned them with human annotations, achieving an F1 score of 0.74. The LLM’s performance improved significantly when using more focused observations, such as cropped images showing a 9×9 view around the agent. This suggests that LLMs can be a valuable tool in automating credit assignment, particularly in environments where natural rewards are sparse or delayed.

The researchers also reported that CALM’s performance was competitive with that of human annotators in identifying successful subgoal completions. In some cases, the LLM achieved an accuracy rate of 0.73 in detecting when an agent had completed a subgoal, and the auxiliary rewards provided by CALM helped the agent learn more efficiently. The team also compared the performance of CALM to existing models like Meta-Llama-3 and found that CALM performed well across various metrics, including recall and precision, with precision scores ranging from 0.60 to 0.97, depending on the model and task.

In conclusion, the research demonstrates that CALM can effectively tackle the credit assignment problem in RL by leveraging LLMs. CALM reduces the need for extensive human involvement in designing RL systems by breaking tasks into subgoals and automating reward shaping. The experiments indicate that LLMs can provide accurate feedback to RL agents, improving their learning ability in environments with sparse rewards. This approach can enhance RL performance in various applications, making it a promising avenue for future research and development. The study highlights the potential for LLMs to generalize across tasks, making RL systems more scalable and efficient in real-world scenarios.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post CALM: Credit Assignment with Language Models for Automated Reward Shaping in Reinforcement Learning appeared first on MarkTechPost.

Trust-Align: An AI Framework for Improving the Trustworthiness of Retr …

Large language models (LLMs) have gained significant attention due to their potential to enhance various artificial intelligence applications, particularly in natural language processing. When integrated into frameworks like Retrieval-Augmented Generation (RAG), these models aim to refine AI systems’ output by drawing information from external documents rather than relying solely on their internal knowledge base. This approach is crucial in ensuring that AI-generated content remains factually accurate, which is a persistent issue in models not tied to external sources.

A key problem faced in this area is the occurrence of hallucinations in LLMs—where models generate seemingly plausible but factually incorrect information. This becomes especially problematic in tasks requiring high accuracy, such as answering factual questions or assisting in legal and educational fields. Many state-of-the-art LLMs rely heavily on parametric knowledge information learned during training, making them unsuitable for tasks where responses must strictly come from specific documents. To tackle this issue, new methods must be introduced to evaluate and improve the trustworthiness of these models.

Traditional methods focus on evaluating the end results of LLMs within the RAG framework, but few explore the intrinsic trustworthiness of the models themselves. Currently, approaches like prompting techniques align the models’ responses with document-grounded information. However, these methods often fall short, either failing to adapt the models or resulting in overly sensitive outputs that respond inappropriately. Researchers identified the need for a new metric to measure LLM performance and ensure that the models provide grounded, trustworthy responses based solely on retrieved documents.

Researchers from the Singapore University of Technology and Design, in collaboration with DSO National Laboratories, introduced a novel framework called “TRUST-ALIGN.” This method focuses on enhancing the trustworthiness of LLMs in RAG tasks by aligning their outputs to provide more accurate, document-supported answers. The researchers also developed a new evaluation metric, TRUST-SCORE, which assesses models based on multiple dimensions, such as their ability to determine whether a question can be answered using the provided documents and their precision in citing relevant sources.

TRUST-ALIGN works by fine-tuning LLMs using a dataset containing 19,000 question-document pairs, each labeled with preferred and unpreferred responses. This dataset was created by synthesizing natural responses from LLMs like GPT-4 and negative responses derived from common hallucinations. The key advantage of this method lies in its ability to directly optimize LLM behavior toward providing grounded refusals when necessary, ensuring that models only answer questions when sufficient information is available. It improves the models’ citation accuracy by guiding them to reference the most relevant portions of the documents, thus preventing over-citation or improper attribution.

Regarding performance, the introduction of TRUST-ALIGN showed substantial improvements across several benchmark datasets. For example, when evaluated on the ASQA dataset, LLaMA-3-8b, aligned with TRUST-ALIGN, achieved a 10.73% increase in the TRUST-SCORE, surpassing models like GPT-4 and Claude-3.5 Sonnet. On the QAMPARI dataset, the method outperformed the baseline models by 29.24%, while the ELI5 dataset showed a performance boost of 14.88%. These figures demonstrate the effectiveness of the TRUST-ALIGN framework in generating more accurate and reliable responses compared to other methods.

One of the significant improvements brought by TRUST-ALIGN was in the models’ ability to refuse to answer when the available documents were insufficient correctly. On ASQA, the refusal metric improved by 9.87%, while on QAMPARI, it showed an even higher increase of 22.53%. The ability to refuse was further highlighted in ELI5, where the improvement reached 5.32%. These results indicate that the framework enhanced the models’ accuracy and significantly reduced their tendency to over-answer questions without proper justification from the provided documents.

Another noteworthy achievement of TRUST-ALIGN was in improving citation quality. On ASQA, the citation precision scores rose by 26.67%, while on QAMPARI, citation recall increased by 31.96%. The ELI5 dataset also showed an improvement of 29.30%. This improvement in citation groundedness ensures that the models provide well-supported answers, making them more trustworthy for users who rely on fact-based systems.

In conclusion, this research addresses a critical issue in deploying large language models in real-world applications. By developing TRUST-SCORE and the TRUST-ALIGN framework, researchers have created a reliable method to align LLMs toward generating document-grounded responses, minimizing hallucinations, and improving overall trustworthiness. This advancement is particularly significant in fields where accuracy and the ability to provide well-cited information are paramount, paving the way for more reliable AI systems in the future.

Check out the Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Trust-Align: An AI Framework for Improving the Trustworthiness of Retrieval-Augmented Generation in Large Language Models appeared first on MarkTechPost.

Spiking Network Optimization Using Population Statistics (SNOPS): A Ma …

Building massive neural network models that replicate the activity of the brain has long been a cornerstone of computational neuroscience’s efforts to understand the complexities of brain function. These models, which are frequently intricate, are essential for comprehending how neural networks give rise to cognitive functions.  However, optimizing these models’ parameters to precisely mimic observed brain activity has historically been a difficult and resource-intensive operation requiring much time and specialized knowledge. 

A new AI research from Carnegie Mellon University and the University of Pittsburgh introduces a machine learning-driven framework called Spiking Network Optimisation using Population Statistics (SNOPS) that holds the potential to transform this process completely. SNOPS has been developed by an interdisciplinary team of academics from Carnegie Mellon University and the University of Pittsburgh. 

Because of the framework’s automation of customization, spiking network models can more faithfully replicate the population-wide variability seen in large-scale neural recordings. In neuroscience, spiking network models, which mimic the biophysics of neural circuits, are extremely useful instruments. On the other hand, their intricacy frequently presents formidable obstacles. These networks’ behavior is extremely sensitive to model parameters, which makes configuration difficult and unpredictable. 

SNOPS automates the optimization process to address these issues directly. Building such models has traditionally been a manual process that takes a lot of time and domain expertise. The SNOPS approach finds a larger range of model configurations that are consistent with brain activity automatically, in addition to being quicker and more potent. This feature makes it possible to study the behavior of the model in greater detail and reveals activity regimes that might otherwise go unnoticed.

SNOPS’s capacity to match empirical data and computational models is one of its most important features. It makes use of population statistics from extensive neural recordings to adjust model parameters in a way that closely matches the patterns of actual activity. The study’s use of SNOPS on brain recordings from macaque monkeys’ prefrontal and visual cortices proved this. The findings have demonstrated the need for more complex methods of model tweaking by exposing unidentified limitations of the spiking network models already in use.

The creation of SNOPS is evidence of the effectiveness of cross-disciplinary cooperation. By combining the skills of modelers, data-driven computational scientists, and experimentalists, the study team was able to develop a tool that is useful for the larger neuroscience community in addition to being unique. 

SNOPS has the potential to have a big impact on computational neuroscience in the future. Because it is open-source, researchers from all over the world can use and improve upon it, which may yield new understandings of how the brain functions. With SNOPS, a configuration that captures all the needed aspects of the brain’s activity can be easily found.

In conclusion, SNOPS offers a strong, automated method for model tweaking, marking a significant advancement in the creation of large-scale neural models. Through SNOPS, the complexity of brain function can be better comprehended and ultimately advance the understanding of the most complex organ in the human body by bridging the gap between empirical data and computer models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Spiking Network Optimization Using Population Statistics (SNOPS): A Machine Learning-Driven Framework that can Quickly and Accurately Customize Models that Reproduce Activity to Mimic What’s Observed in the Brain appeared first on MarkTechPost.

Enhancing Just Walk Out technology with multi-modal AI

Since its launch in 2018, Just Walk Out technology by Amazon has transformed the shopping experience by allowing customers to enter a store, pick up items, and leave without standing in line to pay. You can find this checkout-free technology in over 180 third-party locations worldwide, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and college campuses. Just Walk Out technology’s end-to-end system automatically determines which products each customer chose in the store and provides digital receipts, eliminating the need for checkout lines.
In this post, we showcase the latest generation of Just Walk Out technology by Amazon, powered by a multi-modal foundation model (FM). We designed this multi-modal FM for physical stores using a transformer-based architecture similar to that underlying many generative artificial intelligence (AI) applications. The model will help retailers generate highly accurate shopping receipts using data from multiple inputs including a network of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and catalog images of products. To put it in plain terms, a multi-modal model means using data from multiple inputs.
Our research and development (R&D) investments in state-of-the-art multi-modal FMs enables the Just Walk Out system to be deployed in a wide range of shopping situations with greater accuracy and at lower cost. Similar to large language models (LLMs) that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for every shopper visiting the store.
The challenge: Tackling complicated long-tail shopping scenarios
Because of their innovative checkout-free environment, Just Walk Out stores presented us with a unique technical challenge. Retailers and shoppers as well as Amazon demand nearly 100 percent checkout accuracy, even in the most complex shopping situations. These include unusual shopping behaviors that can create a long and complicated sequence of activities requiring additional effort to analyze what happened.
Previous generations of the Just Walk Out system utilized a modular architecture; it tackled complex shopping situations by breaking down the shopper’s visit into discrete tasks, such as detecting shopper interactions, tracking items, identifying products, and counting what is selected. These individual components were then integrated into sequential pipelines to enable the overall system functionality. While this approach produced highly accurate receipts, significant engineering efforts are required to address challenges in new, previously unencountered situations and complex shopping scenarios. This limitation restricted the scalability of this approach.
The solution: Just Walk Out multi-modal AI
To meet these challenges, we introduced a new multi-modal FM that we designed specifically for retail store environments, enabling Just Walk Out technology to handle complex real-world shopping scenarios. The new multi-modal FM further enhances the Just Walk Out system’s capabilities by generalizing more effectively to new store formats, products, and customer behaviors, which is crucial for scaling up Just Walk Out technology.
The incorporation of continuous learning enables the model training to automatically adapt and learn from new challenging scenarios as they arise. This self-improving capability helps ensure the system maintains high performance, even as shopping environments continue to evolve.
Through this combination of end-to-end learning and enhanced generalization, the Just Walk Out system can tackle a wider range of dynamic and complex retail settings. Retailers can confidently deploy this technology, knowing it will provide a frictionless checkout-free experience for their customers.
The following video shows our system’s architecture in action.

Key elements of our Just Walk Out multi-modal AI model include:

Flexible data inputs –The system tracks how users interact with products and fixtures, such as shelves or fridges. It primarily relies on multi-view video feeds as inputs, using weight sensors solely to track small items. The model maintains a digital 3D representation of the store and can access catalog images to identify products, even if the shopper returns items to the shelf incorrectly.
Multi-modal AI tokens to represent shoppers’ journeys – The multi-modal data inputs are processed by the encoders, which compress them into transformer tokens, the basic unit of input for the receipt model. This allows the model to interpret hand movements, differentiate between items, and accurately count the number of items picked up or returned to the shelf with speed and precision.
Continuously updating receipts – The system uses tokens to create digital receipts for each shopper. It can differentiate between different shopper sessions and dynamically updates each receipt as they pick up or return items.

Training the Just Walk Out FM
By feeding vast amounts of multi-modal data into the Just Walk Out FM, we found it could consistently generate—or, technically, “predict”— accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, such as detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. All of these are learned within a single model, enhancing the model’s ability to handle new, never-before-seen store formats, products, and customer behaviors. This is crucial for bringing Just Walk Out technology to new locations.
AI model training—in which curated data is fed to selected algorithms—helps the system refine itself to produce accurate results. We quickly discovered we could accelerate the training of our model by using a data flywheel that continuously mines and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these progressive improvements with minimal manual intervention. The following diagram illustrates the process.

To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
Here are some key steps we followed in training our FM:

Selecting challenging data sources – To train our AI model for Just Walk Out technology, we focus on training data from especially difficult shopping scenarios that test the limits of our model. Although these complex cases constitute only a small fraction of shopping data, they are the most valuable for helping the model learn from its mistakes.
Leveraging auto labeling – To increase operational efficiency, we developed algorithms and models that automatically attach meaningful labels to the data. In addition to receipt prediction, our automated labeling algorithms cover the auxiliary tasks, ensuring the model gains comprehensive multi-modal understanding and reasoning capabilities.
Pre-training the model – Our FM is pre-trained on a vast collection of multi-modal data across a diverse range of tasks, which enhances the model’s ability to generalize to new store environments never encountered before.
Fine-tuning the model – Finally, we refined the model further and used quantization techniques to create a smaller, more efficient model that uses edge computing.

As the data flywheel continues to operate, it will progressively identify and incorporate more high-quality, challenging cases to test the robustness of the model. These additional difficult samples are then fed into the training set, further enhancing the model’s accuracy and applicability across new physical store environments.
Conclusion
In this post, we showed how our multi-modal, AI system represents significant new possibilities for Just Walk Out technology. With our innovative approach, we are moving away from modular AI systems that rely on human-defined subcomponents and interfaces. Instead, we’re building simpler and more scalable AI systems that can be trained end-to-end. Although we’ve just scratched the surface, multi-modal AI has raised the bar for our already highly accurate receipt system and will enable us to improve the shopping experience at more Just Walk Out technology stores around the world.
Visit About Amazon to read the official announcement about the new multi-modal AI system and learn more about the latest improvements in Just Walk Out technology.
To find where you can find Just Walk Out technology locations, visit Just Walk Out technology locations near you. Learn more about how to power your store or venue with Just Walk Out technology by Amazon on the Just Walk Out technology product page.
Visit Build and scale the next wave of AI innovation on AWS to learn more about how AWS can reinvent customer experiences with the most comprehensive set of AI and ML services.

About the Authors
Tian Lan is a Principal Scientist at AWS. He currently leads the research efforts in developing the next-generation Just Walk Out 2.0 technology, transforming it into an end-to-end learned, store domain–focused multi-modal foundation model.
Chris Broaddus is a Senior Manager at AWS. He currently manages all the research efforts for Just Walk Out technology, including the multi-modal AI model and other projects, such as deep learning for human pose estimation and Radio Frequency Identification (RFID) receipt prediction.

Generate synthetic data for evaluating RAG systems using Amazon Bedroc …

Evaluating your Retrieval Augmented Generation (RAG) system to make sure it fulfils your business requirements is paramount before deploying it to production environments. However, this requires acquiring a high-quality dataset of real-world question-answer pairs, which can be a daunting task, especially in the early stages of development. This is where synthetic data generation comes into play. With Amazon Bedrock, you can generate synthetic datasets that mimic actual user queries, enabling you to evaluate your RAG system’s performance efficiently and at scale. With synthetic data, you can streamline the evaluation process and gain confidence in your system’s capabilities before unleashing it to the real world.
This post explains how to use Anthropic Claude on Amazon Bedrock to generate synthetic data for evaluating your RAG system. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Fundamentals of RAG evaluation
Before diving deep into how to evaluate a RAG application, let’s recap the basic building blocks of a naive RAG workflow, as shown in the following diagram.

The workflow consists of the following steps:

In the ingestion step, which happens asynchronously, data is split into separate chunks. An embedding model is used to generate embeddings for each of the chunks, which are stored in a vector store.
When the user asks a question to the system, an embedding is generated from the questions and the top-k most relevant chunks are retrieved from the vector store.
The RAG model augments the user input by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the large language model (LLM). The augmented prompt allows the LLM to generate an accurate answer to user queries.
An LLM is prompted to formulate a helpful answer based on the user’s questions and the retrieved chunks.

Amazon Bedrock Knowledge Bases offers a streamlined approach to implement RAG on AWS, providing a fully managed solution for connecting FMs to custom data sources. To implement RAG using Amazon Bedrock Knowledge Bases, you begin by specifying the location of your data, typically in Amazon Simple Storage Service (Amazon S3), and selecting an embedding model to convert the data into vector embeddings. Amazon Bedrock then creates and manages a vector store in your account, typically using Amazon OpenSearch Serverless, handling the entire RAG workflow, including embedding creation, storage, management, and updates. You can use the RetrieveAndGenerate API for a straightforward implementation, which automatically retrieves relevant information from your knowledge base and generates responses using a specified FM. For more granular control, the Retrieve API is available, allowing you to build custom workflows by processing retrieved text chunks and developing your own orchestration for text generation. Additionally, Amazon Bedrock Knowledge Bases offers customization options, such as defining chunking strategies and selecting custom vector stores like Pinecone or Redis Enterprise Cloud.
A RAG application has many moving parts, and on your way to production you’ll need to make changes to various components of your system. Without a proper automated evaluation workflow, you won’t be able to measure the effect of these changes and will be operating blindly regarding the overall performance of your application.
To evaluate such a system properly, you need to collect an evaluation dataset of typical user questions and answers.
Moreover, you need to make sure you evaluate not only the generation part of the process but also the retrieval. An LLM without relevant retrieved context can’t answer the user’s question if the information wasn’t present in the training data. This holds true even if it has exceptional generation capabilities.
As such, a typical RAG evaluation dataset consists of the following minimum components:

A list of questions users will ask the RAG system
A list of corresponding answers to evaluate the generation step
The context or a list of contexts that contain the answer for each question to evaluate the retrieval

In an ideal world, you would take real user questions as a basis for evaluation. Although this is the optimal approach because it directly resembles end-user behavior, this is not always feasible, especially in the early stages of building a RAG system. As you progress, you should aim for incorporating real user questions into your evaluation set.
To learn more about how to evaluate a RAG application, see Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock.
Solution overview
We use a sample use case to illustrate the process by building an Amazon shareholder letter chatbot that allows business analysts to gain insights about the company’s strategy and performance over the past years.
For the use case, we use PDF files of Amazon’s shareholder letters as our knowledge base. These letters contain valuable information about the company’s operations, initiatives, and future plans. In a RAG implementation, the knowledge retriever might use a database that supports vector searches to dynamically look up relevant documents that serve as the knowledge source.
The following diagram illustrates the workflow to generate the synthetic dataset for our RAG system.

The workflow includes the following steps:

Load the data from your data source.
Chunk the data as you would for your RAG application.
Generate relevant questions from each document.
Generate an answer by prompting an LLM.
Extract the relevant text that answers the question.
Evolve the question according to a specific style.
Filter questions and improve the dataset either using domain experts or LLMs using critique agents.

We use a model from the Anthropic’s Claude 3 model family to extract questions and answers from our knowledge source, but you can experiment with other LLMs as well. Amazon Bedrock makes this effortless by providing standardized API access to many FMs.
For the orchestration and automation steps in this process, we use LangChain. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.
The next sections walk you through the most important parts of the process. If you want to dive deeper and run it yourself, refer to the notebook on GitHub.
Load and prepare the data
First, load the shareholder letters using LangChain’s PyPDFDirectoryLoader and use the RecursiveCharacterTextSplitter to split the PDF documents into chunks. The RecursiveCharacterTextSplitter divides the text into chunks of a specified size while trying to preserve context and meaning of the content. It’s a good way to start when working with text-based documents. You don’t have to split your documents to create your evaluation dataset if your LLM supports a context window that is large enough to fit your documents, but you could potentially end up with a lower quality of generated questions due to the larger size of the task. You want to have the LLM generate multiple questions per document in this case.

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

# Load PDF documents from directory
loader = PyPDFDirectoryLoader(“./synthetic_dataset_generation/”)
documents = loader.load()
# Use recursive character splitter, works better for this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
# Split documents into small chunks
chunk_size = 1500,
# Overlap chunks to reduce cutting sentences in half
chunk_overlap = 100,
separators=[“nn”, “n”, “.”, ” “, “”],
)

# Split loaded documents into chunks
docs = text_splitter.split_documents(documents)

To demonstrate the process of generating a corresponding question and answer and iteratively refining them, we use an example chunk from the loaded shareholder letters throughout this post:

“page_content=”Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In thenfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. […] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America andnInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”

Generate an initial question
To facilitate prompting the LLM using Amazon Bedrock and LangChain, you first configure the inference parameters. To accurately extract more extensive contexts, set the max_tokens parameter to 4096, which corresponds to the maximum number of tokens the LLM will generate in its output. Additionally, define the temperature parameter as 0.2 because the goal is to generate responses that adhere to the specified rules while still allowing for a degree of creativity. This value differs for different use cases and can be determined by experimentation.

import boto3

from langchain_community.chat_models import BedrockChat

# set up a Bedrock-runtime client for inferencing large language models
boto3_bedrock = boto3.client(‘bedrock-runtime’)
# Choosing claude 3 Haiku due to cost and performance efficiency
claude_3_haiku = “anthropic.claude-3-haiku-20240307-v1:0”
# Set-up langchain LLM for implementing the synthetic dataset generation logic

# for each model provider there are different parameters to define when inferencing against the model
inference_modifier = {
“max_tokens”: 4096,
“temperature”: 0.2
}

llm = BedrockChat(model_id = claude_3_haiku,
client = boto3_bedrock,
model_kwargs = inference_modifier
)

You use each generated chunk to create synthetic questions that mimic those a real user might ask. By prompting the LLM to analyze a portion of the shareholder letter data, you generate relevant questions based on the information presented in the context. We use the following sample prompt to generate a single question for a specific context. For simplicity, the prompt is hardcoded to generate a single question, but you can also instruct the LLM to generate multiple questions with a single prompt.
The rules can be adapted to better guide the LLM in generating questions that reflect the types of queries your users would pose, tailoring the approach to your specific use case.

# Create a prompt template to generate a question a end-user could have about a given context
initial_question_prompt_template = PromptTemplate(
input_variables=[“context”],
template=”””
<Instructions>
Here is some context:
<context>
{context}
</context>

Your task is to generate 1 question that can be answered using the provided context, following these rules:

<rules>
1. The question should make sense to humans even when read without the given context.
2. The question should be fully answered from the given context.
3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
4. The answer to the question should not contain any links.
5. The question should be of moderate difficulty.
6. The question must be reasonable and must be understood and responded by humans.
7. Do not use phrases like ‘provided context’, etc. in the question.
8. Avoid framing questions using the word “and” that can be decomposed into more than one question.
9. The question should not contain more than 10 words, make use of abbreviations wherever possible.
</rules>

To generate the question, first identify the most important or relevant part of the context. Then frame a question around that part that satisfies all the rules above.

Output only the generated question with a “?” at the end, no other text or characters.
</Instructions>

“””)

The following is the generated question from our example chunk:
What is the price-performance improvement of AWS Graviton2 chip over x86 processors?
Generate answers
To use the questions for evaluation, you need to generate a reference answer for each of the questions to test against. With the following prompt template, you can generate a reference answer to the created question based on the question and the original source chunk:

# Create a prompt template that takes into consideration the the question and generates an answer
answer_prompt_template = PromptTemplate(
input_variables=[“context”, “question”],
template=”””
<Instructions>
<Task>
<role>You are an experienced QA Engineer for building large language model applications.</role>
<task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>
The output should be only the answer generated from the context.

<rules>
1. Only use the given context as a source for generating the answer.
2. Be as precise as possible with answering the question.
3. Be concise in answering the question and only answer the question at hand rather than adding extra information.
</rules>

Only output the generated answer as a sentence. No extra characters.
</Task>
</Instructions>

Assistant:”””)

The following is the generated answer based on the example chunk:
“The AWS revenue grew 37% year-over-year in 2021.”
Extract relevant context
To make the dataset verifiable, we use the following prompt to extract the relevant sentences from the given context to answer the generated question. Knowing the relevant sentences, you can check whether the question and answer are correct.

# To check whether an answer was correctly formulated by the large language model you get the relevant text passages from the documents used for answering the questions.
source_prompt_template = PromptTemplate(
input_variables=[“context”, “question”],
template=”””Human:
<Instructions>
Here is the context:
<context>
{context}
</context>

Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.

<question>
{question}
</question>

Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.
</Instructions>
Assistant:”””)

The following is the relevant source sentence extracted using the preceding prompt:
“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.”
Refine questions
When generating question and answer pairs from the same prompt for the whole dataset, it might appear that the questions are repetitive and similar in form, and therefore don’t mimic real end-user behavior. To prevent this, take the previously created questions and prompt the LLM to modify them according to the rules and guidance established in the prompt. By doing so, a more diverse dataset is synthetically generated. The prompt for generating questions tailored to your specific use case heavily depends on that particular use case. Therefore, your prompt must accurately reflect your end-users by setting appropriate rules or providing relevant examples. The process of refining questions can be repeated as many times as necessary.

# To generate a more versatile testing dataset you alternate the questions to see how your RAG systems performs against differently formulated of questions
question_compress_prompt_template = PromptTemplate(
input_variables=[“question”],
template=”””
<Instructions>
<role>You are an experienced linguistics expert for building testsets for large language model applications.</role>

<task>It is your task to rewrite the following question in a more indirect and compressed form, following these rules:

<rules>
1. Make the question more indirect
2. Make the question shorter
3. Use abbreviations if possible
</rules>

<question>
{question}
</question>

Your output should only be the rewritten question with a question mark “?” at the end. Do not provide any other explanation or text.
</task>
</Instructions>

“””)

Users of your application might not always use your solution in the same way, for instance using abbreviations when asking questions. This is why it’s crucial to develop a diverse dataset:
“AWS rev YoY growth in ’21?”
Automate dataset generation
To scale the process of the dataset generation, we iterate over all the chunks in our knowledge base; generate questions, answers, relevant sentences, and refinements for each question; and save them to a pandas data frame to prepare the full dataset.
To provide a clearer understanding of the structure of the dataset, the following table presents a sample row based on the example chunk used throughout this post.

Chunk
Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In thenfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. […] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America andnInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”

Question
“What was the YoY growth of AWS revenue in 2021?”

Answer
“The AWS revenue grew 37% year-over-year in 2021.”

Source Sentence
“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.”

Evolved Question
“AWS rev YoY growth in ’21?”

On average, the generation of questions with a given context of 1,500–2,000 tokens results in an average processing time of 2.6 seconds for a set of initial question, answer, evolved question, and source sentence discovery using Anthropic Claude 3 Haiku. The generation of 1,000 sets of questions and answers costs approximately $2.80 USD using Anthropic Claude 3 Haiku. The pricing page gives a detailed overview of the cost structure. This results in a more time- and cost-efficient generation of datasets for RAG evaluation compared to the manual generation of these questions sets.
Improve your dataset using critique agents
Although using synthetic data is a good starting point, the next step should be to review and refine the dataset, filtering out or modifying questions that aren’t relevant to your specific use case. One effective approach to accomplish this is by using critique agents.
Critique agents are a technique used in natural language processing (NLP) to evaluate the quality and suitability of questions in a dataset for a particular task or application using a machine learning model. In our case, the critique agents are employed to assess whether the questions in the dataset are valid and appropriate for our RAG system.
The two main metrics evaluated by the critique agents in our example are question relevance and answer groundedness. Question relevance determines how relevant the generated question is for a potential user of our system, and groundedness assesses whether the generated answers are indeed based on the given context.

groundedness_check_prompt_template = PromptTemplate(
input_variables=[“context”,”question”],
template=”””
<Instructions>
You will be given a context and a question related to that context.

Your task is to provide an evaluation of how well the given question can be answered using only the information provided in the context. Rate this on a scale from 1 to 5, where:

1 = The question cannot be answered at all based on the given context
2 = The context provides very little relevant information to answer the question
3 = The context provides some relevant information to partially answer the question
4 = The context provides substantial information to answer most aspects of the question
5 = The context provides all the information needed to fully and unambiguously answer the question

First, read through the provided context carefully:

<context>
{context}
</context>

Then read the question:

<question>
{question}
</question>

Evaluate how well you think the question can be answered using only the context information. Provide your reasoning first in an <evaluation> section, explaining what relevant or missing information from the context led you to your evaluation score in only one sentence.

Provide your evaluation in the following format:

<rating>(Your rating from 1 to 5)</rating>

<evaluation>(Your evaluation and reasoning for the rating)</evaluation>

</Instructions>

“””)

relevance_check_prompt_template = PromptTemplate(
input_variables=[“question”],
template=”””
<Instructions>
You will be given a question related to Amazon Shareholder letters. Your task is to evaluate how useful this question would be for a financial and business analyst working in wallstreet.

To evaluate the usefulness of the question, consider the following criteria:

1. Relevance: Is the question directly relevant to your work? Questions that are too broad or unrelated to this domain should receive a lower rating.

2. Practicality: Does the question address a practical problem or use case that analysts might encounter? Theoretical or overly academic questions may be less useful.

3. Clarity: Is the question clear and well-defined? Ambiguous or vague questions are less useful.

4. Depth: Does the question require a substantive answer that demonstrates understanding of financial topics? Surface-level questions may be less useful.

5. Applicability: Would answering this question provide insights or knowledge that could be applied to real-world company evaluation tasks? Questions with limited applicability should receive a lower rating.

Provide your evaluation in the following format:

<rating>(Your rating from 1 to 5)</rating>

<evaluation>(Your evaluation and reasoning for the rating)</evaluation>

Here is the question:

{question}
</Instructions>
“””)

Evaluating the generated questions helps with assessing the quality of a dataset and eventually the quality of the evaluation. The generated question was rated very well:

Groundedness score: 5
“The context provides the exact information needed to answer the question[…]”

Relevance score: 5
“This question is highly relevant and useful for a financial and business analyst working on Wall Street. AWS (Amazon Web Services) is a key business segment for Amazon, and understanding its year-over-year (YoY) revenue growth is crucial for evaluating the company’s overall performance and growth trajectory.[…].

Best practices for generating synthetic datasets
Although generating synthetic datasets offers numerous benefits, it’s essential to follow best practices to maintain the quality and representativeness of the generated data:

Combine with real-world data – Although synthetic datasets can mimic real-world scenarios, they might not fully capture the nuances and complexities of actual human interactions or edge cases. Combining synthetic data with real-world data can help address this limitation and create more robust datasets.
Choose the right model – Choose different LLMs for dataset creation than used for RAG generation in order to avoid self-enhancement bias.
Implement robust quality assurance – You can employ multiple quality assurance mechanisms, such as critique agents, human evaluation, and automated checks, to make sure the generated datasets meet the desired quality standards and accurately represent the target use case.
Iterate and refine – You should treat synthetic dataset generation as an iterative process. Continuously refine and improve the process based on feedback and performance metrics, adjusting parameters, prompts, and quality assurance mechanisms as needed.
Domain-specific customization – For highly specialized or niche domains, consider fine-tuning the LLM (such as with PEFT or RLHF) by injecting domain-specific knowledge to improve the quality and accuracy of the generated datasets.

Conclusion
The generation of synthetic datasets is a powerful technique that can significantly enhance the evaluation process of your RAG system, especially in the early stages of development when real-world data is scarce or difficult to obtain. By taking advantage of the capabilities of LLMs, this approach enables the creation of diverse and representative datasets that accurately mimic real human interactions, while also providing the scalability necessary to meet your evaluation needs.
Although this approach offers numerous benefits, it’s essential to acknowledge its limitations. Firstly, the quality of the synthetic dataset heavily relies on the performance and capabilities of the underlying language model, knowledge retrieval system, and quality of prompts used for generation. Being able to understand and adjust the prompts for generation is crucial in this process. Biases and limitations present in these components may be reflected in the generated dataset. Additionally, capturing the full complexity and nuances of real-world interactions can be challenging because synthetic datasets may not account for all edge cases or unexpected scenarios.
Despite these limitations, generating synthetic datasets remains a valuable tool for accelerating the development and evaluation of RAG systems. By streamlining the evaluation process and enabling iterative development cycles, this approach can contribute to the creation of better-performing AI systems.
We encourage developers, researchers, and enthusiasts to explore the techniques mentioned in this post and the accompanying GitHub repository and experiment with generating synthetic datasets for your own RAG applications. Hands-on experience with this technique can provide valuable insights and contribute to the advancement of RAG systems in various domains.

About the Authors
Johannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.
Lukas Wenzel is a Solutions Architect at Amazon Web Services in Hamburg, Germany. He focuses on supporting software companies building SaaS architectures. In addition to that, he engages with AWS customers on building scalable and cost-efficient generative AI features and applications. In his free-time, he enjoys playing basketball and running.
David Boldt is a Solutions Architect at Amazon Web Services. He helps customers build secure and scalable solutions that meet their business needs. He is specialized in machine learning to address industry-wide challenges, using technologies to drive innovation and efficiency across various sectors.

Making traffic lights more efficient with Amazon Rekognition

State and local agencies spend approximately $1.23 billion annually to operate and maintain signalized traffic intersections. On the other end, traffic congestion at intersections costs drivers about $22 billion annually. Implementing an artificial intelligence (AI)-powered detection-based solution can significantly mitigate congestion at intersections and reduce operation and maintenance costs. In this blog post, we show you how Amazon Rekognition (an AI technology) can mitigate congestion at traffic intersections and reduce operations and maintenance costs.
State and local agencies rely on traffic signals to facilitate the safe flow of traffic involving cars, pedestrians, and other users. There are two main types of traffic lights: fixed and dynamic. Fixed traffic lights are timed lights controlled by electro-mechanical signals that switch and hold the lights based on a set period of time. Dynamic traffic lights are designed to adjust based on traffic conditions by using detectors both underneath the surface of the road and above the traffic light. However, as population continues to rise, there are more cars, bikes, and pedestrians using the streets. This increase in road users can negatively impact the efficiency of either of the two traffic systems.
Solution overview
At a high level, our solution uses Amazon Rekognition to automatically detect objects (cars, bikes, and so on) and scenes at an intersection. After detection, Amazon Rekognition creates bounding boxes around each object (such as a vehicle) and calculates the distance between each object (in this scenario, that would be the distance between vehicles detected at an intersection). Results from the calculated distances are used programmatically to stop or allow the flow of traffic, thus reducing congestion. All of this happens without human intervention.
Prerequisties
The proposed solution can be implemented in a personal AWS environment using the code that we provide. However, there are a few prerequisites that must in place. Before running the labs in this post, ensure you have the following:

An AWS account. Create one if necessary.
The appropriate AWS Identity and Access Management (IAM) permissions to access services used in the lab. If this is your first time setting up an AWS account, see the IAM documentation for information about configuring IAM.
A SageMaker Studio Notebook. Create one if necessary.

Solution architecture
The following diagram illustrates the lab’s architecture:
This solution uses the following AI and machine learning (AI/ML), serverless, and managed technologies:

Amazon SageMaker, a fully managed machine learning service that enables data scientists and developers to build, train and deploy machine learning applications.
Amazon Rekognition supports adding image and video analysis to your applications.
IAM grants authentication and authorization that allows resources in the solution to talk to each other.

To recap how the solution works

Traffic intersection video footage is uploaded to your SageMaker environment from an external device.
A Python function uses CV2 to split the video footage into image frames.
The function makes a call to Amazon Rekognition when the image frames are completed.
Amazon Rekognition analyzes each frame and creates bounding boxes around each vehicle it detects.
The function counts the bounding boxes and changes the traffic signal based on the number of cars it detects using pre-defined logic.

Solution walkthrough
Now, let’s walk through implementing the solution.
Configure SageMaker:

Choose Domains in the navigation pane, and then select your domain name.
Find and copy the SageMaker Execution Role.
Go to the IAM console and choose Roles in the navigation pane and paste the SageMaker Execution Role you copied in the preceding step.

Enable SageMaker to interact with Amazon Rekognition:
Next, enable SageMaker to interact with Amazon Rekognition using the SageMaker execution role.

In the SageMaker console, select your SageMaker execution role and choose Add permission and then choose Attach policies.
In the search bar, enter and select AmazonRekognitionFullAccess Policy. See the following figure.

With the IAM permissions configured, you can run the notebook in SageMaker with access to Amazon Rekognition for the video analysis.
Download the Rekognition Notebook and traffic intersection data to your local environment. On the Amazon Sagemaker Studio, upload the notebook and data you downloaded.
Code walkthrough:
This lab uses OpenCv and Boto3 to prepare the SageMaker environment. OpenCv is an open source library with over 250 algorithms for computer vision analysis. Boto3 is the AWS SDK for Python that helps you to integrate AWS services with applications or scripts written in Python.

First, we import OpenCv and Boto3 package. The next cell of codes builds a function for analyzing the video. We will walk through key components of the function. The function starts by creating a frame for the video to be analyzed.
The frame is written to a new video writer file with an MP4 extension. The function also loops through the file and, if the video doesn’t have a frame, the function converts it to a JPEG file. Then the code define and identify traffic lanes using bounding boxes. Amazon Rekognition image operations place bounding boxes around images detected for later analysis.
The function captures the video frame and sends it to Amazon Rekognition to analyze images in the video using the bounding boxes. The model uses bounding boxes to detect and classify captured images (cars, pedestrians, and so on) in the video. The code then detects whether a car is in the video sent to Amazon Rekognition. A bounding box is generated for each car detected in the video.
The size and position of the car is computed to accurately detect its position. After computing the size and position of the car, the model checks whether the car is in a detected lane. After determining whether there are cars in one of the detected lanes, the model counts the numbers of detected cars in the lane.
The results from detecting and computing the size, position and numbers of cars in a lane are written to a new file in the rest of the function.
Writing the outputs to a new file, a few geometry computations are done to determine the details of detected objects. For example, polygons are used to determine the size of objects.
With the function completely built, the next step is running the function and with a minimum confidence sore of 95% using a test video.
The last line of codes allow you to download the video from the directory in SageMaker to check the results and confidence level of the output.

Costs
The logic behind our cost estimates is put at $6,000 per intersection with the assumption one frame per second using four cameras with a single SageMaker notebook for each intersection. One important callout is that not every intersection is a 4-way intersection. Implementing this solution on more populated traffic areas will increase the overall flow of traffic.
Cost breakdown and details

Service
Description
First month cost
First 12 months cost

Amazon SageMaker Studio notebooks
·  Instance name: ml.t3.medium ·  Number of data scientists: 1 ·  Number of Studio notebook instances per data scientist: 1 ·  Studio notebook hours per day: 24 ·  Studio notebook days per month: 30
$36
$432

Amazon Rekognition
Number of images processed with labels API calls per month: 345,600 per month
$345.60
$4,147.20

Amazon Simple Storage Service (Amazon S3) (Standard storage class)
·  S3 Standard storage: 4,320 GB per month ·  PUT, COPY, POST, and LIST requests to S3 Standard per month: 2,592,000
$112.32
$1,347.84

Total estimate per year

$5,927.04

However, this is an estimate, and you may incur additional costs depending on customization. For additional information on costs, visit the AWS pricing page for the services covered in the solution architecture. If you have questions, reach out to the AWS team for a more technical and focused discussion.
Clean up
Delete all AWS resources created for this solution that are no longer needed to avoid future charges.
Conclusion
This post provides a solution to make traffic lights more efficient using Amazon Rekognition. The solution proposed in this post can mitigate costs, support road safety, and reduce congestion at intersections. All of these make traffic management more efficient. We strongly recommend learning more about how Amazon Rekognition can help accelerate other image recognition and video analysis tasks by visiting the Amazon Rekognition Developer Guide.

About the authors
Hao Lun Colin Chu is an innovative Solution Architect at AWS, helping partners and customers leverage cutting-edge cloud technologies to solve complex business challenges. With extensive expertise in cloud migrations, modernization, and AI/ML, Colin advises organizations on translating their needs into transformative AWS-powered solutions. Driven by a passion for using technology as a force for good, he is committed to delivering solutions that empower organizations and improve people’s lives. Outside of work, he enjoys playing drum, volleyball and board games!
Joe Wilson is a Solutions Architect at Amazon Web Services supporting nonprofit organizations. He provides technical guidance to nonprofit organizations seeking to securely build, deploy or expand applications in the cloud. He is passionate about leveraging data and technology for social good. Joe background is in data science and international development. Outside work, Joe loves spending time with his family, friends and chatting about innovation and entrepreneurship.

Google AI Introduces the Open Buildings 2.5D Temporal Dataset that Tr …

Governments and humanitarian organizations need reliable data on building and infrastructure changes over time to manage urbanization, allocate resources, and respond to crises. However, many regions across the Global South need more access to timely and accurate data on buildings, making it difficult to track urban growth and infrastructure development. The absence of this data hinders effective planning and disaster response efforts. Current methods for detecting buildings generally rely on high-resolution satellite imagery, which provides detailed images of building footprints. However, high-resolution images are often captured sporadically, sometimes years apart, making it difficult to track changes in buildings over time, especially in rural or rapidly developing areas.

Google researchers introduced the Open Buildings 2.5D Temporal Dataset to address the issue of rapid urban population growth, particularly in the Global South, where urban areas are projected to expand significantly by 2050. Unlike previous efforts, which relied on high-resolution images, this new dataset utilizes Sentinel-2 satellite imagery captured by the European Space Agency, which offers lower resolution but captures images every five days globally. By using a novel machine learning approach, the dataset can estimate changes in building presence and height over time, covering a large geographic region from 2016 to 2023.

The core idea behind the dataset involves using a combination of student-teacher models based on HRNet architecture. The teacher model is trained on high-resolution satellite imagery, providing ground truth labels. The student model, trained on lower-resolution Sentinel-2 images, aims to recreate the teacher’s predictions without directly seeing the high-resolution images. By leveraging multiple time frames of Sentinel-2 data (up to 32 images for each location), the model enhances resolution and detects building footprints with high accuracy. This approach allows the model to achieve a mean Intersection over Union (IoU) of 78.3%, which is close to the 85.3% accuracy obtained using high-resolution imagery. The dataset also includes tools for estimating building heights and counts, with a mean absolute error of 1.5 meters for height estimates and reliable building count predictions.

In conclusion, Google’s Open Buildings 2.5D Temporal Dataset offers a significant advancement in detecting and monitoring building changes across the Global South using public satellite imagery. By employing a combination of frequent, lower-resolution Sentinel-2 images and machine learning models, it provides an innovative solution for addressing the lack of accurate, up-to-date data on buildings. The proposed method not only improves the ability to track changes in urban areas but also supports better planning and crisis response in regions that are often data-poor.

Check out the Dataset and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Google AI Introduces the Open Buildings 2.5D Temporal Dataset that Tracks Building Changes Across the Global South appeared first on MarkTechPost.

This Research Paper Discusses Space-Efficient Algorithms for Integer P …

Integer Linear Programming (ILP) is the foundation of combinatorial optimization, which is extensively applied across numerous industries to resolve challenging decision-making issues. Under a set of linear equality constraints, an ILP aims to minimize or maximize a linear objective function, with the important condition that all variables must be integers. Even while ILP is an effective technique, its complexity can provide serious difficulties, particularly in situations when there are many limitations or a big problem size.

The following equation represents an ILP’s standard form.

The non-negative integer variables that need to be optimized are represented by x in this case, whereas c is the cost vector, b is a vector of constants, and d is a matrix of coefficients. ILP is categorized as NP-complete, which means that for big cases, finding an optimal solution is computationally infeasible, making the task especially difficult. However, dynamic programming can solve ILPs more effectively when the number of constraints (m) is small and fixed.

Dynamic programming offers a pseudopolynomial time solution for ILPs with a fixed number of constraints (𝑚 = 𝑂(1)). This is an important development since it provides a workable approach to solving an issue that would otherwise be unsolvable. Such solutions have the following running times:

(m∆)O(m) poly(I)

O(m) poly(I) in where I is the size of the input, taking into account the encoding of A, B, and C, and Δ is the greatest absolute value of the elements in matrix W. By utilizing the set number of constraints, this method lowers the complexity and enables the efficient solution of small to medium-sized ILPs.

Although dynamic programming techniques yield considerable space complexity trade-offs, they are economical in terms of running time. Large amounts of memory are usually needed for these algorithms, frequently in direct proportion to their execution times. Consequently, memory needs can constitute a bottleneck, particularly in cases of big problems or when great precision is needed.

Dynamic programming techniques can be limited in practical applications due to their space complexity, especially when memory is a limited resource. The desire to create space-efficient algorithms that can solve ILPs without using a lot of memory has grown as a result.

A new method that maintains competitive running times while addressing the space complexity issue has been developed as a result of recent developments in ILP research. The time complexity attained by this algorithm is:

(m∆)O(m(log m+log log ∆)) poly(I)

Compared to conventional dynamic programming techniques, this results in a marginally longer running time, however the main benefit is that less space is needed. This approach solves larger ILP instances on devices with limited memory by acting in polynomial space.

With this new technique, data scientists working on optimization challenges have a useful tool. It enables effective ILP solutions without the memory costs associated with conventional approaches being too high. This development is especially significant in fields like machine learning, finance, and logistics, where optimization is essential.

In conclusion, space-efficient algorithm development represents a major advancement, even though ILP is still a difficult topic in combinatorial optimization. These developments make it possible to solve complicated issues more effectively in new ways, which increases the potency of ILP as a tool for data scientists.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post This Research Paper Discusses Space-Efficient Algorithms for Integer Programming with Few Constraints appeared first on MarkTechPost.

Exploring Input Space Mode Connectivity: Insights into Adversarial Det …

Input space mode connectivity in deep neural networks builds upon research on excessive input invariance, blind spots, and connectivity between inputs yielding similar outputs. The phenomenon exists generally, even in untrained networks, as evidenced by empirical and theoretical findings. This research expands the scope of input space connectivity beyond out-of-distribution samples, considering all possible inputs. The study adapts methods from parameter space mode connectivity to explore input space, providing insights into neural network behavior.

The research draws on prior work identifying high-dimensional convex hulls of low loss between multiple loss minimizers, which is crucial for analyzing training dynamics and mode connectivity. Feature visualization techniques, optimizing inputs for adversarial attacks further contribute to understanding input space manipulation. By synthesizing these diverse areas of study, the research presents a comprehensive view of input space mode connectivity, emphasizing its implications for adversarial detection and model interpretability while highlighting the intrinsic properties of high-dimensional geometry in neural networks.

The concept of mode connectivity in neural networks extends from parameter space to input space, revealing low-loss paths between inputs yielding similar predictions. This phenomenon, observed in both trained and untrained models, suggests a geometric effect explicable through percolation theory. The study employs real, interpolated, and synthetic inputs to explore input space connectivity, demonstrating its prevalence and simplicity in trained models. This research advances the understanding of neural network behavior, particularly regarding adversarial examples, and offers potential applications in adversarial detection and model interpretability. The findings provide new insights into the high-dimensional geometry of neural networks and their generalization capabilities.

The methodology employs diverse input generation techniques, including real, interpolated, and synthetic images, to comprehensively analyze input space connectivity in deep neural networks. Loss landscape analysis investigates barriers between different modes, particularly focusing on natural inputs and adversarial examples. The theoretical framework utilizes percolation theory to explain input space mode connectivity as a geometric phenomenon in high-dimensional spaces. This approach provides a foundation for understanding connectivity properties in both trained and untrained networks.

Empirical validation on pretrained vision models demonstrates the existence of low-loss paths between different modes, supporting the theoretical claims. An adversarial detection algorithm developed from these findings highlights practical applications. The methodology extends to untrained networks, emphasizing that input space mode connectivity is a fundamental characteristic of neural architectures. Consistent use of cross-entropy loss as an evaluation metric ensures comparability across experiments. This comprehensive approach combines theoretical insights with empirical evidence to explore input space mode connectivity in deep neural networks.

Results extend mode connectivity to the input space of deep neural networks, revealing low-loss paths between inputs, yielding similar predictions. Trained models exhibit simple, near-linear paths between connected inputs. The research distinguishes natural inputs from adversarial examples based on loss barrier heights, with real-real pairs showing low barriers and real-adversarial pairs displaying high, complex ones. This geometric phenomenon explained through percolation theory, persists in untrained models. The findings enhance understanding of model behavior, improve adversarial detection methods, and contribute to DNN interpretability.

In conclusion, the research demonstrates the existence of mode connectivity in the input space of deep networks trained for image classification. Low-loss paths consistently connect different modes, revealing a robust structure in the input space. The study differentiates natural inputs from adversarial attacks based on loss barrier heights along linear interpolant paths. This insight advances adversarial detection mechanisms and enhances deep neural network interpretability. The findings support the hypothesis that mode connectivity is an intrinsic property of high-dimensional geometry, explainable through percolation theory.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Exploring Input Space Mode Connectivity: Insights into Adversarial Detection and Deep Neural Network Interpretability appeared first on MarkTechPost.

Google AI Researchers Introduce a New Whale Bioacoustics Model that ca …

Whale species produce a wide range of vocalizations, from very low to very high frequencies, which vary by species and location, making it difficult to develop models that automatically classify multiple whale species. By analyzing whale vocalizations, researchers can estimate population sizes, track changes over time, and help develop conservation strategies, including protected area designation and mitigation measures. Effective monitoring is essential for conservation, but the complexity of whale calls, especially from elusive species, and the vast amount of underwater audio data complicate efforts to track their populations.

Current methods for animal species identification through sound are more advanced for birds than for whales, as models like Google Perch can classify thousands of bird vocalizations. However, similar multi-species classification models for whales are more challenging to develop due to the diversity in whale vocalizations and a lack of comprehensive data for certain species. Previous efforts have focused on specific species like humpback whales, with earlier models developed by Google Research in partnership with NOAA and other organizations. These models helped classify humpback calls and identified new locations of whale activity.

To address the limitations of previous models, Google researchers developed a new whale bioacoustics model capable of classifying vocalizations from eight distinct species, including the mysterious “Biotwang” sound attributed to the Bryde’s whale. This new model expands on earlier efforts by classifying multiple species and vocalization types, designed for large-scale application on long-term passive acoustic recordings.

The proposed whale bioacoustics model processes audio data by converting it into spectrogram images for each 5-second window of sound. The front-end of the model uses mel-scaled frequency axes and log amplitude compression. It then classifies these spectrograms into one of 12 classes, corresponding to eight whale species and several specific vocalization types. To ensure accurate classifications and minimize false positives, the model was trained not just on positive examples but also on negative and background noise data. The model’s performance, as measured by metrics such as the area under the receiver operating characteristic curve (AUC), showed strong discriminative abilities, particularly for species like Minke and Bryde’s whales.

Along with the classification task, the model helped researchers discover new insights about species’ movements, including differences between central and western Pacific Bryde’s whale populations. By labeling over 200,000 hours of underwater recordings, the model also uncovered the seasonal migration patterns of some species. The model is now publicly available via Kaggle for further use in whale conservation and research efforts.

In conclusion, Google’s new whale bioacoustics model is a significant advancement in the field, addressing the challenge of multi-species classification with a model that not only recognizes eight species but also provides detailed insights into their ecology. This model is a crucial tool in marine biology research, offering scalable and accurate underwater audio data classification and furthering our understanding of whale populations, especially for elusive species like Bryde’s whales.

Check out the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Google AI Researchers Introduce a New Whale Bioacoustics Model that can Identify Eight Distinct Species, Including Multiple Calls for Two of Those Species appeared first on MarkTechPost.

Advancing Membrane Science: The Role of Machine Learning in Optimizati …

Machine Learning in Membrane Science:ML significantly transforms natural sciences, particularly cheminformatics and materials science, including membrane technology. This review focuses on current ML applications in membrane science, offering insights from both ML and membrane perspectives. It begins by explaining foundational ML algorithms and design principles, then a detailed examination of traditional and deep learning approaches in the membrane domain. The review highlights the role of data and featurization in molecular and membrane systems and explores how ML has been applied in areas like reverse osmosis, gas separation, and nanofiltration. The distinction between predictive tasks and generative membrane design is also discussed, along with recommended best practices for ensuring reproducibility in ML studies on membranes. This is the first review that systematically covers the intersection of ML and membrane science.

Introducing data-driven approaches, such as ML, has led to significant advancements in various scientific disciplines. Challenges in membrane science often involve complex, multidimensional problems that ML can effectively address. Membrane processes such as gas separation and filtration benefit from the ability of ML algorithms to analyze vast datasets, predict material properties, and assist in membrane design. Moreover, recent studies emphasize the growing interest in ML applications within this field, as evidenced by the rising number of publications on the topic. The review also explores advanced techniques like graph neural networks (GNNs) and generative membrane design, which are promising for future developments in nonlinear material innovation.

Machine Learning Approaches in Membrane Science:Traditional scientific research often follows a hypothesis-driven framework, where new theories arise from established observations and are validated through experiments. This model formulation process involves refining a physical model based on empirical evidence. However, the emergence of data science has shifted this paradigm, enabling researchers to employ ML techniques that can model physical phenomena without a predefined theoretical basis. By leveraging vast amounts of data, ML models can adapt and recognize patterns without significant a priori conceptualization, relying heavily on the quality and volume of training data. The performance of these models is crucially assessed through validation and testing phases to avoid underfitting and overfitting—conditions that impede the model’s predictive accuracy.

Effective featurization is vital for successful ML implementation in the context of membrane applications. Membrane separation processes consist of a matrix, membrane, and various process parameters, which must be accurately represented. Different featurization techniques—such as fingerprints and graph-based representations—transform molecular structures into formats that ML algorithms can process. This approach allows for better prediction of properties based on the underlying chemical relationships and characteristics. By utilizing domain knowledge to select relevant parameters, researchers can optimize their models and improve the accuracy of predictions, addressing challenges like data sparsity and overfitting while facilitating advancements in membrane science.

Advancements in Membrane Technology through Machine Learning Innovations:Recent studies have focused on enhancing membrane performance through ML techniques, addressing high costs and labor-intensive material development challenges. Traditional approaches, often reliant on trial and error, need help with the multi-dimensional complexities of membrane design. By utilizing computational models, researchers have analyzed performance metrics such as permeability and selectivity, optimizing existing processes and informing the development of new materials. Predictive models are instrumental in identifying structure-property relationships across various membrane types and applications, including ultrafiltration and electrolytic conductivity, enhancing overall performance and efficiency in membrane technology.

Fouling is a significant issue in membrane applications, which negatively impacts performance and increases operational costs. Data-driven methods have emerged to monitor and predict fouling, leading to cost savings by optimizing cleaning schedules and reducing unnecessary membrane replacements. Various ML techniques, including artificial neural networks (ANNs) and genetic algorithms, have been applied to tackle fouling by analyzing input parameters such as biomass characteristics and operating conditions. Additionally, ML is being increasingly integrated into wastewater treatment and gas separation processes, optimizing operational parameters and enhancing the design of membranes, particularly in complex applications like organic solvent nanofiltration. These advancements highlight the potential of hybrid ML approaches in improving membrane technology on an industrial scale. However, there remains a need for broader research encompassing diverse membrane materials and real-time monitoring capabilities.

                 

Guidelines for Machine Learning in Membrane Science:Adopting best practices in ML is crucial to enhancing reproducibility in membrane-related applications. This includes ensuring reliable data sources, cleaning datasets, and selecting appropriate algorithms. Model training should involve proper validation and hyperparameter tuning. Evaluation metrics must be well-defined, with techniques to prevent overfitting and ensure model explainability. Ethical considerations should guide the use of ML in research. Comprehensive documentation and transparent reporting of methodologies and results are essential for fostering trust within the membrane research community and facilitating effective knowledge sharing.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Advancing Membrane Science: The Role of Machine Learning in Optimization and Innovation appeared first on MarkTechPost.

Microsoft Releases GRIN MoE: A Gradient-Informed Mixture of Experts Mo …

Artificial intelligence (AI) research has increasingly focused on enhancing the efficiency & scalability of deep learning models. These models have revolutionized natural language processing, computer vision, and data analytics but have significant computational challenges. Specifically, as models grow larger, they require vast computational resources to process immense datasets. Techniques such as backpropagation are essential for training these models by optimizing their parameters. However, traditional methods struggle to scale deep learning models efficiently without causing performance bottlenecks or requiring excessive computational power.

One of the main issues with current deep learning models is their reliance on dense computation, which activates all model parameters uniformly during training and inference. This method is inefficient when processing large-scale data, resulting in unnecessary activation of resources that may not be relevant to the task at hand. In addition, the non-differentiable nature of some components in these models makes it challenging to apply gradient-based optimization, limiting training effectiveness. As models continue to scale, overcoming these challenges is crucial to advancing the field of AI and enabling more powerful and efficient systems.

Current approaches to scaling AI models often include dense and sparse models that employ expert routing mechanisms. Dense models, like GPT-3 and GPT-4, activate all layers and parameters for every input, making them resource-heavy and difficult to scale. Sparse models, which aim to activate only a subset of parameters based on input requirements, have shown promise in reducing computational demands. However, existing methods like GShard and Switch Transformers still rely heavily on expert parallelism and employ techniques like token dropping to manage resource distribution. While effective, these methods have trade-offs in training efficiency and model performance.

Researchers from Microsoft have introduced an innovative solution to these challenges with GRIN (GRadient-INformed Mixture of Experts). This approach aims to address the limitations of existing sparse models by introducing a new method of gradient estimation for expert routing. GRIN enhances model parallelism, allowing for more efficient training without the need for token dropping, a common issue in sparse computation. By applying GRIN to autoregressive language models, the researchers have developed a top-2 mixture-of-experts model with 16 experts per layer, referred to as the GRIN MoE model. This model selectively activates experts based on input, significantly reducing the number of active parameters while maintaining high performance.

The GRIN MoE model employs several advanced techniques to achieve its impressive performance. The model’s architecture includes MoE layers where each layer consists of 16 experts, and only the top 2 are activated for each input token, using a routing mechanism. Each expert is implemented as a GLU (Gated Linear Unit) network, allowing the model to balance computational efficiency and expressive power. The researchers introduced SparseMixer-v2, a key component that estimates gradients related to expert routing, replacing conventional methods that use gating gradients as proxies. This allows the model to scale without relying on token dropping or expert parallelism, which is common in other sparse models.

The performance of the GRIN MoE model has been rigorously tested across a wide range of tasks, and the results demonstrate its superior efficiency and scalability. In the MMLU (Massive Multitask Language Understanding) benchmark, the model scored an impressive 79.4, surpassing several dense models of similar or larger sizes. It also achieved a score of 83.7 on HellaSwag, a benchmark for common-sense reasoning, and 74.4 on HumanEval, which measures the model’s ability to solve coding problems. Notably, the model’s performance on MATH, a benchmark for mathematical reasoning, was 58.9, reflecting its strength in specialized tasks. The GRIN MoE model uses only 6.6 billion activated parameters during inference, which is fewer than the 7 billion activated parameters of competing dense models, yet it matches or exceeds their performance. In another comparison, GRIN MoE outperformed a 7-billion parameter-dense model and matched the performance of a 14-billion parameter-dense model on the same dataset.

The introduction of GRIN also brings marked improvements in training efficiency. When trained on 64 H100 GPUs, the GRIN MoE model achieved an 86.56% throughput, demonstrating that sparse computation can scale effectively while maintaining high efficiency. This marks a significant improvement over previous models, which often suffer from slower training speeds as the number of parameters increases. Furthermore, the model’s ability to avoid token dropping means it maintains a high level of accuracy and robustness across various tasks, unlike models that lose information during training.

Overall, the research team’s work on GRIN presents a compelling solution to the ongoing challenge of scaling AI models. By introducing an advanced method for gradient estimation and model parallelism, they have successfully developed a model that not only performs better but also trains more efficiently. This advancement could lead to widespread applications in natural language processing, coding, mathematics, and more. The GRIN MoE model represents a significant step forward in AI research, offering a pathway to more scalable, efficient, and high-performing models in the future.

Check out the Paper, Model Card, and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Microsoft Releases GRIN MoE: A Gradient-Informed Mixture of Experts MoE Model for Efficient and Scalable Deep Learning appeared first on MarkTechPost.