LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Intr …

Large language models (LLMs) have gained prominence for their ability to handle complex reasoning tasks, transforming applications from chatbots to code-generation tools. These models are known to benefit significantly from scaling their computation during inference, often producing higher accuracy by dedicating more resources to hard problems. However, this approach brings along considerable drawbacks. Longer processing times and higher computing costs make it challenging to scale such solutions in real-world settings, where responsiveness and affordability are crucial. As technology advances toward more intelligent systems, there is a growing need to explore how LLMs can become not only smarter but also more efficient, especially when operating within repetitive or familiar contexts.

One of the biggest inefficiencies in current LLM deployment occurs during query resolution. Typically, when a user poses a question, the model processes it simultaneously with the necessary background context. This test-time compute assumes that the context and question always arrive together. But in real scenarios, such as document Q&A or debugging code, context is usually persistent and can be accessed well before a specific question is asked. Yet, the model processes everything from scratch for each query, even if it has seen the context before. This redundancy results in increased computational costs and response delays, particularly in scenarios involving multiple queries within a single context.

To deal with this inefficiency, various methods have been developed. Sequential and parallel test-time computation are two major strategies. Sequential approaches extend the model’s reasoning path, allowing it to consider more possibilities, while parallel approaches involve sampling multiple outputs simultaneously, known as pass@k. Techniques like speculative decoding aim to cut latency by making early guesses, but their usefulness is limited when the model still has to think from scratch. While helpful, these methods don’t eliminate the need to process context alongside every new question repeatedly. They also typically require test-time conditions that aren’t always feasible, such as access to an oracle or an ideal verifier.

Researchers from Letta and the University of California, Berkeley, introduced a novel solution they call sleep-time compute. The method involves utilizing idle time between user interactions to increase productivity. Instead of waiting for a user question, the model begins analyzing the context beforehand. It anticipates possible future queries and builds a new version of the context enriched with relevant inferences. When a user finally asks a question, the model can simply refer to this pre-processed context. Since much of the thinking is already done, it requires less computational effort to produce accurate answers. This approach becomes even more effective when multiple questions relate to the same context, allowing for shared inferences and distributed computational cost.

The implementation of sleep-time compute relies on decomposing the traditional prompt into two parts: a static context and a dynamic query. During the sleep-time window, only the context is used to generate a pre-processed version. This enhanced context, called c′, is built using test-time compute techniques like reasoning chains or summarization. Once this enriched version is stored, it replaces the raw context during real-time queries. The final answers are then generated using much fewer resources. This system not only minimizes redundant reasoning but also paves the way for more proactive LLMs that can think ahead and be better prepared.

To evaluate the effectiveness of sleep-time compute, the research team tested it using two specially designed benchmarks: Stateful GSM-Symbolic and Stateful AIME. Both datasets are derived by splitting existing problem sets into separate contexts and questions. In experiments using models like GPT-4o and GPT-4o-mini, researchers observed a 5× reduction in test-time compute for similar accuracy levels. Notably, accuracy improved by up to 13% for the GSM-Symbolic P2 dataset and by 18% on Stateful AIME when sleep-time compute was scaled. Multi-Query GSM-Symbolic, a new dataset introduced for this evaluation, helped demonstrate that the cost per query could be reduced by 2.5× when 10 queries shared the same context.

When pitted against popular strategies like pass@k, sleep-time compute consistently outperformed them. Unlike pass@k, which assumes access to a perfect evaluator, sleep-time compute works under more realistic conditions. Results show that even at low test-time compute budgets, sleep-time compute produced comparable or better accuracy while consuming fewer tokens. For instance, the GPT-4o-mini model achieved higher accuracy with fewer than 200 test-time tokens using sleep-time compute compared to over 500 tokens needed in the baseline. Even when models like Claude Sonnet 3.7 and DeepSeek R1 were evaluated, similar improvements were observed.

Scaling the amount of compute dedicated to sleep-time further improved outcomes. By running five parallel generations during sleep-time on complex tasks, researchers pushed the pareto curve further. However, they noted diminishing returns beyond this point. Importantly, results showed that stronger models handling more difficult tasks benefited more from additional sleep-time compute. Also, amortizing sleep-time computation became highly cost-effective when contexts served multiple related queries. By weighting test-time tokens as ten times more expensive than sleep-time tokens, aligned with industry latency-cost ratios, the researchers confirmed a reduction of up to 2.5 times in the average cost per query.

Another interesting finding was that sleep-time compute worked best when user queries were predictable. Using Llama2-70B, researchers scored the predictability of each query given its context and found a strong correlation: the more predictable the query, the greater the benefit. In examples where the question logically followed from the given context, sleep-time computation yielded higher gains. Conversely, less predictable or abstract queries experienced reduced effectiveness, although they still showed benefits compared to traditional test-time-only methods.

Altogether, this research presents a smart and scalable technique to enhance the efficiency of LLMs without compromising accuracy. By leveraging otherwise idle time, sleep-time computing reduces the burden on real-time systems, lowers operational costs, and improves response time. The clear quantitative improvements, such as a 5× reduction in compute, 13–18% accuracy gains, and a drop of up to 2.5× in cost per query, demonstrate that forward-thinking approaches like this could shape the next generation of intelligent, context-aware assistants.

Several Key Takeaways from the Research are as follows:

Sleep-time compute allows models to anticipate queries by reasoning on context before the query arrives.

Accuracy improved by 13% on GSM-Symbolic and 18% on AIME datasets when sleep-time computation was scaled.

Test-time compute requirements decreased by approximately 5 times for similar performance levels.

When sharing context across 10 related queries, the average query cost decreased by a factor of 2.5.

Outperformed the pass@k strategy in parallel compute settings at equivalent budgets.

More effective on predictable queries, identified via log-probability scoring.

Diminishing returns noted beyond five parallel generations for sleep-time computation.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency appeared first on MarkTechPost.

LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New …

Large language models (LLMs) are continually evolving by ingesting vast quantities of text data, enabling them to become more accurate predictors, reasoners, and conversationalists. Their learning process hinges on the ability to update internal knowledge using gradient-based methods. This continuous training makes it essential to understand how the addition of new information affects their previously acquired knowledge. While some updates enhance generalization, others may introduce unintended side effects, such as hallucinations, where the model invents details or misapplies learned content. Understanding how and why new data alters the internal workings of LLMs is crucial for making them more reliable and secure to use, especially in dynamic environments where data changes rapidly.

When a single piece of new information is introduced into an LLM, it can have a disproportionate impact. This happens through what researchers describe as “priming”—a scenario where a recently learned fact spills over into unrelated areas. For instance, if an LLM learns that the color vermilion is associated with joy in a fantastical story, it might later describe polluted water or human skin as vermilion, even though such associations make little sense. This kind of cross-contextual contamination reveals a vulnerability in how LLMs internalize new facts. Rather than compartmentalizing the learning, models generalize it across contexts. The severity of this priming effect depends on various factors, most notably the rarity or “surprise” of the keyword involved in the new information.

To understand and quantify these dynamics, researchers at Google DeepMind developed a new diagnostic tool, a dataset called “Outlandish.” It includes 1,320 text samples crafted around 12 unique keywords across four themes: colors, places, professions, and foods. Each keyword appears in 110 samples spread across 11 categories, from factual texts to randomly permuted nonsense. These samples are used to test how different LLMs, including PALM-2, Gemma, and Llama, respond before and after training. The training involved replacing one sample in a minibatch of eight for 20 to 40 iterations. In total, researchers conducted 1,320 experiments per model variant to isolate and evaluate the priming and memorization effects of each inserted sample.

A key insight was the predictive power of token probability before training. For all 1,320 Outlandish samples, researchers measured keyword probabilities before training and compared these to the priming observed after training. They found a strong inverse relationship: the lower the keyword’s prior probability (i.e., the more surprising it was), the higher the likelihood of priming. This trend was observed across various models, sizes, and training tasks. A clear threshold emerged around a probability of 10⁻³. Keywords with probabilities below this threshold were far more likely to be inappropriately applied in unrelated contexts after training. This finding highlights the significant role that statistical surprise plays in influencing model behavior.

Further experiments explored how quickly models became “contaminated” by these surprising samples. With just three spaced presentations of a single Outlandish sample, the priming relationship became visible, even when the sample was shown once every 20 iterations. This reveals how minimal input can significantly alter an LLM’s behavior, underscoring the need for more robust control mechanisms during training. Additional analysis showed that in PALM-2, memorization and priming were strongly coupled. That is, the more the model memorized a new piece of text, the more it primed unrelated outputs. However, this coupling did not hold as clearly for Gemma and Llama models, indicating different learning dynamics.

Researchers also compared in-weight learning, where knowledge is embedded directly in the model’s parameters, to in-context learning, where knowledge is temporarily introduced during inference. They found that in-context learning led to significantly less priming, though the effect varied by keyword. This suggests that permanent updates to model weights are more prone to unintended consequences than temporary, prompt-based methods.

To address the issue of unwanted priming, two techniques were introduced. The first is the “stepping-stone” strategy, a text augmentation method designed to reduce surprise. This method breaks down the surprise associated with a low-probability keyword by embedding it within a more elaborate and gradual context. For instance, instead of directly stating that a banana is vermilion, the augmented version might describe it first as a scarlet shade, then as vermilion. Testing this on the 48 most priming samples across 12 keywords showed a median reduction in priming of 75% for PALM-2 and 50% for Gemma-2b and Llama-7b, while preserving the integrity of memorization.

The second method, “ignore-topk,” is a gradient pruning strategy. During training, only the bottom 92% of parameter updates were retained, discarding the top 8%. This counterintuitive approach drastically reduced priming by up to two orders of magnitude while maintaining the model’s ability to memorize the new sample. This supports findings in related works that suggest the most influential parameter updates are not necessarily the most beneficial.

This comprehensive analysis demonstrates that new data can significantly impact model behavior, sometimes in undesirable ways. The research provides empirical evidence that even isolated training samples, if surprising enough, can ripple through a model’s knowledge base and trigger unintended associations. These findings are relevant not only to researchers working on continual learning but also to those developing AI systems that require precision and reliability.

Several Key Takeaways from the Research include:

1,320 custom-crafted text samples were used to evaluate the impact of new information on LLMs.  

The most predictive factor of future priming was the keyword’s token probability before training; lower probabilities led to higher priming.

A probability threshold of 10⁻³ was identified, below which priming effects became significantly pronounced. 

Priming effects were measurable after just three training iterations, even with spacing between inputs.

PALM-2 showed a strong correlation between memorization and priming, while Gemma and Llama exhibited different learning behaviors.  

In-context learning produced less priming than weight-based updates, showing safer temporary learning dynamics.

The “stepping-stone” strategy reduced priming by up to 75% without compromising learning.

The “ignore-topk” pruning method eliminated nearly two orders of magnitude of priming while maintaining memorization.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New Techniques to Predict and Reduce Unintended Knowledge Contamination appeared first on MarkTechPost.

An Advanced Coding Implementation: Mastering Browser‑Driven AI in Go …

In this tutorial, we will learn how to harness the power of a browser‑driven AI agent entirely within Google Colab. We will utilize Playwright’s headless Chromium engine, along with the browser_use library’s high-level Agent and BrowserContext abstractions, to programmatically navigate websites, extract data, and automate complex workflows. We will wrap Google’s Gemini model via the langchain_google_genai connector to provide natural‑language reasoning and decision‑making, secured by pydantic’s SecretStr for safe API‑key handling. With getpass managing credentials, asyncio orchestrating non‑blocking execution, and optional .env support via python-dotenv, this setup will give you an end‑to‑end, interactive agent platform without ever leaving your notebook environment.

Copy CodeCopiedUse a different Browser!apt-get update -qq
!apt-get install -y -qq chromium-browser chromium-chromedriver fonts-liberation
!pip install -qq playwright python-dotenv langchain-google-generative-ai browser-use
!playwright install

We first refresh the system package lists and install headless Chromium, its WebDriver, and the Liberation fonts to enable browser automation. It then installs Playwright along with python-dotenv, the LangChain GoogleGenerativeAI connector, and browser-use, and finally downloads the necessary browser binaries via playwright install.

Copy CodeCopiedUse a different Browserimport os
import asyncio
from getpass import getpass
from pydantic import SecretStr
from langchain_google_genai import ChatGoogleGenerativeAI
from browser_use import Agent, Browser, BrowserContextConfig, BrowserConfig
from browser_use.browser.browser import BrowserContext

We bring in the core Python utilities, os for environment management and asyncio for asynchronous execution, plus getpass and pydantic’s SecretStr for secure API‑key input and storage. It then loads LangChain’s Gemini wrapper (ChatGoogleGenerativeAI) and the browser_use toolkit (Agent, Browser, BrowserContextConfig, BrowserConfig, and BrowserContext) to configure and drive a headless browser agent.

Copy CodeCopiedUse a different Browseros.environ[“ANONYMIZED_TELEMETRY”] = “false”

We disable anonymous usage reporting by setting the ANONYMIZED_TELEMETRY environment variable to “false”, ensuring that neither Playwright nor the browser_use library sends any telemetry data back to its maintainers.

Copy CodeCopiedUse a different Browserasync def setup_browser(headless: bool = True):
browser = Browser(config=BrowserConfig(headless=headless))
context = BrowserContext(
browser=browser,
config=BrowserContextConfig(
wait_for_network_idle_page_load_time=5.0,
highlight_elements=True,
save_recording_path=”./recordings”,
)
)
return browser, context

This asynchronous helper initializes a headless (or headed) Browser instance and wraps it in a BrowserContext configured to wait for network‑idle page loads, visually highlight elements during interactions, and save a recording of each session under ./recordings. It then returns both the browser and its ready‑to‑use context for your agent’s tasks.

Copy CodeCopiedUse a different Browserasync def agent_loop(llm, browser_context, query, initial_url=None):
initial_actions = [{“open_tab”: {“url”: initial_url}}] if initial_url else None
agent = Agent(
task=query,
llm=llm,
browser_context=browser_context,
use_vision=True,
generate_gif=False,
initial_actions=initial_actions,
)
result = await agent.run()
return result.final_result() if result else None

This async helper encapsulates one “think‐and‐browse” cycle: it spins up an Agent configured with your LLM, the browser context, and optional initial URL tab, leverages vision when available, and disables GIF recording. Once you call agent_loop, it runs the agent through its steps and returns the agent’s final result (or None if nothing is produced).

Copy CodeCopiedUse a different Browserasync def main():
raw_key = getpass(“Enter your GEMINI_API_KEY: “)

os.environ[“GEMINI_API_KEY”] = raw_key

api_key = SecretStr(raw_key)
model_name = “gemini-2.5-flash-preview-04-17”

llm = ChatGoogleGenerativeAI(model=model_name, api_key=api_key)

browser, context = await setup_browser(headless=True)

try:
while True:
query = input(“nEnter prompt (or leave blank to exit): “).strip()
if not query:
break
url = input(“Optional URL to open first (or blank to skip): “).strip() or None

print(“n Running agent…”)
answer = await agent_loop(llm, context, query, initial_url=url)
print(“n Search Resultsn” + “-“*40)
print(answer or “No results found”)
print(“-“*40)
finally:
print(“Closing browser…”)
await browser.close()

await main()

Finally, this main coroutine drives the entire Colab session: it securely prompts for your Gemini API key (using getpass and SecretStr), sets up the ChatGoogleGenerativeAI LLM and a headless Playwright browser context, then enters an interactive loop where it reads your natural‑language prompts (and optional start URL), invokes the agent_loop to perform the browser‑driven AI task, prints the results, and finally ensures the browser closes cleanly.

In conclusion, by following this guide, you now have a reproducible Colab template that integrates browser automation, LLM reasoning, and secure credential management into a single cohesive pipeline. Whether you’re scraping real‑time market data, summarizing news articles, or automating reporting tasks, the combination of Playwright, browser_use, and LangChain’s Gemini interface provides a flexible foundation for your next AI‑powered project. Feel free to extend the agent’s capabilities, re‑enable GIF recording, add custom navigation steps, or swap in other LLM backends to tailor the workflow precisely to your research or production needs.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post An Advanced Coding Implementation: Mastering Browser‑Driven AI in Google Colab with Playwright, browser_use Agent & BrowserContext, LangChain, and Gemini appeared first on MarkTechPost.

Build a FinOps agent using Amazon Bedrock with multi-agent capability …

AI agents are revolutionizing how businesses enhance their operational capabilities and enterprise applications. By enabling natural language interactions, these agents provide customers with a streamlined, personalized experience. Amazon Bedrock Agents uses the capabilities of foundation models (FMs), combining them with APIs and data to process user requests, gather information, and execute specific tasks effectively. The introduction of multi-agent collaboration now enables organizations to orchestrate multiple specialized AI agents working together to tackle complex, multi-step challenges that require diverse expertise.
Amazon Bedrock offers a diverse selection of FMs, allowing you to choose the one that best fits your specific use case. Among these offerings, Amazon Nova stands out as AWS’s next-generation FM, delivering breakthrough intelligence and industry-leading performance at exceptional value.
The Amazon Nova family comprises three types of models:

Understanding models – Available in Micro, Lite, and Pro variants
Content generation models – Featuring Canvas and Reel
Speech-to-Speech model – Nova Sonic

These models are specifically optimized for enterprise and business applications, excelling in the following capabilities:

Text generation
Summarization
Complex reasoning tasks
Content creation

This makes Amazon Nova ideal for sophisticated use cases like our FinOps solution.
A key advantage of the Amazon Nova model family is its industry-leading price-performance ratio. Compared to other enterprise-grade AI models, Amazon Nova offers comparable or superior capabilities at a more competitive price point. This cost-effectiveness, combined with its versatility and performance, makes Amazon Nova an attractive choice for businesses looking to implement advanced AI solutions.
In this post, we use the multi-agent feature of Amazon Bedrock to demonstrate a powerful and innovative approach to AWS cost management. By using the advanced capabilities of Amazon Nova FMs, we’ve developed a solution that showcases how AI-driven agents can revolutionize the way organizations analyze, optimize, and manage their AWS costs.
Solution overview
Our innovative AWS cost management solution uses the power of AI and multi-agent collaboration to provide comprehensive cost analysis and optimization recommendations. The core of the system is built around three key components:

FinOps supervisor agent – Acts as the central coordinator, managing user queries and orchestrating the activities of specialized subordinate agents
Cost analysis agent – Uses AWS Cost Explorer to gather and analyze cost data for specified time ranges
Cost optimization agent – Uses the AWS Trusted Advisor Cost Optimization Pillar to provide actionable cost-saving recommendations

The solution integrates the multi-agent collaboration capabilities of Amazon Bedrock with Amazon Nova to create an intelligent, interactive, cost management AI assistant. This integration enables seamless communication between specialized agents, each focusing on different aspects of AWS cost management. Key features of the solution include:

User authentication through Amazon Cognito with role-based access control
Frontend application hosted on AWS Amplify
Real-time cost insights and historical analysis
Actionable cost optimization recommendations
Parallel processing of tasks for improved efficiency

By combining AI-driven analysis with AWS cost management tools, this solution offers finance teams and cloud administrators a powerful, user-friendly interface to gain deep insights into AWS spending patterns and identify cost-saving opportunities.
The architecture displayed in the following diagram uses several AWS services, including AWS Lambda functions, to create a scalable, secure, and efficient system. This approach demonstrates the potential of AI-driven multi-agent systems to assist with cloud financial management and solve a wide range of cloud management challenges.

In the following sections, we dive deeper into the architecture of our solution, explore the capabilities of each agent, and discuss the potential impact of this approach on AWS cost management strategies.
Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Nova Pro and Micro in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation template downloaded from the aws-samples GitHub repo

Deploy solution resources using AWS CloudFormation
This CloudFormation template is designed to run in the us-east-1 Region. If you deploy in a different Region, you must configure cross-Region inference profiles to have proper functionality and update the CloudFormation template accordingly.
During the CloudFormation template deployment, you will need to specify three required parameters:

Stack name
FM selection
Valid user email address

AWS resource usage will incur costs. When deployment is complete, the following resources will be deployed:

Amazon Cognito resources:

User pool – CognitoUserPoolforFinOpsApp
App client – FinOpsApp
Identity pool – cognito-identity-pool-finops
Groups – Finance
User – Finance User

AWS Identity and Access Management (IAM) resources:

IAM roles:

FinanceUserRestrictedRole
DefaultCognitoAuthenticatedRole

IAM policies:

Finance-BedrockAccess
Default-CognitoAccess

Lambda functions:

TrustedAdvisorListRecommendationResources
TrustedAdvisorListRecommendations
CostAnalysis
ClockandCalendar
CostForecast

Amazon Bedrock agents:

FinOpsSupervisorAgent
CostAnalysisAgent with action groups:

CostAnalysisActionGroup
ClockandCalendarActionGroup
CostForecastActionGroup

CostOptimizationAgent with action groups:

TrustedAdvisorListRecommendationResources
TrustedAdvisorListRecommendations

After you deploy the CloudFormation template, copy the following from the Outputs tab on the AWS CloudFormation console to use during the configuration of your application after it’s deployed in Amplify:

AWSRegion
BedrockAgentAliasId
BedrockAgentId
BedrockAgentName
IdentityPoolId
UserPoolClientId
UserPoolId

The following screenshot shows you what the Outputs tab will look like.

Deploy the Amplify application
You need to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code AWS-Amplify-Frontend.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain it automatically generated to access the application.

Amazon Cognito for user authentication
The FinOps application uses Amazon Cognito user pools and identity pools to implement secure, role-based access control for finance team members. User pools handle authentication and group management, and identity pools provide temporary AWS credentials mapped to specific IAM roles. The system makes sure that only verified finance team members can access the application and interact with the Amazon Bedrock API, combining robust security with a seamless user experience.
Amazon Bedrock Agents with multi-agent capability
The Amazon Bedrock multi-agent architecture enables sophisticated FinOps problem-solving through a coordinated system of AI agents, led by a FinOpsSupervisorAgent. The FinOpsSupervisorAgent coordinates with two key subordinate agents: the CostAnalysisAgent, which handles detailed cost analysis queries, and the CostOptimizationAgent, which handles specific cost optimization recommendations. Each agent focuses on their specialized financial tasks while maintaining contextual awareness, with the FinOpsSupervisorAgent managing communication and synthesizing comprehensive responses from both agents. This coordinated approach enables parallel processing of financial queries and delivers more effective answers than a single agent could provide, while maintaining consistency and accuracy throughout the FinOps interaction.
Lambda functions for Amazon Bedrock action groups
As part of this solution, Lambda functions are deployed to support the action groups defined for each subordinate agent.
The CostAnalysisAgent uses three distinct Lambda backed action groups to deliver comprehensive cost management capabilities. The CostAnalysisActionGroup connects with Cost Explorer to extract and analyze detailed historical cost data, providing granular insights into cloud spending patterns and resource utilization. The ClockandCalendarActionGroup maintains temporal precision by providing current date and time functionality, essential for accurate period-based cost analysis and reporting. The CostForecastActionGroup uses the Cost Explorer forecasting function, which analyzes historical cost data and provides future cost projections. This information helps the agent support proactive budget planning and make informed recommendations. These action groups work together seamlessly, enabling the agent to provide historical cost analysis and future spend predictions while maintaining precise temporal context.
The CostOptimizationAgent incorporates two Trusted Advisor focused action groups to enhance its recommendation capabilities. The TrustedAdvisorListRecommendationResources action group interfaces with Trusted Advisor to retrieve a comprehensive list of resources that could benefit from optimization, providing a targeted scope for cost-saving efforts. Complementing this, the TrustedAdvisorListRecommendations action group fetches specific recommendations from Trusted Advisor, offering actionable insights on potential cost reductions, performance improvements, and best practices across various AWS services. Together, these action groups empower the agent to deliver data-driven, tailored optimization strategies by using the expertise embedded in Trusted Advisor.
Amplify for frontend
Amplify provides a streamlined solution for deploying and hosting web applications with built-in security and scalability features. The service reduces the complexity of managing infrastructure, allowing developers to concentrate on application development. In our solution, we use the manual deployment capabilities of Amplify to host our frontend application code.
Multi-agent and application walkthrough
To validate the solution before using the Amplify deployed frontend, we can conduct testing directly on the AWS Management Console. By navigating to the FinOpsSupervisorAgent, we can pose a question like “What is my cost for Feb 2025 and what are my current cost savings opportunity?” This query demonstrates the multi-agent orchestration in action. As shown in the following screenshot, the FinOpsSupervisorAgent coordinates with both the CostAnalysisAgent (to retrieve February 2025 cost data) and the CostOptimizationAgent (to identify current cost savings opportunities). This illustrates how the FinOpsSupervisorAgent effectively delegates tasks to specialized agents and synthesizes their responses into a comprehensive answer, showcasing the solution’s integrated approach to FinOps queries.

Navigate to the URL provided after you created the application in Amplify. Upon accessing the application URL, you will be prompted to provide information related to Amazon Cognito and Amazon Bedrock Agents. This information is required to securely authenticate users and allow the frontend to interact with the Amazon Bedrock agent. It enables the application to manage user sessions and make authorized API calls to AWS services on behalf of the user.
You can enter information with the values you collected from the CloudFormation stack outputs. You will be required to enter the following fields, as shown in the following screenshot:

User Pool ID
User Pool Client ID
Identity Pool ID
Region
Agent Name
Agent ID
Agent Alias ID
Region

You need to sign in with your user name and password. A temporary password was automatically generated during deployment and sent to the email address you provided when launching the CloudFormation template. At first sign-in attempt, you will be asked to reset your password, as shown in the following video.

Now you can start asking the same question in the application, for example, “What is my cost for February 2025 and what are my current cost savings opportunity?” In a few seconds, the application will provide you detailed results showing services spend for the particular month and savings opportunity. The following video shows this chat.

You can further dive into the details you got by asking a follow-up question such as “Can you give me the details of the EC2 instances that are underutilized?” and it will return the details for each of the Amazon Elastic Compute Cloud (Amazon EC2) instances that it found underutilized.

The following are a few additional sample queries to demonstrate the capabilities of this tool:

What is my top services cost in June 2024?
In the past 6 months, how much did I spend on VPC cost?
What is my current savings opportunity?

Clean up
If you decide to discontinue using the FinOps application, you can follow these steps to remove it, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process (you assigned a name to it).
Select the stack and choose Delete.

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Considerations
For optimal visibility across your organization, deploy this solution in your AWS payer account to access cost details for your linked accounts through Cost Explorer.
Trusted Advisor cost optimization visibility is limited to the account where you deploy this solution. To expand its scope, enable Trusted Advisor at the AWS organization level and modify this solution accordingly.
Before deploying to production, enhance security by implementing additional safeguards. You can do this by associating guardrails with your agent in Amazon Bedrock.
Conclusion
The integration of the multi-agent capability of Amazon Bedrock with Amazon Nova demonstrates the transformative potential of AI in AWS cost management. Our FinOps agent solution showcases how specialized AI agents can work together to deliver comprehensive cost analysis, forecasting, and optimization recommendations in a secure and user-friendly environment. This implementation not only addresses immediate cost management challenges, but also adapts to evolving cloud financial operations. As AI technologies advance, this approach sets a foundation for more intelligent and proactive cloud management strategies across various business operations.
Additional resources
To learn more about Amazon Bedrock, refer to the following resources:

Introducing multi-agent collaboration capability for Amazon Bedrock
Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock
Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance

About the Author
Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.
Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.
Sergio Barraza is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.
Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Stream ingest data from Kafka to Amazon Bedrock Knowledge Bases using …

Retrieval Augmented Generation (RAG) enhances AI responses by combining the generative AI model’s capabilities with information from external data sources, rather than relying solely on the model’s built-in knowledge. In this post, we showcase the custom data connector capability in Amazon Bedrock Knowledge Bases that makes it straightforward to build RAG workflows with custom input data. Through this capability, Amazon Bedrock Knowledge Bases supports the ingestion of streaming data, which means developers can add, update, or delete data in their knowledge base through direct API calls.
Think of the examples of clickstream data, credit card swipes, Internet of Things (IoT) sensor data, log analysis and commodity prices—where both current data and historical trends are important to make a learned decision. Previously, to feed such critical data inputs, you had to first stage it in a supported data source and then either initiate or schedule a data sync job. Based on the quality and quantity of the data, the time to complete this process varied. With custom data connectors, you can quickly ingest specific documents from custom data sources without requiring a full sync and ingest streaming data without the need for intermediary storage. By avoiding time-consuming full syncs and storage steps, you gain faster access to data, reduced latency, and improved application performance.
However, with streaming ingestion using custom connectors, Amazon Bedrock Knowledge Bases processes such streaming data without using an intermediary data source, making it available almost immediately. This feature chunks and converts input data into embeddings using your chosen Amazon Bedrock model and stores everything in the backend vector database. This automation applies to both newly created and existing databases, streamlining your workflow so you can focus on building AI applications without worrying about orchestrating data chunking, embeddings generation, or vector store provisioning and indexing. Additionally, this feature provides the ability to ingest specific documents from custom data sources, all while reducing latency and alleviating operational costs for intermediary storage.
Amazon Bedrock
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and RAG, and build agents that execute tasks using your enterprise systems and data sources.
Amazon Bedrock Knowledge Bases
Amazon Bedrock Knowledge Bases allows organizations to build fully managed RAG pipelines by augmenting contextual information from private data sources to deliver more relevant, accurate, and customized responses. With Amazon Bedrock Knowledge Bases, you can build applications that are enriched by the context that is received from querying a knowledge base. It enables a faster time to product release by abstracting from the heavy lifting of building pipelines and providing you an out-of-the-box RAG solution, thus reducing the build time for your application.
Amazon Bedrock Knowledge Bases custom connector
Amazon Bedrock Knowledge Bases supports custom connectors and the ingestion of streaming data, which means you can add, update, or delete data in your knowledge base through direct API calls.
Solution overview: Build a generative AI stock price analyzer with RAG
For this post, we implement a RAG architecture with Amazon Bedrock Knowledge Bases using a custom connector and topics built with Amazon Managed Streaming for Apache Kafka (Amazon MSK) for a user who may be interested to understand stock price trends. Amazon MSK is a streaming data service that manages Apache Kafka infrastructure and operations, making it straightforward to run Apache Kafka applications on Amazon Web Services (AWS). The solution enables real-time analysis of customer feedback through vector embeddings and large language models (LLMs).
The following architecture diagram has two components:
Preprocessing streaming data workflow noted in letters on the top of the diagram:

Mimicking streaming input, upload a .csv file with stock price data into MSK topic
Automatically trigger the consumer AWS Lambda function
Ingest consumed data into a knowledge base
Knowledge base internally using embeddings model transforms into vector index
Knowledge base internally storing vector index into the vector database

Runtime execution during user queries noted in numerals at the bottom of the diagram:

Users query on stock prices
Foundation model uses the knowledge base to search for an answer
The knowledge base returns with relevant documents
User answered with relevant answer

Implementation design
The implementation follows these high-level steps:

Data source setup – Configure an MSK topic that streams input stock prices
Amazon Bedrock Knowledge Bases setup – Create a knowledge base in Amazon Bedrock using the quick create a new vector store option, which automatically provisions and sets up the vector store
Data consumption and ingestion – As and when data lands in the MSK topic, trigger a Lambda function that extracts stock indices, prices, and timestamp information and feeds into the custom connector for Amazon Bedrock Knowledge Bases
Test the knowledge base – Evaluate customer feedback analysis using the knowledge base

Solution walkthrough
To build a generative AI stock analysis tool with Amazon Bedrock Knowledge Bases custom connector, use instructions in the following sections.
Configure the architecture
To try this architecture, deploy the AWS CloudFormation template from this GitHub repository in your AWS account. This template deploys the following components:

Functional virtual private clouds (VPCs), subnets, security groups and AWS Identity and Access Management (IAM) roles
An MSK cluster hosting Apache Kafka input topic
A Lambda function to consume Apache Kafka topic data
An Amazon SageMaker Studio notebook for granular setup and enablement

Create an Apache Kafka topic
In the precreated MSK cluster, the required brokers are deployed ready for use. The next step is to use a SageMaker Studio terminal instance to connect to the MSK cluster and create the test stream topic. In this step, you follow the detailed instructions that are mentioned at Create a topic in the Amazon MSK cluster. The following are the general steps involved:

Download and install the latest Apache Kafka client
Connect to the MSK cluster broker instance
Create the test stream topic on the broker instance

Create a knowledge base in Amazon Bedrock
To create a knowledge base in Amazon Bedrock, follow these steps:

On the Amazon Bedrock console, in the left navigation page under Builder tools, choose Knowledge Bases.

To initiate knowledge base creation, on the Create dropdown menu, choose Knowledge Base with vector store, as shown in the following screenshot.

In the Provide Knowledge Base details pane, enter BedrockStreamIngestKnowledgeBase as the Knowledge Base name.
Under IAM permissions, choose the default option, Create and use a new service role, and (optional) provide a Service role name, as shown in the following screenshot.

On the Choose data source pane, select Custom as the data source where your dataset is stored
Choose Next, as shown in the following screenshot

On the Configure data source pane, enter BedrockStreamIngestKBCustomDS as the Data source name.
Under Parsing strategy, select Amazon Bedrock default parser and for Chunking strategy, choose Default chunking. Choose Next, as shown in the following screenshot.

On the Select embeddings model and configure vector store pane, for Embeddings model, choose Titan Text Embeddings v2. For Embeddings type, choose Floating-point vector embeddings. For Vector dimensions, select 1024, as shown in the following screenshot. Make sure you have requested and received access to the chosen FM in Amazon Bedrock. To learn more, refer to Add or remove access to Amazon Bedrock foundation models.

On the Vector database pane, select Quick create a new vector store and choose the new Amazon OpenSearch Serverless option as the vector store.

On the next screen, review your selections. To finalize the setup, choose Create.
Within a few minutes, the console will display your newly created knowledge base.

Configure AWS Lambda Apache Kafka consumer
Now, using API calls, you configure the consumer Lambda function so it gets triggered as soon as the input Apache Kafka topic receives data.

Configure the manually created Amazon Bedrock Knowledge Base ID and its custom Data Source ID as environment variables within the Lambda function. When you use the sample notebook, the referred function names and IDs will be filled in automatically.

response = lambda_client.update_function_configuration(
FunctionName=<Consumer Lambda Function Name>,
Environment={
‘Variables’: {
‘KBID’: <Knowledge Base ID>,
‘DSID’: <Data Source ID>
}
}
)

When it’s completed, you tie the Lambda consumer function to listen for events in the source Apache Kafka topic:

response = lambda_client.create_event_source_mapping(
EventSourceArn=<MSK Cluster’s ARN>,
FunctionName=<Consumer Lambda Function Name>,
StartingPosition=’LATEST’,
Enabled=True,
Topics=[‘streamtopic’]
)

Review AWS Lambda Apache Kafka consumer
The Apache Kafka consumer Lambda function reads data from the Apache Kafka topic, decodes it, extracts stock price information, and ingests it into the Amazon Bedrock knowledge base using the custom connector.

Extract the knowledge base ID and the data source ID:

kb_id = os.environ[‘KBID’]
ds_id = os.environ[‘DSID’]

Define a Python function to decode input events:

def decode_payload(event_data):
agg_data_bytes = base64.b64decode(event_data)
decoded_data = agg_data_bytes.decode(encoding=”utf-8″)
event_payload = json.loads(decoded_data)
return event_payload

Decode and parse required data on the input event received from the Apache Kafka topic. Using them, create a payload to be ingested into the knowledge base:

records = event[‘records’][‘streamtopic-0’]
for rec in records:
# Each record has separate eventID, etc.
event_payload = decode_payload(rec[‘value’])
ticker = event_payload[‘ticker’]
price = event_payload[‘price’]
timestamp = event_payload[‘timestamp’]
myuuid = uuid.uuid4()
payload_ts = datetime.utcfromtimestamp(timestamp).strftime(‘%Y-%m-%d %H:%M:%S’)
payload_string = “At ” + payload_ts + ” the price of ” + ticker + ” is ” + str(price) + “.”

Ingest the payload into Amazon Bedrock Knowledge Bases using the custom connector:

response = bedrock_agent_client.ingest_knowledge_base_documents(
knowledgeBaseId = kb_id,
dataSourceId = ds_id,
documents= [
{
‘content’: {
‘custom’ : {
‘customDocumentIdentifier’: {
‘id’ : str(myuuid)
},
‘inlineContent’ : {
‘textContent’ : {
‘data’ : payload_string
},
‘type’ : ‘TEXT’
},
‘sourceType’ : ‘IN_LINE’
},
‘dataSourceType’ : ‘CUSTOM’
}
}
]
)

Testing
Now that the required setup is done, you trigger the workflow by ingesting test data into your Apache Kafka topic hosted with the MSK cluster. For best results, repeat this section by changing the .csv input file to show stock price increase or decrease.

Prepare the test data. In my case, I had the following data input as a .csv file with a header.

ticker
price

OOOO
$44.50

ZVZZT
$3,413.23

ZNTRX
$22.34

ZNRXX
$208.76

NTEST
$0.45

ZBZX
$36.23

ZEXIT
$942.34

ZIEXT
$870.23

ZTEST
$23.75

ZVV
$2,802.86

ZXIET
$63.00

ZAZZT
$18.86

ZBZZT
$998.26

ZCZZT
$72.34

ZVZZC
$90.32

ZWZZT
$698.24

ZXZZT
$932.32

Define a Python function to put data to the topic. Use pykafka client to ingest data:

def put_to_topic(kafka_host, topic_name, ticker, amount, timestamp):
client = KafkaClient(hosts = kafka_host)
topic = client.topics[topic_name]
payload = {
‘ticker’: ticker,
‘price’: amount,
‘timestamp’: timestamp
}
ret_status = True
data = json.dumps(payload)
encoded_message = data.encode(“utf-8”)
print(f’Sending ticker data: {ticker}…’)
with topic.get_sync_producer() as producer:
result=producer.produce(encoded_message)
return ret_status

Read the .csv file and push the records to the topic:

df = pd.read_csv(‘TestData.csv’)
start_test_time = time.time()
print(datetime.utcfromtimestamp(start_test_time).strftime(‘%Y-%m-%d %H:%M:%S’))
df = df.reset_index()
for index, row in df.iterrows():
put_to_topic(BootstrapBrokerString, KafkaTopic, row[‘ticker’], row[‘price’], time.time())
end_test_time = time.time()
print(datetime.utcfromtimestamp(end_test_time).strftime(‘%Y-%m-%d %H:%M:%S’))

Verification
If the data ingestion and subsequent processing is successful, navigate to the Amazon Bedrock Knowledge Bases data source page to check the uploaded information.

Querying the knowledge base
Within the Amazon Bedrock Knowledge Bases console, you have access to query the ingested data immediately, as shown in the following screenshot.

To do that, select an Amazon Bedrock FM that you have access to. In my case, I chose Amazon Nova Lite 1.0, as shown in the following screenshot.

When it’s completed, the question, “How is ZVZZT trending?”, yields the results based on the ingested data. Note how Amazon Bedrock Knowledge Bases shows how it derived the answer, even pointing to the granular data element from its source.

Cleanup
To make sure you’re not paying for resources, delete and clean up the resources created.

Delete the Amazon Bedrock knowledge base.
Delete the automatically created Amazon OpenSearch Serverless cluster.
Delete the automatically created Amazon Elastic File System (Amazon EFS) shares backing the SageMaker Studio environment.
Delete the automatically created security groups associated with the Amazon EFS share. You might need to remove the inbound and outbound rules before they can be deleted.
Delete the automatically created elastic network interfaces attached to the Amazon MSK security group for Lambda traffic.
Delete the automatically created Amazon Bedrock Knowledge Bases execution IAM role.
Stop the kernel instances with Amazon SageMaker Studio.
Delete the CloudFormation stack.

Conclusion
In this post, we showed you how Amazon Bedrock Knowledge Bases supports custom connectors and the ingestion of streaming data, through which developers can add, update, or delete data in their knowledge base through direct API calls. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end RAG workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources. With this capability, you can quickly ingest specific documents from custom data sources without requiring a full sync, and ingest streaming data without the need for intermediary storage.
Send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS contacts, and engage with the generative AI builder community at community.aws.

About the Author
Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds eight AWS and seven other professional certifications. With over 22 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

Add Zoom as a data accessor to your Amazon Q index

For many organizations, vast amounts of enterprise knowledge are scattered across diverse data sources and applications. Organizations across industries seek to use this cross-application enterprise data from within their preferred systems while adhering to their established security and governance standards.
This post demonstrates how Zoom users can access their Amazon Q Business enterprise data directly within their Zoom interface, alleviating the need to switch between applications while maintaining enterprise security boundaries. Organizations can now configure Zoom as a data accessor in Amazon Q Business, enabling seamless integration between their Amazon Q index and Zoom AI Companion. This integration allows users to access their enterprise knowledge in a controlled manner directly within the Zoom platform.
How Amazon Q Business and Zoom AI Companion work together
The Amazon Q Business data accessor is a core component within Amazon Q Business. It manages and controls access to data stored in an enterprise’s internal knowledge repositories on Amazon Q Business from an external independent software vendor (ISV) such as Zoom while maintaining security and data access compliance. This feature allows Zoom to retrieve relevant content, enhancing the Zoom AI Companion’s knowledge. It serves as an intermediary that enforces access control lists (ACLs), defining both data source permissions and user access rights to the existing Amazon Q Business index.
Zoom AI Companion, the foundation of Zoom’s AI-first work platform, enhances human connection by working behind the scenes to boost productivity, improve work quality, and strengthen relationships. This April, Zoom launched the Custom AI Companion add-on, enabling organizations to customize AI agents and skills to help meet their specific needs and drive company-wide efficiency. Through its partnership with Amazon Q Business, customers can now connect their indexed data in Amazon Q index to Zoom AI Companion, providing enhanced knowledge and contextual insights.
As an Amazon Q Business data accessor, Zoom AI Companion can interact with the enterprise Amazon Q index in a managed way, enriching content beyond what’s available in Zoom alone. Enterprise users can retrieve contextual information from their Amazon Q index’s multiple connected data sources directly within Zoom, with results seamlessly presented through Zoom AI Companion. Zoom AI Companion can access Amazon Q index data with its native data sources, such as previous call transcripts, to quickly surface relevant information to users. This integration alleviates the need to manually switch between various enterprise systems like Google Drive, Confluence, Salesforce, and more, saving time and reducing workflow disruptions.
For example, while preparing for a Zoom call, users can quickly find answers to questions like “When is customer AnyCustomer’s contract up for renewal, and who signed the last one?” The Amazon Q index processes these queries and delivers results through Zoom AI Companion in real time.
Solution overview
The following diagram is a high-level architecture that explains how enterprises can set up and access Amazon Q Business indexed data from within the Zoom AI Companion application.

In the following sections, we demonstrate how to configure Zoom as a data accessor and get started using Zoom AI Companion.
Prerequisites
To implement this solution, you need an AWS account with appropriate permissions.
Create an Amazon Q Business application
To access indexed data from Amazon Q Business through Zoom AI Companion, organizations must first set up their Amazon Q Business application. The application must be configured with AWS IAM Identity Center to enable the Zoom data accessor functionality. For detailed guidance on creating an Amazon Q Business application, refer to Configure application.
Configure access control with IAM Identity Center
Through IAM Identity Center, Amazon Q Business uses trusted identity propagation to provide proper authentication and fine-grained authorization based on user ID and group-based resources, making sure access to sensitive data is tightly controlled and document ACLs are enforced. The ISV is only permitted to access this index using the assigned data accessor.
If you’re using an identity provider (IdP) such as Okta, CyberArk, or others, you can add the IdP to IAM Identity Center as a trusted token issuer. For additional information, see Configure Amazon Q Business with AWS IAM Identity Center trusted identity propagation.
For more information on IAM Identity Center, refer to IAM Identity Center identity source tutorials.
Add Zoom as a data accessor
After creating an Amazon Q Business application with IAM Identity Center, administrators can configure Zoom as a data accessor through the Amazon Q Business console. Complete the following steps:

 On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose Zoom as your data accessor.
For Accessor name, enter a name for your data accessor.
For Data source access, configure your level of access.

You can select specific data sources to be available through the data accessor. This allows you to control which content is surfaced in the ISV environment. You can use Amazon Q Business pre-built connectors to synchronize content from various systems. For more information, refer to Supported connectors.

For User access, specify which users can access the Amazon Q index through the data accessor.

This option enables you to configure granular permissions for data accessor accessibility and manage organizational access controls.
For more information about data access, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.

Administrators can modify data accessor settings at any time after implementation. You can adjust user access permissions, update available data sources, and change the scope of accessibility. To revoke access, complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Locate the accessor you want to delete and choose Delete.
Confirm the deletion when prompted.

Removing a data accessor from a data source immediately cancels the ISV’s access to your organization’s Amazon Q index.
Configure Amazon Q for Zoom AI Companion
To start using Zoom as a data accessor for your Amazon Q Business index, the following information from your enterprise Amazon Q Business application must be shared with Zoom:

Amazon Q Business application ID
Amazon Q Business AWS Region
Amazon Q Business retriever ID
Data accessor application Amazon Resource Name (ARN)
IAM Identity Center instance Region

For more information, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.
After you add Zoom as a data accessor, a pop-up window will appear on the Amazon Q Business console. This pop-up contains the required parameters, as shown in the following screenshot.

Navigate to the Zoom App Marketplace to configure Amazon Q in Zoom, and enter the information you collected.

After you submit this information, you’re ready to access Amazon Q index data from Zoom AI Companion.
With AI Companion connected to Amazon Q index, you have the information you need instantly. For example, you could make AI Companion aware of your organization’s IT troubleshooting guides so employees could quickly get help with questions like “How do I fix a broken keyboard?”

Using the SearchRelevantContent API
When an enterprise customer with an Amazon Q index enables a data accessor, it allows authenticated Amazon Q Business users to search and retrieve relevant content in real time while using external ISV platforms (like Zoom). This functionality is achieved through the ISV calling the Amazon Q index SearchRelevantContent API as an external data accessor across accounts. The SearchRelevantContent API is specifically designed to return search results from the Amazon Q index, which can be further enhanced by the ISV’s generative AI stack. By using the Amazon Q index SearchRelevantContent API, Zoom and other ISVs can integrate query results directly into their environment.
The SearchRelevantContent API is an identity-aware API, which means it operates with knowledge of the user’s identity and associated information (such as email and group membership) through the credentials used to call the API. This identity awareness is a prerequisite for using the API. When querying the index, it reconciles document access controls against the authenticated user’s permissions. As a result, users can only retrieve results from content they are authorized to access.
When an ISV calls the SearchRelevantContent API as a data accessor, both sparse and dense searches are applied to the Amazon Q index, combining keyword search and vector embedding proximity. Results are ranked before being returned to the ISV interface.
For example, if you ask in Zoom, “What is Company XYZ’s engagement on the cancer moonshot project?”, Zoom AI Companion triggers a call to the SearchRelevantContent API as a data accessor.
For a more comprehensive code example, see the notebook in Module 2 – Amazon Q cross-app index.
The following is a code snippet in Python showing what that search request might look like:

search_params = { ‘applicationId’: Q_BIZ_APP_ID,
‘contentSource’: {
‘retriever’: {
‘retrieverId’: Q_RETRIEVER_ID
}
},
‘queryText’: ‘What is Company XYZ engagement on the cancer moonshot project?’,
‘maxResults’: 10
}

search_response = qbiz.search_relevant_content(**search_params)

The search response will contain an array of results with relevant chunks of text, along with source information, document attributes, and confidence scores. The following is a snippet from the SearchRelevantContent API response. This is an example of results you might see from the web crawler data connector used with Amazon Q Business.

[
{
“content”: “nSeveral initiatives have been launched or will soon launch to address the goals of this next phase, including:nIncluding more people in expanded and modernized cancer clinical trialsnIncreasing the pipeline of new cancer drugsnEnsuring access to current and new standards of cancer carenEnhancing diversity in the cancer research workforce”,
“documentId”: “Cancermoonshot”,
“documentTitle”: “About The Cancer Moonshot”,
“documentUri”: “https://companyxyz/cancermoonshot.html”,
“documentAttributes”: [
{
“name”: “_source_uri”,
“value”: {
“stringValue”: “https://companyxyz.com/cancermoonshot.html”
}
}
],
“scoreAttributes”: {
“scoreConfidence”: “VERY_HIGH”
}
},…]

The SearchRelevantContent API has a rich set of optional parameters available that ISVs can choose to use. For example, document attributes can be used as filters. If documents with meta attributes have been indexed, and one of these attributes contains the author, it would be possible for an ISV to apply a filter where you can specify an author name. In the following example, results returned are constrained to only documents that have the specified attribute author name “John Smith.”

search_params = {
‘applicationId’: Q_BIZ_APP_ID,
‘contentSource’: {
‘retriever’: {
‘retrieverId’: Q_RETRIEVER_ID
}
},
‘queryText’: myQuestion,
‘maxResults’: 5,
‘attributeFilter’: {
‘equalsTo’: {
‘name’: ‘Author’,
‘value’: {
‘stringValue’: ‘John Smith’
}
}
}
}

For a more comprehensive reference on what is available in the SearchRelevantContent API request object, refer to search_relevant_content.
Clean up
When you’re done using this solution, clean up the resources you created.

Delete the Zoom data accessor from the Data accessors console. Deleting this data accessor will delete permissions and access to the data accessor for all users.
Delete the Amazon Q Business application that you created as a prerequisite.

Navigate to the Amazon Q Business console.
Choose Applications on the left menu.
Select the application you created.
Choose Delete from under Actions to delete the application.

Deleting the Amazon Q Business application will remove the associated index and data source connectors, and prevent incurring additional costs.
Conclusion
Amazon Q indexes offers a transformative approach to workplace efficiency. By creating a centralized, secure repository for your organization’s data, you can seamlessly integrate vital information with your everyday productivity tools like Zoom AI Companion.
In this post, we explored how Amazon Q Business enterprise users can add data accessors to integrate with external parties like Zoom AI Companion, allowing users to access their enterprise knowledge in a managed way directly from within those platforms.
Ready to supercharge your workforce’s productivity? Start your Amazon Q Business journey today alongside Zoom. To learn more about Amazon Q Business data accessors, see Enhance enterprise productivity for your LLM solution by becoming an Amazon Q Business data accessor.

About the authors
David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps Independent Software Vendors (ISVs) accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Build a computer vision-based asset inventory application with low or …

Keeping an up-to-date asset inventory with real devices deployed in the field can be a challenging and time-consuming task. Many electricity providers use manufacturer’s labels as key information to link their physical assets within asset inventory systems. Computer vision can be a viable solution to speed up operator inspections and reduce human errors by automatically extracting relevant data from the label. However, building a standard computer vision application capable of managing hundreds of different types of labels can be a complex and time-consuming endeavor.
In this post, we present a solution using generative AI and large language models (LLMs) to alleviate the time-consuming and labor-intensive tasks required to build a computer vision application, enabling you to immediately start taking pictures of your asset labels and extract the necessary information to update the inventory using AWS services like AWS Lambda, Amazon Bedrock, Amazon Titan, Anthropic’s Claude 3 on Amazon Bedrock, Amazon API Gateway, AWS Amplify, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.
LLMs are large deep learning models that are pre-trained on vast amounts of data. They are capable of understanding and generating human-like text, making them incredibly versatile tools with a wide range of applications. This approach harnesses the image understanding capabilities of Anthropic’s Claude 3 model to extract information directly from photographs taken on-site, by analyzing the labels present in those field images.
Solution overview
The AI-powered asset inventory labeling solution aims to streamline the process of updating inventory databases by automatically extracting relevant information from asset labels through computer vision and generative AI capabilities. The solution uses various AWS services to create an end-to-end system that enables field technicians to capture label images, extract data using AI models, verify the accuracy, and seamlessly update the inventory database.
The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The process starts when an operator takes and uploads a picture of the assets using the mobile app.
The operator submits a request to extract data from the asset image.
A Lambda function retrieves the uploaded asset image from the uploaded images data store.
The function generates the asset image embeddings (vector representations of data) invoking the Amazon Titan Multimodal Embeddings G1 model.
The function performs a similarity search in the knowledge base to retrieve similar asset labels. The most relevant results will augment the prompt as similar examples to improve the response accuracy, and are sent with the instructions to the LLM to extract data from the asset image.
The function invokes Anthropic’s Claude 3 Sonnet on Amazon Bedrock to extract data (serial number, vendor name, and so on) using the augmented prompt and the related instructions.
The function sends the response to the mobile app with the extracted data.
The mobile app verifies the extracted data and assigns a confidence level. It invokes the API to process the data. Data with high confidence will be directly ingested into the system.
A Lambda function is invoked to update the asset inventory database with the extracted data if the confidence level has been indicated as high by the mobile app.
The function sends data with low confidence to Amazon Augmented AI (Amazon A2I) for further processing.
The human reviewers from Amazon A2I validate or correct the low-confidence data.
Human reviewers, such as subject matter experts, validate the extracted data, flag it, and store it in an S3 bucket.
A rule in Amazon EventBridge is defined to trigger a Lambda function to get the information from the S3 bucket when the Amazon A2I workflow processing is complete.
A Lambda function processes the output of the Amazon A2I workflow by loading data from the JSON file that stored the backend operator-validated information.
The function updates the asset inventory database with the new extracted data.
The function sends the extracted data marked as new by human reviewers to an Amazon Simple Queue Service (Amazon SQS) queue to be further processed.
Another Lambda function fetches messages from the queue and serializes the updates to the knowledge base database.
The function generates the asset image embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model.
The function updates the knowledge base with the generated embeddings and notifies other functions that the database has been updated.

Let’s look at the key components of the solution in more detail.
Mobile app
The mobile app component plays a crucial role in this AI-powered asset inventory labeling solution. It serves as the primary interface for field technicians on their tablets or mobile devices to capture and upload images of asset labels using the device’s camera. The implementation of the mobile app includes an authentication mechanism that will allow access only to authenticated users. It’s also built using a serverless approach to minimize recurring costs and have a highly scalable and robust solution.
The mobile app has been built using the following services:

AWS Amplify – This provides a development framework and hosting for the static content of the mobile app. By using Amplify, the mobile app component benefits from features like seamless integration with other AWS services, offline capabilities, secure authentication, and scalable hosting.
Amazon Cognito – This handles user authentication and authorization for the mobile app.

AI data extraction service
The AI data extraction service is designed to extract critical information, such as manufacturer name, model number, and serial number from images of asset labels.
To enhance the accuracy and efficiency of the data extraction process, the service employs a knowledge base comprising sample label images and their corresponding data fields. This knowledge base serves as a reference guide for the AI model, enabling it to learn and generalize from labeled examples to new label formats effectively. The knowledge base is stored as vector embeddings in a high-performance vector database: Meta’s FAISS (Facebook AI Similarity Search), hosted on Amazon S3.
Embeddings are dense numerical representations that capture the essence of complex data like text or images in a vector space. Each data point is mapped to a vector or ordered list of numbers, where similar data points are positioned closer together. This embedding space allows for efficient similarity calculations by measuring the distance between vectors. Embeddings enable machine learning (ML) models to effectively process and understand relationships within complex data, leading to improved performance on various tasks like natural language processing and computer vision.
The following diagram illustrates an example workflow.

The vector embeddings are generated using Amazon Titan, a powerful embedding generation service, which converts the labeled examples into numerical representations suitable for efficient similarity searches. The workflow consists of the following steps:

When a new asset label image is submitted for processing, the AI data extraction service, through a Lambda function, retrieves the uploaded image from the bucket where it was uploaded.
The Lambda function performs a similarity search using Meta’s FAISS vector search engine. This search compares the new image against the vector embeddings in the knowledge base generated by Amazon Titan Multimodal Embeddings invoked through Amazon Bedrock, identifying the most relevant labeled examples.
Using the augmented prompt with context information from the similarity search, the Lambda function invokes Amazon Bedrock, specifically Anthropic’s Claude 3, a state-of-the-art generative AI model, for image understanding and optical character recognition (OCR) tasks. By using the similar examples, the AI model can more accurately extract and interpret the critical information from the new asset label image.
The response is then sent to the mobile app to be confirmed by the field technician.

In this phase, the AWS services used are:

Amazon Bedrock – A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities.
AWS Lambda – A serverless computing service that allows you to run your code without the need to provision or manage physical servers or virtual machines. A Lambda function runs the data extraction logic and orchestrates the overall data extraction process.
Amazon S3 – A storage service offering industry-leading durability, availability, performance, security, and virtually unlimited scalability at low costs. It’s used to store the asset images uploaded by the field technicians.

Data verification
Data verification plays a crucial role in maintaining the accuracy and reliability of the extracted data before updating the asset inventory database and is included in the mobile app.
The workflow consists of the following steps:

The extracted data is shown to the field operator.
If the field operator determines that the extracted data is accurate and matches an existing asset label in the knowledge base, they can confirm the correctness of the extraction; if not, they can update the values directly using the app.
When the field technician confirms the data is correct, that information is automatically forwarded to the backend review component.

Data verification uses the following AWS services:

Amazon API Gateway – A secure and scalable API gateway that exposes the data verification component’s functionality to the mobile app and other components.
AWS Lambda – Serverless functions for implementing the verification logic and routing data based on confidence levels.

Backend review
This component assesses the discrepancy of automatically identified data by the AI data extraction service and the final data approved by the field operator and computes the difference. If the difference is below a configured threshold, the data is sent to update the inventory database; otherwise a human review process is engaged:

Subject matter experts asynchronously review flagged data entries on the Amazon A2I console.
Significant discrepancies are marked to update the generative AI’s knowledge base.
Minor OCR errors are corrected without updating the AI model’s knowledge base.

The backend review component uses the following AWS services:

Amazon A2I – A service that provides a web-based interface for human reviewers to inspect and correct the extracted data and asset label images.
Amazon EventBridge – A serverless service that uses events to connect application components together. When the Amazon A2I human workflow is complete, EventBridge is used to detect this event and trigger a Lambda function to process the output data.
Amazon S3 – Object storage to save the marked information in charge of Amazon A2I.

Inventory database
The inventory database component plays a crucial role in storing and managing the verified asset data in a scalable and efficient manner. Amazon DynamoDB, a fully managed NoSQL database service from AWS, is used for this purpose. DynamoDB is a serverless, scalable, and highly available key-value and document database service. It’s designed to handle massive amounts of data and high traffic workloads, making it well-suited for storing and retrieving large-scale inventory data.
The verified data from the AI extraction and human verification processes is ingested into the DynamoDB table. This includes data with high confidence from the initial extraction, as well as data that has been reviewed and corrected by human reviewers.
Knowledge base update
The knowledge base update component enables continuous improvement and adaptation of the generative AI models used for asset label data extraction:

During the backend review process, human reviewers from Amazon A2I validate and correct the data extracted from asset labels by the AI model.
The corrected and verified data, along with the corresponding asset label images, is marked as new label examples if not already present in the knowledge base.
A Lambda function is triggered to update the asset inventory and send the new labels to the FIFO (First-In-First-Out) queue.
A Lambda function processes the messages in the queue, updating the knowledge base vector store (S3 bucket) with the new label examples.
The update process generates the vector embeddings by invoking the Amazon Titan Multimodal Embeddings G1 model exposed by Amazon Bedrock and storing the embeddings in a Meta’s FAISS database in Amazon S3.

The knowledge base update process makes sure that the solution remains adaptive and continuously improves its performance over time, reducing the likelihood of unseen label examples and the involvement of subject matter experts to correct the extracted data.
This component uses the following AWS services:

Amazon Titan Multimodal Embeddings G1 model – This model generates the embeddings (vector representations) for the new asset images and their associated data.
AWS Lambda – Lambda functions are used to update the asset inventory database, to send and process the extracted data to the FIFO queue, and to update the knowledge base in case of new unseen labels.
Amazon SQS – Amazon SQS offers fully managed message queuing for microservices, distributed systems, and serverless applications. The extracted data marked as new by human reviewers is sent to an SQS FIFO (First-In-First-Out) queue. This makes sure that the messages are processed in the correct order; FIFO queues preserve the order in which messages are sent and received. If you use a FIFO queue, you don’t have to place sequencing information in your messages.
Amazon S3 – The knowledge base is stored in an S3 bucket, with the newly generated embeddings. This allows the AI system to improve its accuracy for future asset label recognition tasks.

Navigation flow
This section explains how users interact with the system and how data flows between different components of the solution. We’ll examine each key component’s role in the process, from initial user access through data verification and storage.
Mobile app
The end user accesses the mobile app using the browser included in the handheld device. The application URL to access the mobile app is available after you have deployed the frontend application. Using the browser on a handheld device or your PC, browse to the application URL address, where a login window will appear. Because this is a demo environment, you can register on the application by following the automated registration workflow implemented through Amazon Cognito and choosing Create Account, as shown in the following screenshot.

During the registration process, you must provide a valid email address that will be used to verify your identity, and define a password. After you’re registered, you can log in with your credentials.
After authentication is complete, the mobile app appears, as shown in the following screenshot.

The process to use the app is the following:

Use the camera button to capture a label image.
The app facilitates the upload of the captured image to a private S3 bucket specifically designated for storing asset images. S3 Transfer Acceleration is a separate AWS service that can be integrated with Amazon S3 to improve the transfer speed of data uploads and downloads. It works by using AWS edge locations, which are globally distributed and closer to the client applications, as intermediaries for data transfer. This reduces the latency and improves the overall transfer speed, especially for clients that are geographically distant from the S3 bucket’s AWS Region.
After the image is uploaded, the app sends a request to the AI data extraction service, triggering the subsequent process of data extraction and analysis. The extracted data returned by the service is displayed and editable within the form, as described later in this post. This allows for data verification.

AI data extraction service
This module uses Anthropic’s Claude 3 FM, a multimodal system capable of processing both images and text. To extract relevant data, we employ a prompt technique that uses samples to guide the model’s output. Our prompt includes two sample images along with their corresponding extracted text. The model identifies which sample image most closely resembles the one we want to analyze and uses that sample’s extracted text as a reference to determine the relevant information in the target image.
We use the following prompt to achieve this result:

{
 “role”: “user”,
 “content”: [
 {
 “type”: “text”,
 “text”: “first_sample_image:”,
 },
 {
 “type”: “image”,
 “source”: {
 “type”: “base64”,
 “media_type”: “image/jpeg”,
 “data”: first_sample_encoded_image,
 },
 },
 {
 “type”: “text”,
 “text”: “target_image:”,
 },
 {
 “type”: “image”,
 “source”: {
 “type”: “base64”,
 “media_type”: “image/jpeg”,
 “data”: encoded_image,
 },
 },
 {“type”: “text”,
 “text”: f”””
 answer the question using the following example as reference.
 match exactly the same set of fields and information as in the provided example.
 
 <example>
 analyze first_sample_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 {first_sample_answer}
 </example>
 
 <question>
 analyze target_image and answer with a json file with the following information: Model, SerialN, ZOD.
 answer only with json.
 
 Answer:
 </question>
 “””},
 
 ],
 }

In the preceding code, first_sample_encoded_image and first_sample_answer are the reference image and expected output, respectively, and encoded_image contains the new image that has to be analyzed.
Data verification
After the image is processed by the AI data extraction service, the control goes back to the mobile app:

The mobile app receives the extracted data from the AI data extraction service, which has processed the uploaded asset label image and extracted relevant information using computer vision and ML models.
Upon receiving the extracted data, the mobile app presents it to the field operator, allowing them to review and confirm the accuracy of the information (see the following screenshot). If the extracted data is correct and matches the physical asset label, the technician can submit a confirmation through the app, indicating that the data is valid and ready to be inserted into the asset inventory database.
If the field operator sees any discrepancies or errors in the extracted data compared to the actual asset label, they have the option to correct those values.
The values returned by the AI data extraction service and the final values validated by the field operators are sent to the backend review service.

Backend review
This process is implemented using Amazon A2I:

A distance metric is computed to evaluate the difference between what the data extraction service has identified and the correction performed by the on-site operator.
If the difference is larger than a predefined threshold, the image and the operator modified data are submitted to an Amazon A2I workflow, creating a human-in-the-loop request.
When a backend operator becomes available, the new request is assigned.
The operator uses the Amazon A2I provided web interface, as depicted in the following screenshot, to check what the on-site operator has done and, if it’s found that this type of label is not included in the knowledge base, can decide to add it by entering Yes in the Add to Knowledge Base field.

When the A2I process is complete, a Lambda function is triggered.
This Lambda function stores the information in the inventory database and verifies whether this image also needs to be used to update the knowledge base.
If this is the case, the Lambda function files the request with the relevant data in an SQS FIFO queue.

Inventory database
To keep this solution as simple as possible while covering the required capability, we selected DynamoDB as our inventory database. This is a no SQL database, and we will store data in a table with the following information:

Manufacturers, model ID, and the serial number that is going to be the key of the table
A link to the picture containing the label used during the on-site inspection

DynamoDB offers an on-demand pricing model that allows costs to directly depend on actual database usage.
Knowledge base database
The knowledge base database is stored as two files in an S3 bucket:

The first file is a JSON array containing the metadata (manufacturer, serial number, model ID, and link to reference image) for each of the knowledge base entries
The second file is a FAISS database containing an index with the embedding for each of the images included in the first file

To be able to minimize race conditions when updating the database, a single Lambda function is configured as the consumer of the SQS queue. The Lambda function extracts the information about the link to the reference image and the metadata, certified by the back-office operator, updates both files, and stores the new version in the S3 bucket.
In the following sections, we create a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.
Prerequisites
You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region. You will also need an AWS Identity and Access Management (IAM) user with administrative privileges to deploy the required components and a development environment with access to AWS resources already configured.
For the development environment, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance (choose select at least a t3.small instance type in order to be able to build the web application) or use a development environment of your own choice. Install Python 3.9 and install and configure AWS Command Line Interface (AWS CLI).
You will also need to install the Amplify CLI. Refer to Set up Amplify CLI for more information.
The next step is to enable the models used in this workshop in Amazon Bedrock. To do this, complete the following steps:

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Enable specific models.

 Select all Anthropic and Amazon models and choose Next

A new window will list the requested models.

Confirm that the Amazon Titan models and Anthropic Claude models are on this list and choose Submit.

The next step is to create an Amazon SageMaker Ground Truth private labeling workforce that will be used to perform back-office activities. If you don’t already have a private labeling workforce in your account, you can create one following these steps:

On the SageMaker console, under Ground Truth in the navigation pane, choose Labeling workforce.

On the Private tab, choose Create private team.
Provide a name to the team and your organization, and insert your email address (must be a valid one) for both Email addresses and Contact email.
Leave all the other options as default.
Choose Create private team.
After your workforce is created, copy your workforce Amazon Resource Name (ARN) on the Private tab and save for later use. Lastly, build a Lambda layer that includes two Python libraries. To build this layer, connect to your development environment and issue the following commands:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
bash build_lambda_layer.sh

You should get an output similar to the following screenshot.
Save theLAMBDA_LAYER_VERSION_ARN for later use.
You are now ready to deploy the backend infrastructure and frontend application.
Deploy the backend infrastructure
The backend is deployed using AWS CloudFormation to build the following components:

An API Gateway to act as an integration layer between the frontend application and the backend
An S3 bucket to store the uploaded images and the knowledge base
Amazon Cognito to allow end-user authentication
A set of Lambda functions to implement backend services
An Amazon A2I workflow to support the back-office activities
An SQS queue to store knowledge base update requests
An EventBridge rule to trigger a Lambda function as soon as an Amazon A2I workflow is complete
A DynamoDB table to store inventory data
IAM roles and policies to allow access to the different components to interact with each other and also access Amazon Bedrock for generative AI-related tasks

Download the CloudFormation template, then complete the following steps:

On the AWS CloudFormation console, chose Create stack.
Choose Upload a template file and choose Choose file to upload the downloaded template.
Choose Next.
For Stack name, enter a name (for example, asset-inventory).
For A2IWorkforceARN, enter the ARN of the labeling workforce you identified.
For LambdaLayerARN, enter the ARN of the Lambda layer version you uploaded.
Choose Next and Next again.
Acknowledge that AWS CloudFormation is going to create IAM resources and choose Submit.

Wait until the CloudFormation stack creation process is complete; it will take about 15–20 minutes. You can then view the stack details.
Note the values on the Outputs tab. You will use the output data later to complete the configuration of the frontend application.
Deploy the frontend application
In this section, you will build the web application that is used by the on-site operator to collect a picture of the labels, submit it to the backend services to extract relevant information, validate or correct returned information, and submit the validated or corrected information to be stored in the asset inventory.
The web application uses React and will use the Amplify JavaScript Library.
Amplify provides several products to build full stack applications:

Amplify CLI – A simple command line interface to set up the needed services
Amplify Libraries – Use case-centric client libraries to integrate the frontend code with the backend
Amplify UI Components – UI libraries for React, React Native, Angular, Vue, and Flutter

In this example, you have already created the needed services with the CloudFormation template, so the Amplify CLI will deploy the application on the Amplify provided hosting service.

Log in to your development environment and download the client code from the GitHub repository using the following command:

git clone https://github.com/aws-samples/Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd Build_a_computer_vision_based_asset_inventory_app_with_low_no_training
cd webapp

If you’re running on AWS Cloud9 as a development environment, issue the following command to let the Amplify CLI use AWS Cloud9 managed credentials:

ln -s $HOME/.aws/credentials $HOME/.aws/config

Now you can initialize the Amplify application using the CLI:

amplify init

After issuing this command, the Amplify CLI will ask you for some parameters.

Accept the default values by pressing Enter for each question.
The next step is to modify amplifyconfiguration.js.template (you can find it in folder webapp/src) with the information collected from the output of the CloudFormation stack and save as amplifyconfiguration.js. This file tells Amplify which is the correct endpoint to use to interact with the backend resources created for this application. The information required is as follows:

aws_project_region and aws_cognito_region – To be filled in with the Region in which you ran the CloudFormation template (for example, us-east-1).
aws_cognito_identity_pool_id, aws_user_pools_id, aws_user_pools_web_client_id – The values from the Outputs tab of the CloudFormation stack.
Endpoint – In the API section, update the endpoint with the API Gateway URL listed on the Outputs tab of the CloudFormation stack.

You now need to add a hosting option for the single-page application. You can use Amplify to configure and host the web application by issuing the following command:

amplify hosting add

The Amplify CLI will ask you which type of hosting service you prefer and what type of deployment.

Answer both questions by accepting the default option by pressing Enter key.
You now need to install the JavaScript libraries used by this application using npm:

npm install

Deploy the application using the following command:

amplify publish

Confirm you want to proceed by entering Y.

At the end of the deployment phase, Amplify will return the public URL of the web application, similar to the following:


Find out more about deployment here:

https://cra.link/deployment

Zipping artifacts completed.
Deployment complete!
https://dev.xxx.amplifyapp.com

Now you can use your browser to connect to the application using the provided URL.
Clean up
To delete the resources used to build this solution, complete the following steps:

Delete the Amplify application:

Issue the following command:

amplify delete

Confirm that you are willing to delete the application.

Remove the backend resources:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the stack and choose Delete.
Choose Delete to confirm.

At the end of the deletion process, you should not see the entry related to asset-inventory on the list of stacks.

Remove the Lambda layer by issuing the following command in the development environment:

aws lambda delete-layer-version —layer-name asset-inventory-blog —version-number 1

If you created a new labeling workforce, remove it by using the following command:

aws delete-workteam —workteam-name <the name you defined when you created the workteam>

Conclusion
In this post, we presented a solution that incorporates various AWS services to handle image storage (Amazon S3), mobile app development (Amplify), AI model hosting (Amazon Bedrock using Anthropic’s Claude), data verification (Amazon A2I), database (DynamoDB), and vector embeddings (Amazon Bedrock using Amazon Titan Multimodal Embeddings). It creates a seamless workflow for field data collection, AI-powered extraction, human validation, and inventory updates.
By taking advantage of the breadth of AWS services and integrating generative AI capabilities, this solution dramatically improves the efficiency and accuracy of asset inventory management processes. It reduces manual labor, accelerates data entry, and maintains high-quality inventory records, enabling organizations to optimize asset tracking and maintenance operations.
You can deploy this solution and immediately start collecting images of your assets to build or update your asset inventory.

About the authors
Federico D’Alessio is an AWS Solutions Architect and joined AWS in 2018. He is currently working in the Power and Utility and Transportation market. Federico is cloud addict and when not at work, he tries to reach clouds with his hang glider.
Leonardo Fenu is a Solutions Architect, who has been helping AWS customers align their technology with their business goals since 2018. When he is not hiking in the mountains or spending time with his family, he enjoys tinkering with hardware and software, exploring the latest cloud technologies, and finding creative ways to solve complex problems.
Elisabetta Castellano is an AWS Solutions Architect focused on empowering customers to maximize their cloud computing potential, with expertise in machine learning and generative AI. She enjoys immersing herself in cinema, live music performances, and books.
Carmela Gambardella is an AWS Solutions Architect since April 2018. Before AWS, Carmela has held various roles in large IT companies, such as software engineer, security consultant and solutions architect. She has been using her experience in security, compliance and cloud operations to help public sector organizations in their transformation journey to the cloud. In her spare time, she is a passionate reader, she enjoys hiking, traveling and playing yoga.

Transformers Can Now Predict Spreadsheet Cells without Fine-Tuning: Re …

Tabular data is widely utilized in various fields, including scientific research, finance, and healthcare. Traditionally, machine learning models such as gradient-boosted decision trees have been preferred for analyzing tabular data due to their effectiveness in handling heterogeneous and structured datasets. Despite their popularity, these methods have notable limitations, particularly in terms of performance on unseen data distributions, transferring learned knowledge between datasets, and integration challenges with neural network-based models because of their non-differentiable nature.

Researchers from the University of Freiburg, Berlin Institute of Health, Prior Labs, and ELLIS Institute have introduced a novel approach named Tabular Prior-data Fitted Network (TabPFN). TabPFN leverages transformer architectures to address common limitations associated with traditional tabular data methods. The model significantly surpasses gradient-boosted decision trees in both classification and regression tasks, especially on datasets with fewer than 10,000 samples. Notably, TabPFN demonstrates remarkable efficiency, achieving better results in just a few seconds compared to several hours of extensive hyperparameter tuning required by ensemble-based tree models.

TabPFN utilizes in-context learning (ICL), a technique initially introduced by large language models, where the model learns to solve tasks based on contextual examples provided during inference. The researchers adapted this concept specifically for tabular data by pre-training TabPFN on millions of synthetically generated datasets. This training method allows the model to implicitly learn a broad spectrum of predictive algorithms, reducing the need for extensive dataset-specific training. Unlike traditional deep learning models, TabPFN processes entire datasets simultaneously during a single forward pass through the network, which enhances computational efficiency substantially.

The architecture of TabPFN is specifically designed for tabular data, employing a two-dimensional attention mechanism tailored to effectively utilize the inherent structure of tables. This mechanism allows each data cell to interact with others across rows and columns, effectively managing different data types and conditions such as categorical variables, missing data, and outliers. Furthermore, TabPFN optimizes computational efficiency by caching intermediate representations from the training set, significantly accelerating inference on subsequent test samples.

Empirical evaluations highlight TabPFN’s substantial improvements over established models. Across various benchmark datasets, including the AutoML Benchmark and OpenML-CTR23, TabPFN consistently achieves higher performance than widely used models like XGBoost, CatBoost, and LightGBM. For classification problems, TabPFN showed notable gains in normalized ROC AUC scores relative to extensively tuned baseline methods. Similarly, in regression contexts, it outperformed these established approaches, showcasing improved normalized RMSE scores.

TabPFN’s robustness was also extensively evaluated across datasets characterized by challenging conditions, such as numerous irrelevant features, outliers, and substantial missing data. In contrast to typical neural network models, TabPFN maintained consistent and stable performance under these challenging scenarios, demonstrating its suitability for practical, real-world applications.

Beyond its predictive strengths, TabPFN also exhibits fundamental capabilities typical of foundation models. It effectively generates realistic synthetic tabular datasets and accurately estimates probability distributions of individual data points, making it suitable for tasks such as anomaly detection and data augmentation. Additionally, the embeddings produced by TabPFN are meaningful and reusable, providing practical value for downstream tasks including clustering and imputation.

In summary, the development of TabPFN signifies an important advancement in modeling tabular data. By integrating the strengths of transformer-based models with the practical requirements of structured data analysis, TabPFN offers enhanced accuracy, computational efficiency, and robustness, potentially facilitating substantial improvements across various scientific and business domains.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Transformers Can Now Predict Spreadsheet Cells without Fine-Tuning: Researchers Introduce TabPFN Trained on 100 Million Synthetic Datasets appeared first on MarkTechPost.

SQL-R1: A Reinforcement Learning-based NL2SQL Model that Outperforms L …

Natural language interface to databases is a growing focus within artificial intelligence, particularly because it allows users to interact with structured databases using plain human language. This area, often known as NL2SQL (Natural Language to SQL), is centered on transforming user-friendly queries into SQL commands that can be directly executed on databases. The objective is to simplify data access for non-technical users and broaden the utility of data systems in various sectors like finance, healthcare, and retail. With the rise of LLMs, significant progress has made these conversions more accurate and context-aware, especially when dealing with simple queries or structured database layouts.

Despite progress, converting natural language into accurate SQL remains difficult in complex situations involving multiple table joins, nested queries, or ambiguous semantics. The challenge is not just about generating syntactically correct SQL but producing queries that correctly reflect the user’s intent and can be generalized across domains. Standard approaches struggle to scale in high-stakes fields where interpretability and precision are critical. Moreover, many current models depend heavily on fixed schemas and training data structures, which hampers their performance in new or evolving environments.

Most NL2SQL systems today rely on supervised fine-tuning, where large language models are trained on annotated datasets that pair questions with correct SQL answers. While this method has led to noticeable improvements, it introduces limitations in adaptability and interpretability. Because these models are tuned to specific datasets and schemas, they often fail in unfamiliar scenarios. Also, they follow a rigid generation strategy, which can lead to failures when the input diverges from training data. These systems also typically lack transparency in their reasoning processes, limiting their utility in domains where clear decision-making trails are necessary.

Researchers from IDEA Research, the Hong Kong University of Science and Technology (Guangzhou), the University of Chinese Academy of Sciences, and DataArc Tech Ltd. introduced SQL-R1. This new NL2SQL model leverages reinforcement learning rather than traditional supervised learning. SQL-R1 uses feedback mechanisms during training to improve its performance. Instead of just learning from annotated examples, the model learns by generating SQL candidates, executing them, and receiving structured feedback on the outcome. This feedback includes whether the SQL was syntactically correct, whether it produced the proper result, and how efficient and interpretable it was. This dynamic learning process allows the model to optimize its SQL generation strategies over time and improves generalization in complex or unfamiliar scenarios.

To build SQL-R1, researchers first performed supervised fine-tuning on 200,000 samples drawn from a large synthetic dataset called SynSQL-2.5M. This process, known as a cold start, ensured the model could follow basic instructions and generate simple SQL outputs. Following this, reinforcement learning was introduced using the Group Relative Policy Optimization (GRPO) algorithm. The model generated multiple SQL candidates for each query and was rewarded based on a composite scoring function. This function included four metrics: format reward (+1 or -1 depending on syntax correctness), execution reward (+2 for executable queries, -2 for failures), result reward (+3 for correct query outputs, -3 for incorrect ones), and length reward based on the depth and clarity of the reasoning trace. Each of these scores contributed to updating the model’s internal decision-making process.

SQL-R1 was evaluated on two industry-standard NL2SQL benchmarks: Spider and BIRD. On the Spider development set, the model achieved 87.6% execution accuracy, and on the Spider test set, it gained 88.7%. For the BIRD dataset, which covers 95 databases from 37 domains, the model scored 66.6%. These results are competitive with or superior to larger models, including closed-source solutions like GPT-4. Notably, SQL-R1 used the Qwen2.5-Coder-7B model, which is considerably smaller than many alternatives, demonstrating that high accuracy can be achieved with efficient architectures when combined with reinforcement learning. An ablation study confirmed the contribution of each reward component. Removing the format reward, for instance, caused accuracy to drop from 63.1% to 60.4%. Removing the result reward caused a 0.7% drop, indicating that each element in the reward mechanism plays a role in guiding the model.

Several Key Takeaways from the Research on SQL-R1:

SQL-R1 achieved 88.7% accuracy on the Spider test set and 66.6% on the BIRD development set, using only a 7B base model (Qwen2.5-Coder-7B).  

The model used 200,000 samples from the SynSQL-2.5M dataset for supervised fine-tuning and 5,000 complex samples for reinforcement learning.  

The GRPO algorithm powered reinforcement learning, which required no value model and worked efficiently with relative performance scores.  

The reward function included four components: Format (+1/-1), Execution (+2/-2), Result (+3/-3), and Length (proportional).  

SQL-R1 outperformed larger models like GPT-4, highlighting that model architecture and feedback training are as critical as size.  

Ablation studies revealed the importance of each reward: removing the format reward caused a 2.7% drop in performance, while eliminating the execution reward dropped accuracy by 2.4%.  

The approach promotes transparency, as the model provides reasoning traces using ‘<think>’ and ‘<answer>’ tags, improving end-user interpretability.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post SQL-R1: A Reinforcement Learning-based NL2SQL Model that Outperforms Larger Systems in Complex Queries with Transparent and Accurate SQL Generation appeared first on MarkTechPost.

From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks …

Large language models are increasingly used to solve math problems that mimic real-world reasoning tasks. These models are tested for their ability to answer factual queries and how well they can handle multi-step logical processes. Mathematical problem-solving offers a reliable way to examine whether models can extract the necessary information, navigate complex statements, and compute answers correctly. This field has become central to understanding the extent of AI’s logical and cognitive capabilities.

A key concern in this domain is how these models perform when their inputs aren’t neat or formatted. In many cases, the questions LLMs encounter in practice come with extra background information, irrelevant details, or even subtle hints that could lead them off track. While models can perform well on standard benchmark problems, their ability to isolate important information from cluttered prompts remains questionable. This has raised the need to examine how distractions influence their reasoning and whether current models are ready for unpredictable, real-world use cases.

Past tools and benchmarks have focused mostly on well-formed problem sets, such as GSM8K or MATH. Still, newer variants like GSM-Symbolic and GSM-PLUS began testing model performance under symbolic variations and distractor insertions. These tools uncovered significant weaknesses in LLMs when faced with small changes to the problem text. For instance, introducing one clause that seems relevant but is logically redundant can reduce model accuracy by as much as 65%. This led to the conclusion that models often rely on surface patterns rather than genuine reasoning, which prompted further exploration into more realistic and noisy testing conditions.

A team of researchers from the Massachusetts Institute of Technology has introduced a research focused on measuring how LLMs handle four types of systematic perturbations: irrelevant context, pathological instructions, relevant but non-essential information, and a combination of the latter two. The team evaluated 13 large language models—both open-source and commercial—through APIs provided by OpenAI, Anthropic, Cohere, and TogetherAI. Instead of relying on full test sets, the team sampled 56 data points from the GSM8K dataset per experiment, ensuring they captured a balanced distribution of reasoning complexity.

To construct these altered prompts, the researchers added dense and irrelevant contexts like Wikipedia pages or financial reports into the input. This took up to 90% of the model’s context window. In the pathological scenario, misleading instructions were appended, designed to manipulate the reasoning path without altering the original question. New details that were factually correct but unnecessary were inserted for the relevant context case to see how the models handled distractions that looked informative. In the final variant, pathological and relevant perturbations were combined, increasing the input complexity while observing how this dual pressure influenced model output.

The performance dropped most sharply when irrelevant context was introduced. Across all models, the average accuracy dropped by 55.89%. Pathological instructions caused an 8.52% decline, while relevant context led to a 7.01% decrease. Combining the two types of perturbations produced a 12.91% drop in accuracy. Interestingly, performance didn’t correlate with model size—larger models like Mixtral-8x22B and Command-R-Plus experienced greater regressions compared to some smaller models. Also, the number of reasoning steps in a problem didn’t significantly affect the outcome, suggesting that complexity in logical structure wasn’t the dominant factor in performance variance.

This study shows that current large language models, even those with billions of parameters, still struggle when their prompts are altered relatively simply. The researchers from MIT demonstrate that model resilience doesn’t improve significantly with size and that the ability to filter and prioritize information is a major gap in LLM design. These findings push for developing models that are better equipped to deal with cluttered and misleading inputs—an essential step for moving closer to reliable AI in real-world environments.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks Derail LLM Reasoning appeared first on MarkTechPost.

Clario enhances the quality of the clinical trial documentation proces …

This post is co-written with Kim Nguyen and Shyam Banuprakash from Clario.
Clario is a leading provider of endpoint data solutions to the clinical trials industry, generating high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since Clario’s founding more than 50 years ago, the company’s endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces when supporting its clients is the time-consuming process of generating documentation for clinical trials, which can take weeks.
The business challenge
When medical imaging analysis is part of a clinical trial it is supporting, Clario prepares a medical imaging charter process document that outlines the format and requirements of the central review of clinical trial images (the Charter). Based on the Charter, Clario’s imaging team creates several subsequent documents (as shown in the following figure), including the business requirement specification (BRS), training slides, and ancillary documents. The content of these documents is largely derived from the Charter, with significant reformatting and rephrasing required. This process is time-consuming, can be subject to inadvertent manual error, and carries the risk of inconsistent or redundant information, which can delay or otherwise negatively impact the clinical trial.

Clario’s imaging team recognized the need to modernize the document generation process and streamline the processes used to create end-to-end document workflows. Clario engaged with their AWS account team and AWS Generative AI Innovation Center to explore how generative AI could help streamline the process.
The solution
The AWS team worked closely with Clario to develop a prototype solution that uses AWS AI services to automate the BRS generation process. The solution involves the following key services:

Amazon Simple Storage Service (Amazon S3): A scalable object storage service used to store the charter-derived and generated BRS documents.
Amazon OpenSearch Serverless: An on-demand serverless configuration for Amazon OpenSearch Service used as a vector store.
Amazon Bedrock: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG) and build agents that execute tasks using your enterprise systems and data sources.

The solution is shown in the following figure:

Architecture walkthrough

Charter-derived documents are processed in an on-premises script in preparation for uploading.
Files are sent to AWS using AWS Direct Connect.
The script chunks the documents and calls an embedding model to produce the document embeddings. It then stores the embeddings in an OpenSearch vector database for retrieval by our application. Clario uses an Amazon Titan Text Embeddings model offered by Amazon Bedrock. Each chunk is called to produce an embedding.
Amazon OpenSearch Serverlessis used as the durable vector store. Document chunk embeddings are stored in an OpenSearch vector index, which enables the application to search for the most semantically relevant documents. Clario also stores attributes for the source document and associated trial to allow for a richer search experience.
A custom build user interface is the primary access point for users to access the system, initiate generation jobs, and interact with a chat UI. The UI is integrated with the workflow engine that manages the orchestration process.
The workflow engine calls the Amazon Bedrock API and orchestrates the business requirement specification document generation process. The engine:

Uses a global specification that stores the prompts to be used as input when calling the large language model.
Queries OpenSearch for the relevant Imaging charter.
Loops through every business requirement.
Calls the Claude 3.7 Sonnet large language model from Amazon Bedrock to generate responses.

Outputs the business requirement specification document to the user interface, where a business requirement writer can review the answers to produce a final document. Clario uses Claude 3.7 Sonnet from Amazon Bedrock for the question-answering and the conversational AI application.
The final documents are written to Amazon S3 to be consumed and published by additional document workflows that will be built in the future.
An as-needed AI chat agent to allow document-based discovery and enable users to converse with one or more documents.

Benefits and results
By using AWS AI services, Clario has streamlined the complicated BRS generation process significantly. The prototype solution demonstrated the following benefits:

Improved accuracy: The use of generative AI models minimized the risk of translation errors and inconsistencies, reducing the need for rework and study delays.
Scalability and flexibility: The serverless architecture provided by AWS services allows the solution to scale seamlessly as demand increases, while the modular design enables straightforward integration with other Clario systems.
Security: Clario’s data security strategy revolves around confining all its information within the secure AWS ecosystem using the security features of Amazon Bedrock. By keeping data isolated within the AWS infrastructure, Clario helps ensure protection against external threats and unauthorized access. This approach enables Clario to meet compliance requirements and provide clients with confidence in the confidentiality and integrity of their sensitive data.

Lessons learned
The successful implementation of this prototype solution reinforced the value of using generative AI models for domain-specific applications like those prevalent in the life sciences industry. It also highlighted the importance of involving business stakeholders early in the process and having a clear understanding of the business value to be realized. Following the success of this project, Clario is working to productionize the solution in their Medical Imaging business during 2025 to continue offering state-of-the-art services to its customers for best quality data and successful clinical trials.
Conclusion
The collaboration between Clario and AWS demonstrated the potential of AWS AI and machine learning (AI/ML) services and generative AI models, such as Anthropic’s Claude, to streamline document generation processes in the life sciences industry and, specifically, for complicated clinical trial processes. By using these technologies, Clario was able to enhance and streamline the BRS generation process significantly, improving accuracy and scalability. As Clario continues to adopt AI/ML across its operations, the company is well-positioned to drive innovation and deliver better outcomes for its partners and patients.

About the Authors
Kim Nguyen serves as the Sr Director of Data Science at Clario, where he leads a team of data scientists in developing innovative AI/ML solutions for the healthcare and clinical trials industry. With over a decade of experience in clinical data management and analytics, Kim has established himself as an expert in transforming complex life sciences data into actionable insights that drive business outcomes. His career journey includes leadership roles at Clario and Gilead Sciences, where he consistently pioneered data automation and standardization initiatives across multiple functional teams. Kim holds a Master’s degree in Data Science and Engineering from UC San Diego and a Bachelor’s degree from the University of California, Berkeley, providing him with the technical foundation to excel in developing predictive models and data-driven strategies. Based in San Diego, California, he leverages his expertise to drive forward-thinking approaches to data science in the clinical research space.
Shyam Banuprakash serves as the Senior Vice President of Data Science and Delivery at Clario, where he leads complex analytics programs and develops innovative data solutions for the medical imaging sector. With nearly 12 years of progressive experience at Clario, he has demonstrated exceptional leadership in data-driven decision making and business process improvement. His expertise extends beyond his primary role, as he contributes his knowledge as an Advisory Board Member for both Modal and UC Irvine’s Customer Experience Program. Shyam holds a Master of Advanced Study in Data Science and Engineering from UC San Diego, complemented by specialized training from MIT in data science and big data analytics. His career exemplifies the powerful intersection of healthcare, technology, and data science, positioning him as a thought leader in leveraging analytics to transform clinical research and medical imaging.
John O’Donnell is a Principal Solutions Architect at Amazon Web Services (AWS) where he provides CIO-level engagement and design for complex cloud-based solutions in the healthcare and life sciences (HCLS) industry. With over 20 years of hands-on experience, he has a proven track record of delivering value and innovation to HCLS customers across the globe. As a trusted technical leader, he has partnered with AWS teams to dive deep into customer challenges, propose outcomes, and ensure high-value, predictable, and successful cloud transformations. John is passionate about helping HCLS customers achieve their goals and accelerate their cloud native modernization efforts.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS) where he provides expert guidance and architects secure, scalable cloud solutions for diverse enterprise customers. With nearly two decades of IT experience, including over ten years specializing in Cloud Computing, he has a proven track record of delivering transformative cloud implementations across multiple industries. As a trusted technical advisor, Praveen has successfully partnered with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. Praveen is passionate about solving complex business challenges through cutting-edge cloud architectures and helping organizations achieve successful digital transformations powered by artificial intelligence and machine learning technologies.

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Organizations are constantly seeking ways to harness the power of advanced large language models (LLMs) to enable a wide range of applications such as text generation, summarizationquestion answering, and many others. As these models grow more powerful and capable, deploying them in production environments while optimizing performance and cost-efficiency becomes more challenging.
Amazon Web Services (AWS) provides highly optimized and cost-effective solutions for deploying AI models, like the Mixtral 8x7B language model, for inference at scale. The AWS Inferentia and AWS Trainium are AWS AI chips, purpose-built to deliver high throughput and low latency inference and training performance for even the largest deep learning models. The Mixtral 8x7B model adopts the Mixture-of-Experts (MoE) architecture with eight experts. AWS Neuron—the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances—employs expert parallelism for MoE architecture, sharding the eight experts across multiple NeuronCores.
This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face. This will be followed by deployment to an Amazon SageMaker real-time inference endpoint, which automatically provisions and manages the Inferentia2 instances behind the scenes and provides a containerized environment to run the model securely and at scale.
While pre-compiled model versions exist, we’ll cover the compilation process to illustrate important configuration options and instance sizing considerations. This end-to-end guide combines Amazon Elastic Compute Cloud (Amazon EC2)-based compilation with SageMaker deployment to help you use Mixtral 8x7B’s capabilities with optimal performance and cost efficiency.
Step 1: Set up Hugging Face access
Before you can deploy the Mixtral 8x7B model, there some prerequisites that you need to have in place.

The model is hosted on Hugging Face and uses their transformers library. To download and use the model, you need to authenticate with Hugging Face using a user access token. These tokens allow secure access for applications and notebooks to Hugging Face’s services. You first need to create a Hugging Face account if you don’t already have one, which you can then use to generate and manage your access tokens through the user settings.
The mistralai/Mixtral-8x7B-Instruct-v0.1 model that you will be working with in this post is a gated model. This means that you need to specifically request access from Hugging Face before you can download and work with the model.

Step 2: Launch an Inferentia2-powered EC2 Inf2 instance
To get started with an Amazon EC2 Inf2 instance for deploying the Mixtral 8x7B, either deploy the AWS CloudFormation template or use the AWS Management Console.
To launch an Inferentia2 instance using the console:

Navigate to the Amazon EC2 console and choose Launch Instance.
Enter a descriptive name for your instance.
Under the Application and OS Images search for and select the Hugging Face Neuron Deep Learning AMI, which comes pre-configured with the Neuron software stack for AWS Inferentia.
For Instance type, select 24xlarge, which contains six Inferentia chips (12 NeuronCores).
Create or select an existing key pair to enable SSH access.
Create or select a security group that allows inbound SSH connections from the internet.
Under Configure Storage, set the root EBS volume to 512 GiB to accommodate the large model size.
After the settings are reviewed, choose Launch Instance.

With your Inf2 instance launched, connect to it over SSH by first locating the public IP or DNS name in the Amazon EC2 console. Later in this post, you will connect to a Jupyter notebook using a browser on port 8888. To do that, SSH tunnel to the instance using the key pair you configured during instance creation.

ssh -i “<pem file>” ubuntu@<instance DNS name> -L 8888:127.0.0.1:8888

After signing in, list the NeuronCores attached to the instance and their associated topology:

neuron-ls

For inf2.24xlarge, you should see the following output listing six Neuron devices:

instance-type: inf2.24xlarge
instance-id: i-…
+——–+——–+——–+———–+———+
| NEURON | NEURON | NEURON | CONNECTED |   PCI   |
| DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
+——–+——–+——–+———–+———+
| 0      | 2      | 32 GB  | 1         | 10:1e.0 |
| 1      | 2      | 32 GB  | 0, 2      | 20:1e.0 |
| 2      | 2      | 32 GB  | 1, 3      | 10:1d.0 |
| 3      | 2      | 32 GB  | 2, 4      | 20:1f.0 |
| 4      | 2      | 32 GB  | 3, 5      | 10:1f.0 |
| 5      | 2      | 32 GB  | 4         | 20:1d.0 |
+——–+——–+——–+———–+———+

For more information on the neuron-ls command, see the Neuron LS User Guide.
Make sure the Inf2 instance is sized correctly to host the model. Each Inferentia NeuronCore processor contains 16 GB of high-bandwidth memory (HBM). To accommodate an LLM like the Mixtral 8x7B on AWS Inferentia2 (inf2) instances, a technique called tensor parallelism is used. This allows the model’s weights, activations, and computations to be split and distributed across multiple NeuronCores in parallel. To determine the degree of tensor parallelism required, you need to calculate the total memory footprint of the model. This can be computed as:
total memory = bytes per parameter * number of parameters
The Mixtral-8x7B model consists of 46.7 billion parameters. With float16 casted weights, you need 93.4 GB to store the model weights. The total space required is often greater than just the model parameters because of caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. With a batch size of 1 and a sequence length of 1024 tokens, the total memory footprint for the caching is 0.5 GB. The exact formula can be found in the AWS Neuron documentation and the hyper-parameter configuration required for these calculations is stored in the model config.json file.
Given that each NeuronCore has 16 GB of HBM, and the model requires approximately 94 GB of memory, a minimum tensor parallelism degree of 6 would theoretically suffice. However, with 32 attention heads, the tensor parallelism degree must be a divisor of this number.
Furthermore, considering the model’s size and the MoE implementation in transformers-neuronx, the supported tensor parallelism degrees are limited to 8, 16, and 32. For the example in this post, you will distribute the model across eight NeuronCores.
Compile Mixtral-8x7B model to AWS Inferentia2
The Neuron SDK includes a specialized compiler that automatically optimizes the model format for efficient execution on AWS Inferentia2.

To start this process, launch the container and pass the Inferentia devices to the container. For more information about launching the neuronx-tgi container see Deploy the Text Generation Inference (TGI) Container on a dedicated host.

docker run -it –entrypoint /bin/bash
–net=host -v $(pwd):$(pwd) -w $(pwd)
–device=/dev/neuron0
–device=/dev/neuron1
–device=/dev/neuron2
–device=/dev/neuron3
–device=/dev/neuron4
–device=/dev/neuron5
ghcr.io/huggingface/neuronx-tgi:0.0.25

Inside the container, sign in to the Hugging Face Hub to access gated models, such as the Mixtral-8x7B-Instruct-v0.1. See the previous section for Setup Hugging Face Access. Make sure to use a token with read and write permissions so you can later save the compiled model to the Hugging Face Hub.

huggingface-cli login –token hf_…

After signing in, compile the model with optimum-cli. This process will download the model artifacts, compile the model, and save the results in the specified directory.
The Neuron chips are designed to execute models with fixed input shapes for optimal performance. This requires that the compiled artifact shapes must be known at compilation time. In the following command, you will set the batch size, input/output sequence length, data type, and tensor-parallelism degree (number of neuron cores). For more information about these parameters, see Export a model to Inferentia.

Let’s discuss these parameters in more detail:

The parameter batch_size is the number of input sequences that the model will accept.
sequence_length specifies the maximum number of tokens in an input sequence. This affects memory usage and model performance during inference or training on Neuron hardware. A larger number will increase the model’s memory requirements because the attention mechanism needs to operate over the entire sequence, which leads to more computations and memory usage; while a smaller number will do the opposite. The value 1024 will be adequate for this example.
auto_cast_type parameter controls quantization. It allows type casting for model weights and computations during inference. The options are: bf16, fp16, or tf32. For more information about defining which lower-precision data type the compiler should use see Mixed Precision and Performance-accuracy Tuning. For models trained in float32, the 16-bit mixed precision options (bf16, f16) generally provide sufficient accuracy while significantly improving performance. We use data type float16 with the argument auto_cast_type fp16.
The num_cores parameter controls the number of cores on which the model should be deployed. This will dictate the number of parallel shards or partitions the model is split into. Each shard is then executed on a separate NeuronCore, taking advantage of the 16 GB high-bandwidth memory available per core. As discussed in the previous section, given the Mixtral-8x7B model’s requirements, Neuron supports 8, 16, or 32 tensor parallelism The inf2.24xlarge instance contains 12 Inferentia NeuronCores. Therefore, to optimally distribute the model, we set num_cores to 8.

optimum-cli export neuron
  –model mistralai/Mixtral-8x7B-Instruct-v0.1
  –batch_size 1
  –sequence_length 1024
  –auto_cast_type fp16
  –num_cores 8
  ./neuron_model_path

Download and compilation should take 10–20 minutes. After the compilation completes successfully, you can check the artifacts created in the output directory:

neuron_model_path
├── compiled
│ ├── 2ea52780bf51a876a581.neff
│ ├── 3fe4f2529b098b312b3d.neff
│ ├── …
│ ├── …
│ ├── cfda3dc8284fff50864d.neff
│ └── d6c11b23d8989af31d83.neff
├── config.json
├── generation_config.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

Push the compiled model to the Hugging Face Hub with the following command. Make sure to change <user_id> to your Hugging Face username. If the model repository doesn’t exist, it will be created automatically. Alternatively, store the model on Amazon Simple Storage Service (Amazon S3).

huggingface-cli upload <user_id>/Mixtral-8x7B-Instruct-v0.1 ./neuron_model_path ./
Deploy Mixtral-8x7B SageMaker real-time inference endpoint
Now that the model has been compiled and stored, you can deploy it for inference using SageMaker. To orchestrate the deployment, you will run Python code from a notebook hosted on an EC2 instance. You can use the instance created in the first section or create a new instance. Note that this EC2 instance can be of any type (for example t2.micro with an Amazon Linux 2023 image). Alternatively, you can use a notebook hosted in Amazon SageMaker Studio.
Set up AWS authorization for SageMaker deployment
You need AWS Identity and Access Management (IAM) permissions to manage SageMaker resources. If you created the instance with the provided CloudFormation template, these permissions are already created for you. If not, the following section takes you through the process of setting up the permissions for an EC2 instance to run a notebook that deploys a real-time SageMaker inference endpoint.
Create an AWS IAM role and attach SageMaker permission policy

Go to the IAM console.
Choose the Roles tab in the navigation pane.
Choose Create role.
Under Select trusted entity, select AWS service.
Choose Use case and select EC2.
Select EC2 (Allows EC2 instances to call AWS services on your behalf.)
Choose Next: Permissions.
In the Add permissions policies screen, select AmazonSageMakerFullAccess and IAMReadOnlyAccess. Note that the AmazonSageMakerFullAccess permission is overly permissive. We use it in this example to simplify the process but recommend applying the principle of least privilege when setting up IAM permissions.
Choose Next: Review.
In the Role name field, enter a role name.
Choose Create role to complete the creation.
With the role created, choose the Roles tab in the navigation pane and select the role you just created.
Choose the Trust relationships tab and then choose Edit trust policy.
Choose Add next to Add a principal.
For Principal type, select AWS services.
Enter sagemaker.amazonaws.com and choose Add a principal.
Choose Update policy. Your trust relationship should look like the following:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”,
“sagemaker.amazonaws.com”
]
},
“Action”: “sts:AssumeRole”
}
]
}

Attach the IAM role to your EC2 instance

Go to the Amazon EC2 console.
Choose Instances in the navigation pane.
Select your EC2 instance.
Choose Actions, Security, and then Modify IAM role.
Select the role you created in the previous step.
Choose Update IAM role.

Launch a Jupyter notebook
Your next goal is to run a Jupyter notebook hosted in a container running on the EC2 instance. The notebook will be run using a browser on port 8888 by default. For this example, you will use SSH port forwarding from your local machine to the instance to access the notebook.

Continuing from the previous section, you are still within the container. The following steps install Jupyter Notebook:

pip install ipykernel
python3 -m ipykernel install –user –name aws_neuron_venv_pytorch –display-name “Python Neuronx”
pip install jupyter notebook
pip install environment_kernels

Launch the notebook server using:

jupyter notebook

Then connect to the notebook using your browser over SSH tunneling

http://localhost:8888/tree?token=…
If you get a blank screen, try opening this address using your browser’s incognito mode.
Deploy the model for inference with SageMaker
After connecting to Jupyter Notebook, follow this notebook. Alternatively, choose File, New,  Notebook, and then select Python 3 as the kernel. Use the following instructions and run the notebook cells.

In the notebook, install the sagemaker and huggingface_hub libraries.

!pip install sagemaker

Next, get a SageMaker session and execution role that will allow you to create and manage SageMaker resources. You’ll use a Deep Learning Container.

import os
import sagemaker
from sagemaker.huggingface import get_huggingface_llm_image_uri

os.environ[‘AWS_DEFAULT_REGION’] = ‘us-east-1’

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
print(f”sagemaker role arn: {role}”)

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
“huggingface-neuronx”,
version=”0.0.25″
)

# print ecr image uri
print(f”llm image uri: {llm_image}”)

Deploy the compiled model to a SageMaker real-time endpoint on AWS Inferentia2.

Change user_id in the following code to your Hugging Face username. Make sure to update HF_MODEL_ID and HUGGING_FACE_HUB_TOKEN with your Hugging Face username and your access token.

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = “ml.inf2.24xlarge”
health_check_timeout=2400 # additional time to load the model
volume_size=512 # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
“HF_MODEL_ID”: “user_id/Mixtral-8x7B-Instruct-v0.1”, # replace with your model id if you are using your own model
“HF_NUM_CORES”: “4”, # number of neuron cores
“HF_AUTO_CAST_TYPE”: “fp16”,  # dtype of the model
“MAX_BATCH_SIZE”: “1”, # max batch size for the model
“MAX_INPUT_LENGTH”: “1000”, # max length of input text
“MAX_TOTAL_TOKENS”: “1024”, # max length of generated text
“MESSAGES_API_ENABLED”: “true”, # Enable the messages API
“HUGGING_FACE_HUB_TOKEN”: “hf_…” # Add your Hugging Face token here
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

You’re now ready to deploy the model to a SageMaker real-time inference endpoint. SageMaker will provision the necessary compute resources instance and retrieve and launch the inference container. This will download the model artifacts from your Hugging Face repository, load the model to the Inferentia devices and start inference serving. This process can take several minutes.

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm_model._is_compiled_model = True # We precompiled the model

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
volume_size=volume_size
)

Next, run a test to check the endpoint. Update user_id to match your Hugging Face username, then create the prompt and parameters.

# Prompt to generate
messages=[
{ “role”: “system”, “content”: “You are a helpful assistant.” },
{ “role”: “user”, “content”: “What is deep learning?” }
]

# Generation arguments
parameters = {
“model”: “user_id/Mixtral-8x7B-Instruct-v0.1”, # replace user_id
“top_p”: 0.6,
“temperature”: 0.9,
“max_tokens”: 1000,
}

Send the prompt to the SageMaker real-time endpoint for inference

chat = llm.predict({“messages” :messages, **parameters})

print(chat[“choices”][0][“message”][“content”].strip())

In the future, if you want to connect to this inference endpoint from other applications, first find the name of the inference endpoint. Alternatively, you can use the SageMaker console and choose Inference, and then Endpoints to see a list of the SageMaker endpoints deployed in your account.

endpoints = sess.sagemaker_client.list_endpoints()

for endpoint in endpoints[‘Endpoints’]:
print(endpoint[‘EndpointName’])

Use the endpoint name to update the following code, which can also be run in other locations.

from sagemaker.huggingface import HuggingFacePredictor

endpoint_name=”endpoint_name…”

llm = HuggingFacePredictor(
endpoint_name=endpoint_name,
sagemaker_session=sess
)

Cleanup
Delete the endpoint to prevent future charges for the provisioned resources.

llm.delete_model()
llm.delete_endpoint()

Conclusion
In this post, we covered how to compile and deploy the Mixtral 8x7B language model on AWS Inferentia2 using the Hugging Face Optimum Neuron container and Amazon SageMaker. AWS Inferentia2 offers a cost-effective solution for hosting models like Mixtral, providing high-performance inference at a lower cost.
For more information, see Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum.
For other methods to compile and run Mixtral inference on Inferentia2 and Trainium see the Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 tutorial located in the AWS Neuron Documentation and Notebook.

About the authors
Lior Sadan is a Senior Solutions Architect at AWS, with an affinity for storage solutions and AI/ML implementations. He helps customers architect scalable cloud systems and optimize their infrastructure. Outside of work, Lior enjoys hands-on home renovation and construction projects.
Stenio de Lima Ferreira is a Senior Solutions Architect passionate about AI and automation. With over 15 years of work experience in the field, he has a background in cloud infrastructure, devops and data science. He specializes in codifying complex requirements into reusable patterns and breaking down difficult topics into accessible content.

Elevate business productivity with Amazon Q and Amazon Connect

Modern banking faces dual challenges: delivering rapid loan processing while maintaining robust security against sophisticated fraud. Amazon Q Business provides AI-driven analysis of regulatory requirements and lending patterns. Additionally, you can now report fraud from the same interface with a custom plugin capability that can integrate with Amazon Connect. This fusion of technology transforms traditional lending by enabling faster processing times, faster fraud prevention, and a seamless user experience.
Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business provides plugins to interact with popular third-party applications, such as Jira, ServiceNow, Salesforce, PagerDuty, and more. Administrators can enable these plugins with a ready-to-use library of over 50 actions to their Amazon Q Business application. Where pre-built plugins are not available, Amazon Q Business provides capabilities to build custom plugins to integrate with your application. Plugins help streamline tasks and boost productivity by integrating external services into the Amazon Q Business chat interface.
Amazon Connect is an AI-powered application that provides one seamless experience for your contact center customers and users. It’s comprised of a full suite of features across communication channels. Amazon Connect Cases, a feature of Amazon Connect, allows your agents to track and manage customer issues that require multiple interactions, follow-up tasks, and teams in your contact center. Agents can document customer issues with the relevant case details, such as date/time opened, issue summary, customer information, and status, in a single unified view.
The solution integrates with Okta Identity Management Platform to provide robust authentication, authorization, and single sign-on (SSO) capabilities across applications. Okta can support enterprise federation clients like Active Directory, LDAP, or Ping.
For loan approval officers reviewing mortgage applications, the seamless integration of Amazon Q Business directly into their primary workflow transforms the user experience. Rather than context-switching between applications, officers can harness the capabilities of Amazon Q to conduct research, analyze data, and report potential fraud cases within their mortgage approval interface.
In this post, we demonstrate how to elevate business productivity by leveraging Amazon Q to provide insights that enable research, data analysis, and report potential fraud cases within Amazon Connect.
Solution overview
The following diagram illustrates the solution architecture.

The solution includes the following steps:

Users in Okta are configured to be federated to AWS IAM Identity Center, and a unique ID (audience) is configured for an Amazon API Gateway
When the user chooses to chat in the web application, the following flow is initiated:

The Amazon Q Business application uses the client ID and client secret key to exchange the Okta-generated JSON Web Token (JWT) with IAM Identity Center. The token includes the AWS Security Token Service (AWS STS) context identity.
A temporary token is issued to the application server to assume the role and access the Amazon Q Business API.

The Amazon Q Business application fetches information from the Amazon Simple Storage Service (Amazon S3) data source to answer questions or generate summaries.
The Amazon Q custom plugin uses an Open API schema to discover and understand the capabilities of the API Gateway API.
A client secret is stored in AWS Secrets Manager and the information is provided to the plugin.
The plugin assumes the AWS Identity and Access Management (IAM) role with the kms:decrypt action to access the secrets in Secret Manager.
When a user wants to send a case, the custom plugin invokes the API hosted on API Gateway.
API Gateway uses the same Okta user’s session and authorizes the access.
API Gateway invokes AWS Lambda to create a case in Amazon Connect.
Lambda hosted in Amazon Virtual Private Cloud (Amazon VPC) internally calls the Amazon Connect API using an Amazon Connect VPC interface endpoint powered by AWS PrivateLink.
The contact center agents can also use Amazon Q in Connect to further assist the user.

Prerequisites
The following prerequisites need to be met before you can build the solution:

Have a valid AWS account.
Have an Amazon Q Business Pro subscription to create Amazon Q applications.
Have the service-linked IAM role AWSServiceRoleForQBusiness. If you don’t have one, create it with the amazonaws.com service name.
Have an IAM role in the account that will allow the AWS CloudFormation template to create new roles and add policies. If you have administrator access to the account, no action is required.
Enable logging in AWS CloudTrail for operational and risk auditing.

Okta prerequisites:

Have an Okta developer account and setup an application and API. If you do not have an Okta, please see the following instructions.

Set up an application and API in Okta
Complete the following steps to set up an application and API in Okta:

Log in to the Okta console.
Provide credentials and choose Login.
Choose Continue with Google.
You might need to set up multi-factor authentication following the instructions on the page.
Log in using the authentication code.
In the navigation pane, choose Applications and choose Create App Integration.

Select OIDC – OpenID for Sign-in method and Web Application for Application type, then choose Next.

For App integration name, enter a name (for example, myConnectApp).
Select Authorization Code and Refresh Token for Grant type.
Select Skip group assignment for now for Control Access.
Choose Save to create an application.
Take note of the client ID and secret.

Add Authentication server and metadata

In the navigation pane, choose Security, then choose API.
Choose Add Authorization Server, provide the necessary details, and choose Save.

Take note of the Audience value and choose Metadata URI.

Audience is provided as an input to the CloudFormation template later in the section.

The response will provide the metadata.

From the response, take note of the following:

issuer
authorization_endpoint
token_endpoint

Under Scopes, choose Add Scope, provide the name write/tasks, and choose Create.

On the Access Policies tab, choose Add Policy.
Provide a name and description.
Select The following clients and choose the application by entering my in the text box and choosing the application created earlier.
Choose Create Policy to add a policy.

Choose Add Rule to add a rule and select only Authorization Code for Grant type is.
For Scopes requested, select The following scopes, then enter write in the text box and select the write/tasks
Adjust Access token lifetime is and Refresh token lifetime is to minutes.
Add but will expire if not used every as 5 minutes.
Choose Create rule to create the rule.

Add users

In the navigation pane, choose Directory and choose People.
Choose Add person.

Complete the fields:

First name
Last name
Username (use the same as the primary email)
Primary email

Select Send user activation email now.
Choose Save to save the user.

You will receive an email. Choose the link in the email to activate the user.
Choose Groups, then choose Add group to add the group.
Provide a name and optional description.
Refresh the page and choose the newly created group.
Choose Assign people to assign users.
Add the newly created user by choosing the plus sign next to the user.

Under Applications, select the application name created earlier.
On the Assignments tab, choose Assign to People.

Select the user and choose Assign.
Choose Done to complete the assignment.

Set up Okta as an identity source in IAM Identity Center
Complete the following steps to set up Okta as an identity source:

Enable an IAM Identity Center instance.
Configure SAML and SCIM with Okta and IAM Identity Center.
On the IAM Identity Center console, navigate to the instance.
Under Settings, copy the value Instance ARN. You will need it when you run the CloudFormation template.

Deploy resources using AWS CloudFormation
In this step, we use a CloudFormation template to deploy a Lambda function, configure the REST API, and create identities. Complete the following steps:

Open the AWS CloudFormation console in the us-east-1 AWS Region.
Choose Create stack.
Download the CloudFormation template and upload it in the Specify template
Choose Next.
For Stack name, enter a name (for example, QIntegrationWithConnect).
In the Parameters section, provide values for the following:

Audience
AuthorizationUrl
ClientId
ClientSecret
IdcInstanceArn
Issuer
TokenUrl

Choose Next.
Keep the other values as default and select I acknowledge that AWS CloudFormation might create IAM resources in the Capabilities.
Select I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND in the Capabilities.
Choose Submit to create the CloudFormation stack.
After the successful deployment of the stack, on the Outputs tab, note the value for ALBDNSName.

The CloudFormation template does not deploy certificates for Application Load Balancer. We strongly recommend creating a secure listener for the Application Load Balancer and deploying at least one certificate.
Assign user to Amazon Q Application

On the Amazon Q Business console, navigate to the application named qbusiness-connect-case.
Under User Access, choose Manage user access.
On the user tab, choose Add groups and users and search for the user you created in Okta and propagated in IAM Identity Center.
Choose Assign and Done.

Choose Confirm to confirm the subscription.
Copy the link for Deployed URL.

Create a callback URL: <Deployed URL>/oauth/callback.

We recommend that you enable a budget policy notification to prevent unwanted billing.
Configure login credentials for the web application
Complete the following steps to configure login credentials for the web application:

Navigate to the Okta developer login.
Under Applications, choose the web application myConnectApp created earlier.
Choose Edit in the General Settings
Enter the callback URL for Sign-in redirect URIs.
Choose Save.

Sync the knowledge base
Complete the following steps to sync your knowledge base:

On the Amazon S3 console, choose Buckets in the navigation pane.
Search for AmazonQDataSourceBucket and choose the bucket.
Download the sample AnyBank regulations document.
Upload the PDF file to the S3 bucket.
On the Amazon Q Business console, navigate to the Amazon Q Business application.
In the Data sources section, select the data source.
Choose Sync now to sync the data source.

Embed the web application
Complete the following steps to embed the web application:

On the Amazon Q Business console, under Enhancements, choose Amazon Q embedded.
Choose Add allowed website.
For Enter website URL, enter http://<ALBDNSName>.

Test the solution
Complete the following steps to test the solution:

Copy the ALBDNSName value from the outputs section of the CloudFormation stack and open it in a browser.

You will see an AnyBank website.

Choose Chat with us and the Okta sign-in page will pop up.
Provide the sign-in details.

Upon verification, close the browser tab.
Navigate to the Amazon Q Business application in the chat window.
In the chat window, enter “What are the Fraud Detection and Prevention Measures?”

Amazon Q Business will provide the answers from the knowledge base.
Next, let’s assume that you detected a fraud and want to create a case.

Choose the plugin CreateCase and ask the question, “Can you create a case reporting fraud?”

Amazon Q Business generates the title of the case based on the question.

Choose Submit.
If Amazon Q Business asks you to authorize your access, choose Authorize.

The CreateCase plugin will create a case in Amazon Connect

Navigate to Amazon Connect and open the access URL in a browser.
Provide the user name admin and get the password from visiting the parameter store in AWS Systems Manager.

Choose Agent Workspace.

You can see the case that was created by Amazon Q Business using the custom plugin.

Clean up
To avoid incurring future charges, delete the resources that you created and clean up your account:

Empty the contents of the S3 buckets you created as part of the CloudFormation stack.
Delete the CloudFormation stack you created as part of this post.
Disable the application from IAM Identity Center.

Conclusion
As businesses navigate the ever-changing corporate environment, the combination of Amazon Q Business and Amazon Connect emerges as a transformative approach to optimizing employee assistance and operational effectiveness. Harnessing the capabilities of AI-powered assistants and advanced contact center tools, organizations can empower their teams to access data, initiate support requests, and collaborate cohesively through a unified solution. This post showcased a banking portal, but this can be used for other industrial sectors or organizational verticals.
Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors
Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.
Dr Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Reasoning Models Know When They’re Right: NYU Researchers Introduce …

Artificial intelligence systems have made significant strides in simulating human-style reasoning, particularly mathematics and logic. These models don’t just generate answers—they walk through a series of logical steps to reach conclusions, offering insights into how and why those answers are produced. This step-by-step reasoning, often called Chain-of-Thought (CoT), has become vital in how machines handle complex problem-solving tasks.

A common problem researchers encounter with these models is inefficiency during inference. Reasoning models often continue processing even after reaching a correct conclusion. This overthinking results in the unnecessary generation of tokens, increasing computational cost. Whether these models have an internal sense of correctness remains unclear—do they realize when an intermediate answer is right? If they could identify this internally, the models could halt processing earlier, becoming more efficient without losing accuracy.

Many current approaches measure a model’s confidence through verbal prompts or by analyzing multiple outputs. These black-box strategies ask the model to report how sure it is of its answer. However, they are often imprecise and computationally expensive. On the other hand, white-box methods investigate models’ internal hidden states to extract signals that may correlate with answer correctness. Prior work shows that a model’s internal states can indicate the validity of final answers, but applying this to intermediate steps in long reasoning chains is still an underexplored direction.

The research introduced by a team from New York University and NYU Shanghai tackled this gap by designing a lightweight probe—a simple two-layer neural network—to inspect a model’s hidden states at intermediate reasoning steps. The models used for experimentation included the DeepSeek-R1-Distill series and QwQ-32B, known for their step-by-step reasoning capabilities. These models were tested across various datasets involving mathematical and logical tasks. The researchers trained their probe to read the internal state associated with each chunk of reasoning and predict whether the current intermediate answer was correct.

To construct their approach, the researchers first segmented each long CoT output into smaller parts or chunks, using markers like “wait” or “verify” to identify breaks in reasoning. They used the last token’s hidden state in each chunk as a representation and matched this to a correctness label, which was judged using another model. These representations were then used to train the probe on binary classification tasks. The probe was fine-tuned using grid search across hyperparameters like learning rate and hidden layer size, with most models converging to linear probes—indicating that correctness information is often linearly embedded in the hidden states. The probe worked for fully formed answers and showed the ability to predict correctness before an answer was even completed, hinting at look-ahead capabilities.

Performance results were clear and quantifiable. The probes achieved ROC-AUC scores exceeding 0.9 for some datasets like AIME when using models like R1-Distill-Qwen-32B. Expected Calibration Errors (ECE) remained under 0.1, showing high reliability. For example, R1-Distill-Qwen-32B had an ECE of just 0.01 on GSM8K and 0.06 on MATH datasets. In application, the probe was used to implement a confidence-based early exit strategy during inference. The reasoning process was stopped when the probe’s confidence in an answer exceeded a threshold. At a confidence threshold of 0.85, the accuracy remained at 88.2%, while the inference token count was reduced by 24%. Even at a threshold of 0.9, accuracy stayed at 88.6%, with a 19% token reduction. Compared to static exit methods, this dynamic strategy achieved up to 5% higher accuracy using the same or fewer tokens.

This study offers an efficient, integrated way for reasoning models to self-verify during inference. The researchers’ approach pinpoints a gap—while models inherently know when they’re right, they don’t act on it. The research reveals a path toward smarter, more efficient reasoning systems by leveraging internal representations through probing. It shows that tapping into what the model already “knows” can lead to meaningful performance and resource use improvements.

Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post Reasoning Models Know When They’re Right: NYU Researchers Introduce a Hidden-State Probe That Enables Efficient Self-Verification and Reduces Token Usage by 24% appeared first on MarkTechPost.

Code Implementation to Building a Model Context Protocol (MCP) Server …

In this hands-on tutorial, we’ll build an MCP (Model Context Protocol) server that allows Claude Desktop to fetch stock news sentiment and daily top gainers and movers via the AlphaVantage API. Since most LLMs can’t directly access real-time financial data, this solution uses MCP to provide real-time insights.

We’ll expose two tools from our server:

get_news_sentiment

get_top_movers

Let’s walk through each step.

Step 1: Setting Up the Environment

We will first set up our environment and start with installing the uv package manager. For Mac or Linux:

Copy CodeCopiedUse a different Browsercurl -LsSf https://astral.sh/uv/install.sh | sh

For Windows (PowerShell):

Copy CodeCopiedUse a different Browserpowershell -ExecutionPolicy ByPass -c “irm https://astral.sh/uv/install.ps1 | iex”

We will then create a new project directory and initialize it with uv

Copy CodeCopiedUse a different Browseruv init stockNews
cd stockNews

We can now create and activate a virtual environment. For Mac or Linux:

Copy CodeCopiedUse a different Browseruv venv
source .venv/bin/activate

For Windows:

Copy CodeCopiedUse a different Browseruv venv
.venvScriptsactivate

We will now install the required dependencies

Copy CodeCopiedUse a different Browseruv add mcp httpx python-dotenv

Step 3: Setting Up the Environment Variables

We will now create a .env file that contains the API key for AlphaVantage. To generate a free API key:

Go to https://www.alphavantage.co/

Click on Get free API key button, or use the following url https://www.alphavantage.co/support/#api-key

Enter your email and other required details. You’ll receive an API key—copy it and keep it safe, as this will be used to authenticate your requests.

Now, create a .env file and add the following line:

Copy CodeCopiedUse a different BrowserALPHA_VANTAGE_API_KEY = your_api_key

Step 4: Implementing the MCP Server and integrating AlphaVantage

First create a stockNews.py file in the directory that we created and add the following code snippets:

Importing packages and setting up the instance:

We will first import the necessary packages and set up instance to use the API

Copy CodeCopiedUse a different Browserfrom typing import Any
import os
import httpx
from mcp.server.fastmcp import FastMCP
from dotenv import load_dotenv

# Load .env variables
load_dotenv()
API_KEY = os.getenv(“ALPHA_VANTAGE_API_KEY”)

# Initialize FastMCP server
mcp = FastMCP(“alpha-finance”)

# Constants
BASE_URL = “https://www.alphavantage.co/query”

Helper functions

Next, let’s add our helper functions for querying the data from AlphaVantage.

Copy CodeCopiedUse a different Browserasync def call_alpha_vantage(endpoint: str, params: dict[str, Any]) -> dict[str, Any] | None:
“””Generic async caller to Alpha Vantage.”””
params[“apikey”] = API_KEY
params[“function”] = endpoint
async with httpx.AsyncClient() as client:
try:
response = await client.get(BASE_URL, params=params, timeout=30.0)
response.raise_for_status()
return response.json()
except Exception:
return None

Implementing tool execution

The tool execution handler is responsible for executing the logic of each tool.

Copy CodeCopiedUse a different Browser@mcp.tool()
async def get_news_sentiment(ticker: str) -> str:
“””Get news sentiment data for a stock ticker.

Args:
ticker: Stock ticker symbol (e.g., MSFT, AAPL)
“””
data = await call_alpha_vantage(“NEWS_SENTIMENT”, {“tickers”: ticker.upper()})
if not data or “feed” not in data:
return “Couldn’t retrieve news sentiment.”

articles = data[“feed”][:3]
result = []
for item in articles:
result.append(f”””
{item[‘title’]}
Summary: {item[‘summary’]}
Source: {item[‘source’]} | Published: {item[‘time_published’]}
“””)
return “n—n”.join(result)

@mcp.tool()
async def get_top_movers() -> str:
“””Get top gainers and losers from the stock market.

No arguments required.
“””
data = await call_alpha_vantage(“TOP_GAINERS_LOSERS”, {})
if not data:
return “Couldn’t retrieve top movers.”

gainers = data.get(“top_gainers”, [])[:3]
losers = data.get(“top_losers”, [])[:3]

result = “**Top Gainers**n”
result += “n”.join([
f”{g[‘ticker’]} ({g.get(‘change_percentage’, ‘N/A’)})”
for g in gainers
])

result += “nn**Top Losers**n”
result += “n”.join([
f”{l[‘ticker’]} ({l.get(‘change_percentage’, ‘N/A’)})”
for l in losers
])

return result

Running the server

Finally, let’s initialize and run the server:

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
mcp.run(transport=”stdio”)

We will now test our server from an existing MCP host, Claude for Desktop.

Step 5: Testing the server

First, ensure you have Claude for Desktop installed. If not, download and install the latest version from the official source. If you already have it, make sure it’s up to date.

Next, you’ll need to configure Claude to connect with your MCP server. To do this, open the claude_desktop_config.json file located in the Claude directory using any text editor. If the file doesn’t exist, go ahead and create it manually.

For MacOS/Linux:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“stockNews”: {
“command”: “uv”,
“args”: [
“–directory”,
“/ABSOLUTE/PATH/TO/PARENT/FOLDER/stockNews”,
“run”,
“stockNews.py”
]
}
}
}

For Windows:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“stockNews”: {
“command”: “uv”,
“args”: [
“–directory”,
“C:\ABSOLUTE\PATH\TO\PARENT\FOLDER\stockNews”,
“run”,
“stockNews.py”
]
}
}
}

This configuration lets Claude for Desktop know that:

There’s an MCP server called “stockNews”.

It should be launched using the following command:uv –directory /ABSOLUTE/PATH/TO/PARENT/FOLDER/stockNews run stockNews.py

Once you’ve added this to your config file, save the file and restart Claude for Desktop to apply the changes.

Test with commands

To confirm that Claude for Desktop has recognized the two tools from your stockNews server, look for the hammer icon in the Claude interface — this icon indicates tool access.

After clicking on the hammer icon, you should see two tools listed:

We can test the server by running the following prompts:

What is the news sentiment for Apple?

Who are the top gainers and losers from the stock market?

When you ask Claude a question:

The client sends your query to Claude.

Claude reviews the available tools (like get_news_sentiment or get_top_movers) and determines which one(s) to use based on your question.

The selected tool is executed via the MCP server you configured earlier.

The tool returns the results back to Claude.

Claude uses those results to craft a natural language response.

The final response is shown to you in the chat interface.

This seamless flow is what allows Claude to interact with real-time data in a structured and controlled way.

Conclusion:

Our MCP-based stock insights server extends Claude Desktop’s capabilities by enabling real-time financial data retrieval. By integrating the AlphaVantage API with a custom MCP server, users can fetch live news sentiment and track top market movers directly through Claude. This setup empowers users with timely, actionable stock insights—all within a conversational interface—making financial analysis more efficient, contextual, and interactive.
The post Code Implementation to Building a Model Context Protocol (MCP) Server and Connecting It with Claude Desktop appeared first on MarkTechPost.