May 2025 - Page 3 of 8 - i-genie.co.uk

Boosting team productivity with Amazon Q Business Microsoft 365 integr …

Posted on May 23, 2025 by i-genie

Amazon Q Business, with its enterprise grade security, seamless integration with multiple diverse data sources, and sophisticated natural language understanding, represents the next generation of AI business assistants. What sets Amazon Q Business apart is its support of enterprise requirements from its ability to integrate with company documentation to its adaptability with specific business terminology and context-aware responses. Combined with comprehensive customization options, Amazon Q Business is transforming how organizations enhance their document processing and business operations.
Amazon Q Business integration with Microsoft 365 applications offers powerful AI assistance directly within the tools that your team already uses daily.
In this post, we explore how these integrations for Outlook and Word can transform your workflow.
Prerequisites
Before you get started, make sure that you have the following prerequisites in place:

Create an Amazon Q Business application. Configuring an Amazon Q Business application using AWS IAM Identity Center.
Access to the Microsoft Entra admin center.
Microsoft Entra tenant ID (this should be treated as sensitive information). How to find your Microsoft Entra tenant ID.

Set up Amazon Q Business M365 integrations
Follow the steps below to setup Microsoft 365 integrations with Amazon Q Business.

Go to the AWS Management Console for Amazon Q Business and choose Enhancements then Integrations. On the Integrations page, choose Add integrations.
Select Outlook or Word. In this example, we selected Outlook.
Under Integration name, enter a name for the integration. Under Workspace, enter your Microsoft Entra tenant ID. Leave the remaining values as the default, and choose Add Integration.
After the integration is successfully deployed, select the integration name and copy the manifest URL to use in a later step.
Go to the Microsoft admin center. Under Settings choose Integrated apps, and choose Upload custom apps.
Choose App type and then select Office Add-in. Enter the manifest URL from the Amazon Q Business console, in Provide link to the manifest file. Choose Validate.
On the User page, add users, choose Accept permissions and choose Finish Deployment.

Amazon Q Business in Outlook: Email efficiency reimagined
By integrating Amazon Q Business with Outlook, you have access to several tools to improve email efficiency. To access these tools in Outlook, select Amazon Q Business icon in Outlook on top right side of the email section. Amazon Q Business can help you summarize an email thread, extract insights and action items, and suggest follow-ups.

Email summarization: Quickly understand the key points of lengthy email threads by choosing Summarize in the Amazon Q Business sidebar.
Draft responses: Generate contextually appropriate email replies based on the conversation history and insert them directly into your email draft from the Amazon Q Business sidebar.

Received email: Hi team, I wanted to share the key points from today’s quarterly strategy meeting with John Doe and the leadership team.
Key Takeaways:
Q4 2024 targets were exceeded by 12%, setting a strong foundation for 2025 New product launch timeline confirmed for July 2025 Need to accelerate hiring for the technical team (6 positions to fill by end of Q2)
Action Items: John Smith will finalize the budget allocation for Q2 by March 5 Marketing team to present updated campaign metrics next week HR to fast track technical recruitment process Sales team to provide updated pipeline report by Friday
Project Updates: Project Phoenix is on track for May deployment Customer feedback program launching next month International expansion plans under review
Next Steps: Follow-up meeting scheduled for March 12 at 2 PM EST Department heads to submit Q2 objectives by March 1 John to distribute updated organizational chart next week Please let me know if I missed anything or if you have any questions.
Best regards, Jane Doe

Amazon Q Business draft reply: You will see a draft reply in the Amazon Q Business sidebar. Choose the highlighted icon at the bottom of the sidebar to create an email using the draft reply.

Hi Jane, Thank you for sharing the meeting notes from yesterday’s Q1 Strategy Review. The summary is very helpful. I noticed the impressive Q4 results and the confirmed July product launch timeline. The hiring acceleration for the technical team seems to be a priority we should focus on. I’ll make note of all the action items, particularly the March 1 deadline for Q2 objectives submission. I’ll also block my calendar for the follow up meeting on March 12 at 2 PM EST. Is there anything specific you’d like me to help with regarding any of these items? I’m particularly interested in the Project Phoenix deployment and the customer feedback program.
Thanks again for the comprehensive summary. Regards

Meeting preparation: Extract action items by choosing Action items and next steps in the Amazon Q Business sidebar. Also find important details from email conversations by asking questions in the Amazon Q Business sidebar chat box.

Amazon Q Business in Word: Content creation accelerated
You can select Amazon Q Business on the top right corner of the word document access Amazon Q Business. You can access the Amazon Q Business document processing features from the Word context menu when you highlight text. You can also access Amazon Q Business in the sidebar when working in a Word document. When you select a document processing feature, the output will appear in the Amazon Q Business sidebar, as shown in the following figure.

You can use Amazon Q Business in Word to summarize, explain, simplify, or fix the content of a Word document.

Summarize: Document summarization is a powerful capability of Amazon Q Business that you can use to quickly extract key information from lengthy documents. This feature uses natural language processing to identify the most important concepts, facts, and insights within text documents, then generates concise summaries that preserve the essential meaning while significantly reducing reading time. You can customize the summary length and focus areas based on your specific needs, making it straightforward to process large volumes of information efficiently. Document summarization helps professionals across industries quickly grasp the core content of reports, research papers, articles, and other text-heavy materials without sacrificing comprehension of critical details. To summarize a document, select Amazon Q Business from the ribbon, choose Summarize from the Amazon Q Business sidebar and enter a prompt describing what type of summary you want.

Quickly understand the key points of lengthy word document by choosing Summarize in the Amazon Q Business sidebar

Simplify: The Amazon Q Business Word add-in analyzes documents in real time, identifying overly complex sentences, jargon, and verbose passages that might confuse readers. You can have Amazon Q Business rewrite selected text or entire documents to improve readability while maintaining the original meaning. The Simplify feature is particularly valuable for professionals who need to communicate technical information to broader audiences, educators creating accessible learning materials, or anyone looking to enhance the clarity of their written communication without spending hours manually editing their work.

Select the passage in the work document and choose Simplify in the Amazon Q Business sidebar.

Explain: You can use Amazon Q Business to help you better understand complex content within their documents. You can select difficult terms, technical concepts, or confusing passages and receive clear, contextual explanations. Amazon Q Business analyzes the selected text and generates comprehensive explanations tailored to your needs, including definitions, simplified descriptions, and relevant examples. This functionality is especially beneficial for professionals working with specialized terminology, students navigating academic papers, or anyone encountering unfamiliar concepts in their reading. The Explain feature transforms the document experience from passive consumption to interactive learning, making complex information more accessible to users.

Select the passage in the work document and choose Explain in the Amazon Q Business sidebar.

Fix: Amazon Q Business scans the selected passages for grammatical errors, spelling mistakes, punctuation problems, inconsistent formatting, and stylistic improvements, and resolves those issues. This functionality is invaluable for professionals preparing important business documents, students finalizing academic papers, or anyone seeking to produce polished, accurate content without the need for extensive manual proofreading. This feature significantly reduces editing time while improving document quality.

Select the passage in the work document and choose Fix in the Amazon Q Business sidebar.

Measuring impact
Amazon Q Business helps measure the effectiveness of the solution by empowering users to provide feedback. Feedback information is stored in Amazon Cloudwatch logs where admins can review it to identify issues and improvements.

Clean up
When you are done testing Amazon Q Business integrations, you can remove them through the Amazon Q Business console.

In the console, choose Applications and select your application ID.
Select Integrations.
In the Integrations page, select the integration that you created.
Choose Delete.

Conclusion
Amazon Q Business integrations with Microsoft 365 applications represents a significant opportunity to enhance your team’s productivity. By bringing AI assistance directly into the tools where work happens, teams can focus on higher-value activities while maintaining quality and consistency in their communications and documents.
Experience the power of Amazon Q Business, by exploring its seamless integration with your everyday business tools. Start enhancing your productivity today by visiting the Amazon Q Business User Guide to understand the full potential of this AI-powered solution. Transform your email communication with our Microsoft Outlook integration and revolutionize your document creation process with our Microsoft Word features. To discover the available integrations that can streamline your workflow, see Integrations.

About the author

Leo Mentis Raj Selvaraj is a Sr. Specialist Solutions Architect – GenAI at AWS with 4.5 years of experience, currently guiding customers through their GenAI implementation journeys. Previously, he architected data platform and analytics solutions for strategic customers using a comprehensive range of AWS services including storage, compute, databases, serverless, analytics, and ML technologies. Leo also collaborates with internal AWS teams to drive product feature development based on customer feedback, contributing to the evolution of AWS offerings.

Integrate Amazon Bedrock Agents with Slack

Posted on May 22, 2025 by i-genie

As companies increasingly adopt generative AI applications, AI agents capable of delivering tangible business value have emerged as a crucial component. In this context, integrating custom-built AI agents within chat services such as Slack can be transformative, providing businesses with seamless access to AI assistants powered by sophisticated foundation models (FMs). After an AI agent is developed, the next challenge lies in incorporating it in a way that provides straightforward and efficient use. Organizations have several options: integration into existing web applications, development of custom frontend interfaces, or integration with communication services such as Slack. The third option—integrating custom AI agents with Slack—offers a simpler and quicker implementation path you can follow to summon the AI agent on-demand within your familiar work environment.
This solution drives team productivity through faster query responses and automated task handling, while minimizing operational overhead. The pay-per-use model optimizes cost as your usage scales, making it particularly attractive for organizations starting their AI journey or expanding their existing capabilities.
There are numerous practical business use cases for AI agents, each offering measurable benefits and significant time savings compared to traditional approaches. Examples include a knowledge base agent that instantly surfaces company documentation, reducing search time from minutes to seconds. A compliance checker agent that facilitates policy adherence in real time, potentially saving hours of manual review. Sales analytics agents provide immediate insights, alleviating the need for time consuming data compilation and analysis. AI agents for IT support help with common technical issues, often resolving problems faster than human agents.
These AI-powered solutions enhance user experience through contextual conversations, providing relevant assistance based on the current conversation and query context. This natural interaction model improves the quality of support and helps drive user adoption across the organization. You can follow this implementation approach to provide the solution to your Slack users in use cases where quick access to AI-powered insights would benefit team workflows. By integrating custom AI agents, organizations can track improvements in key performance indicators (KPIs) such as mean time to resolution (MTTR), first-call resolution rates, and overall productivity gains, demonstrating the practical benefits of AI agents powered by large language models (LLMs).
In this post, we present a solution to incorporate Amazon Bedrock Agents in your Slack workspace. We guide you through configuring a Slack workspace, deploying integration components in Amazon Web Services (AWS), and using this solution.
Solution overview
The solution consists of two main components: the Slack to Amazon Bedrock Agents integration infrastructure and either your existing Amazon Bedrock agent or a sample agent we provide for testing. The integration infrastructure handles the communication between Slack and the Amazon Bedrock agent, and the agent processes and responds to the queries.
The solution uses Amazon API Gateway, AWS Lambda, AWS Secrets Manager, and Amazon Simple Queue Service (Amazon SQS) for a serverless integration. This alleviates the need for always-on infrastructure, helping to reduce overall costs because you only pay for actual usage.
Amazon Bedrock agents automate workflows and repetitive tasks while securely connecting to your organization’s data sources to provide accurate responses.
An action group defines actions that the agent can help the user perform. This way, you can integrate business logic with your backend services by having your agent process and manage incoming requests. The agent also maintains context throughout conversations, uses the process of chain of thought, and enables more personalized interactions.
The following diagram represents the solution architecture, which contains two key sections:

Section A – The Amazon Bedrock agent and its components are included in this section. With this part of the solution, you can either connect your existing agent or deploy our sample agent using the provided AWS CloudFormation template
Section B – This section contains the integration infrastructure (API Gateway, Secrets Manager, Lambda, and Amazon SQS) that’s deployed by a CloudFormation template.

The request flow consists of the following steps:

A user sends a message in Slack to the bot by using @appname.
Slack sends a webhook POST request to the API Gateway endpoint.
The request is forwarded to the verification Lambda function.
The Lambda function retrieves the Slack signing secret and bot token to verify request authenticity.
After verification, the message is sent to a second Lambda function.
Before putting the message in the SQS queue, the Amazon SQS integration Lambda function sends a “ Processing your request…” message to the user in Slack within a thread under the original message.
Messages are sent to the FIFO (First-In-First-Out) queue for processing, using the channel and thread ID to help prevent message duplication.
The SQS queue triggers the Amazon Bedrock integration Lambda function.
The Lambda function invokes the Amazon Bedrock agent with the user’s query, and the agent processes the request and responds with the answer.
The Lambda function updates the initial “ Processing your request…” message in the Slack thread with either the final agent’s response or, if debug mode is enabled, the agent’s reasoning process.

Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
A Slack account (two options):

For company Slack accounts, work with your administrator to create and publish the integration application, or you can use a sandbox organization
Alternatively, create your own Slack account and workspace for testing and experimentation

Model access in Amazon Bedrock for Anthropic’s Claude 3.5 Sonnet in the same AWS Region where you’ll deploy this solution (if using your own agent, you can skip this requirement)
The accompanying CloudFormation templates provided in GitHub repo:

Sample Amazon Bedrock agent (virtual-meteorologist)
Slack integration to Amazon Bedrock Agents

Create a Slack application in your workspace
Creating applications in Slack requires specific permissions that vary by organization. If you don’t have the necessary access, you’ll need to contact your Slack administrator. The screenshots in this walkthrough are from a personal Slack account and are intended to demonstrate the implementation process that can be followed for this solution.

Go to Slack API and choose Create New App

In the Create an app pop-up, choose From scratch

For App Name, enter virtual-meteorologist
For Pick a workspace to develop your app in, choose the workspace where you want to use this application
Choose Create App

After the application is created, you’ll be taken to the Basic Information page.

In the navigation pane under Features, choose OAuth & Permissions
Navigate to the Scopes section and under Bot Tokens Scopes, add the following scopes by choosing Add an OAuth Scope and entering im:read, im:write, and chat:write

On the OAuth & Permissions page, navigate to the OAuth Tokens section and choose Install to {Workspace}
On the following page, choose Allow to complete the process

On the OAuth & Permissions page, navigate to OAuth Tokens and copy the value for Bot User OAuth Token that has been created. Save this in a notepad to use later when you’re deploying the CloudFormation template.

In the navigation pane under Settings, choose Basic Information
Navigate to Signing Secret and choose Show
Copy and save this value to your notepad to use later when you’re deploying the CloudFormation template

Deploy the sample Amazon Bedrock agent resources with AWS CloudFormation
If you already have an Amazon Bedrock agent configured, you can copy its ID and alias from the agent details. If you don’t, then when you run the CloudFormation template for the sample Amazon Bedrock agent (virtual-meteorologist), the following resources are deployed (costs will be incurred for the AWS resources used):

Lambda functions:

GeoCoordinates – Converts location names to latitude and longitude coordinates
Weather – Retrieves weather information using coordinates
DateTime – Gets current date and time for specific time zones

AWS Identity and Access Management IAM roles:

GeoCoordinatesRole – Role for GeoCoordinates Lambda function
WeatherRole – Role for Weather Lambda function
DateTimeRole – Role for DateTime Lambda function
BedrockAgentExecutionRole – Role for Amazon Bedrock agent execution

Lambda permissions:

GeoCoordinatesLambdaPermission – Allows Amazon Bedrock to invoke the GeoCoordinates Lambda function
WeatherLambdaPermission – Allows Amazon Bedrock to invoke the Weather Lambda function
DateTimeLambdaPermission – Allows Amazon Bedrock to invoke the DateTime Lambda function

Amazon Bedrock agent:

BedrockAgent – Virtual meteorologist agent configured with three action groups

Amazon Bedrock agent action groups:

obtain-latitude-longitude-from-place-name
obtain-weather-information-with-coordinates
get-current-date-time-from-timezone

Choose Launch Stack to deploy the resources:

After deployment is complete, navigate to the Outputs tab and copy the BedrockAgentId and BedrockAgentAliasID values. Save these to a notepad to use later when deploying the Slack integration to Amazon Bedrock Agents CloudFormation template.

Deploy the Slack integration to Amazon Bedrock Agents resources with AWS CloudFormation
When you run the CloudFormation template to integrate Slack with Amazon Bedrock Agents, the following resources are deployed (costs will be incurred for the AWS resources used):

API Gateway:

SlackAPI – A REST API for Slack interactions

Lambda functions:

MessageVerificationFunction – Verifies Slack message signatures and tokens
SQSIntegrationFunction – Handles message queueing to Amazon SQS
BedrockAgentsIntegrationFunction – Processes messages with the Amazon Bedrock agent

IAM roles:

MessageVerificationFunctionRole – Role for MessageVerificationFunction Lambda function permissions
SQSIntegrationFunctionRole – Role for SQSIntegrationFunction Lambda function permissions
BedrockAgentsIntegrationFunctionRole – Role for BedrockAgentsIntegrationFunction Lambda function permissions

SQS queues:

ProcessingQueue – FIFO queue for ordered message processing
DeadLetterQueue – FIFO queue for failed message handling

Secrets Manager secret:

SlackBotTokenSecret – Stores Slack credentials securely

Choose Launch Stack to deploy these resources:

Provide your preferred stack name. When deploying the CloudFormation template, you’ll need to provide four values: the Slack bot user OAuth token, the signing secret from your Slack configuration, and the BedrockAgentId and BedrockAgentAliasID values saved earlier. If your agent is in draft version, use TSTALIASID as the BedrockAgentAliasID. Although our example uses a draft version, you can use the alias ID of your published version if you’ve already published your agent.

Keep SendAgentRationaleToSlack set to False by default. However, if you want to troubleshoot or observe how Amazon Bedrock Agents processes your questions, you can set this to True. This way, you can receive detailed processing information in the Slack thread where you invoked the Slack application.
When deployment is complete, navigate to the Outputs tab and copy the WebhookURL value. Save this to your notepad to use in your Slack configuration in the next step.

Integrate Amazon Bedrock Agents with your Slack workspace
Complete the following steps to integrate Amazon Bedrock Agents with your Slack workspace:

Go to Slack API and choose the virtual-meteorologist application

In the navigation pane, choose Event Subscriptions
On the Event Subscriptions page, turn on Enable Events
Enter your previously copied API Gateway URL for Request URL—verification will happen automatically
For Subscribe to bot events, select Add Bot User Event button and add app_mention and message.im
Choose Save Changes
Choose Reinstall your app and choose Allow on the following page

Test the Amazon Bedrock Agents bot application in Slack
Return to Slack and locate virtual-meteorologist in the Apps section. After you add this application to your channel, you can interact with the Amazon Bedrock agent by using @virtual-meteorologist to get weather information.

Let’s test it with some questions. When we ask about today’s weather in Chicago, the application first sends a “ Processing your request…” message as an initial response. After the Amazon Bedrock agent completes its analysis, this temporary message is replaced with the actual weather information.

You can ask follow-up questions within the same thread, and the Amazon Bedrock agent will maintain the context from your previous conversation. To start a new conversation, use @virtual-meteorologist in the main channel instead of the thread.

Clean up
If you decide to stop using this solution, complete the following steps to remove it and its associated resources deployed using AWS CloudFormation:

Delete the Slack integration CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane
Locate the stack you created for the Slack integration for Amazon Bedrock Agents during the deployment process (you assigned a name to it)
Select the stack and choose Delete

If you deployed the sample Amazon Bedrock agent (virtual-meteorologist), repeat these steps to delete the agent stack

Considerations
When designing serverless architectures, separating Lambda functions by purpose offers significant advantages in terms of maintenance and flexibility. This design pattern allows for straightforward behavior modifications and customizations without impacting the overall system logic. Each request involves two Lambda functions: one for token validation and another for SQS payload processing. During high-traffic periods, managing concurrent executions across both functions requires attention to Lambda concurrency limits. For use cases where scaling is a critical concern, combining these functions into a single Lambda function might be an alternative approach, or you could consider using services such as Amazon EventBridge to help manage the event flow between components. Consider your use case and traffic patterns when choosing between these architectural approaches.
Summary
This post demonstrated how to integrate Amazon Bedrock Agents with Slack, a widely used enterprise collaboration tool. After creating your specialized Amazon Bedrock Agents, this implementation pattern shows how to quickly integrate them into Slack, making them readily accessible to your users. The integration enables AI-powered solutions that enhance user experience through contextual conversations within Slack, improving the quality of support and driving user adoption. You can follow this implementation approach to provide the solution to your Slack users in use cases where quick access to AI-powered insights would benefit team workflows. By integrating custom AI agents, organizations can track improvements in KPIs such as mean time to resolution (MTTR), first-call resolution rates, and overall productivity gains, showcasing the practical benefits of Amazon Bedrock Agents in enterprise collaboration settings.
We provided a sample agent to help you test and deploy the complete solution. Organizations can now quickly implement their Amazon Bedrock agents and integrate them into Slack, allowing teams to access powerful generative AI capabilities through a familiar interface they use daily. Get started today by developing your own agent using Amazon Bedrock Agents.
Additional resources
To learn more about building Amazon Bedrock Agents, refer to the following resources:

Build a FinOps agent using Amazon Bedrock with multi-agent capability and Amazon Nova as the foundation model
Building a virtual meteorologist using Amazon Bedrock Agents
Build a gen AI–powered financial assistant with Amazon Bedrock multi-agent collaboration

About the Authors
Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.
Sergio Barraza is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.
Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.
Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Secure distributed logging in scalable multi-account deployments using …

Posted on May 22, 2025 by i-genie

Data privacy is a critical issue for software companies that provide services in the data management space. If they want customers to trust them with their data, software companies need to show and prove that their customers’ data will remain confidential and within controlled environments. Some companies go to great lengths to maintain confidentiality, sometimes adopting multi-account architectures, where each customer has their data in a separate AWS account. By isolating data at the account level, software companies can enforce strict security boundaries, help prevent cross-customer data leaks, and support adherence with industry regulations such as HIPAA or GDPR with minimal risk.
Multi-account deployment represents the gold standard for cloud data privacy, allowing software companies to make sure customer data remains segregated even at massive scale, with AWS accounts providing security isolation boundaries as highlighted in the AWS Well-Architected Framework. Software companies increasingly adopt generative AI capabilities like Amazon Bedrock, which provides fully managed foundation models with comprehensive security features. However, managing a multi-account deployment powered by Amazon Bedrock introduces unique challenges around access control, quota management, and operational visibility that could complicate its implementation at scale. Constantly requesting and monitoring quota for invoking foundation models on Amazon Bedrock becomes a challenge when the number of AWS accounts reaches double digits. One approach to simplify operations is to configure a dedicated operations account to centralize management while data from customers transits through managed services and is stored at rest only in their respective customer accounts. By centralizing operations in a single account while keeping data in different accounts, software companies can simplify the management of model access and quotas while maintaining strict data boundaries and security isolation.
In this post, we present a solution for securing distributed logging multi-account deployments using Amazon Bedrock and LangChain.
Challenges in logging with Amazon Bedrock
Observability is crucial for effective AI implementations—organizations can’t optimize what they don’t measure. Observability can help with performance optimization, cost management, and model quality assurance. Amazon Bedrock offers built-in invocation logging to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) through a configuration on the AWS Management Console, and individual logs can be routed to different CloudWatch accounts with cross-account sharing, as illustrated in the following diagram.

Routing logs to each customer account presents two challenges: logs containing customer data would be stored in the operations account for the user-defined retention period (at least 1 day), which might not comply with strict privacy requirements, and CloudWatch has a limit of five monitoring accounts (customer accounts). With these limitations, how can organizations build a secure logging solution that scales across multiple tenants and customers?
In this post, we present a solution for enabling distributed logging for Amazon Bedrock in multi-account deployments. The objective of this design is to provide robust AI observability while maintaining strict privacy boundaries for data at rest by keeping logs exclusively within the customer accounts. This is achieved by moving logging to the customer accounts rather than invoking it from the operations account. By configuring the logging instructions in each customer’s account, software companies can centralize AI operations while enforcing data privacy, by keeping customer data and logs within strict data boundaries in each customer’s account. This architecture uses AWS Security Token Service (AWS STS) to allow customer accounts to assume dedicated roles in AWS Identity and Access Management (IAM) in the operations account while invoking Amazon Bedrock. For logging, this solution uses LangChain callbacks to capture invocation metadata directly in each customer’s account, making the entire process in the operations account memoryless. Callbacks can be used to log token usage, performance metrics, and the overall quality of the model in response to customer queries. The proposed solution balances centralized AI service management with strong data privacy, making sure customer interactions remain within their dedicated environments.
Solution overview
The complete flow of model invocations on Amazon Bedrock is illustrated in the following figure. The operations account is the account where the Amazon Bedrock permissions will be managed using an identity-based policy, where the Amazon Bedrock client will be created, and where the IAM role with the correct permissions will exist. Every customer account will assume a different IAM role in the operations account. The customer accounts are where customers will access the software or application. This account will contain an IAM role that will assume the corresponding role in the operations account, to allow Amazon Bedrock invocations. It is important to note that it is not necessary for these two accounts to exist in the same AWS organization. In this solution, we use an AWS Lambda function to invoke models from Amazon Bedrock, and use LangChain callbacks to write invocation data to CloudWatch. Without loss of generality, the same principle can be applied to other forms of compute such as servers in Amazon Elastic Compute Cloud (Amazon EC2) instances or managed containers on Amazon Elastic Container Service (Amazon ECS).

The sequence of steps in a model invocation are:

The process begins when the IAM role in the customer account assumes the role in the operations account, allowing it to access the Amazon Bedrock service. This is accomplished through the AWS STS AssumeRole API operation, which establishes the necessary cross-account relationship.
The operations account verifies that the requesting principal (IAM role) from the customer account is authorized to assume the role it is targeting. This verification is based on the trust policy attached to the IAM role in the operations account. This step makes sure that only authorized customer accounts and roles can access the centralized Amazon Bedrock resources.
After trust relationship verification, temporary credentials (access key ID, secret access key, and session token) with specified permissions are returned to the customer account’s IAM execution role.
The Lambda function in the customer account invokes the Amazon Bedrock client in the operations account. Using temporary credentials, the customer account’s IAM role sends prompts to Amazon Bedrock through the operations account, consuming the operations account’s model quota.
After the Amazon Bedrock client response returns to the customer account, LangChain callbacks log the response metrics directly into CloudWatch in the customer account.

Enabling cross-account access with IAM roles
The key idea in this solution is that there will be an IAM role per customer in the operations account. The software company will manage this role and assign permissions to define aspects such as which models can be invoked, in which AWS Regions, and what quotas they’re subject to. This centralized approach significantly simplifies the management of model access and permissions, especially when scaling to hundreds or thousands of customers. For enterprise customers with multiple AWS accounts, this pattern is particularly valuable because it allows the software company to configure a single role that can be assumed by a number of the customer’s accounts, providing consistent access policies and simplifying both permission management and cost tracking. Through carefully crafted trust relationships, the operations account maintains control over who can access what, while still enabling the flexibility needed in complex multi-account environments.
The IAM role can have assigned one or more policies. For example, the following policy allows a certain customer to invoke some models:

{
“Version”: “2012-10-17”,
“Statement”: {
“Sid”: “AllowInference”,
“Effect”: “Allow”,
“Action”: [
“bedrock:Converse”,
“bedrock:ConverseStream”,
“bedrock:GetAsyncInvoke”,
“bedrock:InvokeModel”,
“bedrock:InvokeModelWithResponseStream”,
“bedrock:StartAsyncInvoke”
],
“Resource”: “arn:aws:bedrock:*::foundation-model/<model-id>”
}
}

The control would be implemented at the trust relationship level, where we would only allow some accounts to assume that role. For example, in the following script, the trust relationship allows the role for customer 1 to only be assumed by the allowed AWS account when the ExternalId matches a specified value, with the purpose of preventing the confused deputy problem:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AmazonBedrockModelInvocationCustomer1”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “bedrock.amazonaws.com”
},
“Action”: “sts:AssumeRole”,
“Condition”: {
“StringEquals”: {
“aws:SourceAccount”: “<account-customer-1>”,
“sts:ExternalId”: “<external-id>”
},
“ArnLike”: {
“aws:SourceArn”: “arn:aws:bedrock::<account-customer-1>:*”
}
}
}
]
}

AWS STS AssumeRole operations constitute the cornerstone of secure cross-account access within multi-tenant AWS environments. By implementing this authentication mechanism, organizations establish a robust security framework that enables controlled interactions between the operations account and individual customer accounts. The operations team grants precisely scoped access to resources across the customer accounts, with permissions strictly governed by the assumed role’s trust policy and attached IAM permissions. This granular control makes sure that the operational team and customers can perform only authorized actions on specific resources, maintaining strong security boundaries between tenants.
As organizations scale their multi-tenant architectures to encompass thousands of accounts, the performance characteristics and reliability of these cross-account authentication operations become increasingly critical considerations. Engineering teams must carefully design their cross-account access patterns to optimize for both security and operational efficiency, making sure that authentication processes remain responsive and dependable even as the environment grows in complexity and scale.
When considering the service quotas that govern these operations, it’s important to note that AWS STS requests made using AWS credentials are subject to a default quota of 600 requests per second, per account, per Region—including AssumeRole operations. A key architectural advantage emerges in cross-account scenarios: only the account initiating the AssumeRole request (customer account) counts against its AWS STS quota; the target account’s (operations account) quota remains unaffected. This asymmetric quota consumption means that the operations account doesn’t deplete their AWS STS service quotas when responding to API requests from customer accounts. For most multi-tenant implementations, the standard quota of 600 requests per second provides ample capacity, though AWS offers quota adjustment options for environments with exceptional requirements. This quota design enables scalable operational models where a single operations account can efficiently service thousands of tenant accounts without encountering service limits.
Writing private logs using LangChain callbacks
LangChain is a popular open source orchestration framework that enables developers to build powerful applications by connecting various components through chains, which are sequential series of operations that process and transform data. At the core of LangChain’s extensibility is the BaseCallbackHandler class, a fundamental abstraction that provides hooks into the execution lifecycle of chains, allowing developers to implement custom logic at different stages of processing. This class can be extended to precisely define behaviors that should occur upon completion of a chain’s invocation, enabling sophisticated monitoring, logging, or triggering of downstream processes. By implementing custom callback handlers, developers can capture metrics, persist results to external systems, or dynamically alter the execution flow based on intermediate outputs, making LangChain both flexible and powerful for production-grade language model applications.
Implementing a custom CloudWatch logging callback in LangChain provides a robust solution for maintaining data privacy in multi-account deployments. By extending the BaseCallbackHandler class, we can create a specialized handler that establishes a direct connection to the customer account’s CloudWatch logs, making sure model interaction data remains within the account boundaries. The implementation begins by initializing a Boto3 CloudWatch Logs client using the customer account’s credentials, rather than the operations account’s credentials. This client is configured with the appropriate log group and stream names, which can be dynamically generated based on customer identifiers or application contexts. During model invocations, the callback captures critical metrics such as token usage, latency, prompt details, and response characteristics. The following Python script serves as an example of this implementation:

class CustomCallbackHandler(BaseCallbackHandler):

def log_to_cloudwatch(self, message: str):
“””Function to write extracted metrics to CloudWatch”””

def on_llm_end(self, response, **kwargs):
print(“nChat model finished processing.”)
# Extract model_id and token usage from the response
input_token_count = response.llm_output.get(“usage”, {}).get(“prompt_tokens”, None)
output_token_count = response.llm_output.get(“usage”, {}).get(“completion_tokens”, None)
model_id=response.llm_output.get(“model_id”, None)

# Here we invoke the callback
self.log_to_cloudwatch(
f”User ID: {self.user_id}nApplication ID: {self.application_id}n Input tokens: {input_token_count}n Output tokens: {output_token_count}n Invoked model: {model_id}”
)

def on_llm_error(self, error: Exception, **kwargs):
print(f”Chat model encountered an error: {error}”)

The on_llm_start, on_llm_end, and on_llm_error methods are overridden to intercept these lifecycle events and persist the relevant data. For example, the on_llm_end method can extract token counts, execution time, and model-specific metadata, formatting this information into structured log entries before writing them to CloudWatch. By implementing proper error handling and retry logic within the callback, we provide reliable logging even during intermittent connectivity issues. This approach creates a comprehensive audit trail of AI interactions while maintaining strict data isolation in the customer account, because the logs do not transit through or rest in the operations account.
The AWS Shared Responsibility Model in multi-account logging
When implementing distributed logging for Amazon Bedrock in multi-account architectures, understanding the AWS Shared Responsibility Model becomes paramount. Although AWS secures the underlying infrastructure and services like Amazon Bedrock and CloudWatch, customers remain responsible for securing their data, configuring access controls, and implementing appropriate logging strategies. As demonstrated in our IAM role configurations, customers must carefully craft trust relationships and permission boundaries to help prevent unauthorized cross-account access. The LangChain callback implementation outlined places the responsibility on customers to enforce proper encryption of logs at rest, define appropriate retention periods that align with compliance requirements, and implement access controls for who can view sensitive AI interaction data. This aligns with the multi-account design principle where customer data remains isolated within their respective accounts. By respecting these security boundaries while maintaining operational efficiency, software companies can uphold their responsibilities within the shared security model while delivering scalable AI capabilities across their customer base.
Conclusion
Implementing a secure, scalable multi-tenant architecture with Amazon Bedrock requires careful planning around account structure, access patterns, and operational management. The distributed logging approach we’ve outlined demonstrates how organizations can maintain strict data isolation while still benefiting from centralized AI operations. By using IAM roles with precise trust relationships, AWS STS for secure cross-account authentication, and LangChain callbacks for private logging, companies can create a robust foundation that scales to thousands of customers without compromising on security or operational efficiency.
This architecture addresses the critical challenge of maintaining data privacy in multi-account deployments while still enabling comprehensive observability. Organizations should prioritize automation, monitoring, and governance from the beginning to avoid technical debt as their system scales. Implementing infrastructure as code for role management, automated monitoring of cross-account access patterns, and regular security reviews will make sure the architecture remains resilient and will help maintain adherence with compliance standards as business requirements evolve. As generative AI becomes increasingly central to software provider offerings, these architectural patterns provide a blueprint for maintaining the highest standards of data privacy while delivering innovative AI capabilities to customers across diverse regulatory environments and security requirements.
To learn more, explore the comprehensive Generative AI Security Scoping Matrix through Securing generative AI: An introduction to the Generative AI Security Scoping Matrix, which provides essential frameworks for securing AI implementations. Building on these security foundations, strengthen Amazon Bedrock deployments by getting familiar with IAM authentication and authorization mechanisms that establish proper access controls. As organizations grow to require multi-account structures, these IAM practices connect seamlessly with AWS STS, which delivers temporary security credentials enabling secure cross-account access patterns. To complete this integrated security approach, delve into LangChain and LangChain on AWS capabilities, offering powerful tools that build upon these foundational security services to create secure, context-aware AI applications, while maintaining appropriate security boundaries across your entire generative AI workflow.

About the Authors
Mohammad Tahsin is an AI/ML Specialist Solutions Architect at AWS. He lives for staying up-to-date with the latest technologies in AI/ML and helping customers deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Aswin Vasudevan is a Senior Solutions Architect for Security, ISV at AWS. He is a big fan of generative AI and serverless architecture and enjoys collaborating and working with customers to build solutions that drive business value.

Google AI Releases MedGemma: An Open Suite of Models Trained for Perfo …

Posted on May 21, 2025 by i-genie

At Google I/O 2025, Google introduced MedGemma, an open suite of models designed for multimodal medical text and image comprehension. Built on the Gemma 3 architecture, MedGemma aims to provide developers with a robust foundation for creating healthcare applications that require integrated analysis of medical images and textual data.

Model Variants and Architecture

MedGemma is available in two configurations:

MedGemma 4B: A 4-billion parameter multimodal model capable of processing both medical images and text. It employs a SigLIP image encoder pre-trained on de-identified medical datasets, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. The language model component is trained on diverse medical data to facilitate comprehensive understanding.

MedGemma 27B: A 27-billion parameter text-only model optimized for tasks requiring deep medical text comprehension and clinical reasoning. This variant is exclusively instruction-tuned and is designed for applications that demand advanced textual analysis.

Deployment and Accessibility

Developers can access MedGemma models through Hugging Face, subject to agreeing to the Health AI Developer Foundations terms of use. The models can be run locally for experimentation or deployed as scalable HTTPS endpoints via Google Cloud’s Vertex AI for production-grade applications. Google provides resources, including Colab notebooks, to facilitate fine-tuning and integration into various workflows.

Applications and Use Cases

MedGemma serves as a foundational model for several healthcare-related applications:

Medical Image Classification: The 4B model’s pre-training makes it suitable for classifying various medical images, such as radiology scans and dermatological images.

Medical Image Interpretation: It can generate reports or answer questions related to medical images, aiding in diagnostic processes.

Clinical Text Analysis: The 27B model excels in understanding and summarizing clinical notes, supporting tasks like patient triaging and decision support.

Adaptation and Fine-Tuning

While MedGemma provides strong baseline performance, developers are encouraged to validate and fine-tune the models for their specific use cases. Techniques such as prompt engineering, in-context learning, and parameter-efficient fine-tuning methods like LoRA can be employed to enhance performance. Google offers guidance and tools to support these adaptation processes.

Conclusion

MedGemma represents a significant step in providing accessible, open-source tools for medical AI development. By combining multimodal capabilities with scalability and adaptability, it offers a valuable resource for developers aiming to build applications that integrate medical image and text analysis.

Check out the Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension appeared first on MarkTechPost.

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physica …

Posted on May 21, 2025 by i-genie

AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and generates responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications span robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is crucial.

The current AI models’ weak connection to real-world physics is a major limitation. While they perform well on abstract tasks, they often fail to predict physical consequences or respond appropriately to sensory data. Concepts like gravity or spatial relationships are not intuitively understood, making them unreliable for embodied tasks. Training directly in the physical world is costly and risky, which hampers development and iteration. This lack of physical grounding and embodied understanding is a significant barrier to deploying AI effectively in real-world applications.

Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were rigid and failed in novel scenarios. Simulations and synthetic data often miss the nuances of real-world physics. Critically, there was no standardized framework to define or evaluate physical common sense or embodied reasoning. Inconsistent methodologies and benchmarks made progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility.

Researchers from NVIDIA introduced Cosmos-Reason1, a suite of multimodal large language models. These models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, were designed specifically for physical reasoning tasks. Each model is trained in two major phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL). What differentiates this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories, Space, Time, and Fundamental Physics, divided further into 16 subcategories. The second ontology is two-dimensional and maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies are training guides and evaluation tools for benchmarking AI’s physical reasoning.

The architecture of Cosmos-Reason1 uses a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration enables the model to reason over textual and visual data simultaneously. The researchers curated a massive dataset comprising around 4 million annotated video-text pairs for training. These include action descriptions, multiple choice questions, and long chain-of-thought reasoning traces. The reinforcement learning stage is driven by rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks. These tasks include predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, making the training deeply tied to real-world physical logic.

The team constructed three benchmarks for physical common sense, Space, Time, and Fundamental Physics, containing 604 questions from 426 videos. Six benchmarks were built for embodied reasoning with 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models outperformed previous baselines, especially after the RL phase. Notably, they improved in task completion verification, predicting next plausible actions, and assessing the physical feasibility of actions. These gains were observed in both model sizes, with Cosmos-Reason1-56B showing stronger performance across most metrics. This performance improvement underscores the effectiveness of using structured ontologies and multimodal data to enhance physical reasoning in AI.

Several Key Takeaways from the Research on Cosmos-Reason1:

Two models introduced: Cosmos-Reason1-7B and Cosmos-Reason1-56B, trained specifically for physical reasoning tasks.

The models were trained in two phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL).

The training dataset includes approximately 4 million annotated video-text pairs curated for physical reasoning.

Reinforcement learning uses rule-based and verifiable rewards, derived from human annotations and video-based tasks.

The team relied on two ontologies: a hierarchical one with three categories and 16 subcategories, and a two-dimensional one mapping agent capabilities.

Benchmarks: 604 questions from 426 videos for physical common sense, and 610 from 600 videos for embodied reasoning.

Performance gains were observed across all benchmarks after RL training, particularly in predicting next actions and verifying task completion.

Real-world applicability for robots, vehicles, and other embodied agents across diverse environments.

In conclusion, the Cosmos-Reason1 initiative demonstrates how AI can be better equipped for the physical world. It addresses key limitations in perception, reasoning, and decision-making that have hindered progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements signal a major step forward in bridging the gap between abstract AI reasoning and the needs of systems that must operate in unpredictable, real-world environments.

Check out the Paper, Project Page, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments appeared first on MarkTechPost.

Enhancing Language Model Generalization: Bridging the Gap Between In-C …

Posted on May 21, 2025 by i-genie

Language models (LMs) have great capabilities as in-context learners when pretrained on vast internet text corpora, allowing them to generalize effectively from just a few task examples. However, fine-tuning these models for downstream tasks presents significant challenges. While fine-tuning requires hundreds to thousands of examples, the resulting generalization patterns show limitations. For example, models fine-tuned on statements like “B’s mother is A” struggle to answer related questions like “Who is A’s son?” However, the LMs can handle such reverse relations in context. This raises questions about the differences between in-context learning and fine-tuning generalization patterns, and how these differences should inform adaptation strategies for downstream tasks.

Research into improving LMs’ adaptability has followed several key approaches. In-context learning studies have examined learning and generalization patterns through empirical, mechanistic, and theoretical analyses. Out-of-context learning research explores how models utilize information not explicitly included in prompts. Data augmentation techniques use LLMs to enhance performance from limited datasets, with specific solutions targeting issues like the reversal curse through hardcoded augmentations, deductive closure training, and generating reasoning pathways. Moreover, synthetic data approaches have evolved from early hand-designed data to improve generalization in domains like linguistics or mathematics to more recent methods that generate data directly from language models.

Researchers from Google DeepMind and Stanford University have constructed several datasets that isolate knowledge from pretraining data to create clean generalization tests. Performance is evaluated across various generalization types by exposing pretrained models to controlled information subsets, both in-context and through fine-tuning. Their findings reveal that in-context learning shows more flexible generalization than fine-tuning in data-matched settings, though there are some exceptions where fine-tuning can generalize to reversals within larger knowledge structures. Building on these insights, researchers have developed a method that enhances fine-tuning generalization by including in-context inferences into the fine-tuning data.

Researchers employ multiple datasets carefully designed to isolate specific generalization challenges or insert them within broader learning contexts. Evaluation relies on multiple-choice likelihood scoring without providing answer choices in context. The experiments involve fine-tuning Gemini 1.5 Flash using batch sizes of 8 or 16. For in-context evaluation, the researchers combine training documents as context for the instruction-tuned model, randomly subsampling by 8x for larger datasets to minimize interference issues. The key innovation is a dataset augmentation approach using in-context generalization to enhance fine-tuning dataset coverage. This includes local and global strategies, each employing distinct contexts and prompts.

On the Reversal Curse dataset, in-context learning achieves near-ceiling performance on reversals, while conventional fine-tuning shows near-zero accuracy as models favor incorrect celebrity names seen during training. Fine-tuning with data augmented by in-context inferences matches the high performance of pure in-context learning. Testing on simple nonsense reversals reveals similar patterns, though with less pronounced benefits. For simple syllogisms, while the pretrained model performs at chance level (indicating no data contamination), fine-tuning does produce above-chance generalization for certain syllogism types where logical inferences align with simple linguistic patterns. However, in-context learning outperforms fine-tuning, with augmented fine-tuning showing the best overall results.

In conclusion, this paper explores generalization differences between in-context learning and fine-tuning when LMs face novel information structures. Results show in-context learning’s superior generalization for certain inference types, prompting the researchers to develop methods that enhance fine-tuning performance by incorporating in-context inferences into training data. Despite promising outcomes, several limitations affect the study. The first one is the dependency on nonsense words and implausible operations. Second, the research focuses on specific LMs, limiting the results’ generality. Future research should investigate learning and generalization differences across various models to expand upon these findings, especially newer reasoning models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-Tuning appeared first on MarkTechPost.

Build a domain‐aware data preprocessing pipeline: A multi‐agent co …

Posted on May 21, 2025 by i-genie

Enterprises—especially in the insurance industry—face increasing challenges in processing vast amounts of unstructured data from diverse formats, including PDFs, spreadsheets, images, videos, and audio files. These might include claims document packages, crash event videos, chat transcripts, or policy documents. All contain critical information across the claims processing lifecycle.
Traditional data preprocessing methods, though functional, might have limitations in accuracy and consistency. This might affect metadata extraction completeness, workflow velocity, and the extent of data utilization for AI-driven insights (such as fraud detection or risk analysis). To address these challenges, this post introduces a multi‐agent collaboration pipeline: a set of specialized agents for classification, conversion, metadata extraction, and domain‐specific tasks. By orchestrating these agents, you can automate the ingestion and transformation of a wide range of multimodal unstructured data—boosting accuracy and enabling end‐to‐end insights.
For teams processing a small volume of uniform documents, a single-agent setup might be more straightforward to implement and sufficient for basic automation. However, if your data spans diverse domains and formats—such as claims document packages, collision footage, chat transcripts, or audio files—a multi-agent architecture offers distinct advantages. Specialized agents allow for targeted prompt engineering, better debugging, and more accurate extraction, each tuned to a specific data type.
As volume and variety grow, this modular design scales more gracefully, allowing you to plug in new domain-aware agents or refine individual prompts and business logic—without disrupting the broader pipeline. Feedback from domain experts in the human-in-the-loop phase can also be mapped back to specific agents, supporting continuous improvement.
To support this adaptive architecture, you can use Amazon Bedrock, a fully managed service that makes it straightforward to build and scale generative AI applications using foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API. A powerful feature of Amazon Bedrock—Amazon Bedrock Agents—enables the creation of intelligent, domain-aware agents that can retrieve context from Amazon Bedrock Knowledge Bases, call APIs, and orchestrate multi-step tasks. These agents provide the flexibility and adaptability needed to process unstructured data at scale, and can evolve alongside your organization’s data and business workflows.
Solution overview
Our pipeline functions as an insurance unstructured data preprocessing hub with the following features:

Classification of incoming unstructured data based on domain rules
Metadata extraction for claim numbers, dates, and more
Conversion of documents into uniform formats (such as PDF or transcripts)
Conversion of audio/video data into structured markup format
Human validation for uncertain or missing fields

Enriched outputs and associated metadata will ultimately land in a metadata‐rich unstructured data lake, forming the foundation for fraud detection, advanced analytics, and 360‐degree customer views.
The following diagram illustrates the solution architecture.

The end-to-end workflow features a supervisor agent at the center, classification and conversion agents branching off, a human‐in‐the‐loop step, and Amazon Simple Storage Service (Amazon S3) as the final unstructured data lake destination.
Multi‐agent collaboration pipeline
This pipeline is composed of multiple specialized agents, each handling a distinct function such as classification, conversion, metadata extraction, and domain-specific analysis. Unlike a single monolithic agent that attempts to manage all tasks, this modular design promotes scalability, maintainability, and reuse. Individual agents can be independently updated, swapped, or extended to accommodate new document types or evolving business rules without impacting the overall system. This separation of concerns improves fault tolerance and enables parallel processing, resulting in faster and more reliable data transformation workflows.
Multi-agent collaboration offers the following metrics and efficiency gains:

Reduction in human validation time – Focused prompts tailored to specific agents will lead to cleaner outputs and less complicated verification, providing efficiency in validation time.
Faster iteration cycles and regression isolation – Changes to prompts or logic are scoped to individual agents, minimizing the area of effect of updates and significantly reducing regression testing effort during tuning or enhancement phases.
Improved metadata extraction accuracy, especially on edge cases – Specialized agents reduce prompt overload and allow deeper domain alignment, which improves field-level accuracy—especially when processing mixed document types like crash videos vs. claims document packages.
Scalable efficiency gains with automated issue resolver agents – As automated issue resolver agents are added over time, processing time per document is expected to improve considerably, reducing manual touchpoints. These agents can be designed to use human-in-the-loop feedback mappings and intelligent data lake lookups to automate recurring fixes.

Unstructured Data Hub Supervisor Agent
The Supervisor Agent orchestrates the workflow, delegates tasks, and invokes specialized downstream agents. It has the following key responsibilities:

Receive incoming multimodal data and processing instructions from the user portal (multimodal claims document packages, vehicle damage images, audio transcripts, or repair estimates).
Forward each unstructured data type to the Classification Collaborator Agent to determine whether a conversion step is needed or direct classification is possible.
Coordinate specialized domain processing by invoking the appropriate agent for each data type—for example, a claims documents package is handled by the Claims Documentation Package Processing Agent, and repair estimates go to the Vehicle Repair Estimate Processing Agent.
Make sure that every incoming data eventually lands, along with its metadata, in the S3 data lake.

Classification Collaborator Agent
The Classification Collaborator Agent determines each file’s type using domain‐specific rules and makes sure it’s either converted (if needed) or directly classified. This includes the following steps:

Identify the file extension. If it’s DOCX, PPT, or XLS, it routes the file to the Document Conversion Agent first.
Output a unified classification result for each standardized document—specifying the category, confidence, extracted metadata, and next steps.

Document Conversion Agent
The Document Conversion Agent converts non‐PDF files into PDF and extracts initial metadata (creation date, file size, and so on). This includes the following steps:

Transform DOCX, PPT, XLS, and XLSX into PDF.
Capture embedded metadata.
Return the new PDF to the Classification Collaborator Agent for final classification.

Specialized classification agents
Each agent handles specific modalities of data:

Document Classification Agent:

Processes text‐heavy formats like claims document packages, standard operating procedure documents (SOPs), and policy documents
Extracts claim numbers, policy numbers, policy holder details, coverage dates, and expense amounts as metadata
Identifies missing items (for example, missing policy holder information, missing dates)

Transcription Classification Agent:

Focuses on audio or video transcripts, such as First Notice of Lost (FNOL) calls or adjuster follow‐ups
Classifies transcripts into business categories (such as first‐party claim or third‐party conversation) and extracts relevant metadata

Image Classification Agent:

Analyzes vehicle damage photos and collision videos for details like damage severity, vehicle identification, or location
Generates structured metadata that can be fed into downstream damage analysis systems

Additionally, we have defined specialized downstream agents:

Claims Document Package Processing Agent
Vehicle Repair Estimate Processing Agent
Vehicle Damage Analysis Processing Agent
Audio Video Transcription Processing Agent
Insurance Policy Document Processing Agent

After the high‐level classification identifies a file as, for example, a claims document package or repair estimate, the Supervisor Agent invokes the appropriate specialized agent to perform deeper domain‐specific transformation and extraction.
Metadata extraction and human-in-the-loop
Metadata is essential for automated workflows. Without accurate metadata fields—like claim numbers, policy numbers, coverage dates, loss dates, or claimant names—downstream analytics lack context. This part of the solution handles data extraction, error handling, and recovery through the following features:

Automated extraction – Large language models (LLMs) and domain‐specific rules parse critical data from unstructured content, identify key metadata fields, and flag anomalies early.
Data staging for review – The pipeline extracts metadata fields and stages each record for human review. This process presents the extracted fields—highlighting missing or incorrect values for human review.
Human-in-the-loop – Domain experts step in to validate and correct metadata during the human-in-the-loop phase, providing accuracy and context for key fields such as claim numbers, policyholder details, and event timelines. These interventions not only serve as a point-in-time error recovery mechanism but also lay the foundation for continuous improvement of the pipeline’s domain-specific rules, conversion logic, and classification prompts.

Eventually, automated issue resolver agents can be introduced in iterations to handle an increasing share of data fixes, further reducing the need for manual review. Several strategies can be introduced to enable this progression to improve resilience and adaptability over time:

Persisting feedback – Corrections made by domain experts can be captured and mapped to the types of issues they resolve. These structured mappings help refine prompt templates, update business logic, and generate targeted instructions to guide the design of automated issue resolver agents to emulate similar fixes in future workflows.
Contextual metadata lookups – As the unstructured data lake becomes increasingly metadata-rich—with deeper connections across policy numbers, claim IDs, vehicle info, and supporting documents— issue resolver agents with appropriate prompts can be introduced to perform intelligent dynamic lookups. For example, if a media file lacks a policy number but includes a claim number and vehicle information, an issue resolver agent can retrieve missing metadata by querying related indexed documents like claims document packages or repair estimates.

By combining these strategies, the pipeline becomes increasingly adaptive—continually improving data quality and enabling scalable, metadata-driven insights across the enterprise.
Metadata‐rich unstructured data lake
After each unstructured data type is converted and classified, both the standardized content
and metadata JSON files are stored in an unstructured data lake (Amazon S3). This repository unifies different data types (images, transcripts, documents) through shared metadata, enabling the following:

Fraud detection by cross‐referencing repeated claimants or contradictory details
Customer 360-degree profiles by linking claims, calls, and service records
Advanced analytics and real‐time queries

Multi‐modal, multi‐agentic pattern
In our AWS CloudFormation template, each multimodal data type follows a specialized flow:

Data conversion and classification:

The Supervisor Agent receives uploads and passes them to the Classification Collaborator Agent.
If needed, the Document Conversion Agent might step in to standardize the file.
The Classification Collaborator Agent’s classification step organizes the uploads into categories—FNOL calls, claims document packages, collision videos, and so on.

Document processing:

The Document Classification Agent and other specialized agents apply domain rules to extract metadata like claim numbers, coverage dates, and more.
The pipeline presents the extracted as well as missing information to the domain expert for correction or updating.

Audio/video analysis:

The Transcription Classification Agent handles FNOL calls and third‐party conversation transcripts.
The Audio Video Transcription Processing Agent or the Vehicle Damage Analysis Processing Agent further parses collision videos or damage photos, linking spoken events to visual evidence.

Markup text conversion:

Specialized processing agents create markup text from the fully classified and corrected metadata. This way, the data is transformed into a metadata-rich format ready for consumption by knowledge bases, Retrieval Augmented Generation (RAG) pipelines, or graph queries.

Human-in-the-loop and future improvements
The human‐in‐the‐loop component is key for verifying and adding missing metadata and fixing incorrect categorization of data. However, the pipeline is designed to evolve as follows:

Refined LLM prompts – Every correction from domain experts helps refine LLM prompts, reducing future manual steps and improving metadata consistency
Issue resolver agents – As metadata consistency improves over time, specialized fixers can handle metadata and classification errors with minimal user input
Cross referencing – Issue resolver agents can cross‐reference existing data in the metadata-rich S3 data lake to automatically fill in missing metadata

The pipeline evolves toward full automation, minimizing human oversight except for the most complex cases.
Prerequisites
Before deploying this solution, make sure that you have the following in place:

An AWS account. If you don’t have an AWS account, sign up for one.
Access as an AWS Identity and Access Management (IAM) administrator or an IAM user that has permissions for:

Deploying AWS CloudFormation.
Creating and managing S3 buckets and uploading objects.
Creating and updating Amazon Simple Queue Service (Amazon SQS) queues, AWS Lambda functions, Amazon Bedrock, Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Service, and Amazon API Gateway.
Creating and managing IAM roles.

Access to Amazon Bedrock. Make sure Amazon Bedrock is available in your AWS Region, and you have explicitly enabled the FMs you plan to use (for example, Anthropic’s Claude or Cohere). Refer to Add or remove access to Amazon Bedrock foundation models for guidance on enabling models for your AWS account. This solution was tested in us-west-2. Make sure that you have enabled the required FMs:

claude-3-5-haiku-20241022-v1:0
claude-3-5-sonnet-20241022-v2:0
claude-3-haiku-20240307-v1:0
titan-embed-text-v2:0

Set the API Gateway integration timeout from the default 29 seconds to 180 seconds, as introduced in this announcement, in your AWS account by submitting a service quota increase for API Gateway integration timeout.

Deploy the solution with AWS CloudFormation
Complete the following steps to set up the solution resources:

Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user.
Choose Launch Stack to deploy the CloudFormation template.

Provide the necessary parameters and create the stack.

For this setup, we use us-west-2 as our Region, Anthropic’s Claude 3.5 Haiku model for orchestrating the flow between the different agents, and Anthropic’s Claude 3.5 Sonnet V2 model for conversion, categorization, and processing of multimodal data.
If you want to use other models on Amazon Bedrock, you can do so by making appropriate changes in the CloudFormation template. Check for appropriate model support in the Region and the features that are supported by the models.
It will take about 30 minutes to deploy the solution. After the stack is deployed, you can view the various outputs of the CloudFormation stack on the Outputs tab, as shown in the following screenshot.

The provided CloudFormation template creates multiple S3 buckets (such as DocumentUploadBucket, SampleDataBucket, and KnowledgeBaseDataBucket) for raw uploads, sample files, Amazon Bedrock Knowledge Bases references, and more. Each specialized Amazon Bedrock agent or Lambda function uses these buckets to store intermediate or final artifacts.
The following screenshot is an illustration of the Amazon Bedrock agents that are deployed in the AWS account.

The next section outlines how to test the unstructured data processing workflow.
Test the unstructured data processing workflow
In this section, we present different use cases to demonstrate the solution. Before you begin, complete the following steps:

Locate the APIGatewayInvokeURL value from the CloudFormation stack’s outputs. This URL launches the Insurance Unstructured Data Preprocessing Hub in your browser.

Download the sample data files from the designated S3 bucket (SampleDataBucketName) to your local machine. The following screenshots show the bucket details from CloudFormation stack’s outputs and the contents of the sample data bucket.

With these details, you can now test the pipeline by uploading the following sample multimodal files through the Insurance Unstructured Data Preprocessing Hub Portal:

Claims document package (ClaimDemandPackage.pdf)
Vehicle repair estimate (collision_center_estimate.xlsx)
Collision video with supported audio (carcollision.mp4)
First notice of loss audio transcript (fnol.mp4)
Insurance policy document (ABC_Insurance_Policy.docx)

Each multimodal data type will be processed through a series of agents:

Supervisor Agent – Initiates the processing
Classification Collaborator Agent – Categorizes the multimodal data
Specialized processing agents – Handle domain-specific processing

Finally, the processed files, along with their enriched metadata, are stored in the S3 data lake. Now, let’s proceed to the actual use cases.
Use Case 1: Claims document package
This use case demonstrates the complete workflow for processing a multimodal claims document package. By uploading a PDF document to the pipeline, the system automatically classifies the document type, extracts essential metadata, and categorizes each page into specific components.

Choose Upload File in the UI and choose the pdf file.

The file upload might take some time depending on the document size.

When the upload is complete, you can confirm that the extracted metadata values are follows:

Claim Number: 0112233445
Policy Number: SF9988776655
Date of Loss: 2025-01-01
Claimant Name: Jane Doe

The Classification Collaborator Agent identifies the document as a Claims Document Package. Metadata (such as claim ID and incident date) is automatically extracted and displayed for review.

For this use case, no changes are made—simply choose Continue Preprocessing to proceed.

The processing stage might take up to 15 minutes to complete. Rather than manually checking the S3 bucket (identified in the CloudFormation stack outputs as KnowledgeBaseDataBucket) to verify that 72 files—one for each page and its corresponding metadata JSON—have been generated, you can monitor the progress by periodically choosing Check Queue Status. This lets you view the current state of the processing queue in real time.
The pipeline further categorizes each page into specific types (for example, lawyer letter, police report, medical bills, doctor’s report, health forms, x-rays). It also generates corresponding markup text files and metadata JSON files.
Finally, the processed text and metadata JSON files are stored in the unstructured S3 data lake.
The following diagram illustrates the complete workflow.

Use Case 2: Collision center workbook for vehicle repair estimate
In this use case, we upload a collision center workbook to trigger the workflow that converts the file, extracts repair estimate details, and stages the data for review before final storage.

Choose Upload File and choose the xlsx workbook.
Wait for the upload to complete and confirm that the extracted metadata is accurate:

Claim Number: CLM20250215
Policy Number: SF9988776655
Claimant Name: John Smith
Vehicle: Truck

The Document Conversion Agent converts the file to PDF if needed, or the Classification Collaborator Agent identifies it as a repair estimate. The Vehicle Repair Estimate Processing Agent extracts cost lines, part numbers, and labor hours.

Review and update the displayed metadata as necessary, then choose Continue Preprocessing to trigger final storage.

The finalized file and metadata are stored in Amazon S3.
The following diagram illustrates this workflow.

Use Case 3: Collision video with audio transcript
For this use case, we upload a video showing the accident scene to trigger a workflow that analyzes both visual and audio data, extracts key frames for collision severity, and stages metadata for review before final storage.

Choose Upload File and choose the mp4 video.
Wait until the upload is complete, then review the collision scenario and adjust the displayed metadata to correct omissions or inaccuracies as follows:

Claim Number: 0112233445
Policy Number: SF9988776655
Date of Loss: 01-01-2025
Claimant Name: Jane Doe
Policy Holder Name: John Smith

The Classification Collaborator Agent directs the video to either the Audio/Video Transcript or Vehicle Damage Analysis agent. Key frames are analyzed to determine collision severity.

Review and update the displayed metadata (for example, policy number, location), then choose Continue Preprocessing to initiate final storage.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics such as verifying story consistency.
The following diagram illustrates this workflow.

Use Case 4: Audio transcript between claimant and customer service associate
Next, we upload a video that captures the claimant reporting an accident to trigger the workflow that extracts an audio transcript and identifies key metadata for review before final storage.

Choose Upload File and choose mp4.
Wait until the upload is complete, then review the call scenario and adjust the displayed metadata to correct any omissions or inaccuracies as follows:

Claim Number: Not Assigned Yet
Policy Number: SF9988776655
Claimant Name: Jane Doe
Policy Holder Name: John Smith
Date Of Loss: January 1, 2025 8:30 AM

The Classification Collaborator Agent routes the file to the Audio/Video Transcript Agent for processing. Key metadata attributes are automatically identified from the call.

Review and correct any incomplete metadata, then choose Continue Preprocessing to proceed.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics (for example, verifying story consistency).
The following diagram illustrates this workflow.

Use Case 5: Auto insurance policy document
For our final use case, we upload an insurance policy document to trigger the workflow that converts and classifies the document, extracts key metadata for review, and stores the finalized output in Amazon S3.

Choose Upload File and choose docx.
Wait until the upload is complete, and confirm that the extracted metadata values are as follows:

Policy Number: SF9988776655
Policy type: Auto Insurance
Effective Date: 12/12/2024
Policy Holder Name: John Smith

The Document Conversion Agent transforms the document into a standardized PDF format if required. The Classification Collaborator Agent then routes it to the Document Classification Agent for categorization as an Auto Insurance Policy Document. Key metadata attributes are automatically identified and presented for user review.

Review and correct incomplete metadata, then choose Continue Preprocessing to trigger final storage.

The finalized policy document in markup format, along with its metadata, is stored in Amazon S3—ready for advanced analytics such as verifying story consistency.
The following diagram illustrates this workflow.

Similar workflows can be applied to other types of insurance multimodal data and documents by uploading them on the Data Preprocessing Hub Portal. Whenever needed, this process can be enhanced by introducing specialized downstream Amazon Bedrock agents that collaborate with the existing Supervisor Agent, Classification Agent, and Conversion Agents.
Amazon Bedrock Knowledge Bases integration
To use the newly processed data in the data lake, complete the following steps to ingest the data in Amazon Bedrock Knowledge Bases and interact with the data lake using a structured workflow. This integration allows for dynamic querying across different document types, enabling deeper insights from multimodal data.

Choose Chat with Your Documents to open the chat interface.

Choose Sync Knowledge Base to initiate the job that ingests and indexes the newly processed files and the available metadata into the Amazon Bedrock knowledge base.
After the sync is complete (which might take a couple of minutes), enter your queries in the text box. For example, set Policy Number to SF9988776655 and try asking:

“Retrieve details of all claims filed against the policy number by multiple claimants.”
“What is the nature of Jane Doe’s claim, and what documents were submitted?”
“Has the policyholder John Smith submitted any claims for vehicle repairs, and are there any estimates on file?”

Choose Send and review the system’s response.

This integration enables cross-document analysis, so you can query across multimodal data types like transcripts, images, claims document packages, repair estimates, and claim records to reveal customer 360-degree insights from your domain-aware multi-agent pipeline. By synthesizing data from multiple sources, the system can correlate information, uncover hidden patterns, and identify relationships that might not have been evident in isolated documents.
A key enabler of this intelligence is the rich metadata layer generated during preprocessing. Domain experts actively validate and refine this metadata, providing accuracy and consistency across diverse document types. By reviewing key attributes—such as claim numbers, policyholder details, and event timelines—domain experts enhance the metadata foundation, making it more reliable for downstream AI-driven analysis.
With rich metadata in place, the system can now infer relationships between documents more effectively, enabling use cases such as:

Identifying multiple claims tied to a single policy
Detecting inconsistencies in submitted documents
Tracking the complete lifecycle of a claim from FNOL to resolution

By continuously improving metadata through human validation, the system becomes more adaptive, paving the way for future automation, where issue resolver agents can proactively identify and self-correct missing and inconsistent metadata with minimal manual intervention during the data ingestion process.
Clean up
To avoid unexpected charges, complete the following steps to clean up your resources:

Delete the contents from the S3 buckets mentioned in the outputs of the CloudFormation stack.
Delete the deployed stack using the AWS CloudFormation console.

Conclusion
By transforming unstructured insurance data into metadata‐rich outputs, you can accomplish the following:

Accelerate fraud detection by cross‐referencing multimodal data
Enhance customer 360-degree insights by uniting claims, calls, and service records
Support real‐time decisions through AI‐assisted search and analytics

As this multi‐agent collaboration pipeline matures, specialized issue resolver agents and refined LLM prompts can further reduce human involvement—unlocking end‐to‐end automation and improved decision‐making. Ultimately, this domain‐aware approach future‐proofs your claims processing workflows by harnessing raw, unstructured data as actionable business intelligence.
To get started with this solution, take the following next steps:

Deploy the CloudFormation stack and experiment with the sample data.
Refine domain rules or agent prompts based on your team’s feedback.
Use the metadata in your S3 data lake for advanced analytics like real‐time risk assessment or fraud detection.
Connect an Amazon Bedrock knowledge base to KnowledgeBaseDataBucket for advanced Q&A and RAG.

With a multi‐agent architecture in place, your insurance data ceases to be a scattered liability, becoming instead a unified source of high‐value insights.
Refer to the following additional resources to explore further:

Automate tasks in your application using AI agents
Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases
Amazon Bedrock Samples GitHub repo

About the Author
Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over two decades of experience building and executing large scale enterprise IT projects across geographies. She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths, weaknesses and risks, in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Automating complex document processing: How Onity Group built an intel …

Posted on May 21, 2025 by i-genie

In the mortgage servicing industry, efficient document processing can mean the difference between business growth and missed opportunities. This post explores how Onity Group, a financial services company specializing in mortgage servicing and origination, used Amazon Bedrock and other AWS services to transform their document processing capabilities.
Onity Group, founded in 1988, is headquartered in West Palm Beach, Florida. Through its primary operating subsidiary, PHH Mortgage Corporation, and Liberty Reverse Mortgage brand, the company provides mortgage servicing and origination solutions to homeowners, business clients, investors, and others.
Onity processes millions of pages across hundreds of document types annually, including legal documents such as deeds of trust where critical information is often contained within dense text. The company also had to manage inconsistent handwritten entries and the need to verify notarization and legal seals—tasks that traditional optical character recognition (OCR) and AI and machine learning (AI/ML) solutions struggled to handle effectively. By using foundation models (FMs) provided by Amazon Bedrock, Onity achieved a 50% reduction in document extraction costs while improving overall accuracy by 20% compared to their previous OCR and AI/ML solution.
Onity’s intelligent document processing (IDP) solution dynamically routes extraction tasks based on content complexity, using the strengths of both its custom AI models and generative AI capabilities provided by Amazon Web Services (AWS) through Amazon Bedrock. This dual-model approach enabled Onity to address the scale and diversity of its mortgage servicing documents more efficiently, driving significant improvements in both cost and accuracy.

“We needed a solution that could evolve as quickly as our document processing needs,” says Raghavendra (Raghu) Chinhalli, VP of Digital Transformation at Onity Group.
“By combining AWS AI/ML and generative AI services, we achieved the perfect balance of cost, performance, accuracy, and speed to market,” adds Priyatham Minnamareddy, Director of Digital Transformation & Intelligent Automation.

Why traditional OCR and ML models fall short
Traditional document processing presented several fundamental challenges that drove Onity’s search for a more sophisticated solution. The following are key examples:

Verbose documents with data elements not clearly identified

Issue – Key documents in mortgage servicing contain verbose text with critical data elements embedded without clear identifiers or structure
Example – Identifying the exact legal description from a deed of trust, which might be buried within paragraphs of legalese

Inconsistent handwritten text

Issue – Documents contain handwritten elements that vary significantly in quality, style, and legibility
Example – Simple variations in writing formats—such as state names (GA and Georgia) or monetary values (200K or 200,000)—create significant extraction challenges

Notarization and legal seal detection

Issue – Identifying whether a document is notarized, detecting legal court stamps, verifying if a notary’s commission has expired, or extracting data from legal seals, which come in multiple shapes, requires a deeper understanding of visual and textual cues that traditional methods might miss

Limited contextual understanding

Issue – Traditional OCR models, although adept at digitizing text, often lack the capacity to interpret the semantic context within a document, hindering a true understanding of the information contained

These complexities in mortgage servicing documents—ranging from verbose text to inconsistent handwriting and the need for specialized seal detection—proved to be significant limitations for traditional OCR and ML models. This drove Onity to seek a more sophisticated solution to address these fundamental challenges.
Solution overview
To address these document processing challenges, Onity built an intelligent solution combining AWS AI/ML and generative AI services.
Amazon Textract is a ML service that automates the extraction of text, data, and insights from documents and images. By using Amazon Textract, organizations can streamline document processing workflows and unlock valuable data to power intelligent applications.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies. Through a single API, Amazon Bedrock provides access to models from providers such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon, along with a broad set of capabilities to build secure, private, and responsible generative AI applications.
Amazon Bedrock gives you the flexibility to choose the FM that best suits your needs. For IDP, common solutions use text and vision models such as Amazon Nova Pro or Anthropic’s Claude Sonnet. Beyond model access, Amazon Bedrock provides enterprise-grade security with data processing within your Amazon virtual private cloud (VPC), built-in guardrails for responsible AI use, and comprehensive data protection capabilities that are essential for handling sensitive financial documents. You can select the model that strikes the right balance of accuracy, performance, and cost efficiency for your specific application.
The following figure shows how the solution works.

Document ingestion – Documents are uploaded to Amazon Simple Storage Service (Amazon S3). Uploading triggers automated processing workflows.
Preprocessing – Before analysis, documents undergo optimization through image enhancement, noise reduction, and layout analysis. These preprocessing steps help facilitate maximum accuracy for subsequent OCR processing.
Classification – Classification occurs through a three-step intelligent workflow orchestrated by Onity’s document classification application. The process outputs each page’s document type and page number in JSON format:

The application uses Amazon Textract to extract document contents.
Extracted content is processed by Onity’s custom AI model. If the model’s confidence score meets the predetermined threshold, classification is complete.
If the document isn’t recognized because the model isn’t trained with that document type, the application automatically routes the document to Anthropic’s Claude Sonnet in Amazon Bedrock. This foundation model, along with other text and vision models such as Anthropic’s Claude and Amazon Nova, can classify documents without additional training, analyzing both text and images. This dual-model approach, using both Onity’s custom model and the generative AI capabilities of Amazon, helps to optimally balance cost efficiency with speed to market.

Extraction – Onity’s document extraction application employs an algorithm-driven approach that queries an internal database to retrieve specific extraction rules for each document type and data element. It then dynamically routes extraction tasks between Amazon Textract and Amazon Bedrock FMs based on the complexity of the content. For example, verifying notarization requires complex visual and textual analysis. In these cases, the application uses the capabilities of Amazon Bedrock advanced text and vision models. The solution is built on the Amazon Bedrock API, which allows Onity to use different FMs that provide the optimal balance of cost and accuracy for each document type. This dynamic routing of extraction tasks allows Onity to optimize the balance between cost, performance, and accuracy.
Persistence – The extracted information is stored in a structured format in Onity’s operational databases and in a semi-structured format in Amazon S3 for further downstream processing.

Security overview
When processing sensitive financial documents, Onity implements robust data protection measures. Data is encrypted at rest using AWS Key Management Service (AWS KMS) and in transit using TLS protocols. Access to data is strictly controlled using AWS Identity and Access Management (IAM) policies. For architectural best practices building financial services Industry (FSI) applications in AWS, refer to AWS Financial Services Industry Lens. This solution is implemented using AWS Security best practice guidance using Security Pillar – AWS Well-Architected Framework. For AWS security and compliance best practices, refer to Best Practices for Security, Identity, & Compliance.
Transforming document processing with Amazon Bedrock: Sample use cases
This section demonstrates how Onity uses Amazon Bedrock to automate the extraction of critical information from complex mortgage servicing documents.
Deed of trust data extraction
A deed of trust is a critical legal document that creates a security interest in real property. These documents are typically verbose, containing multiple pages of legal text with critical information including notarization details, legal stamps, property descriptions, and rider attachments. The intelligent extraction solution has reduced data extraction costs by 50% while improving overall accuracy by 20% compared to the previous OCR and AI/ML solution.
Notarization information extraction
The following is a sample of a notarized document that combines printed and handwritten text and a notary seal. The document image is passed to the application with a prompt to extract the following information: state, county, notary date, notary expiry date, presence of notary seal, person signed before notary, and notary public name. The prompt also instructs that if a field is manually crossed out or modified, the manually written or modified text should be used for that field in the output. Example output:

{
“state”: “Indiana”,
“county”: “Monroe”,
“notary_date”: “8/13/2024”,
“notary_expiry_date”: “8/24/25”,
“notary_seal”: “Present”,
“person_signed”: “[Redacted]”,
“notary_public”: “[Redacted]”
}

Extract rider information
The following image is of a rider that includes text and a series of check boxes (selected and unselected). The document image is passed to the application with a prompt to extract both checked riders and other riders listed on the document in a provided JSON format.

Example output:

{
“riders_checked”: [],
“Others_listed”: [“Manufactured Home Rider”, “Manufactured Home Affidavit of Affixation”]
}

Automation of the checklist review of home appraisal documents
Home appraisal reports contain detailed property comparisons and valuations that require careful review of multiple data points, including room counts, square footage, and property features. Traditionally, this review process required manual verification and cross-referencing, making it time-consuming and prone to errors. The automated solution now validates property comparisons and identifies potential discrepancies, significantly reducing review times while improving accuracy by 65% over the manual process.
The following example shows a document in a grid layout with rows and columns of information. The document image is passed to the application with a prompt to verify if the room counts are identical across the subject and comparables in the appraisal report and if square footages are within a specified percentage of the subject property’s square footage. The prompt also requests an explanation of the analysis results. The application then extracts the required information and provides detailed justification for its findings.

Example output:

{
“Result”: “Yes”,
“Explanation”: “Both conditions are met. Room counts match at 4-2-2.0 (total-bedrooms-baths) across all properties. Subject property is 884 sq ft, and all comparable (884 sq ft, 884 sq ft, and 1000 sq ft) fall within 15% variance range (751.4-1016.6 sq ft). Comparable #3 at 1000 sq ft is within acceptable 15% range.”
}

Automated credit report analysis
Credit reports are essential documents in mortgage servicing that contain critical borrower information from multiple credit bureaus. These reports arrive in diverse formats with scattered information, making manual data extraction time-consuming and error-prone. The solution automatically extracts and standardizes credit scores and scoring models across different report formats, achieving approximately 85% accuracy.
The following image shows a credit report that combines rows and columns with number and text values. The document image is passed to the application using a prompt instructing it to extract the required information.

Example output:

{
“EFX”: {
“Score”: 683,
“ScoreModel”: “Equifax Beacon 5.0”
},
“XPN”: {
“Score”: 688,
“ScoreModel”: “Experian Fair Isaac V2”
},
“TRU”: {
“Score”: 691,
“ScoreModel”: “FICO Risk Score Classic 04”
}
}

Conclusion
Onity’s implementation of intelligent document processing, powered by AWS generative AI services, demonstrates how organizations can transform complex document handling challenges into strategic advantages. By using the generative AI capabilities of Amazon Bedrock, Onity achieved a remarkable 50% reduction in document extraction costs while improving overall accuracy by 20% compared to their previous OCR and AI/ML solution. The impact was even more dramatic in specific use cases—their credit report processing achieved accuracy rates of up to 85%—demonstrating the solution’s exceptional capability in handling complex, multiformat documents.
The flexible FM selection provided by Amazon Bedrock enables organizations to choose and evolve their AI capabilities over time, helping to strike the optimal balance between performance, accuracy, and cost for each specific use case. The solution’s ability to handle complex documents, including verbose legal documents, handwritten text, and notarized materials, showcases the transformative potential of modern AI technologies in financial services. Beyond the immediate benefits of cost savings and improved accuracy, this implementation provides a blueprint for organizations seeking to modernize their document processing operations while maintaining the agility to adapt to evolving business needs. The success of this solution proves that thoughtful application of AWS AI/ML and generative AI services can deliver tangible business results while positioning organizations for continued innovation in document processing capabilities.
If you have similar document processing challenges, we recommend starting with Amazon Textract to evaluate if its core OCR and data extraction capabilities meet your needs. For more complex use cases requiring advanced contextual understanding and visual analysis, use Amazon Bedrock text and vision foundation models, such as Amazon Nova Lite, Nova Pro, Anthropic’s Claude Sonnet, and Anthropic’s Claude. Using an Amazon Bedrock model playground, you can quickly experiment with these multimodal models and then compare the best foundation models across different metrics such as accuracy, robustness, and cost using Amazon Bedrock model evaluation. Through this process, you can make informed decisions about which model provides the best balance of performance and cost-effectiveness for your specific use case.

About the author
Ramesh Eega is a Global Accounts Solutions Architect based out of Atlanta, GA. He is passionate about helping customers throughout their cloud journey.

Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunitie …

Posted on May 20, 2025 by i-genie

As autonomous AI agents move from theory into implementation, their impact on the financial services sector is becoming tangible. A recent whitepaper from IBM Consulting, titled “Agentic AI in Financial Services: Opportunities, Risks, and Responsible Implementation”, outlines how these AI systems—designed for autonomous decision-making and long-term planning—can fundamentally reshape how financial institutions operate. The paper presents a balanced framework that identifies where Agentic AI can add value, the risks it introduces, and how institutions can implement these systems responsibly.

Understanding Agentic AI

AI agents, in this context, are software entities that interact with their environments to accomplish tasks with a high degree of autonomy. Unlike traditional automation or even LLM-powered chatbots, Agentic AI incorporates planning, memory, and reasoning to execute dynamic tasks across systems. IBM categorizes them into Principal, Service, and Task agents, which collaborate in orchestrated systems. These systems enable the agents to autonomously process information, select tools, and interact with human users or enterprise systems in a closed loop of goal pursuit and reflection.

The whitepaper describes the evolution from rule-based automation to multi-agent orchestration, emphasizing how LLMs now serve as the reasoning engine that drives agent behavior in real-time. Crucially, these agents can adapt to evolving conditions and handle complex, cross-domain tasks, making them ideal for the intricacies of financial services.

Key Opportunities in Finance

IBM identifies three primary use case patterns where Agentic AI can unlock significant value:

Customer Engagement & PersonalizationAgents can streamline onboarding, personalize services through real-time behavioral data, and drive KYC/AML processes using tiered agent hierarchies that reduce manual oversight.

Operational Excellence & GovernanceAgents improve internal efficiencies by automating risk management, compliance verification, and anomaly detection, while maintaining auditability and traceability.

Technology & Software DevelopmentThey support IT teams with automated testing, predictive maintenance, and infrastructure optimization—redefining DevOps through dynamic, self-improving workflows.

These systems promise to replace fragmented interfaces and human handoffs with integrated, persona-driven agent experiences grounded in high-quality, governed data products.

Risk Landscape and Mitigation Strategies

Autonomy in AI brings unique risks. The IBM paper categorizes them under the system’s core components—goal misalignment, tool misuse, and dynamic deception being among the most critical. For instance, a wealth management agent might misinterpret a client’s risk appetite due to goal drift, or bypass controls by chaining permissible actions in unintended ways.

Key mitigation strategies include:

Goal Guardrails: Explicitly defined objectives, real-time monitoring, and value alignment feedback loops.

Access Controls: Least-privilege design for tool/API access, combined with dynamic rate-limiting and auditing.

Persona Calibration: Regularly reviewing agents’ behavior to avoid biased or unethical actions.

The whitepaper also emphasizes agent persistence and system drift as long-term governance challenges. Persistent memory, while enabling learning, can cause agents to act on outdated assumptions. IBM proposes memory reset protocols and periodic recalibrations to counteract drift and ensure continued alignment with organizational values.

Regulatory Readiness and Ethical Design

IBM outlines regulatory developments in jurisdictions like the EU and Australia, where agentic systems are increasingly considered “high-risk.” These systems must comply with emerging mandates for transparency, explainability, and continuous human oversight. In the EU’s AI Act, for example, agents influencing access to financial services may fall under stricter obligations due to their autonomous and adaptive behavior.

The paper recommends proactive alignment with ethical AI principles even in the absence of regulation—asking not just can we, but should we. This includes auditing agents for deceptive behavior, embedding human-in-the-loop structures, and maintaining transparency through natural language decision narratives and visualized reasoning paths.

Conclusion

Agentic AI stands at the frontier of enterprise automation. For financial services firms, the promise lies in enhanced personalization, operational agility, and AI-driven governance. Yet these benefits are closely linked to how responsibly these systems are designed and deployed. IBM’s whitepaper serves as a practical guide—advocating for a phased, risk-aware adoption strategy that includes governance frameworks, codified controls, and cross-functional accountability.

Check out the White Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
The post Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration appeared first on MarkTechPost.

Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic� …

Posted on May 20, 2025 by i-genie

Chain-of-thought (CoT) prompting has become a popular method for improving and interpreting the reasoning processes of large language models (LLMs). The idea is simple: if a model explains its answer step-by-step, then those steps should give us some insight into how it reached its conclusion. This is especially appealing in safety-critical domains, where understanding how a model reasons—or misreasons—can help prevent unintended behavior. But a fundamental question remains: are these explanations actually true to what the model is doing internally? Can we trust what the model says it’s thinking?

Anthropic Confirms: Chain-of-Thought Isn’t Really Telling You What AI is Actually “Thinking”

Anthropic’s new paper, “Reasoning Models Don’t Always Say What They Think,” directly addresses this question. The researchers evaluated whether leading reasoning models, such as Claude 3.7 Sonnet and DeepSeek R1, accurately reflect their internal decision-making in their CoT outputs. They constructed prompts containing six types of hints—ranging from neutral suggestions like user feedback to more problematic ones like grader hacking—and tested whether models acknowledged using these hints when they influenced the answer.

The results were clear: in most cases, the models failed to mention the hint, even when their answer changed because of it. In other words, the CoT often concealed key influences on the model’s reasoning, revealing them in less than 20% of applicable cases.

Technical Approach and What It Tells Us

To assess CoT faithfulness, the team designed paired prompts—one standard and one with an embedded hint. They filtered for cases where the model changed its answer in the presence of the hint, indicating that the hint likely affected the model’s internal reasoning. Then, they checked whether the model’s CoT actually verbalized its reliance on the hint. If it didn’t, that was considered an unfaithful CoT.

The benefit of this setup is that it sidesteps the need to directly probe the model’s internal representations. Instead, it infers unspoken reasoning based on consistent behavioral shifts. This framework is valuable for analyzing where and how CoTs fall short—and whether they can be trusted in contexts like safety evaluations and reinforcement learning.

Findings and Why They Matter

Despite the models frequently using the hints to guide their answers, they rarely admitted doing so in the CoT. Claude 3.7 Sonnet acknowledged the hint in only 25% of relevant cases, and DeepSeek R1 fared only slightly better at 39%. For misaligned hints like grader hacks, faithfulness dropped further—to 20% for Claude and 29% for DeepSeek.

The researchers also examined how outcome-based reinforcement learning (RL) affected CoT faithfulness. While RL initially improved the rate at which models verbalized their reasoning, those improvements plateaued early, topping out at 28% on easier benchmarks like MMLU and just 20% on more complex tasks like GPQA.

One particularly striking result came from experiments where models were trained in synthetic environments containing reward hacks—patterns the model could exploit for high reward, even when factually incorrect. The models learned to exploit these patterns consistently but rarely disclosed them in the CoT. In five out of six environments, the reward hack was verbalized in less than 2% of cases, despite being used in over 99% of them.

Interestingly, longer CoTs were often less faithful. Instead of being brief and omitting reasoning, unfaithful CoTs tended to be more verbose, sometimes offering elaborate justifications that disguised the real (and often spurious) reasoning behind the answer.

Conclusion

Anthropic’s findings raise important concerns about relying on CoT as a mechanism for AI interpretability or safety. While CoTs can sometimes surface useful reasoning steps, they frequently omit or obscure critical influences—especially when the model is incentivized to behave strategically. In cases involving reward hacking or unsafe behavior, models may not reveal the true basis for their decisions, even if explicitly prompted to explain themselves.

As AI systems are increasingly deployed in sensitive and high-stakes applications, it’s important to understand the limits of our current interpretability tools. CoT monitoring may still offer value, especially for catching frequent or reasoning-heavy misalignments. But as this study shows, it isn’t sufficient on its own. Building reliable safety mechanisms will likely require new techniques that probe deeper than surface-level explanations.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
The post Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps appeared first on MarkTechPost.

Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforce …

Posted on May 20, 2025 by i-genie

Recent developments have shown that RL can significantly enhance the reasoning abilities of LLMs. Building on this progress, the study aims to improve Audio LLMs—models that process audio and text to perform tasks like question answering. The MMAU benchmark is a widely used dataset designed to evaluate these models, featuring multiple-choice questions on sounds, speech, and music, some of which require external knowledge. A prior approach, R1-AQA, used GRPO (Group Relative Policy Optimization) to fine-tune the Qwen2-Audio model on the AVQA dataset, achieving state-of-the-art (SOTA) results on MMAU. Inspired by this, the authors applied GRPO to fine-tune Qwen2.5-Omni-7B, a newer multimodal model, further improving performance. Additionally, they introduced a method to automatically generate audio QA data, leading to even better outcomes.

Compared to methods like SARI, which uses a more complex mix of supervised fine-tuning and RL with structured reasoning, the authors’ approach is simpler, relying solely on RL without explicit reasoning steps. They also conducted experiments with text-only inputs to investigate the role of GRPO in performance gains. Surprisingly, fine-tuning the models using just text data yielded nearly the same improvements as training with audio and text. This finding suggests that GRPO primarily enhances the model’s reasoning ability through text, significantly contributing to its improved performance in audio QA tasks.

Researchers from MIT CSAIL, Goethe University, IBM Research, and others introduce Omni-R1, a fine-tuned version of the multi-modal LLM Qwen2.5-Omni using the GRPO reinforcement learning method. Trained on the AVQA dataset, Omni-R1 sets new state-of-the-art results on the MMAU benchmark across all audio categories. Surprisingly, much of the improvement stems from enhanced text-based reasoning rather than audio input. Fine-tuning with text-only data also led to notable performance gains. Additionally, the team generated large-scale audio QA datasets using ChatGPT, further boosting accuracy. Their work highlights the significant impact of text reasoning in audio LLM performance and promises the public release of all resources.

The Omni-R1 model fine-tunes Qwen2.5-Omni using the GRPO reinforcement learning method with a simple prompt format that allows direct answer selection, making it memory-efficient for 48GB GPUs. GRPO avoids a value function by comparing grouped outputs using a reward based solely on answer correctness. Researchers used audio captions from Qwen-2 Audio to expand training data and prompted ChatGPT to generate new question-answer pairs. This method produced two datasets—AVQA-GPT and VGGS-GPT—covering 40k and 182k audios, respectively. Training on these automatically generated datasets improved performance, with VGGS-GPT helping Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark.

The researchers fine-tuned Qwen2.5-Omni using GRPO on AVQA, AVQA-GPT, and VGGS-GPT datasets. Results show notable performance gains, with the best average score of 71.3% on the MAU Test-mini from VGGS-GPT. Qwen2.5-Omni outperformed baselines, including SARI, and showed strong reasoning even without audio, suggesting robust text-based understanding. GRPO fine-tuning improved Qwen2-Audio more significantly due to its weaker initial text reasoning. Surprisingly, fine-tuning without audio boosted performance, while text-only datasets like ARC-Easy yielded comparable results. Improvements mainly stem from enhanced text reasoning, though audio-based fine-tuning remains slightly superior for optimal performance.

In conclusion, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni using the GRPO reinforcement learning method for enhanced audio question answering. Omni-R1 achieves new state-of-the-art results on the MMAU benchmark across sounds, speech, music, and overall performance. Two new large-scale datasets, AVQA-GPT and VGGS-GPT, were created using automatically generated questions, further boosting model accuracy. Experiments show that GRPO mainly enhances text-based reasoning, significantly contributing to performance. Surprisingly, fine-tuning with only text (without audio) improved audio-based performance, highlighting the value of strong base language understanding. These findings offer cost-effective strategies for developing audio-capable language models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
The post Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data appeared first on MarkTechPost.

HERE Technologies boosts developer productivity with new generative AI …

Posted on May 20, 2025 by i-genie

This blog post is co-written with Jonas Neuman from HERE Technologies.
HERE Technologies, a 40-year pioneer in mapping and location technology, collaborated with the AWS Generative AI Innovation Center (GenAIIC) to enhance developer productivity with a generative AI-powered coding assistant. This innovative tool is designed to enhance the onboarding experience for HERE’s self-service Maps API for JavaScript. HERE’s use of generative AI empowers its global developer community to quickly translate natural language queries into interactive map visualizations, streamlining the evaluation and adaptation of HERE’s mapping services.
New developers who try out these APIs for the first time often begin with questions such as “How can I generate a walking route from point A to B?” or “How can I display a circle around a point?” Although HERE’s API documentation is extensive, HERE recognized that accelerating the onboarding process could significantly boost developer engagement. They aim to enhance retention rates and create proficient product advocates through personalized experiences.
To create a solution, HERE collaborated with the GenAIIC. Our joint mission was to create an intelligent AI coding assistant that could provide explanations and executable code solutions in response to users’ natural language queries. The requirement was to build a scalable system that could translate natural language questions into HTML code with embedded JavaScript, ready for immediate rendering as an interactive map that users can see on screen.
The team needed to build a solution that accomplished the following:

Provide value and reliability by delivering correct, renderable code that is relevant to a user’s question
Facilitate a natural and productive developer interaction by providing code and explanations at low latency (as of this writing, around 60 seconds) while maintaining context awareness for follow-up questions
Preserve the integrity and usefulness of the feature within HERE’s system and brand by implementing robust filters for irrelevant or infeasible queries
Offer reasonable cost of the system to maintain a positive ROI when scaled across the entire API system

Together, HERE and the GenAIIC built a solution based on Amazon Bedrock that balanced goals with inherent trade-offs. Amazon Bedrock is a fully managed service that provides access to foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities, enabling you to build generative AI applications with built-in security, privacy, and responsible AI features. The service allows you to experiment with and privately customize different FMs using techniques like fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks. Amazon Bedeck is serverless, alleviates infrastructure management needs, and seamlessly integrates with existing AWS services.
Built on the comprehensive suite of AWS managed and serverless services, including Amazon Bedrock FMs, Amazon Bedrock Knowledge Bases for RAG implementation, Amazon Bedrock Guardrails for content filtering, and Amazon DynamoDB for conversation management, the solution delivers a robust and scalable coding assistant without the overhead of infrastructure management. The result is a practical, user-friendly tool that can enhance the developer experience and provide a novel way for API exploration and fast solutioning of location and navigation experiences.
In this post, we describe the details of how this was accomplished.
Dataset
We used the following resources as part of this solution:

Domain documentation – We used two publicly available resources: HERE Maps API for JavaScript Developer Guide and HERE Maps API for JavaScript API Reference. The Developer Guide offers conceptual explanations, and the API Reference provides detailed API function information.
Sample examples – HERE provided 60 cases, each containing a user query, HTML/JavaScript code solution, and brief description. These examples span multiple categories, including geodata, markers, and geoshapes, and were divided into training and testing sets.
Out-of-scope queries – HERE provided samples of queries beyond the HERE Maps API for JavaScript scope, which the large language model (LLM) should not respond to.

Solution overview
To develop the coding assistant, we designed and implemented a RAG workflow. Although standard LLMs can generate code, they often work with outdated knowledge and can’t adapt to the latest HERE Maps API for JavaScript changes or best practices. HERE Maps API for JavaScript documentation can significantly enhance coding assistants by providing accurate, up-to-date context. The storage of HERE Maps API for JavaScript documentation in a vector database allows the coding assistant to retrieve relevant snippets for user queries. This allows the LLM to ground its responses in official documentation rather than potentially outdated training data, leading to more accurate code suggestions.
The following diagram illustrates the overall architecture.

The solution architecture comprises four key modules:

Follow-up question module – This module enables follow-up question answering by contextual conversation handling. Chat histories are stored in DynamoDB and retrieved when users pose new questions. If a chat history exists, it is combined with the new question. The LLM then processes it to reformulate follow-up questions into standalone queries for downstream processing. The module maintains context awareness while recognizing topic changes, preserving the original question when the new question deviates from the previous conversation context.
Scope filtering and safeguard module – This module evaluates whether queries fall within the HERE Maps API for JavaScript scope and determines their feasibility. We applied Amazon Bedrock Guardrails and Anthropic’s Claude 3 Haiku on Amazon Bedrock to filter out-of-scope questions. With a short natural language description, Amazon Bedrock Guardrails helps define a set of out-of-scope topics to block for the coding assistant, for example topics about other HERE products. Amazon Bedrock Guardrails also helps filter harmful content containing topics such as hate speech, insults, sex, violence, and misconduct (including criminal activity), and helps protect against prompt attacks. This makes sure the coding assistant follows responsible AI policies. For in-scope queries, we employ Anthropic’s Claude 3 Haiku model to assess feasibility by analyzing both the user query and retrieved domain documents. We selected Anthropic’s Claude Haiku 3 for its optimal balance of performance and speed. The system generates standard responses for out-of-scope or infeasible queries, and viable questions proceed to response generation.
Knowledge base module – This module uses Amazon Bedrock Knowledge Bases for document indexing and retrieval operations. Amazon Bedrock Knowledge Bases is a comprehensive managed service that simplifies the RAG process from end to end. It handles everything from data ingestion to indexing and retrieval and generation automatically, removing the complexity of building and maintaining custom integrations and managing data flows. For this coding assistant, we used Amazon Bedrock Knowledge Bases for document indexing and retrieval. The multiple options for document chunking, embedding generation, and retrieval methods offered by Amazon Bedrock Knowledge Bases make it highly adaptable and allow us to test and identify the optimal configuration. We created two separate indexes, one for each domain document. This dual-index approach makes sure content is retrieved from both documentation sources for response generation. The indexing process implements hierarchical chunking with the Cohere embedding English V3 model on Amazon Bedrock, and semantic retrieval is implemented for document retrieval.
Response generation module – The response generation module processes in-scope and feasible queries using Anthropic’s Claude 3.5 Sonnet model on Amazon Bedrock. It combines user queries with retrieved documents to generate HTML code with embedded JavaScript code, capable of rendering interactive maps. Additionally, the module provides a concise description of the solution’s key points. We selected Anthropic’s Claude 3.5 Sonnet for its superior code generation capabilities.

Solution orchestration
Each module discussed in the previous section was decomposed into smaller sub-tasks. This allowed us to model the functionality and various decision points within the system as a Directed Acyclic Graph (DAG) using LangGraph. A DAG is a graph where nodes (vertices) are connected by directed edges (arrows) that represent relationships, and crucially, there are no cycles (loops) in the graph. A DAG allows the representation of dependencies with a guaranteed order, and it helps enable safe and efficient execution of tasks. LangGraph orchestration has several benefits, such as parallel task execution, code readability, and maintainability through state management and streaming support.
The following diagram illustrates the coding assistant workflow.

When a user submits a question, a workflow is invoked, starting at the Reformulate Question node. This node handles the implementation of the follow-up question module (Module 1). The Apply Guardrail, Retrieve Documents, and Review Question nodes run in parallel, using the reformulated input question. The Apply Guardrail node uses denied topics from Amazon Bedrock Guardrails to enforce boundaries and apply safeguards against harmful inputs, and the Review Question node filters out-of-scope inquiries using Anthropic’s Claude 3 Haiku (Module 2). The Retrieve Documents node retrieves relevant documents from the Amazon Bedrock knowledge base to provide the language model with necessary information (Module 3).
The outputs of the Apply Guardrail and Review Question nodes determine the next node invocation. If the input passes both checks, the Review Documents node assesses the question’s feasibility by analyzing if it can be answered with the retrieved documents (Module 2). If feasible, the Generate Response node answers the question and the code and description are streamed to the UI, allowing the user to start getting feedback from the system within seconds (Module 4). Otherwise, the Block Response node returns a predefined answer. Finally, the Update Chat History node persistently maintains the conversation history for future reference (Module 1).
This pipeline backs the code assistant chatbot capability, providing an efficient and user-friendly experience for developers seeking guidance on implementing the HERE Maps API for JavaScript. The following code and screenshot is an example of the model generated code and code rendered map for the query “How to open an infobubble when clicking on a marker?

<!DOCTYPE html>
<html>
<head>
<meta name=”viewport” content=”initial-scale=1.0, width=device-width” />
<script src=”https://js.api.here.com/v3/3.1/mapsjs-core.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-service.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-mapevents.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-ui.js” type=”text/javascript” charset=”utf-8″></script>
<link rel=”stylesheet” type=”text/css” href=”https://js.api.here.com/v3/3.1/mapsjs-ui.css” />
</head>
<body>
<div id=”map” style=”width: 100%; height: 480px;”></div>
<script type=”text/javascript”>

function addMarkerWithInfoBubble(map, ui) {
// Create a marker
var marker = new H.map.Marker({lat: 28.6071, lng: 77.2127});

// Add the marker to the map
map.addObject(marker);

// Create the info bubble content
var bubbleContent = ‘<div><h3>Delhi, India</h3><p>Capital city of India</p></div>’;

// Add a click event listener to the marker
marker.addEventListener(‘tap’, function(evt) {
// Create an info bubble object
var bubble = new H.ui.InfoBubble(evt.target.getGeometry(), {
content: bubbleContent
});

// Add info bubble to the UI
ui.addBubble(bubble);
});
}

/**
* Boilerplate map initialization code starts below:
*/

//Step 1: initialize communication with the platform
// In your own code, replace variable window.apikey with your own apikey
var platform = new H.service.Platform({
apikey: ‘Your_API_Key’
});
var defaultLayers = platform.createDefaultLayers();

//Step 2: initialize a map
var map = new H.Map(document.getElementById(‘map’),
defaultLayers.vector.normal.map, {
center: {lat:28.6071, lng:77.2127},
zoom: 13,
pixelRatio: window.devicePixelRatio || 1
});
// add a resize listener to make sure that the map occupies the whole container
window.addEventListener(‘resize’, () => map.getViewPort().resize());

//Step 3: make the map interactive
// MapEvents enables the event system
// Behavior implements default interactions for pan/zoom (also on mobile touch environments)
var behavior = new H.mapevents.Behavior(new H.mapevents.MapEvents(map));

//Step 4: Create the default UI components
var ui = H.ui.UI.createDefault(map, defaultLayers);

// Step 5: main logic
addMarkerWithInfoBubble(map, ui);
</script>
</body>
</html>

Prompt engineering
To improve final code generation accuracy, we employed extensive prompt engineering for the response generation module. The final prompt incorporated the following components:

Task breakdown with chain of thought – We decomposed the code generation task into sequential steps, providing structured guidance for the LLM to follow during response generation.
Few-shot learning – We enhanced the prompt with three carefully selected training examples from question categories where the LLM initially underperformed. These examples included retrieved documents and expected responses, demonstrating the desired output format.
Code template integration – In response to subject matter expert (SME) feedback regarding map interactivity issues, we incorporated a code template for generation. This template contains boilerplate code for HERE map initialization and setup, improving accuracy and providing consistent map interactivity in the generated code.

The following is the core structure of the prompt and the components discussed:

Task Instructions
Examples
User Query
Developer Guide Content
API Reference Content
Code Template

Evaluation
We manually evaluated the accuracy of code generation for each question in the test set. Our evaluation focused on two key criteria:

Whether the generated code can render an interactive HERE map
Whether the rendered map addresses the user’s query—for example, if the user requests a circle to be added, this will check whether the generated code successfully adds a circle to the map

Code samples that satisfied both criteria were classified as correct. In addition to accuracy, we also evaluated latency, including both overall latency and time to first token. Overall latency refers to the total time taken to generate the full response. To improve user experience and avoid having users wait without visible output, we employed response streaming. Time to first token measures how long it takes for the system to generate the first token of the response. The evaluation results, based on 20 samples from the testing dataset, are as follows:

Code generation accuracy: 87.5%
Overall latency: 23.5 seconds
Time to first token: Under 8 seconds

The high accuracy makes sure that the code assistant generates correct code to answer the user’s question. The low overall latency and quick time to first token significantly reduces customer waiting time, enhancing the overall user experience.
Security considerations
Security is our top priority at AWS. For the scope of this post, we shared how we used Amazon Bedrock Guardrails to build responsible AI application. Safety and security is critical for every application. For in-depth guidance on AWS’s approach to secure and responsible AI development, refer to Securing generative AI and the AWS Whitepaper Navigating the security landscape of generative AI.
Possible improvements
The following two areas are worth exploring to improve overall system accuracy and improve the current mechanism for evaluating the LLM response:

Improved automation evaluation – We recommend exploring automating the evaluation. For example, we can use an LLM-as-a-judge approach to compare ground truth and generated code, alongside automated map rendering checks using tools like Playwright. This combined strategy can offer a scalable, accurate, and efficient framework for evaluating the quality and functionality of LLM-generated map code.
Prompt chaining with self-correction feedback – Future implementations could consider a pipeline to execute the generate code, interact with the map, and feed errors back into the LLM to improve accuracy. The trade-off is this feedback loop would increase the overall system latency.

Conclusion
The outcome of this solution is a fast, practical, user-friendly coding assistant that enhances the developer experience for the HERE Maps API for JavaScript. Through iterative evolution of a RAG approach and prompt engineering techniques, the team surpassed target accuracy and latency without relying on fine-tuning. This means the solution can be expanded to other HERE offerings beyond the HERE Maps API for JavaScript. Additionally, the LLMs backing the assistant can be upgraded as higher-performant FMs are made available on Amazon Bedrock.
Key highlights of the solution include the use of a map initialization code template in the prompt, a modular and maintainable architecture orchestrated by LangGraph, and response streaming capabilities that start displaying generated code in under 8 seconds. The careful selection and combination of language models, optimized for specific tasks, further contributed to the overall performance and cost-effectiveness of the solution.
Overall, the outcomes of this proof of concept were made possible through the partnership between the GenAIIC and HERE Technologies. The coding assistant has laid a solid foundation for HERE Technologies to significantly enhance developer productivity, accelerate API adoption, and drive growth in its developer landscape.
Explore how Amazon Bedrock makes it straightforward to build generative AI applications with model choice and features like Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails. Get started with Amazon Bedrock Knowledge Bases to implement RAG-based solutions that can transform your developer experience and boost productivity.

About the Authors
Gan is an Applied Scientist on the AWS Generative AI Innovation and Delivery team. He is passionate about leveraging generative AI techniques to help customers solve real-world business problems.
Grace Lang is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she designs and implements advanced AI solutions across industries. Driven by a passion for solving complex technical challenges, Grace partners with customers to develop innovative machine learning applications.
Julia Wagner is a Senior AI Strategist at AWS’s Generative AI Innovation Center. With her background in product management, she helps teams develop AI solutions focused on customer and business needs. Outside of work, she enjoys biking and mountain activities.
Jonas Neuman is an Engineering Manager at HERE Technologies, based in Berlin, Germany. He is passionate about building great customer-facing applications. Together with his team, Jonas delivers features that help customers sign up for HERE Services and SDKs, manage access, and monitor their usage.
Sibasankar is a Senior Solutions Architect at AWS in the Automotive and Manufacturing team. He is passionate about AI, data and security. In his free time, he loves spending time with his family and reading non-fiction books.
Jared Kramer is an Applied Science Manager at Amazon Web Services based in Seattle. Jared joined Amazon 11 years ago as an ML Science intern. After 6 years in Customer Service Technologies and 4 years in Sustainability Science and Innovation, he now leads of team of Applied Scientists and Deep Learning Architects in the Generative AI Innovation Center. Jared specializes in designing and delivering industry NLP applications and is on the Industry Track program committee for ACL and EMNLP.

Set up a custom plugin on Amazon Q Business and authenticate with Amaz …

Posted on May 20, 2025 by i-genie

Businesses are constantly evolving, and leaders are challenged every day to meet new requirements and are seeking ways to optimize their operations and gain a competitive edge. One of the key challenges they face is managing the complexity of disparate business systems and workflows, which leads to inefficiencies, data silos, and missed opportunities.
Generative AI can play an important role in integrating these disparate systems in a secure and seamless manner, addressing these challenges in a cost-effective way. This integration allows for secure and efficient data exchange, action triggering, and enhanced productivity across the organization. Amazon Q Business plays an important role in making this happen. Amazon Q Business enables organizations to quickly and effortlessly analyze their data, uncover insights, and make data-driven decisions. With its intuitive interface and seamless integration with other AWS services, Amazon Q Business empowers businesses of different sizes to transform their data into actionable intelligence and drive innovation across their operations.
In this post, we demonstrate how to build a custom plugin with Amazon Q Business for backend integration. This plugin can integrate existing systems, including third-party systems, with little to no development in just weeks and automate critical workflows. Additionally, we show how to safeguard the solution using Amazon Cognito and AWS IAM Identity Center, maintaining the safety and integrity of sensitive data and workflows. Amazon Q Business also offers application environment guardrails or chat controls that you can configure to control the end-user chat experience to add an additional layer of safety. Lastly, we show how to expose your backend APIs through Amazon API Gateway, which is built on serverless AWS Lambda functions and Amazon DynamoDB.
Solution overview
Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. With Amazon Q Business, you can quickly find answers to questions, generate summaries and content, and complete tasks by using the information and expertise stored across your company’s various data sources and enterprise systems. At the core of this capability are built-in data source connectors and custom plugins that seamlessly integrate and index content from multiple repositories into a unified index. This enables the Amazon Q Business large language model (LLM) to provide accurate, well-written answers by drawing from the consolidated data and information. The data source connectors act as a bridge, synchronizing content from disparate systems like Salesforce, Jira, and SharePoint into a centralized index that powers the natural language understanding and generative abilities of Amazon Q Business. Amazon Q Business also provides the capability to create custom plugins to integrate with your organization’s backend system and third-party applications.
After you integrate Amazon Q Business with your backend system using a custom plugin, users can ask questions from documents that are uploaded in Amazon Simple Storage Service (Amazon S3). For this post, we use a simple document that contains product names, descriptions, and other related information. Some of the questions you can ask Amazon Q Business might include the following:

“Give me the name of the products.”
“Now list all the products along with the description in tabular format.”
“Now create one of the products <product name>.” (At this stage, Amazon Q Business will require you to authenticate against Amazon Cognito to make sure you have the right permission to work on that application.)
“List all the products along with ID and price in tabular format.”
“Update the price of product with ID <product ID>.”

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The user asks a question using the Amazon Q Business chat interface.
Amazon Q Business searches the indexed document in Amazon S3 for relevant information and presents it to the user.
The user can use the plugin to perform actions (API calls) in the system exposed to Amazon Q Business using Open API 3.x standards.
Because the API is secured with Amazon Cognito, Amazon Q Business requires the user to authenticate against the user credentials available in Amazon Cognito.
On successful authentication, API Gateway forwards the request to Lambda.
The API response is returned to the user through the Amazon Q Business chat interface.

Prerequisites
Before you begin the walkthrough, you must have an AWS account. If you don’t have one, sign up for one. Additionally, you must have access to the following services:

Amazon API Gateway
AWS CloudFormation
Amazon Cognito
Amazon DynamoDB
AWS IAM Identity Center
AWS Lambda
Amazon Q Business Pro (This will have an additional monthly cost)
Amazon S3

Launch the CloudFormation template
Launch the following CloudFormation template to set up Amazon Cognito, API Gateway, DynamoDB, and Lambda resources.

After you deploy the stack, navigate to the Outputs tab for the stack on the AWS CloudFormation console and note the resource details. We use those values later in this post.
If you’re running the CloudFormation template multiple times, make sure to choose a unique name for the stack each time.

Create an Amazon Q Business application
Complete the following steps to create an Amazon Q Business application:

On the Amazon Q Business console, choose Applications in the navigation pane.
Choose Create application.

Provide an application name (for example, product-mgmt-app).
Leave the other settings as default and choose Create.

The application will be created in a few seconds.

On the application details page, choose Data source.
Choose Add an index.
For Index name, enter a name for the index.
For Index provisioning, select Enterprise or Starter.
For Number of units, leave as the default 1.
Choose Add an index.

On the Data source page, choose Add a data source.
Choose Amazon S3 as your data source and enter a unique name.
Enter the data source location as the value of BucketName from the CloudFormation stack outputs in the format s3://<name_here>.

In a later step, we upload a file to this S3 bucket.

For IAM role¸ choose Create a new service role (recommended).
For Sync scope, select Full sync.
For Frequency, select Run on demand.
Choose Add data source.
On the application details page, choose Manage user access.
Choose Add groups and users.
You can use existing users or groups in IAM Identity Center or create new users and groups, then choose Confirm.

Only these groups and users have access to the Amazon Q Business application for their subscriptions.

Take note of deployed URL of the application to use in a later step.
On the Amazon S3 console, locate the S3 bucket you noted earlier and upload the sample document.
On the Amazon Q Business console, navigate to the application details page and sync the Amazon S3 data source.

Configure Amazon Cognito
Complete the following steps to set up Amazon Cognito:

On the Amazon Cognito console, navigate to the user pool created using the CloudFormation template (ending with-ProductUserPool).
Under Branding in the navigation pane, choose Domain.
On the Actions menu, choose Create Cognito domain.

We did not create a domain when we created the user pool using the CloudFormation template.

For Cognito domain, enter a domain prefix.
For Version, select Hosted UI.
Choose Create Cognito domain.

Under Applications in the navigation pane, choose App clients.
Choose your app client.

On the app client detail page, choose Login pages and then choose Edit the managed login pages configuration.
For URL, enter the deployed URL you noted earlier, followed by /oauth/callback. For example, https://xxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback.
Specify your identity provider, OAuth 2.0 grant type, OpenID Connect scopes, and custom scopes.

Custom scopes are defined as part of the API configuration in API Gateway. This will help Amazon Q Business determine what action a user is allowed to take. In this case, we are allowing the user to read, write, and delete. However, you can change this based on what you want your users to do using the Amazon Q Business chat.

Choose Save changes.

Take note of the Client ID and Client secret values in the App client information section to use in a later step.

Amazon Cognito doesn’t support changing the client secret after you have created the app client; a new app client is needed if you want to change the client secret.
Lastly, you have to add at least one user to the Amazon Cognito user pool.

Choose Users under User management in the navigation pane and choose Create user.
Create a user to add to your Amazon Cognito user pool.

We will use this user to authenticate before we can chat and ask questions to the backend system using Amazon Q Business.

Create an Amazon Q Business custom plugin
Complete the following steps to create your custom plugin:

On the Amazon Q Business console, navigate to the application you created.
Under Actions in the navigation pane, choose Plugins
Choose Add plugin.

Select Create custom plugin.
Provide a plugin name (for example, Products).
Under API schema source, select Define with in-line OpenAPI schema editor and enter the following code:

openapi: 3.0.0
info:
title: CRUD API
version: 1.0.0
description: API for performing CRUD operations
servers:
– url: put api gateway endpoint url here, copy it from cloudformation output

paths:
/products:
get:
summary: List all products
security:
– OAuth2:
– products/read
description: Returns a list of all available products
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
type: array
items:
$ref: ‘#/components/schemas/Product’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
post:
summary: Create a new product
security:
– OAuth2:
– products/write
description: Creates a new product
requestBody:
required: true
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
responses:
‘201’:
description: Created
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘400’:
description: Bad Request
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
/products/{id}:
get:
summary: Get a product
security:
– OAuth2:
– products/read
description: Retrieves a specific product by its ID
parameters:
– name: id
in: path
required: true
description: The ID of the product to retrieve
schema:
type: string
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
put:
summary: Update a product
security:
– OAuth2:
– products/write
description: Updates an existing product
parameters:
– name: id
in: path
required: true
description: The ID of the product to update
schema:
type: string
requestBody:
required: true
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
delete:
summary: Delete a product
security:
– OAuth2:
– products/delete
description: Deletes a specific product by its ID
parameters:
– name: id
in: path
required: true
description: The ID of the product to delete
schema:
type: string
responses:
‘204’:
description: Successful response
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
components:
securitySchemes:
OAuth2:
type: oauth2
flows:
authorizationCode:
authorizationUrl: <Cognito domain>/oauth2/authorize
tokenUrl: <Cognito domain>/oauth2/token
scopes:
products/read: read prodcut
products/write: write prodcut
products/delete: delete prodcut
schemas:
Product:
type: object
required:
– id
– name
– description
properties:
id:
type: string
name:
type: string
description:
type: string
Error:
type: object
properties:
error:
type: string

In the YAML file, replace the URL value with the value of ProductAPIEndpoint from the CloudFormation stack outputs:

servers url: https://<<xxxx>>.execute-api.us-east-1.amazonaws.com/dev

Replace the Amazon Cognito domain URL with the domain you created earlier:

authorizationCode:
authorizationUrl: https://xxxx.auth.us-east1.amazoncognito.com/oauth2/authorize
tokenUrl: https://xxxx.auth.us-east-1.amazoncognito.com/oauth2/token

The YAML file contains the schema (Open API 3.x) that Amazon Q Business uses to decide which API needs to be called based on the description. For example, line 16 in the following screenshot says Return a list all available products, which instructs Amazon Q Business to call this API whenever a user makes a request to list all products.

For authentication, select Authentication required.
For AWS Secrets Manager secret, choose Create and add new secret and enter the client ID and client secret you saved earlier, and enter the callback URL the same way as you did for the Amazon Cognito host UI (https://<>.chat.qbusiness.<<region>>.on.aws/oauth/callback).
For Choose a method to authorize Amazon Q Business, choose Create and use a new service role.
Choose Create plugin.

The last step is to enable the chat orchestration feature so Amazon Q Business can select the plugin automatically.

On the custom plugin details page, choose Admin controls and guardrails under Enhancements in the navigation pane.
In the Global controls section, choose Edit.

Select Allow Amazon Q Business to automatically orchestrate chat queries across plugins and data sources, then choose Save.

Configure API Gateway, Lambda, and DynamoDB resources
Everything related to API Gateway, Lambda, and DynamoDB is already configured using the CloudFormation template. Details are available on the Outputs tab of the stack details page. You can also review the details of the Lambda function and DynamoDB table on their respective service consoles. To learn how the Lambda function is exposed as an API through API Gateway, review the details on the API Gateway console.
Chat with Amazon Q Business
Now you’re ready to chat with Amazon Q Business.

On the Amazon Q Business console, navigate to your application.
Choose the link for Deployed URL.
Authenticate using IAM Identity Center (this is to make sure you have access to Amazon Q Business Pro).

You can now ask questions in natural language.
In the following example, we check if Amazon Q Business is able to access the data from the S3 bucket by asking “List all the products and their description in a table.”

After the product descriptions are available, start chatting and ask questions like Can you create product <product name> with same description please?. Alternatively, you can create a new product that isn’t listed in the sample document uploaded in Amazon S3. Amazon Q Business will automatically pick the right plugin (in this case, Products).
Subsequent requests for API calls to go through the custom plugin will ask you to authorize your access. Choose Authorize and authenticate with the user credentials created in Amazon Cognito earlier. After you’re authenticated, Amazon Q Business will cache the session token for subsequent API calls and complete the request.

You can query on the products that are available in the backend by asking questions like the following:

Can you please list all the products?
Delete a product by ID or by name.
Create a new product with the name ‘Gloves’ and description as ‘Football gloves’ with automatic in-built cooling

Based on the preceding prompt, a product has been created in the products table in DynamoDB.

Cost considerations
The cost of setting up this solution is based on the price of the individual AWS services being used. Prices of those services are available on the individual service pages. The only mandatory cost is the Amazon Q Business Pro license. For more information, see Amazon Q Business pricing.
Clean up
Complete the following steps to clean up your resources:

Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
Delete the Amazon Q Business application.
Delete the Amazon Cognito user pool domain.
Empty and delete the S3 bucket. For instructions, refer to Deleting a general purpose bucket.

Conclusion
In this post, we explored how Amazon Q Business can seamlessly integrate with enterprise systems using a custom plugin to help enterprises unlock the value of their data. We walked you through the process of setting up the custom plugin, including configuring the necessary Amazon Cognito and authentication mechanisms.
With this custom plugin, organizations can empower their employees to work efficiently, answers quickly, accelerate reporting, automate workflows, and enhance collaboration. You can ask Amazon Q Business natural language questions and watch as it surfaces the most relevant information from your company’s backend system and act on requests.
Don’t miss out on the transformative power of generative AI and Amazon Q Business. Sign up today and experience the difference that Amazon Q Business can make for your organization’s workflow automation and the efficiency it brings.

About the Authors
Shubhankar Sumar is a Senior Solutions Architect at Amazon Web Services (AWS), working with enterprise software and SaaS customers across the UK to help architect secure, scalable, efficient, and cost-effective systems. He is an experienced software engineer, having built many SaaS solutions powered by generative AI. Shubhankar specializes in building multi-tenant systems on the cloud. He also works closely with customers to bring generative AI capabilities to their SaaS applications.
Dr. Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Ankur Agarwal is a Principal Enterprise Architect at Amazon Web Services Professional Services. Ankur works with enterprise clients to help them get the most out of their investment in cloud computing. He advises on using cloud-based applications, data, and AI technologies to deliver maximum business value.

Detect hallucinations for RAG-based systems

Posted on May 20, 2025 by i-genie

With the rise of generative AI and knowledge extraction in AI systems, Retrieval Augmented Generation (RAG) has become a prominent tool for enhancing the accuracy and reliability of AI-generated responses. RAG is as a way to incorporate additional data that the large language model (LLM) was not trained on. This can also help reduce generation of false or misleading information (hallucinations). However, even with RAG’s capabilities, the challenge of AI hallucinations remains a significant concern.
As AI systems become increasingly integrated into our daily lives and critical decision-making processes, the ability to detect and mitigate hallucinations is paramount. Most hallucination detection techniques focus on the prompt and the response alone. However, where additional context is available, such as in RAG-based applications, new techniques can be introduced to better mitigate the hallucination problem.
This post walks you through how to create a basic hallucination detection system for RAG-based applications. We also weigh the pros and cons of different methods in terms of accuracy, precision, recall, and cost.
Although there are currently many new state-of-the-art techniques, the approaches outlined in this post aim to provide simple, user-friendly techniques that you can quickly incorporate into your RAG pipeline to increase the quality of the outputs in your RAG system.
Solution overview
Hallucinations can be categorized into three types, as illustrated in the following graphic.

Scientific literature has come up with multiple hallucination detection techniques. In the following sections, we discuss and implement four prominent approaches to detecting hallucinations: using an LLM prompt-based detector, semantic similarity detector, BERT stochastic checker, and token similarity detector. Finally, we compare approaches in terms of their performance and latency.
Prerequisites
To use the methods presented in this post, you need an AWS account with access to Amazon SageMaker, Amazon Bedrock, and Amazon Simple Storage Service (Amazon S3).
From your RAG system, you will need to store three things:

Context – The area of text that is relevant to a user’s query
Question – The user’s query
Answer – The answer provided by the LLM

The resulting table should look similar to the following example.

question
context
answer

What are cocktails?
Cocktails are alcoholic mixed…
Cocktails are alcoholic mixed…

What are cocktails?
Cocktails are alcoholic mixed…
They have distinct histories…

What is Fortnite?
Fortnite is a popular video…
Fortnite is an online multi…

What is Fortnite?
Fortnite is a popular video…
The average Fortnite player spends…

Approach 1: LLM-based hallucination detection
We can use an LLM to classify the responses from our RAG system into context-conflicting hallucinations and facts. The aim is to identify which responses are based on the context or whether they contain hallucinations.
This approach consists of the following steps:

Create a dataset with questions, context, and the response you want to classify.
Send a call to the LLM with the following information:

Provide the statement (the answer from the LLM that we want to classify).
Provide the context from which the LLM created the answer.
Instruct the LLM to tag sentences in the statement that are directly based on the context.

Parse the outputs and obtain sentence-level numeric scores between 0–1.
Make sure to keep the LLM, memory, and parameters independent from the ones used for Q&A. (This is so the LLM can’t access the previous chat history to draw conclusions.)
Tune the decision threshold for the hallucination scores for a specific dataset based on domain, for example.
Use the threshold to classify the statement as hallucination or fact.

Create a prompt template
To use the LLM to classify the answer to your question, you need to set up a prompt. We want the LLM to take in the context and the answer, and determine from the given context a hallucination score. The score will be encoded between 0 and 1, with 0 being an answer directly from the context and 1 being an answer with no basis from the context.
The following is a prompt with few-shot examples so the LLM knows what the expected format and content of the answer should be:
prompt = “””nnHuman: You are an expert assistant helping human to check if statements are based on the context.
Your task is to read context and statement and indicate which sentences in the statement are based directly on the context.

Provide response as a number, where the number represents a hallucination score, which is a float between 0 and 1.
Set the float to 0 if you are confident that the sentence is directly based on the context.
Set the float to 1 if you are confident that the sentence is not based on the context.
If you are not confident, set the score to a float number between 0 and 1. Higher numbers represent higher confidence that the sentence is not based on the context.

Do not include any other information except for the the score in the response. There is no need to explain your thinking.

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS’s virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: ‘AWS is Amazon subsidiary that provides cloud computing services.’
Assistant: 0.05
</example>

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS’s virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: ‘AWS revenue in 2022 was $80 billion.’
Assistant: 1
</example>

<example>
Context: Monkey is a common name that may refer to most mammals of the infraorder Simiiformes, also known as the simians. Traditionally, all animals in the group now known as simians are counted as monkeys except the apes, which constitutes an incomplete paraphyletic grouping; however, in the broader sense based on cladistics, apes (Hominoidea) are also included, making the terms monkeys and simians synonyms in regard to their scope. On average, monkeys are 150 cm tall.
Statement:’Average monkey is 2 meters high and weights 100 kilograms.’
Assistant: 0.9
</example>

Context: {context}
Statement: {statement}

nnAssistant: [
“””
   ### LANGCHAIN CONSTRUCTS
# prompt template
prompt_template = PromptTemplate(
   template=prompt,
   input_variables=[“context”, “statement”],
   )
Configure the LLM
To retrieve a response from the LLM, you need to configure the LLM using Amazon Bedrock, similar to the following code:
def configure_llm() -> Bedrock:

model_params= { “answer_length”: 100, # max number of tokens in the answer
“temperature”: 0.0, # temperature during inference
        “top_p”: 1, # cumulative probability of sampled tokens
“stop_words”: [ “nnHuman:”, “]”, ], # words after which the generation is stopped
                    }
bedrock_client = boto3.client(
service_name=”bedrock-runtime”,
            region_name=”us-east-1″,
            )

  MODEL_ID = “anthropic.claude-3-5-sonnet-20240620-v1:0”

  llm = Bedrock(
    client=bedrock_client,
        model_id=MODEL_ID,
        model_kwargs=model_params,
        )

return llm
Get hallucination classifications from the LLM
The next step is to use the prompt, dataset, and LLM to get hallucination scores for each response from your RAG system. Taking this a step further, you can use a threshold to determine whether the response is a hallucination or not. See the following code:
def get_response_from_claude(context: str, answer: str, prompt_template: PromptTemplate, llm: Bedrock) -> float:

    llm_chain = LLMChain(llm=llm, prompt=prompt_template, verbose=False)
   # compute scores
   response = llm_chain(
   {“context”: context, “statement”: str(answer)}
   )
   try:
   scores = float(scores)
   except Exception:
   print(f”Could not parse LLM response: {scores}”)
   scores = 0
   return scores
Approach 2: Semantic similarity-based detection
Under the assumption that if a statement is a fact, then there will be high similarity with the context, you can use semantic similarity as a method to determine whether a statement is an input-conflicting hallucination.
This approach consists of the following steps:

Create embeddings for the answer and the context using an LLM. (In this example, we use the Amazon Titan Embeddings model.)
Use the embeddings to calculate similarity scores between each sentence in the answer and the (In this case, we use cosine similarity as a distance metric.) Out-of-context (hallucinated sentences) should have low similarity with the context.
Tune the decision threshold for a specific dataset (such as domain dependent) to classify hallucinating statements.

Create embeddings with LLMs and calculate similarity
You can use LLMs to create embeddings for the context and the initial response to the question. After you have the embeddings, you can calculate the cosine similarity of the two. The cosine similarity score will return a number between 0 and 1, with 1 being perfect similarity and 0 as no similarity. To translate this to a hallucination score, we need to take 1—the cosine similarity. See the following code:
def similarity_detector(
   context: str,
   answer: str,
   llm: BedrockEmbeddings,
) -> float:
   “””
   Check hallucinations using semantic similarity methods based on embeddings<br /><br />

   Parameters
———-
   context : str
   Context provided for RAG
   answer : str
   Answer from an LLM
   llm : BedrockEmbeddings
   Embeddings model

   Returns
   ——-
   float
   Semantic similarity score
   “””

   if len(context) == 0 or len(answer) == 0:
   return 0.0
   # calculate embeddings
   context_emb = llm.embed_query(context)
   answer_emb = llm.embed_query(answer)
   context_emb = np.array(context_emb).reshape(1, -1)
answer_emb = np.array(answer_emb).reshape(1, -1)
   sim_score = cosine_similarity(context_emb, answer_emb)
   return 1 – sim_score[0][0]
Approach 3: BERT stochastic checker
The BERT score uses the pre-trained contextual embeddings from a pre-trained language model such as BERT and matches words in candidate and reference sentences by cosine similarity. One of the traditional metrics for evaluation in natural language processing (NLP) is the BLEU score. The BLEU score primarily measures precision by calculating how many n-grams (consecutive tokens) from the candidate sentence appear in the reference sentences. It focuses on matching these consecutive token sequences between candidate and reference sentences, while incorporating a brevity penalty to prevent overly short translations from receiving artificially high scores. Unlike the BLEU score, which focuses on token-level comparisons, the BERT score uses contextual embeddings to capture semantic similarities between words or full sentences. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, the BERT score computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
In our approach, we use the BERT score as a stochastic checker for hallucination detection. The idea is that if you generate multiple answers from an LLM and there are large variations (inconsistencies) between them, then there is a good chance that these answers are hallucinated. We first generate N random samples (sentences) from the LLM. We then compute BERT scores by comparing each sentence in the original generated paragraph against its corresponding sentence across the N newly generated stochastic samples. This is done by embedding all sentences using an LLM based embedding model and calculating cosine similarity. Our hypothesis is that factual sentences will remain consistent across multiple generations, resulting in high BERT scores (indicating similarity). Conversely, hallucinated content will likely vary across different generations, resulting in low BERT scores between the original sentence and its stochastic variants. By establishing a threshold for these similarity scores, we can flag sentences with consistently low BERT scores as potential hallucinations, because they demonstrate semantic inconsistency across multiple generations from the same model.
Approach 4: Token similarity detection
With the token similarity detector, we extract unique sets of tokens from the answer and the context. Here, we can use one of the LLM tokenizers or simply split the text into individual words. Then, we calculate similarity between each sentence in the answer and the context. There are multiple metrics that can be used for token similarity, including a BLEU score over different n-grams, a ROUGE score (an NLP metric similar to BLEU but calculates recall vs. precision) over different n-grams, or simply the proportion of the shared tokens between the two texts. Out-of-context (hallucinated) sentences should have low similarity with the context.
def intersection_detector(
context: str,
answer: str,
length_cutoff: int = 3,
) -> dict[str, float]:
“””
Check hallucinations using token intersection metrics

Parameters
———-
context : str
Context provided for RAG
answer : str
Answer from an LLM
length_cutoff : int
If no. tokens in the answer is smaller than length_cutoff, return scores of 1.0

Returns
——-
dict[str, float]
Token intersection and BLEU scores
“””

# populate with relevant stopwords such as articles
stopword_set = {}

# remove punctuation and lowercase
context = re.sub(r”[^ws]”, “”, context).lower()
answer = re.sub(r”[^ws]”, “”, answer).lower()

# calculate metrics
if len(answer) >= length_cutoff:
# calculate token intersection
context_split = {term for term in context if term not in stopword_set}
answer_split = re.compile(r”w+”).findall(answer)
answer_split = {term for term in answer_split if term not in stopword_set}
intersection = sum([term in context_split for term in answer_split]) / len(answer_split)

# calculate BLEU score
bleu = evaluate.load(“bleu”)
bleu_score = bleu.compute(predictions=[answer], references=[context])[“precisions”]
bleu_score = sum(bleu_score) / len(bleu_score)

return {
“intersection”: 1 – intersection,
“bleu”: 1 – bleu_score,
}

return {“intersection”: 0, “bleu”: 0}
Comparing approaches: Evaluation results
In this section, we compare the hallucination detection approaches described in the post. We run an experiment on three RAG datasets, including Wikipedia article data and two synthetically generated datasets. Each example in a dataset includes a context, a user’s question, and an LLM answer labeled as correct or hallucinated. We run each hallucination detection method on all questions and aggregate the accuracy metrics across the datasets.
The highest accuracy (number of sentences correctly classified as hallucination vs. fact) is demonstrated by the BERT stochastic checker and the LLM prompt-based detector. The LLM prompt-based detector outperforms the BERT checker in precision, and the BERT stochastic checker has a higher recall. The semantic similarity and token similarity detectors show very low accuracy and recall but perform well with regards to precision. This indicates that those detectors might only be useful to identify the most evident hallucinations.
Aside from the token similarity detector, the LLM prompt-based detector is the most cost-effective option in terms of the number LLM calls because it’s constant relative to the size of the context and the response (but cost will vary depending on the number of input tokens). The semantic similarity detector cost is proportional to the number of sentences in the context and the response, so as the context grows, this can become increasingly expensive.
The following table summarizes the metrics compared between each method. For use cases where precision is the highest priority, we would recommend the token similarity, LLM prompt-based, and semantic similarity methods, whereas to provide high recall, the BERT stochastic method outperforms other methods.
The following table summarizes the metrics compared between each method.

Technique
Accuracy*
Precision*
Recall*
Cost (Number of LLM Calls)
Explainability

Token Similarity Detector
0.47
0.96
0.03
0
Yes

Semantic Similarity Detector
0.48
0.90
0.02
K***
Yes

LLM Prompt-Based Detector
0.75
0.94
0.53
1
Yes

BERT Stochastic Checker
0.76
0.72
0.90
N+1**
Yes

*Averaged over Wikipedia dataset and generative AI synthetic datasets **N = Number of random samples ***K = Number of sentences
These results suggest that an LLM-based detector shows a good trade-off between accuracy and cost (additional answer latency). We recommend using a combination of a token similarity detector to filter out the most evident hallucinations and an LLM-based detector to identify more difficult ones.
Conclusion
As RAG systems continue to evolve and play an increasingly important role in AI applications, the ability to detect and prevent hallucinations remains crucial. Through our exploration of four different approaches—LLM prompt-based detection, semantic similarity detection, BERT stochastic checking, and token similarity detection—we’ve demonstrated various methods to address this challenge. Although each approach has its strengths and trade-offs in terms of accuracy, precision, recall, and cost, the LLM prompt-based detector shows particularly promising results with accuracy rates above 75% and a relatively low additional cost. Organizations can choose the most suitable method based on their specific needs, considering factors such as computational resources, accuracy requirements, and cost constraints. As the field continues to advance, these foundational techniques provide a starting point for building more reliable and trustworthy RAG systems.

About the Authors
Zainab Afolabi is a Senior Data Scientist at the Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over eight years of specialised experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center working on the frontier of AI research and business. Nikita builds generative AI solutions to solve real-world business problems for AWS customers across industries and holds PhD in Machine Learning.
Liza (Elizaveta) Zinovyeva is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model De …

Posted on May 16, 2025 by i-genie

VLMs have become central to building general-purpose AI systems capable of understanding and interacting in digital and real-world settings. By integrating visual and textual data, VLMs have driven advancements in multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors like education and healthcare. Despite this progress, VLMs still lag behind human capabilities, particularly in tasks involving 3D reasoning, object counting, creative visual interpretation, and interactive gameplay. A challenge lies in the scarcity of rich, diverse multimodal datasets, unlike the abundant textual resources available to LLMs. Additionally, multimodal data complexity poses significant training and evaluation hurdles.

Researchers at ByteDance have developed Seed1.5-VL, a compact yet powerful vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its efficient architecture, Seed1.5-VL achieves top results on 38 out of 60 public VLM benchmarks, excelling in tasks like GUI control, video understanding, and visual reasoning. It is trained on trillions of multimodal tokens using advanced data synthesis and post-training techniques, including human feedback. Innovations in training, such as hybrid parallelism and vision token redistribution, optimize performance. The model’s efficiency and strong reasoning capabilities suit real-world interactive applications like chatbots.

The Seed1.5-VL architecture features a vision encoder, an MLP adapter, and an LLM. Its custom vision encoder, Seed-ViT, supports native-resolution image input using 2D RoPE and processes images through 14×14 patches, followed by average pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment using images, text, and video-audio-caption pairs. The model uses a Dynamic Frame-Resolution Sampling approach for video encoding that adapts frame rates and resolutions based on content complexity, balancing efficiency and detail. This method enables effective spatial-temporal understanding within a token budget, ensuring comprehensive video representation across varied lengths and complexities.

The pre-training of Seed1.5-VL involved curating 3 trillion high-quality tokens across diverse domains. Image-text pairs from the web were filtered using CLIP scores, size/aspect ratio checks, and deduplication to reduce noise. Using domain-based sampling and duplication strategies, rare visual concepts were overrepresented to address class imbalance. Specialized datasets were added for OCR using annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial understanding using depth annotations, and video understanding through multi-frame captioning, QA, and temporal grounding to support dynamic content analysis.

The evaluation highlights Seed-ViT and Seed1.5-VL’s competitive performance across vision-language tasks. Seed-ViT, despite having significantly fewer parameters, matches or outperforms larger models like InternVL-C and EVA-CLIP on zero-shot image classification tasks, showing high accuracy and robustness on datasets such as ImageNet-A and ObjectNet. Seed1.5-VL demonstrates strong capabilities in multimodal reasoning, general VQA, document understanding, and grounding. It achieves state-of-the-art benchmarks, particularly in complex reasoning, counting, and chart interpretation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its strong ability in detailed visual understanding and task generalization.

In conclusion, Seed1.5-VL is a vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts language model. Despite its compact size, it achieves state-of-the-art results on 38 of 60 public benchmarks and excels in complex reasoning, OCR, diagram interpretation, 3D spatial understanding, and video analysis. It also performs well in agent-driven tasks like GUI control and gameplay, surpassing models like OpenAI CUA and Claude 3.7. The model shows strong generalization to tasks beyond its training scope. The study outlines its architecture, data pipeline, and training methods and identifies future directions, including enhancing tool-use and visual reasoning capabilities.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning appeared first on MarkTechPost.