Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforce …

Recent developments have shown that RL can significantly enhance the reasoning abilities of LLMs. Building on this progress, the study aims to improve Audio LLMs—models that process audio and text to perform tasks like question answering. The MMAU benchmark is a widely used dataset designed to evaluate these models, featuring multiple-choice questions on sounds, speech, and music, some of which require external knowledge. A prior approach, R1-AQA, used GRPO (Group Relative Policy Optimization) to fine-tune the Qwen2-Audio model on the AVQA dataset, achieving state-of-the-art (SOTA) results on MMAU. Inspired by this, the authors applied GRPO to fine-tune Qwen2.5-Omni-7B, a newer multimodal model, further improving performance. Additionally, they introduced a method to automatically generate audio QA data, leading to even better outcomes.

Compared to methods like SARI, which uses a more complex mix of supervised fine-tuning and RL with structured reasoning, the authors’ approach is simpler, relying solely on RL without explicit reasoning steps. They also conducted experiments with text-only inputs to investigate the role of GRPO in performance gains. Surprisingly, fine-tuning the models using just text data yielded nearly the same improvements as training with audio and text. This finding suggests that GRPO primarily enhances the model’s reasoning ability through text, significantly contributing to its improved performance in audio QA tasks. 

Researchers from MIT CSAIL, Goethe University, IBM Research, and others introduce Omni-R1, a fine-tuned version of the multi-modal LLM Qwen2.5-Omni using the GRPO reinforcement learning method. Trained on the AVQA dataset, Omni-R1 sets new state-of-the-art results on the MMAU benchmark across all audio categories. Surprisingly, much of the improvement stems from enhanced text-based reasoning rather than audio input. Fine-tuning with text-only data also led to notable performance gains. Additionally, the team generated large-scale audio QA datasets using ChatGPT, further boosting accuracy. Their work highlights the significant impact of text reasoning in audio LLM performance and promises the public release of all resources. 

The Omni-R1 model fine-tunes Qwen2.5-Omni using the GRPO reinforcement learning method with a simple prompt format that allows direct answer selection, making it memory-efficient for 48GB GPUs. GRPO avoids a value function by comparing grouped outputs using a reward based solely on answer correctness. Researchers used audio captions from Qwen-2 Audio to expand training data and prompted ChatGPT to generate new question-answer pairs. This method produced two datasets—AVQA-GPT and VGGS-GPT—covering 40k and 182k audios, respectively. Training on these automatically generated datasets improved performance, with VGGS-GPT helping Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark. 

The researchers fine-tuned Qwen2.5-Omni using GRPO on AVQA, AVQA-GPT, and VGGS-GPT datasets. Results show notable performance gains, with the best average score of 71.3% on the MAU Test-mini from VGGS-GPT. Qwen2.5-Omni outperformed baselines, including SARI, and showed strong reasoning even without audio, suggesting robust text-based understanding. GRPO fine-tuning improved Qwen2-Audio more significantly due to its weaker initial text reasoning. Surprisingly, fine-tuning without audio boosted performance, while text-only datasets like ARC-Easy yielded comparable results. Improvements mainly stem from enhanced text reasoning, though audio-based fine-tuning remains slightly superior for optimal performance.

In conclusion, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni using the GRPO reinforcement learning method for enhanced audio question answering. Omni-R1 achieves new state-of-the-art results on the MMAU benchmark across sounds, speech, music, and overall performance. Two new large-scale datasets, AVQA-GPT and VGGS-GPT, were created using automatically generated questions, further boosting model accuracy. Experiments show that GRPO mainly enhances text-based reasoning, significantly contributing to performance. Surprisingly, fine-tuning with only text (without audio) improved audio-based performance, highlighting the value of strong base language understanding. These findings offer cost-effective strategies for developing audio-capable language models. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
The post Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data appeared first on MarkTechPost.

HERE Technologies boosts developer productivity with new generative AI …

This blog post is co-written with Jonas Neuman from HERE Technologies. 
HERE Technologies, a 40-year pioneer in mapping and location technology, collaborated with the AWS Generative AI Innovation Center (GenAIIC) to enhance developer productivity with a generative AI-powered coding assistant. This innovative tool is designed to enhance the onboarding experience for HERE’s self-service Maps API for JavaScript. HERE’s use of generative AI empowers its global developer community to quickly translate natural language queries into interactive map visualizations, streamlining the evaluation and adaptation of HERE’s mapping services.
New developers who try out these APIs for the first time often begin with questions such as “How can I generate a walking route from point A to B?” or “How can I display a circle around a point?” Although HERE’s API documentation is extensive, HERE recognized that accelerating the onboarding process could significantly boost developer engagement. They aim to enhance retention rates and create proficient product advocates through personalized experiences.
To create a solution, HERE collaborated with the GenAIIC. Our joint mission was to create an intelligent AI coding assistant that could provide explanations and executable code solutions in response to users’ natural language queries. The requirement was to build a scalable system that could translate natural language questions into HTML code with embedded JavaScript, ready for immediate rendering as an interactive map that users can see on screen.
The team needed to build a solution that accomplished the following:

Provide value and reliability by delivering correct, renderable code that is relevant to a user’s question
Facilitate a natural and productive developer interaction by providing code and explanations at low latency (as of this writing, around 60 seconds) while maintaining context awareness for follow-up questions
Preserve the integrity and usefulness of the feature within HERE’s system and brand by implementing robust filters for irrelevant or infeasible queries
Offer reasonable cost of the system to maintain a positive ROI when scaled across the entire API system

Together, HERE and the GenAIIC built a solution based on Amazon Bedrock that balanced goals with inherent trade-offs. Amazon Bedrock is a fully managed service that provides access to foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities, enabling you to build generative AI applications with built-in security, privacy, and responsible AI features. The service allows you to experiment with and privately customize different FMs using techniques like fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks. Amazon Bedeck is serverless, alleviates infrastructure management needs, and seamlessly integrates with existing AWS services.
Built on the comprehensive suite of AWS managed and serverless services, including Amazon Bedrock FMs, Amazon Bedrock Knowledge Bases for RAG implementation, Amazon Bedrock Guardrails for content filtering, and Amazon DynamoDB for conversation management, the solution delivers a robust and scalable coding assistant without the overhead of infrastructure management. The result is a practical, user-friendly tool that can enhance the developer experience and provide a novel way for API exploration and fast solutioning of location and navigation experiences.
In this post, we describe the details of how this was accomplished.
Dataset
We used the following resources as part of this solution:

Domain documentation – We used two publicly available resources: HERE Maps API for JavaScript Developer Guide and HERE Maps API for JavaScript API Reference. The Developer Guide offers conceptual explanations, and the API Reference provides detailed API function information.
Sample examples – HERE provided 60 cases, each containing a user query, HTML/JavaScript code solution, and brief description. These examples span multiple categories, including geodata, markers, and geoshapes, and were divided into training and testing sets.
Out-of-scope queries – HERE provided samples of queries beyond the HERE Maps API for JavaScript scope, which the large language model (LLM) should not respond to.

Solution overview
To develop the coding assistant, we designed and implemented a RAG workflow. Although standard LLMs can generate code, they often work with outdated knowledge and can’t adapt to the latest HERE Maps API for JavaScript changes or best practices. HERE Maps API for JavaScript documentation can significantly enhance coding assistants by providing accurate, up-to-date context. The storage of HERE Maps API for JavaScript documentation in a vector database allows the coding assistant to retrieve relevant snippets for user queries. This allows the LLM to ground its responses in official documentation rather than potentially outdated training data, leading to more accurate code suggestions.
The following diagram illustrates the overall architecture.

The solution architecture comprises four key modules:

Follow-up question module – This module enables follow-up question answering by contextual conversation handling. Chat histories are stored in DynamoDB and retrieved when users pose new questions. If a chat history exists, it is combined with the new question. The LLM then processes it to reformulate follow-up questions into standalone queries for downstream processing. The module maintains context awareness while recognizing topic changes, preserving the original question when the new question deviates from the previous conversation context.
Scope filtering and safeguard module – This module evaluates whether queries fall within the HERE Maps API for JavaScript scope and determines their feasibility. We applied Amazon Bedrock Guardrails and Anthropic’s Claude 3 Haiku on Amazon Bedrock to filter out-of-scope questions. With a short natural language description, Amazon Bedrock Guardrails helps define a set of out-of-scope topics to block for the coding assistant, for example topics about other HERE products. Amazon Bedrock Guardrails also helps filter harmful content containing topics such as hate speech, insults, sex, violence, and misconduct (including criminal activity), and helps protect against prompt attacks. This makes sure the coding assistant follows responsible AI policies. For in-scope queries, we employ Anthropic’s Claude 3 Haiku model to assess feasibility by analyzing both the user query and retrieved domain documents. We selected Anthropic’s Claude Haiku 3 for its optimal balance of performance and speed. The system generates standard responses for out-of-scope or infeasible queries, and viable questions proceed to response generation.
Knowledge base module – This module uses Amazon Bedrock Knowledge Bases for document indexing and retrieval operations. Amazon Bedrock Knowledge Bases is a comprehensive managed service that simplifies the RAG process from end to end. It handles everything from data ingestion to indexing and retrieval and generation automatically, removing the complexity of building and maintaining custom integrations and managing data flows. For this coding assistant, we used Amazon Bedrock Knowledge Bases for document indexing and retrieval. The multiple options for document chunking, embedding generation, and retrieval methods offered by Amazon Bedrock Knowledge Bases make it highly adaptable and allow us to test and identify the optimal configuration. We created two separate indexes, one for each domain document. This dual-index approach makes sure content is retrieved from both documentation sources for response generation. The indexing process implements hierarchical chunking with the Cohere embedding English V3 model on Amazon Bedrock, and semantic retrieval is implemented for document retrieval.
Response generation module – The response generation module processes in-scope and feasible queries using Anthropic’s Claude 3.5 Sonnet model on Amazon Bedrock. It combines user queries with retrieved documents to generate HTML code with embedded JavaScript code, capable of rendering interactive maps. Additionally, the module provides a concise description of the solution’s key points. We selected Anthropic’s Claude 3.5 Sonnet for its superior code generation capabilities.

Solution orchestration
Each module discussed in the previous section was decomposed into smaller sub-tasks. This allowed us to model the functionality and various decision points within the system as a Directed Acyclic Graph (DAG) using LangGraph. A DAG is a graph where nodes (vertices) are connected by directed edges (arrows) that represent relationships, and crucially, there are no cycles (loops) in the graph. A DAG allows the representation of dependencies with a guaranteed order, and it helps enable safe and efficient execution of tasks. LangGraph orchestration has several benefits, such as parallel task execution, code readability, and maintainability through state management and streaming support.
The following diagram illustrates the coding assistant workflow.

When a user submits a question, a workflow is invoked, starting at the Reformulate Question node. This node handles the implementation of the follow-up question module (Module 1). The Apply Guardrail, Retrieve Documents, and Review Question nodes run in parallel, using the reformulated input question. The Apply Guardrail node uses denied topics from Amazon Bedrock Guardrails to enforce boundaries and apply safeguards against harmful inputs, and the Review Question node filters out-of-scope inquiries using Anthropic’s Claude 3 Haiku (Module 2). The Retrieve Documents node retrieves relevant documents from the Amazon Bedrock knowledge base to provide the language model with necessary information (Module 3).
The outputs of the Apply Guardrail and Review Question nodes determine the next node invocation. If the input passes both checks, the Review Documents node assesses the question’s feasibility by analyzing if it can be answered with the retrieved documents (Module 2). If feasible, the Generate Response node answers the question and the code and description are streamed to the UI, allowing the user to start getting feedback from the system within seconds (Module 4). Otherwise, the Block Response node returns a predefined answer. Finally, the Update Chat History node persistently maintains the conversation history for future reference (Module 1).
This pipeline backs the code assistant chatbot capability, providing an efficient and user-friendly experience for developers seeking guidance on implementing the HERE Maps API for JavaScript. The following code and screenshot is an example of the model generated code and code rendered map for the query “How to open an infobubble when clicking on a marker?

<!DOCTYPE html>
<html>
<head>
<meta name=”viewport” content=”initial-scale=1.0, width=device-width” />
<script src=”https://js.api.here.com/v3/3.1/mapsjs-core.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-service.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-mapevents.js” type=”text/javascript” charset=”utf-8″></script>
<script src=”https://js.api.here.com/v3/3.1/mapsjs-ui.js” type=”text/javascript” charset=”utf-8″></script>
<link rel=”stylesheet” type=”text/css” href=”https://js.api.here.com/v3/3.1/mapsjs-ui.css” />
</head>
<body>
<div id=”map” style=”width: 100%; height: 480px;”></div>
<script type=”text/javascript”>

function addMarkerWithInfoBubble(map, ui) {
// Create a marker
var marker = new H.map.Marker({lat: 28.6071, lng: 77.2127});

// Add the marker to the map
map.addObject(marker);

// Create the info bubble content
var bubbleContent = ‘<div><h3>Delhi, India</h3><p>Capital city of India</p></div>’;

// Add a click event listener to the marker
marker.addEventListener(‘tap’, function(evt) {
// Create an info bubble object
var bubble = new H.ui.InfoBubble(evt.target.getGeometry(), {
content: bubbleContent
});

// Add info bubble to the UI
ui.addBubble(bubble);
});
}

/**
* Boilerplate map initialization code starts below:
*/

//Step 1: initialize communication with the platform
// In your own code, replace variable window.apikey with your own apikey
var platform = new H.service.Platform({
apikey: ‘Your_API_Key’
});
var defaultLayers = platform.createDefaultLayers();

//Step 2: initialize a map
var map = new H.Map(document.getElementById(‘map’),
defaultLayers.vector.normal.map, {
center: {lat:28.6071, lng:77.2127},
zoom: 13,
pixelRatio: window.devicePixelRatio || 1
});
// add a resize listener to make sure that the map occupies the whole container
window.addEventListener(‘resize’, () => map.getViewPort().resize());

//Step 3: make the map interactive
// MapEvents enables the event system
// Behavior implements default interactions for pan/zoom (also on mobile touch environments)
var behavior = new H.mapevents.Behavior(new H.mapevents.MapEvents(map));

//Step 4: Create the default UI components
var ui = H.ui.UI.createDefault(map, defaultLayers);

// Step 5: main logic
addMarkerWithInfoBubble(map, ui);
</script>
</body>
</html>

Prompt engineering
To improve final code generation accuracy, we employed extensive prompt engineering for the response generation module. The final prompt incorporated the following components:

Task breakdown with chain of thought – We decomposed the code generation task into sequential steps, providing structured guidance for the LLM to follow during response generation.
Few-shot learning – We enhanced the prompt with three carefully selected training examples from question categories where the LLM initially underperformed. These examples included retrieved documents and expected responses, demonstrating the desired output format.
Code template integration – In response to subject matter expert (SME) feedback regarding map interactivity issues, we incorporated a code template for generation. This template contains boilerplate code for HERE map initialization and setup, improving accuracy and providing consistent map interactivity in the generated code.

The following is the core structure of the prompt and the components discussed:

Task Instructions
Examples
User Query
Developer Guide Content
API Reference Content
Code Template

Evaluation
We manually evaluated the accuracy of code generation for each question in the test set. Our evaluation focused on two key criteria:

Whether the generated code can render an interactive HERE map
Whether the rendered map addresses the user’s query—for example, if the user requests a circle to be added, this will check whether the generated code successfully adds a circle to the map

Code samples that satisfied both criteria were classified as correct. In addition to accuracy, we also evaluated latency, including both overall latency and time to first token. Overall latency refers to the total time taken to generate the full response. To improve user experience and avoid having users wait without visible output, we employed response streaming. Time to first token measures how long it takes for the system to generate the first token of the response. The evaluation results, based on 20 samples from the testing dataset, are as follows:

Code generation accuracy: 87.5%
Overall latency: 23.5 seconds
Time to first token: Under 8 seconds

The high accuracy makes sure that the code assistant generates correct code to answer the user’s question. The low overall latency and quick time to first token significantly reduces customer waiting time, enhancing the overall user experience.
Security considerations
Security is our top priority at AWS. For the scope of this post, we shared how we used Amazon Bedrock Guardrails to build responsible AI application. Safety and security is critical for every application. For in-depth guidance on AWS’s approach to secure and responsible AI development, refer to Securing generative AI and the AWS Whitepaper Navigating the security landscape of generative AI.
Possible improvements
The following two areas are worth exploring to improve overall system accuracy and improve the current mechanism for evaluating the LLM response:

Improved automation evaluation – We recommend exploring automating the evaluation. For example, we can use an LLM-as-a-judge approach to compare ground truth and generated code, alongside automated map rendering checks using tools like Playwright. This combined strategy can offer a scalable, accurate, and efficient framework for evaluating the quality and functionality of LLM-generated map code.
Prompt chaining with self-correction feedback – Future implementations could consider a pipeline to execute the generate code, interact with the map, and feed errors back into the LLM to improve accuracy. The trade-off is this feedback loop would increase the overall system latency.

Conclusion
The outcome of this solution is a fast, practical, user-friendly coding assistant that enhances the developer experience for the HERE Maps API for JavaScript. Through iterative evolution of a RAG approach and prompt engineering techniques, the team surpassed target accuracy and latency without relying on fine-tuning. This means the solution can be expanded to other HERE offerings beyond the HERE Maps API for JavaScript. Additionally, the LLMs backing the assistant can be upgraded as higher-performant FMs are made available on Amazon Bedrock.
Key highlights of the solution include the use of a map initialization code template in the prompt, a modular and maintainable architecture orchestrated by LangGraph, and response streaming capabilities that start displaying generated code in under 8 seconds. The careful selection and combination of language models, optimized for specific tasks, further contributed to the overall performance and cost-effectiveness of the solution.
Overall, the outcomes of this proof of concept were made possible through the partnership between the GenAIIC and HERE Technologies. The coding assistant has laid a solid foundation for HERE Technologies to significantly enhance developer productivity, accelerate API adoption, and drive growth in its developer landscape.
Explore how Amazon Bedrock makes it straightforward to build generative AI applications with model choice and features like Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails. Get started with Amazon Bedrock Knowledge Bases to implement RAG-based solutions that can transform your developer experience and boost productivity.

About the Authors
Gan is an Applied Scientist on the AWS Generative AI Innovation and Delivery team. He is passionate about leveraging generative AI techniques to help customers solve real-world business problems.
Grace Lang is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she designs and implements advanced AI solutions across industries. Driven by a passion for solving complex technical challenges, Grace partners with customers to develop innovative machine learning applications.
Julia Wagner is a Senior AI Strategist at AWS’s Generative AI Innovation Center. With her background in product management, she helps teams develop AI solutions focused on customer and business needs. Outside of work, she enjoys biking and mountain activities.
Jonas Neuman is an Engineering Manager at HERE Technologies, based in Berlin, Germany. He is passionate about building great customer-facing applications. Together with his team, Jonas delivers features that help customers sign up for HERE Services and SDKs, manage access, and monitor their usage.
Sibasankar is a Senior Solutions Architect at AWS in the Automotive and Manufacturing team. He is passionate about AI, data and security. In his free time, he loves spending time with his family and reading non-fiction books.
Jared Kramer is an Applied Science Manager at Amazon Web Services based in Seattle. Jared joined Amazon 11 years ago as an ML Science intern. After 6 years in Customer Service Technologies and 4 years in Sustainability Science and Innovation, he now leads of team of Applied Scientists and Deep Learning Architects in the Generative AI Innovation Center. Jared specializes in designing and delivering industry NLP applications and is on the Industry Track program committee for ACL and EMNLP.

Set up a custom plugin on Amazon Q Business and authenticate with Amaz …

Businesses are constantly evolving, and leaders are challenged every day to meet new requirements and are seeking ways to optimize their operations and gain a competitive edge. One of the key challenges they face is managing the complexity of disparate business systems and workflows, which leads to inefficiencies, data silos, and missed opportunities.
Generative AI can play an important role in integrating these disparate systems in a secure and seamless manner, addressing these challenges in a cost-effective way. This integration allows for secure and efficient data exchange, action triggering, and enhanced productivity across the organization. Amazon Q Business plays an important role in making this happen. Amazon Q Business enables organizations to quickly and effortlessly analyze their data, uncover insights, and make data-driven decisions. With its intuitive interface and seamless integration with other AWS services, Amazon Q Business empowers businesses of different sizes to transform their data into actionable intelligence and drive innovation across their operations.
In this post, we demonstrate how to build a custom plugin with Amazon Q Business for backend integration. This plugin can integrate existing systems, including third-party systems, with little to no development in just weeks and automate critical workflows. Additionally, we show how to safeguard the solution using Amazon Cognito and AWS IAM Identity Center, maintaining the safety and integrity of sensitive data and workflows. Amazon Q Business also offers application environment guardrails or chat controls that you can configure to control the end-user chat experience to add an additional layer of safety. Lastly, we show how to expose your backend APIs through Amazon API Gateway, which is built on serverless AWS Lambda functions and Amazon DynamoDB.
Solution overview
Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. With Amazon Q Business, you can quickly find answers to questions, generate summaries and content, and complete tasks by using the information and expertise stored across your company’s various data sources and enterprise systems. At the core of this capability are built-in data source connectors and custom plugins that seamlessly integrate and index content from multiple repositories into a unified index. This enables the Amazon Q Business large language model (LLM) to provide accurate, well-written answers by drawing from the consolidated data and information. The data source connectors act as a bridge, synchronizing content from disparate systems like Salesforce, Jira, and SharePoint into a centralized index that powers the natural language understanding and generative abilities of Amazon Q Business. Amazon Q Business also provides the capability to create custom plugins to integrate with your organization’s backend system and third-party applications.
After you integrate Amazon Q Business with your backend system using a custom plugin, users can ask questions from documents that are uploaded in Amazon Simple Storage Service (Amazon S3). For this post, we use a simple document that contains product names, descriptions, and other related information. Some of the questions you can ask Amazon Q Business might include the following:

“Give me the name of the products.”
“Now list all the products along with the description in tabular format.”
“Now create one of the products <product name>.” (At this stage, Amazon Q Business will require you to authenticate against Amazon Cognito to make sure you have the right permission to work on that application.)
“List all the products along with ID and price in tabular format.”
“Update the price of product with ID <product ID>.”

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The user asks a question using the Amazon Q Business chat interface.
Amazon Q Business searches the indexed document in Amazon S3 for relevant information and presents it to the user.
The user can use the plugin to perform actions (API calls) in the system exposed to Amazon Q Business using Open API 3.x standards.
Because the API is secured with Amazon Cognito, Amazon Q Business requires the user to authenticate against the user credentials available in Amazon Cognito.
On successful authentication, API Gateway forwards the request to Lambda.
The API response is returned to the user through the Amazon Q Business chat interface.

Prerequisites
Before you begin the walkthrough, you must have an AWS account. If you don’t have one, sign up for one. Additionally, you must have access to the following services:

Amazon API Gateway
AWS CloudFormation
Amazon Cognito
Amazon DynamoDB
AWS IAM Identity Center
AWS Lambda
Amazon Q Business Pro (This will have an additional monthly cost)
Amazon S3

Launch the CloudFormation template
Launch the following CloudFormation template to set up Amazon Cognito, API Gateway, DynamoDB, and Lambda resources.

After you deploy the stack, navigate to the Outputs tab for the stack on the AWS CloudFormation console and note the resource details. We use those values later in this post.
If you’re running the CloudFormation template multiple times, make sure to choose a unique name for the stack each time.

Create an Amazon Q Business application
Complete the following steps to create an Amazon Q Business application:

On the Amazon Q Business console, choose Applications in the navigation pane.
Choose Create application.

Provide an application name (for example, product-mgmt-app).
Leave the other settings as default and choose Create.

The application will be created in a few seconds.

On the application details page, choose Data source.
Choose Add an index.
For Index name, enter a name for the index.
For Index provisioning, select Enterprise or Starter.
For Number of units, leave as the default 1.
Choose Add an index.

On the Data source page, choose Add a data source.
Choose Amazon S3 as your data source and enter a unique name.
Enter the data source location as the value of BucketName from the CloudFormation stack outputs in the format s3://<name_here>.

In a later step, we upload a file to this S3 bucket.

For IAM role¸ choose Create a new service role (recommended).
For Sync scope, select Full sync.
For Frequency, select Run on demand.
Choose Add data source.
On the application details page, choose Manage user access.
Choose Add groups and users.
You can use existing users or groups in IAM Identity Center or create new users and groups, then choose Confirm.

Only these groups and users have access to the Amazon Q Business application for their subscriptions.

Take note of deployed URL of the application to use in a later step.
On the Amazon S3 console, locate the S3 bucket you noted earlier and upload the sample document.
On the Amazon Q Business console, navigate to the application details page and sync the Amazon S3 data source.

Configure Amazon Cognito
Complete the following steps to set up Amazon Cognito:

On the Amazon Cognito console, navigate to the user pool created using the CloudFormation template (ending with-ProductUserPool).
Under Branding in the navigation pane, choose Domain.
On the Actions menu, choose Create Cognito domain.

We did not create a domain when we created the user pool using the CloudFormation template.

For Cognito domain, enter a domain prefix.
For Version, select Hosted UI.
Choose Create Cognito domain.

Under Applications in the navigation pane, choose App clients.
Choose your app client.

On the app client detail page, choose Login pages and then choose Edit the managed login pages configuration.
For URL, enter the deployed URL you noted earlier, followed by /oauth/callback. For example, https://xxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback.
Specify your identity provider, OAuth 2.0 grant type, OpenID Connect scopes, and custom scopes.

Custom scopes are defined as part of the API configuration in API Gateway. This will help Amazon Q Business determine what action a user is allowed to take. In this case, we are allowing the user to read, write, and delete. However, you can change this based on what you want your users to do using the Amazon Q Business chat.

Choose Save changes.

Take note of the Client ID and Client secret values in the App client information section to use in a later step.

Amazon Cognito doesn’t support changing the client secret after you have created the app client; a new app client is needed if you want to change the client secret.
Lastly, you have to add at least one user to the Amazon Cognito user pool.

Choose Users under User management in the navigation pane and choose Create user.
Create a user to add to your Amazon Cognito user pool.

We will use this user to authenticate before we can chat and ask questions to the backend system using Amazon Q Business.

Create an Amazon Q Business custom plugin
Complete the following steps to create your custom plugin:

On the Amazon Q Business console, navigate to the application you created.
Under Actions in the navigation pane, choose Plugins
Choose Add plugin.

Select Create custom plugin.
Provide a plugin name (for example, Products).
Under API schema source, select Define with in-line OpenAPI schema editor and enter the following code:

openapi: 3.0.0
info:
title: CRUD API
version: 1.0.0
description: API for performing CRUD operations
servers:
– url: put api gateway endpoint url here, copy it from cloudformation output

paths:
/products:
get:
summary: List all products
security:
– OAuth2:
– products/read
description: Returns a list of all available products
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
type: array
items:
$ref: ‘#/components/schemas/Product’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
post:
summary: Create a new product
security:
– OAuth2:
– products/write
description: Creates a new product
requestBody:
required: true
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
responses:
‘201’:
description: Created
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘400’:
description: Bad Request
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
/products/{id}:
get:
summary: Get a product
security:
– OAuth2:
– products/read
description: Retrieves a specific product by its ID
parameters:
– name: id
in: path
required: true
description: The ID of the product to retrieve
schema:
type: string
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
put:
summary: Update a product
security:
– OAuth2:
– products/write
description: Updates an existing product
parameters:
– name: id
in: path
required: true
description: The ID of the product to update
schema:
type: string
requestBody:
required: true
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
responses:
‘200’:
description: Successful response
content:
application/json:
schema:
$ref: ‘#/components/schemas/Product’
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
delete:
summary: Delete a product
security:
– OAuth2:
– products/delete
description: Deletes a specific product by its ID
parameters:
– name: id
in: path
required: true
description: The ID of the product to delete
schema:
type: string
responses:
‘204’:
description: Successful response
‘404’:
description: Product not found
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
‘500’:
description: Internal server error
content:
application/json:
schema:
$ref: ‘#/components/schemas/Error’
components:
securitySchemes:
OAuth2:
type: oauth2
flows:
authorizationCode:
authorizationUrl: <Cognito domain>/oauth2/authorize
tokenUrl: <Cognito domain>/oauth2/token
scopes:
products/read: read prodcut
products/write: write prodcut
products/delete: delete prodcut
schemas:
Product:
type: object
required:
– id
– name
– description
properties:
id:
type: string
name:
type: string
description:
type: string
Error:
type: object
properties:
error:
type: string

In the YAML file, replace the URL value with the value of ProductAPIEndpoint from the CloudFormation stack outputs:

servers url: https://<<xxxx>>.execute-api.us-east-1.amazonaws.com/dev

Replace the Amazon Cognito domain URL with the domain you created earlier:

authorizationCode:
authorizationUrl: https://xxxx.auth.us-east1.amazoncognito.com/oauth2/authorize
tokenUrl: https://xxxx.auth.us-east-1.amazoncognito.com/oauth2/token

The YAML file contains the schema (Open API 3.x) that Amazon Q Business uses to decide which API needs to be called based on the description. For example, line 16 in the following screenshot says Return a list all available products, which instructs Amazon Q Business to call this API whenever a user makes a request to list all products.

For authentication, select Authentication required.
For AWS Secrets Manager secret, choose Create and add new secret and enter the client ID and client secret you saved earlier, and enter the callback URL the same way as you did for the Amazon Cognito host UI (https://<>.chat.qbusiness.<<region>>.on.aws/oauth/callback).
For Choose a method to authorize Amazon Q Business, choose Create and use a new service role.
Choose Create plugin.

The last step is to enable the chat orchestration feature so Amazon Q Business can select the plugin automatically.

On the custom plugin details page, choose Admin controls and guardrails under Enhancements in the navigation pane.
In the Global controls section, choose Edit.

Select Allow Amazon Q Business to automatically orchestrate chat queries across plugins and data sources, then choose Save.

Configure API Gateway, Lambda, and DynamoDB resources
Everything related to API Gateway, Lambda, and DynamoDB is already configured using the CloudFormation template. Details are available on the Outputs tab of the stack details page. You can also review the details of the Lambda function and DynamoDB table on their respective service consoles. To learn how the Lambda function is exposed as an API through API Gateway, review the details on the API Gateway console.
Chat with Amazon Q Business
Now you’re ready to chat with Amazon Q Business.

On the Amazon Q Business console, navigate to your application.
Choose the link for Deployed URL.
Authenticate using IAM Identity Center (this is to make sure you have access to Amazon Q Business Pro).

You can now ask questions in natural language.
In the following example, we check if Amazon Q Business is able to access the data from the S3 bucket by asking “List all the products and their description in a table.”

After the product descriptions are available, start chatting and ask questions like Can you create product <product name> with same description please?. Alternatively, you can create a new product that isn’t listed in the sample document uploaded in Amazon S3. Amazon Q Business will automatically pick the right plugin (in this case, Products).
Subsequent requests for API calls to go through the custom plugin will ask you to authorize your access. Choose Authorize and authenticate with the user credentials created in Amazon Cognito earlier. After you’re authenticated, Amazon Q Business will cache the session token for subsequent API calls and complete the request.

You can query on the products that are available in the backend by asking questions like the following:

Can you please list all the products?
Delete a product by ID or by name.
Create a new product with the name ‘Gloves’ and description as ‘Football gloves’ with automatic in-built cooling

Based on the preceding prompt, a product has been created in the products table in DynamoDB.

Cost considerations
The cost of setting up this solution is based on the price of the individual AWS services being used. Prices of those services are available on the individual service pages. The only mandatory cost is the Amazon Q Business Pro license. For more information, see Amazon Q Business pricing.
Clean up
Complete the following steps to clean up your resources:

Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
Delete the Amazon Q Business application.
Delete the Amazon Cognito user pool domain.
Empty and delete the S3 bucket. For instructions, refer to Deleting a general purpose bucket.

Conclusion
In this post, we explored how Amazon Q Business can seamlessly integrate with enterprise systems using a custom plugin to help enterprises unlock the value of their data. We walked you through the process of setting up the custom plugin, including configuring the necessary Amazon Cognito and authentication mechanisms.
With this custom plugin, organizations can empower their employees to work efficiently, answers quickly, accelerate reporting, automate workflows, and enhance collaboration. You can ask Amazon Q Business natural language questions and watch as it surfaces the most relevant information from your company’s backend system and act on requests.
Don’t miss out on the transformative power of generative AI and Amazon Q Business. Sign up today and experience the difference that Amazon Q Business can make for your organization’s workflow automation and the efficiency it brings.

About the Authors
Shubhankar Sumar is a Senior Solutions Architect at Amazon Web Services (AWS), working with enterprise software and SaaS customers across the UK to help architect secure, scalable, efficient, and cost-effective systems. He is an experienced software engineer, having built many SaaS solutions powered by generative AI. Shubhankar specializes in building multi-tenant systems on the cloud. He also works closely with customers to bring generative AI capabilities to their SaaS applications.
Dr. Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Ankur Agarwal is a Principal Enterprise Architect at Amazon Web Services Professional Services. Ankur works with enterprise clients to help them get the most out of their investment in cloud computing. He advises on using cloud-based applications, data, and AI technologies to deliver maximum business value.

Detect hallucinations for RAG-based systems

With the rise of generative AI and knowledge extraction in AI systems, Retrieval Augmented Generation (RAG) has become a prominent tool for enhancing the accuracy and reliability of AI-generated responses. RAG is as a way to incorporate additional data that the large language model (LLM) was not trained on. This can also help reduce generation of false or misleading information (hallucinations). However, even with RAG’s capabilities, the challenge of AI hallucinations remains a significant concern.
As AI systems become increasingly integrated into our daily lives and critical decision-making processes, the ability to detect and mitigate hallucinations is paramount. Most hallucination detection techniques focus on the prompt and the response alone. However, where additional context is available, such as in RAG-based applications, new techniques can be introduced to better mitigate the hallucination problem.
This post walks you through how to create a basic hallucination detection system for RAG-based applications. We also weigh the pros and cons of different methods in terms of accuracy, precision, recall, and cost.
Although there are currently many new state-of-the-art techniques, the approaches outlined in this post aim to provide simple, user-friendly techniques that you can quickly incorporate into your RAG pipeline to increase the quality of the outputs in your RAG system.
Solution overview
Hallucinations can be categorized into three types, as illustrated in the following graphic.

Scientific literature has come up with multiple hallucination detection techniques. In the following sections, we discuss and implement four prominent approaches to detecting hallucinations: using an LLM prompt-based detector, semantic similarity detector, BERT stochastic checker, and token similarity detector. Finally, we compare approaches in terms of their performance and latency.
Prerequisites
To use the methods presented in this post, you need an AWS account with access to Amazon SageMaker, Amazon Bedrock, and Amazon Simple Storage Service (Amazon S3).
From your RAG system, you will need to store three things:

Context – The area of text that is relevant to a user’s query
Question – The user’s query
Answer – The answer provided by the LLM

The resulting table should look similar to the following example.

question
context
answer

What are cocktails?
Cocktails are alcoholic mixed…
Cocktails are alcoholic mixed…

What are cocktails?
Cocktails are alcoholic mixed…
They have distinct histories…

What is Fortnite?
Fortnite is a popular video…
Fortnite is an online multi…

What is Fortnite?
Fortnite is a popular video…
The average Fortnite player spends…

Approach 1: LLM-based hallucination detection
We can use an LLM to classify the responses from our RAG system into context-conflicting hallucinations and facts. The aim is to identify which responses are based on the context or whether they contain hallucinations.
This approach consists of the following steps:

Create a dataset with questions, context, and the response you want to classify.
Send a call to the LLM with the following information:

Provide the statement (the answer from the LLM that we want to classify).
Provide the context from which the LLM created the answer.
Instruct the LLM to tag sentences in the statement that are directly based on the context.

Parse the outputs and obtain sentence-level numeric scores between 0–1.
Make sure to keep the LLM, memory, and parameters independent from the ones used for Q&A. (This is so the LLM can’t access the previous chat history to draw conclusions.)
Tune the decision threshold for the hallucination scores for a specific dataset based on domain, for example.
Use the threshold to classify the statement as hallucination or fact.

Create a prompt template
To use the LLM to classify the answer to your question, you need to set up a prompt. We want the LLM to take in the context and the answer, and determine from the given context a hallucination score. The score will be encoded between 0 and 1, with 0 being an answer directly from the context and 1 being an answer with no basis from the context.
The following is a prompt with few-shot examples so the LLM knows what the expected format and content of the answer should be:
prompt = “””nnHuman: You are an expert assistant helping human to check if statements are based on the context.
Your task is to read context and statement and indicate which sentences in the statement are based directly on the context.

Provide response as a number, where the number represents a hallucination score, which is a float between 0 and 1.
Set the float to 0 if you are confident that the sentence is directly based on the context.
Set the float to 1 if you are confident that the sentence is not based on the context.
If you are not confident, set the score to a float number between 0 and 1. Higher numbers represent higher confidence that the sentence is not based on the context.

Do not include any other information except for the the score in the response. There is no need to explain your thinking.

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS’s virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: ‘AWS is Amazon subsidiary that provides cloud computing services.’
Assistant: 0.05
</example>

<example>
Context: Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered, pay-as-you-go basis. Clients will often use this in combination with autoscaling (a process that allows a client to use more computing in times of high application usage, and then scale down to reduce costs when there is less traffic). These cloud computing web services provide various services related to networking, compute, storage, middleware, IoT and other processing capacity, as well as software tools via AWS server farms. This frees clients from managing, scaling, and patching hardware and operating systems. One of the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual cluster of computers, with extremely high availability, which can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS’s virtual computers emulate most of the attributes of a real computer, including hardware central processing units (CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-disk/SSD storage; a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, and customer relationship management (CRM).
Statement: ‘AWS revenue in 2022 was $80 billion.’
Assistant: 1
</example>

<example>
Context: Monkey is a common name that may refer to most mammals of the infraorder Simiiformes, also known as the simians. Traditionally, all animals in the group now known as simians are counted as monkeys except the apes, which constitutes an incomplete paraphyletic grouping; however, in the broader sense based on cladistics, apes (Hominoidea) are also included, making the terms monkeys and simians synonyms in regard to their scope. On average, monkeys are 150 cm tall.
Statement:’Average monkey is 2 meters high and weights 100 kilograms.’
Assistant: 0.9
</example>

Context: {context}
Statement: {statement}

nnAssistant: [
   “””
    ### LANGCHAIN CONSTRUCTS
   # prompt template
   prompt_template = PromptTemplate(
       template=prompt,
        input_variables=[“context”, “statement”],
    )
Configure the LLM
To retrieve a response from the LLM, you need to configure the LLM using Amazon Bedrock, similar to the following code:
def configure_llm() -> Bedrock:

  model_params= { “answer_length”: 100, # max number of tokens in the answer
        “temperature”: 0.0, # temperature during inference
        “top_p”: 1, # cumulative probability of sampled tokens
        “stop_words”: [ “nnHuman:”, “]”, ], # words after which the generation is stopped
                    }
    bedrock_client = boto3.client(
            service_name=”bedrock-runtime”,
            region_name=”us-east-1″,
            )
            
    MODEL_ID = “anthropic.claude-3-5-sonnet-20240620-v1:0”
   
    llm = Bedrock(
        client=bedrock_client,
        model_id=MODEL_ID,
        model_kwargs=model_params,
        )
                        
    return llm
Get hallucination classifications from the LLM
The next step is to use the prompt, dataset, and LLM to get hallucination scores for each response from your RAG system. Taking this a step further, you can use a threshold to determine whether the response is a hallucination or not. See the following code:
def get_response_from_claude(context: str, answer: str, prompt_template: PromptTemplate, llm: Bedrock) -> float:
  
    llm_chain = LLMChain(llm=llm, prompt=prompt_template, verbose=False)
    # compute scores
    response = llm_chain(
        {“context”: context, “statement”: str(answer)}
    )
    try:
        scores = float(scores)
    except Exception:
        print(f”Could not parse LLM response: {scores}”)
        scores = 0
    return scores
Approach 2: Semantic similarity-based detection
Under the assumption that if a statement is a fact, then there will be high similarity with the context, you can use semantic similarity as a method to determine whether a statement is an input-conflicting hallucination.
This approach consists of the following steps:

Create embeddings for the answer and the context using an LLM. (In this example, we use the Amazon Titan Embeddings model.)
Use the embeddings to calculate similarity scores between each sentence in the answer and the (In this case, we use cosine similarity as a distance metric.) Out-of-context (hallucinated sentences) should have low similarity with the context.
Tune the decision threshold for a specific dataset (such as domain dependent) to classify hallucinating statements.

Create embeddings with LLMs and calculate similarity
You can use LLMs to create embeddings for the context and the initial response to the question. After you have the embeddings, you can calculate the cosine similarity of the two. The cosine similarity score will return a number between 0 and 1, with 1 being perfect similarity and 0 as no similarity. To translate this to a hallucination score, we need to take 1—the cosine similarity. See the following code:
def similarity_detector(
    context: str,
    answer: str,
    llm: BedrockEmbeddings,
) -> float:
    “””
    Check hallucinations using semantic similarity methods based on embeddings<br /><br />

    Parameters
   ———-
    context : str
        Context provided for RAG
    answer : str
        Answer from an LLM
    llm : BedrockEmbeddings
        Embeddings model

    Returns
    ——-
    float
        Semantic similarity score
    “””

   if len(context) == 0 or len(answer) == 0:
        return 0.0
    # calculate embeddings
    context_emb = llm.embed_query(context)
    answer_emb = llm.embed_query(answer)
    context_emb = np.array(context_emb).reshape(1, -1)
    answer_emb = np.array(answer_emb).reshape(1, -1)
    sim_score = cosine_similarity(context_emb, answer_emb)
    return 1 – sim_score[0][0]
Approach 3: BERT stochastic checker
The BERT score uses the pre-trained contextual embeddings from a pre-trained language model such as BERT and matches words in candidate and reference sentences by cosine similarity. One of the traditional metrics for evaluation in natural language processing (NLP) is the BLEU score. The BLEU score primarily measures precision by calculating how many n-grams (consecutive tokens) from the candidate sentence appear in the reference sentences. It focuses on matching these consecutive token sequences between candidate and reference sentences, while incorporating a brevity penalty to prevent overly short translations from receiving artificially high scores. Unlike the BLEU score, which focuses on token-level comparisons, the BERT score uses contextual embeddings to capture semantic similarities between words or full sentences. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, the BERT score computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
In our approach, we use the BERT score as a stochastic checker for hallucination detection. The idea is that if you generate multiple answers from an LLM and there are large variations (inconsistencies) between them, then there is a good chance that these answers are hallucinated. We first generate N random samples (sentences) from the LLM. We then compute BERT scores by comparing each sentence in the original generated paragraph against its corresponding sentence across the N newly generated stochastic samples. This is done by embedding all sentences using an LLM based embedding model and calculating cosine similarity. Our hypothesis is that factual sentences will remain consistent across multiple generations, resulting in high BERT scores (indicating similarity). Conversely, hallucinated content will likely vary across different generations, resulting in low BERT scores between the original sentence and its stochastic variants. By establishing a threshold for these similarity scores, we can flag sentences with consistently low BERT scores as potential hallucinations, because they demonstrate semantic inconsistency across multiple generations from the same model.
Approach 4: Token similarity detection
With the token similarity detector, we extract unique sets of tokens from the answer and the context. Here, we can use one of the LLM tokenizers or simply split the text into individual words. Then, we calculate similarity between each sentence in the answer and the context. There are multiple metrics that can be used for token similarity, including a BLEU score over different n-grams, a ROUGE score (an NLP metric similar to BLEU but calculates recall vs. precision) over different n-grams, or simply the proportion of the shared tokens between the two texts. Out-of-context (hallucinated) sentences should have low similarity with the context.
def intersection_detector(
context: str,
answer: str,
length_cutoff: int = 3,
) -> dict[str, float]:
“””
Check hallucinations using token intersection metrics

Parameters
———-
context : str
Context provided for RAG
answer : str
Answer from an LLM
length_cutoff : int
If no. tokens in the answer is smaller than length_cutoff, return scores of 1.0

Returns
——-
dict[str, float]
Token intersection and BLEU scores
“””

# populate with relevant stopwords such as articles
stopword_set = {}

# remove punctuation and lowercase
context = re.sub(r”[^ws]”, “”, context).lower()
answer = re.sub(r”[^ws]”, “”, answer).lower()

# calculate metrics
if len(answer) >= length_cutoff:
# calculate token intersection
context_split = {term for term in context if term not in stopword_set}
answer_split = re.compile(r”w+”).findall(answer)
answer_split = {term for term in answer_split if term not in stopword_set}
intersection = sum([term in context_split for term in answer_split]) / len(answer_split)

# calculate BLEU score
bleu = evaluate.load(“bleu”)
bleu_score = bleu.compute(predictions=[answer], references=[context])[“precisions”]
bleu_score = sum(bleu_score) / len(bleu_score)

return {
“intersection”: 1 – intersection,
“bleu”: 1 – bleu_score,
}

return {“intersection”: 0, “bleu”: 0}
Comparing approaches: Evaluation results
In this section, we compare the hallucination detection approaches described in the post. We run an experiment on three RAG datasets, including Wikipedia article data and two synthetically generated datasets. Each example in a dataset includes a context, a user’s question, and an LLM answer labeled as correct or hallucinated. We run each hallucination detection method on all questions and aggregate the accuracy metrics across the datasets.
The highest accuracy (number of sentences correctly classified as hallucination vs. fact) is demonstrated by the BERT stochastic checker and the LLM prompt-based detector. The LLM prompt-based detector outperforms the BERT checker in precision, and the BERT stochastic checker has a higher recall. The semantic similarity and token similarity detectors show very low accuracy and recall but perform well with regards to precision. This indicates that those detectors might only be useful to identify the most evident hallucinations.
Aside from the token similarity detector, the LLM prompt-based detector is the most cost-effective option in terms of the number LLM calls because it’s constant relative to the size of the context and the response (but cost will vary depending on the number of input tokens). The semantic similarity detector cost is proportional to the number of sentences in the context and the response, so as the context grows, this can become increasingly expensive.
The following table summarizes the metrics compared between each method. For use cases where precision is the highest priority, we would recommend the token similarity, LLM prompt-based, and semantic similarity methods, whereas to provide high recall, the BERT stochastic method outperforms other methods.
The following table summarizes the metrics compared between each method.

Technique
Accuracy*
Precision*
Recall*
Cost (Number of LLM Calls)
Explainability

Token Similarity Detector
0.47
0.96
0.03
0
Yes

Semantic Similarity Detector
0.48
0.90
0.02
K***
Yes

LLM Prompt-Based Detector
0.75
0.94
0.53
1
Yes

BERT Stochastic Checker
0.76
0.72
0.90
N+1**
Yes

*Averaged over Wikipedia dataset and generative AI synthetic datasets **N = Number of random samples ***K = Number of sentences
These results suggest that an LLM-based detector shows a good trade-off between accuracy and cost (additional answer latency). We recommend using a combination of a token similarity detector to filter out the most evident hallucinations and an LLM-based detector to identify more difficult ones.
Conclusion
As RAG systems continue to evolve and play an increasingly important role in AI applications, the ability to detect and prevent hallucinations remains crucial. Through our exploration of four different approaches—LLM prompt-based detection, semantic similarity detection, BERT stochastic checking, and token similarity detection—we’ve demonstrated various methods to address this challenge. Although each approach has its strengths and trade-offs in terms of accuracy, precision, recall, and cost, the LLM prompt-based detector shows particularly promising results with accuracy rates above 75% and a relatively low additional cost. Organizations can choose the most suitable method based on their specific needs, considering factors such as computational resources, accuracy requirements, and cost constraints. As the field continues to advance, these foundational techniques provide a starting point for building more reliable and trustworthy RAG systems.

About the Authors
 Zainab Afolabi is a Senior Data Scientist at the Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over eight years of specialised experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center working on the frontier of AI research and business. Nikita builds generative AI solutions to solve real-world business problems for AWS customers across industries and holds PhD in Machine Learning.
Liza (Elizaveta) Zinovyeva is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.

ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model De …

VLMs have become central to building general-purpose AI systems capable of understanding and interacting in digital and real-world settings. By integrating visual and textual data, VLMs have driven advancements in multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors like education and healthcare. Despite this progress, VLMs still lag behind human capabilities, particularly in tasks involving 3D reasoning, object counting, creative visual interpretation, and interactive gameplay. A challenge lies in the scarcity of rich, diverse multimodal datasets, unlike the abundant textual resources available to LLMs. Additionally, multimodal data complexity poses significant training and evaluation hurdles. 

Researchers at ByteDance have developed Seed1.5-VL, a compact yet powerful vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its efficient architecture, Seed1.5-VL achieves top results on 38 out of 60 public VLM benchmarks, excelling in tasks like GUI control, video understanding, and visual reasoning. It is trained on trillions of multimodal tokens using advanced data synthesis and post-training techniques, including human feedback. Innovations in training, such as hybrid parallelism and vision token redistribution, optimize performance. The model’s efficiency and strong reasoning capabilities suit real-world interactive applications like chatbots. 

The Seed1.5-VL architecture features a vision encoder, an MLP adapter, and an LLM. Its custom vision encoder, Seed-ViT, supports native-resolution image input using 2D RoPE and processes images through 14×14 patches, followed by average pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment using images, text, and video-audio-caption pairs. The model uses a Dynamic Frame-Resolution Sampling approach for video encoding that adapts frame rates and resolutions based on content complexity, balancing efficiency and detail. This method enables effective spatial-temporal understanding within a token budget, ensuring comprehensive video representation across varied lengths and complexities. 

The pre-training of Seed1.5-VL involved curating 3 trillion high-quality tokens across diverse domains. Image-text pairs from the web were filtered using CLIP scores, size/aspect ratio checks, and deduplication to reduce noise. Using domain-based sampling and duplication strategies, rare visual concepts were overrepresented to address class imbalance. Specialized datasets were added for OCR using annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial understanding using depth annotations, and video understanding through multi-frame captioning, QA, and temporal grounding to support dynamic content analysis. 

The evaluation highlights Seed-ViT and Seed1.5-VL’s competitive performance across vision-language tasks. Seed-ViT, despite having significantly fewer parameters, matches or outperforms larger models like InternVL-C and EVA-CLIP on zero-shot image classification tasks, showing high accuracy and robustness on datasets such as ImageNet-A and ObjectNet. Seed1.5-VL demonstrates strong capabilities in multimodal reasoning, general VQA, document understanding, and grounding. It achieves state-of-the-art benchmarks, particularly in complex reasoning, counting, and chart interpretation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its strong ability in detailed visual understanding and task generalization. 

In conclusion, Seed1.5-VL is a vision-language foundation model featuring a 532 M-parameter vision encoder and a 20 B-parameter Mixture-of-Experts language model. Despite its compact size, it achieves state-of-the-art results on 38 of 60 public benchmarks and excels in complex reasoning, OCR, diagram interpretation, 3D spatial understanding, and video analysis. It also performs well in agent-driven tasks like GUI control and gameplay, surpassing models like OpenAI CUA and Claude 3.7. The model shows strong generalization to tasks beyond its training scope. The study outlines its architecture, data pipeline, and training methods and identifies future directions, including enhancing tool-use and visual reasoning capabilities. 

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning appeared first on MarkTechPost.

Hugging Face Introduces a Free Model Context Protocol (MCP) Course: A …

Hugging Face has released a free/open-source course on the Model Context Protocol (MCP), an open approach developed by Anthropic to facilitate the integration of large language models (LLMs) with external data sources and tools. This course aims to provide developers and AI practitioners with the knowledge and skills to leverage MCP for building more context-aware and capable AI applications.

Understanding the Model Context Protocol (MCP)

The Model Context Protocol (MCP) is designed to address the complexities involved in connecting AI models to diverse external systems. Traditionally, integrating AI models with various data sources required custom solutions for each connection, leading to inefficiencies and scalability issues. MCP introduces a standardized protocol that enables AI models to interact with external resources through a unified interface, simplifying the integration process and enhancing interoperability.

By adopting MCP, developers can build AI applications that are more adaptable and capable of accessing real-time information from multiple sources, thereby improving the relevance and accuracy of AI-driven insights and actions.

Overview of the Hugging Face MCP Course

The Hugging Face MCP Course is structured to guide learners from foundational concepts to practical applications of MCP. The curriculum is divided into several units, each focusing on different aspects of MCP:

Unit 0: Onboarding

This introductory unit provides an overview of the course objectives and outlines the prerequisites for participants. It sets the stage for the subsequent units by establishing the necessary context and tools required for the course.

Unit 1: MCP Fundamentals

In this unit, learners delve into the core principles of MCP, exploring its architecture, key components, and the problems it aims to solve. The unit emphasizes understanding how MCP facilitates seamless integration between AI models and external systems.

Unit 2: Building an MCP Application

This hands-on unit guides participants through the process of developing a simple MCP application. By applying the concepts learned, learners gain practical experience in implementing MCP in real-world scenarios.

Unit 3: Advanced MCP Development

Focusing on more complex aspects, this unit covers the deployment of MCP applications using the Hugging Face ecosystem and partner services. It also explores advanced topics and best practices for MCP implementation.

Bonus Units

Additional content is provided to enhance learning, including collaborations with Hugging Face partners and exploration of the latest MCP tools and implementations.

Upon completion of the course, participants have the opportunity to earn a certification, validating their proficiency in MCP.

Getting Started with MCP

To successfully engage with the MCP course, participants should have a foundational understanding of AI and LLM concepts, familiarity with software development principles, and experience with at least one programming language, such as Python or TypeScript. The course provides resources to assist learners in meeting these prerequisites if needed.

All course materials are accessible online, requiring only a computer with an internet connection and a Hugging Face account. This accessibility ensures that a wide range of learners can participate and benefit from the course.

The Significance of Learning MCP

As AI continues to evolve, the ability to integrate models with various data sources and tools becomes increasingly critical. MCP offers a standardized approach to this integration, promoting efficiency and scalability. By mastering MCP, developers can create AI applications that are more responsive, context-aware, and capable of delivering enhanced value across different domains.

The Hugging Face MCP Course provides a structured pathway to acquiring this expertise, empowering learners to contribute effectively to the development of advanced AI systems.

Check out the Course here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post Hugging Face Introduces a Free Model Context Protocol (MCP) Course: A Developer’s Guide to Build and Deploy Context-Aware AI Agents and Applications appeared first on MarkTechPost.

Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Pos …

Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools.

Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts.

While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity.

Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastive (ARC) post-training. This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Open (SAO) framework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices.

With ARC methodology, they introduced Stable Audio Open Small, a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiT (Diffusion Transformer) that operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems.

The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality.

ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Score (CCDS) of 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution.

Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastive (ARC) post-training and  Stable Audio Open Small include: 

ARC post-training avoids distillation and CFG, relying on adversarial and contrastive losses.

ARC generates 12s of 44.1 kHz stereo audio in 75ms on H100 and 7s on mobile CPUs.

It achieves 0.41 CLAP Conditional Diversity Score, the highest among tested models.

Subjective scores: 4.4 (diversity), 4.2 (quality), and 4.2 (prompt adherence).

Ping-pong sampling enables few-step inference while refining output quality.

Stable Audio Open Small offers 497M parameters, supports 8-step generation, and is compatible with mobile deployments.

On Vivo X200 Pro, inference latency dropped from 15.3s to 6.6s with half the memory.

ARC and SAO Small provide real-time solutions for music, games, and creative tools.

In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices.

Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices appeared first on MarkTechPost.

How Apoidea Group enhances visual information extraction from banking …

This post is co-written with Ken Tsui, Edward Tsoi and Mickey Yip from Apoidea Group.
The banking industry has long struggled with the inefficiencies associated with repetitive processes such as information extraction, document review, and auditing. These tasks, which require significant human resources, slow down critical operations such as Know Your Customer (KYC) procedures, loan applications, and credit analysis. As a result, banks face operational challenges, including limited scalability, slow processing speeds, and high costs associated with staff training and turnover.
To address these inefficiencies, the implementation of advanced information extraction systems is crucial. These systems enable the rapid extraction of data from various financial documents—including bank statements, KYC forms, and loan applications—reducing both manual errors and processing time. As such, information extraction technology is instrumental in accelerating customer onboarding, maintaining regulatory compliance, and driving the digital transformation of the banking sector, particularly in high-volume document processing tasks.
The challenges in document processing are compounded by the need for specialized solutions that maintain high accuracy while handling sensitive financial data such as banking statements, financial statements, and company annual reports. This is where Apoidea Group, a leading AI-focused FinTech independent software vendor (ISV) based in Hong Kong, has made a significant impact. By using cutting-edge generative AI and deep learning technologies, Apoidea has developed innovative AI-powered solutions that address the unique needs of multinational banks. Their flagship product, SuperAcc, is a sophisticated document processing service featuring a set of proprietary document understanding models capable of processing diverse document types such as bank statements, financial statements, and KYC documents.
SuperAcc has demonstrated significant improvements in the banking sector. For instance, the financial spreading process, which previously required 4–6 hours, can now be completed in just 10 minutes, with staff needing less than 30 minutes to review the results. Similarly, in small and medium-sized enterprise (SME) banking, the review process for multiple bank statements spanning 6 months—used to extract critical data such as sales turnover and interbank transactions—has been reduced to just 10 minutes. This substantial reduction in processing time not only accelerates workflows but also minimizes the risk of manual errors. By automating repetitive tasks, SuperAcc enhances both operational efficiency and accuracy, using Apoidea’s self-trained machine learning (ML) models to deliver consistent, high-accuracy results in live production environments. These advancements have led to an impressive return on investment (ROI) of over 80%, showcasing the tangible benefits of implementing SuperAcc in banking operations.
AI transformation in banking faces several challenges, primarily due to stringent security and regulatory requirements. Financial institutions demand banking-grade security, necessitating compliance with standards such as ISO 9001 and ISO 27001. Additionally, AI solutions must align with responsible AI principles to facilitate transparency and fairness. Integration with legacy banking systems further complicates adoption, because these infrastructures are often outdated compared to rapidly evolving tech landscapes. Despite these challenges, SuperAcc has been successfully deployed and trusted by over 10 financial services industry (FSI) clients, demonstrating its reliability, security, and compliance in real-world banking environments.
To further enhance the capabilities of specialized information extraction solutions, advanced ML infrastructure is essential. Amazon SageMaker HyperPod offers an effective solution for provisioning resilient clusters to run ML workloads and develop state-of-the-art models. SageMaker HyperPod accelerates the development of foundation models (FMs) by removing the undifferentiated heavy lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 GPUs. Its resiliency features automatically monitor cluster instances, detecting and replacing faulty hardware automatically, allowing developers to focus on running ML workloads without worrying about infrastructure management.
Building on this foundation of specialized information extraction solutions and using the capabilities of SageMaker HyperPod, we collaborate with APOIDEA Group to explore the use of large vision language models (LVLMs) to further improve table structure recognition performance on banking and financial documents. In this post, we present our work and step-by-step code on fine-tuning the Qwen2-VL-7B-Instruct model using LLaMA-Factory on SageMaker HyperPod. Our results demonstrate significant improvements in table structure recognition accuracy and efficiency compared to the original base model and traditional methods, with particular success in handling complex financial tables and multi-page documents. Following the steps described in this post, you can also fine-tune your own model with domain-specific data to solve your information extraction challenges using the open source implementation.
Challenges in banking information extraction systems with multimodal models
Developing information extraction systems for banks presents several challenges, primarily due to the sensitive nature of documents, their complexity, and variety. For example, bank statement formats vary significantly across financial institutions, with each bank using unique layouts, different columns, transaction descriptions, and ways of presenting financial information. In some cases, documents are scanned with low quality and are poorly aligned, blurry, or faded, creating challenges for Optical Character Recognition (OCR) systems attempting to convert them into machine-readable text. Creating robust ML models is challenging due to the scarcity of clean training data. Current solutions rely on orchestrating models for tasks such as layout analysis, entity extraction, and table structure recognition. Although this modular approach addresses the issue of limited resources for training end-to-end ML models, it significantly increases system complexity and fails to fully use available information.
Models developed based on specific document features are inherently limited in their scope, restricting access to diverse and rich training data. This limitation results in upstream models, particularly those responsible for visual representation, lacking robustness. Furthermore, single-modality models fail to use the multi-faceted nature of information, potentially leading to less precise and accurate predictions. For instance, in table structure recognition tasks, models often lack the capability to reason about textual content while inferring row and column structures. Consequently, a common error is the incorrect subdivision of single rows or columns into multiple instances. Additionally, downstream models that heavily depend on upstream model outputs are susceptible to error propagation, potentially compounding inaccuracies introduced in earlier stages of processing.
Moreover, the substantial computational requirements of these multimodal systems present scalability and efficiency challenges. The necessity to maintain and update multiple models increases the operational burden, rendering large-scale document processing both resource-intensive and difficult to manage effectively. This complexity impedes the seamless integration and deployment of such systems in banking environments, where efficiency and accuracy are paramount.
The recent advances in multimodal models have demonstrated remarkable capabilities in processing complex visual and textual information. LVLMs represent a paradigm shift in document understanding, combining the robust textual processing capabilities of traditional language models with advanced visual comprehension. These models excel at tasks requiring simultaneous interpretation of text, visual elements, and their spatial relationships, making them particularly effective for financial document processing. By integrating visual and textual understanding into a unified framework, multimodal models offer a transformative approach to document analysis. Unlike traditional information extraction systems that rely on fragmented processing pipelines, these models can simultaneously analyze document layouts, extract text content, and interpret visual elements. This integrated approach significantly improves accuracy by reducing error propagation between processing stages while maintaining computational efficiency.
Advanced vision language models are typically pre-trained on large-scale multimodal datasets that include both image and text data. The pre-training process typically involves training the model on diverse datasets containing millions of images and associated text descriptions, sourced from publicly available datasets such as image-text pairs LAION-5B, Visual Question Answering (VQAv2.0), DocVQA, and others. These datasets provide a rich variety of visual content paired with textual descriptions, enabling the model to learn meaningful representations of both modalities. During pre-training, these models are trained using auto-regressive loss, where the model predicts the next token in a sequence given the previous tokens and the visual input. This approach allows the model to effectively align visual and textual features and generate coherent text responses based on the visual context. For image data specifically, modern vision-language models use pre-trained vision encoders, such as vision transformers (ViTs), as their backbone to extract visual features. These features are then fused with textual embeddings in a multimodal transformer architecture, allowing the model to understand the relationships between images and text. By pre-training on such diverse and large-scale datasets, these models develop a strong foundational understanding of visual content, which can be fine-tuned for downstream tasks like OCR, image captioning, or visual question answering. This pre-training phase is critical for enabling the model to generalize well across a wide range of vision-language tasks. The model architecture is illustrated in the following diagram.

Fine-tuning vision-language models for visual document understanding tasks offers significant advantages due to their advanced architecture and pre-trained capabilities. The model’s ability to understand and process both visual and textual data makes it inherently well-suited for extracting and interpreting text from images. Through fine-tuning on domain-specific datasets, the model can achieve superior performance in recognizing text across diverse fonts, styles, and backgrounds. This is particularly valuable in banking applications, where documents often contain specialized terminology, complex layouts, and varying quality scans.
Moreover, fine-tuning these models for visual document understanding tasks allows for domain-specific adaptation, which is crucial for achieving high precision in specialized applications. The model’s pre-trained knowledge provides a strong foundation, reducing the need for extensive training data and computational resources. Fine-tuning also enables the model to learn domain-specific nuances, such as unique terminologies or formatting conventions, further enhancing its performance. By combining a model’s general-purpose vision-language understanding with task-specific fine-tuning, you can create a highly efficient and accurate information extraction system that outperforms traditional methods, especially in challenging or niche use cases. This makes vision-language models powerful tools for advancing visual document understanding technology in both research and practical applications.
Solution overview
LLaMA-Factory is an open source framework designed for training and fine-tuning large language models (LLMs) efficiently. It supports over 100 popular models, including LLaMA, Mistral, Qwen, Baichuan, and ChatGLM, and integrates advanced techniques such as LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and full-parameter fine-tuning. The framework provides a user-friendly interface, including a web-based tool called LlamaBoard, which allows users to fine-tune models without writing code. LLaMA-Factory also supports various training methods like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO), making it versatile for different tasks and applications.
The advantage of LLaMA-Factory lies in its efficiency and flexibility. It significantly reduces the computational and memory requirements for fine-tuning large models by using techniques like LoRA and quantization, enabling users to fine-tune models even on hardware with limited resources. Additionally, its modular design and integration of cutting-edge algorithms, such as FlashAttention-2 and GaLore, facilitate high performance and scalability. The framework also simplifies the fine-tuning process, making it accessible to both beginners and experienced developers. This democratization of LLM fine-tuning allows users to adapt models to specific tasks quickly, fostering innovation and application across various domains. The solution architecture is presented in the following diagram.

For the training infrastructure, we use SageMaker HyperPod for distributed training. SageMaker HyperPod provides a scalable and flexible environment for training and fine-tuning large-scale models. SageMaker HyperPod offers a comprehensive set of features that significantly enhance the efficiency and effectiveness of ML workflows. Its purpose-built infrastructure simplifies distributed training setup and management, allowing flexible scaling from single-GPU experiments to multi-GPU data parallelism and large model parallelism. The service’s shared file system integration with Amazon FSx for Lustre enables seamless data synchronization across worker nodes and Amazon Simple Storage Service (Amazon S3) buckets, while customizable environments allow tailored installations of frameworks and tools.
SageMaker HyperPod integrates with Slurm, a popular open source cluster management and job scheduling system, to provide efficient job scheduling and resource management, enabling parallel experiments and distributed training. The service also enhances productivity through Visual Studio Code connectivity, offering a familiar development environment for code editing, script execution, and Jupyter notebook experimentation. These features collectively enable ML practitioners to focus on model development while using the power of distributed computing for faster training and innovation.
Refer to our GitHub repo for a step-by-step guide on fine-tuning Qwen2-VL-7B-Instruct on SageMaker HyperPod.
We start the data preprocessing using the image input and HTML output. We choose the HTML structure as the output format because it is the most common format for representing tabular data in web applications. It is straightforward to parse and visualize, and it is compatible with most web browsers for rendering on the website for manual review or modification if needed. The data preprocessing is critical for the model to learn the patterns of the expected output format and adapt the visual layout of the table. The following is one example of input image and output HTML as the ground truth.

<table>
  <tr>
    <td></td>
    <td colspan=”5″>Payments due by period</td>
  </tr>
  <tr>
    <td></td><td>Total</td><td>Less than 1 year</td><td>1-3 years</td><td>3-5 years</td><td>More than 5 years</td>
  </tr>
  <tr>
    <td>Operating Activities:</td><td></td><td></td><td></td><td></td><td></td>
  </tr>
… … …
… … …
 <tr>
    <td>Capital lease obligations<sup> (6)</sup></td><td>48,771</td><td>8,320</td><td>10,521</td><td>7,371</td><td>22,559</td>
  </tr>
  <tr>
    <td>Other<sup> (7) </sup></td><td>72,734</td><td>20,918</td><td>33,236</td><td>16,466</td><td>2,114</td>
  </tr>
  <tr>
<td>Total</td><td>$16,516,866</td><td>$3,037,162</td><td>$5,706,285</td><td>$4,727,135</td><td>$3,046,284</td>
  </tr>
</table>

We then use LLaMA-Factory to fine-tune the Qwen2-VL-7B-Instruct model on the preprocessed data. We use Slurm sbatch to orchestrate the distributed training script. An example of the script would be submit_train_multinode.sh. The training script uses QLoRA and data parallel distributed training on SageMaker HyperPod. Following the guidance provided, you will see output similar to the following training log.
During inference, we use vLLM for hosting the quantized model, which provides efficient memory management and optimized attention mechanisms for high-throughput inference. vLLM natively supports the Qwen2-VL series model and continues to add support for newer models, making it particularly suitable for large-scale document processing tasks. The deployment process involves applying 4-bit quantization to reduce model size while maintaining accuracy, configuring the vLLM server with optimal parameters for batch processing and memory allocation, and exposing the model through RESTful APIs for quick integration with existing document processing pipelines. For details on model deployment configuration, refer to the hosting script.
Results
Our evaluation focused on the FinTabNet dataset, which contains complex tables from S&P 500 annual reports. This dataset presents unique challenges due to its diverse table structures, including merged cells, hierarchical headers, and varying layouts. The following example demonstrates a financial table and its corresponding model-generated HTML output, rendered in a browser for visual comparison.

For quantitative evaluation, we employed the Tree Edit Distance-based Similarity (TEDS) metric, which assesses both structural and content similarity between generated HTML tables and ground truth. TEDS measures the minimum number of edit operations required to transform one tree structure into another, and TEDS-S focuses specifically on structural similarity. The following table summarizes the output on different models.

Model
TEDS
TEDS-S

Anthropic’s Claude 3 Haiku
69.9
76.2

Anthropic’s Claude 3.5 Sonnet
86.4
87.1

Qwen2-VL-7B-Instruct (Base)
23.4
25.3

Qwen2-VL-7B-Instruct (Fine-tuned)
81.1
89.7

The evaluation results reveal significant advancements in our fine-tuned model’s performance. Most notably, the Qwen2-VL-7B-Instruct model demonstrated substantial improvements in both content recognition and structural understanding after fine-tuning. When compared to its base version, the model showed enhanced capabilities in accurately interpreting complex table structures and maintaining content fidelity. The fine-tuned version not only surpassed the performance of Anthropic’s Claude 3 Haiku, but also approached the accuracy levels of Anthropic’s Claude 3.5 Sonnet, while maintaining more efficient computational requirements. Particularly impressive was the model’s improved ability to handle intricate table layouts, suggesting a deeper understanding of document structure and organization. These improvements highlight the effectiveness of our fine-tuning approach in adapting the model to specialized financial document processing tasks.
Best practices
Based on our experiments, we identified several key insights and best practices for fine-tuning multimodal table structure recognition models:

Model performance is highly dependent on the quality of fine-tuning data. The closer the fine-tuning data resembles real-world datasets, the better the model performs. Using domain-specific data, we achieved a 5-point improvement in TEDS score with only 10% of the data compared to using general datasets. Notably, fine-tuning doesn’t require massive datasets; we achieved relatively good performance with just a few thousand samples. However, we observed that imbalanced datasets, particularly those lacking sufficient examples of complex elements like long tables and forms with merged cells, can lead to biased performance. Maintaining a balanced distribution of document types during fine-tuning facilitates consistent performance across various formats.
The choice of base model significantly impacts performance. More powerful base models yield better results. In our case, Qwen2-VL’s pre-trained visual and linguistic features provided a strong foundation. By freezing most parameters through QLoRA during the initial fine-tuning stages, we achieved faster convergence and better usage of pre-trained knowledge, especially with limited data. Interestingly, the model’s multilingual capabilities were preserved; fine-tuning on English datasets alone still yielded good performance on Chinese evaluation datasets. This highlights the importance of selecting a compatible base model for optimal performance.
When real-world annotated data is limited, synthetic data generation (using specific document data synthesizers) can be an effective solution. Combining real and synthetic data during fine-tuning helps mitigate out-of-domain issues, particularly for rare or domain-specific text types. This approach proved especially valuable for handling specialized financial terminology and complex document layouts.

Security
Another important aspect of our project involves addressing the security considerations essential when working with sensitive financial documents. As expected in the financial services industry, robust security measures must be incorporated throughout the ML lifecycle. These typically include data security through encryption at rest using AWS Key Management Service (AWS KMS) and in transit using TLS, implementing strict S3 bucket policies with virtual private cloud (VPC) endpoints, and following least-privilege access controls through AWS Identity and Access Management IAM roles. For training environments like SageMaker HyperPod, security considerations involve operating within private subnets in dedicated VPCs using the built-in encryption capabilities of SageMaker. Secure model hosting with vLLM requires deployment in private VPC subnets with proper Amazon API Gateway protections and token-based authentication. These security best practices for financial services make sure that sensitive financial information remains protected throughout the entire ML pipeline while enabling innovative document processing solutions in highly regulated environments.
Conclusion
Our exploration of multi-modality models for table structure recognition in banking documents has demonstrated significant improvements in both accuracy and efficiency. The fine-tuned Qwen2-VL-7B-Instruct model, trained using LLaMA-Factory on SageMaker HyperPod, has shown remarkable capabilities in handling complex financial tables and diverse document formats. These results highlight how multimodal approaches represent a transformative leap forward from traditional multistage and single modality methods, offering an end-to-end solution for modern document processing challenges.
Furthermore, using LLaMA-Factory on SageMaker HyperPod significantly streamlines the fine-tuning process, making it both more efficient and accessible. The scalable infrastructure of SageMaker HyperPod enables rapid experimentation by allowing seamless scaling of training resources. This capability facilitates faster iteration cycles, enabling researchers and developers to test multiple configurations and optimize model performance more effectively.
Explore our GitHub repository to access the implementation and step-by-step guidance, and begin customizing models for your specific requirements. Whether you’re processing financial statements, KYC documents, or complex reports, we encourage you to evaluate its potential for optimizing your document workflows.

About the Authors
Tony Wong is a Solutions Architect at AWS based in Hong Kong, specializing in financial services. He works with FSI customers, particularly in banking, on digital transformation journeys that address security and regulatory compliance. With entrepreneurial background and experience as a Solutions Architect Manager at a local System Integrator, Tony applies problem management skills in enterprise environments. He holds an M.Sc. from The Chinese University of Hong Kong and is passionate to leverage new technologies like Generative AI to help organizations enhance business capabilities.
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Zhihao Lin is a Deep Learning Architect at the AWS Generative AI Innovation Center. With a Master’s degree from Peking University and publications in top conferences like CVPR and IJCAI, he brings extensive AI/ML research experience to his role. At AWS, He focuses on developing generative AI solutions, leveraging cutting-edge technology for innovative applications. He specializes in solving complex computer vision and natural language processing challenges and advancing the practical use of generative AI in business.
Ken Tsui, VP of Machine Learning at Apoidea Group, is a seasoned machine learning engineer with over a decade of experience in applied research and B2B and B2C AI product development. Specializing in language models, computer vision, data curation, synthetic data generation, and distributed training, he also excels in credit scoring and stress-testing. As an active open-source researcher, he contributes to large language model and vision-language model pretraining and post-training datasets.
Edward Tsoi Po Wa is a Senior Data Scientist at Apoidea Group. Passionate about Artificial Intelligence, he specializes in Machine Learning, working on projects like Document Intelligence System, Large Language Models R&D and Retrieval-Augmented Generation Application. Edward drives impactful AI solutions, optimizing systems for industries like banking. Beside working, he holds a B.S. in Physics from Hong Kong University of Science and Technology. In his spare time, he loves to explore science, mathematics, and philosophy.
Mickey Yip is the Vice President of Product at Apoidea Group, where he utilizes his expertises to spearhead groundbreaking AI and digital transformation initiatives. With extensive experience, Mickey has successfully led complex projects for multinational banks, property management firms, and global corporations, delivering impactful and measurable outcomes. His expertise lies in designing and launching innovative AI SaaS products tailored for the banking sector, significantly improving operational efficiency and enhancing client success.

How Qualtrics built Socrates: An AI platform powered by Amazon SageMak …

This post is co-authored by Jay Kshirsagar and Ronald Quan from Qualtrics. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Qualtrics, founded in 2002, is a pioneering software company that has spent over two decades creating exceptional frontline experiences, building high-performing teams, and designing products that people love. As the creators and stewards of the Experience Management (XM) category, Qualtrics serves over 20,000 clients globally, bringing humanity, connection, and empathy back to businesses across various industries, including retail, government, and healthcare.

Qualtrics’s comprehensive XM platform enables organizations to consistently understand, measure, and improve the experiences they deliver for customers, employees, and the broader market. With its three core product suites—XM for Customer Experience, XM for Employee Experience, and XM for Research & Strategy—Qualtrics provides actionable insights and purpose-built solutions that empower companies to deliver exceptional experiences.
Qualtrics harnesses the power of generative AI, cutting-edge machine learning (ML), and the latest in natural language processing (NLP) to provide new purpose-built capabilities that are precision-engineered for experience management (XM). These AI capabilities are purpose-built to help organizations of all sizes deeply understand and address the needs of every customer, employee, and stakeholder—driving stronger connections, increased loyalty, and sustainable growth.
In this post, we share how Qualtrics built an AI platform powered by Amazon SageMaker and Amazon Bedrock.
AI at Qualtrics
Qualtrics has a deep history of using advanced ML to power its industry-leading experience management platform. Early 2020, with the push for deep learning and transformer models, Qualtrics created its first enterprise-level ML platform called Socrates. Built on top of SageMaker, this new platform enabled ML scientists to efficiently build, test, and deliver new AI-powered capabilities for the Qualtrics XM suite. This strong foundation in ML and AI has been a key driver of Qualtrics’s innovation in experience management.
Qualtrics AI, a powerful engine that sits at the heart of the company’s XM platform, harnesses the latest advances in ML, NLP, and AI. Trained on Qualtrics’s expansive database of human sentiment and experience data, Qualtrics AI unlocks richer, more personalized connections between organizations and their customers, employees, and stakeholders. Qualtrics’s unwavering commitment to innovation and customer success has solidified its position as the global leader in experience management.

To learn more about how AI is transforming experience management, visit this blog from Qualtrics.
Socrates platform: Powering AI at Qualtrics
Qualtrics AI is powered by a custom-built  ML platform, a synergistic suite of tools and services designed to enable a diverse set of Qualtrics personae—researchers, scientists, engineers, and knowledge workers—to harness the transformative power of AI and ML.  Qualtrics refers to it internally as the “Socrates” platform. It uses managed AWS services like SageMaker and Amazon Bedrock to enable the entire ML lifecycle. Knowledge workers can source, explore, and analyze Qualtrics data using Socrates’s ML workbenches and AI Data Infrastructure. Scientists and researchers are enabled to conduct research, prototype, develop, and train models using a host of SageMaker features. ML engineers can test, productionize, and monitor a heterogeneous set of ML models possessing a wide range of capabilities, inference modes, and production traffic patterns. Partner application teams are provided with an abstracted model inference interface that makes the integration of an ML model into the Qualtrics product a seamless engineering experience. This holistic approach enables internal teams to seamlessly integrate advanced AI and ML capabilities into their workflows and decision-making processes.

Science Workbench
The Socrates Science Workbench, purpose-built for Qualtrics Data and Knowledge Workers, provides a powerful platform for model training and hyperparameter optimization (HPO) with a JupyterLab interface, support for a range of programming languages, and secure, scalable infrastructure through SageMaker integration, giving users the flexibility and reliability to focus on their core ML tasks. Users can take advantage of the robust and reliable infrastructure of SageMaker to maintain the confidentiality and integrity of their data and models, while also taking advantage of the scalability that SageMaker provides to handle even the most demanding ML workloads.
AI Data Infrastructure
Socrates’s AI Data Infrastructure is a comprehensive and cohesive end-to-end ML data ecosystem. It features a secure and scalable data store integrated with the Socrates Science Workbench, enabling users to effortlessly store, manage, and share datasets with capabilities for anonymization, schematization, and aggregation. The AI Data Infrastructure also provides scientists with interfaces for distributed compute, data pulls and enrichment, and ML processing.
AI Playground
The AI Playground is a user-friendly interface that provides Socrates users with direct access to the powerful language models and other generative AI capabilities hosted on the Socrates platform using backend tools like SageMaker Inference, Amazon Bedrock, and OpenAI GPT, allowing them to experiment and rapidly prototype new ideas without extensive coding or technical expertise. By continuously integrating the latest models, the AI Playground empowers Socrates users to stay at the forefront of advancements in large language models (LLMs) and other cutting-edge generative AI technologies, exploring their potential and discovering new ways to drive innovation.
Model deployment for inference
The Socrates platform features a sophisticated model deployment infrastructure that is essential for the scalable implementation of ML and AI models. This infrastructure allows users to host models across the variety of hardware options available for SageMaker endpoints, providing the flexibility to select a deployment environment that optimally meets their specific needs for inference, whether those needs are related to performance optimization, cost-efficiency, or particular hardware requirements.
One of the defining characteristics of the Socrates model deployment infrastructure is its capability to simplify the complexities of model hosting. This allows users to concentrate on the essential task of deploying their models for inference within the larger Socrates ecosystem. Users benefit from an efficient and user-friendly interface that enables them to effortlessly package their models, adjust deployment settings, and prepare them for inference use.
By offering an adaptable model deployment solution, the Socrates platform makes sure ML models created within the system are smoothly integrated into real-world applications and workflows. This integration not only speeds up the transition to production but also maximizes the usage of Qualtrics’s AI-driven features, fostering innovation and providing significant business value to its customers.
Model capacity management
Model capacity management is a critical component that offers efficient and reliable delivery of ML models to Qualtrics users by providing oversight of model access and the allocation of computing resources across multiple consumers. The Socrates team closely monitors resource usage and sets up rate limiting and auto scaling policies, where applicable, to meet the evolving demands of each use case.
Unified GenAI Gateway
The Socrates platform’s Unified GenAI Gateway simplifies and streamlines access to LLMs and embedding models across the Qualtrics ecosystem. The Unified GenAI Gateway is an API that provides a common interface for consumers to interact with all of the platform-supported LLMs and embedding models, regardless of their underlying providers or hosting environments. This means that Socrates users can use the power of cutting-edge language models without having to worry about the complexities of integrating with multiple vendors or managing self-hosted models.
The standout feature of the Unified GenAI Gateway is its centralized integration with inference platforms like SageMaker Inference and Amazon Bedrock. which allows the Socrates team to handle the intricate details of model access, authentication, and attribution on behalf of users. This not only simplifies the user experience but also enables cost attribution and control mechanisms, making sure the consumption of these powerful AI resources is carefully monitored and aligned with specific use cases and billing codes. Furthermore, the Unified GenAI Gateway boasts capabilities like rate-limiting support, making sure the system’s resources are efficiently allocated, and an upcoming semantic caching feature that will further optimize model inference and enhance overall performance.
Managed Inference APIs (powered by SageMaker Inference)
The Socrates Managed Inference APIs provide a comprehensive suite of services that simplify the integration of advanced ML and AI capabilities into Qualtrics applications. This infrastructure, built on top of SageMaker Inference, handles the complexities of model deployment, scaling, and maintenance, boasting a growing catalog of production-ready models.
Managed Inference APIs offer both asynchronous and synchronous modes to accommodate a wide range of application use cases. Importantly, these managed APIs come with guaranteed production-level SLAs, providing reliable performance and cost-efficiency as usage scales. With readily available pre-trained Qualtrics models for inference, the Socrates platform empowers Qualtrics application teams to focus on delivering exceptional user experiences, without the burden of building and maintaining AI infrastructure.
GenAI Orchestration Framework
Socrates’s GenAI Orchestration Framework is a collection of tools and patterns designed to streamline the development and deployment of LLM-powered applications within the Qualtrics ecosystem. The framework consists of such tools/frameworks such as:

Socrates Agent Platform, built on top of LangGraph Platform providing a flexible orchestration framework to develop agents as graphs that expedite delivery of agentic features while centralizing core infrastructure and observability components.
A GenAI SDK, providing straightforward coding convenience for interacting with LLMs and third-party orchestration packages
Prompt Lifecycle Management Service (PLMS) for maintaining the security and governance of prompts
LLM guardrail tooling, enabling LLM consumers to define the protections they want applied to their model inference
Synchronous and asynchronous inference gateways

These tools all contribute to the overall reliability, scalability, and performance of the LLM-powered applications built upon it. Capabilities of the Socrates AI App Framework are anticipated to grow and evolve alongside the rapid advancements in the field of LLMs. This means that Qualtrics users always have access to the latest and most cutting-edge AI capabilities from generative AI inference platforms like SageMaker Inference and Amazon Bedrock, empowering them to harness the transformative power of these technologies with greater ease and confidence.
Ongoing enhancements to the Socrates platform using SageMaker Inference
As the Socrates platform continues to evolve, Qualtrics is continuously integrating the latest advancements in SageMaker Inference to further enhance the capabilities of their AI-powered ecosystem:

Improved cost, performance, and usability of generative AI inference – One prominent area of focus is the integration of cost and performance optimizations for generative AI inference. The SageMaker Inference team has launched innovative techniques to optimize the use of accelerators, enabling SageMaker Inference to reduce foundation model (FM) deployment costs by 50% on average and latency by 20% on average with inference components. Using this feature, we’re working on achieving significant cost savings and performance improvements for Qualtrics customers running their generative AI workloads on the Socrates platform. In addition, SageMaker has streamlined deployment of open source LLMs and FMs with just three clicks. This user-friendly functionality removes the complexity traditionally associated with deploying these advanced models, empowering more Qualtrics customers to harness the power of generative AI within their workflows and applications.
Improved auto scaling speeds – The SageMaker team has developed an advanced auto scaling capability to better handle the scaling requirements of generative AI models. These improvements reduce significantly (from multiple minutes to under a minute), reducing auto scaling times by up to 40% and auto scaling detection by six times for Meta Llama 3 8B, enabling Socrates users to rapidly scale their generative AI workloads on SageMaker to meet spikes in demand without compromising performance.
Straightforward deployment of self-managed OSS LLMs – Using the new capability from SageMaker Inference for a more streamlined and intuitive process to package your generative AI models reduces the technical complexity that was traditionally associated with this task. This, in turn, empowers a wider range of Socrates users, including application teams and subject matter experts, to use the transformative power of these cutting-edge AI technologies within their workflows and decision-making processes.
Generative AI inference optimization toolkit – Qualtrics is also actively using the latest advancements in the SageMaker Inference optimization toolkit within the Socrates platform, which offers two times higher throughput while reducing costs by up to 50% for generative AI inference. By integrating using capabilities, Socrates is working on lowering the cost of generative AI inference. This breakthrough is particularly impactful for Qualtrics’s customers, who rely on the Socrates platform to power AI-driven applications and experiences.

“By seamlessly integrating SageMaker Inference into our Socrates platform, we’re able to deliver inference advancements in AI to our global customer base. The generative AI inference from capabilities in SageMaker like inference components, faster auto scaling, easy LLM deployment, and the optimization toolkit have been a game changer for Qualtrics to reduce the cost and improve the performance for our generative AI workloads. The level of sophistication and ease of use that SageMaker Inference brings to the table is remarkable.”
– James Argyropoulos, Sr AI/ML Engineer at Qualtrics.

Partnership with SageMaker Inference
Since adopting SageMaker Inference, the Qualtrics Socrates team has been a key collaborator in the development of AI capabilities in SageMaker Inference. Building on expertise to serve Socrates users, Qualtrics has worked closely with the SageMaker Inference team to enhance and expand the platform’s generative AI functionalities. From the early stages of generative AI, they offered invaluable insights and expertise to the SageMaker team. This has enabled the introduction of several new features and optimizations that have strengthened the platform’s generative AI offerings, including:

Cost and performance optimizations for generative AI inference – Qualtrics helped the SageMaker Inference team build a new inference capability for SageMaker Inference to reduce FM deployment costs by 50% on average and latency by 20% on average with inference components. This feature delivers significant cost savings and performance improvements for customers running generative AI inference on SageMaker.
Faster auto scaling for generative AI inference – Qualtrics has helped the SageMaker team develop These improvements have reduced auto scaling times by up to 40% for models like Meta Llama 3 and increased auto scaling detection speed by six times faster. With this, generative AI inference can scale with changing traffic without compromising performance.
Inference optimization toolkit for generative AI inference – Qualtrics has been instrumental in giving feedback for AWS to launch the inference optimization toolkit, which increases throughput by up to two times faster and reduces latency by 50%.
Launch of multi-model endpoint (MME) support for GPU – MMEs allow customers to reduce inference costs by up to 90%. Qualtrics was instrumental in helping AWS with the launch of this feature by providing valuable feedback.
Launch of asynchronous inference – Qualtrics was a launch partner for and has played a key role in helping AWS improve the offering to give customers optimal price-performance.

The partnership between Qualtrics and the SageMaker Inference team has been instrumental in advancing the state-of-the-art in generative AI within the AWS ecosystem. Qualtrics’s deep domain knowledge and technical proficiency have played a crucial role in shaping the evolution of this rapidly developing field on the SageMaker Inference.

“Our partnership with the SageMaker Inference product team has been instrumental in delivering incredible performance and cost benefits for Socrates platform consumers running AI Inference workloads. By working hand in hand with the SageMaker team, we’ve been able to introduce game changing optimizations that have reduced AI inference costs multiple folds for some of our use cases. We look forward to continued innovation through valuable partnership to improve state-of-the-art AI inference capabilities.”
–  Jay Kshirsagar, Senior Manager, Machine Learning

Conclusion
The Socrates platform underscores Qualtrics’s commitment to advancing innovation in experience management by flawlessly integrating advanced AI and ML technologies. Thanks to a strong partnership with the SageMaker Inference team, the platform has seen enhancements that boost performance, reduce costs, and increase the accessibility of AI-driven features within the Qualtrics XM suite. As AI technology continues to develop rapidly, the Socrates platform is geared to empower Qualtrics’s AI teams to innovate and deliver exceptional customer experiences.

About the Authors
Jay Kshirsagar is a seasoned ML leader driving GenAI innovation and scalable AI infrastructure at Qualtrics. He has built high-impact ML teams and delivered enterprise-grade LLM solutions that power key product features.
Ronald Quan is a Staff Engineering Manager for the Data Intelligence Platform team within Qualtrics. The team’s charter is to enable, expedite and evolve AI and Agentic developments on the Socrates platform. He focuses on the team’s technical roadmap and strategic alignment with the business needs.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Micheal Nguyen is a Senior Startup Solutions Architect at AWS, specializing in using AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.
Ranga Malaviarachchi is a Sr. Customer Solutions Manager in the ISV Strategic Accounts organization at AWS. He has been closely associated with Qualtrics over the past 4 years in supporting their AI initiatives. Ranga holds a BS in Computer Science and Engineering and an MBA from Imperial College London.

Vxceed secures transport operations with Amazon Bedrock

Vxceed delivers SaaS solutions across industries such as consumer packaged goods (CPG), transportation, and logistics. Its modular environments include Lighthouse for CPG demand and supply chains, GroundCentric247 for airline and airport operations, and LimoConnect247 and FleetConnect247 for passenger transport. These solutions support a wide range of customers, including government agencies in Australia and New Zealand.
In 2024, Vxceed launched a strategy to integrate generative AI into its solutions, aiming to enhance customer experiences and boost operational efficiency. As part of this initiative, Vxceed developed LimoConnectQ using Amazon Bedrock and AWS Lambda. This solution enables efficient document searching, simplifies trip booking, and enhances operational decisions while maintaining data security and protection.
The challenge: Balancing innovation with security
Vxceed’s customers include government agencies responsible for transporting high-profile individuals, such as judiciary members and senior officials. These agencies require highly secure systems that adhere to standards like  Information Security Registered Assessors Program (IRAP), used by the Australian government to assess security posture.
Government agencies and large corporations that handle secure ground transportation face a unique challenge: providing seamless, efficient, and secure operations while adhering to strict regulatory requirements. Vxceed Technologies, a software-as-a-service (SaaS) provider specializing in ground transportation and resource planning, recognized an opportunity to enhance its LimoConnect solution with generative AI. Vxceed initially explored various AI solutions but faced a critical hurdle: verifying that customer data remained within their dedicated private environments. Existing AI offerings often processed data externally, posing security risks that their clients could not accept.
Vxceed needed AI capabilities that could function within a highly controlled environment, helping to ensure complete data privacy while enhancing operational efficiency.
This challenge led Vxceed to use Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API.
LimoConnect Q solution overview and implementation highlights
To address the challenges of secure, efficient, and intelligent ground transportation management, Vxceed developed LimoConnect Q, an AI-powered solution. LimoConnect Q’s architecture uses Amazon Bedrock, Amazon API Gateway, Amazon DynamoDB, and AWS Lambda to create a secure, scalable AI-powered transportation management system. The solution implements a multi-agent architecture, shown in the following figure where each component operates within the customer’s private AWS environment, maintaining data security, scalability, and intuitive user interactions.

Figure 1 – Vxceed’s LimoConnect Q architecture

Let’s dive further into each component in this architecture:
Conversational trip booking with intelligent orchestration using Amazon Bedrock Agents
Beyond document queries, LimoConnect Q revolutionizes trip booking by replacing traditional forms and emails with a conversational AI-driven process. Users can state their trip requirements in natural language. Key features include:

Natural language: Processes natural language booking requests based on travel context and preferences, for example:

Schedule airport pickup for dignitaries at 9 AM tomorrow to the conference center.
Book airport to my office transfer next Monday at 10 AM.

Automated data retrieval and processing: LimoConnect Q integrates with multiple data sources to:

Validate pickup and drop-off locations using geolocation services
Automates address geocoding and external API lookups, verifying accurate bookings.
Verify vehicle and driver eligibility through Amazon Bedrock Agents
Retrieve relevant trip details from past bookings and preferences

Seamless booking execution: After the request is processed, LimoConnect Q automatically:

Confirms the trip
Provides personalized recommendations based on booking history
Sends real-time booking updates and notifies relevant personnel (for example, drivers and dispatch teams)

This conversational approach minimizes manual processing, reduces booking errors, and enhances user convenience—especially for busy professionals who need a fast, friction less way to arrange transportation.
Secure RAG for policy and document querying using Amazon Bedrock Knowledge Bases
One of the most critical functionalities of LimoConnect Q is the ability to query policy documents, procedural manuals, and operational guidelines in natural language. Traditionally, accessing such information required manual searches or expert assistance, creating inefficiencies—especially when expert staff aren’t available.
Vxceed addressed these challenges by implementing a Retrieval Augmented Generation (RAG) framework. This system generates responses that align with policies, incorporate relevant facts, and consider context. The solution delivers the ability to:

Query documents in natural language: Instead of searching manually, users can ask questions like What is the protocol for VIP pickup at the airport?
Restrict AI-generated responses based on RAG: Use RAG to make sure that answers are pulled only from approved, up-to-date documents, maintaining security and compliance.
Keep sensitive data within the customer’s environment: LimoConnect Q maintains data privacy and compliance by keeping queries within the customer’s private AWS environment, providing end-to-end security.

This capability significantly improves operational efficiency, allowing users to get instant, reliable answers instead of relying on manual lookups or expert availability.
Multi-agent AI architecture for secure orchestration
Vxceed built a multi-agent AI system on Lambda to manage LimoConnect Q’s transportation workflows. The architecture comprises agents that handle dispatch, routing, and scheduling tasks while maintaining security and scalability.

Intent recognition agent: Determines whether a user request pertains to document retrieval, trip booking, or another functions.
Document retrieval agent: Handles policy queries using RAG-based retrieval.
Trip booking agent: Processes user inputs, extracting key information such as pickup and drop-off locations, time, vehicle type, passenger count, and special requests. It verifies that booking information is provided, including name, contact details, and trip preferences. The agent validates addresses using geolocation APIs for accuracy before proceeding. The agent then checks vehicle and driver availability by querying the fleet management database, retrieving real-time data on approved resources. It also interacts with a user preference database, using vector-based search to suggest personalized options.
Flight information validation agent: Verifies flight schedules.
Trip duplication agent: Checks for previously booked trips with similar details to help avoid duplicate bookings.
Return trip agent: Analyzes past trips and preferences to recommend suitable return options, considering real-time vehicle availability and driver schedules.
Data validation agent: Verifies security policy compliance.
External API agent: integrates with third-party services such as geolocation services, scheduling interfaces, and transportation databases, providing real-time data updates for optimized trip coordination.
Booking retrieval agent: Helps users retrieve existing bookings or cancel them, querying the backend database for current and past trips.

After validation, LimoConnect Q uses Lambda functions and Amazon Bedrock integrated APIs to process bookings, update databases, and manage notifications to drivers and dispatch teams. The modular architecture enables Vxceed to seamlessly add new features like driver certification tracking and compliance automation.
Built with security at its core, LimoConnect Q uses Lambda for efficient handling of query spikes while implementing robust memory isolation mechanisms. Each user session maintains temporary memory for contextual conversations without permanent storage, and strict access controls ensure session-specific data isolation, preventing cross-contamination of sensitive information. This architecture adhere to the stringent security requirements of government and enterprise customers while maintaining operational efficiency.
Using LimoConnect Q, customers have saved an average of 15 minutes per query, increased first-call resolution rates by 80 percent, and cut onboarding and training time by 50 percent.
Guardrails
LimoConnect Q uses Amazon Bedrock Guardrails to maintain professional, focused interactions. The system uses denied topics and word filters to help prevent unrelated discussions and unprofessional language, making sure that conversations remain centered on transportation needs. These guardrails constrain the system’s responses to travel-specific intents, maintaining consistent professionalism across user interactions. By implementing these controls, Vxceed makes sure that this AI solution delivers reliable, business-appropriate responses that align with their customers’ high standards for secure transportation services.
AI-powered tools for ground transportation optimization
LimoConnect Q also incorporates custom AI tools to enhance accuracy and automation across various transportation tasks:

Address geocoding and validation: AI-powered location services verify pickup and drop-off addresses, reducing errors and maintaining accurate scheduling.
Automated trip matching: The system analyzes historical booking data and user preferences to recommend the most suitable vehicle options.
Role-based access control: AI-driven security protocols enforce policies on vehicle assignments based on user roles and clearance levels.

These enhancements streamline operations, reduce manual intervention, and provide a frictionless user experience for secure transportation providers, government agencies and large enterprises.
Why Vxceed chose Amazon Bedrock
Vxceed selected Amazon Bedrock over other AI solutions because of four key advantages:

Enterprise-grade security and privacy: Amazon Bedrock provides private, encrypted AI environments that keep data within the customer’s virtual private cloud (VPC), maintaining compliance with strict security requirements.
Seamless AWS integration: LimoConnect Q runs on Vxceed’s existing AWS infrastructure, minimizing migration effort and allowing end-to-end control over data and operations.
Access to multiple AI models: Amazon Bedrock supports various FMs, allowing Vxceed to experiment and optimize performance across different use cases. Vxceed uses Anthropic’s Claude 3.5 Sonnet for its ability to handle sophisticated conversational interactions and complex language processing tasks.
Robust AI development tools: Vxceed accelerated development by using Amazon Bedrock Knowledge Bases, prompt engineering libraries and agent frameworks for efficient AI orchestration.

Business impact and future outlook
The introduction of LimoConnect Q has already demonstrated significant operational improvements, enhancing both efficiency and user experience for Vxceed’s customers including secure transportation providers, government agencies and enterprise clients.

Faster information retrieval: AI-driven document querying reduces lookup times by 15 minutes per query, ensuring quick access to critical policies.
Streamlined trip booking: 97% of bookings now happen digitally, removing manual workflows and enabling faster confirmations.
Enhanced security and compliance: AI processing remains within a private AWS environment, adhering to strict government security standards such as IRAP.

Beyond government customers, the success of LimoConnect Q powered by Amazon Bedrock has drawn strong interest from private sector transportation providers, including large fleet operators managing up to 7,000 trips per month. The ability to automate booking workflows, improve compliance tracking, and provide secure AI-driven assistance has positioned Vxceed as a leader in AI-powered ground transportation solutions.
Summary
AWS partnered with Vxceed to support their AI strategy, resulting in the development of LimoConnect Q, an innovative ground transportation management solution. Using AWS services including Amazon Bedrock and Lambda, Vxceed successfully built a secure, AI-powered solution that streamlines trip booking and document processing. Looking ahead, Vxceed plans to further refine LimoConnect Q by:

Optimizing AI inference costs to improve scalability and cost-effectiveness.
Enhancing AI guardrails to help prevent hallucinations and improve response reliability.
Developing advanced automation features, such as driver certification tracking and compliance auditing.

With these collaboration, Vxceed is poised to revolutionize ground transportation management, delivering secure, efficient, and AI-powered solutions for government agencies, enterprises, and private transportation providers alike.
If you are interested in implementing a similar AI-powered solution, start by understanding how to implement asynchronous AI agents using Amazon Bedrock. See Creating asynchronous AI agents with Amazon Bedrock to learn about the implementation patterns for multi-agent systems and develop secure, AI-powered solutions for your organization.

About the Authors
Deepika Kumar is a Solution Architect at AWS. She has over 13 years of experience in the technology industry and has helped enterprises and SaaS organizations build and securely deploy their workloads on the cloud securely. She is passionate about using Generative AI in a responsible manner whether that is driving product innovation, boost productivity or enhancing customer experiences.
Cyril Ovely, CTO and co-founder of Vxceed Software Solutions, leads the company’s SaaS-based logistics solutions for CPG brands. With 33 years of experience, including 22 years at Vxceed, he previously worked in analytical and process control instrumentation. An engineer by training, Cyril architects Vxceed’s SaaS offerings and drives innovation from his base in Auckland, New Zealand.
Santosh Shenoy is a software architect at Vxceed Software Solutions. He has a strong focus on system design and cloud-native development. He specializes in building scalable enterprise applications using modern technologies, microservices, and AWS services, including Amazon Bedrock for AI-driven solutions.

Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights …

As generative AI continues to redefine digital workflows across industries, SimilarWeb’s ‘AI Global Report: Global Sector Trends on Generative AI’ (ending May 9, 2025) offers a comprehensive snapshot of shifting user engagement patterns. The data-driven report highlights notable growth in coding agents, disruptive impacts on EdTech, and an unexpected downturn in Legal AI platforms. Here are five findings that stand out from the report’s multi-sectoral analysis.

1. AI-Powered Coding Tools Witness Sustained Momentum

Among the highest-performing categories, DevOps & Code Completion tools recorded a 75% year-over-year (YoY) increase in traffic. This growth reflects rising developer adoption of AI assistants that support code generation, error detection, and workflow automation.

Two platforms stood out: Lovable, with a remarkable 207% YoY growth, and Cursor, which registered a 62% increase. These tools are gaining traction due to their ability to integrate seamlessly with IDEs and DevOps pipelines, reducing cognitive overhead for developers and enhancing velocity in iterative software engineering environments.

2. General-Purpose LLMs Disrupt Traditional EdTech Models

Chat-based LLMs such as OpenAI’s ChatGPT, DeepSeek, and Grok have emerged as central tools for self-directed learning and on-demand tutoring. DeepSeek, in particular, experienced exponential traffic spikes—at one point surpassing 17,000% YoY growth—before moderating in May.

This surge coincides with a marked decline in traditional education platforms. The EdTech category experienced a 24% YoY traffic drop, with legacy players Chegg and CourseHero posting sharp declines of -62% and -68%, respectively. The data suggests that LLMs are effectively displacing static repositories with conversational, real-time educational support—especially for STEM and writing tasks.

3. Legal AI Tools Enter a Downturn Amid Usage Fatigue

In contrast to the buoyancy of coding tools, the Legal AI segment faced a significant contraction, with a 73% YoY drop in traffic. This decline may reflect saturation in a niche market where generative AI’s value proposition—contract summarization, legal drafting, compliance automation—has yet to fully mature into robust, enterprise-grade deployments.

The data implies that while early interest in legal AI was strong, retention and continued usage remain challenges. Legal practitioners may be holding off on broader adoption until tools demonstrate better alignment with real-world legal reasoning, jurisdictional nuance, and auditability requirements.

4. Video Generation Tools Deliver Mixed Signals

The Video Generation sector showed only a -5% YoY change overall, but this average masks notable platform-specific variances. Kling.ai and RunwayML saw traffic declines of 5% and 15%, while Heygen recorded a 25% increase—likely attributable to its focus on solving specific commercial use cases such as synthetic avatars for business communications.

This divergence underlines a broader trend: video synthesis platforms that do not address a clear market need or lack intuitive UI/UX are struggling to retain user interest. In contrast, those aligned with enterprise storytelling or content automation are seeing more durable engagement.

5. Freelance Platforms Feel the Pressure of AI Automation

The report also highlights a 17% YoY decline in traffic to Digital Freelance platforms. Fiverr and Upwork were particularly affected, down 15% and 19%, respectively. The underlying driver appears to be generative AI’s growing ability to automate traditionally freelance-driven tasks—copywriting, basic design, SEO analysis, and transcription—thus shifting demand away from manual labor.

The freelance economy may be entering a transition phase where success depends on human-AI collaboration. Freelancers who adapt by offering AI-enhanced services or specialize in domains requiring nuanced judgment may find new opportunities as others contract.

Conclusion

The SimilarWeb’s ‘AI Global Report: Global Sector Trends on Generative AI’ eveals a bifurcation in generative AI adoption: platforms that address domain-specific challenges with measurable productivity gains—especially in development and operations—are thriving. In contrast, tools that either lack differentiation or have not yet demonstrated practical reliability are witnessing attrition.

As AI continues to integrate more deeply into professional toolchains, user engagement is increasingly driven by clarity of purpose and return on investment. This is not just a technological shift—it’s a redefinition of digital productivity landscapes across sectors.

Download the report. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom appeared first on MarkTechPost.

Google DeepMind Introduces AlphaEvolve: A Gemini-Powered Coding AI Age …

Algorithm design and scientific discovery often demand a meticulous cycle of exploration, hypothesis testing, refinement, and validation. Traditionally, these processes rely heavily on expert intuition and manual iteration, particularly for problems rooted in combinatorics, optimization, and mathematical construction. While large language models (LLMs) have recently demonstrated promise in accelerating code generation and problem solving, their ability to autonomously generate provably correct and computationally superior algorithms remains limited—especially when solutions must generalize across diverse use cases or deliver production-grade performance.

Google DeepMind Introduces AlphaEvolve

To address these limitations, Google DeepMind has unveiled AlphaEvolve, a next-generation coding agent powered by Gemini 2.0 LLMs. AlphaEvolve is designed to automate the process of algorithm discovery using a novel fusion of large-scale language models, automated program evaluation, and evolutionary computation. Unlike conventional code assistants, AlphaEvolve autonomously rewrites and improves algorithmic code by learning from a structured feedback loop—iteratively proposing, evaluating, and evolving new candidate solutions over time.

AlphaEvolve orchestrates a pipeline where LLMs generate program mutations informed by previous high-performing solutions, while automated evaluators assign performance scores. These scores drive a continual refinement process. AlphaEvolve builds on prior systems like FunSearch but extends their scope dramatically—handling full codebases in multiple languages and optimizing for multiple objectives simultaneously.

System Architecture and Technical Advantages

The architecture of AlphaEvolve combines multiple components into an asynchronous and distributed system:

Prompt Construction: A sampler assembles prompts using previous high-scoring solutions, mathematical context, or code structure.

LLM Ensemble: A hybrid of Gemini 2.0 Pro and Gemini 2.0 Flash enables a balance between high-quality insight and rapid idea exploration.

Evaluation Framework: Custom scoring functions are used to systematically assess algorithmic performance based on predefined metrics, enabling transparent and scalable comparison.

Evolutionary Loop: AlphaEvolve maintains a database of prior programs and performance data, which it uses to inform new generations of code, balancing exploration and exploitation.

A key technical strength lies in AlphaEvolve’s flexibility. It can evolve complete programs, support multi-objective optimization, and adapt to different problem abstractions—whether evolving constructor functions, search heuristics, or entire optimization pipelines. This capability is particularly useful for problems where progress is machine-measurable, such as matrix multiplication or data center scheduling.

Results and Real-World Applications

AlphaEvolve has demonstrated robust performance across theoretical and applied domains:

Matrix Multiplication: AlphaEvolve discovered 14 new low-rank algorithms for matrix multiplication. Most notably, it found a method to multiply 4×4 complex matrices using 48 scalar multiplications—surpassing the long-standing 49-multiplication bound set by Strassen’s algorithm in 1969.

Mathematical Discovery: Applied to over 50 mathematical problems—including the Erdős minimum overlap problem and the kissing number problem in 11 dimensions—AlphaEvolve matched existing state-of-the-art constructions in ~75% of cases and outperformed them in ~20%, all while requiring minimal expert handcrafting.

Infrastructure Optimization at Google:

Data Center Scheduling: AlphaEvolve generated a scheduling heuristic that improved resource efficiency across Google’s global compute fleet, reclaiming 0.7% of stranded compute capacity—equivalent to hundreds of thousands of machines.

Kernel Engineering for Gemini: Optimized tiling heuristics yielded a 23% speedup for matrix multiplication kernels, reducing overall Gemini training time by 1%.

Hardware Design: AlphaEvolve proposed Verilog-level optimizations to TPU arithmetic circuits, contributing to area and power reductions without compromising correctness.

Compiler-Level Optimization: By modifying compiler-generated XLA intermediate representations for attention kernels, AlphaEvolve delivered a 32% performance improvement in FlashAttention execution.

These results underscore AlphaEvolve’s generality and impact—successfully discovering novel algorithms and deploying them in production-grade environments.

Conclusion

AlphaEvolve represents a significant leap forward in AI-assisted scientific and algorithmic discovery. By integrating Gemini-powered LLMs with evolutionary search and automated evaluation, AlphaEvolve transcends the limitations of prior systems—offering a scalable, general-purpose engine capable of uncovering high-performing, verifiably correct algorithms across diverse domains.

Its deployment within Google’s infrastructure—and its ability to improve upon both theoretical bounds and real-world systems—suggests a future where AI agents do not merely assist in software development but actively contribute to scientific advancement and system optimization.

Check out the Paper and Official Release. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
The post Google DeepMind Introduces AlphaEvolve: A Gemini-Powered Coding AI Agent for Algorithm Discovery and Scientific Optimization appeared first on MarkTechPost.

Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice A …

The field of Voice AI is evolving toward more representative and adaptable systems. While many existing models have been trained on carefully curated, studio-recorded audio, Rime is pursuing a different direction: building foundational voice models that reflect how people actually speak. Its two latest releases, Arcana and Rimecaster, are designed to offer practical tools for developers seeking greater realism, flexibility, and transparency in voice applications.

Arcana: A General-Purpose Voice Embedding Model

Arcana is a spoken language text-to-speech (TTS) model optimized for extracting semantic, prosodic, and expressive features from speech. While Rimecaster focuses on identifying who is speaking, Arcana is oriented toward understanding how something is said—capturing delivery, rhythm, and emotional tone.

The model supports a variety of use cases, including:

Voice agents for businesses across IVR, support, outbound, and more

Expressive text-to-speech synthesis for creative applications

Dialogue systems that require speaker-aware interaction

Arcana is trained on a diverse range of conversational data collected in natural settings. This allows it to generalize across speaking styles, accents, and languages, and to perform reliably in complex audio environments, such as real-time interaction.

Arcana also captures speech elements that are typically overlooked—such as breathing, laughter, and speech disfluencies—helping systems to process voice input in a way that mirrors human understanding.

Rime also offers another TTS model optimized for high-volume, business-critical applications. Mist v2 enables efficient deployment on edge devices at extremely low latency without sacrificing quality. Its design blends acoustic and linguistic features, resulting in embeddings that are both compact and expressive.

Rimecaster: Capturing Natural Speaker Representation

Rimecaster is an open source speaker representation model developed to help train voice AI models, like Arcana and Mist v2. It moves beyond performance-oriented datasets, such as audiobooks or scripted podcasts. Instead, it is trained on full-duplex, multilingual conversations featuring everyday speakers. This approach allows the model to account for the variability and nuances of unscripted speech—such as hesitations, accent shifts, and conversational overlap.

Technically, Rimecaster transforms a voice sample into a vector embedding that represents speaker-specific characteristics like tone, pitch, rhythm, and vocal style. These embeddings are useful in a range of applications, including speaker verification, voice adaptation, and expressive TTS.

Key design elements of Rimecaster include:

Training Data: The model is built on a large dataset of natural conversations across languages and speaking contexts, enabling improved generalization and robustness in noisy or overlapping speech environments.

Model Architecture: Based on NVIDIA’s Titanet, Rimecaster produces four times denser speaker embeddings, supporting fine-grained speaker identification and better downstream performance.

Open Integration: It is compatible with Hugging Face and NVIDIA NeMo, allowing researchers and engineers to integrate it into training and inference pipelines with minimal friction.

Licensing: Released under an open source CC-by-4.0 license, Rimecaster supports open research and collaborative development.

By training on speech that reflects real-world use, Rimecaster enables systems to distinguish among speakers more reliably and deliver voice outputs that are less constrained by performance-driven data assumptions.

Realism and Modularity as Design Priorities

Rime’s recent updates align with its core technical principles: model realism, diversity of data, and modular system design. Rather than pursuing monolithic voice solutions trained on narrow datasets, Rime is building a stack of components that can be adapted to a wide range of speech contexts and applications.

Integration and Practical Use in Production Systems

Arcana and Mist v2 are designed with real-time applications in mind. Both support:

Streaming and low-latency inference

Compatibility with conversational AI stacks and telephony systems

They improve the naturalness of synthesized speech and enable personalization in dialogue agents. Because of their modularity, these tools can be integrated without significant changes to existing infrastructure.

For example, Arcana can help synthesize speech that retains the tone and rhythm of the original speaker in a multilingual customer service setting.

Conclusion

Rime’s voice AI models offer an incremental yet important step toward building voice AI systems that reflect the true complexity of human speech. Their grounding in real-world data and modular architecture make them suitable for developers and builders working across speech-related domains.

Rather than prioritizing uniform clarity at the expense of nuance, these models embrace the diversity inherent in natural language. In doing so, Rime is contributing tools that can support more accessible, realistic, and context-aware voice technologies.

Sources: 

https://www.rime.ai/blog/introducing-arcana/

https://www.rime.ai/blog/introducing-rimecaster/

https://www.rime.ai/blog/introducing-our-new-brand

Thanks to the Rime team for the thought leadership/ Resources for this article. Rime team has sponsored us for this content/article.
The post Rime Introduces Arcana and Rimecaster (Open Source): Practical Voice AI Tools Built on Real-World Speech appeared first on MarkTechPost.

Cost-effective AI image generation with PixArt-Σ inference on AWS Tra …

PixArt-Sigma is a diffusion transformer model that is capable of image generation at 4k resolution. This model shows significant improvements over previous generation PixArt models like Pixart-Alpha and other diffusion models through dataset and architectural improvements. AWS Trainium and AWS Inferentia are purpose-built AI chips to accelerate machine learning (ML) workloads, making them ideal for cost-effective deployment of large generative models. By using these AI chips, you can achieve optimal performance and efficiency when running inference with diffusion transformer models like PixArt-Sigma.
This post is the first in a series where we will run multiple diffusion transformers on Trainium and Inferentia-powered instances. In this post, we show how you can deploy PixArt-Sigma to Trainium and Inferentia-powered instances.
Solution overview
The steps outlined below will be used to deploy the PixArt-Sigma model on AWS Trainium and run inference on it to generate high-quality images.

Step 1 – Pre-requisites and setup
Step 2 – Download and compile the PixArt-Sigma model for AWS Trainium
Step 3 – Deploy the model on AWS Trainium to generate images

Step 1 – Prerequisites and setup
To get started, you will need to set up a development environment on a trn1, trn2, or inf2 host. Complete the following steps:

Launch a trn1.32xlarge or trn2.48xlarge instance with a Neuron DLAMI. For instructions on how to get started, refer to Get Started with Neuron on Ubuntu 22 with Neuron Multi-Framework DLAMI.
Launch a Jupyter Notebook sever. For instructions to set up a Jupyter server, refer to the following user guide.
Clone the aws-neuron-samples GitHub repository:

git clone https://github.com/aws-neuron/aws-neuron-samples.git

Navigate to the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb notebook:

cd aws-neuron-samples/torch-neuronx/inference

The provided example script is designed to run on a Trn2 instance, but you can adapt it for Trn1 or Inf2 instances with minimal modifications. Specifically, within the notebook and in each of the component files under the neuron_pixart_sigma directory, you will find commented-out changes to accommodate Trn1 or Inf2 configurations.
Step 2 – Download and compile the PixArt-Sigma model for AWS Trainium
This section provides a step-by-step guide to compiling PixArt-Sigma for AWS Trainium.
Download the model
You will find a helper function in cache-hf-model.py in above mentioned GitHub repository that shows how to download the PixArt-Sigma model from Hugging Face. If you are using PixArt-Sigma in your own workload, and opt not to use the script included in this post, you can use the huggingface-cli to download the model instead.
The Neuron PixArt-Sigma implementation contains a few scripts and classes. The various files and scrips are broken down as follows:
├── compile_latency_optimized.sh # Full Model Compilation script for Latency Optimized
├── compile_throughput_optimized.sh # Full Model Compilation script for Throughput Optimized
├── hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb # Notebook to run Latency Optimized Pixart-Sigma
├── hf_pretrained_pixart_sigma_1k_throughput_optimized.ipynb # Notebook to run Throughput Optimized Pixart-Sigma
├── neuron_pixart_sigma
│ ├── cache_hf_model.py # Model downloading Script
│ ├── compile_decoder.py # Text Encoder Compilation Script and Wrapper Class
│ ├── compile_text_encoder.py # Text Encoder Compilation Script and Wrapper Class
│ ├── compile_transformer_latency_optimized.py # Latency Optimized Transformer Compilation Script and Wrapper Class
│ ├── compile_transformer_throughput_optimized.py # Throughput Optimized Transformer Compilation Script and Wrapper Class
│ ├── neuron_commons.py # Base Classes and Attention Implementation
│ └── neuron_parallel_utils.py # Sharded Attention Implementation
└── requirements.txt
This notebook will help you to download the model, compile the individual component models, and invoke the generation pipeline to generate an image. Although the notebooks can be run as a standalone sample, the next few sections of this post will walk through the key implementation details within the component files and scripts to support running PixArt-Sigma on Neuron.

Sharding PixArt linear layers

For each component of PixArt (T5, Transformer, and VAE), the example uses Neuron specific wrapper classes. These wrapper classes serve two purposes. The first purpose is it allows us to trace the models for compilation:
class InferenceTextEncoderWrapper(nn.Module):
def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
super().__init__()
self.dtype = dtype
self.device = t.device
self.t = t
def forward(self, text_input_ids, attention_mask=None):
return [self.t(text_input_ids, attention_mask)[‘last_hidden_state’].to(self.dtype)]

Please refer to the neuron_commons.py file for all wrapper modules and classes.
The second reason for using wrapper classes is to modify the attention implementation to run on Neuron. Because diffusion models like PixArt are typically compute-bound, you can improve performance by sharding the attention layer across multiple devices. To do this, you replace the linear layers with NeuronX Distributed’s RowParallelLinear and ColumnParallelLinear layers:
def shard_t5_self_attention(tp_degree: int, selfAttention: T5Attention):
orig_inner_dim = selfAttention.q.out_features
dim_head = orig_inner_dim // selfAttention.n_heads
original_nheads = selfAttention.n_heads
selfAttention.n_heads = selfAttention.n_heads // tp_degree
selfAttention.inner_dim = dim_head * selfAttention.n_heads
orig_q = selfAttention.q
selfAttention.q = ColumnParallelLinear(
selfAttention.q.in_features,
selfAttention.q.out_features,
bias=False,
gather_output=False)
selfAttention.q.weight.data = get_sharded_data(orig_q.weight.data, 0)
del(orig_q)
orig_k = selfAttention.k
selfAttention.k = ColumnParallelLinear(
selfAttention.k.in_features,
selfAttention.k.out_features,
bias=(selfAttention.k.bias is not None),
gather_output=False)
selfAttention.k.weight.data = get_sharded_data(orig_k.weight.data, 0)
del(orig_k)
orig_v = selfAttention.v
selfAttention.v = ColumnParallelLinear(
selfAttention.v.in_features,
selfAttention.v.out_features,
bias=(selfAttention.v.bias is not None),
gather_output=False)
selfAttention.v.weight.data = get_sharded_data(orig_v.weight.data, 0)
del(orig_v)
orig_out = selfAttention.o
selfAttention.o = RowParallelLinear(
selfAttention.o.in_features,
selfAttention.o.out_features,
bias=(selfAttention.o.bias is not None),
input_is_parallel=True)
selfAttention.o.weight.data = get_sharded_data(orig_out.weight.data, 1)
del(orig_out)
return selfAttention

Please refer to the neuron_parallel_utils.py file for more details on parallel attention.
Compile individual sub-models
The PixArt-Sigma model is composed of three components. Each component is compiled so the entire generation pipeline can run on Neuron:

Text encoder – A 4-billion-parameter encoder, which translates a human-readable prompt into an embedding. In the text encoder, the attention layers are sharded, along with the feed-forward layers, with tensor parallelism.
Denoising transformer model – A 700-million-parameter transformer, which iteratively denoises a latent (a numerical representation of a compressed image). In the transformer, the attention layers are sharded, along with the feed-forward layers, with tensor parallelism.
Decoder – A VAE decoder that converts our denoiser-generated latent to an output image. For the decoder, the model is deployed with data parallelism.

Now that the model definition is ready, you need to trace a model to run it on Trainium or Inferentia. You can see how to use the trace() function to compile the decoder component model for PixArt in the following code block:
compiled_decoder = torch_neuronx.trace(
decoder,
sample_inputs,
compiler_workdir=f”{compiler_workdir}/decoder”,
compiler_args=compiler_flags,
inline_weights_to_neff=False
)

Please refer to the compile_decoder.py file for more on how to instantiate and compile the decoder.
To run models with tensor parallelism, a technique used to split a tensor into chunks across multiple NeuronCores, you need to trace with a pre-specified tp_degree. This tp_degree specifies the number of NeuronCores to shard the model across. It then uses the parallel_model_trace API to compile the encoder and transformer component models for PixArt:
compiled_text_encoder = neuronx_distributed.trace.parallel_model_trace(
get_text_encoder_f,
sample_inputs,
compiler_workdir=f”{compiler_workdir}/text_encoder”,
compiler_args=compiler_flags,
tp_degree=tp_degree,
)

Please refer to the compile_text_encoder.py file for more details on tracing the encoder with tensor parallelism.
Lastly, you trace the transformer model with tensor parallelism:
compiled_transformer = neuronx_distributed.trace.parallel_model_trace(
get_transformer_model_f,
sample_inputs,
compiler_workdir=f”{compiler_workdir}/transformer”,
compiler_args=compiler_flags,
tp_degree=tp_degree,
inline_weights_to_neff=False,
)

Please refer to the compile_transformer_latency_optimized.py file for more details on tracing the transformer with tensor parallelism.
You will use the compile_latency_optimized.sh script to compile all three models as described in this post, so these functions will be run automatically when you run through the notebook.
Step 3 – Deploy the model on AWS Trainium to generate images
This section will walk us through the steps to run inference on PixArt-Sigma on AWS Trainium.
Create a diffusers pipeline object
The Hugging Face diffusers library is a library for pre-trained diffusion models, and includes model-specific pipelines that bundle the components (independently-trained models, schedulers, and processors) needed to run a diffusion model. The PixArtSigmaPipeline is specific to the PixArtSigma model, and is instantiated as follows:
pipe: PixArtSigmaPipeline = PixArtSigmaPipeline.from_pretrained(
“PixArt-alpha/PixArt-Sigma-XL-2-1024-MS”,
torch_dtype=torch.bfloat16,
local_files_only=True,
cache_dir=”pixart_sigma_hf_cache_dir_1024″)

Please refer to the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb notebook for details on pipeline execution.
Load compiled component models into the generation pipeline
After each component model has been compiled, load them into the overall generation pipeline for image generation. The VAE model is loaded with data parallelism, which allows us to parallelize image generation for batch size or multiple images per prompt. For more details, refer to the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb notebook.
vae_decoder_wrapper.model = torch_neuronx.DataParallel(
torch.jit.load(decoder_model_path), [0, 1, 2, 3], False
)

text_encoder_wrapper.t = neuronx_distributed.trace.parallel_model_load(
text_encoder_model_path
)

Finally, the loaded models are added to the generation pipeline:
pipe.text_encoder = text_encoder_wrapper
pipe.transformer = transformer_wrapper
pipe.vae.decoder = vae_decoder_wrapper
pipe.vae.post_quant_conv = vae_post_quant_conv_wrapper

Compose a prompt
Now that the model is ready, you can write a prompt to convey what kind of image you want generated. When creating a prompt, you should always be as specific as possible. You can use a positive prompt to convey what is wanted in your new image, including a subject, action, style, and location, and can use a negative prompt to indicate features that should be removed.
For example, you can use the following positive and negative prompts to generate a photo of an astronaut riding a horse on mars without mountains:
# Subject: astronaut
# Action: riding a horse
# Location: Mars
# Style: photo
prompt = “a photo of an astronaut riding a horse on mars”
negative_prompt = “mountains”

Feel free to edit the prompt in your notebook using prompt engineering to generate an image of your choosing.
Generate an image
To generate an image, you pass the prompt to the PixArt model pipeline, and then save the generated image for later reference:
# pipe: variable holding the Pixart generation pipeline with each of
# the compiled component models
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
height=1024, # number of pixels
width=1024, # number of pixels
num_inference_steps=25 # Number of passes through the denoising model
).images

for idx, img in enumerate(images):
img.save(f”image_{idx}.png”)

Cleanup
To avoid incurring additional costs, stop your EC2 instance using either the AWS Management Console or AWS Command Line Interface (AWS CLI).
Conclusion
In this post, we walked through how to deploy PixArt-Sigma, a state-of-the-art diffusion transformer, on Trainium instances. This post is the first in a series focused on running diffusion transformers for different generation tasks on Neuron. To learn more about running diffusion transformers models with Neuron, refer to Diffusion Transformers.

About the Authors
Achintya Pinninti is a Solutions Architect at Amazon Web Services. He supports public sector customers, enabling them to achieve their objectives using the cloud. He specializes in building data and machine learning solutions to solve complex problems.
Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AI/ML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.
Sadaf Rasool is a Solutions Architect in Annapurna Labs at AWS. Sadaf collaborates with customers to design machine learning solutions that address their critical business challenges. He helps customers train and deploy machine learning models leveraging AWS Trainium or AWS Inferentia chips to accelerate their innovation journey.
John Gray is a Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips.

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recip …

This post is the second part of the DeepSeek series focusing on model customization with Amazon SageMaker HyperPod recipes (or recipes for brevity). In Part 1, we demonstrated the performance and ease of fine-tuning DeepSeek-R1 distilled models using these recipes. In this post, we use the recipes to fine-tune the original DeepSeek-R1 671b parameter model. We demonstrate this through the step-by-step implementation of these recipes using both SageMaker training jobs and SageMaker HyperPod.
Business use case
After its public release, DeepSeek-R1 model, developed by DeepSeek AI, showed impressive results across multiple evaluation benchmarks. The model follows the Mixture of Experts (MoE) architecture and has 671 billion parameters. Traditionally, large models are well adapted for a wide spectrum of generalized tasks by the virtue of being trained on the huge amount of data. The DeepSeek-R1 model was trained on 14.8 trillion tokens. The original R1 model demonstrates strong few-shot or zero-shot learning capabilities, allowing it to generalize to new tasks and scenarios that weren’t part of its original training.
However, many customers prefer to either fine-tune or run continuous pre-training of these models to adapt it to their specific business applications or to optimize it for specific tasks. A financial organization might want to customize the model with their custom data to assist with their data processing tasks. Or a hospital network can fine-tune it with their patient records to act as a medical assistant for their doctors. Fine-tuning can also extend the model’s generalization ability. Customers can fine-tune it with a corpus of text in specific languages that aren’t fully represented in the original training data. For example, a model fine-tuned with an additional trillion tokens of Hindi language will be able to expand the same generalization capabilities to Hindi.
The decision on which model to fine-tune depends on the end application as well as the available dataset. Based on the volume of proprietary data, customers can decide to fine-tune the larger DeepSeek-R1 model instead of doing it for one of the distilled versions. In addition, the R1 models have their own set of guardrails. Customers might want to fine-tune to update those guardrails or expand on them.
Fine-tuning larger models like DeepSeek-R1 requires careful optimization to balance cost, deployment requirements, and performance effectiveness. To achieve optimal results, organizations must meticulously select an appropriate environment, determine the best hyperparameters, and implement efficient model sharding strategies.
Solution architecture
SageMaker HyperPod recipes effectively address these requirements by providing a carefully curated mix of distributed training techniques, optimizations, and configurations for state-of-the-art (SOTA) open source models. These recipes have undergone extensive benchmarking, testing, and validation to provide seamless integration with the SageMaker training and fine-tuning processes.
In this post, we explore solutions that demonstrate how to fine-tune the DeepSeek-R1 model using these recipes on either SageMaker HyperPod or SageMaker training jobs. Your choice between these services will depend on your specific requirements and preferences. If you require granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal choice. SageMaker training jobs, on the other hand, is tailored for organizations that want a fully managed experience for their training workflows. To learn more details about these service features, refer to Generative AI foundation model training on Amazon SageMaker.
The following diagram illustrates the solution architecture for training using SageMaker HyperPod. With HyperPod, users can begin the process by connecting to the login/head node of the Slurm cluster. Each step is run as a Slurm job and uses Amazon FSx for Lustre for storing model checkpoints. For DeepSeek-R1, the process consists of the following steps:

Download the DeepSeek-R1 model and convert weights from FP8 to BF16 format
Load the model into memory and perform fine-tuning using Quantized Low-Rank Adaptation (QLoRA)
Merge QLoRA adapters with the base model
Convert and load the model for batch evaluation

The following diagram illustrates the solution architecture for SageMaker training jobs. You can execute each step in the training pipeline by initiating the process through the SageMaker control plane using APIs, AWS Command Line Interface (AWS CLI), or the SageMaker ModelTrainer SDK. In response, SageMaker launches training jobs with the requested number and type of compute instances to run specific tasks. For DeepSeek-R1, the process consists of three main steps:

Download and convert R1 to BF16 datatype format
Load the model into memory and perform fine-tuning
Consolidate and load the checkpoints into memory, then run inference and metrics to evaluate performance improvements

Prerequisites
Complete the following prerequisites before running the DeepSeek-R1 671B model fine-tuning notebook:

Make the following quota increase requests for SageMaker. You need to request a minimum of two ml.p5.48xlarge instances (with 8 x NVIDIA H100 GPUs) ranging to a maximum of four ml.p5.48xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). On the Service Quotas console, request the following SageMaker quotas. It can take up to 24 hours for the quota increase to be approved:

P5 instances (ml.p5.48xlarge) for training job usage: 2–4
P5 instances (ml.p5.48xlarge) for HyperPod clusters (ml.p5.48xlarge for cluster usage): 2–4

If you choose to use HyperPod clusters to run your training, set up a HyperPod Slurm cluster, referring to Amazon SageMaker HyperPod Developer Guide. Alternatively, you can also use the AWS CloudFormation template provided in the Own Account workshop and follow the instructions to set up a cluster and a development environment to access and submit jobs to the cluster.
(Optional) If you choose to use SageMaker training jobs, you can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role (You can use JupyterLab in your local setup too).

Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonFSxFullAccess, and AmazonS3FullAccess to give the necessary access to SageMaker to run the examples.

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:

git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git
cd 18_sagemaker_training_recipes/ft_deepseek_r1_qlora

Solution walkthrough
To perform the solution, follow the steps in the next sections.
Technical considerations
The default weights provided by the DeepSeek team on their official R1 repository are of type FP8. However, we chose to disable FP8 in our recipes because we empirically found that training with BF16 enhances generalization across diverse datasets with minimal changes to the recipe hyperparameters. Therefore, to achieve stable fine-tuning for a model of 671b parameter size, we recommend first converting the model from FP8 to BF16 using the fp8_cast_bf16.py command-line script provided by DeepSeek. Executing this script will copy over the converted BF16 weights in Safetensor format to the specified output directory. Remember to copy over the model’s config.yaml to the output directory so the weights are loaded accurately. These steps are encapsulated in a prologue script and are documented step-by-step under the Fine-tuning section.
Customers can use a sequence length of 8K for training, as tested on a p5.48xlarge instance, each equipped with eight NVIDIA H100 GPUs. You can also choose a smaller sequence length if needed. Training with a sequence length greater than 8K might lead to out-of-memory issues with GPUs. Also, converting model weights from FP8 to BF16 requires a p5.48xlarge instance, which is also recommended for training due to the model’s high host memory requirements during initialization.
Customers must upgrade their transformers version to transformers==4.48.2 to run the training.
Fine-tuning
Run the finetune_deepseek_r1_671_qlora.ipynb notebook to fine-tune the DeepSeek-R1 model using QLoRA on SageMaker.
Prepare the dataset
This section covers loading the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenizing and chunking the dataset, and configuring the data channels for SageMaker training on Amazon Simple Storage Service (Amazon S3). Complete the following steps:

Format the dataset by applying the prompt format for DeepSeek-R1:

def generate_prompt(data_point):
full_prompt = f”””
Below is an instruction that describes a task, paired with an input
that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{data_point[“Question”]}

### Response:
{data_point[“Complex_CoT”]}

“””
return {“prompt”: full_prompt.strip()}

Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and split it into training and validation datasets:

# Load dataset from the hub
train_set = load_dataset(dataset_name, ‘en’, split=”train[5%:]”)
test_set = load_dataset(dataset_name, ‘en’, split=”train[:5%]”)

train_dataset = train_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)

test_dataset = test_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)

Load the DeepSeek-R1 tokenizer from the Hugging Face Transformers library and generate tokens for the train and validation datasets. We use the original sequence length of 8K:

model_id = “deepseek-ai/DeepSeek-R1”
max_seq_length=8096

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, using the fast tokenizer implementation if available.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

train_dataset = train_dataset.map(tokenize, remove_columns=[“prompt”])
test_dataset = test_dataset.map(tokenize, remove_columns=[“prompt”])

Prepare the training and validation datasets for SageMaker training by saving them as arrow files, required by SageMaker HyperPod recipes, and constructing the S3 paths where these files will be uploaded. This dataset will be used in both SageMaker training jobs and SageMaker HyperPod examples:

train_dataset_s3_path = f”s3://{bucket_name}/{input_path}/train”
val_dataset_s3_path = f”s3://{bucket_name}/{input_path}/test”

train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)

The next section describes how to run a fine-tuning example with SageMaker training jobs.
Option A: Fine-tune using SageMaker training jobs
Follow these high-level steps:

Download DeepSeek-R1 to the FSx for Lustre mounted directory
Convert DeepSeek-R1 from FP8 to BF16
Fine-tune the DeepSeek-R1 model
Merge the trained adapter with the base model

Define a utility function to create the ModelTrainer class for every step of the SageMaker training jobs pipeline:

# Creates and executes a model training job using SageMaker
def create_model_trainer(
use_recipes: bool,
compute: dict,
network: dict,
data_channel: dict,
action: str,
hyperparameters: dict ={},
source_code: str=None,
training_recipe: str=None,
recipe_overrides: str=None,
image_uri: str=None
) -> ModelTrainer:

Download DeepSeek-R1 to the FSx for Lustre mounted directory
Follow these steps:

Select the instance type, Amazon FSx data channel, network configuration for the training job, and source code, then define the ModelTrainer class to run the training job on the ml.c5.18xlarge instance to download DeepSeek-R1 from the Hugging Face DeepSeek-R1 hub:

# Create compute instance
compute = ComputeCreator.create(
instance_type=”ml.c5.18xlarge”,
instance_count=1
)

# Create FSx data channel
data_channel = FSxDataChannelCreator.create_channel(
directory_path=fsx_mount_point
)

# Create network configuration
network = NetworkConfigCreator.create_network_config(network_config)

# Set up source code configuration
source_code = SourceCode(
source_dir=”scripts”,
entry_script=”download.py”
)

# Create model trainer
model_trainer = create_model_trainer(
compute=compute,
network=network,
data_channel=data_channel,
action=”download”,
source_code=source_code

)

Initiate the training calling train function of the ModelTrainer class:

model_trainer.train(input_data_config=[data_channel], wait=True)

Convert DeepSeek R1 from FP8 to BF16
Use ModelTrainer to convert the DeepSeek-R1 downloaded model weights from FP8 to BF16 format for optimal PEFT training. We use script convert.sh to run the execution using the ml.c5.18xlarge instance.
Use SageMaker training warm pool configuration to retain and reuse provisioned infrastructure after the completion of a model download training job in the previous step:

# Define constants
FSX_MODELDIR_BF16 = “deepseek-r1-bf16″
FSX_DIR_PATH = f”{fsx_mount_point}/{fsx_dir_basemodel}”

# Create compute instance
compute = ComputeCreator.create(
instance_type=”ml.p5.48xlarge”,
instance_count=1
)

# Set up source code configuration
source_code = SourceCode(
source_dir=”scripts”,
entry_script=”convert.sh”
)


# Create model trainer for conversion
model_trainer = create_model_trainer(
..
action=”convert”,

)

Fine-tune the DeepSeek-R1 model
The next phase involves fine-tuning the DeepSeek-R1 model using two ml.p5.48xlarge instances, using distributed training. You implement this through the SageMaker recipe hf_deepseek_r1_671b_seq8k_gpu_qlora, which incorporates the QLoRA methodology. QLoRA makes the large language model (LLM) trainable on limited compute by quantizing the base model to 4-bit precision while using small, trainable low-rank adapters for fine-tuning, dramatically reducing memory requirements without sacrificing model quality:

# Create compute configuration with P5 instances
compute = ComputeCreator.create(
instance_type=”ml.p5.48xlarge”,
instance_count=2
)

# Create model trainer for fine-tuning
model_trainer = create_model_trainer(
use_recipes=True,

action=”finetune”,
training_recipe=’fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora’,
recipe_overrides=recipe_overrides
)

Initiate the training job to fine-tune the model. SageMaker training jobs will provision two P5 instances, orchestrate the SageMaker model parallel container smdistributed-modelparallel:2.4.1-gpu-py311-cu121, and execute the recipe to fine-tune DeepSeek-R1 with the QLoRA strategy on an ephemeral cluster:

model_trainer.train (input_data_config=[data_channel], wait=True)

Merge the trained adapter with the base model
Merge the trained adapters with the base model so it can be used for inference:

# Create compute configuration with P5 instance
compute = ComputeCreator.create(
instance_type=”ml.p5.48xlarge”,
instance_count=1
)

# Configure source code location and entry point
source_code = SourceCode(
source_dir=”scripts”,
entry_script=”cli-inference.sh”
)

# Create model trainer for adapter merging
model_trainer = create_model_trainer(
use_recipes=False,

action=”mergeadapter”,
source_code=source_code,
)

The next section shows how you can run similar steps on HyperPod to run your generative AI workloads.
Option B: Fine-tune using SageMaker HyperPod with Slurm
To fine-tune the model using HyperPod, make sure that your cluster is up and ready by following the prerequisites mentioned earlier. To access the login/head node of the HyperPod Slurm cluster from your development environment, follow the login instructions at SSH into Cluster in the workshop.
Alternatively, you can also use AWS Systems Manager and run a command such as the following to start the session. You can find the cluster ID, instance group name, and instance ID on the Amazon SageMaker console.

aws ssm start-session –target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] –region region_name

When you’re in the cluster’s login/head node, run the following commands to set up the environment. Run sudo su – ubuntu to run the remaining commands as the root user, unless you have a specific user ID to access the cluster and your POSIX user is created through a lifecycle script on the cluster. Refer to the multi-user setup for more details.

# create a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate

# clone the recipes repository and set up the environment
git clone –recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

Create a squash file using Enroot to run the job on the cluster. Enroot runtime offers GPU acceleration, rootless container support, and seamless integration with HPC environments, making it ideal for running workflows securely.

# create a squash file using Enroot
REGION=<region>
IMAGE=”658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121″
aws ecr get-login-password –region “${REGION}” | docker login –username AWS –password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}

After you’ve created the squash file, update the recipes_collection/config.yaml file with the absolute path to the squash file (created in the preceding step), and update the instance_type if needed. The final config file should have the following parameters:

cluster_type: slurm

instance_type: p5.48xlarge

container: /fsx/<path-to-smdistributed-modelparallel>.sqsh

Also update the file recipes_collection/cluster/slurm.yaml to add container_mounts pointing to the FSx for Lustre file system used in your cluster.
Follow these high-level steps to set up, fine-tune, and evaluate the model using HyperPod recipes:

Download the model and convert weights to BF16
Fine-tune the model using QLoRA
Merge the trained model adapter
Evaluate the fine-tuned model

Download the model and convert weights to BF16
Download the DeepSeek-R1 model from the HuggingFace hub and convert the model weights from FP8 to BF16. You need to convert this to use QLoRA for fine-tuning. Copy and execute the following bash script:

#!/bin/bash
start=$(date +%s)
# install git lfs and download the model from huggingface
sudo apt-get install git-lfs
GIT_LFS_SKIP_SMUDGE=1 && git clone https://huggingface.co/deepseek-ai/DeepSeek-R1
&& cd DeepSeek-R1 && git config lfs.concurrenttransfers nproc &&  git lfs pull
end=$(date +%s)
echo “Time taken to download model: $((end – start)) seconds”
start=$(date +%s)
#convert the model weights from fp8 to bf16
source venv/bin/activate
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference && pip install -r requirements.txt &&
wget https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/blob/main/src/hyperpod_nemo_adapter/scripts/fp8_cast_bf16.py &&
python fp8_cast_bf16.py –input-fp8-hf-path ./DeepSeek-R1 –output-bf16-hf-path ./DeepSeek-R1-bf16

end=$(date +%s)
echo “Time taken to convert model to BF16: $((end – start)) seconds”

Fine-tune the model using QLoRA
Download the prepared dataset that you uploaded to Amazon S3 into your FSx for Lustre volume attached to the cluster.

Enter the following commands to download the files from Amazon S3:

aws s3 cp s3://{bucket_name}/{input_path}/train /fsx/ubuntu/deepseek/data/train –recursive
aws s3 cp s3://{bucket_name}/{input_path}/test /fsx/ubuntu/deepseek/data/test –recursive

Update the launcher script to fine-tune the DeepSeek-R1 671B model. The launcher scripts serve as convenient wrappers for executing the training script, main.py file, simplifying the process of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek R1 671B model, you can find the specific script at:

launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_qlora.sh

Before running the script, you need to modify the location of the training and validation files, update the HuggingFace model ID, and optionally the access token for private models and datasets. The script should look like the following (update recipes.trainer.num_nodes if you’re using a multi-node cluster):

#!/bin/bash

# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

#Users should setup their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-“$(pwd)”}

HF_MODEL_NAME_OR_PATH=”/fsx/ubuntu/deepseek/DeepSeek-R1-bf16″ # Path to the BF16 converted model

TRAIN_DIR=”/fsx/ubuntu/deepseek/data/train” # Location of training dataset
VAL_DIR=”/fsx/ubuntu/deepseek/data/train/” # Location of validation dataset

EXP_DIR=”/fsx/ubuntu/deepseek/checkpoints” # Location to save experiment info including logging, checkpoints, etc.

HYDRA_FULL_ERROR=1 python3 “${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py”
recipes=fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora
base_results_dir=”${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results”
recipes.run.name=”hf-deepseek-r1-671b-seq8k-gpu-qlora”
recipes.exp_manager.exp_dir=”$EXP_DIR”
recipes.trainer.num_nodes=2
recipes.model.train_batch_size=1
recipes.model.data.train_dir=”$TRAIN_DIR”
recipes.model.data.val_dir=”$VAL_DIR”
recipes.model.hf_model_name_or_path=”$HF_MODEL_NAME_OR_PATH”

You can view the recipe for this fine-tuning task under recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_671b_seq8k_gpu_qlora.yaml and override additional parameters as needed.

Submit the job by running the launcher script:

bash launcher_scripts/deepseek/run_hf_deepseek_r1_671b_seq8k_gpu_qlora.sh

Monitor the job using Slurm commands such as squeue and scontrol show to view the status of the job and the corresponding logs. The logs can be found in the results folder in the launch directory. When the job is complete, the model adapters are stored in the EXP_DIR that you defined in the launch. The structure of the directory should look like this:

ls -R
.:.:
checkpoints experiment result.json

./checkpoints:
peft_sharded

./checkpoints/peft_sharded:
step_50

./checkpoints/peft_sharded/step_50:
README.md adapter_config.json adapter_model.safetensors tp0_ep0

You can see the trained adapter weights are stored as part of the checkpointing under ./checkpoints/peft_sharded/step_N. We will later use this to merge with the base model.
Merge the trained model adapter
Follow these steps:

Run a job using the smdistributed-modelparallel enroot image to merge the adapter with the base model.

Download the merge_peft_checkpoint.py code from sagemaker-hyperpod-training-adapter-for-nemo repository and store it in Amazon FSx. Modify the export variables in the following scripts accordingly to reflect the paths for SOURCE_DIR, ADAPTER_PATH, BASE_MODEL_BF16 and MERGE_MODEL_PATH.

#!/bin/bash
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#SBATCH –nodes=1 # number of nodes to use
#SBATCH –job-name=deepseek_merge_adapter # name of your job
#SBATCH –exclusive # job has exclusive use of the resource, no sharing
#SBATCH –wait-all-nodes=1

set -ex;
export SOURCE_DIR=/fsx/path_to_merge_code #(folder containing merge_peft_checkpoint.py)
export ADAPTER_PATH=/fsx/path_to_adapter #( from previous step )
export BASE_MODEL_BF16=/fsx/path_to_base #( BF16 model from step 1 )
export MERGE_MODEL_PATH=/fsx/path_to_merged_model

# default variables for mounting local paths to container
: “${IMAGE:=$(pwd)/smdistributed-modelparallel.sqsh}”
: “${HYPERPOD_PATH:=”/var/log/aws/clusters”:”/var/log/aws/clusters”}” #this is need for validating its hyperpod cluster
: “${ADAPTER_PATH_1:=$ADAPTER_PATH:$ADAPTER_PATH}”
: “${BASE_MODEL_BF16_1:=$BASE_MODEL_BF16:$BASE_MODEL_BF16}”
: “${MERGE_MODEL_PATH_1:=$MERGE_MODEL_PATH:$MERGE_MODEL_PATH}”
: “${SOURCE_DIR_1:=$SOURCE_DIR:$SOURCE_DIR}”
############

declare -a ARGS=(
–container-image $IMAGE
–container-mounts $HYPERPOD_PATH,$ADAPTER_PATH_1,$BASE_MODEL_BF16_1,$MERGE_MODEL_PATH_1,$SOURCE_DIR_1
)
#Merge adapter with base model.

srun -l “${ARGS[@]}” python  $SOURCE_DIR/merge_peft_checkpoint.py
–hf_model_name_or_path $BASE_MODEL_BF16
–peft_adapter_checkpoint_path $ADAPTER_PATH
–output_model_path $MERGE_MODEL_PATH
–deepseek_v3 true

Evaluate the fine-tuned model
Use the basic testing scripts provided by DeekSeek to deploy the merged model.

Start by cloning their repo:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git

cd DeepSeek-V3/inference
pip install -r requirements.txt

You need to convert the merged model to a specific format for running inference. In this case, you need 4*P5 instances to deploy the model because the merged model is in BF16. Enter the following command to convert the model:

python convert.py –hf-ckpt-path /fsx/ubuntu/deepseek/DeepSeek-V3-Base/
–save-path /fsx/ubuntu/deepseek/DeepSeek-V3-Demo –n-experts 256
–model-parallel 32

When the conversion is complete, use the following sbatch script to run the batch inference, making the following adjustments:

Update the ckpt-path to the converted model path from the previous step.
Create a new prompts.txt file with each line containing a prompt. The job will use the prompts from this file and generate output.

#!/bin/bash
#SBATCH —nodes=4
#SBATCH —job-name=deepseek_671b_inference
#SBATCH —output=deepseek_671b_%j.out

# Set environment variables
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
source /fsx/ubuntu/alokana/deepseek/venv/bin/activate
# Run the job using torchrun
srun /fsx/ubuntu/alokana/deepseek/venv/bin/torchrun
—nnodes=4
—nproc-per-node=8
—rdzv_id=$SLURM_JOB_ID
—rdzv_backend=c10d
—rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT
./generate.py
—ckpt-path /fsx/ubuntu/alokana/deepseek/DeepSeek-R1-Demo
—config ./configs/config_671B.json
–input-file ./prompts.txt

Cleanup
To clean up your resources to avoid incurring more charges, follow these steps:

Delete any unused SageMaker Studio resources.
(Optional) Delete the SageMaker Studio domain.
Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
If you created a HyperPod cluster, delete the cluster to stop incurring costs. If you created the networking stack from the HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.

Conclusion
In this post, we demonstrated how to fine-tune large models such as DeepSeek-R1 671B using either SageMaker training jobs or SageMaker HyperPod with HyperPod recipes in a few steps. This approach minimizes the complexity of identifying optimal distributed training configurations and provides a simple way to properly size your workloads with the best price-performance architecture on AWS.
To start using SageMaker HyperPod recipes, visit our sagemaker-hyperpod-recipes GitHub repository for comprehensive documentation and example implementations. Our team continually expands our recipes based on customer feedback and emerging machine learning (ML) trends, making sure you have the necessary tools for successful AI model training.

About the Authors
 Kanwaljit Khurmi is a Principal Worldwide Generative AI Solutions Architect at AWS. He collaborates with AWS product teams, engineering departments, and customers to provide guidance and technical assistance, helping them enhance the value of their hybrid machine learning solutions on AWS. Kanwaljit specializes in assisting customers with containerized applications and high-performance computing solutions.
Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.
 Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Rohith Nadimpally is a Software Development Engineer working on AWS SageMaker, where he accelerates large-scale AI/ML workflows. Before joining Amazon, he graduated with Honors from Purdue University with a degree in Computer Science. Outside of work, he enjoys playing tennis and watching movies.