Accelerate IaC troubleshooting with Amazon Bedrock Agents

Troubleshooting infrastructure as code (IaC) errors often consumes valuable time and resources. Developers can spend multiple cycles searching for solutions across forums, troubleshooting repetitive issues, or trying to identify the root cause. These delays can lead to missed security errors or compliance violations, especially in complex, multi-account environments.
This post demonstrates how you can use Amazon Bedrock Agents to create an intelligent solution to streamline the resolution of Terraform and AWS CloudFormation code issues through context-aware troubleshooting. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents is a fully managed service that helps developers create AI agents that can break down complex tasks into steps and execute them using FMs and APIs to accomplish specific business objectives.
Our solution uses Amazon Bedrock Agents to analyze error messages and code context, generating detailed troubleshooting steps for IaC errors. In organizations with multi-account AWS environments, teams often maintain a centralized AWS environment for developers to deploy applications. This setup makes sure that AWS infrastructure deployments using IaC align with organizational security and compliance measures. For specific IaC errors related to these compliance measures, such as those involving service control policies (SCPs) or resource-based policies, our solution intelligently directs developers to contact appropriate teams like Security or Enablement. This targeted guidance maintains security protocols and makes sure that sensitive issues are handled by the right experts. The solution is flexible and can be adapted for similar use cases beyond these examples.
Although we focus on Terraform Cloud workspaces in this example, the same principles apply to GitLab CI/CD pipelines or other continuous integration and delivery (CI/CD) approaches executing IaC code. By automating initial error analysis and providing targeted solutions or guidance, you can improve operational efficiency and focus on solving complex infrastructure challenges within your organization’s compliance framework.
Solution overview
Before we dive into the deployment process, let’s walk through the key steps of the architecture as illustrated in the following figure.

The workflow for the Terraform solution is as follows:

Initial input through the Amazon Bedrock Agents chat console – The user begins by entering details about their Terraform error into the chat console for Amazon Bedrock Agents. This typically includes the Terraform Cloud workspace URL where the error occurred, and optionally, a Git repository URL and branch name if additional context is needed.
Error retrieval and context gathering – The Amazon Bedrock agent forwards these details to an action group that invokes the first AWS Lambda function (see the following Lambda function code). This function invokes another Lambda function (see the following Lambda function code) which retrieves the latest error message from the specified Terraform Cloud workspace. If a Git repository URL is provided, it also retrieves relevant Terraform files from the repository. This contextual information is then sent back to the first Lambda function.
Error analysis and response generation – Lambda function would then construct a detailed prompt that includes the error message, repository files (if available), and specific use case instructions. It then uses the Amazon Bedrock model to analyze the error and generate either troubleshooting steps or guidance to contact specific teams.
Interaction and user guidance – The agent displays the generated response to the user. For most Terraform errors, this includes detailed troubleshooting steps. For specific cases related to organizational policies (for example, service control policies or resource-based policies), the response directs the user to contact the appropriate team, such as Security or Enablement.
Continuous improvement – The solution can be continually updated with new specific use cases and organizational guidelines, making sure that the troubleshooting advice stays current with the organization’s evolving infrastructure and compliance requirements. For example:

SCP or IAM policy violations – Guides developers when they encounter permission issues due to SCPs or strict AWS Identity and Access Management (IAM) boundaries, offering alternatives or escalation paths.
VPC and networking restrictions – Flags non-compliant virtual private cloud (VPC) or subnet configurations (such as public subnets) and suggests security-compliant adjustments.
Encryption requirements – Detects missing or incorrect encryption for Amazon Simple Storage Service (Amazon S3) or Amazon Elastic Block Store (Amazon EBS) resources and recommends the appropriate configurations to align with compliance standards.

The following diagram illustrates the step-by-step process of how the solution works.

This solution streamlines the process of resolving Terraform errors, providing immediate, context-aware guidance to developers while making sure that sensitive or complex issues are directed to the appropriate teams. By using the capabilities of Amazon Bedrock Agents, it offers a scalable and intelligent approach to managing IaC challenges in large, multi-account AWS environments.
Prerequisites
To implement the solution, you need the following:

An understanding of Amazon Bedrock Agents, prompt engineering, Amazon Bedrock Knowledge Bases, Lambda functions, and IAM
An AWS account with appropriate IAM permissions to create agents and knowledge bases in Amazon Bedrock, Lambda functions, and IAM roles
A service role created for Amazon Bedrock Agents
Model access enabled for Amazon Bedrock
A GitLab account with a repository and a personal access token to access the repository

Create the Amazon Bedrock agent
To create and configure the Amazon Bedrock agent, complete the following steps:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose Create agent.
Provide agent details, including agent name and description (optional).
Grant the agent permissions to AWS services through the IAM service role. This gives your agent access to required services, such as Lambda.
Select an FM from Amazon Bedrock (such as Anthropic’s Claude 3 Sonnet).
For troubleshooting Terraform errors through Amazon Bedrock Agents, attach the following instruction to the agent. This instruction makes sure that the agent gathers the required input from the user and executes the action group to provide detailed troubleshooting steps.

“You are a terraform code error specialist. Greet the user and ask for terraform workspace url, branch name, code repository url. Once received, trigger troubleshooting action group. Provide the troubleshooting steps to the user.”
Configure the Lambda function for the action group
After you configure the initial agent and add the preceding instruction to the agent, you need to create two Lambda functions:

The first Lambda function will be added to the action group, which is invoked by the Amazon Bedrock agent, and will subsequently trigger the second Lambda function using the invoke method. Refer to the Lambda function code for more details. Make sure the LAMBDA_2_FUNCTION_NAME environment variable is set.
The second Lambda function will handle fetching the Terraform workspace error and the associated Terraform code from GitLab. Refer to the Lambda function code. Make sure that the TERRAFORM_API_URL, TERRAFORM_SECRET_NAME, and VCS_SECRET_NAME environment variables are set.

After the Terraform workspace error and code details are retrieved, these details will be passed back to the first Lambda function, which will use the Amazon Bedrock API with an FM to generate and provide the appropriate troubleshooting steps based on the error and code information.
Add the action group to the Amazon Bedrock agent
Complete the following steps to add the action group to the Amazon Bedrock agent:

Add an action group to the Amazon Bedrock agent.
Assign a descriptive name (for example, troubleshooting) to the action group and provide a description. This helps clarify the purpose of the action group within the workflow.
For Action group type, select Define with function details.

For more details, see Define function details for your agent’s action groups in Amazon Bedrock.

For Action group invocation, choose the first Lambda function that you created previously.

This function runs the business logic required when an action is invoked. Make sure to choose the correct version of the first Lambda function. For more details on how to configure Lambda functions for action groups, see Configure Lambda functions to send information that an Amazon Bedrock agent elicits from the user.

For Action group function 1, provide a name and description.
Add the following parameters.

Name
Description
Type
Required

workspace_url
Terraform workspace url
string
True

      repo_url
Code repository URL
string
True

branch_name
Code repository branch name
string
True

Test the solution
The following example is of a Terraform error due to a service control polcy. The troubleshooting steps provided would be aligned to address those specific constraints. The action group triggers the Lambda function, which follows structured single-shot prompting by passing the complete context—such as the error message and repository contents—in a single input to the Amazon Bedrock model to generate precise troubleshooting steps.
Example 1: The following screenshot shows an example of a Terraform error caused by an SCP limitation managed by the security team.

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

Example 2: The following screenshot shows an example of a Terraform error due to a missing variable value.

The following screenshot shows an example of the user interaction with Amazon Bedrock Agents and the troubleshooting steps provided.

Clean up
The services used in this demo can incur costs. Complete the following steps to clean up your resources:

Delete the Lambda functions if they are no longer required.
Delete the action group and Amazon Bedrock agent you created.

Conclusion
IaC offers flexibility for managing cloud environments, but troubleshooting code errors can be time-consuming, especially in environments with strict organizational guardrails. This post demonstrated how Amazon Bedrock Agents, combined with action groups and generative AI models, streamlines and accelerates the resolution of Terraform errors while maintaining compliance with environment security and operational guidelines.
Using the capabilities of Amazon Bedrock Agents, developers can receive context-aware troubleshooting steps tailored to environment-related issues such as SCP or IAM violations, VPC restrictions, and encryption policies. The solution provides specific guidance based on the error’s context and directs users to the appropriate teams for issues that require further escalation. This reduces the time spent on IaC errors, improves developer productivity, and maintains organizational compliance.
Are you ready to streamline your cloud deployment process with the generative AI of Amazon Bedrock? Start by exploring the Amazon Bedrock User Guide to see how it can facilitate your organization’s transition to the cloud. For specialized assistance, consider engaging with AWS Professional Services to maximize the efficiency and benefits of using Amazon Bedrock.

About the Authors
Akhil Raj Yallamelli is a Cloud Infrastructure Architect at AWS, specializing in architecting cloud infrastructure solutions for enhanced data security and cost efficiency. He is experienced in integrating technical solutions with business strategies to create scalable, reliable, and secure cloud environments. Akhil enjoys developing solutions focusing on customer business outcomes, incorporating generative AI (Gen AI) technologies to drive innovation and cloud enablement. He holds an MS degree in Computer Science. Outside of his professional work, Akhil enjoys watching and playing sports.
Ebbey Thomas is a Senior Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Derive generative AI powered insights from Alation Cloud Services usin …

This blog post is co-written with Gene Arnold from Alation.
To build a generative AI-based conversational application integrated with relevant data sources, an enterprise needs to invest time, money, and people. First, you would need build connectors to the data sources. Next you need to index this data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this, you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve data, rank the answers, and build a feature rich web application. Additionally, you might need to hire and staff a large team to build, maintain, and manage such a system.
Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems. To do this Amazon Q Business provides out-of-the-box native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well written answers. A data source connector is a component of Amazon Q Business that helps to integrate and synchronize data from multiple repositories into one index. Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including ServiceNow, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more. For a full list of supported data source connectors, see Amazon Q Business connectors.
However, many organizations store relevant information in the form of unstructured data on company intranets or within file systems on corporate networks that are inaccessible to Amazon Q Business using its native data source connectors. You can now use the custom data source connector within Amazon Q Business to upload content to your index from a wider range of data sources.
Using an Amazon Q Business custom data source connector, you can gain insights into your organization’s third party applications with the integration of generative AI and natural language processing. This post shows how to configure an Amazon Q Business custom connector and derive insights by creating a generative AI-powered conversation experience on AWS using Amazon Q Business while using access control lists (ACLs) to restrict access to documents based on user permissions.
Alation is a data intelligence company serving more than 600 global enterprises, including 40% of the Fortune 100. Customers rely on Alation to realize the value of their data and AI initiatives. Headquartered in Redwood City, California, Alation is an AWS Specialization Partner and AWS Marketplace Seller with Data and Analytics Competency. Organizations trust Alation’s platform for self-service analytics, cloud transformation, data governance, and AI-ready data, fostering innovation at scale. In this post, we will showcase a sample of how Alation’s business policies can be integrated with an Amazon Q Business application using a custom data source connector.
Finding accurate answers from content in custom data sources using Amazon Q Business
After you integrate Amazon Q Business with data sources such as Alation, users can ask questions from the description of the document. For example,

What are the top sections of the HR benefits policies?
Who are the data stewards for my proprietary database sources?

Overview of a custom connector
A data source connector is a mechanism for integrating and synchronizing data from multiple repositories into one container index. Amazon Q Business offers multiple pre-built data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. However, if you have valuable data residing in spots for which those pre-built connectors cannot be used, you can use a custom connector.
When you connect Amazon Q Business to a data source and initiate the data synchronization process, Amazon Q Business crawls and adds documents from the data source to its index.
You would typically use an Amazon Q Business custom connector when you have a repository that Amazon Business doesn’t yet provide a data source connector for. Amazon Q Business only provides metric information that you can use to monitor your data source sync jobs. You must create and run the crawler that determines the documents your data source indexes. A simple architectural representation of the steps involved is shown in the following figure.

Solution overview
The solution shown of integrating Alation’s business policies is for demonstration purposes only. We recommend running similar scripts only on your own data sources after consulting with the team who manages them, or be sure to follow the terms of service for the sources that you’re trying to fetch data from. The steps involved for other custom data sources are very similar except the part where we connect to Alation and fetch data from it. To crawl and index contents in Alation you configure an Amazon Q Business custom connector as a data source in your Amazon Q Business application.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the Alation service with the ability to create new policies and access tokens. You can verify if you have access by navigating to https://[[your-domain]].alationcloud.com/admin/auth/ and see the OAuth Client Applications. Alation admins can navigate to https://[[your-domain]].alationcloud.com/admin/users/ and change user access if needed.
Privileges to create an Amazon Q Business application, AWS resources, and AWS Identity and Access Management (IAM) roles and policies.
Basic knowledge of AWS services and working knowledge of Alation or other data sources of choice.
Set up AWS IAM Identity Center integration with Amazon Q Business for user management.
Set up SageMaker Studio notebook and ensure the execution role on it has the necessary privileges to access both the Amazon Q Business application (specifically StartDataSourceSyncJob, BatchPutDocument, and StopDataSourceSyncJob permissions) and the AWS Secrets Manager secret (GetSecretValue). Additionally, it’s recommended that the policy restricts access to only the Amazon Q Business application Amazon Resource Name (ARN) and the Secrets Manager secret created in the following steps.

Configure your Alation connection
In your Alation cloud account, create an OAuth2 client application that can be consumed from an Amazon Q Business application.

In Alation, sign in as a user with administrator privileges, navigate to the settings page, and choose Authentication (https://[[your-domain]].alationcloud.com/admin/auth/).

In the OAuth Client Applications section, choose Add.

Enter an easily identifiable application name, and choose Save.

Take note of the OAuth client application data—the Client ID and the Client Secret—created and choose Close.

As a security best practice, storing the client application data in Secrets Manager is recommended. In AWS console, navigate to AWS Secrets Manager and add a new secret. Key in the Client_Id and Client_Secret values copied from the previous step.

Provide a name and description for the secret and choose Next.

Leave the defaults and choose Next.

Choose Store in the last page.

Create sample Alation policies
In our example, you would create three different sets of Alation policies for a fictional organization named Unicorn Rentals. Grouped as Workplace, HR, and Regulatory, each policy contains a rough two-page summary of crucial organizational items of interest. You can find details on how to create policies on Alation documentation.

On the Amazon Q Business side, let’s assume that we want to ensure that the following access policies are enforced. Users and access are setup via code illustrated in later sections.

#
First name
Last name
Policies authorized for access

1
Alejandro
Rosalez
Workplace, HR, and Regulatory

2
Sofia
Martinez
Workplace and HR

3
Diego
Ramirez
Workplace and Regulatory

Create an Amazon Q Business application

Sign in to the AWS Management Console and navigate to Amazon Q Business from the search bar at the top.

On the Amazon Q Business console, choose Get Started.

On the Applications page, choose Create application.

In the first step of the Create application wizard, enter the default values. Additionally, you need to choose a list of users who require access to the Amazon Q Business application by including them through the IAM Identity Center settings.

In the access management settings page, you would create and add users via AWS IAM Identity Center.

Once all users are added, choose Create.

After the application is created, take note of the Application ID value from the landing page.

Next is to choose an index type for the Amazon Q Business application. Choose the native retriever option.

After the index is created, verify that the status has changed to Active. You can then take a note of the Index ID.

Next step is for you to add the custom data source.

Search for Custom data source and choose the plus sign next to it.

Provide a name and description for the custom data source.

Once done, choose Add data source.

After the data source is added and its status is Active, take note of the Data source ID.

Load policy data from Alation to Amazon Q Business using the custom connector
Now let’s load the Alation data into Amazon Q Business using the correct access permissions. The code examples that follow are also available on the accompanying GitHub code repository.

With the connector ready, move over to the SageMaker Studio notebook and perform data synchronization operations by invoking Amazon Q Business APIs.
To start, retrieve the Alation OAuth client application credentials stored in Secrets Manager.

secrets_manager_client = boto3.client(‘secretsmanager’)
secret_name = “alation_test”

try:
get_secret_value_response = secrets_manager_client.get_secret_value(
SecretId=secret_name
)
secret = eval(get_secret_value_response[‘SecretString’])

except ClientError as e:
raise e

Next, initiate the connection using the OAuth client application credentials from Alation.

base_url = “https://[[your-domain]].alationcloud.com”
token_url = “/oauth/v2/token/”
introspect_url = “/oauth/v2/introspect/”
jwks_url = “/oauth/v2/.well-known/jwks.json/”

api_url = base_url + token_url
data = {
“grant_type”: “client_credentials”,
}
client_id = secret[‘Client_Id’]
client_secret = secret[‘Client_Secret’]

auth = HTTPBasicAuth(username=client_id, password=client_secret)
response = requests.post(url=api_url, data=data, auth=auth)
print(response.json())

access_token = response.json().get(‘access_token’,”)
api_url = base_url + introspect_url + “?verify_token=true”
data = {
“token”: access_token,
}
response = requests.post(url=api_url, data=data, auth=auth)

You then configure policy type level user access. This section can be customized based on how user access information is stored on any data sources. Here, we assume a pre-set access based on the user’s email IDs.

primary_principal_list = []
workplace_policy_principals = []
hr_policy_principals = []
regulatory_policy_principals = []

principal_user_email_ids = [‘alejandro_rosalez@example.com’, ‘sofia_martinez@example.com’, ‘diego_martinez@example.com’]

workplace_policy_email_ids = [‘alejandro_rosalez@example.com’, ‘sofia_martinez@example.com’, ‘diego_ramirez@example.com’]
hr_policy_email_ids = [‘alejandro_rosalez@example.com’, ‘sofia_martinez@example.com’]
regulatory_policy_email_ids = [‘alejandro_rosalez@example.com’, ‘diego_ramirez@example.com’]

for workplace_policy_member in workplace_policy_email_ids:
workplace_policy_members_dict = { ‘user’: { ‘id’: workplace_policy_member, ‘access’: ‘ALLOW’, ‘membershipType’: ‘DATASOURCE’ }}
workplace_policy_principals.append(workplace_policy_members_dict)
if workplace_policy_member not in primary_principal_list:
primary_principal_list.append(workplace_policy_member)

for hr_policy_member in hr_policy_email_ids:
hr_policy_members_dict = { ‘user’: { ‘id’: hr_policy_member, ‘access’: ‘ALLOW’, ‘membershipType’: ‘DATASOURCE’ }}
hr_policy_principals.append(hr_policy_members_dict)
if hr_policy_member not in primary_principal_list:
primary_principal_list.append(hr_policy_member)

for regulatory_policy_member in regulatory_policy_email_ids:
regulatory_policy_members_dict = { ‘user’: { ‘id’: regulatory_policy_member, ‘access’: ‘ALLOW’, ‘membershipType’: ‘DATASOURCE’ }}
regulatory_policy_principals.append(regulatory_policy_members_dict)
if regulatory_policy_member not in primary_principal_list:
primary_principal_list.append(regulatory_policy_member)

You then pull individual policy details from Alation. This step can be repeated for all three policy types: Workplace, HR, and regulatory

url = “https://[[your-domain]].com/integration/v1/business_policies/?limit=200&skip=0&search=[[Workplace/HR/Regulatory]]&deleted=false”

headers = {
“accept”: “application/json”,
“TOKEN”: access_token
}

response = requests.get(url, headers=headers)
policy_data = “”

for policy in json.loads(response.text):
if policy[“title”] is not None:
policy_title = cleanhtml(policy[“title”])
else:
policy_title = “None”
if policy[“description”] is not None:
policy_description = cleanhtml(policy[“description”])
else:
policy_description = “None”
temp_data = policy_title + “:n” + policy_description + “nn”
policy_data += temp_data

The next step is to define the Amazon Q Business application, index, and data source information that you created in the previous steps.

qbusiness_client = boto3.client(‘qbusiness’)
application_id = “xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx”
index_id = “xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx”
data_source_id = “xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx”

Now you explicitly create the users in Amazon Q Business. Individual user access to different policy type data sets is configured later.

for principal in primary_principal_list:
create_user_response = qbusiness_client.create_user(
applicationId=application_id,
userId=principal,
userAliases=[
{
‘indexId’: index_id,
‘dataSourceId’: data_source_id,
‘userId’: principal
},
],
)

for principal in primary_principal_list:
get_user_response = qbusiness_client.get_user(
applicationId=application_id,
userId=principal
)
for user_alias in get_user_response[‘userAliases’]:
if “dataSourceId” in user_alias:
print(user_alias[‘userId’])

For each policy type data set (Workplace, HR, and Regulatory), we execute the following three steps.

Start an Amazon Q Business data source sync job.

start_data_source_sync_job_response = qbusiness_client.start_data_source_sync_job(
dataSourceId = data_source_id,
indexId = index_id,
applicationId = application_id
)
job_execution_id = start_data_source_sync_job_response[‘executionId’]

Encode and batch upload data with user access mapping.

workplace_policy_document_id = hashlib.shake_256(policy_data.encode(‘utf-8’)).hexdigest(128)
docs = [ {
“id”: policy_document_id,
“content” : {
‘blob’: policy_data.encode(‘utf-8’)
},
“contentType”: “PLAIN_TEXT”,
“title”: “Unicorn Rentals – Workplace/HR/Regulatory Policy”,
“accessConfiguration” : { ‘accessControls’: [ { ‘principals’: [[xx]]_policy_principals } ] }
}
]

batch_put_document_response = qbusiness_client.batch_put_document(
applicationId = application_id,
indexId = index_id,
dataSourceSyncId = job_execution_id,
documents = docs,
)

Stop the data source sync job and wait for the data set to be indexed.

stop_data_source_sync_job_response = qbusiness_client.stop_data_source_sync_job(
dataSourceId = data_source_id,
indexId = index_id,
applicationId = application_id
)
max_time = time.time() + 1*60*60
found = False
while time.time() < max_time and bool(found) == False:
list_documents_response = qbusiness_client.list_documents(
applicationId=application_id,
indexId=index_id
)
if list_documents_response:
for document in list_documents_response[“documentDetailList”]:
if document[“documentId”] == workplace_policy_document_id:
status = document[“status”]
print(status)
if status == “INDEXED” or status == “FAILED” or status == “DOCUMENT_FAILED_TO_INDEX” or status == “UPDATED”:
found = True
else:
time.sleep(10)
except:
print(“Exception when calling API”)

Go back to the Amazon Q Business console and see if the data uploads were successful.

Find and open the custom data source from the list of data sources.

Ensure the ingested documents are added in the Sync history tab and are in the Completed status.

Also ensure the Last sync status for the custom data source connector is Completed.

Run queries with the Amazon Q Business web experience
Now that the data synchronization is complete, you can start exploring insights from Amazon Q Business. With the newly created Amazon Q Business application, select the Web Application settings tab and navigate to the auto-created URL. This will open a new tab with a preview of the user interface and options that you can customize to fit your use case.

Sign in as user Alejandro Rosales. As you might recall, Alejandro has access to all three policy type data sets (Workplace, HR and Regulator).

Start by asking a question about HR policy, such as “Per the HR Payroll Policy of Unicorn Rents, what are some additional voluntary deductions taken from employee paychecks.” Note how Q Business provides an answers and also shows where it pulled the answer from.

Next, ask a question about a Regulatory policy: “Per the PCI DSS compliance policy of Unicorn Rentals, how is the third-party service provider access to cardholder information protected?” The result includes the summarized answer on PCI DSS compliance and also shows sources where it gathered the data from.

Lastly, see how Amazon Q Business responds when asked a question about generic workplace policy. “What does Unicorn Rentals do to protect information of children under the age of 13.” In this case, the application returns the answer and marks it as a Workplace policy question.

Let’s next sign in as Sofia Martinez. Sofia has access to HR and Workplace policy types, but not to Regulatory policies.

Start by asking a question about HR policy: “Per the HR Payroll Policy of Unicorn Rentals, list the additional voluntary deductions taken from employee paychecks.” Note how Q Business list the deductions and cite policy where the answer is gathered from.

Next, ask a Regulatory policy question: “What are the record keeping requirements mentioned in the ECOA compliance policy of Unicorn Rentals?”. Note how Amazon Q Business contextually answers the question mentioning Sofia does not have access to that data –

Finally, sign in as Diego Ramirez. Diego has access to Workplace and Regulatory policies but not to HR policies.

Start by asking the same Regulatory policy question that: “Per the PCI DSS compliance policy of Unicorn Rentals, how is third-party service provider access to cardholder information protected?”. Since Diego has access to Regulatory policy data, expected answer is generated.

Next, when Diego asks a question about a HR policy: “Per the HR Compensation Policy of Unicorn Rentals, how is job pricing determined?.” Note how Amazon Q Business contextually answers the question mentioning Diego does not have access to that data.

Troubleshooting
If you’re unable to get answers to any of your questions and get the message “Sorry, I could not find relevant information to complete your request,” check to see if any of the following issues apply:

No permissions: ACLs applied to your account doesn’t allow you to query certain data sources. If this is the case, please reach out to your application administrator to ensure your ACLs are configured to access the data sources.
EmailID not matching UserID: In rare scenarios, a user might have a different email ID associated with the Amazon Q Business Identity Center connection than is associated in the data source’s user profile. Make sure that the Amazon Q Business user profile is updated to recognize the email ID using the update-user CLI command or the related API call.
Data connector sync failed: Data connector fails to synchronize information from the source to Amazon Q Business application. Verify the data connectors sync run schedule and sync history to help ensure that the synchronization is successful.
Empty or private data sources: Private or empty projects will not be crawled during the synchronization run.

If none of the above are true then open a support case to get this resolved.
Clean up
To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Amazon Q Business custom connector data source and client application created in Alation and the Amazon Q Business application. Next, delete the Secrets Manager secret with Alation OAuth client application credential data. Also, delete the user management setup in IAM Identity Center and the SageMaker Studio domain.
Conclusion
In this post, we discussed how to configure the Amazon Q Business custom connector to crawl and index tasks from Alation as a sample. We showed how you can use Amazon Q Business generative AI-based search to enable your business leaders and agents discover insights from your enterprise data.
To learn more about the Amazon Q Business custom connector, see the Amazon Q Business developer guide. To learn more about Alation Data Catalog, which is available for purchase through AWS Marketplace. Speak to your Alation account representative for custom purchase options. For any additional information, contact your Alation business partner.

Alation – AWS Partner Spotlight
Alation is an AWS Specialization Partner that has pioneered the modern data catalog and is making the leap into a full-service source for data intelligence. Alation is passionate about helping enterprises create thriving data cultures where anyone can find, understand, and trust data.
Contact Alation | Partner Overview | AWS Marketplace

About the Authors
Gene Arnold is a Product Architect with Alation’s Forward Deployed Engineering team. A curious learner with over 25 years of experience, Gene focuses how to sharpen selling skills and constantly explores new product lines.
Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds eight AWS and seven other professional certifications. With over 21 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.
Sindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.
Prateek Jain is a Sr. Solutions Architect with AWS, based out of Atlanta Georgia. He is passionate about GenAI and helping customers build amazing solutions on AWS. In his free time, he enjoys spending time with Family and playing tennis.

Mistral-Small-24B-Instruct-2501 is now available on SageMaker Jumpstar …

Today, we’re excited to announce that Mistral-Small-24B-Instruct-2501—a twenty-four billion parameter large language model (LLM) from Mistral AI that’s optimized for low latency text generation tasks—is available for customers through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that developers can use to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. These models are in addition to the industry-leading models that are already available on Amazon Bedrock. You can also use this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference. In this post, we walk through how to discover, deploy, and use Mistral-Small-24B-Instruct-2501.
Overview of Mistral Small 3 (2501)
Mistral Small 3 (2501), a latency-optimized 24B-parameter model released under Apache 2.0 maintains a balance between performance and computational efficiency. Mistral offers both the pretrained (Mistral-Small-24B-Base-2501) and instruction-tuned (Mistral-Small-24B-Instruct-2501) checkpoints of the model under Apache 2.0. Mistral Small 3 (2501) features a 32 k token context window. According to Mistral, the model demonstrates strong performance in code, math, general knowledge, and instruction following compared to its peers. Mistral Small 3 (2501) is designed for the 80% of generative AI tasks that require robust language and instruction following performance with very low latency. The instruction-tuning process is focused on improving the model’s ability to follow complex directions, maintain coherent conversations, and generate accurate, context-aware responses. The 2501 version follows previous iterations (Mistral-Small-2409 and Mistral-Small-2402) released in 2024, incorporating improvements in instruction-following and reliability. Currently, the instruct version of this model, Mistral-Small-24B-Instruct-2501 is available for customers to deploy and use on SageMaker JumpStart and Bedrock Marketplace.
Optimized for conversational assistance
Mistral Small 3 (2501) excels in scenarios where quick, accurate responses are critical, such as in virtual assistants. This includes virtual assistants where users expect immediate feedback and near real-time interactions. Mistral Small 3 (2501) can handle rapid function execution when used as part of automated or agentic workflows. The architecture is designed to typically respond in less than 100 milliseconds, according to Mistral, making it ideal for customer service automation, interactive assistance, live chat, and content moderation.
Performance metrics and benchmarks
According to Mistral, the instruction-tuned version of the model achieves over 81% accuracy on Massive Multitask Language Understanding (MMLU) with 150 tokens per second latency, making it currently the most efficient model in its category. In third-party evaluations conducted by Mistral, the model demonstrates competitive performance against larger models such as Llama 3.3 70B and Qwen 32B. Notably, Mistral claims that the model performs at the same level as Llama 3.3 70B instruct and is more than three times faster on the same hardware.
SageMaker JumpStart overview
SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.
You can now discover and deploy Mistral models in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with Amazon SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and under your VPC controls, helping to support data security for enterprise security needs.
Prerequisites
To try Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, see Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio, a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the model.

Amazon Bedrock Marketplace overview
To get started, in the AWS Management Console for Amazon Bedrock, select Model catalog in the Foundation models section of the navigation pane. Here, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).
Deploy Mistral-Small-24B-Instruct-2501 in Amazon Bedrock Marketplace
To access Mistral-Small-24B-Instruct-2501 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, select Model catalog under Foundation models in the navigation pane.

At the time of writing this post, you can use the InvokeModel API to invoke the model. It doesn’t support Converse APIs or other Amazon Bedrock tooling.

Filter for Mistral as a provider and select the Mistral-Small-24B-Instruct-2501

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.
The page also includes deployment options and licensing information to help you get started with Mistral-Small-24B-Instruct-2501 in your applications.

To begin using Mistral-Small-24B-Instruct-2501, choose Deploy.
You will be prompted to configure the deployment details for Mistral-Small-24B-Instruct-2501. The model ID will be pre-populated.

For Endpoint name, enter an endpoint name (up to 50 alphanumeric characters).
For Number of instances, enter a number between 1and 100.
For Instance type, select your instance type. For optimal performance with Mistral-Small-24B-Instruct-2501, a GPU-based instance type such as ml.g6.12xlarge is recommended.
Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.

Choose Deploy to begin using the model.

When the deployment is complete, you can test Mistral-Small-24B-Instruct-2501 capabilities directly in the Amazon Bedrock playground.

Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters such as temperature and maximum length.

When using Mistral-Small-24B-Instruct-2501 with the Amazon Bedrock InvokeModel and Playground console, use DeepSeek’s chat template for optimal results. For example, <|begin▁of▁sentence|><|User|>content for inference<|Assistant|>.
This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to get the endpoint Amazon Resource Name (ARN).
Discover Mistral-Small-24B-Instruct-2501 in SageMaker JumpStart
You can access Mistral-Small-24B-Instruct-2501 through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform ML development steps, from preparing data to building, training, and deploying your ML models. For more information about how to get started and set up SageMaker Studio, see Amazon SageMaker Studio.

In the SageMaker Studio console, access SageMaker JumpStart by choosing JumpStart in the navigation pane.
Select HuggingFace.
From the SageMaker JumpStart landing page, search for Mistral-Small-24B-Instruct-2501 using the search box.
Select a model card to view details about the model such as license, data used to train, and how to use the model. Choose Deploy to deploy the model and create an endpoint.

Deploy Mistral-Small-24B-Instruct-2501 with the SageMaker SDK
Deployment starts when you choose Deploy. After deployment finishes, you will see that an endpoint is created. Test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in the notebook editor of your choice in SageMaker Studio.

To deploy using the SDK, start by selecting the Mistral-Small-24B-Instruct-2501 model, specified by the model_id with the value mistral-small-24B-instruct-2501. You can deploy your choice of the selected models on SageMaker using the following code. Similarly, you can deploy Mistral-Small-24b-Instruct-2501 using its model ID.

from sagemaker.jumpstart.model import JumpStartModel

accept_eula = True

model = JumpStartModel(model_id=”huggingface-llm-mistral-small-24b-instruct-2501″)
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. The EULA value must be explicitly defined as True to accept the end-user license agreement (EULA). See AWS service quotas for how to request a service quota increase.

After the model is deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

prompt = “Hello!”
payload = {
“messages”: [
{
“role”: “user”,
“content”: prompt
}
],
“max_tokens”: 4000,
“temperature”: 0.1,
“top_p”: 0.9,
}

response = predictor.predict(payload)
print(response[‘choices’][0][‘message’][‘content’])

Retail math example
Here’s an example of how Mistral-Small-24B-Instruct-2501 can break down a common shopping scenario. In this case, you ask the model to calculate the final price of a shirt after applying multiple discounts—a situation many of us face while shopping. Notice how the model provides a clear, step-by-step solution to follow.

prompt = “A store is having a 20% off sale, and you have an additional 10% off coupon. If you buy a shirt that originally costs $50, how much will you pay?”
payload = {
“messages”: [
{
“role”: “user”,
“content”: prompt
}
],
“max_tokens”: 1000,
“temperature”: 0.1,
“top_p”: 0.9,
}

response = predictor.predict(payload)
print(response[‘choices’][0][‘message’][‘content’])

The following is the output:

First, we’ll apply the 20% off sale discount to the original price of the shirt.

20% of $50 is calculated as:
0.20 * $50 = $10

So, the price after the 20% discount is:
$50 – $10 = $40

Next, we’ll apply the additional 10% off coupon to the new price of $40.

10% of $40 is calculated as:
0.10 * $40 = $4

So, the price after the additional 10% discount is:
$40 – $4 = $36

Therefore, you will pay $36 for the shirt.

The response shows clear step-by-step reasoning without introducing incorrect information or hallucinated facts. Each mathematical step is explicitly shown, making it simple to verify the accuracy of the calculations.
Clean up
To avoid unwanted charges, complete the following steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, select Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, select Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:

Endpoint name
Model name
Endpoint status

Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor
After you’re done running the notebook, make sure to delete all resources that you created in the process to avoid additional billing. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we showed you how to get started with Mistral-Small-24B-Instruct-2501 in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
For more Mistral resources on AWS, check out the Mistral-on-AWS GitHub repo.

About the Authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.
Shane Rai is a Principal Generative AI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using the breadth of cloud-based AI/ML services offered by AWS, including model offerings from top tier foundation model providers.
Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0p …

In this tutorial, we will build an efficient Legal AI CHatbot using open-source tools. It provides a step-by-step guide to creating a chatbot using bigscience/T0pp LLM, Hugging Face Transformers, and PyTorch. We will walk you through setting up the model, optimizing performance using PyTorch, and ensuring an efficient and accessible AI-powered legal assistant.

Copy CodeCopiedUse a different Browserfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = “bigscience/T0pp” # Open-source and available
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

First, we load bigscience/T0pp, an open-source LLM, using Hugging Face Transformers. It initializes a tokenizer for text preprocessing and loads the AutoModelForSeq2SeqLM, enabling the model to perform text generation tasks such as answering legal queries.

Copy CodeCopiedUse a different Browserimport spacy
import re

nlp = spacy.load(“en_core_web_sm”)

def preprocess_legal_text(text):
text = text.lower()
text = re.sub(r’s+’, ‘ ‘, text) # Remove extra spaces
text = re.sub(r'[^a-zA-Z0-9s]’, ”, text) # Remove special characters
doc = nlp(text)
tokens = [token.lemma_ for token in doc if not token.is_stop] # Lemmatization
return ” “.join(tokens)

sample_text = “The contract is valid for 5 years, terminating on December 31, 2025.”
print(preprocess_legal_text(sample_text))

Then, we preprocess legal text using spaCy and regular expressions to ensure cleaner and more structured input for NLP tasks. It first converts text to lowercase, removes extra spaces and special characters using regex, and then tokenizes and lemmatizes the text using spaCy’s NLP pipeline. Additionally, it filters out stop words to retain only meaningful terms, making it ideal for legal text processing in AI applications. The cleaned text is more efficient for machine learning and language models like bigscience/T0pp, improving accuracy in legal chatbot responses.

Copy CodeCopiedUse a different Browserdef extract_legal_entities(text):
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
return entities

sample_text = “Apple Inc. signed a contract with Microsoft on June 15, 2023.”
print(extract_legal_entities(sample_text))

Here, we extract legal entities from text using spaCy’s Named Entity Recognition (NER) capabilities. The function processes the input text with spaCy’s NLP model, identifying and extracting key entities such as organizations, dates, and legal terms. It returns a list of tuples, each containing the recognized entity and its category (e.g., organization, date, or law-related term).

Copy CodeCopiedUse a different Browserimport faiss
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer

embedding_model = AutoModel.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2”)
embedding_tokenizer = AutoTokenizer.from_pretrained(“sentence-transformers/all-MiniLM-L6-v2″)

def embed_text(text):
inputs = embedding_tokenizer(text, return_tensors=”pt”, padding=True, truncation=True)
with torch.no_grad():
output = embedding_model(**inputs)
embedding = output.last_hidden_state.mean(dim=1).squeeze().cpu().numpy() # Ensure 1D vector
return embedding

legal_docs = [
“A contract is legally binding if signed by both parties.”,
“An NDA prevents disclosure of confidential information.”,
“A non-compete agreement prohibits working for a competitor.”
]

doc_embeddings = np.array([embed_text(doc) for doc in legal_docs])

print(“Embeddings Shape:”, doc_embeddings.shape) # Should be (num_samples, embedding_dim)

index = faiss.IndexFlatL2(doc_embeddings.shape[1]) # Dimension should match embedding size
index.add(doc_embeddings)

query = “What happens if I break an NDA?”
query_embedding = embed_text(query).reshape(1, -1) # Reshape for FAISS
_, retrieved_indices = index.search(query_embedding, 1)

print(f”Best matching legal text: {legal_docs[retrieved_indices[0][0]]}”)

With the above code, we build a legal document retrieval system using FAISS for efficient semantic search. It first loads the MiniLM embedding model from Hugging Face to generate numerical representations of text. The embed_text function processes legal documents and queries by computing contextual embeddings using MiniLM. These embeddings are stored in a FAISS vector index, allowing fast similarity searches.

Copy CodeCopiedUse a different Browserdef legal_chatbot(query):
inputs = tokenizer(query, return_tensors=”pt”, padding=True, truncation=True)
output = model.generate(**inputs, max_length=100)
return tokenizer.decode(output[0], skip_special_tokens=True)

query = “What happens if I break an NDA?”
print(legal_chatbot(query))

Finally, we define a Legal AI Chatbot as generating responses to legal queries using a pre-trained language model. The legal_chatbot function takes a user query, processes it using the tokenizer, and generates a response with the model. The response is then decoded into readable text, removing any special tokens. When a query like “What happens if I break an NDA?” is input, the chatbot provides a relevant AI-generated legal response.

In conclusion, by integrating bigscience/T0pp LLM, Hugging Face Transformers, and PyTorch, we have demonstrated how to build a powerful and scalable Legal AI Chatbot using open-source resources. This project is a solid foundation for creating reliable AI-powered legal tools, making legal assistance more accessible and automated.

Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0pp LLM, Open-Source NLP Models, Streamlit, PyTorch, and Hugging Face Transformers appeared first on MarkTechPost.

Optimizing Training Data Allocation Between Supervised and Preference …

Large Language Models (LLMs) face significant challenges in optimizing their post-training methods, particularly in balancing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) approaches. While SFT uses direct instruction-response pairs and RL methods like RLHF use preference-based learning, the optimal allocation of limited training resources between these approaches remains unclear. Recent studies have shown that models can achieve task alignment and improved reasoning capabilities without extensive SFT, challenging traditional sequential post-training pipelines. Moreover, the substantial cost of collecting and annotating human data compared to compute costs creates a need to understand the effectiveness of different training methods under fixed data-annotation budgets.

Existing research has explored various trade-offs in language model training under fixed budgets, including comparisons between pretraining versus finetuning and finetuning versus model distillation. Studies have examined the data and compute costs of SFT and RL methods in isolation along with cost-efficiency considerations in generating human and synthetic data. While some research shows the effects of high-quality preference data on RL methods like Direct Preference Optimization (DPO) and PPO, other studies focus on the relationship between SFT and RL methods regarding model forgetfulness, generalization, and alignment. However, these studies haven’t failed to address optimal resource allocation between SFT and RL-based approaches under strict data annotation constraints.

Researchers from the Georgia Institute of Technology have proposed a comprehensive study examining the optimal allocation of training data budgets between SFT and Preference Finetuning (PFT) in LLMs. The study investigates this relationship across four diverse tasks, multiple model sizes, and various data annotation costs. It addresses the “cold start problem” in mathematical tasks, where eliminating SFT leads to suboptimal performance due to distribution shifts when applying DPO directly to the base model. Their findings suggest that while larger data budgets benefit from combining both methods, allocating even a small portion of the budget to SFT can significantly improve performance on analytical tasks.

The study evaluates the cost-effectiveness and optimal resource allocation between SFT and PFT in post-training LLMs under 10 billion parameters. The research methodology measures data budgets through training examples or monetary annotation costs, assuming equal labor costs for both methods and the availability of training prompts. The experimental setup begins with no task-specific labeled data, using open-source datasets, or synthetically curated data for each target task. To maintain focus on task-specific improvements, general-purpose conversational datasets commonly used in PFT, such as UltraFeedback and Chatbot Arena preferences are excluded. This controlled approach allows for precise measurement of performance improvements resulting from targeted data annotation.

The results reveal that optimal allocation of the training budget between SFT and PFT methods proves crucial, with properly balanced datasets outperforming suboptimally allocated datasets 2-5 times larger in size. Using 5K examples with 25% SFT allocation for tasks like Summarization, Helpfulness, and Grade School Math matches the performance of 20K examples with 75% SFT allocation. The study identifies that pure SFT excels in low-data scenarios, while larger data budgets benefit from higher proportions of preference data. Moreover, direct preference finetuning on base models shows limited success in mathematical tasks, and allocating even a small portion to SFT significantly improves performance by better aligning the reference model’s response style.

In conclusion, this paper provides crucial insights into optimizing LLM post-training under resource constraints, particularly regarding the interplay between SFT and PFT. The study identifies a significant “cold-start problem” when applying PFT directly to base models, which can be mitigated effectively by allocating even 10% of the budget to initial SFT. However, the research acknowledges limitations, including offline methods like DPO and KTO use for RL implementation, and potential biases from using GPT4 for synthetic data generation and evaluation. Moreover, the model size is limited to 10 Billion parameters otherwise it would be extremely compute resource intensive to run thousands of finetuning runs with larger model sizes like 70B parameters.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models appeared first on MarkTechPost.

This AI Paper from Weco AI Introduces AIDE: A Tree-Search-Based AI Age …

The development of high-performing machine learning models remains a time-consuming and resource-intensive process. Engineers and researchers spend significant time fine-tuning models, optimizing hyperparameters, and iterating through various architectures to achieve the best results. This manual process demands computational power and relies heavily on domain expertise. Efforts to automate these aspects have led to the development of techniques such as neural architecture search and AutoML, which streamline model optimization but still face computational expense and scalability challenges.

One of the critical challenges in machine learning development is the reliance on iterative experimentation. Engineers must evaluate different configurations to optimize model performance, making the process labor-intensive and computationally demanding. Traditional optimization techniques often depend on brute-force searches, requiring extensive trial-and-error to achieve desirable results. The inefficiency of this approach limits productivity, and the high cost of computations makes scalability an issue. Addressing these inefficiencies requires an intelligent system that can systematically explore the search space, reduce redundancy, and minimize unnecessary computational expenditure while improving overall model quality.

Automated tools have been introduced to assist in model development and address these inefficiencies. AutoML frameworks such as H2O AutoML and AutoSklearn have enabled model selection and hyperparameter tuning. Similarly, neural architecture search methods attempt to automate the design of neural networks using reinforcement learning and evolutionary techniques. While these methods have shown promise, they are often limited by their reliance on predefined search spaces and lack the adaptability required for diverse problem domains. As a result, there is a pressing need for a more dynamic approach that can enhance the efficiency of machine learning engineering without excessive computational costs.

Researchers at Weco AI introduced AI-Driven Exploration (AIDE), an intelligent agent designed to automate the process of machine learning engineering using large language models (LLMs). Unlike traditional optimization techniques, AIDE approaches model development as a tree-search problem, enabling the system to refine solutions systematically. AIDE efficiently trades computational resources for enhanced performance by evaluating and improving candidate solutions incrementally. Its ability to explore solutions at the code level rather than within predefined search spaces allows for a more flexible and adaptive approach to machine learning engineering. The methodology ensures that AIDE optimally navigates through possible solutions while integrating automated evaluations to guide its search.

AIDE structures its optimization process as a hierarchical tree where each node represents a potential solution. A search policy determines which solutions should be refined, while an evaluation function assesses model performance at each step. The system also integrates a coding operator powered by LLMs to generate new iterations. AIDE effectively refines solutions by analyzing historical improvements and leveraging domain-specific knowledge while minimizing unnecessary computations. Unlike conventional methods, which often append all past interactions into a model’s context, AIDE selectively summarizes relevant details, ensuring that each iteration remains focused on essential improvements. Further, debugging and refinement mechanisms ensure that AIDE’s iterations consistently lead to more efficient and higher-performing models.

Empirical results demonstrate AIDE’s effectiveness in machine learning engineering. The system was evaluated on Kaggle competitions, achieving an average performance surpassing 51.38% of human competitors. AIDE ranked above the median human participant in 50% of the competitions being assessed. The tool also excelled in AI research benchmarks, including OpenAI’s MLE-Bench and METR’s RE-Bench, demonstrating superior adaptability across diverse machine learning challenges. In METR’s evaluation, AIDE was found to be competitive with top human AI researchers in complex optimization tasks. It outperformed human experts in constrained environments where rapid iteration was crucial, proving its ability to streamline machine learning workflows.

Further evaluations on MLE-Bench Lite highlight the performance boost AIDE provides. Combining AIDE with the o1-preview model led to a substantial increase in key metrics. Valid submissions rose from 63.6% to 92.4%, while the percentage of solutions ranking above the median improved from 13.6% to 59.1%. AIDE also significantly improved competition success rates, with gold medal achievements increasing from 6.1% to 21.2% and overall medal acquisition reaching 36.4%, up from 7.6%. These findings emphasize AIDE’s ability to optimize machine learning workflows effectively and enhance AI-driven solutions.

AIDE’s design addresses critical inefficiencies in machine learning engineering by systematically automating model development through a structured search methodology. By integrating LLMs into an optimization framework, AIDE significantly reduces the reliance on manual trial-and-error processes. The empirical evaluations indicate it effectively enhances efficiency and adaptability, making machine learning development more scalable. Given its strong performance in multiple benchmarks, AIDE represents a promising step toward the future of automated machine learning engineering. Future improvements may expand its applicability to more complex problem domains while refining its interpretability and generalization capabilities.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post This AI Paper from Weco AI Introduces AIDE: A Tree-Search-Based AI Agent for Automating Machine Learning Engineering appeared first on MarkTechPost.

Fine-Tuning NVIDIA NV-Embed-v1 on Amazon Polarity Dataset Using LoRA a …

In this tutorial, we explore how to fine-tune NVIDIA’s NV-Embed-v1 model on the Amazon Polarity dataset using LoRA (Low-Rank Adaptation) with PEFT (Parameter-Efficient Fine-Tuning) from Hugging Face. By leveraging LoRA, we efficiently adapt the model without modifying all its parameters, making fine-tuning feasible on low-VRAM GPUs.Steps to the implementation in this tutorial can be broken into the following steps:

Authenticating with Hugging Face to access NV-Embed-v1  

Loading and configuring the model efficiently  

Applying LoRA fine-tuning using PEFT  

Preprocessing the Amazon Polarity dataset for training  

Optimizing GPU memory usage with `device_map=”auto”`  

Training and evaluating the model on sentiment classification  

By the end of this guide, you’ll have a fine-tuned NV-Embed-v1 model optimized for binary sentiment classification, demonstrating how to apply efficient fine-tuning techniques to real-world NLP tasks.

Copy CodeCopiedUse a different Browserfrom huggingface_hub import login

login() # Enter your Hugging Face token when prompted

import os
HF_TOKEN = “….” # Replace with your actual token
os.environ[“HF_TOKEN”] = HF_TOKEN

import torch
import torch.distributed as dist
from transformers import AutoModel, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

First, we log into the Hugging Face Hub using your API token, set the token as an environment variable, and import various libraries needed for distributed training and fine-tuning transformer models with techniques like LoRA.

Copy CodeCopiedUse a different BrowserMODEL_NAME = “nvidia/NV-Embed-v1”
HF_TOKEN = “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” # Replace with your actual token

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, token=HF_TOKEN)
model = AutoModel.from_pretrained(
MODEL_NAME,
device_map=”auto”, # Enable efficient GPU placement
torch_dtype=torch.float16, # Use FP16 for efficiency
token=HF_TOKEN
)

This snippet sets a specific model name and authentication token, then loads the corresponding pretrained tokenizer and model from Hugging Face’s model hub. It also configures the model to use automatic GPU allocation and FP16 precision for improved efficiency.

Copy CodeCopiedUse a different Browserlora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[“self_attn.q_proj”, “self_attn.v_proj”],
lora_dropout=0.1,
bias=”none”,
task_type=”FEATURE_EXTRACTION”,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

With the above code, we configure a LoRA setup with specified parameters (like r=16, lora_alpha=32, and a dropout of 0.1) targeting the self-attention mechanism’s query and value projection layers. It then integrates this configuration into the model using PEFT so that only these LoRA layers are trainable for feature extraction, and finally, the trainable parameters are printed.

Copy CodeCopiedUse a different Browserdataset = load_dataset(“amazon_polarity”)

def tokenize_function(examples):
return tokenizer(examples[“content”], padding=”max_length”, truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Here, we load the Amazon Polarity dataset, define a function to tokenize its “content” field with padding and truncation, and applies this function to convert the dataset into a tokenized format for model training.

Copy CodeCopiedUse a different Browsertraining_args = TrainingArguments(
output_dir=”./results”,
evaluation_strategy=”epoch”,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=1,
save_strategy=”epoch”,
save_total_limit=1,
logging_dir=”./logs”,
logging_steps=10,
fp16=True, # Mixed precision
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets[“train”],
eval_dataset=tokenized_datasets[“test”],
)

trainer.train()

With the above code, we set up training parameters—like output directories, batch sizes, logging, and FP16 mixed precision—using TrainingArguments, create a Trainer with the model and tokenized train/test datasets, and finally initiate the training process.

Copy CodeCopiedUse a different Browsermodel.save_pretrained(“./fine_tuned_nv_embed”)
tokenizer.save_pretrained(“./fine_tuned_nv_embed”)
print(” Training Complete! Model Saved.”)

Finally, we save the fine-tuned model and its tokenizer to the specified directory and then print a confirmation message indicating that training is complete and the model is saved.

By the end of this tutorial, we successfully fine-tuned NV-Embed-v1 on the Amazon Polarity dataset using LoRA and PEFT, ensuring efficient memory usage and scalable adaptation. This tutorial highlights the power of parameter-efficient fine-tuning, enabling domain adaptation of large models without requiring massive computational resources. This approach can be extended to other transformer-based models, making it valuable for custom embeddings, sentiment analysis, and NLP-driven applications. Whether you’re working on product review classification, AI-driven recommendation systems, or domain-specific search engines, this method allows you to fine-tune large-scale models on a budget efficiently.

Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Fine-Tuning NVIDIA NV-Embed-v1 on Amazon Polarity Dataset Using LoRA and PEFT: A Memory-Efficient Approach with Transformers and Hugging Face appeared first on MarkTechPost.

Sony Researchers Propose TalkHier: A Novel AI Framework for LLM-MA Sys …

LLM-based multi-agent (LLM-MA) systems enable multiple language model agents to collaborate on complex tasks by dividing responsibilities. These systems are used in robotics, finance, and coding but face challenges in communication and refinement. Text-based communication leads to long, unstructured exchanges, making it hard to track tasks, maintain structure, and recall past interactions. Refinement methods like debates and feedback-based improvements struggle as important inputs may be ignored or biased due to processing order. These issues limit the efficiency of LLM-MA systems in handling multi-step problems.

Currently, LLM-based multi-agent systems use debate, self-refinement, and multi-agent feedback to handle complex tasks. These techniques become unstructured and hard to control based on text-based interaction. Agents struggle to follow subtasks, remember previous interactions, and provide consistent responses. Various communication structures, including chain and tree-based models, try to enhance efficiency but do not have explicit protocols for structuring information. Feedback-refinement techniques try to increase accuracy but have challenges with biased or duplicate inputs, making evaluation unreliable. Without systematic communication and feedback on a large scale, such systems still are inefficient and error-prone.

To mitigate these issues, researchers from Sony Group Corporation, Japan, proposed TalkHier, a framework that improves communication and task coordination in multi-agent systems using structured protocols and hierarchical refinement. Unlike standard approaches, TalkHier explicitly describes the interactions of agents and task formulation more and more subtly, reducing error and efficiency. Agents execute formalized roles, and scaling is automatically adapted to different issues by the system, resulting in improved decision-making and coordination.

This framework structures agents in a graph such that each node is an agent, and edges represent communication paths. Agents possess independent memory, which allows them to hold pertinent information and make decisions based on informed inputs without using shared memory. Communication follows a formal process: messages contain content, background information, and intermediate outputs. Agents are organized into teams with supervisors monitoring the process, and a subset of agents serve as members and supervisors, resulting in a nested hierarchy. Work is allocated, assessed, and improved in a series of iterations until it passes a quality threshold, with the goal of accuracy and minimizing errors.

Upon evaluation, researchers assessed TalkHier across multiple benchmarks to analyze its effectiveness. On the MMLU dataset, covering Moral Scenario, College Physics, Machine Learning, Formal Logic, and US Foreign Policy, TalkHier, built on GPT-4o, achieved the highest accuracy of 88.38%, surpassing AgentVerse (83.66%) and single-agent baselines like ReAct–7@ (67.19%) and GPT-4o-7@ (71.15%), demonstrating the benefits of hierarchical refinement. On the WikiQA dataset, it outperformed baselines in open-domain question answering with a ROUGE-1 score of 0.3461 (+5.32%) and a BERTScore of 0.6079 (+3.30%), exceeding AutoGPT (0.3286 ROUGE-1, 0.5885 BERTScore). An ablation study showed that removing the evaluation supervisor or structured communication significantly reduced accuracy, confirming their importance. TalkHier outperformed OKG by 17.63% across Faithfulness, Fluency, Attractiveness, and Character Count Violation on the Camera dataset for ad text generation, with human evaluations validating its multi-agent assessments. While OpenAI-o1’s internal architecture was not revealed, TalkHier posted competitive MMLU scores and beat it decisively on WikiQA, showing flexibility between tasks and dominance over majority voting and open-source multi-agent systems.

In the end, the proposed framework improved communication, reasoning, and coordination in LLM multi-agent systems by combining a structured protocol with hierarchical refinement, which resulted in a better performance on several benchmarks. Including messages, intermediate results, and context information ensured structured interactions without sacrificing heterogeneous agent feedback. Even with increased API expenses, TalkHier set a new benchmark for scalable, objective multi-agent cooperation. This methodology can serve as a baseline in subsequent research, directing improvement in effective communication mechanisms and low-cost multi-agent interactions, ultimately towards advancing LLM-based cooperative systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Sony Researchers Propose TalkHier: A Novel AI Framework for LLM-MA Systems that Addresses Key Challenges in Communication and Refinement appeared first on MarkTechPost.

TokenSkip: Optimizing Chain-of-Thought Reasoning in LLMs Through Contr …

Large Language Models (LLMs) face significant challenges in complex reasoning tasks, despite the breakthrough advances achieved through Chain-of-Thought (CoT) prompting. The primary challenge lies in the computational overhead introduced by longer CoT sequences, which directly impacts inference latency and memory requirements. The autoregressive nature of LLM decoding means that as CoT sequences grow longer, there is a proportional increase in processing time and memory usage in attention layers where computational costs scale quadratically. Finding a balance between maintaining reasoning accuracy and computational efficiency has become a critical challenge, as attempts to reduce reasoning steps often compromise the model’s problem-solving capabilities.

Various methodologies have been developed to address the computational challenges of Chain-of-Thought (CoT) reasoning. Some approaches focus on streamlining the reasoning process by simplifying or skipping certain thinking steps, while others attempt to generate steps in parallel. A different strategy involves compressing reasoning steps into continuous latent representations, enabling LLMs to reason without generating explicit word tokens. Moreover, prompt compression techniques to handle complex instructions and long-context inputs more efficiently range from using lightweight language models to generate concise prompts, employing implicit continuous tokens for task representation, and implementing direct compression by filtering for high-informative tokens.

Researchers from The Hong Kong Polytechnic University and the University of Science and Technology of China have proposed TokenSkip, an approach to optimize CoT processing in LLMs. It enables models to skip less important tokens within CoT sequences while maintaining connections between critical reasoning tokens, with adjustable compression ratios. The system works by first constructing compressed CoT training data through token pruning, followed by a supervised fine-tuning process. Initial testing across multiple models, including LLaMA-3.1-8B-Instruct and Qwen2.5-Instruct series shows promising results, particularly in maintaining reasoning capabilities while significantly reducing computational overhead.

TokenSkip’s architecture is built on the fundamental principle that different reasoning tokens contribute varying levels of importance to reaching the final answer. It contains two main phases: training data preparation and inference. In the training phase, the system generates CoT trajectories using the target LLM, and Each remaining trajectory undergoes pruning with a randomly selected compression ratio. The token pruning process is guided by an “importance scoring” mechanism. TokenSkip maintains the autoregressive decoding approach during inference but enhances efficiency by enabling LLMs to skip less important tokens. The structure of the input format is such that the question and compression ratio gets separated by end-of-sequence tokens.

The results show that larger language models are more adept at maintaining performance while achieving higher compression rates. The Qwen2.5-14B-Instruct model achieves remarkable results with only a 0.4% performance drop while reducing token usage by 40%. TokenSkip shows superior performance when compared with alternative approaches like prompt-based reduction and truncation. While prompt-based reduction fails to achieve target compression ratios and truncation leads to significant performance degradation, TokenSkip maintains the specified compression ratio while preserving reasoning capabilities. On the MATH-500 dataset, it achieves a 30% reduction in token usage with less than a 4% performance drop.

In this paper, researchers introduced TokenSkip which represents a significant advancement in optimizing CoT processing for LLMs by introducing a controllable compression mechanism based on token importance. The method’s success lies in maintaining reasoning accuracy while significantly reducing computational overhead by selectively preserving critical tokens and skipping less important ones. The approach has proven effective with LLMs, showing minimal performance degradation even at substantial compression ratios. This research opens new possibilities for advancing efficient reasoning in LLMs, establishing a foundation for future developments in computational efficiency while maintaining robust reasoning capabilities.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post TokenSkip: Optimizing Chain-of-Thought Reasoning in LLMs Through Controllable Token Compression appeared first on MarkTechPost.

SGLang: An Open-Source Inference Engine Transforming LLM Deployment th …

Organizations face significant challenges when deploying LLMs in today’s technology landscape. The primary issues include managing the enormous computational demands required to process high volumes of data, achieving low latency, and ensuring optimal balance between CPU-intensive tasks, such as scheduling and memory allocation, and GPU-intensive computations. Repeatedly processing similar inputs further compounds the inefficiencies in many systems, leading to redundant computations that slow down overall performance. Also, generating structured outputs like JSON or XML in real-time introduces further delays, making it difficult for applications to deliver fast, reliable, cost-effective performance at scale.

SGLang is an open-source inference engine designed by the SGLang team to address these challenges. It optimizes CPU and GPU resources during inference, achieving significantly higher throughput than many competitive solutions. Its design utilizes an innovative approach that reduces redundant computations and enhances overall efficiency, thereby enabling organizations to manage better the complexities associated with LLM deployment.

RadixAttention is central to SGLang, which reuses shared prompt prefixes across multiple requests. This approach effectively minimizes the repeated processing of similar input sequences, improving throughput. The technique is advantageous in conversational interfaces or retrieval-augmented generation applications, where similar prompts are frequently processed. By eliminating redundant computations, the system ensures that resources are used more efficiently, contributing to faster processing times and more responsive applications.

Image Source

Another critical feature of SGLang is its zero-overhead batch scheduler. Earlier inference systems often suffer from significant CPU overhead due to tasks like batch scheduling, memory allocation, and prompt preprocessing. In many cases, these operations result in idle periods for the GPU, which in turn hampers overall performance. SGLang addresses this bottleneck by overlapping CPU scheduling with ongoing GPU computations. The scheduler keeps the GPUs continuously engaged by running one batch ahead and preparing all necessary metadata for the next batch. Profiling has shown that this design reduces idle time and achieves measurable speed improvements, especially in configurations that involve smaller models and extensive tensor parallelism.

SGLang also incorporates a cache-aware load balancer that departs from conventional load balancing methods such as round-robin scheduling. Traditional techniques often ignore the state of the key-value (KV) cache, leading to inefficient resource use. In contrast, SGLang’s load balancer predicts the cache hit rates of different workers and directs incoming requests to those with the highest likelihood of a cache hit. This targeted routing increases throughput and enhances cache utilization. The mechanism relies on an approximate radix tree that reflects the current cache state on each worker, and it lazily updates this tree to impose minimal overhead. The load balancer, implemented in Rust for high concurrency, is especially well suited for distributed, multi-node environments.

Image Source

In addition to these features, SGLang supports data parallelism attention, a strategy particularly tailored for DeepSeek models. While many modern models use tensor parallelism, which can lead to duplicated KV cache storage when scaling across multiple GPUs, SGLang employs a different method for models utilizing multi-head latent attention. In this approach, individual data parallel workers independently handle various batches, such as prefill, decode, or idle. The attention-processed data is then aggregated across workers before passing through subsequent layers, such as a mixture-of-experts layer, and later redistributed.

SGLang also excels in the efficient generation of structured outputs. Many inference systems struggle with the real-time decoding of formats like JSON, which can be a critical requirement in many applications. SGLang addresses this by integrating a specialized grammar backend known as xgrammar. This integration streamlines the decoding process, allowing the system to generate structured outputs up to ten times faster than other open-source alternatives. This capability is especially valuable when rapidly producing machine-readable data, essential for downstream processing or interactive applications.

Several high-profile companies have recognized SGLang’s practical benefits. For example, ByteDance channels a large portion of its internal NLP pipelines through this engine, processing petabytes of data daily. Similarly, xai has reported substantial cost savings by leveraging optimized scheduling and effective cache management, resulting in a notable reduction in serving expenses. These real-world applications highlight SGLang’s ability to operate efficiently at scale, delivering performance improvements and cost benefits.

SGLang is released under the Apache 2.0 open-source license and is accessible for academic research and commercial applications. Its compatibility with OpenAI standards and the provision of a Python API allows developers to integrate it seamlessly into existing workflows. The engine supports many models, including popular ones such as Llama, Mistral, Gemma, Qwen, DeepSeek, Phi, and Granite. It is designed to work across various hardware platforms, including NVIDIA and AMD GPUs, and integrates advanced quantization techniques like FP8 and INT4. Future enhancements will include FP6 weight and FP8 activation quantization, faster startup times, and cross-cloud load balancing.

Several Key Takeaways from the research on SGLang include:

SGLang addresses critical challenges in deploying large language models by optimizing the balance between CPU and GPU tasks.

RadixAttention minimizes redundant computations, improving throughput in conversational and retrieval scenarios.

A zero-overhead batch scheduler overlaps CPU scheduling with GPU operations to ensure continuous processing and reduce idle time.

A cache-aware load balancer efficiently predicts cache hit rates and routes requests, boosting overall performance and cache utilization.

Data parallelism attention reduces memory overhead and enhances decoding throughput for multi-head latent attention models.

The integration of xgrammar allows for the rapid generation of structured outputs, significantly improving processing speed for formats like JSON.

SGLang’s practical benefits are demonstrated by its adoption in large-scale production environments, which contribute to substantial cost savings and performance improvements.

Check out the GitHub Repo, Documentation and Technical Details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post SGLang: An Open-Source Inference Engine Transforming LLM Deployment through CPU Scheduling, Cache-Aware Load Balancing, and Rapid Structured Output Generation appeared first on MarkTechPost.

This AI Paper Explores Emergent Response Planning in LLMs: Probing Hid …

Large Language models (LLMs) operate by predicting the next token based on input data, yet their performance suggests they process information beyond mere token-level predictions. This raises questions about whether LLMs engage in implicit planning before generating complete responses. Understanding this phenomenon can lead to more transparent AI systems, improving efficiency and making output generation more predictable.

One challenge in working with LLMs is predicting how they will structure responses. These models generate text sequentially, making controlling the overall response length, reasoning depth, and factual accuracy challenging. The lack of explicit planning mechanisms means that although LLMs generate human-like responses, their internal decision-making remains opaque. As a result, users often rely on prompt engineering to guide outputs, but this method lacks precision and does not provide insight into the model’s inherent response formulation.

Existing techniques to refine LLM outputs include reinforcement learning, fine-tuning, and structured prompting. Researchers have also experimented with decision trees and external logic-based frameworks to impose structure. However, these methods do not fully capture how LLMs internally process information. 

The Shanghai Artificial Intelligence Laboratory research team has introduced a novel approach by analyzing hidden representations to uncover latent response-planning behaviors. Their findings suggest that LLMs encode key attributes of their responses even before the first token is generated. The research team examined their hidden representations and investigated whether LLMs engage in emergent response planning. They introduced simple probing models trained on prompt embeddings to predict upcoming response attributes. The study categorized response planning into three main areas: structural attributes, such as response length and reasoning steps, content attributes including character choices in story-writing tasks, and behavioral attributes, such as confidence in multiple-choice answers. By analyzing patterns in hidden layers, the researchers found that these planning abilities scale with model size and evolve throughout the generation process.

To quantify response planning, the researchers conducted a series of probing experiments. They trained models to predict response attributes using hidden state representations extracted before output generation. The experiments showed that probes could accurately predict upcoming text characteristics. The findings indicated that LLMs encode response attributes in their prompt representations, with planning abilities peaking at the beginning and end of responses. The study further demonstrated that models of different sizes share similar planning behaviors, with larger models exhibiting more pronounced predictive capabilities.

The experiments revealed substantial differences in planning capabilities between base and fine-tuned models. Fine-tuned models exhibited better prediction accuracy in structural and behavioral attributes, confirming that planning behaviors are reinforced through optimization. For instance, response length prediction showed high correlation coefficients across models, with Spearman’s correlation reaching 0.84 in some cases. Similarly, reasoning step predictions exhibited strong alignment with ground-truth values. Classification tasks such as character choice in story writing and multiple-choice answer selection performed significantly above random baselines, further supporting the notion that LLMs internally encode elements of response planning.

Larger models demonstrated superior planning abilities across all attributes. Within the LLaMA and Qwen model families, planning accuracy improved consistently with increased parameter count. The study found that LLaMA-3-70B and Qwen2.5-72B-Instruct exhibited the highest prediction performance, while smaller models like Qwen2.5-1.5B struggled to encode long-term response structures effectively. Further, layer-wise probing experiments indicated that structural attributes emerged prominently in mid-layers, while content attributes became more pronounced in later layers. Behavioral attributes, such as answer confidence and factual consistency, remained relatively stable across different model depths.

These findings highlight a fundamental aspect of LLM behavior: they do not merely predict the next token but plan broader attributes of their responses before generating text. This emergent response planning ability has implications for improving model transparency and control. Understanding these internal processes can help refine AI models, leading to better predictability and reduced reliance on post-generation corrections. Future research may explore integrating explicit planning modules within LLM architectures to enhance response coherence and user-directed customization.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post This AI Paper Explores Emergent Response Planning in LLMs: Probing Hidden Representations for Predictive Text Generation appeared first on MarkTechPost.

Meet Baichuan-M1: A New Series of Large Language Models Trained on 20T …

While LLMs have shown remarkable advancements in general-purpose applications, their development for specialized fields like medicine remains limited. The complexity of medical knowledge and the scarcity of high-quality, domain-specific data make creating highly efficient medical LLMs challenging. Although models like GPT-4 and DeepseekR1 have demonstrated impressive capabilities across industries, their adaptation to the medical domain is hindered by the intricate nature of medical terminology, diverse disciplines, and constantly evolving literature. Unlike general applications, medical AI must interpret highly technical language and provide precise, contextually relevant responses, which traditional LLMs struggle to achieve.

One major obstacle in building effective medical LLMs is the limited accessibility of high-quality training data, which is restricted due to privacy concerns and regulatory barriers. Medical datasets consist of structured and unstructured information, including clinical notes, textbooks, and research articles, making comprehensive model training difficult. While approaches like fine-tuning general LLMs on medical datasets and applying transfer learning have been explored, these methods often fail to grasp the depth of medical knowledge fully. As a result, such models may perform well on specific tasks but lack the nuanced understanding necessary for complex medical inquiries, highlighting the need for more refined training strategies.

Researchers at Baichuan Inc. introduced Baichuan-M1, a specialized large language model series designed specifically for medical applications. Unlike traditional models that refine existing architectures through additional pretraining or post-training, Baichuan-M1 is built from scratch with a strong focus on medical expertise. Trained on 20 trillion tokens, including both general and medical-specific data, the model balances broad language understanding with domain-specific precision. It excels in general tasks like coding and mathematics and in medical applications such as diagnostics and treatment recommendations. With an optimized Transformer architecture, Baichuan-M1 sets a new benchmark for AI-driven advancements in healthcare.

The model architecture follows Llama and similar frameworks, incorporating pre-norm RMSNorm, SwishGlu in the FFN layer, and rotary position embeddings. The study integrates global and sliding window attention to optimize inference efficiency, increasing the head dimension to 256 for global layers. Additionally, temporal short convolutions on key-value attention enhance in-context learning. The model employs a hybrid tokenizer for medical and general text, a curriculum-based training strategy with progressive data complexity, and adaptive gradient clipping for stability. Supervised fine-tuning refines general reasoning and medical-specific tasks, ensuring robust language understanding, medical reasoning, and long-document handling capabilities while maintaining inference efficiency.

Using various benchmarks, baichuan-M1-14B-Base’s code and mathematical abilities were evaluated against the Qwen2.5 series models. Code generation performance was tested with the EvalPlus framework and Bigcodebench, while mathematical proficiency was assessed using MATH and CMATH datasets. Although the 14B-Instruct variant still lags behind proprietary models like Claude-3.5-Sonnet and GPT-4o, the gap has narrowed significantly. The results demonstrate that Baichuan-M1-14B-Base performs competitively in certain tasks, showcasing its code generation and mathematical reasoning strengths compared to other advanced models.

In conclusion, Traditional methods for adapting LLMs to specialized fields often involve fine-tuning existing models. However, experiments suggest that further training on pre-existing models can hinder domain-specific improvements without sacrificing general performance. In the medical domain, fine-tuning general models with domain-specific data may be less effective than training from scratch. Baichuan-M1 was developed with this approach, using 20 trillion tokens to enhance medical expertise while maintaining general capabilities. Open-sourcing Baichuan-M1-14B allows further research, though challenges remain in rare disease diagnosis and real-world applications. Its continued evolution could significantly advance AI-driven medical decision-making.

Check out the Paper, Baichuan-M1-14B-Base and Baichuan-M1-14B-Instruct. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Meet Baichuan-M1: A New Series of Large Language Models Trained on 20T Tokens with a Dedicated Focus on Enhancing Medical Capabilities appeared first on MarkTechPost.

How Rocket Companies modernized their data science solution on AWS

This post was written with Dian Xu and Joel Hawkins of Rocket Companies.
Rocket Companies is a Detroit-based FinTech company with a mission to “Help Everyone Home”. With the current housing shortage and affordability concerns, Rocket simplifies the homeownership process through an intuitive and AI-driven experience. This comprehensive framework streamlines every step of the homeownership journey, empowering consumers to search, purchase, and manage home financing effortlessly. Rocket integrates home search, financing, and servicing in a single environment, providing a seamless and efficient experience.
The Rocket brand is a synonym for offering simple, fast, and trustworthy digital solutions for complex transactions. Rocket is dedicated to helping clients realize their dream of homeownership and financial freedom. Since its inception, Rocket has grown from a single mortgage lender to an network of businesses that creates new opportunities for its clients.
Rocket takes a complicated process and uses technology to make it simpler. Applying for a mortgage can be complex and time-consuming. That’s why we use advanced technology and data analytics to streamline every step of the homeownership experience, from application to closing. By analyzing a wide range of data points, we’re able to quickly and accurately assess the risk associated with a loan, enabling us to make more informed lending decisions and get our clients the financing they need.
Our goal at Rocket is to provide a personalized experience for both our current and prospective clients. Rocket’s diverse product offerings can be customized to meet specific client needs, while our team of skilled bankers must match with the best client opportunities that align with their skills and knowledge. Maintaining strong relationships with our large, loyal client base and hedge positions to cover financial obligations is key to our success. With the volume of business we do, even small improvements can have a significant impact.
In this post, we share how we modernized Rocket’s data science solution on AWS to increase the speed to delivery from eight weeks to under one hour, improve operational stability and support by reducing incident tickets by over 99% in 18 months, power 10 million automated data science and AI decisions made daily, and provide a seamless data science development experience.
Rocket’s legacy data science environment challenges
Rocket’s previous data science solution was built around Apache Spark and combined the use of a legacy version of the Hadoop environment and vendor-provided Data Science Experience development tools. The Hadoop environment was hosted on Amazon Elastic Compute Cloud (Amazon EC2) servers, managed in-house by Rocket’s technology team, while the data science experience infrastructure was hosted on premises. Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink.
Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. Apache Hive was used to provide a tabular interface to data stored in HDFS, and to integrate with Apache Spark SQL. Apache HBase was employed to offer real-time key-based access to data. Model training and scoring was performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which was part of the Hadoop implementation.
Despite the benefits of this architecture, Rocket faced challenges that limited its effectiveness:

Accessibility limitations: The data lake was stored in HDFS and only accessible from the Hadoop environment, hindering integration with other data sources. This also led to a backlog of data that needed to be ingested.
Steep learning curve for data scientists: Many of Rocket’s data scientists did not have experience with Spark, which had a more nuanced programming model compared to other popular ML solutions like scikit-learn. This created a challenge for data scientists to become productive.
Responsibility for maintenance and troubleshooting: Rocket’s DevOps/Technology team was responsible for all upgrades, scaling, and troubleshooting of the Hadoop cluster, which was installed on bare EC2 instances. This resulted in a backlog of issues with both vendors that remained unresolved.
Balancing development vs. production demands: Rocket had to manage work queues between development and production, which were always competing for the same resources.
Deployment challenges: Rocket sought to support more real-time and streaming inferencing use cases, but this was limited by the capabilities of MLeap for real-time models and Spark Streaming for streaming use cases, which were still experimental at that time.
Inadequate data security and DevOps support – The previous solution lacked robust security measures, and there was limited support for development and operations of the data science work.

Rocket’s legacy data science architecture is shown in the following diagram.

The diagram depicts the flow; the key components are detailed below:

Data Ingestion: Data is ingested into the system using Attunity data ingestion in Spark SQL.
Data Storage and Processing: All compute is done as Spark jobs inside of a Hadoop cluster using Apache Livy and Spark. Data is stored in HDFS and is accessed via Hive, which provides a tabular interface to the data and integrates with Spark SQL. HBase is employed to offer real-time key-based access to data.
Model Development: Data exploration and model development are conducted using tools such as Jupyter or Orchestration, which communicate with the Spark server over Kerberized Livy connection.
Model Training and Scoring: Model training and scoring is performed either from Jupyter notebooks or through jobs scheduled by Apache’s Oozie orchestration tool, which is part of the Hadoop implementation.

Rocket’s migration journey
At Rocket, we believe in the power of continuous improvement and constantly seek out new opportunities. One such opportunity is using data science solutions, but to do so, we must have a strong and flexible data science environment.
To address the legacy data science environment challenges, Rocket decided to migrate its ML workloads to the Amazon SageMaker AI suite. This would allow us to deliver more personalized experiences and understand our customers better. To promote the success of this migration, we collaborated with the AWS team to create automated and intelligent digital experiences that demonstrated Rocket’s understanding of its clients and kept them connected.
We implemented an AWS multi-account strategy, standing up Amazon SageMaker Studio in a build account using a network-isolated Amazon VPC. This allows us to separate development and production environments, while also improving our security stance.
We moved our new work to SageMaker Studio and our legacy Hadoop workloads to Amazon EMR, connecting to the old Hadoop cluster using Livy and SageMaker notebooks to ease the transition. This gives us access to a wider range of tools and technologies, enabling us to choose the most appropriate ones for each problem we’re trying to solve.
In addition, we moved our data from HDFS to Amazon Simple Storage Service (Amazon S3), and now use Amazon Athena and AWS Lake Formation to provide proper access controls to production data. This makes it easier to access and analyze the data, and to integrate it with other systems. The team also provides secure interactive integration through Amazon Elastic Kubernetes Service (Amazon EKS), further improving the company’s security stance.
SageMaker AI has been instrumental in empowering our data science community with the flexibility to choose the most appropriate tools and technologies for each problem, resulting in faster development cycles and higher model accuracy. With SageMaker Studio, our data scientists can seamlessly develop, train, and deploy models without the need for additional infrastructure management.
As a result of this modernization effort, SageMaker AI enabled Rocket to scale our data science solution across Rocket Companies and integrate using a hub-and-spoke model. The ability of SageMaker AI to automatically provision and manage instances has allowed us to focus on our data science work rather than infrastructure management, increasing the number of models in production by five times and data scientists’ productivity by 80%.
Our data scientists are empowered to use the most appropriate technology for the problem at hand, and our security stance has improved. Rocket can now compartmentalize data and compute, as well as compartmentalize development and production. Additionally, we are able to provide model tracking and lineage using Amazon SageMaker Experiments and artifacts discoverable using the SageMaker model registry and Amazon SageMaker Feature Store. All the data science work has now been migrated onto SageMaker, and all the old Hadoop work has been migrated to Amazon EMR.
Overall, SageMaker AI has played a critical role in enabling Rocket’s modernization journey by building a more scalable and flexible ML framework, reducing operational burden, improving model accuracy, and accelerating deployment times.
The successful modernization allowed Rocket to overcome our previous limitations and better support our data science efforts. We were able to improve our security stance, make work more traceable and discoverable, and give our data scientists the flexibility to choose the most appropriate tools and technologies for each problem. This has helped us better serve our customers and drive business growth.
Rocket’s new data science solution architecture on AWS is shown in the following diagram.

The solution consists of the following components:

Data ingestion: Data is ingested into the data account from on-premises and external sources.
Data refinement: Raw data is refined into consumable layers (raw, processed, conformed, and analytical) using a combination of AWS Glue extract, transform, and load (ETL) jobs and EMR jobs.
Data access: Refined data is registered in the data account’s AWS Glue Data Catalog and exposed to other accounts via Lake Formation. Analytic data is stored in Amazon Redshift. Lake Formation makes this data available to both the build and compute accounts. For the build account, access to production data is restricted to read-only.
Development: Data science development is done using SageMaker Studio. Data engineering development is done using AWS Glue Studio. Both disciplines have access to Amazon EMR for Spark development. Data scientists have access to the entire SageMaker ecosystem in the build account.
Deployment: SageMaker trained models developed in the build account are registered with an MLFlow instance. Code artifacts for both data science activities and data engineering activities are stored in Git. Deployment initiation is controlled as part of CI/CD.
Workflows: We have a number of workflow triggers. For online scoring, we typically provide an external-facing endpoint using Amazon EKS with Istio. We have numerous jobs that are launched by AWS Lambda functions that in turn are triggered by timers or events. Processes that run may include AWS Glue ETL jobs, EMR jobs for additional data transformations or model training and scoring activities, or SageMaker pipelines and jobs performing training or scoring activities.

Migration impact
We’ve evolved a long way in modernizing our infrastructure and workloads. We started our journey supporting six business channels and 26 models in production, with dozens in development. Deployment times stretched for months and required a team of three system engineers and four ML engineers to keep everything running smoothly. Despite the support of our internal DevOps team, our issue backlog with the vendor was an unenviable 200+.
Today, we are supporting nine organizations and over 20 business channels, with a whopping 210+ models in production and many more in development. Our average deployment time has gone from months to just weeks—sometimes even down to mere days! With just one part-time ML engineer for support, our average issue backlog with the vendor is practically non-existent. We now support over 120 data scientists, ML engineers, and analytical roles. Our framework mix has expanded to include 50% SparkML models and a diverse range of other ML frameworks, such as PyTorch and scikit-learn. These advancements have given our data science community the power and flexibility to tackle even more complex and challenging projects with ease.
The following table compares some of our metrics before and after migration.

.
Before Migration
After Migration

Speed to Delivery
New data ingestion project took 4–8 weeks
Data-driven ingestion takes under one hour

Operation Stability and Supportability
Over a hundred incidents and tickets in 18 months
Fewer incidents: one per 18 months

Data Science
Data scientists spent 80% of their time waiting on their jobs to run
Seamless data science development experience

Scalability
Unable to scale
Powers 10 million automated data science and AI decisions made daily

Lessons learned
Throughout the journey of modernizing our data science solution, we’ve learned valuable lessons that we believe could be of great help to other organizations who are planning to undertake similar endeavors.
First, we’ve come to realize that managed services can be a game changer in optimizing your data science operations.
The isolation of development into its own account while providing read-only access to production data is a highly effective way of enabling data scientists to experiment and iterate on their models without putting your production environment at risk. This is something that we’ve achieved through the combination of SageMaker AI and Lake Formation.
Another lesson we learned is the importance of training and onboarding for teams. This is particularly true for teams that are moving to a new environment like SageMaker AI. It’s crucial to understand the best practices of utilizing the resources and features of SageMaker AI, and to have a solid understanding of how to move from notebooks to jobs.
Lastly, we found that although Amazon EMR still requires some tuning and optimization, the administrative burden is much lighter compared to hosting directly on Amazon EC2. This makes Amazon EMR a more scalable and cost-effective solution for organizations who need to manage large data processing workloads.
Conclusion
This post provided overview of the successful partnership between AWS and Rocket Companies. Through this collaboration, Rocket Companies was able to migrate many ML workloads and implement a scalable ML framework. Ongoing with AWS, Rocket Companies remains committed to innovation and staying at the forefront of customer satisfaction.
Don’t let legacy systems hold back your organization’s potential. Discover how AWS can assist you in modernizing your data science solution and achieving remarkable results, similar to those achieved by Rocket Companies.

About the Authors
Dian Xu is the Senior Director of Engineering in Data at Rocket Companies, where she leads transformative initiatives to modernize enterprise data platforms and foster a collaborative, data-first culture. Under her leadership, Rocket’s data science, AI & ML platforms power billions of automated decisions annually, driving innovation and industry disruption. A passionate advocate for Gen AI and cloud technologies, Xu is also a sought-after speaker at global forums, inspiring the next generation of data professionals. Outside of work, she channels her love of rhythm into dancing, embracing styles from Bollywood to Bachata as a celebration of cultural diversity.
Joel Hawkins is a Principal Data Scientist at Rocket Companies, where he is responsible for the data science and MLOps platform. Joel has decades of experience developing sophisticated tooling and working with data at large scales. A driven innovator, he works hand in hand with data science teams to ensure that we have the latest technologies available to provide cutting edge solutions. In his spare time, he is an avid cyclist and has been known to dabble in vintage sports car restoration.
Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services. He partners with North American FinTech companies like Rocket and other financial services organizations to drive cloud and AI strategy, accelerating AI adoption at scale. With deep expertise in AI & ML, Generative AI, and cloud-native architecture, he helps financial institutions unlock new revenue streams, optimize operations, and drive impactful business transformation. Sajjan collaborates closely with Rocket Companies to advance its mission of building an AI-fueled homeownership platform to Help Everyone Home. Outside of work, he enjoys traveling, spending time with his family, and is a proud father to his daughter.
Alak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.

AWS and DXC collaborate to deliver customizable, near real-time voice- …

Providing effective multilingual customer support in global businesses presents significant operational challenges. Through collaboration between AWS and DXC Technology, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact centers handle multi-lingual customer interactions.
In this post, we discuss how AWS and DXC used Amazon Connect and other AWS AI services to deliver near real-time V2V translation capabilities.
Challenge: Serving customers in multiple languages
In Q3 2024, DXC Technology approached AWS with a critical business challenge: their global contact centers needed to serve customers in multiple languages without the exponential cost of hiring language-specific agents for the lower volume languages. Previously, DXC had explored several existing alternatives but found limitations in each approach – from communication constraints to infrastructure requirements that impacted reliability, scalability, and operational costs. DXC and AWS decided to organize a focused hackathon where DXC and AWS Solution Architects collaborated to:

Define essential requirements for real-time translation
Establish latency and accuracy benchmarks
Create seamless integration paths with existing systems
Develop a phased implementation strategy
Prepare and test an initial proof of concept setup

Business impact
For DXC, this prototype was used as an enabler, allowing technical talent maximization, operational transformation, and cost improvements through:

Best technical expertise delivery – Hiring and matching agents based on technical knowledge rather than spoken language, making sure customers get top technical support regardless of language barriers
Global operational flexibility – Removing geographical and language constraints in hiring, placement, and support delivery while maintaining consistent service quality across all languages
Cost reduction – Eliminating multi-language expertise premiums, specialized language training, and infrastructure costs through pay-per-use translation model
Similar experience to native speakers – Maintaining natural conversation flow with near real-time translation and audio feedback, while delivering premium technical support in customer’s preferred language

Solution overview
The Amazon Connect V2V translation prototype uses AWS advanced speech recognition and machine translation technologies to enable real-time conversation translation between agents and customers, allowing them to speak in their preferred languages while having natural conversations. It consists of the following key components:

Speech recognition – The customer’s spoken language is captured and converted into text using Amazon Transcribe, which serves as the speech recognition engine. The transcript (text) is then fed into the machine translation engine.
Machine translation – Amazon Translate, the machine translation engine, translates the customer’s transcript into the agent’s preferred language in near real time. The translated transcript is converted back into speech using Amazon Polly, which serves as the text-to-speech engine.
Bidirectional translation – The process is reversed for the agent’s response, translating their speech into the customer’s language and delivering the translated audio to the customer.
Seamless integration – The V2V translation sample project integrates with Amazon Connect, enabling agents to handle customer interactions in multiple languages without any additional effort or training, using the Amazon Connect Streams JS and Amazon Connect RTC JS libraries.

The prototype can be extended with other AWS AI services to further customize the translation capabilities. It’s open source and ready for customization to meet your specific needs.
The following diagram illustrates the solution architecture.

The following screenshot illustrates a sample agent web application.

The user interface consists of three sections:

Contact Control Panel – A softphone client using Amazon Connect
Customer Controls – Customer-to-agent interaction controls, including Transcribe Customer Voice, Translate Customer Voice, and Synthesize Customer Voice
Agent controls – Agent-to-customer interaction controls, including Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice

Challenges when implementing near real-time voice translation
The Amazon Connect V2V sample project was designed to minimize the audio processing time from the moment the customer or agent finishes speaking until the translated audio stream is started. However, even with the shortest audio processing time, the user experience still doesn’t match the experience of a real conversation when both are speaking the same language. This is due to the specific pattern of the customer only hearing the agent’s translated speech, and the agent only hearing the customer’s translated speech. The following diagram displays that pattern.

The example workflow consists of the following steps:

The customer starts speaking in their own language, and speaks for 10 seconds.
Because the agent only hears the customer’s translated speech, the agent first hears 10 seconds of silence.
When customer finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
The customer’s translated speech is streamed to the agent. During that time, the customer hears silence.
When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
Because customer only hears the agent’s translated speech, the customer hears 10 seconds of silence.
When the agent finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
The agent’s translated speech is streamed to the agent. During that time, the agent hears silence.

In this scenario, the customer hears a single block of 22–24 seconds of a complete silence, from the moment they finished speaking until they hear the agent’s translated voice. This creates a suboptimal experience, because the customer might not be certain what is happening during these 22–24 seconds—for instance, if the agent was able to hear them, or if there was a technical issue.
Audio streaming add-ons
In a face-to-face conversation scenario between two people that don’t speak the same language, they might have another person as a translator or interpreter. An example workflow consists of the following steps:

Person A speaks in their own language, which is heard by Person B and the translator.
The translator translates what Person A said to Person B’s language. The translation is heard by Person B and Person A.

Essentially, Person A and Person B hear each other speaking their own language, and they also hear the translation (from the translator). There’s no waiting in silence, which is even more important in non-face-to-face conversations (such as contact center interactions).
To optimize the customer/agent experience, the Amazon Connect V2V sample project implements audio streaming add-ons to simulate a more natural conversation experience. The following diagram illustrates an example workflow.

The workflow consists of the following steps:

The customer starts speaking in their own language, and speaks for 10 seconds.
The agent hears the customer’s original voice, at a lower volume (“Stream Customer Mic to Agent” enabled).
When the customer finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
The customer’s translated speech is then streamed to the agent. During that time, the customer hears their translated speech, at a lower volume (“Stream Customer Translation to Customer” enabled).
When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
The customer hears the agent’s original voice, at a lower volume (“Stream Agent Mic to Customer” enabled).
When the agent finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
The agent’s translated speech is then streamed to the agent. During that time, the agent hears their translated speech, at a lower volume (“Stream Agent Translation to Agent” enabled).

In this scenario, the customer hears two short blocks (1–2 seconds) of subtle audio feedback, instead of a single block of 22–24 seconds of complete silence. This pattern is much closer to a face-to-face conversation that includes a translator.
The audio streaming add-ons provide additional benefits, including:

Voice characteristics – In cases when the agent and customer only hear their translated and synthesized speech, the actual voice characteristics are lost. For instance, the agent can’t hear if the customer was talking slow or fast, if the customer was upset or calm, and so on. The translated and synthesized speech doesn’t carry over that information.
Quality assurance – In cases when call recording is enabled, only the customer’s original voice and the agent’s synthesized speech are recorded, because the translation and the synthetization are done on the agent (client) side. This makes it difficult for QA teams to properly evaluate and audit the conversations, including the many silent blocks within it. Instead, when the audio streaming add-ons are enabled, there are no silent blocks, and the QA team can hear the agent’s original voice, the customer’s original voice, and their respective translated and synthesized speech, all in a single audio file.
Transcription and translation accuracy – Having both the original and translated speech available in the call recording makes it straightforward to detect specific words that would improve transcription accuracy (by using Amazon Transcribe custom vocabularies) or translation accuracy (using Amazon Translate custom terminologies), to make sure that your brand names, character names, model names, and other unique content are transcribed and translated to the desired result.

Get started with Amazon Connect V2V
Ready to transform your contact center’s communication? Our Amazon Connect V2V sample project is now available on GitHub. We invite you to explore, deploy, and experiment with this powerful prototype. You can it as a foundation for developing innovative multi-lingual communication solutions in your own contact center, through the following key steps:

Clone the GitHub repository.
Test different configurations for audio streaming add-ons.
Review the sample project’s limitations in the README.
Develop your implementation strategy:

Implement robust security and compliance controls that meet your organization’s standards.
Collaborate with your customer experience team to define your specific use case requirements.
Balance between automation and the agent’s manual controls (for example, use an Amazon Connect contact flow to automatically set contact attributes for preferred languages and audio streaming add-ons).
Use your preferred transcribe, translate, and text-to-speech engines, based on specific language support requirements and business, legal, and regional preferences.
Plan a phased rollout, starting with a pilot group, then iteratively optimize your transcription custom vocabularies and translation custom terminologies.

Conclusion
The Amazon Connect V2V sample project demonstrates how Amazon Connect and advanced AWS AI services can break down language barriers, enhance operational flexibility, and reduce support costs. Get started now and revolutionize how your contact center communicates across language barriers!

About the Authors
Milos Cosic is a Principal Solutions Architect at AWS.
EJ Ferrell is a Senior Solutions Architect at AWS.
Adam El Tanbouli is a Technical Program Manager for Prototyping and Support Services at DXC Modern Workplace.

Orchestrate an intelligent document processing workflow using tools in …

Generative AI is revolutionizing enterprise automation, enabling AI systems to understand context, make decisions, and act independently. Generative AI foundation models (FMs), with their ability to understand context and make decisions, are becoming powerful partners in solving sophisticated business problems. At AWS, we’re using the power of models in Amazon Bedrock to drive automation of complex processes that have traditionally been challenging to streamline.
In this post, we focus on one such complex workflow: document processing. This serves as an example of how generative AI can streamline operations that involve diverse data types and formats.
Challenges with document processing
Document processing often involves handling three main categories of documents:

Structured – For example, forms with fixed fields
Semi-structured – Documents that have a predictable set of information but might vary in layout or presentation
Unstructured – For example, paragraphs of text or notes

Traditionally, processing these varied document types has been a pain point for many organizations. Rule-based systems or specialized machine learning (ML) models often struggle with the variability of real-world documents, especially when dealing with semi-structured and unstructured data.
We demonstrate how generative AI along with external tool use offers a more flexible and adaptable solution to this challenge. Through a practical use case of processing a patient health package at a doctor’s office, you will see how this technology can extract and synthesize information from all three document types, potentially improving data accuracy and operational efficiency.
Solution overview
This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools and APIs to perform complex document analysis tasks.
The solution employs a strategic multi-model approach, optimizing for both performance and cost by selecting the most appropriate model for each task:

Anthropic’s Claude 3 Haiku – Serves as the workflow orchestrator due to its low latency and cost-effectiveness. This model’s strong reasoning and tool use abilities make it ideal for the following:

Coordinating the overall document processing pipeline
Making routing decisions for different document types
Invoking appropriate processing functions
Managing the workflow state

Anthropic’s Claude 3.5 Sonnet (v2) – Used for its advanced reasoning capabilities, notably strong visual processing abilities, particularly excelling at interpreting charts and graphs. Its key strengths include:

Interpreting complex document layouts and structure
Extracting text from tables and forms
Processing medical charts and handwritten notes
Converting unstructured visual information into structured data

Through the Amazon Bedrock Converse API’s standardized tool use (function calling) interface, these models can work together seamlessly to invoke document processing functions, call external APIs for data validation, trigger storage operations, and execute content transformation tasks. The API serves as the foundation for this intelligent workflow, providing a unified interface for model communication while maintaining conversation state throughout the processing pipeline. The API’s standardized approach to tool definition and function calling provides consistent interaction patterns across different processing stages. For more details on how tool use works, refer to The complete tool use workflow.
The solution incorporates Amazon Bedrock Guardrails to implement robust content filtering policies and sensitive information detection, making sure that personal health information (PHI) and personally identifiable information (PII) data is appropriately protected through automated detection and masking capabilities while maintaining industry standard compliance throughout the document processing workflow.
Prerequisites
You need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

An AWS account with an AWS Identity and Access Management (IAM) role that has permissions to Amazon Bedrock and Amazon SageMaker Studio.
Access to the Anthropic’s Claude 3.5 Sonnet (v2) and Claude 3 Haiku models in Amazon Bedrock. For instructions, see Access Amazon Bedrock foundation models and CreateInferenceProfile.
Access to create an Amazon Bedrock guardrail. For more information, see Create a guardrail.

Use case and dataset
For our example use case, we examine a patient intake process at a healthcare institution. The workflow processes a patient health information package containing three distinct document types:

Structured document – A new patient intake form with standardized fields for personal information, medical history, and current symptoms. This form follows a consistent layout with clearly defined fields and check boxes, making it an ideal example of a structured document.
Semi-structured document – A health insurance card that contains essential coverage information. Although insurance cards generally contain similar information (policy number, group ID, coverage dates), they come from different providers with varying layouts and formats, showing the semi-structured nature of these documents.
Unstructured document – A handwritten doctor’s note from an initial consultation, containing free-form observations, preliminary diagnoses, and treatment recommendations. This represents the most challenging category of unstructured documents, where information isn’t confined to any predetermined format or structure.

The example document can be downloaded from the following GitHub repo.
This healthcare use case is particularly relevant because it encompasses common challenges in document processing: the need for high accuracy, compliance with healthcare data privacy requirements, and the ability to handle multiple document formats within a single workflow. The variety of documents in this patient package demonstrates how a modern intelligent document processing solution must be flexible enough to handle different levels of document structure while maintaining consistency and accuracy in data extraction.
The following diagram illustrates the solution workflow.

This self-orchestrated workflow demonstrates how modern generative AI solutions can balance capability, performance, and cost-effectiveness in transforming traditional document processing workflows in healthcare settings.
Deploy the solution

Create an Amazon SageMaker domain. For instructions, see Use quick setup for Amazon SageMaker AI.
Launch SageMaker Studio, then create and launch a JupyterLab space. For instructions, see Create a space.
Create a guardrail. Focus on adding sensitive information filters that would mask PII or PHI.
Clone the code from the GitHub repository: git clone https://github.com/aws-samples/anthropic-on-aws.git
Change the directory to the root of the cloned repository: cd medical-idp
Install dependencies: pip install -r requirements.txt
Update setup.sh with the guardrail ID you created in Step 3. Then set the ENV variable: source setup.sh
Finally, start the Streamlit application: streamlit run streamlit_app.py

Now you’re ready to explore the intelligent document processing workflow using Amazon Bedrock.
Technical implementation
The solution is built around the Amazon Bedrock Converse API and tool use framework, with Anthropic’s Claude 3 Haiku serving as the primary orchestrator. When a document is uploaded through the Streamlit interface, Haiku analyzes the request and determines the sequence of tools needed by consulting the tool definitions in ToolConfig. These definitions include tools for the following:

Document processing pipeline – Handles initial PDF processing and classification
Document notes processing – Extracts information from medical notes
New patient information processing – Processes patient intake forms
Insurance form processing – Handles insurance card information

The following code is an example tool definition for extracting consultation notes. Here, extract_consultation_notes represents the name of the function that the orchestration workflow will call, and document_paths defines the schema of the input parameter that will be passed to the function. The FM will contextually extract the information from the document and pass to the method. A similar toolspec will be defined for each step. Refer to the GitHub repo for the full toolspec definition.
{
“toolSpec”: {
“name”: “extract_consultation_notes”,
“description”: “Extract diagnostics information from a doctor’s consultation notes. Along with the extraction include the full transcript in a <transcript> node”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“document_paths”: {
“type”: “array”,
“items”: {“type”: “string”},
“description”: “Paths to the files that were classified as DOC_NOTES”
}
},
“required”: [“document_paths”]
}
}
}
}

When a PDF document is uploaded through the Streamlit interface, it is temporarily stored and passed to the FileProcessor class along with the tool specification and a user prompt:
prompt = (“1. Extract 2. save and 3. summarize the information from the patient information package located at ” + tmp_file + “. ” +
“The package might contain various types of documents including insurance cards. Extract and save information from all documents provided. ”
“Perform any preprocessing or classification of the file provided prior to the extraction.” +
“Set the enable_guardrails parameter to ” + str(enable_guardrails) + “. ” +
“At the end, list all the tools that you had access to. Give an explantion on why each tool was used and if you are not using a tool, explain why it was not used as well” +
“Think step by step.”)
processor.process_file(prompt=prompt,
toolspecs=toolspecs,…
The BedrockUtils class manages the conversation with Anthropic’s Claude 3 Haiku through the Amazon Bedrock Converse API. It maintains the conversation state and handles the tool use workflow:
# From bedrockutility.py
def invoke_bedrock(self, message_list, system_message=[], tool_list=[],
temperature=0, maxTokens=2048, guardrail_config=None):
response = self.bedrock.converse(
modelId=self.model_id,
messages=message_list,
system=system_message,
inferenceConfig={
“maxTokens”: maxTokens,
“temperature”: temperature
},
**({“toolConfig”: {“tools”: tool_list}} if tool_list else {})
)

When the processor receives a document, it initiates a conversation loop with Anthropic’s Claude 3 Haiku, which analyzes the document and determines which tools to use based on the content. The model acts as an intelligent orchestrator, making decisions about the following:

Which document processing tools to invoke
The sequence of processing steps
How to handle different document types within the same package
When to summarize and complete the processing

This orchestration is managed through a continuous conversation loop that processes tool requests and their results until the entire document package has been processed.
The first key decision in the workflow is initiating the document classification process. Through the DocumentClassifier class, the solution uses Anthropic’s Claude 3.5 Sonnet to analyze and categorize each page of the uploaded document into three main types: intake forms, insurance cards, and doctor’s notes:
# from document_classifier.py
class DocumentClassifier:
def __init__(self, file_handler):
self.sonnet_3_5_bedrock_utils = BedrockUtils(
model_id=ModelIDs.anthropic_claude_3_5_sonnet
)

def categorize_document(self, file_paths):
# Convert documents to binary format for model processing
binary_data_array = []
for file_path in file_paths:
binary_data, media_type = self.file_handler.get_binary_for_file(file_path)
binary_data_array.append((binary_data[0], media_type))

# Prepare message for classification
message_content = [
{“image”: {“format”: media_type, “source”: {“bytes”: data}}}
for data, media_type in binary_data_array
]

# Create classification request
message_list = [{
“role”: ‘user’,
“content”: [
*message_content,
{“text”: “What types of document is in this image?”}
]
}]

# Define system message for classification
system_message = [{
“text”: ”’You are a medical document processing agent.
Categorize images as: INTAKE_FORM, INSURANCE_CARD, or DOC_NOTES”’
}]

# Get classification from model
response = self.sonnet_3_5_bedrock_utils.invoke_bedrock(
message_list=message_list,
system_message=system_message
)
return [response[‘output’][‘message’]]

Based on the classification results, the FM determines the next tool to be invoked. The tool’s description and input schema define exactly what information needs to be extracted. Following the previous example, let’s assume the next page to be processed is a consultation note. The workflow will invoke the extract_consultation_notes function. This function processes documents to extract detailed medical information. Like the classification process discussed earlier, it first converts the documents to binary format suitable for model processing. The key to accurate extraction lies in how the images and system message are combined:
def extract_info(self, file_paths):
# Convert documents to binary data
# This will follow the same pattern to as in the classification function
message_content = [
{“image”: {“format”: media_type, “source”: {“bytes”: data}}}
for data, media_type in binary_data_array
]

message_list = [{
“role”: ‘user’,
“content”: [
*message_content, # Include the processed document images
{“text”: ”’Extract all information from this file
If you find a visualization
– Provide a detailed description in natural language
– Use domain specific language for the description
”’}
]
}]

system_message = [{
“text”: ”’You are a medical consultation agent with expertise in diagnosing and treating various health conditions.
You have a deep understanding of human anatomy, physiology, and medical knowledge across different specialties.
During the consultation, you review the patient’s medical records, test results, and documentation provided.
You analyze this information objectively and make associations between the data and potential diagnoses.
Associate a confidence score to each extracted information. This should reflect how confident the model in the extracted value matched the requested entity.
”’}
]

response = self.bedrock_utils.invoke_bedrock(
message_list=message_list,
system_message=system_message
)
return [response[‘output’][‘message’]]

The system message serves three crucial purposes:

Establish medical domain expertise for accurate interpretation.
Provide guidelines for handling different types of information (text and visualizations).
Provide a self-scored confidence. Although this is not an independent grading mechanism, the score is directionally indicative of how confident the model is in its own extraction.

Following the same pattern, the FM will use the other tools in the toolspec definition to save and summarize the results.
A unique advantage of using a multi-modal FM for the extraction task is its ability to have a deep understanding of the text it is extracting. For example, the following code is an abstract of the data schema we are requesting as input to the save_consultation_notes function. Refer to the code in constants.py for full definition. The model needs to not only extract a transcript, but also understand it to extract such structured data from an unstructured document. This significantly reduces the postprocessing efforts required for the data to be consumed by a downstream application.
“consultation”: {
“type”: “object”,
“properties”: {
“date”: {“type”: “string”},
“concern”: {
“type”: “object”,
“properties”: {
“primaryComplaint”: {
“type”: “string”,
“description”: “Primary medical complaint of the patient. Only capture the medical condition. no timelines”
},
“duration”: {“type”: “number”},
“durationUnit”: {“type”: “string”, “enum”: [“days”, “weeks”, “months”, “years”]},
“associatedSymptoms”: {
“type”: “object”,
“additionalProperties”: {
“type”: “boolean”
},
“description”: “Key-value pairs of symptoms and their presence (true) or absence (false)”
},
“absentSymptoms”: {
“type”: “array”,
“items”: {“type”: “string”}
}
},
“required”: [“primaryComplaint”, “duration”, “durationUnit”]
}

The documents contain a treasure trove of personally identifiable information (PII) and personal health information (PIH). To redact this information, you can pass enable_guardrails as true. This will use the guardrail you setup earlier as part of the information extraction process and mask information identified as PII or PIH.
processor.process_file(prompt=prompt,
enable_guardrails=True,
toolspecs=toolspecs,

)
Finally, cross-document validation is crucial for maintaining data accuracy and compliance in healthcare settings. Although the current implementation performs basic consistency checks through the summary prompt, organizations can extend the framework by implementing a dedicated validation tool that integrates with their specific business rules and compliance requirements. Such a tool could perform sophisticated validation logic like insurance policy verification, appointment date consistency checks, or any other domain-specific validation requirements, providing complete data integrity across the document package.
Future considerations
As Amazon Bedrock continues to evolve, several powerful features can be integrated into this document processing workflow to enhance its enterprise readiness, performance, and cost-efficiency. Let’s explore how these advanced capabilities can take this solution to the next level:

Inference profiles in Amazon Bedrock define a model and its associated Regions for routing invocation requests, enabling various tasks such as usage tracking, cost monitoring, and cross-Region inference. These profiles help users track metrics through Amazon CloudWatch logs, monitor costs with cost allocation tags, and increase throughput by distributing requests across multiple Regions.
Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. Instead of reprocessing the entire context for each document, the workflow can reuse cached prompts, which is particularly beneficial when using the same image across different tooling workflows. With support for multiple cache checkpoints, this feature can substantially reduce processing time and inference costs while maintaining the workflow’s intelligent orchestration capabilities.
 Intelligent prompt routing can dynamically select the most appropriate model for each task based on performance and cost requirements. Rather than explicitly assigning Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for document analysis, the workflow can use intelligent routing to automatically choose the optimal model within the Anthropic family for each request. This approach simplifies model management while providing cost-effective processing of different document types, from simple structured forms to complex handwritten notes, all through a single endpoint.

Conclusion
This intelligent document processing solution demonstrates the power of combining Amazon Bedrock FMs with tool use capabilities to create sophisticated, self-orchestrating workflows. By using Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for complex visual tasks, the solution effectively handles structured, semi-structured, and unstructured documents while maintaining high accuracy and compliance standards.
Key benefits of this approach include:

Reduced manual processing through intelligent automation
Improved accuracy through specialized model selection
Built-in compliance with guardrails for sensitive data
Flexible architecture that adapts to various document types
Cost-effective processing through strategic model usage

As organizations continue to digitize their operations, solutions like this showcase how generative AI can transform traditional document processing workflows. The combination of powerful FMs in Amazon Bedrock and the tool use framework provides a robust foundation for building intelligent, scalable document processing solutions across industries.
For more information about Amazon Bedrock and its capabilities, visit the Amazon Bedrock User Guide.

About the Author

Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.