Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Organizations are constantly seeking ways to harness the power of advanced large language models (LLMs) to enable a wide range of applications such as text generation, summarizationquestion answering, and many others. As these models grow more powerful and capable, deploying them in production environments while optimizing performance and cost-efficiency becomes more challenging.
Amazon Web Services (AWS) provides highly optimized and cost-effective solutions for deploying AI models, like the Mixtral 8x7B language model, for inference at scale. The AWS Inferentia and AWS Trainium are AWS AI chips, purpose-built to deliver high throughput and low latency inference and training performance for even the largest deep learning models. The Mixtral 8x7B model adopts the Mixture-of-Experts (MoE) architecture with eight experts. AWS Neuron—the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances—employs expert parallelism for MoE architecture, sharding the eight experts across multiple NeuronCores.
This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face. This will be followed by deployment to an Amazon SageMaker real-time inference endpoint, which automatically provisions and manages the Inferentia2 instances behind the scenes and provides a containerized environment to run the model securely and at scale.
While pre-compiled model versions exist, we’ll cover the compilation process to illustrate important configuration options and instance sizing considerations. This end-to-end guide combines Amazon Elastic Compute Cloud (Amazon EC2)-based compilation with SageMaker deployment to help you use Mixtral 8x7B’s capabilities with optimal performance and cost efficiency.
Step 1: Set up Hugging Face access
Before you can deploy the Mixtral 8x7B model, there some prerequisites that you need to have in place.

The model is hosted on Hugging Face and uses their transformers library. To download and use the model, you need to authenticate with Hugging Face using a user access token. These tokens allow secure access for applications and notebooks to Hugging Face’s services. You first need to create a Hugging Face account if you don’t already have one, which you can then use to generate and manage your access tokens through the user settings.
The mistralai/Mixtral-8x7B-Instruct-v0.1 model that you will be working with in this post is a gated model. This means that you need to specifically request access from Hugging Face before you can download and work with the model.

Step 2: Launch an Inferentia2-powered EC2 Inf2 instance
To get started with an Amazon EC2 Inf2 instance for deploying the Mixtral 8x7B, either deploy the AWS CloudFormation template or use the AWS Management Console.
To launch an Inferentia2 instance using the console:

Navigate to the Amazon EC2 console and choose Launch Instance.
Enter a descriptive name for your instance.
Under the Application and OS Images search for and select the Hugging Face Neuron Deep Learning AMI, which comes pre-configured with the Neuron software stack for AWS Inferentia.
For Instance type, select 24xlarge, which contains six Inferentia chips (12 NeuronCores).
Create or select an existing key pair to enable SSH access.
Create or select a security group that allows inbound SSH connections from the internet.
Under Configure Storage, set the root EBS volume to 512 GiB to accommodate the large model size.
After the settings are reviewed, choose Launch Instance.

With your Inf2 instance launched, connect to it over SSH by first locating the public IP or DNS name in the Amazon EC2 console. Later in this post, you will connect to a Jupyter notebook using a browser on port 8888. To do that, SSH tunnel to the instance using the key pair you configured during instance creation.

ssh -i “<pem file>” ubuntu@<instance DNS name> -L 8888:127.0.0.1:8888

After signing in, list the NeuronCores attached to the instance and their associated topology:

neuron-ls

For inf2.24xlarge, you should see the following output listing six Neuron devices:

instance-type: inf2.24xlarge
instance-id: i-…
+——–+——–+——–+———–+———+
| NEURON | NEURON | NEURON | CONNECTED |   PCI   |
| DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
+——–+——–+——–+———–+———+
| 0      | 2      | 32 GB  | 1         | 10:1e.0 |
| 1      | 2      | 32 GB  | 0, 2      | 20:1e.0 |
| 2      | 2      | 32 GB  | 1, 3      | 10:1d.0 |
| 3      | 2      | 32 GB  | 2, 4      | 20:1f.0 |
| 4      | 2      | 32 GB  | 3, 5      | 10:1f.0 |
| 5      | 2      | 32 GB  | 4         | 20:1d.0 |
+——–+——–+——–+———–+———+

For more information on the neuron-ls command, see the Neuron LS User Guide.
Make sure the Inf2 instance is sized correctly to host the model. Each Inferentia NeuronCore processor contains 16 GB of high-bandwidth memory (HBM). To accommodate an LLM like the Mixtral 8x7B on AWS Inferentia2 (inf2) instances, a technique called tensor parallelism is used. This allows the model’s weights, activations, and computations to be split and distributed across multiple NeuronCores in parallel. To determine the degree of tensor parallelism required, you need to calculate the total memory footprint of the model. This can be computed as:
total memory = bytes per parameter * number of parameters
The Mixtral-8x7B model consists of 46.7 billion parameters. With float16 casted weights, you need 93.4 GB to store the model weights. The total space required is often greater than just the model parameters because of caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. With a batch size of 1 and a sequence length of 1024 tokens, the total memory footprint for the caching is 0.5 GB. The exact formula can be found in the AWS Neuron documentation and the hyper-parameter configuration required for these calculations is stored in the model config.json file.
Given that each NeuronCore has 16 GB of HBM, and the model requires approximately 94 GB of memory, a minimum tensor parallelism degree of 6 would theoretically suffice. However, with 32 attention heads, the tensor parallelism degree must be a divisor of this number.
Furthermore, considering the model’s size and the MoE implementation in transformers-neuronx, the supported tensor parallelism degrees are limited to 8, 16, and 32. For the example in this post, you will distribute the model across eight NeuronCores.
Compile Mixtral-8x7B model to AWS Inferentia2
The Neuron SDK includes a specialized compiler that automatically optimizes the model format for efficient execution on AWS Inferentia2.

To start this process, launch the container and pass the Inferentia devices to the container. For more information about launching the neuronx-tgi container see Deploy the Text Generation Inference (TGI) Container on a dedicated host.

docker run -it –entrypoint /bin/bash
–net=host -v $(pwd):$(pwd) -w $(pwd)
–device=/dev/neuron0
–device=/dev/neuron1
–device=/dev/neuron2
–device=/dev/neuron3
–device=/dev/neuron4
–device=/dev/neuron5
ghcr.io/huggingface/neuronx-tgi:0.0.25

Inside the container, sign in to the Hugging Face Hub to access gated models, such as the Mixtral-8x7B-Instruct-v0.1. See the previous section for Setup Hugging Face Access. Make sure to use a token with read and write permissions so you can later save the compiled model to the Hugging Face Hub.

huggingface-cli login –token hf_…

After signing in, compile the model with optimum-cli. This process will download the model artifacts, compile the model, and save the results in the specified directory.
The Neuron chips are designed to execute models with fixed input shapes for optimal performance. This requires that the compiled artifact shapes must be known at compilation time. In the following command, you will set the batch size, input/output sequence length, data type, and tensor-parallelism degree (number of neuron cores). For more information about these parameters, see Export a model to Inferentia.

Let’s discuss these parameters in more detail:

The parameter batch_size is the number of input sequences that the model will accept.
sequence_length specifies the maximum number of tokens in an input sequence. This affects memory usage and model performance during inference or training on Neuron hardware. A larger number will increase the model’s memory requirements because the attention mechanism needs to operate over the entire sequence, which leads to more computations and memory usage; while a smaller number will do the opposite. The value 1024 will be adequate for this example.
auto_cast_type parameter controls quantization. It allows type casting for model weights and computations during inference. The options are: bf16, fp16, or tf32. For more information about defining which lower-precision data type the compiler should use see Mixed Precision and Performance-accuracy Tuning. For models trained in float32, the 16-bit mixed precision options (bf16, f16) generally provide sufficient accuracy while significantly improving performance. We use data type float16 with the argument auto_cast_type fp16.
The num_cores parameter controls the number of cores on which the model should be deployed. This will dictate the number of parallel shards or partitions the model is split into. Each shard is then executed on a separate NeuronCore, taking advantage of the 16 GB high-bandwidth memory available per core. As discussed in the previous section, given the Mixtral-8x7B model’s requirements, Neuron supports 8, 16, or 32 tensor parallelism The inf2.24xlarge instance contains 12 Inferentia NeuronCores. Therefore, to optimally distribute the model, we set num_cores to 8.

optimum-cli export neuron
  –model mistralai/Mixtral-8x7B-Instruct-v0.1
  –batch_size 1
  –sequence_length 1024
  –auto_cast_type fp16
  –num_cores 8
  ./neuron_model_path

Download and compilation should take 10–20 minutes. After the compilation completes successfully, you can check the artifacts created in the output directory:

neuron_model_path
├── compiled
│ ├── 2ea52780bf51a876a581.neff
│ ├── 3fe4f2529b098b312b3d.neff
│ ├── …
│ ├── …
│ ├── cfda3dc8284fff50864d.neff
│ └── d6c11b23d8989af31d83.neff
├── config.json
├── generation_config.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

Push the compiled model to the Hugging Face Hub with the following command. Make sure to change <user_id> to your Hugging Face username. If the model repository doesn’t exist, it will be created automatically. Alternatively, store the model on Amazon Simple Storage Service (Amazon S3).

huggingface-cli upload <user_id>/Mixtral-8x7B-Instruct-v0.1 ./neuron_model_path ./
Deploy Mixtral-8x7B SageMaker real-time inference endpoint
Now that the model has been compiled and stored, you can deploy it for inference using SageMaker. To orchestrate the deployment, you will run Python code from a notebook hosted on an EC2 instance. You can use the instance created in the first section or create a new instance. Note that this EC2 instance can be of any type (for example t2.micro with an Amazon Linux 2023 image). Alternatively, you can use a notebook hosted in Amazon SageMaker Studio.
Set up AWS authorization for SageMaker deployment
You need AWS Identity and Access Management (IAM) permissions to manage SageMaker resources. If you created the instance with the provided CloudFormation template, these permissions are already created for you. If not, the following section takes you through the process of setting up the permissions for an EC2 instance to run a notebook that deploys a real-time SageMaker inference endpoint.
Create an AWS IAM role and attach SageMaker permission policy

Go to the IAM console.
Choose the Roles tab in the navigation pane.
Choose Create role.
Under Select trusted entity, select AWS service.
Choose Use case and select EC2.
Select EC2 (Allows EC2 instances to call AWS services on your behalf.)
Choose Next: Permissions.
In the Add permissions policies screen, select AmazonSageMakerFullAccess and IAMReadOnlyAccess. Note that the AmazonSageMakerFullAccess permission is overly permissive. We use it in this example to simplify the process but recommend applying the principle of least privilege when setting up IAM permissions.
Choose Next: Review.
In the Role name field, enter a role name.
Choose Create role to complete the creation.
With the role created, choose the Roles tab in the navigation pane and select the role you just created.
Choose the Trust relationships tab and then choose Edit trust policy.
Choose Add next to Add a principal.
For Principal type, select AWS services.
Enter sagemaker.amazonaws.com and choose Add a principal.
Choose Update policy. Your trust relationship should look like the following:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”,
“sagemaker.amazonaws.com”
]
},
“Action”: “sts:AssumeRole”
}
]
}

Attach the IAM role to your EC2 instance

Go to the Amazon EC2 console.
Choose Instances in the navigation pane.
Select your EC2 instance.
Choose Actions, Security, and then Modify IAM role.
Select the role you created in the previous step.
Choose Update IAM role.

Launch a Jupyter notebook
Your next goal is to run a Jupyter notebook hosted in a container running on the EC2 instance. The notebook will be run using a browser on port 8888 by default. For this example, you will use SSH port forwarding from your local machine to the instance to access the notebook.

Continuing from the previous section, you are still within the container. The following steps install Jupyter Notebook:

pip install ipykernel
python3 -m ipykernel install –user –name aws_neuron_venv_pytorch –display-name “Python Neuronx”
pip install jupyter notebook
pip install environment_kernels

Launch the notebook server using:

jupyter notebook

Then connect to the notebook using your browser over SSH tunneling

http://localhost:8888/tree?token=…
If you get a blank screen, try opening this address using your browser’s incognito mode.
Deploy the model for inference with SageMaker
After connecting to Jupyter Notebook, follow this notebook. Alternatively, choose File, New,  Notebook, and then select Python 3 as the kernel. Use the following instructions and run the notebook cells.

In the notebook, install the sagemaker and huggingface_hub libraries.

!pip install sagemaker

Next, get a SageMaker session and execution role that will allow you to create and manage SageMaker resources. You’ll use a Deep Learning Container.

import os
import sagemaker
from sagemaker.huggingface import get_huggingface_llm_image_uri

os.environ[‘AWS_DEFAULT_REGION’] = ‘us-east-1’

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
print(f”sagemaker role arn: {role}”)

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
“huggingface-neuronx”,
version=”0.0.25″
)

# print ecr image uri
print(f”llm image uri: {llm_image}”)

Deploy the compiled model to a SageMaker real-time endpoint on AWS Inferentia2.

Change user_id in the following code to your Hugging Face username. Make sure to update HF_MODEL_ID and HUGGING_FACE_HUB_TOKEN with your Hugging Face username and your access token.

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = “ml.inf2.24xlarge”
health_check_timeout=2400 # additional time to load the model
volume_size=512 # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
“HF_MODEL_ID”: “user_id/Mixtral-8x7B-Instruct-v0.1”, # replace with your model id if you are using your own model
“HF_NUM_CORES”: “4”, # number of neuron cores
“HF_AUTO_CAST_TYPE”: “fp16”,  # dtype of the model
“MAX_BATCH_SIZE”: “1”, # max batch size for the model
“MAX_INPUT_LENGTH”: “1000”, # max length of input text
“MAX_TOTAL_TOKENS”: “1024”, # max length of generated text
“MESSAGES_API_ENABLED”: “true”, # Enable the messages API
“HUGGING_FACE_HUB_TOKEN”: “hf_…” # Add your Hugging Face token here
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)

You’re now ready to deploy the model to a SageMaker real-time inference endpoint. SageMaker will provision the necessary compute resources instance and retrieve and launch the inference container. This will download the model artifacts from your Hugging Face repository, load the model to the Inferentia devices and start inference serving. This process can take several minutes.

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

llm_model._is_compiled_model = True # We precompiled the model

llm = llm_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
volume_size=volume_size
)

Next, run a test to check the endpoint. Update user_id to match your Hugging Face username, then create the prompt and parameters.

# Prompt to generate
messages=[
{ “role”: “system”, “content”: “You are a helpful assistant.” },
{ “role”: “user”, “content”: “What is deep learning?” }
]

# Generation arguments
parameters = {
“model”: “user_id/Mixtral-8x7B-Instruct-v0.1”, # replace user_id
“top_p”: 0.6,
“temperature”: 0.9,
“max_tokens”: 1000,
}

Send the prompt to the SageMaker real-time endpoint for inference

chat = llm.predict({“messages” :messages, **parameters})

print(chat[“choices”][0][“message”][“content”].strip())

In the future, if you want to connect to this inference endpoint from other applications, first find the name of the inference endpoint. Alternatively, you can use the SageMaker console and choose Inference, and then Endpoints to see a list of the SageMaker endpoints deployed in your account.

endpoints = sess.sagemaker_client.list_endpoints()

for endpoint in endpoints[‘Endpoints’]:
print(endpoint[‘EndpointName’])

Use the endpoint name to update the following code, which can also be run in other locations.

from sagemaker.huggingface import HuggingFacePredictor

endpoint_name=”endpoint_name…”

llm = HuggingFacePredictor(
endpoint_name=endpoint_name,
sagemaker_session=sess
)

Cleanup
Delete the endpoint to prevent future charges for the provisioned resources.

llm.delete_model()
llm.delete_endpoint()

Conclusion
In this post, we covered how to compile and deploy the Mixtral 8x7B language model on AWS Inferentia2 using the Hugging Face Optimum Neuron container and Amazon SageMaker. AWS Inferentia2 offers a cost-effective solution for hosting models like Mixtral, providing high-performance inference at a lower cost.
For more information, see Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum.
For other methods to compile and run Mixtral inference on Inferentia2 and Trainium see the Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 tutorial located in the AWS Neuron Documentation and Notebook.

About the authors
Lior Sadan is a Senior Solutions Architect at AWS, with an affinity for storage solutions and AI/ML implementations. He helps customers architect scalable cloud systems and optimize their infrastructure. Outside of work, Lior enjoys hands-on home renovation and construction projects.
Stenio de Lima Ferreira is a Senior Solutions Architect passionate about AI and automation. With over 15 years of work experience in the field, he has a background in cloud infrastructure, devops and data science. He specializes in codifying complex requirements into reusable patterns and breaking down difficult topics into accessible content.

Elevate business productivity with Amazon Q and Amazon Connect

Modern banking faces dual challenges: delivering rapid loan processing while maintaining robust security against sophisticated fraud. Amazon Q Business provides AI-driven analysis of regulatory requirements and lending patterns. Additionally, you can now report fraud from the same interface with a custom plugin capability that can integrate with Amazon Connect. This fusion of technology transforms traditional lending by enabling faster processing times, faster fraud prevention, and a seamless user experience.
Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business provides plugins to interact with popular third-party applications, such as Jira, ServiceNow, Salesforce, PagerDuty, and more. Administrators can enable these plugins with a ready-to-use library of over 50 actions to their Amazon Q Business application. Where pre-built plugins are not available, Amazon Q Business provides capabilities to build custom plugins to integrate with your application. Plugins help streamline tasks and boost productivity by integrating external services into the Amazon Q Business chat interface.
Amazon Connect is an AI-powered application that provides one seamless experience for your contact center customers and users. It’s comprised of a full suite of features across communication channels. Amazon Connect Cases, a feature of Amazon Connect, allows your agents to track and manage customer issues that require multiple interactions, follow-up tasks, and teams in your contact center. Agents can document customer issues with the relevant case details, such as date/time opened, issue summary, customer information, and status, in a single unified view.
The solution integrates with Okta Identity Management Platform to provide robust authentication, authorization, and single sign-on (SSO) capabilities across applications. Okta can support enterprise federation clients like Active Directory, LDAP, or Ping.
For loan approval officers reviewing mortgage applications, the seamless integration of Amazon Q Business directly into their primary workflow transforms the user experience. Rather than context-switching between applications, officers can harness the capabilities of Amazon Q to conduct research, analyze data, and report potential fraud cases within their mortgage approval interface.
In this post, we demonstrate how to elevate business productivity by leveraging Amazon Q to provide insights that enable research, data analysis, and report potential fraud cases within Amazon Connect.
Solution overview
The following diagram illustrates the solution architecture.

The solution includes the following steps:

Users in Okta are configured to be federated to AWS IAM Identity Center, and a unique ID (audience) is configured for an Amazon API Gateway
When the user chooses to chat in the web application, the following flow is initiated:

The Amazon Q Business application uses the client ID and client secret key to exchange the Okta-generated JSON Web Token (JWT) with IAM Identity Center. The token includes the AWS Security Token Service (AWS STS) context identity.
A temporary token is issued to the application server to assume the role and access the Amazon Q Business API.

The Amazon Q Business application fetches information from the Amazon Simple Storage Service (Amazon S3) data source to answer questions or generate summaries.
The Amazon Q custom plugin uses an Open API schema to discover and understand the capabilities of the API Gateway API.
A client secret is stored in AWS Secrets Manager and the information is provided to the plugin.
The plugin assumes the AWS Identity and Access Management (IAM) role with the kms:decrypt action to access the secrets in Secret Manager.
When a user wants to send a case, the custom plugin invokes the API hosted on API Gateway.
API Gateway uses the same Okta user’s session and authorizes the access.
API Gateway invokes AWS Lambda to create a case in Amazon Connect.
Lambda hosted in Amazon Virtual Private Cloud (Amazon VPC) internally calls the Amazon Connect API using an Amazon Connect VPC interface endpoint powered by AWS PrivateLink.
The contact center agents can also use Amazon Q in Connect to further assist the user.

Prerequisites
The following prerequisites need to be met before you can build the solution:

Have a valid AWS account.
Have an Amazon Q Business Pro subscription to create Amazon Q applications.
Have the service-linked IAM role AWSServiceRoleForQBusiness. If you don’t have one, create it with the amazonaws.com service name.
Have an IAM role in the account that will allow the AWS CloudFormation template to create new roles and add policies. If you have administrator access to the account, no action is required.
Enable logging in AWS CloudTrail for operational and risk auditing.

Okta prerequisites:

Have an Okta developer account and setup an application and API. If you do not have an Okta, please see the following instructions.

Set up an application and API in Okta
Complete the following steps to set up an application and API in Okta:

Log in to the Okta console.
Provide credentials and choose Login.
Choose Continue with Google.
You might need to set up multi-factor authentication following the instructions on the page.
Log in using the authentication code.
In the navigation pane, choose Applications and choose Create App Integration.

Select OIDC – OpenID for Sign-in method and Web Application for Application type, then choose Next.

For App integration name, enter a name (for example, myConnectApp).
Select Authorization Code and Refresh Token for Grant type.
Select Skip group assignment for now for Control Access.
Choose Save to create an application.
Take note of the client ID and secret.

Add Authentication server and metadata

In the navigation pane, choose Security, then choose API.
Choose Add Authorization Server, provide the necessary details, and choose Save.

Take note of the Audience value and choose Metadata URI.

Audience is provided as an input to the CloudFormation template later in the section.

The response will provide the metadata.

From the response, take note of the following:

issuer
authorization_endpoint
token_endpoint

Under Scopes, choose Add Scope, provide the name write/tasks, and choose Create.

On the Access Policies tab, choose Add Policy.
Provide a name and description.
Select The following clients and choose the application by entering my in the text box and choosing the application created earlier.
Choose Create Policy to add a policy.

Choose Add Rule to add a rule and select only Authorization Code for Grant type is.
For Scopes requested, select The following scopes, then enter write in the text box and select the write/tasks
Adjust Access token lifetime is and Refresh token lifetime is to minutes.
Add but will expire if not used every as 5 minutes.
Choose Create rule to create the rule.

Add users

In the navigation pane, choose Directory and choose People.
Choose Add person.

Complete the fields:

First name
Last name
Username (use the same as the primary email)
Primary email

Select Send user activation email now.
Choose Save to save the user.

You will receive an email. Choose the link in the email to activate the user.
Choose Groups, then choose Add group to add the group.
Provide a name and optional description.
Refresh the page and choose the newly created group.
Choose Assign people to assign users.
Add the newly created user by choosing the plus sign next to the user.

Under Applications, select the application name created earlier.
On the Assignments tab, choose Assign to People.

Select the user and choose Assign.
Choose Done to complete the assignment.

Set up Okta as an identity source in IAM Identity Center
Complete the following steps to set up Okta as an identity source:

Enable an IAM Identity Center instance.
Configure SAML and SCIM with Okta and IAM Identity Center.
On the IAM Identity Center console, navigate to the instance.
Under Settings, copy the value Instance ARN. You will need it when you run the CloudFormation template.

Deploy resources using AWS CloudFormation
In this step, we use a CloudFormation template to deploy a Lambda function, configure the REST API, and create identities. Complete the following steps:

Open the AWS CloudFormation console in the us-east-1 AWS Region.
Choose Create stack.
Download the CloudFormation template and upload it in the Specify template
Choose Next.
For Stack name, enter a name (for example, QIntegrationWithConnect).
In the Parameters section, provide values for the following:

Audience
AuthorizationUrl
ClientId
ClientSecret
IdcInstanceArn
Issuer
TokenUrl

Choose Next.
Keep the other values as default and select I acknowledge that AWS CloudFormation might create IAM resources in the Capabilities.
Select I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND in the Capabilities.
Choose Submit to create the CloudFormation stack.
After the successful deployment of the stack, on the Outputs tab, note the value for ALBDNSName.

The CloudFormation template does not deploy certificates for Application Load Balancer. We strongly recommend creating a secure listener for the Application Load Balancer and deploying at least one certificate.
Assign user to Amazon Q Application

On the Amazon Q Business console, navigate to the application named qbusiness-connect-case.
Under User Access, choose Manage user access.
On the user tab, choose Add groups and users and search for the user you created in Okta and propagated in IAM Identity Center.
Choose Assign and Done.

Choose Confirm to confirm the subscription.
Copy the link for Deployed URL.

Create a callback URL: <Deployed URL>/oauth/callback.

We recommend that you enable a budget policy notification to prevent unwanted billing.
Configure login credentials for the web application
Complete the following steps to configure login credentials for the web application:

Navigate to the Okta developer login.
Under Applications, choose the web application myConnectApp created earlier.
Choose Edit in the General Settings
Enter the callback URL for Sign-in redirect URIs.
Choose Save.

Sync the knowledge base
Complete the following steps to sync your knowledge base:

On the Amazon S3 console, choose Buckets in the navigation pane.
Search for AmazonQDataSourceBucket and choose the bucket.
Download the sample AnyBank regulations document.
Upload the PDF file to the S3 bucket.
On the Amazon Q Business console, navigate to the Amazon Q Business application.
In the Data sources section, select the data source.
Choose Sync now to sync the data source.

Embed the web application
Complete the following steps to embed the web application:

On the Amazon Q Business console, under Enhancements, choose Amazon Q embedded.
Choose Add allowed website.
For Enter website URL, enter http://<ALBDNSName>.

Test the solution
Complete the following steps to test the solution:

Copy the ALBDNSName value from the outputs section of the CloudFormation stack and open it in a browser.

You will see an AnyBank website.

Choose Chat with us and the Okta sign-in page will pop up.
Provide the sign-in details.

Upon verification, close the browser tab.
Navigate to the Amazon Q Business application in the chat window.
In the chat window, enter “What are the Fraud Detection and Prevention Measures?”

Amazon Q Business will provide the answers from the knowledge base.
Next, let’s assume that you detected a fraud and want to create a case.

Choose the plugin CreateCase and ask the question, “Can you create a case reporting fraud?”

Amazon Q Business generates the title of the case based on the question.

Choose Submit.
If Amazon Q Business asks you to authorize your access, choose Authorize.

The CreateCase plugin will create a case in Amazon Connect

Navigate to Amazon Connect and open the access URL in a browser.
Provide the user name admin and get the password from visiting the parameter store in AWS Systems Manager.

Choose Agent Workspace.

You can see the case that was created by Amazon Q Business using the custom plugin.

Clean up
To avoid incurring future charges, delete the resources that you created and clean up your account:

Empty the contents of the S3 buckets you created as part of the CloudFormation stack.
Delete the CloudFormation stack you created as part of this post.
Disable the application from IAM Identity Center.

Conclusion
As businesses navigate the ever-changing corporate environment, the combination of Amazon Q Business and Amazon Connect emerges as a transformative approach to optimizing employee assistance and operational effectiveness. Harnessing the capabilities of AI-powered assistants and advanced contact center tools, organizations can empower their teams to access data, initiate support requests, and collaborate cohesively through a unified solution. This post showcased a banking portal, but this can be used for other industrial sectors or organizational verticals.
Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors
Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.
Dr Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Reasoning Models Know When They’re Right: NYU Researchers Introduce …

Artificial intelligence systems have made significant strides in simulating human-style reasoning, particularly mathematics and logic. These models don’t just generate answers—they walk through a series of logical steps to reach conclusions, offering insights into how and why those answers are produced. This step-by-step reasoning, often called Chain-of-Thought (CoT), has become vital in how machines handle complex problem-solving tasks.

A common problem researchers encounter with these models is inefficiency during inference. Reasoning models often continue processing even after reaching a correct conclusion. This overthinking results in the unnecessary generation of tokens, increasing computational cost. Whether these models have an internal sense of correctness remains unclear—do they realize when an intermediate answer is right? If they could identify this internally, the models could halt processing earlier, becoming more efficient without losing accuracy.

Many current approaches measure a model’s confidence through verbal prompts or by analyzing multiple outputs. These black-box strategies ask the model to report how sure it is of its answer. However, they are often imprecise and computationally expensive. On the other hand, white-box methods investigate models’ internal hidden states to extract signals that may correlate with answer correctness. Prior work shows that a model’s internal states can indicate the validity of final answers, but applying this to intermediate steps in long reasoning chains is still an underexplored direction.

The research introduced by a team from New York University and NYU Shanghai tackled this gap by designing a lightweight probe—a simple two-layer neural network—to inspect a model’s hidden states at intermediate reasoning steps. The models used for experimentation included the DeepSeek-R1-Distill series and QwQ-32B, known for their step-by-step reasoning capabilities. These models were tested across various datasets involving mathematical and logical tasks. The researchers trained their probe to read the internal state associated with each chunk of reasoning and predict whether the current intermediate answer was correct.

To construct their approach, the researchers first segmented each long CoT output into smaller parts or chunks, using markers like “wait” or “verify” to identify breaks in reasoning. They used the last token’s hidden state in each chunk as a representation and matched this to a correctness label, which was judged using another model. These representations were then used to train the probe on binary classification tasks. The probe was fine-tuned using grid search across hyperparameters like learning rate and hidden layer size, with most models converging to linear probes—indicating that correctness information is often linearly embedded in the hidden states. The probe worked for fully formed answers and showed the ability to predict correctness before an answer was even completed, hinting at look-ahead capabilities.

Performance results were clear and quantifiable. The probes achieved ROC-AUC scores exceeding 0.9 for some datasets like AIME when using models like R1-Distill-Qwen-32B. Expected Calibration Errors (ECE) remained under 0.1, showing high reliability. For example, R1-Distill-Qwen-32B had an ECE of just 0.01 on GSM8K and 0.06 on MATH datasets. In application, the probe was used to implement a confidence-based early exit strategy during inference. The reasoning process was stopped when the probe’s confidence in an answer exceeded a threshold. At a confidence threshold of 0.85, the accuracy remained at 88.2%, while the inference token count was reduced by 24%. Even at a threshold of 0.9, accuracy stayed at 88.6%, with a 19% token reduction. Compared to static exit methods, this dynamic strategy achieved up to 5% higher accuracy using the same or fewer tokens.

This study offers an efficient, integrated way for reasoning models to self-verify during inference. The researchers’ approach pinpoints a gap—while models inherently know when they’re right, they don’t act on it. The research reveals a path toward smarter, more efficient reasoning systems by leveraging internal representations through probing. It shows that tapping into what the model already “knows” can lead to meaningful performance and resource use improvements.

Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post Reasoning Models Know When They’re Right: NYU Researchers Introduce a Hidden-State Probe That Enables Efficient Self-Verification and Reduces Token Usage by 24% appeared first on MarkTechPost.

Code Implementation to Building a Model Context Protocol (MCP) Server …

In this hands-on tutorial, we’ll build an MCP (Model Context Protocol) server that allows Claude Desktop to fetch stock news sentiment and daily top gainers and movers via the AlphaVantage API. Since most LLMs can’t directly access real-time financial data, this solution uses MCP to provide real-time insights.

We’ll expose two tools from our server:

get_news_sentiment

get_top_movers

Let’s walk through each step.

Step 1: Setting Up the Environment

We will first set up our environment and start with installing the uv package manager. For Mac or Linux:

Copy CodeCopiedUse a different Browsercurl -LsSf https://astral.sh/uv/install.sh | sh

For Windows (PowerShell):

Copy CodeCopiedUse a different Browserpowershell -ExecutionPolicy ByPass -c “irm https://astral.sh/uv/install.ps1 | iex”

We will then create a new project directory and initialize it with uv

Copy CodeCopiedUse a different Browseruv init stockNews
cd stockNews

We can now create and activate a virtual environment. For Mac or Linux:

Copy CodeCopiedUse a different Browseruv venv
source .venv/bin/activate

For Windows:

Copy CodeCopiedUse a different Browseruv venv
.venvScriptsactivate

We will now install the required dependencies

Copy CodeCopiedUse a different Browseruv add mcp httpx python-dotenv

Step 3: Setting Up the Environment Variables

We will now create a .env file that contains the API key for AlphaVantage. To generate a free API key:

Go to https://www.alphavantage.co/

Click on Get free API key button, or use the following url https://www.alphavantage.co/support/#api-key

Enter your email and other required details. You’ll receive an API key—copy it and keep it safe, as this will be used to authenticate your requests.

Now, create a .env file and add the following line:

Copy CodeCopiedUse a different BrowserALPHA_VANTAGE_API_KEY = your_api_key

Step 4: Implementing the MCP Server and integrating AlphaVantage

First create a stockNews.py file in the directory that we created and add the following code snippets:

Importing packages and setting up the instance:

We will first import the necessary packages and set up instance to use the API

Copy CodeCopiedUse a different Browserfrom typing import Any
import os
import httpx
from mcp.server.fastmcp import FastMCP
from dotenv import load_dotenv

# Load .env variables
load_dotenv()
API_KEY = os.getenv(“ALPHA_VANTAGE_API_KEY”)

# Initialize FastMCP server
mcp = FastMCP(“alpha-finance”)

# Constants
BASE_URL = “https://www.alphavantage.co/query”

Helper functions

Next, let’s add our helper functions for querying the data from AlphaVantage.

Copy CodeCopiedUse a different Browserasync def call_alpha_vantage(endpoint: str, params: dict[str, Any]) -> dict[str, Any] | None:
“””Generic async caller to Alpha Vantage.”””
params[“apikey”] = API_KEY
params[“function”] = endpoint
async with httpx.AsyncClient() as client:
try:
response = await client.get(BASE_URL, params=params, timeout=30.0)
response.raise_for_status()
return response.json()
except Exception:
return None

Implementing tool execution

The tool execution handler is responsible for executing the logic of each tool.

Copy CodeCopiedUse a different Browser@mcp.tool()
async def get_news_sentiment(ticker: str) -> str:
“””Get news sentiment data for a stock ticker.

Args:
ticker: Stock ticker symbol (e.g., MSFT, AAPL)
“””
data = await call_alpha_vantage(“NEWS_SENTIMENT”, {“tickers”: ticker.upper()})
if not data or “feed” not in data:
return “Couldn’t retrieve news sentiment.”

articles = data[“feed”][:3]
result = []
for item in articles:
result.append(f”””
{item[‘title’]}
Summary: {item[‘summary’]}
Source: {item[‘source’]} | Published: {item[‘time_published’]}
“””)
return “n—n”.join(result)

@mcp.tool()
async def get_top_movers() -> str:
“””Get top gainers and losers from the stock market.

No arguments required.
“””
data = await call_alpha_vantage(“TOP_GAINERS_LOSERS”, {})
if not data:
return “Couldn’t retrieve top movers.”

gainers = data.get(“top_gainers”, [])[:3]
losers = data.get(“top_losers”, [])[:3]

result = “**Top Gainers**n”
result += “n”.join([
f”{g[‘ticker’]} ({g.get(‘change_percentage’, ‘N/A’)})”
for g in gainers
])

result += “nn**Top Losers**n”
result += “n”.join([
f”{l[‘ticker’]} ({l.get(‘change_percentage’, ‘N/A’)})”
for l in losers
])

return result

Running the server

Finally, let’s initialize and run the server:

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
mcp.run(transport=”stdio”)

We will now test our server from an existing MCP host, Claude for Desktop.

Step 5: Testing the server

First, ensure you have Claude for Desktop installed. If not, download and install the latest version from the official source. If you already have it, make sure it’s up to date.

Next, you’ll need to configure Claude to connect with your MCP server. To do this, open the claude_desktop_config.json file located in the Claude directory using any text editor. If the file doesn’t exist, go ahead and create it manually.

For MacOS/Linux:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“stockNews”: {
“command”: “uv”,
“args”: [
“–directory”,
“/ABSOLUTE/PATH/TO/PARENT/FOLDER/stockNews”,
“run”,
“stockNews.py”
]
}
}
}

For Windows:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“stockNews”: {
“command”: “uv”,
“args”: [
“–directory”,
“C:\ABSOLUTE\PATH\TO\PARENT\FOLDER\stockNews”,
“run”,
“stockNews.py”
]
}
}
}

This configuration lets Claude for Desktop know that:

There’s an MCP server called “stockNews”.

It should be launched using the following command:uv –directory /ABSOLUTE/PATH/TO/PARENT/FOLDER/stockNews run stockNews.py

Once you’ve added this to your config file, save the file and restart Claude for Desktop to apply the changes.

Test with commands

To confirm that Claude for Desktop has recognized the two tools from your stockNews server, look for the hammer icon in the Claude interface — this icon indicates tool access.

After clicking on the hammer icon, you should see two tools listed:

We can test the server by running the following prompts:

What is the news sentiment for Apple?

Who are the top gainers and losers from the stock market?

When you ask Claude a question:

The client sends your query to Claude.

Claude reviews the available tools (like get_news_sentiment or get_top_movers) and determines which one(s) to use based on your question.

The selected tool is executed via the MCP server you configured earlier.

The tool returns the results back to Claude.

Claude uses those results to craft a natural language response.

The final response is shown to you in the chat interface.

This seamless flow is what allows Claude to interact with real-time data in a structured and controlled way.

Conclusion:

Our MCP-based stock insights server extends Claude Desktop’s capabilities by enabling real-time financial data retrieval. By integrating the AlphaVantage API with a custom MCP server, users can fetch live news sentiment and track top market movers directly through Claude. This setup empowers users with timely, actionable stock insights—all within a conversational interface—making financial analysis more efficient, contextual, and interactive.
The post Code Implementation to Building a Model Context Protocol (MCP) Server and Connecting It with Claude Desktop appeared first on MarkTechPost.

A Coding Implementation on Introduction to Weight Quantization: Key As …

In today’s deep learning landscape, optimizing models for deployment in resource-constrained environments is more important than ever. Weight quantization addresses this need by reducing the precision of model parameters, typically from 32-bit floating point values to lower bit-width representations, thus yielding smaller models that can run faster on hardware with limited resources. This tutorial introduces the concept of weight quantization using PyTorch’s dynamic quantization technique on a pre-trained ResNet18 model. The tutorial will explore how to inspect weight distributions, apply dynamic quantization to key layers (such as fully connected layers), compare model sizes, and visualize the resulting changes. This tutorial will equip you with the theoretical background and practical skills required to deploy deep learning models.

Copy CodeCopiedUse a different Browserimport torch
import torch.nn as nn
import torch.quantization
import torchvision.models as models
import matplotlib.pyplot as plt
import numpy as np
import os

print(“Torch version:”, torch.__version__)

We import the required libraries such as PyTorch, torchvision, and matplotlib, and prints the PyTorch version, ensuring all necessary modules are ready for model manipulation and visualization.

Copy CodeCopiedUse a different Browsermodel_fp32 = models.resnet18(pretrained=True)
model_fp32.eval()

print(“Pretrained ResNet18 (FP32) model loaded.”)

A pretrained ResNet18 model is loaded in FP32 (floating-point) precision and set to evaluation mode, preparing it for further processing and quantization.

Copy CodeCopiedUse a different Browserfc_weights_fp32 = model_fp32.fc.weight.data.cpu().numpy().flatten()

plt.figure(figsize=(8, 4))
plt.hist(fc_weights_fp32, bins=50, color=’skyblue’, edgecolor=’black’)
plt.title(“FP32 – FC Layer Weight Distribution”)
plt.xlabel(“Weight values”)
plt.ylabel(“Frequency”)
plt.grid(True)
plt.show()

In this block, the weights from the final fully connected layer of the FP32 model are extracted and flattened, then a histogram is plotted to visualize their distribution before any quantization is applied.

The output of the above block

Copy CodeCopiedUse a different Browserquantized_model = torch.quantization.quantize_dynamic(model_fp32, {nn.Linear}, dtype=torch.qint8)
quantized_model.eval()

print(“Dynamic quantization applied to the model.”)

We apply dynamic quantization to the model, specifically targeting the Linear layers—to convert them to lower-precision formats, demonstrating a key technique for reducing model size and inference latency.

Copy CodeCopiedUse a different Browserdef get_model_size(model, filename=”temp.p”):
torch.save(model.state_dict(), filename)
size = os.path.getsize(filename) / 1e6
os.remove(filename)
return size

fp32_size = get_model_size(model_fp32, “fp32_model.p”)
quant_size = get_model_size(quantized_model, “quant_model.p”)

print(f”FP32 Model Size: {fp32_size:.2f} MB”)
print(f”Quantized Model Size: {quant_size:.2f} MB”)

A helper function is defined to save and check the model size on disk; then, it is used to measure and compare the sizes of the original FP32 model and the quantized model, showcasing the compression impact of quantization.

Copy CodeCopiedUse a different Browserdummy_input = torch.randn(1, 3, 224, 224)

with torch.no_grad():
output_fp32 = model_fp32(dummy_input)
output_quant = quantized_model(dummy_input)

print(“Output from FP32 model (first 5 elements):”, output_fp32[0][:5])
print(“Output from Quantized model (first 5 elements):”, output_quant[0][:5])

A dummy input tensor is created to simulate an image, and both FP32 and quantized models are run on this input so that you can compare their outputs and validate that quantization does not drastically alter predictions.

Copy CodeCopiedUse a different Browserif hasattr(quantized_model.fc, ‘weight’):
fc_weights_quant = quantized_model.fc.weight().dequantize().cpu().numpy().flatten()
else:
fc_weights_quant = quantized_model.fc._packed_params._packed_weight.dequantize().cpu().numpy().flatten()

plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.hist(fc_weights_fp32, bins=50, color=’skyblue’, edgecolor=’black’)
plt.title(“FP32 – FC Layer Weight Distribution”)
plt.xlabel(“Weight values”)
plt.ylabel(“Frequency”)
plt.grid(True)

plt.subplot(1, 2, 2)
plt.hist(fc_weights_quant, bins=50, color=’salmon’, edgecolor=’black’)
plt.title(“Quantized – FC Layer Weight Distribution”)
plt.xlabel(“Weight values”)
plt.ylabel(“Frequency”)
plt.grid(True)

plt.tight_layout()
plt.show()

In this block, the quantized weights (after dequantization) are extracted from the fully connected layer and compared via histograms against the original FP32 weights to illustrate the changes in weight distribution due to quantization.

The output of the above block

In conclusion, the tutorial has provided a step-by-step guide to understanding and implementing weight quantization, highlighting its impact on model size and performance. By quantizing a pre-trained ResNet18 model, we observed the shifts in weight distributions, the tangible benefits in model compression, and potential inference speed improvements. This exploration sets the stage for further experimentation, such as implementing Quantization Aware Training (QAT), which can further optimize performance on quantized models.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.
The post A Coding Implementation on Introduction to Weight Quantization: Key Aspect in Enhancing Efficiency in Deep Learning and LLMs appeared first on MarkTechPost.

Step by Step Guide on Converting Text to High-Quality Audio Using an O …

In this tutorial, we demonstrate a complete end-to-end solution to convert text into audio using an open-source text-to-speech (TTS) model available on Hugging Face. Leveraging the capabilities of the Coqui TTS library, the tutorial walks you through initializing a state-of-the-art TTS model (in our case, “tts_models/en/ljspeech/tacotron2-DDC”), processing your input text, and saving the resulting synthesis as a high-quality WAV audio file. In addition, we integrate Python’s audio processing tools, including the wave module and context managers, to analyze key audio file attributes like duration, sample rate, sample width, and channel configuration. This step-by-step guide is designed to cater to beginners and advanced developers who want to understand how to generate speech from text and perform basic diagnostic analysis on the output.

Copy CodeCopiedUse a different Browser!pip install TTS

!pip install TTS installs the Coqui TTS library, enabling you to leverage open-source text-to-speech models to convert text into high-quality audio. This ensures that all necessary dependencies are available in your Python environment, allowing you to experiment quickly with various TTS functionalities.

Copy CodeCopiedUse a different Browserfrom TTS.api import TTS
import contextlib
import wave

We import essential modules: TTS from the TTS API for text-to-speech synthesis using Hugging Face models and the built-in contextlib and wave modules for safely opening and analyzing WAV audio files.

Copy CodeCopiedUse a different Browserdef text_to_speech(text: str, output_path: str = “output.wav”, use_gpu: bool = False):
“””
Converts input text to speech and saves the result to an audio file.

Parameters:
text (str): The text to convert.
output_path (str): Output WAV file path.
use_gpu (bool): Use GPU for inference if available.
“””
model_name = “tts_models/en/ljspeech/tacotron2-DDC”

tts = TTS(model_name=model_name, progress_bar=True, gpu=use_gpu)

tts.tts_to_file(text=text, file_path=output_path)
print(f”Audio file generated successfully: {output_path}”)

The text_to_speech function accepts a string of text, along with an optional output file path and a GPU usage flag, and utilizes the Coqui TTS model (specified as “tts_models/en/ljspeech/tacotron2-DDC”) to synthesize the provided text into a WAV audio file. Upon successful conversion, it prints a confirmation message indicating where the audio file has been saved.

Copy CodeCopiedUse a different Browserdef analyze_audio(file_path: str):
“””
Analyzes the WAV audio file and prints details about it.

Parameters:
file_path (str): The path to the WAV audio file.
“””
with contextlib.closing(wave.open(file_path, ‘rb’)) as wf:
frames = wf.getnframes()
rate = wf.getframerate()
duration = frames / float(rate)
sample_width = wf.getsampwidth()
channels = wf.getnchannels()

print(“nAudio Analysis:”)
print(f” – Duration : {duration:.2f} seconds”)
print(f” – Frame Rate : {rate} frames per second”)
print(f” – Sample Width : {sample_width} bytes”)
print(f” – Channels : {channels}”)

The analyze_audio function opens a specified WAV file and extracts key audio parameters, such as duration, frame rate, sample width, and number of channels, using Python’s wave module. It then prints these details in a neatly formatted summary, helping you verify and understand the technical characteristics of the synthesized audio output.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
sample_text = (
“Marktechpost is an AI News Platform providing easy-to-consume, byte size updates in machine learning, deep learning, and data science research. Our vision is to showcase the hottest research trends in AI from around the world using our innovative method of search and discovery”
)

output_file = “output.wav”
text_to_speech(sample_text, output_path=output_file)

analyze_audio(output_file)

The if __name__ == “__main__”: block serves as the script’s entry point when executed directly. This segment defines a sample text describing an AI news platform. The text_to_speech function is called to synthesize this text into an audio file named “output.wav”, and finally, the analyze_audio function is invoked to print the audio’s detailed parameters.

Main Function Output

Download the generated audio from the side pane on Colab

In conclusion, the implementation illustrates how to effectively harness open-source TTS tools and libraries to convert text to audio while concurrently performing diagnostic analysis on the resulting audio file. By integrating the Hugging Face models through the Coqui TTS library with Python’s robust audio processing capabilities, you gain a comprehensive workflow that synthesizes speech efficiently and verifies its quality and performance. Whether you aim to build conversational agents, automate voice responses, or simply explore the nuances of speech synthesis, this tutorial lays a solid foundation that you can easily customize and expand as needed.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.
The post Step by Step Guide on Converting Text to High-Quality Audio Using an Open Source TTS Model on Hugging Face: Including Detailed Audio File Analysis and Diagnostic Tools in Python appeared first on MarkTechPost.

LightPROF: A Lightweight AI Framework that Enables Small-Scale Languag …

Large Language Models (LLMs) have revolutionized natural language processing, with abilities on complex zero-shot tasks through extensive training data and vast parameters. However, LLMs often struggle with knowledge-intensive tasks due to limited task-specific prior knowledge and understanding capabilities. LLMs need access to reliable and continuously updated knowledge bases for effective reasoning, with Knowledge Graphs (KGs) being ideal candidates due to their structured semantic framework. Current approaches to LLM reasoning on KGs encounter two obstacles: representing KG content as extensive text fails to convey rich logical relationships within the graph structure, and retrieval and reasoning processes demand numerous LLM calls and substantial reasoning power.

Prompt engineering has emerged as a critical technique for expanding LLM capabilities across various applications without modifying model parameters. The field has evolved from simple zero-shot and few-shot prompts to more complex approaches like Chain-of-Thought (CoT), Tree-of-Thoughts (ToT), and Graph-of-Thoughts (GoT). KG-based LLM reasoning has gained traction as KGs provide explicit, structured knowledge that enhances LLMs’ knowledge awareness with clear logical structures. More flexible solutions like KAPING, KGGPT, StructGPT, ToG, and KnowledgeNavigator construct LLM prompts using KG factual information with various techniques like semantic similarity retrieval, multi-step reasoning frameworks, and beam search on KGs to enhance reasoning capabilities.

Researchers from Beijing University of Posts and Telecommunications, Hangzhou Dianzi University, Singapore Management University, National University of Singapore, Institute of Computing Technology at Chinese Academy of Sciences, and Xi’an Jiaotong University have proposed LightPROF, a Lightweight and efficient Prompt learning-ReasOning Framework. The RetrieveEmbed-Reason framework enables small-scale LLMs to perform stable retrieval and efficient reasoning on KGs. It contains three core components: Retrieval, Embedding, and Reasoning modules. The Retrieval uses relations as fundamental retrieval units and limits the scope based on question semantics, the Embedding uses a compact Transformer-based Knowledge Adapter, and the Reasoning combines embedded representation vectors with carefully designed prompts. LightPROF supports various open-source LLMs and KGs while only requiring Knowledge Adapter tuning during training.

LightPROF is evaluated on two Freebase-based public datasets: WebQuestionsSP (WebQSP) and ComplexWebQuestions (CWQ). WebQSP serves as a benchmark with fewer questions (4,737) but a larger KG, and CWQ is designed for complex KG question answering with 34,689 question-answer pairs built upon WebQSP. Performance is measured using match accuracy (Hits@1), which evaluates whether the model’s top answer is correct. LightPROF is compared against three categories of baseline methods: full fine-tuning approaches (including KV-Mem, EmbedKGQA, TransferNet, NSM, etc), vanilla LLM methods (featuring LLaMa series models), and LLM+KGs methods (such as StructGPT, ToG, KnowledgeNavigator, and AgentBench).

LightPROF significantly outperforms state-of-the-art models, achieving 83.7% accuracy on the WebQSP dataset and 59.3% on the more challenging CWQ dataset. These results validate LightPROF’s effectiveness in handling multi-hop and complex reasoning challenges in KG question answering. When integrating different LLMs within the framework, LightPROF consistently enhances performance regardless of the baseline capabilities of the original models. This plug-and-play integration strategy eliminates the need for costly LLM fine-tuning. Efficiency evaluations against StructGPT reveal LightPROF’s superior resource utilization, with a 30% reduction in processing time, 98% reduction in input token usage, and significantly lower tokens per request.

In conclusion, researchers introduced LightPROF, a novel framework that enhances LLM reasoning through accurate retrieval and efficient encoding of KGs. It narrows the retrieval scope by sampling KGs using stable relationships as units. Researchers developed a complex Knowledge Adapter that effectively parses graph structures and integrates information to enable efficient reasoning with smaller LLMs. It condenses reasoning graphs into fewer tokens while achieving comprehensive alignment with LLM input space through the Projector component. Future research directions include developing KG encoders with strong generalization capabilities that can be applied to unseen KG data without retraining and designing unified cross-modal encoders capable of handling multimodal KGs.

Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post LightPROF: A Lightweight AI Framework that Enables Small-Scale Language Models to Perform Complex Reasoning Over Knowledge Graphs (KGs) Using Structured Prompts appeared first on MarkTechPost.

Google AI Introduce the Articulate Medical Intelligence Explorer (AMIE …

Developing an accurate differential diagnosis (DDx) is a fundamental part of medical care, typically achieved through a step-by-step process that integrates patient history, physical exams, and diagnostic tests. With the rise of LLMs, there’s growing potential to support and automate parts of this diagnostic journey using interactive, AI-powered tools. Unlike traditional AI systems focusing on producing a single diagnosis, real-world clinical reasoning involves continuously updating and evaluating multiple diagnostic possibilities as more patient data becomes available. Although deep learning has successfully generated DDx across fields like radiology, ophthalmology, and dermatology, these models generally lack the interactive, conversational capabilities needed to engage effectively with clinicians.

The advent of LLMs offers a new avenue for building tools that can support DDx through natural language interaction. These models, including general-purpose ones like GPT-4 and medical-specific ones like Med-PaLM 2, have shown high performance on multiple-choice and standardized medical exams. While these benchmarks initially assess a model’s medical knowledge, they don’t reflect its usefulness in real clinical settings or its ability to assist physicians during complex cases. Although some recent studies have tested LLMs on challenging case reports, there’s still a limited understanding of how these models might enhance clinician decision-making or improve patient care through real-time collaboration.

Researchers at Google introduced AMIE, a large language model tailored for clinical diagnostic reasoning, to evaluate its effectiveness in assisting with DDx. AMIE’s standalone performance outperformed unaided clinicians in a study involving 20 clinicians and 302 complex real-world medical cases. When integrated into an interactive interface, clinicians using AMIE alongside traditional tools produced significantly more accurate and comprehensive DDx lists than those using standard resources alone. AMIE not only improved diagnostic accuracy but also enhanced clinicians’ reasoning abilities. Its performance also surpassed GPT-4 in automated evaluations, showing promise for real-world clinical applications and broader access to expert-level support.

AMIE, a language model fine-tuned for medical tasks, demonstrated strong performance in generating DDx. Its lists were rated highly for quality, appropriateness, and comprehensiveness. In 54% of cases, AMIE’s DDx included the correct diagnosis, outperforming unassisted clinicians significantly. It achieved a top-10 accuracy of 59%, with the proper diagnosis ranked first in 29% of cases. Clinicians assisted by AMIE also improved their diagnostic accuracy compared to using search tools or working alone. Despite being new to the AMIE interface, clinicians used it similarly to traditional search methods, showing its practical usability.

In a comparative analysis between AMIE and GPT-4 using a subset of 70 NEJM CPC cases, direct human evaluation comparisons were limited due to different sets of raters. Instead, an automated metric that was shown to align reasonably with human judgment was used. While GPT-4 marginally outperformed AMIE in top-1 accuracy (though not statistically significant), AMIE demonstrated superior top-n accuracy for n > 1, with notable gains for n > 2. This suggests that AMIE generated more comprehensive and appropriate DDx, a crucial aspect in real-world clinical reasoning. Additionally, AMIE outperformed board-certified physicians in standalone DDx tasks and significantly improved clinician performance as an assistive tool, yielding higher top-n accuracy, DDx quality, and comprehensiveness than traditional search-based assistance.

Beyond raw performance, AMIE’s conversational interface was intuitive and efficient, with clinicians reporting increased confidence in their DDx lists after its use. While limitations exist—such as AMIE’s lack of access to images and tabular data in clinician materials and the artificial nature of CPC-style case presentations the model’s potential for educational support and diagnostic assistance is promising, particularly in complex or resource-limited settings. Nonetheless, the study emphasizes the need for careful integration of LLMs into clinical workflows, with attention to trust calibration, the model’s uncertainty expression, and the potential for anchoring biases and hallucinations. Future work should rigorously evaluate AI-assisted diagnosis’s real-world applicability, fairness, and long-term impacts.

Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post Google AI Introduce the Articulate Medical Intelligence Explorer (AMIE): A Large Language Model Optimized for Diagnostic Reasoning, and Evaluate its Ability to Generate a Differential Diagnosis appeared first on MarkTechPost.

Building an AIOps chatbot with Amazon Q Business custom plugins

Many organizations rely on multiple third-party applications and services for different aspects of their operations, such as scheduling, HR management, financial data, customer relationship management (CRM) systems, and more. However, these systems often exist in silos, requiring users to manually navigate different interfaces, switch between environments, and perform repetitive tasks, which can be time-consuming and inefficient.
Moreover, while many enterprise systems are equipped with APIs for integration, users often lack the technical expertise to interact with these APIs directly. As a result, organizations need an intuitive and seamless way to query data and perform actions across these applications using natural language, without requiring specialized knowledge of each system or its APIs.
To address the challenge of integrating multiple third-party applications into a unified, natural language-driven interface, users can use plugins for Amazon Q Business. Plugins provide a way to bridge the gap between complex, siloed enterprise applications in a user-friendly interfacing empowering users to take action across systems with easy. Amazon Q Business supports multiple enterprise systems with pre-built plugins, as well as custom plugins, that users can use to integrate a variety of enterprise systems with Amazon Q Business applications.
Solution overview
In this post, we demonstrate how you can use custom plugins for Amazon Q Business to build a chatbot that can interact with multiple APIs using natural language prompts. We showcase how to build an AIOps chatbot that enables users to interact with their AWS infrastructure through natural language queries and commands. The chatbot is capable of handling tasks such as querying the data about Amazon Elastic Compute Cloud (Amazon EC2) ports and Amazon Simple Storage Service (Amazon S3) buckets access settings. For example, users can ask the chatbot questions like “Which EC2 instances have port 3389 open?” or request actions such as “Please close public access for S3 buckets.”
By integrating other AWS services with Amazon Q using OpenAPI schemas, the chatbot can not only retrieve real-time information (such as checking which S3 buckets have public access), but also take corrective actions (such as closing open ports or public access) in response to user commands. This solution reduces manual intervention and simplifies complex cloud operations by enabling IT teams to manage infrastructure through natural language interactions. The chatbot will streamline operational tasks, reduce the need for switching between different tools, and improve the efficiency of IT and operations teams by allowing them to interact with complex systems using simple, intuitive language.
Architecture
To implement the solution, you will build the following architecture.

Users sign in the AIOps Chatbot using the credentials configured in AWS IAM Identity Center. You will use finding and removing public access from S3 buckets along with finding and closing specific open ports on Amazon EC2 instances as the use cases to demonstrate the capability of this AIOps chatbot using Amazon Q Business custom plugins. However, you can extend the architecture to support other operations use cases through API based integration.
You deploy the required infrastructure using the AWS Serverless Application Model (AWS SAM).
The following is a summary of the functionality of the architecture:

The UI for the chatbot is built using an Amazon Q Business web experience.
The user authentication and authorization are handled by AWS IAM Identity Center.
Relevant actions are identified based on natural language queries from the users using Amazon Q Business custom plugins. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill an end user request.
The APIs are implemented using Amazon API Gateway and AWS Lambda functions.

Prerequisites

Create an AWS account if you do not already have one.
Have access to an AWS account through the AWS Management Console and the AWS Command Line Interface (AWS CLI). The AWS Identity and Access Management (IAM) user that you use must have permissions to make the necessary AWS service calls and manage AWS resources mentioned in this post. While providing permissions to the IAM user, follow the principle of least-privilege.
Have Git installed.
Have AWS Serverless Application Model (AWS SAM)
You must have an Amazon Q Business subscription.
You must enable AWS IAM Identity Center.
[Optional] You can pre-create the user in the Identity Center directory that you will be using to sign in to the Amazon Q Business application.

Deploy and run the solution
The resources in this demonstration will be provisioned in the US East (N. Virginia) AWS Region (us-east-1). You walk through the following phases to implement the model customization workflow:

Deploy the solution using the AWS SAM template
Configure a user for the AIOps Q Business chatbot application
Test the AIOps Q Business chatbot application
Clean up

Step 1: Deploy the solution using the AWS SAM template
See the GitHub repository for the latest instructions. Run the following steps to deploy the AWS Step Functions workflow using the AWS SAM template.

Create a new directory, navigate to that directory in a terminal, and clone the GitHub repository:

git clone https://github.com/aws-samples/ai-ops-with-amazon-q-business.git

2. Change directory to the solution directory:

cd ai-ops-with-amazon-q-business

3. Run the following command to deploy the resources using SAM.

sam deploy -g

4. When prompted, enter the following parameter values:

Stack Name [sam-app]: aiops
AWS Region [us-east-1]: us-east-1
Confirm changes before deploy [y/N]: N

Allow SAM CLI IAM role creation [Y/n]: Y

Disable rollback [y/N]: N

FindS3BucketsWithPublicAccessFunction has no authentication. Is this okay? [y/N]: y

RemovePublicAcessFromS3BucketFunction has no authentication. Is this okay? [y/N]: y

FindEC2WithSpecificOpenPortFunction has no authentication. Is this okay? [y/N]: y

CloseUnwantedPortForEC2Function has no authentication. Is this okay? [y/N]: y

Save arguments to configuration file [Y/n]: Y

SAM configuration file [samconfig.toml]: hit enter

SAM configuration environment [default]: hit enter  

5. Note the outputs from the AWS SAM deployment process. This contains the Amazon Q Business web experience (chatbot) URL. Before you can sign in to the chatbot application, you must set up a user.
Step 2: Configure a user for the AIOps Amazon Q Business chatbot application
Use the following steps to configure a user for the AIOps chatbot application.

Open Amazon Q Business from the console and select the AIOps application.

2. Choose Manage access and subscription.

3. Choose Add groups and users.

4. Select either Add and assign new users or Assign existing users and groups depending on if you pre-created the user as mentioned in the prerequisites and choose Next.

5. If you have an existing user that you want to provide access to your AIOps application, search for and select the username and choose Assign.

6. On the review page, select the current subscription and choose Confirm.

Step 3: Test the AIOps Q Business chatbot application
Use the following steps to log into the chatbot and test it. Responses from large language models are non-deterministic. Hence, you may not get the exact same response every time.

Take the QBusinessWebExperienceURL from the sam deploy output using the user credential configured in the previous step.
After signing in to the AIOps Chatbot, select the kebab menu option (three dots) at the bottom right corner and select the AIOpsCustomPlugin as follows:

3. Enable public access on an Amazon S3 bucket. This is done for testing purposes only, so check your organization policies before performing this test. For this demo we used a bucket named aiops-chatbot-demo.
4. Return to the AIOps Chatbot and enter a question such as: Do I have any S3 bucket with public access? and choose Submit. Provide the bucket prefix to narrow down the search.

5. The AIOps chatbot identifies the buckets that have public access:

6. Ask a follow up question such as: Please block the public access. The chat bot blocks public access. Validate the change from the S3 console.

7. Open a port, such as 1234, for an Amazon EC2 instance using security group inbound rules.

8. Return to the chat bot and enter a question such as: Do I have any EC2 instance with port 1234 open?
9. After the chat bot identifies the EC2 instance with the open port, confirm that you want to close the port.
10. The chat bot closes the open port and confirms.

Clean up
Properly decommissioning provisioned AWS resources is an important best practice to optimize costs and enhance security posture after concluding proofs of concept and demonstrations. To delete the resources deployed to your AWS account through AWS SAM, run the following command:

sam delete

OpenAPI schema definition
After the custom plugin is deployed, Amazon Q Business will process a user’s prompt and use the OpenAPI schema to dynamically determine the appropriate APIs to call to accomplish the user’s goal. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy. Follow the best practices for OpenAPI schema definition for ideal results. This AIOps chatbot demonstrated four operations supported by the following API operations:

find-s3-bucket-with-public-access – This API finds S3 buckets that have the specified prefix and are configured for public access.
remove-public-access-from-s3-bucket – This API removes public access from a specific S3 bucket.
find-ec2-with-specific-open-port – This API finds EC2 instances that have a specified port open for inbound access.
close-unwanted-port-for-ec2 – This API removes a specified port from a given EC2 instance.

The API operations are implemented using API Gateway and Lambda functions.
Troubleshooting
The following are some troubleshooting steps if you encounter errors while using the AIOps chatbot.

As Amazon Q Business dynamically determines the appropriate API operations to be invoked, the questions (prompts) must be unambiguous. Be specific rather than asking generic questions. For example: Do I have any EC2 instance with port 1234 open? instead of Do I have any EC2 exposed to internet?
The APIs are exposed using API Gateway backed by Lambda functions. Check that you can invoke the API operations using Curl or API testing tools.
Check the Lambda function logs in Amazon CloudWatch for errors. Follow the Lambda debugging steps if needed.

Conclusion
In this post, you learned an end-to-end process for creating an AIOps chatbot using Amazon Q Business custom plugins, demonstrating how users can use natural language processing to interact with AWS resources and streamline cloud operations. By integrating other AWS services with Amazon Q Business, the chatbot can query infrastructure for security and compliance status while automating key actions such as closing open ports or restricting public access to S3 buckets. This solution enhances operational efficiency, reduces manual intervention, and enabled teams to manage complex environments more effectively through intuitive, conversational interfaces. With custom plugins and OpenAPI schemas, users can build a powerful, flexible chatbot solution tailored to their specific operational needs, transforming the way they manage IT operations and respond to business challenges.
Further study
For more information on Amazon Q Business and custom plugins:

Amazon Q Business
Custom plugins for Amazon Q Business
Prerequisites for Amazon Q Business custom plugins
Defining OpenAPI schemas for custom plugins
Creating an Amazon Q Business custom plugin
Using an Amazon Q Business custom plugin
Best practices for OpenAPI schema definition for custom plugins

About the authors
Upendra V is a Sr. Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.
Biswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS Cloud.

How TransPerfect Improved Translation Quality and Efficiency Using Ama …

This post is co-written with Keith Brazil, Julien Didier, and Bryan Rand from TransPerfect.
TransPerfect, a global leader in language and technology solutions, serves a diverse array of industries. Founded in 1992, TransPerfect has grown into an enterprise with over 10,000 employees in more than 140 cities on six continents. The company offers a broad spectrum of services, including translation, localization, interpretation, multicultural marketing, website globalization, subtitling, voiceovers, and legal support services. TransPerfect also uses cutting-edge technology to offer AI-driven language solutions, such as its proprietary translation management system, GlobalLink.
This post describes how the AWS Customer Channel Technology – Localization Team worked with TransPerfect to integrate Amazon Bedrock into the GlobalLink translation management system, a cloud-based solution designed to help organizations manage their multilingual content and translation workflows. Organizations use TransPerfect’s solution to rapidly create and deploy content at scale in multiple languages using AI.
Amazon Bedrock is a fully managed service that simplifies the deployment and management of generative AI models. It offers access to a variety of foundation models (FMs), enabling developers to build and scale AI applications efficiently. Amazon Bedrock is designed to be highly scalable, secure, and straightforward to integrate with other AWS services, making it suitable for a broad array of use cases, including language translation.
The AWS Customer Channel Technology – Localization Team is a long-standing TransPerfect customer. The team manages the end-to-end localization process of digital content at AWS, including webpages, technical documentation, ebooks, banners, videos, and more. The AWS team handles billions of words in multiple languages across digital assets. Given the growing demand for multilingual content by internationally minded businesses and new local cloud adoption journeys, the AWS team needs to support an ever-increasing load and a wider set of languages. To do so, the team relies on the GlobalLink technology suite to optimize and automate translation processes.
The challenge
The AWS team and TransPerfect created streamlined custom workflows and toolsets that enable the translation and delivery of billions of words each year. Content localization is a multi-step process consisting minimally of asset handoff, asset preprocessing, machine translation, post-editing, quality review cycles, and asset handback. These steps are often manual, costly, and time-consuming. AWS and TransPerfect are continually striving to optimize this workflow to enable the processing of more content at a lower cost and to decrease those assets’ time to market—providing valuable, salient content faster for non-English-speaking customers. Additionally, transcreation of creative content posed a unique challenge, because it traditionally required highly skilled human linguists and was resistant to automation, resulting in higher costs and longer turnaround times. To address these issues, TransPerfect worked with AWS to evaluate generative AI-powered initiatives for transcreation and automatic post-editing within TransPerfect’s GlobalLink architecture.
Security and data safety
Amazon Bedrock helps make sure data is neither shared with FM providers nor used to improve base models. Amazon Bedrock adheres to major compliance standards like ISO and SOC and is also a FedRAMP-authorized service, making it suitable for government contracts. The extensive monitoring and logging capabilities of Amazon Bedrock allow TransPerfect to align with stringent auditability requirements.
Although data safety is a key requirement, there are many other factors to take into account, such as responsible AI. Amazon Bedrock Guardrails enabled TransPerfect to build and customize truthfulness protections for the automatic post-edit offering. Large language models (LLMs) can generate incorrect information due to hallucinations. Amazon Bedrock supports contextual grounding checks to detect and filter hallucinations if the responses are factually incorrect or inconsistent. This is a critical feature for a translation solution that requires perfect accuracy.
Harnessing LLMs for automatic post-editing
To translate at scale, Amazon Translate powered machine translation is used in AWS team workflows. Segments whose translations can’t be recycled from translation memories (databases of previous high-quality human translations) are routed to machine translation workflows. Depending on the language or content, Amazon either uses a machine translation-only workflow where content is translated and published with no human touch, or machine translation post-edit workflows. Post-editing is when a linguist finesses the machine-translated output of a given segment to make sure it correctly conveys the meaning of the original sentence and is in line with AWS style guides and agreed glossaries. Because this process can add days to the translation timeline, automating some or all of the process would have a major impact on cost and turnaround times.
The following diagram illustrates the machine translation workflow.

The workflow consists of the following components:

TM (translation memory) – The translation memory is a client-specific repository of previously translated and approved content. It’s always applied first and maximizes the reuse of existing translations.
MT (machine translation) – After existing translations are applied, new content is processed through machine translation using Amazon Translate.
APE (automated post-edit) – An LLM is employed to edit, improve, and correct machine-translated content.
HPE (human post-edit) – A subject matter expert linguist revises and perfects the machine-translated content.

The following example follows the path through the preceding workflow for one source segment.

Source
To choose user name attributes, don’t select User name as a sign-in option when you create your user pool.

MT
Pour choisir des attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion au moment de créer votre groupe d’utilisateurs.

APE
Pour choisir des attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion lorsque vous créez votre groupe d’utilisateurs.

HPE
Pour choisir les attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion lorsque vous créez votre groupe d’utilisateurs.

TransPerfect began working with generative AI and LLMs several years ago with the foresight that AI was on track to disrupt the translation industry. As expected, localization workflows have mostly shifted to “expert in the loop”, and are striving toward “no human touch” models. In pursuit of this, TransPerfect chose to use Amazon Bedrock within its GlobalLink Enterprise solution to further automate and optimize these workflows. Amazon Bedrock, by design, provides data ownership and security. This is a critical feature for TransPerfect clients, especially those in sensitive industries such as life sciences or banking.
With Amazon Bedrock and GlobalLink, machine-translated content is now routed through one of the LLMs available in Amazon Bedrock for automatic post-editing. By using style guides, relevant examples of approved translations, and examples of errors to avoid, the LLM is prompted to improve existing machine translations. This post-edited content is either handed off to a linguist for a lighter post-edit (a less difficult task) or is applied in “no human touch workflows” to greatly improve the output. The result is enhanced quality across the board and the ability for post-editors to focus on higher-value edits.
For post-editing, over 95% of all edits suggested by Amazon Bedrock LLMs showed markedly improved translation quality, leading to up to 50% overall cost savings for translations for Transperfect and freeing human linguists for higher-level tasks.
Harnessing LLMs for transcreation
Although machine translation shows great strength in technical, formal, and instructional content, it hasn’t historically performed as well with creative content that leans into nuance, subtlety, humor, descriptiveness, and cultural references. Creative content can sound stiff or unnatural when machine translated. Because of this, TransPerfect has traditionally relied on human linguists to manually transcreate this type of content.
Transcreation is the process of adapting a message from one language to another while maintaining its intent, style, tone, and context. In German, for example, Nike’s “Just do it” tagline is transcreated to “Du tust es nie nur für dich,” which actually means “you never do it just for yourself.”
A successfully transcreated message evokes the same emotions and carries the same implications in the target language as it does in the source language. The AWS team uses transcreation for highly creative marketing assets to maximize their impact in a given industry. However, transcreation historically hasn’t benefitted from the automation solutions used in other types of localization workflows due to the highly customized and creative nature of the process. This means there has been a lot of interest in using generative AI to potentially decrease the costs and time associated with transcreation.
TransPerfect sought to use LLMs to cut down on time and costs typically associated with transcreation. Rather than an all-human or fully automated process, translations are produced through Anthropic’s Claude or Amazon Nova Pro on Amazon Bedrock, with the prompt to create multiple candidate translations with some variations. Within the translation editor, the human linguist chooses the most suitable adapted translation instead of composing it from scratch.
The following screenshot shows an LLM-powered transcreation within the GlobalLink Translate online editor.

Using GlobalLink powered by Amazon Bedrock for transcreation, users are seeing linguist productivity gains of up to 60%.
Conclusion
Thanks to LLM-powered transcreation and post-editing, customers in industries ranging from life sciences to finance to manufacturing have seen cost savings of up to 40% within their translation workflows and up to an 80% reduction in project turnaround times. In addition, the automatic post-edit step added to machine translation-only workflows provides a major quality boost to the no human touch output.
Amazon Bedrock safeguards data by not allowing sharing with FM providers and excluding it from model improvements. Beyond data security, responsible AI is essential. Amazon Bedrock Guardrails allows TransPerfect to customize truthfulness protections for post-editing. To address AI hallucinations, it offers contextual grounding checks to identify and filter inaccuracies—critical for producing precise translations.
Try out LLM-powered transcreation and post-editing with Amazon Bedrock for your own use case, and share your feedback and questions in the comments.

About the authors
Peter Chung is a Senior Solutions Architect at AWS, based in New York. Peter helps software and internet companies across multiple industries scale, modernize, and optimize. Peter is the author of “AWS FinOps Simplified”, and is an active member of the FinOps community.
Franziska Willnow is a Senior Program Manager (Tech) at AWS. A seasoned localization professional, Franziska Willnow brings over 15 years of expertise from various localization roles at Amazon and other companies. Franziska focuses on localization efficiency improvements through automation, machine learning, and AI/LLM. Franziska is passionate about building innovative products to support AWS’ global customers.
Ajit Manuel is a product leader at AWS, based in Seattle. Ajit heads the content technology product practice, which powers the AWS global content supply chain from creation to intelligence with practical enterprise AI. Ajit is passionate about enterprise digital transformation and applied AI product development. He has pioneered solutions that transformed InsurTech, MediaTech, and global MarTech.
Keith Brazil is Senior Vice President of Technology at TransPerfect, with specialization in Translation Management technologies as well as AI/ML data collection and annotation platforms. A native of Dublin, Ireland, Keith has been based in New York city for the last 23 years.
Julien Didier is Vice-President of Technology for translations.com and is responsible for the implementation of AI for both internal workflows and client-facing products. Julien manages a worldwide team of engineers, developers and architects who ensure successful deployments in addition to providing feedback for feature requests.
Bryan Rand is Senior Vice President of Global Solutions at TransPerfect, specializing in enterprise software, AI-driven digital marketing, and content management strategies. With over 20 years of experience leading business units and implementing customer experience innovations, Bryan has played a key role in driving successful global transformations for Fortune 1000 companies. He holds a BA in Economics from the University of Texas.

Racing beyond DeepRacer: Debut of the AWS LLM League

The AWS DeepRacer League is the world’s first autonomous racing league, open to anyone. Announced at re:Invent 2018, it puts machine learning in the hands of every developer through the fun and excitement of developing and racing self-driving remote control cars. Through the past 7 years, over 560 thousand developers of all skill levels have competed in the league at thousands of Amazon and customer events globally. While the final championships concluded at re:Invent 2024, that same event played host to a brand new AI competition, ushering in a new era of gamified learning in the age of generative AI.
In December 2024, AWS launched the AWS Large Language Model League (AWS LLM League) during re:Invent 2024. This inaugural event marked a significant milestone in democratizing machine learning, bringing together over 200 enthusiastic attendees from diverse backgrounds to engage in hands-on technical workshops and a competitive foundation model fine-tuning challenge. Using learnings from DeepRacer, the primary objective of the event was to simplify model customization learning while fostering a collaborative community around generative AI innovation through a gamified competition format.
AWS LLM League structure and outcomes
The AWS LLM League was designed to lower the barriers to entry in generative AI model customization by providing an experience where participants, regardless of their prior data science experience, could engage in fine-tuning LLMs. Using Amazon SageMaker JumpStart, attendees were guided through the process of customizing LLMs to address real business challenges adaptable to their domain.

As shown in the preceding figure, the challenge began with a workshop, where participants embarked on a competitive journey to develop highly effective fine-tuned LLMs. Competitors were tasked with customizing Meta’s Llama 3.2 3B base model for a specific domain, applying the tools and techniques they learned. The submitted model would be compared against a bigger 90B reference model with the quality of the responses decided using an LLM-as-a-Judge approach. Participants score a win for each question where the LLM judge deemed the fine-tuned model’s response to be more accurate and comprehensive than that of the larger model.

In the preliminary rounds, participants submitted hundreds of unique fine-tuned models to the competition leaderboard, each striving to outperform the baseline model. These submissions were evaluated based on accuracy, coherence, and domain-specific adaptability. After rigorous assessments, the top five finalists were shortlisted, with the best models achieving win rates above 55% against the large reference models (as shown in the preceding figure). Demonstrating that a smaller model can achieve competitive performance highlights significant benefits in compute efficiency at scale. Using a 3B model instead of a 90B model reduces operational costs, enables faster inference, and makes advanced AI more accessible across various industries and use cases.
The competition culminates in the Grand Finale, where finalists showcase their models in a final round of evaluation to determine the ultimate winner.
The fine-tuning journey
This journey was carefully designed to guide participants through each critical stage of fine-tuning a large language model—from dataset creation to model evaluation—using a suite of no-code AWS tools. Whether they were newcomers or experienced builders, participants gained hands-on experience in customizing a foundation model through a structured, accessible process. Let’s take a closer look at how the challenge unfolded, starting with how participants prepared their datasets.
Stage 1: Preparing the dataset with PartyRock
During the workshop, participants learned how to generate synthetic data using an Amazon PartyRock playground (as shown in the following figure). PartyRock offers access to a variety of top foundation models through Amazon Bedrock at no additional cost. This enabled participants to use a no-code AI generated app for creating synthetic training data that were used for fine-tuning.

Participants began by defining the target domain for their fine-tuning task, such as finance, healthcare, or legal compliance. Using PartyRock’s intuitive interface, they generated instruction-response pairs that mimicked real-world interactions. To enhance dataset quality, they used PartyRock’s ability to refine responses iteratively, making sure that the generated data was both contextually relevant and aligned with the competition’s objectives.
This phase was crucial because the quality of synthetic data directly impacted the model’s ability to outperform a larger baseline model. Some participants further enhanced their datasets by employing external validation methods, such as human-in-the-loop review or reinforcement learning-based filtering.
Stage 2: Fine-tuning with SageMaker JumpStart
After the datasets were prepared, participants moved to SageMaker JumpStart, a fully managed machine learning hub that simplifies the fine-tuning process. Using a pre-trained Meta Llama 3.2 3B model as the base, they customized it with their curated datasets, adjusting hyperparameters (shown in the following figure) such as:

Epochs: Determining how many times the model iterates over the dataset.
Learning rate: Controlling how much the model weights adjust with each iteration.
LoRA parameters: Optimizing efficiency with low-rank adaptation (LoRA) techniques.

One of the key advantages of SageMaker JumpStart is that it provides a no-code UI, shown in the following figure, allowing participants to fine-tune models without needing to write code. This accessibility enabled even those with minimal machine learning experience to engage in model customization effectively.

By using the distributed training capabilities of SageMaker, participants were able to run multiple experiments in parallel, optimizing their models for accuracy and response quality. The iterative fine-tuning process allowed them to explore different configurations to maximize performance.
Stage 3: Evaluation with Sagemaker Clarify
To make sure that their models were not only accurate but also unbiased, participants had the option to use Amazon SageMaker Clarify for evaluation, shown in the following figure.

This phase included:

Bias detection: Identifying skewed response patterns that might favor specific viewpoints.
Explainability metrics: Understanding why the model made certain predictions.
Performance scoring: Comparing model output against ground truth labels.

While not mandatory, the integration of SageMaker Clarify provided an additional layer of assurance for participants who wanted to validate their models further, verifying that their outputs were reliable and performant.
Stage 4: Submission and evaluation using LLM-as-a-Judge from Amazon Bedrock
After fine-tuned models were ready, they were submitted to the competition leaderboard for evaluation using the Amazon Bedrock Evaluations LLM-as-a-Judge approach. This automated evaluation system compares the fine-tuned models against the reference 90B model using predefined benchmarks, as shown in the following figure.

Each response was scored based on:

Relevance: How well the response addressed the question.
Depth: The level of detail and insight provided.
Coherence: Logical flow and consistency of the answer.

Participants’ models earned a score each time their response outperformed the 90B model in a head-to-head comparison. The leaderboard dynamically updated as new submissions were evaluated, fostering a competitive yet collaborative learning environment.
Grand Finale showcase
The Grand Finale of the AWS LLM League was an electrifying showdown, where the top five finalists, handpicked from hundreds of submissions, competed in a high-stakes live event. Among them was Ray, a determined contender whose fine-tuned model had consistently delivered strong results throughout the competition. Each finalist had to prove not just the technical superiority of their fine-tuned models, but also their ability to adapt and refine responses in real-time.

The competition was intense from the outset, with each participant bringing unique strategies to the table. Ray’s ability to tweak prompts dynamically set him apart early on, providing optimal responses to a range of domain-specific questions. The energy in the room was palpable as finalists’ AI-generated answers were judged by a hybrid evaluation system—40% by an LLM, 40% by expert panelists from Meta AI and AWS, and 20% by an enthusiastic live audience against the following rubric:

Generalization ability: How well the fine-tuned model adapted to previously unseen questions.
Response quality: Depth, accuracy, and contextual understanding.
Efficiency: The model’s ability to provide comprehensive answers with minimal latency.

One of the most gripping moments came when contestants encountered the infamous Strawberry Problem, a deceptively simple letter-counting challenge that exposed an inherent weakness in LLMs. Ray’s model delivered the correct answer, but the AI judge misclassified it, sparking a debate among the human judges and audience. This pivotal moment underscored the importance of human-in-the-loop evaluation, highlighting how AI and human judgment must complement each other for fair and accurate assessments.
As the final round concluded, Ray’s model consistently outperformed expectations, securing him the title of AWS LLM League Champion. The Grand Finale was not just a test of AI—it was a showcase of innovation, strategy, and the evolving synergy between artificial intelligence and human ingenuity.

Conclusion and looking ahead
The inaugural AWS LLM League competition successfully demonstrated how large language model fine-tuning can be gamified to drive innovation and engagement. By providing hands-on experience with cutting-edge AWS AI and machine learning (ML) services, the competition not only demystified the fine-tuning process, but also inspired a new wave of AI enthusiasts to experiment and innovate in this space.
As the AWS LLM League moves forward, future iterations will expand on these learnings, incorporating more advanced challenges, larger datasets, and deeper model customization opportunities. Whether you’re a seasoned AI practitioner or a newcomer to machine learning, the AWS LLM League offers an exciting and accessible way to develop real-world AI expertise.
Stay tuned for upcoming AWS LLM League events and get ready to put your fine-tuning skills to the test!

About the authors
Vincent Oh is the Senior Specialist Solutions Architect in AWS for AI & Innovation. He works with public sector customers across ASEAN, owning technical engagements and helping them design scalable cloud solutions across various innovation projects. He created the LLM League in the midst of helping customers harness the power of AI in their use cases through gamified learning. He also serves as an Adjunct Professor in Singapore Management University (SMU), teaching computer science modules under School of Computer & Information Systems (SCIS). Prior to joining Amazon, he worked as Senior Principal Digital Architect at Accenture and Cloud Engineering Practice Lead at UST.
Natasya K. Idries is the Product Marketing Manager for AWS AI/ML Gamified Learning Programs. She is passionate about democratizing AI/ML skills through engaging and hands-on educational initiatives that bridge the gap between advanced technology and practical business implementation. Her expertise in building learning communities and driving digital innovation continues to shape her approach to creating impactful AI education programs. Outside of work, Natasya enjoys traveling, cooking Southeast Asian cuisines and exploring nature trails.

Boson AI Introduces Higgs Audio Understanding and Higgs Audio Generati …

In today’s enterprise landscape—especially in insurance and customer support —voice and audio data are more than just recordings; they’re valuable touchpoints that can transform operations and customer experiences. With AI audio processing, organizations can automate transcriptions with remarkable accuracy, surface critical insights from conversations, and power natural, engaging voice interactions. By utilizing these capabilities, businesses can boost efficiency, uphold compliance standards, and build deeper connections with customers, all while meeting the high expectations of these demanding industries.

Boson AI introduces Higgs Audio Understanding and Higgs Audio Generation, two robust solutions that empower you to develop custom AI agents for a wide range of audio applications. Higgs Audio Understanding focuses on listening and contextual comprehension. Higgs Audio Generation excels in expressive speech synthesis. Both solutions are currently optimized for English, with support for additional languages on the way. They enable AI interactions that closely resemble natural human conversation. Enterprises can leverage these tools to power real-world audio applications.

Higgs Audio Understanding: Listening Beyond Words  

Higgs Audio Understanding is Boson AI’s advanced solution for audio comprehension. It surpasses traditional speech-to-text systems by capturing context, speaker traits, emotions, and intent. The model deeply integrates audio processing with a large language model (LLM), converting audio inputs into rich contextual embeddings, including speech tone, background sounds, and speaker identities. The model achieves nuanced interpretation by processing these alongside text tokens, essential for tasks such as meeting transcription, contact center analytics, and media archiving.

A key strength is its chain-of-thought audio reasoning capability. This allows the model to analyze audio in a structured, step-by-step manner, solving complex tasks like counting word occurrences, interpreting humor from tone, or applying external knowledge to audio contexts in real time. Tests show Higgs Audio Understanding leads standard speech recognition benchmarks (e.g., Common Voice for English) and outperforms competitors like Qwen-Audio, Gemini, and GPT-4o-audio in holistic audio reasoning evaluations, achieving top scores (60.3 average on AirBench Foundation) with its reasoning enhancements. This real-time, contextual comprehension can give enterprises unparalleled audio data insights.

Higgs Audio Generation: Speaking with Human-Like Nuance  

Higgs Audio Generation, Boson AI’s advanced speech synthesis model, enables AI to produce highly expressive, human-like speech essential for virtual assistants, automated services, and customer interactions. Unlike traditional text-to-speech (TTS) systems that often sound robotic, Higgs Audio Generation leverages an LLM at its core, enabling nuanced comprehension and expressive output closely aligned with textual context and intended emotions.

Boson AI addresses common limitations of legacy TTS, such as monotone delivery, emotional flatness, incorrect pronunciation of unfamiliar terms, and difficulty handling multi-speaker interactions, by incorporating deep contextual understanding into speech generation.

The unique capabilities of Higgs Audio Generation include:

Emotionally Nuanced Speech: It naturally adjusts tone and emotion based on textual context, creating more engaging and context-appropriate interactions.

Multi-Speaker Dialogue Generation: This technology simultaneously generates distinct, realistic voices for multi-character conversations, as Boson AI’s Magic Broom Shop demo demonstrated. It is ideal for audiobooks, interactive training, and dynamic storytelling.

Accurate Pronunciation and Accent Adaptation: Precisely pronounces uncommon names, foreign words, and technical jargon, adapting speech dynamically for global and diverse scenarios.

Real-Time Generation with Contextual Reasoning: This technology produces coherent, real-time speech outputs responsive to conversational shifts, suitable for interactive applications like customer support chatbots or live voice assistants.

Image Source

Benchmark results confirm Higgs Audio’s superiority over top competitors, including CosyVoice2, Qwen2.5-omni, and ElevenLabs. In standard tests like SeedTTS and the Emotional Speech Dataset (ESD), Higgs Audio achieved significantly higher emotional accuracy, while being competitive or superior in word error rate (~1.5–2%). This performance demonstrates Higgs Audio’s ability to deliver unmatched clarity, expressiveness, and realism, setting a new benchmark for audio generation.

Under the Hood: LLMs, Audio Tokenizers, and In‑Context Learning  

Boson AI’s Higgs Audio models leverage advanced research, combining LLMs with innovative audio processing techniques. At their core, these models utilize pretrained LLMs, extending their robust language understanding, contextual awareness, and reasoning abilities to audio tasks. Boson AI achieves this integration by training LLMs end-to-end on extensive paired text–audio datasets, enabling semantic comprehension of spoken content and acoustic nuances.

Boson AI’s custom audio tokenizer is a critical element that efficiently compresses raw audio into discrete tokens using residual vector quantization (RVQ). This preserves linguistic information and subtle acoustic details (tone, timbre) while balancing token granularity for optimal speed and quality. These audio tokens seamlessly feed into the LLM alongside text, allowing simultaneous processing of audio and textual contexts. Also, Higgs Audio incorporates in-context learning, enabling models to adapt quickly without retraining. With simple prompts, such as brief reference audio samples, Higgs Audio Generation can instantly perform zero-shot voice cloning, matching speaking styles. Similarly, Higgs Audio Understanding rapidly customizes outputs (e.g., speaker labeling or domain-specific terminology) with minimal prompting.

Boson AI’s approach integrates transformer-based architectures, multimodal learning, and Chain-of-Thought (CoT) reasoning, enhancing interpretability and accuracy in audio comprehension and generation tasks. By combining LLM’s strengths with sophisticated audio tokenization and flexible prompting, Higgs Audio delivers unprecedented performance, speed, and adaptability, significantly surpassing traditional audio AI solutions.

Benchmark Performance: Outpacing Industry Leaders  

Boson AI extensively benchmarked Higgs Audio, confirming its competitive leadership in audio understanding and generation compared to top industry models.

Image Source

In audio understanding, Higgs Audio matched or surpassed models like OpenAI’s GPT-4o-audio and Gemini-2.0 Flash. It delivered top-tier speech recognition accuracy, achieving state-of-the-art Mozilla Common Voice (English) results, robust performance on challenging tasks like Chinese speech recognition, and strong results on benchmarks such as LibriSpeech and FLEURS.  

Image Source

However, Higgs Audio Understanding truly differentiates itself in complex audio reasoning tasks. On comprehensive tests like the AirBench Foundation and MMAU benchmarks, Higgs outperformed Alibaba’s Qwen-Audio, GPT-4o-audio, and Gemini models, scoring an average of 59.45, which improved to above 60 with CoT reasoning. This demonstrates the model’s superior capability to understand nuanced audio scenarios and dialogues with background noise and interpret audio contexts logically and insightfully.

On the audio generation side, Higgs Audio was evaluated against specialized TTS models, including ElevenLabs, Qwen 2.5-Omni, and CosyVoice2. Higgs Audio consistently led or closely matched competitors on key benchmarks:

Seed-TTS Eval: Higgs Audio achieved the lowest Word Error Rate (WER), indicating highly intelligible speech, and demonstrated the highest similarity to reference voices. In comparison, ElevenLabs had slightly lower intelligibility but notably weaker voice similarity.

Emotional Speech Dataset (ESD): Higgs Audio achieved the highest emotional similarity scores (over 80 versus mid-60s for ElevenLabs), excelling in emotionally nuanced speech generation.

Boson AI also introduced the “EmergentTTS-Eval,” using advanced audio-understanding models (even competitors like Gemini 2.0) as evaluators. Higgs Audio was consistently preferred over ElevenLabs in complex scenarios involving emotional expression, pronunciation accuracy, and nuanced intonation. Overall, benchmarks clearly show Higgs Audio’s comprehensive advantage, ensuring users adopting Boson AI’s models gain superior audio quality and insightful understanding capabilities.

Enterprise Deployment and Use Case: Bringing Higgs Audio to Business  

Higgs Audio Understanding and Generation function on a unified platform, enabling end-to-end voice AI pipelines that listen, reason, and respond, all in real time.

Customer Support: At a company like Chubb, a virtual claims agent powered by Higgs Audio can transcribe customer calls with high accuracy, detect stress or urgency, and identify key claim details. It separates speakers automatically and interprets context (e.g., recognizing a car accident scenario). Higgs Audio Generation responds in an empathetic, natural voice, even adapting to the caller’s accent. This improves resolution speed, reduces staff workload, and boosts customer satisfaction.

Media & Training Content: Enterprises producing e-learning or training materials can use Higgs Audio Generation to create multi-voice, multilingual narrations without hiring voice actors. Higgs Audio Understanding ensures quality control by verifying script adherence and emotional tone. Teams can also transcribe and analyze meetings for speaker sentiment and key takeaways, streamlining internal knowledge management.

Compliance & Analytics: In regulated industries, Higgs Audio Understanding can monitor conversations for compliance by recognizing intent beyond keywords. It detects deviations from approved scripts, flags sensitive disclosures, and surfaces customer trends or pain points over thousands of calls, enabling proactive insights and regulatory adherence.

Boson AI offers flexible deployment, API, cloud, on-premise or licensing, with models that adapt via prompt-based customization. Enterprises can tailor outputs to domain-specific terms or workflows using in-context learning, building intelligent voice agents that match internal vocabulary and tone. From multilingual chatbots to automated meeting summaries, Higgs Audio delivers conversational AI that feels truly human, raising the quality and capability of enterprise voice applications.

Future Outlook and Strategic Takeaways  

Boson AI’s roadmap for Higgs Audio indicates a strong future pipeline of features to deepen audio understanding and generation. A key upcoming capability is multi-voice cloning, allowing the model to learn multiple voice profiles from short samples and generate natural conversations between the speakers. This will enable use cases like AI-powered cast recordings or consistent virtual voices across customer touchpoints. This goes beyond current one-speaker cloning, with Boson AI’s TTS demo already hinting at its arrival. Another development is explicit control over style and emotion. While the current model infers emotion from context, future versions may allow users to specify parameters like “cheerful” or “formal,” enhancing brand consistency and user experience. The Smart Voice feature previewed in Boson AI’s demos suggests an intelligent voice-selection system tailored to script tone and intent.

On the understanding side, future updates may enhance comprehension with features like long-form conversation summarization, deeper reasoning via expanded chain-of-thought capabilities, and real-time streaming support. These advancements could enable applications like live analytics for support calls or AI-driven meeting insights.

Strategically, Boson AI positions Higgs Audio as a unified enterprise audio AI solution. By adopting Higgs Audio, companies can access the frontier of voice AI with tools that understand, reason, and speak with human-level nuance.  Its dual strength in understanding and generation, built on shared infrastructure, allows seamless integration and continuous improvement. Enterprises can benefit from a consistent platform where models evolve together, one that adapts easily and stays ahead of the curve. Boson AI offers a future-proof foundation for enterprise innovation in a world increasingly shaped by audio interfaces.

Sources

https://boson.ai/ 

https://boson.ai/blog/higgs-audio/ 

https://boson.ai/demo/shop

https://boson.ai/demo/tts

Thanks to the Boson AI team for the thought leadership/ Resources for this article. Boson AI team has financially supported us for this content/article.
The post Boson AI Introduces Higgs Audio Understanding and Higgs Audio Generation: An Advanced AI Solution with Real-Time Audio Reasoning and Expressive Speech Synthesis for Enterprise Applications appeared first on MarkTechPost.

Interview with Hamza Tahir: Co-founder and CTO of ZenML

Bio: Hamza Tahir is a software developer turned ML engineer. An indie hacker by heart, he loves ideating, implementing, and launching data-driven products. His previous projects include PicHance, Scrilys, BudgetML, and you-tldr. Based on his learnings from deploying ML in production for predictive maintenance use-cases in his previous startup, he co-created ZenML, an open-source MLOps framework for creating production grade ML pipelines on any infrastructure stack.

Question: From Early Projects to ZenML: Given your rich background in software development and ML engineering—from pioneering projects like BudgetML to co-founding ZenML and building production pipelines at maiot.io—how has your personal journey influenced your approach to creating an open-source ecosystem for production-ready AI?

My journey from early software development to co-founding ZenML has deeply shaped how I approach building open-source tools for AI production. Working on BudgetML taught me that accessibility in ML infrastructure is critical – not everyone has enterprise-level resources, yet everyone deserves access to robust tooling. 

At my first startup maiot.io, I witnessed firsthand how fragmented the MLOps landscape was, with teams cobbling together solutions that often broke in production. This fragmentation creates real business pain points – for example, many enterprises struggle with lengthy time-to-market cycles for their ML models due to these exact challenges.

These experiences drove me to create ZenML with a focus on being production-first, not production-eventual. We built an ecosystem that brings structure to the chaos of managing models, ensuring that what works in your experimental environment transitions smoothly to production. Our approach has consistently helped organizations reduce deployment times and increase efficiency in their ML workflows.

The open-source approach wasn’t just a distribution strategy—it was foundational to our belief that MLOps should be democratized, allowing teams of all sizes to benefit from best practices developed across the industry. We’ve seen organizations of all sizes—from startups to enterprises—accelerate their ML development cycles by 50-80% by adopting these standardized, production-first practices.

Question: From Lab to Launch: Could you share a pivotal moment or technical challenge that underscored the need for a robust MLOps framework in your transition from experimental models to production systems?

ZenML grew out of our experience working in predictive maintenance. We were essentially functioning as consultants, implementing solutions for various clients. A little over four years ago when we started, there were far fewer tools available and those that existed lacked maturity compared to today’s options.

We quickly discovered that different customers had vastly different needs—some wanted AWS, others preferred GCP. While Kubeflow was emerging as a solution that operated on top of Kubernetes, it wasn’t yet the robust MLOps framework that ZenML offers now.

The pivotal challenge was finding ourselves repeatedly writing custom glue code for each client implementation. This pattern of constantly developing similar but platform-specific solutions highlighted the clear need for a more unified approach. We initially built ZenML on top of TensorFlow’s TFX, but eventually removed that dependency to develop our own implementation that could better serve diverse production environments.

Question: Open-Source vs. Closed-Source in MLOps: While open-source solutions are celebrated for innovation, how do they compare with proprietary options in production AI workflows? Can you share how community contributions have enhanced ZenML’s capabilities in solving real MLOps challenges?

Proprietary MLOps solutions offer polished experiences but often lack adaptability. Their biggest drawback is the “black box” problem—when something breaks in production, teams are left waiting for vendor support. With open-source tools like ZenML, teams can inspect, debug, and extend the tooling themselves.

This transparency enables agility. Open-source frameworks incorporate innovations faster than quarterly releases from proprietary vendors. For LLMs, where best practices evolve weekly, this speed is invaluable.

The power of community-driven innovation is exemplified by one of our most transformative contributions—a developer who built the “Vertex” orchestrator integration for Google Cloud Platform. This wasn’t just another integration—it represented a completely new approach to orchestrating pipelines on GCP that opened up an entirely new market for us.

Prior to this contribution, our GCP users had limited options. The community member developed a comprehensive Vertex AI integration that enabled seamless orchestration in 

Question: Integrating LLMs into Production: With the surge in generative AI and large language models, what are the key obstacles you’ve encountered in LLMOps, and how does ZenML help mitigate these challenges?

LLMOps presents unique challenges including prompt engineering management, complex evaluation metrics, escalating costs, and pipeline complexity.

ZenML helps by providing:

Structured pipelines for LLM workflows, tracking all components from prompts to post-processing logic

Integration with LLM-specific evaluation frameworks

Caching mechanisms to control costs

Lineage tracking for debugging complex LLM chains

Our approach bridges traditional MLOps and LLMOps, allowing teams to leverage established practices while addressing LLM-specific challenges. ZenML’s extensible architecture lets teams incorporate emerging LLMOps tools while maintaining reliability and governance.

Question: Streamlining MLOps Workflows: What best practices would you recommend for teams aiming to build secure, scalable ML pipelines using open-source tools, and how does ZenML facilitate this process?

For teams building ML pipelines with open-source tools, I recommend:

Start with reproducibility through strict versioning

Design for observability from day one

Embrace modularity with interchangeable components

Automate testing for data, models, and security

Standardize environments through containerization

ZenML facilitates these practices with a Pythonic framework that enforces reproducibility, integrates with popular MLOps tools, supports modular pipeline steps, provides testing hooks, and enables seamless containerization.

We’ve seen these principles transform organizations like Adeo Leroy Merlin. After implementing these best practices through ZenML, they reduced their ML development cycle by 80%, with their small team of data scientists now deploying new ML use cases from research to production in days rather than months, delivering tangible business value across multiple production models.

The key insight: MLOps isn’t a product you adopt, but a practice you implement. Our framework makes following best practices the path of least resistance while maintaining flexibility.

Question: Engineering Meets Data Science: Your career spans both software engineering and ML engineering—how has this dual expertise influenced your design of MLOps tools that cater to real-world production challenges?

My dual background has revealed a fundamental disconnect between data science and software engineering cultures. Data scientists prioritize experimentation and model performance, while software engineers focus on reliability and maintainability. This divide creates significant friction when deploying ML systems to production.

ZenML was designed specifically to bridge this gap by creating a unified framework where both disciplines can thrive. Our Python-first APIs provide the flexibility data scientists need while enforcing software engineering best practices like version control, modularity, and reproducibility. We’ve embedded these principles into the framework itself, making the right way the easy way.

This approach has proven particularly valuable for LLM projects, where the technical debt accumulated during prototyping can become crippling in production. By providing a common language and workflow for both researchers and engineers, we’ve helped organizations reduce their time-to-production while simultaneously improving system reliability and governance.

Question: MLOps vs. LLMOps: In your view, what distinct challenges do traditional MLOps face compared to LLMOps, and how should open-source frameworks evolve to address these differences?

Traditional MLOps focuses on feature engineering, model drift, and custom model training, while LLMOps deals with prompt engineering, context management, retrieval-augmented generation, subjective evaluation, and significantly higher inference costs.

Open-source frameworks need to evolve by providing:

Consistent interfaces across both paradigms

LLM-specific cost optimizations like caching and dynamic routing

Support for both traditional and LLM-specific evaluation

First-class prompt versioning and governance

ZenML addresses these needs by extending our pipeline framework for LLM workflows while maintaining compatibility with traditional infrastructure. The most successful teams don’t see MLOps and LLMOps as separate disciplines, but as points on a spectrum, using common infrastructure for both.

Question: Security and Compliance in Production: With data privacy and security being critical, what measures does ZenML implement to ensure that production AI models are secure, especially when dealing with dynamic, data-intensive LLM operations?

ZenML implements robust security measures at every level:

Granular pipeline-level access controls with role-based permissions

Comprehensive artifact provenance tracking for complete auditability

Secure handling of API keys and credentials through encrypted storage

Data governance integrations for validation, compliance, and PII detection

Containerization for deployment isolation and attack surface reduction

These measures enable teams to implement security by design, not as an afterthought. Our experience shows that embedding security into the workflow from the beginning dramatically reduces vulnerabilities compared to retrofitting security later. This proactive approach is particularly crucial for LLM applications, where complex data flows and potential prompt injection attacks create unique security challenges that traditional ML systems don’t face.

Question: Future Trends in AI: What emerging trends for MLOps and LLMOps do you believe will redefine production workflows over the next few years, and how is ZenML positioning itself to lead these changes?

Agents and workflows represent a critical emerging trend in AI. Anthropic notably differentiated between these approaches in their blog about Claude agents, and ZenML is strategically focusing on workflows primarily for reliability considerations.

While we may eventually reach a point where we can trust LLMs to autonomously generate plans and iteratively work toward goals, current production systems demand the deterministic reliability that well-defined workflows provide. We envision a future where workflows remain the backbone of production AI systems, with agents serving as carefully constrained components within a larger, more controlled process—combining the creativity of agents with the predictability of structured workflows.

The industry is witnessing unprecedented investment in LLMOps and LLM-driven projects, with organizations actively experimenting to establish best practices as models rapidly evolve. The definitive trend is the urgent need for systems that deliver both innovation and enterprise-grade reliability—precisely the intersection where ZenML is leveraging its years of battle-tested MLOps experience to create transformative solutions for our customers.

Question: Fostering Community Engagement: Open source thrives on collaboration—what initiatives or strategies have you found most effective in engaging the community around ZenML and encouraging contributions in MLOps and LLMOps?

We’ve implemented several high-impact community engagement initiatives that have yielded measurable results. Beyond actively soliciting and integrating open-source contributions for components and features, we hosted one of the first large-scale MLOps competitions in 2023, which attracted over 200 participants and generated dozens of innovative solutions to real-world MLOps challenges.

We’ve established multiple channels for technical collaboration, including an active Slack community, regular contributor meetings, and comprehensive documentation with clear contribution guidelines. Our community members regularly discuss implementation challenges, share production-tested solutions, and contribute to expanding the ecosystem through integrations and extensions. These strategic community initiatives have been instrumental in not only growing our user base substantially but also advancing the collective knowledge around MLOps and LLMOps best practices across the industry.

Question: Advice for Aspiring AI Engineers: Finally, what advice would you give to students and early-career professionals who are eager to dive into the world of open-source AI, MLOps and LLMOps, and what key skills should they focus on developing?

For those entering MLOps and LLMOps: 

Build complete systems, not just models—the challenges of production offer the most valuable learning

Develop strong software engineering fundamentals

Contribute to open-source projects to gain exposure to real-world problems

Focus on data engineering—data quality issues cause more production failures than model problems

Learn cloud infrastructure basics–Key skills to develop include Python proficiency, containerization, distributed systems concepts, and monitoring tools. For bridging roles, focus on communication skills and product thinking. Cultivate “systems thinking”—understanding component interactions is often more valuable than deep expertise in any single area. Remember that the field is evolving rapidly. Being adaptable and committed to continuous learning is more important than mastering any particular tool or framework.

Question: How does ZenML’s approach to workflow orchestration differ from traditional ML pipelines when handling LLMs, and what specific challenges does it solve for teams implementing RAG or agent-based systems?

At ZenML, we believe workflow orchestration must be paired with robust evaluation systems—otherwise, teams are essentially flying blind. This is especially crucial for LLM workflows, where behaviour can be much less predictable than traditional ML models.

Our approach emphasizes “eval-first development” as the cornerstone of effective LLM orchestration. This means evaluation runs as quality gates or as part of the outer development loop, incorporating user feedback and annotations to continually improve the system.

For RAG or agent-based systems specifically, this eval-first approach helps teams identify whether issues are coming from retrieval components, prompt engineering, or the foundation models themselves. ZenML’s orchestration framework makes it straightforward to implement these evaluation checkpoints throughout your workflow, giving teams confidence that their systems are performing as expected before reaching production.

Question: What patterns are you seeing emerge for successful hybrid systems that combine traditional ML models with LLMs, and how does ZenML support these architectures?

ZenML takes a deliberately unopinionated approach to architecture, allowing teams to implement patterns that work best for their specific use cases. Common hybrid patterns include RAG systems with custom-tuned embedding models and specialized language models for structured data extraction.

This hybrid approach—combining custom-trained models with foundation models—delivers superior results for domain-specific applications. ZenML supports these architectures by providing a consistent framework for orchestrating both traditional ML components and LLM components within a unified workflow.

Our platform enables teams to experiment with different hybrid architectures while maintaining governance and reproducibility across both paradigms, making the implementation and evaluation of these systems more manageable.

Question: As organizations rush to implement LLM solutions, how does ZenML help teams maintain the right balance between experimentation speed and production governance?

ZenML handles best practices out of the box—tracking metadata, evaluations, and the code used to produce them without teams having to build this infrastructure themselves. This means governance doesn’t come at the expense of experimentation speed.

As your needs grow, ZenML grows with you. You might start with local orchestration during early experimentation phases, then seamlessly transition to cloud-based orchestrators and scheduled workflows as you move toward production—all without changing your core code.

Lineage tracking is a key feature that’s especially relevant given emerging regulations like the EU AI Act. ZenML captures the relationships between data, models, and outputs, creating an audit trail that satisfies governance requirements while still allowing teams to move quickly. This balance between flexibility and governance helps prevent organizations from ending up with “shadow AI” systems built outside official channels.

Question: What are the key integration challenges enterprises face when incorporating foundation models into existing systems, and how does ZenML’s workflow approach address these?

A key integration challenge for enterprises is tracking which foundation model (and which version) was used for specific evaluations or production outputs. This lineage and governance tracking is critical both for regulatory compliance and for debugging issues that arise in production.

ZenML addresses this by maintaining a clear lineage between model versions, prompts, inputs, and outputs across your entire workflow. This provides both technical and non-technical stakeholders with visibility into how foundation models are being used within enterprise systems.

Our workflow approach also helps teams manage environment consistency and version control as they move LLM applications from development to production. By containerizing workflows and tracking dependencies, ZenML reduces the “it works on my machine” problems that often plague complex integrations, ensuring that LLM applications behave consistently across environments.
The post Interview with Hamza Tahir: Co-founder and CTO of ZenML appeared first on MarkTechPost.

OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Abil …

Despite advances in large language models (LLMs), AI agents still face notable limitations when navigating the open web to retrieve complex information. While many models excel on static knowledge benchmarks, they often underperform when tasked with locating nuanced, context-dependent facts across multiple sources. Most existing benchmarks evaluate a model’s recall of easily accessible knowledge, which does not reflect the intricacy of real-world browsing tasks. In contrast, agents operating in applied settings—whether assisting with research, summarizing policy, or fact-checking claims—require persistence, structured reasoning, and the ability to dynamically adapt their search strategies. These capabilities remain underdeveloped in current AI systems.

OpenAI Open Sources BrowseComp: A Benchmark of 1,266 Information-Seeking Tasks

To better evaluate these capabilities, OpenAI has released BrowseComp, a benchmark designed to assess agents’ ability to persistently browse the web and retrieve hard-to-find information. The benchmark includes 1,266 fact-seeking problems, each with a short, unambiguous answer. Solving these tasks often requires navigating through multiple webpages, reconciling diverse information, and filtering relevant signals from noise.

The benchmark is inspired by the notion that just as programming competitions serve as focused tests for coding agents, BrowseComp offers a similarly constrained yet revealing evaluation of web-browsing agents. It deliberately avoids tasks with ambiguous user goals or long-form outputs, focusing instead on the core competencies of precision, reasoning, and endurance.

BrowseComp is created using a reverse-question design methodology: beginning with a specific, verifiable fact, they constructed a question designed to obscure the answer through complexity and constraint. Human trainers ensured that questions could not be solved via superficial search and would challenge both retrieval and reasoning capabilities. Additionally, questions were vetted to ensure they would not be easily solvable by GPT-4, OpenAI o1, or earlier browsing-enabled models.

The dataset spans a broad range of domains—including science, history, arts, sports, and entertainment—and is balanced to promote topic diversity. Each task is formulated so that the correct answer is a short string, which simplifies evaluation and reduces ambiguity. Human performance was also assessed, with human trainers given two hours per task; most failed to solve the majority of tasks, reflecting their difficulty.

Model Evaluation and Findings

OpenAI evaluated several models on BrowseComp, including GPT-4o (with and without browsing), GPT-4.5, OpenAI o1, and Deep Research—a model specifically trained to handle persistent browsing tasks. The results indicate that models without advanced search or reasoning strategies perform poorly: GPT-4o without browsing achieved 0.6% accuracy, and with browsing enabled, only 1.9%. GPT-4.5 scored similarly low. OpenAI o1, with improved reasoning but no browsing, performed moderately better at 9.9%.

Deep Research outperformed all other models, achieving 51.5% accuracy. Its architecture and training emphasize iterative searching, evidence synthesis, and adaptive navigation. Performance improved further with multiple trials per question and aggregation strategies such as best-of-N selection and confidence-based voting. While Deep Research exhibited higher calibration error—frequently being overconfident in incorrect answers—it often identified its own correct outputs with internal consistency, suggesting a usable confidence signal.

Human Performance and Task Difficulty

Human trainers attempted to solve the benchmark problems without the assistance of AI tools. Of the 1,255 attempted tasks, 71% were marked as unsolvable within the two-hour window, and only 29% were successfully completed. Among those, the agreement rate with the reference answer was 86.4%. These outcomes underscore the complexity of the benchmark and suggest that current AI models still fall short of the adaptability and background reasoning skills needed for such tasks.

Conclusion

BrowseComp introduces a focused, verifiable, and technically demanding benchmark for evaluating the core capabilities of web-browsing agents. By shifting emphasis from static recall to dynamic retrieval and multi-hop reasoning, it presents a realistic challenge that aligns closely with emerging real-world applications. Although current models, including those with browsing capabilities, perform unevenly, the Deep Research agent illustrates the potential of dedicated architectures to bridge this gap.

BrowseComp is publicly available via GitHub and detailed on OpenAI’s official blog. Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web appeared first on MarkTechPost.

Reduce ML training costs with Amazon SageMaker HyperPod

Training a frontier model is highly compute-intensive, requiring a distributed system of hundreds, or thousands, of accelerated instances running for several weeks or months to complete a single job. For example, pre-training the Llama 3 70B model with 15 trillion training tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 instances (p5.48xlarge, each with 8 NVIDIA H100 GPUs), this would take approximately 132 days.
Distributed training workloads run in a synchronous manner because each training step requires all participating instances to complete their calculations before the model can advance to the next step. It implies that if a single instance fails, it stops the entire job. As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Each hardware failure can result in wasted GPU hours and requires valuable engineering time to identify and resolve the issue, making the system prone to downtime that can disrupt progress and delay completion. To assess system reliability, engineering teams often rely on key metrics such as mean time between failures (MTBF), which measures the average operational time between hardware failures and serves as a valuable indicator of system robustness.
In this post, we explore the challenges of large-scale frontier model training, focusing on hardware failures and the benefits of Amazon SageMaker HyperPod—a resilient solution that minimizes disruptions, enhances efficiency, and reduces training costs.
Instance failure rate
To understand the typical MTBF for large-scale frontier model training, it helps to first understand instance failure rates by reviewing three noteworthy examples:

When training OPT-175B on 992 A100 GPUs, Meta AI encountered significant hardware reliability challenges. Across 2 months, the team managed 35 manual restarts and cycled over 100 hosts due to hardware issues, and automated systems triggered more than 70 restarts. Operating 124 instances (each with 8 GPUs) continuously over 1,440 hours, Meta accumulated a total of 178,560 instance-hours. The observed failure rate during this period was around 0.0588% per instance-hour, underscoring the reliability hurdles in training large frontier models at this scale.
During the training of Llama 3.1 405B on 16,000 H100 GPUs, a total of 417 unscheduled hardware failures occurred during a 54-day period. This translates to an effective failure rate of about 0.0161% per instance-hour.
MPT-7B was trained on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. During this period, the training job experienced four hardware failures, resulting in an effective failure rate of approximately 0.0319% per instance-hour.

Based on these examples, it’s realistic to expect that in a single hour of large-scale distributed training, an instance will fail about 0.02%–0.06% of the time.
Larger clusters, more failures, smaller MTBF
As cluster size increases, the entropy of the system increases, resulting in a lower MTBF. The following table illustrates how the MTBF (in hours) changes with the number of instances in a cluster and the estimated failure rate for each instance. For example, with a 0.04% per-hour failure rate per instance, a 512-instance system is expected to experience a failure approximately every 5 hours. The following table shows MTBF (in hours) by failure rates.

 .
Size of cluster (instances)

Failure rate (per instance per hour)
4
8
16
32
64
128
256
512

0.01%
2500
1250
625
313
157
79
40
20

0.02%
1250
625
313
157
79
40
20
10

0.04%
625
313
157
79
40
20
10
5

0.08%
313
157
79
40
20
10
5
3

Table 1: The change in MTBF (in hours) with the number of instances in a training cluster (with assumed failure rates in the columns)
What happens after a failure?
In a perfect world, without failures, the training job proceeds as shown in the following graph, which illustrates the total training time without failures, demonstrating a linear progression.

Figure 1: Training is linear in a perfect world without failures, since there are no interruptions to completion.

However, as previously noted, hardware failures are inevitable. Troubleshooting these failures typically involves several steps:

Root cause analysis (mean time to detect) – Identifying hardware failures as the root cause of training interruptions can be time-consuming, especially in complex systems with multiple potential failure points. The time taken to determine the root cause is referred to as mean time to detect (MTTD).
Hardware repair or replacement (mean time to replace) – Sometimes, a simple instance restart resolves the issue. At other times, the instance must be replaced, which can involve logistical delays, especially if specialized components aren’t readily available. If a replacement instance isn’t on hand when a GPU fails, the system must wait for one to become available. Common redistribution techniques, such as PyTorch FSDP, don’t permit workload redistribution among remaining instances.
System recovery and resumption (mean time to restart) – After resolving hardware issues and replacing the instance, additional time is needed to restore it to its previous state. The new instance must match the original configuration, and the entire cluster must load the model weights from the latest saved checkpoint.

Each failure incurs engineering effort to identify its root cause. When hardware issues arise, diagnostics confirm the problem and isolate the faulty instance, pausing the training job and increasing downtime. The impact of these failures is illustrated in the following figure and can be empirically measured for large distributed training jobs. The figure outlines the troubleshooting steps that follow a failure.

Figure 2: Impact of failures on a distributed training run. Once a failure occurs, time (idle GPUs) is spent on detecting (MTD), replacing (MTT Replace), and continuing (MTR Restart) a training run, often wasting time and expensive resources.

In a scenario where a distributed training job is running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved instances and an Auto Scaling group set to maintain a minimum of n instances, a hardware issue such as a GPU failure can cause the job to fail. The affected instance will be marked as Unhealthy by a Kubernetes health monitor such as Node Problem Detector, and Amazon EKS will attempt to reschedule the training pods to healthy instances. If no instances have sufficient resources, the pods remain in a Pending state, and because the instance count is limited to n, no new instance will be automatically provisioned.
In such cases, the failed job must be manually identified through pod logs or the Kubernetes API and deleted. The failed instance also needs to be isolated and terminated manually, either through the AWS Management Console, AWS Command Line Interface (AWS CLI), or tools like kubectl or eksctl. To restore cluster capacity, the user must increase the cluster size by modifying the Auto Scaling group or updating the instance group. After the new instance is provisioned, bootstrapped, and added to the cluster, the training job must be restarted manually. If checkpointing is enabled, the job can resume from the last saved state. The overall downtime depends on the time required to provision a new instance and restart the job by rescheduling the pods.
Faster failure detection (shorter MTTD), shorter replacement times (shorter MTTR), and rapid resumption will all contribute to reducing total training time. Automating these processes with minimal user intervention is a key advantage of Amazon SageMaker HyperPod. 
Amazon SageMaker HyperPod resilient training infrastructure
SageMaker HyperPod is a compute environment optimized for large-scale frontier model training. This means users can build resilient clusters for machine learning (ML) workloads and develop or fine-tune state-of-the-art frontier models, as demonstrated by organizations such as Luma Labs and Perplexity AI. SageMaker HyperPod runs health monitoring agents in the background for each instance. When it detects a hardware failure, SageMaker HyperPod automatically repairs or replaces the faulty instance and resumes training from the last saved checkpoint. This automation alleviates the need for manual management, which means customers can train in distributed settings for weeks or months with minimal disruption. The benefits are particularly significant for customers deploying many instances (greater than 16) in a cluster.
Frontier model builders can further enhance model performance using built-in ML tools within SageMaker HyperPod. They can use Amazon SageMaker AI with MLflow to create, manage, and track ML experiments, or use Amazon SageMaker AI with TensorBoard to visualize model architecture and address convergence issues. Additionally, integrating with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana provides deeper insights into cluster performance, health, and utilization, ultimately saving valuable development time. The following figure compares the downtime of an infrastructure system using SageMaker HyperPod versus one without SageMaker HyperPod.

Figure 3: Comparing downtime chart from figure 1 with downtime on SageMaker HyperPod. When a failure occurs, it is detected automatically by HyperPod agents, and the instance is replaced in the background. Training is also resumed from the latest checkpoint

SageMaker HyperPod reduces the downtime per hardware failure by automatically detecting hardware issues. When these issues are detected, SageMaker HyperPod automatically replaces the faulty node(s) and resumes your training job from the latest checkpoint, assuming that checkpoints are written.
To evaluate this, we conducted experiments on SageMaker HyperPod using different cluster sizes of p5.48xlarge instances. The results in the following table, showing empirical measurements of time to resume by cluster size, displays the 90th percentile (P90), which represents a value that will be met or exceeded 90% of the time.

Cluster size (number of instances)
P90 time to detect (in seconds)
P90 time to replace (in seconds)
P90 time to resume (in seconds)
Total downtime per failure (in seconds)
Total downtime per failure (in minutes)

16
83
912
1212
2207
36.8

64
90
963
1320
2373
39.6

256
89
903
1398
2390
39.8

1024
80
981
1440
2501
41.7

Table 2: MTTResume (in seconds) on clusters with different sizes
As shown, the mean time to replace an instance is independent of cluster size. For a cluster of 256 x p5.48xlarge instances training Meta Llama 3.1 70B parameter model with batch size = 8, replacing an instance takes about 940 seconds (or 15.7 minutes). After replacement, the new instance must install additional packages using lifecycle scripts and run deep health checks before reading from the latest saved checkpoint. When it’s operational, the training job resumes from the most recent checkpoint, minimizing progress loss despite the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to automatically resume the training job after each failure.
Without SageMaker HyperPod, when a GPU failure occurs during a training job, the time it takes to resume the training can vary widely depending on the infrastructure and processes in place. With proper check-pointing, automated job orchestration, and efficient hardware provisioning, the resume time can be reduced. However, without these optimizations, the impact can be much more severe. Empirical evidence from customer experiences—including a leading open source frontier model provider, a top large language model (LLM) startup, an AI company specializing in enterprise frontier models, and a cutting-edge scientific research institute—indicates that without SageMaker HyperPod, the total downtime per GPU failure can average approximately 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:

.
Without SageMaker HyperPod (in minutes)
With SageMaker HyperPod (in minutes)

Mean time to root-cause
10
1.5

Mean time to replace
240
15

Mean time to resume
30
23.5

Total downtime per failure
280
40

Table 3: Typical failure numbers, in minutes (as described in section “What happens after a failure?” with and without SageMaker HyperPod)
Quantifying the downtime savings
Depending on the frequency of failures, we can calculate the time to train and the cost savings of using SageMaker HyperPod. To illustrate this calculation, we assume it takes 40 minutes to replace an instance with SageMaker HyperPod compared to 280 minutes without it (as previously explained). Additionally, for this calculation, let’s assume a training job requiring 10 million GPU hours on H100 instances, running on a 256-instance P5 cluster.
Although the actual overhead (in hours) depends on the size of the training job, the relative overhead remains constant. The benefits of SageMaker HyperPod in reducing total training time are demonstrated in the following chart. For example, in a 256-instance cluster with a failure rate of 0.05%, SageMaker HyperPod reduces total training time by 32%.

.
Size of cluster (instances)

Failure rate (per instance per hour)
4
8
16
32
64
128
256
512

0.01%
0%
0%
1%
1%
2%
5%
9%
17%

0.02%
0%
1%
1%
2%
5%
9%
17%
28%

0.05%
1%
2%
3%
6%
11%
20%
32%
48%

0.07%
1%
2%
4%
8%
15%
25%
40%
55%

Table 4: Total % of training time reduced by SageMaker HyperPod compared to a P5 cluster of comparable size
To translate this into actual savings, for a training job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of training time. As a result, customers can reduce time-to-market by 3.5 months. Without SageMaker HyperPod, the total time to train would be approximately 325 days, 121 of which are just spent on isolating and mitigating hardware issues. The following table shows the time to train benefits.

H100 GPU hours for training
10,000,000

Number of instances
256

Failure rate (per instance per hour)
0.05%

Additional time to fix per failure (hours)
4

Days lost due to hardware issues (with SageMaker HyperPod)
17

Days lost due to hardware issues (without SageMaker HyperPod)
121

Time to train with SageMaker HyperPod (days)
221

Time to train without SageMaker HyperPod (days)
325

SageMaker HyperPod improvement
32%

Time saved with SageMaker HyperPod (days)
104

Table 5: Benefits presented by SageMaker HyperPod for a training run requiring 10 million GPU hours and a 256 instance cluster. SageMaker HyperPod saves 104 days of training time overall, resulting in a faster time to market (by 3.5 months!)
For the same example, we can estimate the total cost savings using:
Days lost due to hardware issues = (Number of instances) × (Failure rate per instance per hour) × (24 hours per day) × (Total training days) × (Downtime per failure in hours)
The following shows cost to train benefits.

H100 GPU hours for training
10,000,000

Number of instances
256

Failure rate (per instance per hour)
0.05%

Time saved with SageMaker HyperPod (days)
104

Cost per GPU per hour
$5

Total cost saving with SageMaker HyperPod
$25,559,040

Table 6: Using the calculation described above, the cost to train benefits laid out for a training run requiring 10 million GPU hours, 256 GPU based instances, and an assumed failure rate of 0.05% per instance per hour
A training job requiring 10 million GPU hours and 104 additional days of resolving hardware issues results in significant idle cluster time. Assuming a GPU cost of $5 per hour (equivalent to the price of P5 instances on Capacity Blocks for ML), the total cost savings with SageMaker HyperPod amounts to $25,559,040.
Summary
Training frontier models is a complex, resource-intensive process that is particularly vulnerable to hardware failures. In this post, we explored the instance failure rate, which can range about 0.02%–0.07% per hour during large-scale distributed training. As cluster size grows, the likelihood of failures increases, and the MTBF decreases. We also examined what happens after failure, including root cause analysis, hardware repair or replacement, and system recovery and resumption.
Next, we examined Amazon SageMaker HyperPod—a purpose-built, fully resilient cluster for frontier model training. By incorporating robust fault-tolerance mechanisms and automated health monitoring, SageMaker HyperPod minimizes disruptions caused by hardware issues. This not only streamlines the training process but also enhances the reliability and efficiency of model development, enabling faster and more effective innovation delivery. The benefits are measurable and correlate with both cluster size and failure rate. For a 256-instance cluster with a 0.05% per-instance-per-hour failure rate, SageMaker HyperPod reduces total training time by 32%, resulting in an approximate savings of $25.6 million in total training costs.
By addressing the reliability challenges of frontier model training, SageMaker HyperPod allows ML teams to focus on model innovation rather than infrastructure management. Organizations can now conduct long training runs with confidence, knowing that hardware failures will be automatically detected and resolved with minimal disruption to their ML workloads. Get started with Amazon SageMaker HyperPod.
Special thanks to Roy Allela, Senior AI/ML Specialist Solutions Architect for his support on the launch of this post.

About the Authors
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Trevor Harvey is a Principal Specialist in generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.