Convergence Labs Introduces the Large Memory Model (LM2): A Memory-Aug …

Transformer-based models have significantly advanced natural language processing (NLP), excelling in various tasks. However, they struggle with reasoning over long contexts, multi-step inference, and numerical reasoning. These challenges arise from their quadratic complexity in self-attention, making them inefficient for extended sequences, and their lack of explicit memory, which limits their ability to synthesize dispersed information effectively. Existing solutions, such as recurrent memory transformers (RMT) and retrieval-augmented generation (RAG), offer partial improvements but often sacrifice either efficiency or generalization.

Introducing the Large Memory Model (LM2)

Convergence Labs introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module to address the shortcomings of conventional models in long-context reasoning. Unlike standard Transformers, which rely solely on attention mechanisms, LM2 incorporates a structured memory system that interacts with input embeddings through cross-attention. The model’s memory updates are regulated by gating mechanisms, allowing it to selectively retain relevant information while preserving generalization capabilities. This design enables LM2 to maintain coherence across long sequences, facilitating improved relational reasoning and inference.

Technical Overview and Benefits

LM2 builds upon standard Transformer architecture by introducing three key innovations:

Memory-Augmented Transformer: A dedicated memory bank acts as an explicit long-term storage system, retrieving relevant information through cross-attention.

Hybrid Memory Pathway: Unlike previous models that modify the Transformer’s core structure, LM2 maintains the original information flow while integrating an auxiliary memory pathway.

Dynamic Memory Updates: The memory module selectively updates its stored information using learnable input, forget, and output gates, ensuring long-term retention without unnecessary accumulation of irrelevant data.

These enhancements allow LM2 to process long sequences more effectively while maintaining computational efficiency. By selectively incorporating relevant memory content, the model mitigates the gradual performance decline often observed in traditional architectures over extended contexts.

Experimental Results and Insights

To evaluate LM2’s effectiveness, it was tested on the BABILong dataset, designed to assess memory-intensive reasoning capabilities. The results indicate substantial improvements:

Short-context performance (0K context length): LM2 achieves an accuracy of 92.5%, surpassing RMT (76.4%) and vanilla Llama-3.2 (40.7%).

Long-context performance (1K–4K context length): As context length increases, all models experience some degradation, but LM2 maintains a higher accuracy. At 4K context length, LM2 achieves 55.9%, compared to 48.4% for RMT and 36.8% for Llama-3.2.

Extreme long-context performance (≥8K context length): While all models decline in accuracy, LM2 remains more stable, outperforming RMT in multi-step inference and relational argumentation.

Beyond memory-specific benchmarks, LM2 was tested on the MMLU dataset, which covers a broad range of academic subjects. The model demonstrated a 5.0% improvement over a pre-trained vanilla Transformer, particularly excelling in Humanities and Social Sciences, where contextual reasoning is crucial. These results indicate that LM2’s memory module enhances reasoning capabilities without compromising general task performance.

Conclusion

The introduction of LM2 offers a thoughtful approach to addressing the limitations of standard Transformers in long-context reasoning. By integrating an explicit memory module, LM2 improves multi-step inference, relational argumentation, and numerical reasoning while maintaining efficiency and adaptability. Experimental results demonstrate its advantages over existing architectures, particularly in tasks requiring extended context retention. Furthermore, LM2 performs well in general reasoning benchmarks, suggesting that memory integration does not hinder versatility. As memory-augmented models continue to evolve, LM2 represents a step toward more effective long-context reasoning in language models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Convergence Labs Introduces the Large Memory Model (LM2): A Memory-Augmented Transformer Architecture Designed to Address Long Context Reasoning Challenges appeared first on MarkTechPost.

Meta AI Introduces PARTNR: A Research Framework Supporting Seamless Hu …

Human-robot collaboration focuses on developing intelligent systems working alongside humans in dynamic environments. Researchers aim to build robots capable of understanding and executing natural language instructions while adapting to constraints such as spatial positioning, task sequencing, and capability-sharing between humans and machines. This field significantly advances robotics for household assistance, healthcare, and industrial automation, where efficiency and adaptability are crucial for seamless integration.

A major challenge in human-robot collaboration is the lack of a comprehensive benchmark to evaluate planning and reasoning abilities in multi-agent tasks. While previous models have addressed navigation and single-agent interactions, they fail to capture real-world complexities where robots must coordinate with humans. Many existing approaches do not account for real-time task tracking, partner adaptation, and effective error recovery. The absence of an established standard makes it difficult to assess and improve collaborative AI performance in interactive settings systematically.

Current approaches in embodied AI often focus on single-agent task execution, disregarding the necessity of coordination in multi-agent scenarios. Some methods rely on templated task instructions, limiting scalability and task diversity, while others depend on manually crafted evaluation functions, making large-scale assessments impractical. Despite advancements, state-of-the-art large language models (LLMs) struggle with task tracking, coordination, and recovery from execution failures. These limitations hinder their ability to function efficiently in human-centric environments where adaptability and precise task execution are essential.

Researchers at FAIR Meta have introduced PARTNR (Planning And Reasoning Tasks in humaN-Robot collaboration), a large-scale benchmark designed to assess human-robot coordination in simulated environments. PARTNR comprises 100,000 natural language tasks, spanning 60 simulated homes and 5,819 unique objects. The benchmark specifically evaluates tasks incorporating spatial, temporal, and heterogeneous constraints. Researchers ensured a realistic and scalable task generation process by leveraging a semi-automated pipeline integrating LLMs and simulation-in-the-loop validation. PARTNR aims to set a standard for evaluating AI’s ability to collaborate with human partners effectively.

Researchers generated task instructions and evaluation functions using LLMs to create the benchmark. These were then filtered through simulation to remove infeasible tasks. The final dataset underwent human-in-the-loop validation to enhance task diversity and ensure accuracy. The tasks in PARTNR fall into four categories: constraint-free, spatial, temporal, and heterogeneous. Constraint-free tasks allow flexibility in execution order, while spatial tasks require specific object positioning. Temporal tasks necessitate ordered execution, and heterogeneous tasks involve actions beyond the robot’s capability, requiring human intervention. These task structures introduce challenges in coordination, tracking, and execution accuracy.

Evaluations of LLM-based planning agents on PARTNR revealed significant limitations in coordination, task tracking, and error recovery. When paired with humans, LLM-guided robots required 1.5 times more steps than human-human teams and 1.1 times more steps than a single human to complete tasks. The success rate of state-of-the-art LLMs was only 30% under non-privileged conditions, compared to 93% when tasks were performed solely by humans. Moreover, fine-tuning smaller LLMs achieved performance comparable to models nine times larger while being 8.6 times faster at inference. In decentralized multi-agent settings, task completion required 1.3 times more steps than a single-agent scenario, demonstrating inefficiencies in current coordination mechanisms.

PARTNR highlights crucial gaps in existing AI-driven human-robot collaboration models, emphasizing better planning, tracking, and decision-making strategies. The findings indicate that despite advancements in AI, human-robot collaboration benchmarks require substantial improvements to bridge the performance disparity between AI models and humans. The structured evaluation framework offered by PARTNR provides a pathway for advancing AI’s ability to collaborate, plan, and execute tasks efficiently. Future research should focus on refining LLM-based planners, improving coordination mechanisms, and enhancing perception models to address current limitations in multi-agent interaction. PARTNR is a valuable resource for driving innovation in collaborative embodied AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Meta AI Introduces PARTNR: A Research Framework Supporting Seamless Human-Robot Collaboration in Multi-Agent Tasks appeared first on MarkTechPost.

OpenAI Introduces Competitive Programming with Large Reasoning Models

Competitive programming has long served as a benchmark for assessing problem-solving and coding skills. These challenges require advanced computational thinking, efficient algorithms, and precise implementations, making them an excellent testbed for evaluating AI systems. While early AI models like Codex demonstrated strong capabilities in program synthesis, they often relied on extensive sampling and heuristic-based selection, limiting their adaptability. OpenAI’s latest research seeks to move beyond these constraints by leveraging reinforcement learning (RL) to enhance AI’s ability to reason and solve programming challenges more effectively.

OpenAI recently introduced an advanced approach to AI-driven competitive programming, focusing on improving reasoning capabilities through reinforcement learning. The study compares OpenAI’s o1 model, a general-purpose large reasoning model (LRM), with o1-ioi, a model fine-tuned specifically for the 2024 International Olympiad in Informatics (IOI). The research further evaluates o3, an advanced model that achieves high performance without relying on hand-engineered inference strategies. Notably, o3 secures a gold medal at the 2024 IOI and achieves a CodeForces rating comparable to top human programmers, demonstrating the effectiveness of reinforcement learning in reasoning-intensive tasks.

Technical Details and Benefits

The core of OpenAI’s approach lies in reinforcement learning-based reasoning models, which provide a structured way to navigate complex problems. Unlike earlier methods that depended on brute-force heuristics, these models systematically refine their problem-solving strategies through learned experience.

Key aspects of this approach include:

Chain-of-thought reasoning: The models generate intermediate steps to break down problems before arriving at a final solution, improving accuracy in complex scenarios.

Reinforcement learning refinement: RL is used to optimize decision-making, allowing the model to identify and correct errors dynamically.

Autonomous test-time strategies: Unlike previous systems that relied on predefined heuristics, o3 develops its own inference strategies, making it more adaptable.

These improvements contribute to greater flexibility in problem-solving, better generalization across different coding tasks, and reduced reliance on human-designed rules. This represents a step forward from models like AlphaCode, which relied on extensive pre-sampling and heuristic filtering.

Results and Insights

OpenAI’s evaluation provides compelling evidence of these models’ progress in competitive programming:

Gold medal at IOI 2024: The o3 model outperformed prior approaches and achieved a gold medal without requiring hand-tuned inference techniques.

CodeForces benchmark: o3 reached a CodeForces rating of 2724, placing it in the 99.8th percentile, surpassing o1-ioi, which used manually designed test-time strategies.

Improved self-validation mechanisms: The model exhibited the ability to generate brute-force solutions for self-checking, refining its code submissions automatically.

These results suggest that general-purpose reinforcement learning models can outperform domain-specific AI solutions by independently learning and executing effective problem-solving techniques. The transition from o1-ioi to o3 highlights a shift away from human intervention, as the model develops its own optimization strategies during problem-solving.

Conclusion

OpenAI’s work on large reasoning models in competitive programming highlights a shift in how AI systems approach complex problem-solving. By demonstrating that reinforcement learning-based models can match and even exceed the performance of domain-specific techniques, this research suggests broader applications for AI in scientific research, software development, and mathematical reasoning. Moving forward, continued refinement of these models may help bridge the gap between AI-driven reasoning and human cognitive skills, leading to more capable and adaptable AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post OpenAI Introduces Competitive Programming with Large Reasoning Models appeared first on MarkTechPost.

Fine-tune LLMs with synthetic data for context-based Q&A using Ama …

There’s a growing demand from customers to incorporate generative AI into their businesses. Many use cases involve using pre-trained large language models (LLMs) through approaches like Retrieval Augmented Generation (RAG). However, for advanced, domain-specific tasks or those requiring specific formats, model customization techniques such as fine-tuning are sometimes necessary. Amazon Bedrock provides you with the ability to customize leading foundation models (FMs) such as Anthropic’s Claude 3 Haiku and Meta’s Llama 3.1.
Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.
Fine-tuning is a supervised training process where labeled prompt and response pairs are used to further train a pre-trained model to improve its performance for a particular use case. One consistent pain point of fine-tuning is the lack of data to effectively customize these models. Gathering relevant data is difficult, and maintaining its quality is another hurdle. Furthermore, fine-tuning LLMs requires substantial resource commitment. In such scenarios, synthetic data generation offers a promising solution. You can create synthetic training data using a larger language model and use it to fine-tune a smaller model, which has the benefit of a quicker turnaround time.
In this post, we explore how to use Amazon Bedrock to generate synthetic training data to fine-tune an LLM. Additionally, we provide concrete evaluation results that showcase the power of synthetic data in fine-tuning when data is scarce.
Solution overview
The solution comprises two main steps:

Generate synthetic data using the Amazon Bedrock InvokeModel API.
Fine-tune using an Amazon Bedrock custom model.

For synthetic data generation, we use a larger language model (such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock) as the teacher model, and a smaller language model (such as Anthropic’s Claude Instant 1.2 or Claude 3 Haiku on Amazon Bedrock) as the student model for fine-tuning. We use the larger teacher model to generate new data based on its knowledge, which is then used to train the smaller student model. This concept is similar to knowledge distillation used in deep learning, except that we’re using the teacher model to generate a new dataset from its knowledge rather than directly modifying the architecture of the student model.
The following diagram illustrates the overall flow of the solution.

Finally, we share our experiment results, where we compare the performance of the model fine-tuned with synthetic data to the baseline (not fine-tuned) model and to a model fine-tuned with an equal amount of original training data.
Prerequisites
To generate synthetic data and fine-tune models using Amazon Bedrock, you first need to create an AWS Identity and Access Management (IAM) service role with the appropriate permissions. This role is used by Amazon Bedrock to access the necessary resources on your behalf.
For instructions on creating the service role, refer to Create a service role for model customization. Also, make sure the role has the permission for the bedrock:InvokeModel action.
If you’re running this code using an Amazon SageMaker notebook instance, edit the IAM role that’s attached to the notebook (for example, AmazonSageMaker-ExecutionRole-XXX) instead of creating a new role. Follow Create a service role for model customization to modify the trust relationship and add the S3 bucket permission. Additionally, on the role’s Permissions tab, create the following inline policies:

Policy name: bedrock-customization

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel”,
“bedrock:ListModelCustomizationJobs”,
“bedrock:DeleteCustomModel”,
“bedrock:CreateModelCustomizationJob”,
“bedrock:StopModelCustomizationJob”,
“bedrock:ListCustomModels”,
“bedrock:GetCustomModel”,
“bedrock:GetModelCustomizationJob”
],
“Resource”: “*”
}
]
}

Policy name: iam-pass-role

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: “iam:PassRole”,
“Resource”: [
“${sagemaker-execution-role-arn}”
]
}
]
}

The final permission policies for the SageMaker execution role should look like the following, which include AmazonSageMaker-ExecutionPolicy, AmazonSageMakerFullAccess, bedrock-customization, and iam-pass-role.

Generate synthetic data using the Amazon Bedrock InvokeModel API
We use the Amazon Bedrock InvokeModel API to generate synthetic data for fine-tuning. You can use the API to programmatically send an inference (text generation) request to the model of your choice. All you need is a well-crafted prompt tailored for data synthesis. We used the following sample prompt for our use case:

PROMPT = “””
You are an AI assistant who is an expert in Amazon services. Your task is to understand a system that takes in a list of documents, and based on that, answers a question by providing citations for the documents that it referred the answer from.

Your job is to generate three new Question/Answer pairs, emulating the tone, style, and grammar of the original data provided.

Here is the original data :
Input Documents and Question : {document}nnQuestion: {question}
Output Answer : {answer}

Strictly return a jsonl with the keys (question, answer, topic). Every topic should be different. The answers should be in the exact same format as the original. The question and the answer should be different in content from the original data provided, and all questions should be diverse and different from each other. Do not answer in any other format. The response should be parsable as a jsonl.
“””

The goal of our use case was to fine-tune a model to generate a relevant and coherent answer based on a given reference document and a question. RAG is a popular technique used for such Q&A tasks; however, one significant challenge with RAG is the potential for retrieving unrelated or irrelevant documents, which can lead to inaccurate responses. You can apply fine-tuning to guide the model to better focus on the relevance of the documents to the question instead of using the provided documents without context to answer the question.
Our dataset includes Q&A pairs with reference documents regarding AWS services. Each sample has up to five reference documents as context, and a single-line question follows. The following table shows an example.

document
Context: Document 1: Step 1: Prepare to work with AWS CodeStar projects In this step, you create an AWS CodeStar service role and an Amazon EC2 key pair, so that you can begin creating and working with AWS CodeStar projects. If you have used AWS CodeStar before, skip ahead to Step 2 Step 2: Create a Project in AWS CodeStar. For this step, follow the instructions in Setting Up AWS CodeStar in the AWS CodeStar User Guide. Do not create a new AWS account, IAM user, or IAM group as part of those instructions. Use the ones you created or identified in Team Setup for AWS Cloud9. When you finish following those instructions, return to this topic. Document 2: Setting Up AWS CodeStar Before you can start using AWS CodeStar, you must complete the following steps. Topics: Step 1: Create an account Step 2: Create the AWS CodeStar Service Role Step 3: Configure the User’s IAM Permissions Step 4: Create an Amazon EC2 Key Pair for AWS CodeStar Projects Step 5: Open the AWS CodeStar Console Next Steps Document 3: How Do I Get Started with AWS CodeStar? To get started with AWS CodeStar: Prepare to use AWS CodeStar by following the steps in Setting Up AWS CodeStar. Experiment with AWS CodeStar by following the steps in the Getting Started with AWS CodeStar tutorial. Share your project with other developers by following the steps in Add Team Members to an AWS CodeStar Project. Integrate your favorite IDE by following the steps in Use an IDE with AWS CodeStar. Document 4: Step 2: Create a project in AWS CodeStar In this step, you create a project in AWS CodeStar. If you already have a project in AWS CodeStar you want to use, skip ahead to Step 3: Create an AWS Cloud9 Development Environment and Connect It to the Project. For this step, follow the instructions in Create a Project in AWS CodeStar in the AWS CodeStar User Guide. In the AWS CodeStar create project wizard, when you get to the Setup tools page or Connect to your source repository page, choose Skip, and then return to this topic. Document 5: Create a Project in AWS CodeStar You use the AWS CodeStar console to create a project. If you use a project template, it sets up the required resources for you. The template also includes sample code that you can use to start coding. To create a project, sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions. For more information, see Setting Up AWS CodeStar. Note: You must complete the steps in Setting Up AWS CodeStar before you can complete the procedures in this topic. Topics Create a Project in AWS CodeStar (Console) Create a Project in AWS CodeStar (AWS CLI) Answer the following question to the best of your ability. Use and paraphrase the information in the context, if it answers the question. Otherwise, rely on your knowledge and memory.

question
How do I get started with AWS CodeStar?

answer
To get started with AWS CodeStar, follow these steps: 1. Prepare to work with AWS CodeStar projects: a. Create an AWS CodeStar service role by following the instructions in “Setting Up AWS CodeStar” (Document 2) in the AWS CodeStar User Guide [1]. b. Create an Amazon EC2 key pair for AWS CodeStar projects, as described in the same guide [1]. 2. Create a project in AWS CodeStar: a. Sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions [2]. b. In the AWS CodeStar console, select one of the available project templates, which will automatically provision all the required resources for your project [3]. c. Once your project is provisioned, you can see your running application from the Application endpoints tile [3]. 3. Connect to the AWS CodeCommit source repository for your project: a. Use the steps provided in the CodeStar console to connect to the AWS CodeCommit source repository for your project [3]. b. Begin coding and making changes to your project’s source code. 4. Track and manage changes in the release process: a. Use the project dashboard to track and manage changes in the release process [3]. b. View the most recent project activity on the dashboard. 5. Share your project with other developers: a. Follow the steps in “Add Team Members to an AWS CodeStar Project” to share your project with other developers [3]. 6. Integrate your favorite IDE: a. Follow the steps in “Use an IDE with AWS CodeStar” to integrate your preferred Integrated Development Environment (IDE) with AWS CodeStar [3]. [1] https://docs.aws.amazon.com/codestar/latest/userguide/setting-up.html [2] https://docs.aws.amazon.com/codestar/latest/userguide/create-project.html [3] https://docs.aws.amazon.com/codestar/latest/userguide/getting-started.html

For data synthesis, we asked the model to generate three new Q&A pairs per reference document. However, you can adjust the number as needed. The crucial part is to make the model think deeply about a variety of topics. Because the purpose of generating synthetic data is to enrich the training dataset, it’s more beneficial to have the model look at different parts of the documents and create Q&A pairs with different topics than the original.
The following example shows how to generate synthetic data with the Amazon Bedrock InvokeModel API. We tested the preceding prompt with Anthropic’s Claude 3 Sonnet. If you want to test a different model, retrieve the corresponding model ID from Amazon Bedrock model IDs, and replace the modelId variable in the function.

import boto3
import json

bedrock = boto3.client(service_name=”bedrock-runtime”)

def generate_synthetic_data(document, question, answer):

values = {
“document”: document,
“question”: question,
“answer”: answer
}

body = {
“messages”: [{
“role”: “user”, “content”: PROMPT.format(**values)
}],
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 2048,
“temperature” : 0.5
}

response = bedrock.invoke_model(
body=json.dumps(body),
modelId=”anthropic.claude-3-sonnet-20240229-v1:0″,
accept=”application/json”,
contentType=”application/json”
)

response_body = json.loads(response.get(‘body’).read())

return response_body[‘content’][0][‘text’]

The preceding function returns three JSONL records in strings with question, answer, and topic as keys. The following parse_llm_output function loads the strings and uses regular expressions to retrieve the generated questions and answers. Then, the create_synthetic_samples function combines those two functionalities to produce the final synthetic training samples.

import re
import pd

def parse_llm_output(jsonl_string):

question_pattern = re.compile(r'”question”:s*”([^”]+)”‘)
answer_pattern = re.compile(r'”answer”:s*”(.*?)”s*,s*”topic”‘)
questions = question_pattern.findall(jsonl_string)
answers = answer_pattern.findall(jsonl_string)

return questions, answers

def create_synthetic_samples(row: pd.Series) -> pd.DataFrame:

jsonl_string = generate_synthetic_data(row[‘document’], row[‘question’], row[‘answer’])
questions, answers = parse_llm_output(jsonl_string)

return pd.DataFrame({
“document”: [row[‘document’]] * len(questions),
“question”: questions,
“answer”: answers
})

def to_customization_format(row):

msg = {
“messages”: [
{“role”: “user”, “content”: f”{row[‘document’]}nnQuestion: {row[‘question’]}”},
{“role”: “assistant”, “content”: row[‘answer’]}
]
}

return msg

The following script combines all of the preceding functions and gives you the final training set with both original and synthetic samples. We convert the samples into the format required by the customization job using the to_customization_format function and save them as train.jsonl. Assume the input data is a CSV file with three columns: document, question, and answer.

import pandas as pd

# Load original training samples
original_train = pd.read_csv(input_df_path)

# Create synthetic training samples
synthetic_train = pd.concat(original_train.apply(create_synthetic_samples, axis=1).tolist())

# Combine original and synthetic samples
final_train_df = pd.concat([original_train, synthetic_train])

# Convert to the format required by the customization job
final_train = final_train_df.apply(to_customization_format, axis=1).tolist()

# Write to JSONL file
with open(‘train.jsonl’, ‘w’) as file:
for item in final_train:
json.dump(item, file)
file.write(‘n’)

Fine-tune using an Amazon Bedrock custom model
Now that you have the synthetic data generated by the teacher model along with your original data, it’s time to train the student model. We fine-tune the student model using the Amazon Bedrock custom model functionality.
Model customization is the process of providing training data to an FM to improve its performance for specific use cases. Amazon Bedrock offers three model customization methods as of this writing:

Fine-tuning
Continued pre-training
Distillation (preview).

You can create your own custom model using any of these methods through the Amazon Bedrock console or API. For more information on supported models and AWS Regions with various customization methods, please see User guide for model customization. In this section, we focus on how to fine-tune a model using the API.
To create a fine-tuning job in Amazon Bedrock, complete the following prerequisite steps:

Create an Amazon Simple Storage Service (Amazon S3) bucket for your training data and another one for your output data (the names must be unique).
Upload the jsonl file to the training data bucket.
Make sure that you have created an IAM role, as described in the Prerequisites

When these steps are complete, run the following code to submit a new fine-tuning job. In our use case, the student model was Anthropic’s Claude Instant 1.2. At the time of writing, Anthropic’s Claude 3 Haiku is generally available, and we recommend following the rest of the code using Anthropic’s Claude 3 Haiku. For the release announcement, see Fine-tuning for Anthropic’s Claude 3 Haiku in Amazon Bedrock is now generally available.
If you want to try different models, you must check the model provider’s terms of service yourself. Many providers restrict using their models to train competing models. For the latest model support information, see Supported Regions and models for model customization, and replace baseModelIdentifier accordingly. Different models have different hyperparameters. For more information, see Custom model hyperparameters.

import boto3
import json
import time

bedrock = boto3.client(service_name=’bedrock’)

# Set parameters
customizationType = “FINE_TUNING”
baseModelIdentifier = “arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-instant-v1:2:100k”
roleArn = “${customization-role-arn}”
jobName = “${customization-job-name}”
customModelName = “${customization-model-name}”
hyperParameters = {
“epochCount”: “1”,
“batchSize”: “96”,
“learningRateMultiplier”: “0.5”,
}
trainingDataConfig = {“s3Uri”: “s3://${training-bucket}/train.jsonl”}
outputDataConfig = {“s3Uri”: “s3://${output-bucket}/myOutputData”}

# Create job
response_ft = bedrock.create_model_customization_job(
jobName=jobName,
customModelName=customModelName,
roleArn=roleArn,
baseModelIdentifier=baseModelIdentifier,
hyperParameters=hyperParameters,
trainingDataConfig=trainingDataConfig,
outputDataConfig=outputDataConfig
)

jobArn = response_ft.get(‘jobArn’)

# Check job status
while True:
status = bedrock.get_model_customization_job(jobIdentifier=jobArn).get(‘status’)
if status != ‘InProgress’:
print(status)
break
else:
print(status)
time.sleep(30)

When the status changes to Completed, your fine-tuned student model is ready for use. To run an inference with this custom model, you need to purchase provisioned throughput. A flexible No commitment option is available for custom models, which can be turned off when not in use and billed by the hour. A cost estimate is provided on the console prior to purchasing provisioned throughput.
On the Amazon Bedrock console, choose Custom models in the navigation pane. Select the model you fine-tuned and choose Purchase provisioned throughput.

The model name and type are automatically selected for you. Select No commitment for Commitment term. After you make this selection, the estimated cost is shown. If you’re okay with the pricing, choose Confirm purchase.

When the Provisioned Throughput becomes available, retrieve the ARN of the provisioned custom model and run the inference:

import boto3
import json

bedrock = boto3.client(service_name=”bedrock-runtime”)

def run_student_model(document, question):

values = {
“document”: document,
“question”: question,
}

body = {
“messages”: [{
“role”: “user”, “content”: PROMPT.format(**values)
}],
“max_tokens”: 2048,
“temperature” : 0.5
}

response = bedrock.invoke_model(
body=json.dumps(body),
modelId=”${provisioned_model_arn}”,
accept=”application/json”,
contentType=”application/json”
)

response_body = json.loads(response.get(‘body’).read())

return response_body[‘content’][0][‘text’]

Evaluate
In this section, we share our experiment results to provide data points on how the synthetic data generated by a teacher model can improve the performance of a student model. For evaluation methods, we used an LLM-as-a-judge approach, where a judge model compares responses from two different models and picks a better response. Additionally, we conducted a manual evaluation on a small subset to assess whether the LLM-as-a-judge and human judges have aligned preferences.
We carried out controlled experiments where we compared four different models as follows: 1,500 synthetic training samples for the 4th model were generated by Anthropic’s Claude 3 Sonnet, and we created three synthetic samples per one original reference document (3 samples * 500 original reference documents = 1,500 synthetic samples).

Instant base model
Anthropic’s Claude Instant without any customization

Fine-tuned 500 original
Anthropic’s Claude Instant fine-tuned with 500 original training samples

Fine-tuned 2,000 original
Anthropic’s Claude Instant fine-tuned with 2,000 original training samples

Fine-tuned with synthetic
Anthropic’s Claude Instant fine-tuned with 500 original training samples plus 1,500 synthetic training samples

LLM-as-a-judge results
LLM output evaluation is an important step in developing generative AI applications, but it is expensive and takes considerable time if done manually. An alternative solution to systematically evaluate output quality in large volume is the LLM-as-a-judge approach, where an LLM is used to evaluate another LLM’s responses.
For our use case, we used Anthropic’s Claude 3 Sonnet and Meta Llama 3 70B as the judges. We asked the LLM judges to compare outputs from two different models and choose one over the other or state a tie. The following chart summarizes the judges’ decisions. Each number represents the percentage of times when the respective model was selected as providing a better answer, excluding tie cases. The test set contained 343 samples.

As shown in the preceding chart, the Anthropic’s Claude 3 Sonnet judge preferred the response from the fine-tuned model with synthetic examples over the Anthropic’s Claude Instant base model (84.8% preference) and the fine-tuned model with original 500 samples (72.3% preference). However, the judge concluded that the fine-tuned model with 2,000 original examples was preferred over the fine-tuned model with synthetic examples (32.3% preference). This aligns with the expectation that when large, high-quality original data is available, it’s better to use the large training data that accurately reflects the target data distribution.

The Meta Llama judge reached a similar conclusion. As shown in the preceding chart, it preferred the response from the fine-tuned model with synthetic samples over the Anthropic’s Claude Instant base model (75.6% preference) and the fine-tuned model with original 500 examples (76.4% preference), but the fine-tuned model with 2,000 original examples was the ultimate winner.
Human evaluation results
To complement the LLM-as-a-judge result, we conducted manual evaluation with two human judges. We asked the two human evaluators to perform the same pairwise comparison task as the LLM judge, but for 20 examples. The following chart summarizes the results.

As shown in the preceding chart, the two human evaluators reached a similar conclusion, reinforcing the LLM-as-a-judge result. The fine-tuned model with synthetic examples produced outputs that were more preferable than the Anthropic’s Claude Instant base model and the fine-tuned model with the original 500 examples; however, it didn’t outperform the fine-tuned model with the 2,000 original examples.
These comparative evaluation results from both the LLM judges and human judges strongly demonstrate the power and potential of using data synthesis when training data is scarce. Moreover, by using high-quality data from the teacher model, we can effectively train the student model, which is lightweight and cost-effective for deployment in a production environment.
Amazon Bedrock evaluations
Running LLM-as-a-judge and human evaluation has become much easier with Amazon Bedrock. Model evaluation on Amazon Bedrock allows you to evaluate, compare, and select the best FMs for your use case. Human evaluation workflows can use your own employees or an AWS-managed team as reviewers. For more information on how to set up a human evaluation workflow, see Creating your first model evaluation that uses human workers. The latest feature, LLM-as-a-judge, is now in preview and allows you to assess multiple quality dimensions including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness. For step-by-step instructions, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock.
Clean up
Make sure to delete the following resources to avoid incurring cost:

Provisioned throughput for the custom model
The training_bucket and output_bucket S3 buckets

Conclusion
In this post, we explored how to use Amazon Bedrock to generate synthetic training data using a large teacher language model and fine-tune a smaller student model with synthetic data. We provided instructions on generating synthetic data using the Amazon Bedrock InvokeModel API and fine-tuning the student model using an Amazon Bedrock custom model. Our evaluation results, based on both an LLM-as-a-judge approach and human evaluation, demonstrated the effectiveness of synthetic data in improving the student model’s performance when original training data is limited.
Although fine-tuning with a large amount of high-quality original data remains the ideal approach, our findings highlight the promising potential of synthetic data generation as a viable solution when dealing with data scarcity. This technique can enable more efficient and cost-effective model customization for domain-specific or specialized use cases.
If you’re interested in working with the AWS Generative AI Innovation Center and learning more about LLM customization and other generative AI use cases, visit Generative AI Innovation Center.

About the Author
Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.
Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.
Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.
Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she develops generative AI solutions for AWS customers. Her expertise encompasses designing and implementing innovative AI-driven and deep learning techniques, focusing on natural language processing, computer vision, multi-modal learning, and graph learning. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research centered on advanced machine learning and deep learning methodologies. Outside of work, she enjoys sports, hiking, and traveling.
Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.
Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMake …

This blog post is co-written with Moran beladev, Manos Stergiadis, and Ilya Gusev from Booking.com.
Large language models (LLMs) have revolutionized the field of natural language processing with their ability to understand and generate humanlike text. Trained on broad, generic datasets spanning a wide range of topics and domains, LLMs use their parametric knowledge to perform increasingly complex and versatile tasks across multiple business use cases. Furthermore, companies are increasingly investing resources in customizing LLMs through few-shot learning and fine-tuning to optimize their performance for specialized applications.
However, the impressive performance of LLMs comes at the cost of significant computational requirements, driven by their large number of parameters and autoregressive decoding process which is sequential in nature. This combination makes achieving low latency a challenge for use cases such as real-time text completion, simultaneous translation, or conversational voice assistants, where subsecond response times are critical.
Researchers developed Medusa, a framework to speed up LLM inference by adding extra heads to predict multiple tokens simultaneously. This post demonstrates how to use Medusa-1, the first version of the framework, to speed up an LLM by fine-tuning it on Amazon SageMaker AI and confirms the speed up with deployment and a simple load test. Medusa-1 achieves an inference speedup of around two times without sacrificing model quality, with the exact improvement varying based on model size and data used. In this post, we demonstrate its effectiveness with a 1.8 times speedup observed on a sample dataset.
Introduction to Medusa and its benefits for LLM inference speed
LLMs generate text in a sequential manner, which involves autoregressive sampling, with each new token conditional on the previous ones. Generating K tokens necessitates K sequential executions of the model. This token-by-token processing introduces an inherent latency and computational overhead because the model needs to perform a separate forward pass for each new token in the output sequence. The following diagram from Role-Play with Large Language Models illustrates this flow.

Speculative decoding tackles this challenge by using a smaller, faster draft model to generate multiple potential token continuations in parallel, which are then verified by a larger, more accurate target model. This parallelization speeds up text generation while maintaining the quality of the target model because the verification task is faster than autoregressive token generation. For a detailed explanation of the concept, refer to the paper Accelerating Large Language Model Decoding with Speculative Sampling. The speculative decoding technique can be implemented using the inference optimization toolkit on Amazon SageMaker Jumpstart.
The paper Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads introduced Medusa as an alternative to speculative decoding. Instead of adding a separate draft model, it adds extra decoding heads to the LLM that generate candidate continuations simultaneously. These candidates are then evaluated in parallel using a tree-based attention mechanism. This parallel processing reduces the number of sequential steps needed, leading to faster inference times. The main advantage of Medusa over speculative decoding is that it eliminates the need to acquire and maintain a separate draft model while achieving higher speedups. For example, when tested on the MT-Bench dataset, the paper reports that Medusa-2 (the second version of Medusa) speeds up inference time by 2.8 times. This outperforms speculative decoding, which only manages to speed up inference time by 1.5 times on the same dataset.
The Medusa framework currently supports Llama and Mistral models. Although it offers significant speed improvements, it does come with a memory trade-off (similar to speculative decoding). For instance, adding five Medusa heads to the 7-billion-parameter Mistral model increases the total parameter count by 750 million (150 million per head), which means these additional parameters must be stored in GPU memory, leading to a higher memory requirement. However, in most cases, this increase doesn’t necessitate switching to a higher GPU memory instance. For example, you can still use an ml.g5.4xlarge instance with 24 GB of GPU memory to host your 7-billion-parameter Llama or Mistral model with extra Medusa heads.
Training Medusa heads requires additional development time and computational resources, which should be factored into project planning and resource allocation. Another important limitation to mention is that the current framework, when deployed on an Amazon SageMaker AI endpoint, only supports a batch size of one—a configuration typically used for low-latency applications.
The following diagram from the original Medusa paper authors’ FasterDecoding repository gives a visual Medusa framework overview.

There are two main variants of Medusa:

Medusa-1 – Requires a two-stage approach where you first fine-tune your LLM and then add Medusa heads and train them on top of your frozen fine-tuned LLM
Medusa-2 – Introduced later as an improvement, fine-tunes both the additional heads and the backbone LLM parameters together, enabling potentially even further latency speedups

The Medusa paper reports that across models of varying sizes, you can achieve inference speedups of around two times for Medusa-1 and around three times for Medusa-2. With Medusa-1, the predictions are identical to those of the originally fine-tuned LLM. In contrast, with Medusa-2, we might observe slightly different results compared to simple fine-tuning of the LLM because both the heads and the backbone LLM parameters are updated together. In this post, we focus on Medusa-1.
Solution overview
We cover the following steps in our solution:

Prerequisites
Load and prepare the dataset
Fine-tune an LLM using a SageMaker AI training job
Train Medusa heads on top of a frozen fine-tuned LLM using a SageMaker AI training job
Deploy the fine-tuned LLM with Medusa heads on a SageMaker AI endpoint
Demonstrate LLM inference speedup

By following this solution, you can accelerate LLM inference in your applications, leading to faster response times and improved user experience.
Prerequisites
To build the solution yourself, there are the following prerequisites:

You need an AWS account with an AWS Identity and Access Management (IAM) role that has permissions to manage resources created as part of the solution (for example AmazonSageMakerFullAccess and AmazonS3FullAccess). For details, refer to Creating an AWS account.
We use JupyterLab in Amazon SageMaker Studio running on an ml.t3.medium instance with a Python 3 (ipykernel) kernel. However, you can also use an Amazon SageMaker notebook instance (with a conda_pytorch_p310 kernel) or any integrated development environment (IDE) of your choice.
Be sure to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, refer Configure the AWS CLI.
The solution uses an ml.g5.4xlarge instance for the SageMaker AI training jobs, and three ml.g5.4xlarge instance are used for the SageMaker AI endpoints. Make sure you have sufficient capacity for this instance in your AWS account by requesting a quota increase if required. Also check the pricing of the on-demand instances to understand the associated costs.
To replicate the solution demonstrated in this post, you need to clone this GitHub repository. Within the repository, you can use the medusa_1_train.ipynb notebook to run all the steps in this post. This repository is a modified version of the original How to Fine-Tune LLMs in 2024 on Amazon SageMaker. We added simplified Medusa training code, adapted from the original Medusa repository.

Load and prepare the dataset
Now that you have cloned the GitHub repository and opened the medusa_1_train.ipynb notebook, you will load and prepare the dataset in the notebook. We encourage you to read this post while running the code in the notebook. For this post, we use a dataset called sql-create-context, which contains samples of natural language instructions, schema definitions and the corresponding SQL query. It contains 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL queries answering the question using the CREATE statement as context. For demonstration purposes, we select 3,000 samples and split them into train, validation, and test sets.
You need to run the “Load and prepare the dataset” section of the medusa_1_train.ipynb to prepare the dataset for fine-tuning. We also included a data exploration script to analyze the length of input and output tokens. After data exploration, we prepare the train, validation, and test sets and upload them to Amazon Simple Storage Service (Amazon S3).
Fine-tune an LLM using SageMaker AI training job
We use the Zephyr 7B β model as our backbone LLM. Zephyr is a series of language models trained to act as helpful assistants, and Zephyr 7B β is a fine-tuned version of Mistral-7B-v0.1, trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization.
To launch a SageMaker AI training job, we need to use the PyTorch or Hugging Face estimator. SageMaker AI starts and manages all the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate containers, downloads data from our S3 bucket to the container and uploads and runs the specified training script, in our case fine_tune_llm.py. We select the hyperparameters based on the QLoRA paper, but we encourage you to experiment with your own combinations. To expedite the execution of this code, we set the number of epochs to 1. However, for better results, it’s generally recommended to set the number of epochs to at least 2 or 3.

from sagemaker.pytorch.estimator import PyTorch
from sagemaker.debugger import TensorBoardOutputConfig
import time
import os

def get_current_time():
return time.strftime(“%Y-%m-%d-%H-%M-%S”, time.localtime())

def create_estimator(hyperparameters_dict, job_name, role, sess, train_scipt_path):
metric=[
{“Name”: “loss”, “Regex”: r”‘loss’:s*([0-9.]+)”},
{“Name”: “epoch”, “Regex”: r”‘epoch’:s*([0-9.]+)”},
]

tensorboard_s3_output_path = os.path.join(
“s3://”, sess.default_bucket(), job_name, ‘tensorboard’
)
print(“Tensorboard output path:”, tensorboard_s3_output_path)

tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=tensorboard_s3_output_path,
container_local_output_path=hyperparameters_dict[‘logging_dir’]
)
estimator = PyTorch(
sagemaker_session = sess,
entry_point = train_scipt_path, # train script
source_dir = ‘train’, # directory which includes all the files needed for training
instance_type = ‘ml.g5.4xlarge’, # instances type used for the training job, “local_gpu” for local mode
metric_definitions = metric,
instance_count = 1, # the number of instances used for training
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
framework_version = ‘2.1.0’, # the pytorch_version version used in the training job
py_version = ‘py310’, # the python version used in the training job
hyperparameters = hyperparameters_dict, # the hyperparameters passed to the training job
disable_output_compression = True, # not compress output to save training time and cost
tensorboard_output_config = tensorboard_output_config
)
return estimator

# hyperparameters, which are passed into the training job
sft_hyperparameters = {
### SCRIPT PARAMETERS ###
‘train_dataset_path’: ‘/opt/ml/input/data/train/train_dataset.json’, # path where sagemaker will save training dataset
‘eval_dataset_path’: ‘/opt/ml/input/data/eval/eval_dataset.json’, # path where sagemaker will save evaluation dataset
‘model_id’: model_id,
‘max_seq_len’: 256, # max sequence length for model and packing of the dataset
‘use_qlora’: True, # use QLoRA model
### TRAINING PARAMETERS ###
‘num_train_epochs’: 1, # number of training epochs
‘per_device_train_batch_size’: 1, # batch size per device during training
‘gradient_accumulation_steps’: 16, # number of steps before performing a backward/update pass
‘gradient_checkpointing’: True, # use gradient checkpointing to save memory
‘optim’: “adamw_8bit”, # use fused adamw 8bit optimizer
‘logging_steps’: 15, # log every 10 steps
‘save_strategy’: “steps”, # save checkpoint every epoch
‘save_steps’: 15,
‘save_total_limit’: 2,
‘eval_strategy’: “steps”,
‘eval_steps’: 15,
‘learning_rate’: 1e-4, # learning rate, based on QLoRA paper
‘bf16’: True, # use bfloat16 precision
‘max_grad_norm’: 10, # max gradient norm based on QLoRA paper
‘warmup_ratio’: 0.03, # warmup ratio based on QLoRA paper
‘lr_scheduler_type’: “constant”, # use constant learning rate scheduler
‘output_dir’: ‘/opt/ml/checkpoints/’, # Temporary output directory for model checkpoints
‘merge_adapters’: True, # merge LoRA adapters into model for easier deployment
‘report_to’: “tensorboard”, # report metrics to tensorboard
‘logging_dir’: “/opt/ml/output/tensorboard” # tensorboard logging directory
}

sft_job_name = f”sft-qlora-text-to-sql-{get_current_time()}”
data = {
‘train’: train_dataset_path,
‘eval’: eval_dataset_path
}

sft_estimator = create_estimator(sft_hyperparameters, sft_job_name, role, sess, “fine_tune_llm.py”)

sft_estimator.fit(job_name=sft_job_name, inputs=data, wait=False)

When our training job has completed successfully after approximately 1 hour, we can use the fine-tuned model artifact for the next step, training the Medusa heads on top of it. To visualize the training metrics in Tensorboard, you can follow the guidance in this documentation: Load and visualize output tensors using the TensorBoard application
Train Medusa heads on top of frozen fine-tuned LLM using a SageMaker AI training job
For training Medusa heads, we can reuse the functions previously mentioned to launch the training job. We selected hyperparameters based on a combination of what the Medusa paper reported and what we found to be best performing after a few experiments. We set the number of Medusa heads to 5 and used the 8-bit AdamW optimizer, as recommended by the paper. For simplicity, we maintained a constant learning rate of 1e-4 with a constant scheduler, similar to the previous fine-tuning step. Although the paper recommends an increased learning rate and a cosine scheduler, we found that our chosen combination of hyperparameters performed well on this dataset. However, we encourage you to experiment with your own hyperparameter settings to potentially achieve even better results.

# hyperparameters, which are passed into the training job
medusa_hyperparameters = {
### SCRIPT PARAMETERS ###
‘train_dataset_path’: ‘/opt/ml/input/data/train/train_dataset.json’, # path where sagemaker will save training dataset
‘eval_dataset_path’: ‘/opt/ml/input/data/eval/eval_dataset.json’, # path where sagemaker will save evaluation dataset
‘model_path’: ‘/opt/ml/input/data/fine-tuned-model/’,
‘max_seq_len’: 256, # max sequence length for model and packing of the dataset
‘medusa_num_heads’: 5,
### TRAINING PARAMETERS ###
‘num_train_epochs’: 3, # number of training epochs
‘per_device_train_batch_size’: 1, # batch size per device during training
‘gradient_accumulation_steps’: 16, # number of steps before performing a backward/update pass
‘gradient_checkpointing’: True, # use gradient checkpointing to save memory
‘optim’: “adamw_8bit”, # use fused adamw 8bit optimizer
‘logging_steps’: 15, # log every 10 steps
‘save_strategy’: “steps”, # save checkpoint every epoch
‘save_steps’: 15,
‘save_total_limit’:2,
‘eval_strategy’: “steps”,
‘eval_steps’: 15,
‘learning_rate’: 1e-4, # learning rate
‘bf16’: True, # use bfloat16 precision
‘max_grad_norm’: 10, # max gradient norm based on QLoRA paper
‘warmup_ratio’: 0.03, # warmup ratio based on QLoRA paper
‘lr_scheduler_type’: “constant”, # use constant learning rate scheduler
‘output_dir’: ‘/opt/ml/checkpoints/’, # Temporary output directory for model checkpoints
‘report_to’: “tensorboard”, # report metrics to tensorboard
‘logging_dir’: “/opt/ml/output/tensorboard” # tensorboard logging directory
}

medusa_train_job_name = f”medusa-text-to-sql-{get_current_time()}”
data = {
‘train’: train_dataset_path,
‘eval’: eval_dataset_path,
‘fine-tuned-model’: fine_tuned_model_path
}

medusa_estimator = create_estimator(medusa_hyperparameters, medusa_train_job_name, role, sess, “train_medusa_heads.py”)

medusa_estimator.fit(job_name=medusa_train_job_name, inputs=data, wait=False)

We found that after 3 epochs, the evaluation loss of Medusa heads was converging, which can be observed in the TensorBoard graph in the following image.

Besides the hyperparameters, the main difference is that we pass train_medusa_heads.py as the training entrypoint, where we first add Medusa heads, then freeze the fine-tuned LLM, and we create custom MedusaSFTTrainer class, which is a subclass of the transformers SFTTrainer.

# Add medusa heads and freeze base model
add_medusa_heads(
model,
medusa_num_heads=script_args.medusa_num_heads,
)
freeze_layers(model)
model.config.torch_dtype = torch_dtype
model.config.use_cache = False

logger.info(“Finished loading model and medusa heads”)

tokenizer = AutoTokenizer.from_pretrained(script_args.model_path, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

################
# Training
################
trainer = MedusaSFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
max_seq_length=script_args.max_seq_length,
tokenizer=tokenizer,
dataset_kwargs={
“add_special_tokens”: False, # We template with special tokens
“append_concat_token”: False, # No need to add additional separator token
},
medusa_num_heads=script_args.medusa_num_heads,
medusa_heads_coefficient=script_args.medusa_heads_coefficient,
medusa_decay_coefficient=script_args.medusa_decay_coefficient,
medusa_scheduler=script_args.medusa_scheduler,
train_only_medusa_heads=script_args.train_only_medusa_heads,
medusa_lr_multiplier=script_args.medusa_lr_multiplier
)
trainer.train()

In the add_medusa_heads() function, we add the residual blocks of the Medusa heads, and also override the forward pass for our model to make sure not to train the frozen backbone LLM:

def add_medusa_heads(
model,
medusa_num_heads,
):
“””
Args:
model (nn.Module): The base language model to be used.
medusa_num_heads (int, optional): Number of additional tokens to predict
“””
hidden_size = model.lm_head.weight.shape[-1]
vocab_size = model.lm_head.weight.shape[0]
model.config.medusa_num_layers = 1
model.config.medusa_num_heads = medusa_num_heads
model.medusa_num_heads = medusa_num_heads
# Create a list of Medusa heads
model.medusa_heads = nn.ModuleList(
[
nn.Sequential(
ResBlock(hidden_size),
nn.Linear(hidden_size, vocab_size, bias=False),
)
for _ in range(medusa_num_heads)
]
)

# Ensure medusa_head’s dtype and device align with the base_model
model.medusa_heads.to(model.dtype).to(model.device)
logger.info(f”Loading medusa heads in {str(model.dtype)} to device {model.device}”)

for i in range(medusa_num_heads):
# Initialize the weights of each medusa_head using the base model’s weights
model.medusa_heads[i][-1].weight.data[:] = model.lm_head.weight.data[:]

def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
train_only_medusa_heads: bool = False,
):
“””Forward pass of the MedusaModel.
Returns:
torch.Tensor: A tensor containing predictions from all Medusa heads.
(Optional) Original predictions from the base model’s LM head.
“””
maybe_grad = torch.no_grad() if train_only_medusa_heads else nullcontext()
with maybe_grad:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
medusa_logits = [self.lm_head(hidden_states)]
for i in range(self.medusa_num_heads):
medusa_logits.append(self.medusa_heads[i](hidden_states))
return torch.stack(medusa_logits, dim=0)

model.forward = types.MethodType(forward, model)

After the model training is finished (which takes 1 hour), we prepare the model artefacts for deployment and upload it to Amazon S3. Your final model artifact contains both the original fine-tuned model from the previous step under the base-model prefix and the trained Medusa heads in a file named medusa_heads.safetensors.
Deploy the fine-tuned LLM with Medusa heads on a SageMaker AI endpoint
The Medusa framework is supported by the Text Generation Inference (TGI) server. After training the LLM with Medusa heads, we deploy it to a SageMaker AI real-time endpoint using the Hugging Face Inference Container set up with TGI.
First, we create a SageMaker AI HuggingFaceModel object and then deploy the model to an endpoint with the following function:

import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

def deploy_model(endpoint_name, instance_type, model_s3_path=None, hf_model_id=None):
llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=”2.2.0″,
session=sess,
)

print(f”llm image uri: {llm_image}”)

model_data = None
if model_s3_path:
model_data = {‘S3DataSource’: {‘S3Uri’: model_s3_path, ‘S3DataType’: ‘S3Prefix’, ‘CompressionType’: ‘None’}}
hf_model_id = “/opt/ml/model”
else:
assert hf_model_id, “You need to provide either pretrained HF model id, or S3 model data to deploy”
config = {
‘HF_MODEL_ID’: hf_model_id, # path to where sagemaker stores the model
‘SM_NUM_GPUS’: json.dumps(1), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(1024), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(2048), # Max length of the generation (including input text)
}

llm_model = HuggingFaceModel(
name=endpoint_name,
role=role,
image_uri=llm_image,
model_data=model_data,
env=config
)

deployed_llm = llm_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=300,
)
return deployed_llm

We deploy three LLMs on three SageMaker AI endpoints:

Base LLM which isn’t fine-tuned
The LLM that we fine-tuned
The fine-tuned LLM that also has trained Medusa heads

You can deploy the three models in parallel by using a function that we included in the notebook, or you can deploy the models one by one by running the code below:

base_deployed_llm = deploy_model( f”base-{get_current_time()}”, instance_type=”ml.g5.4xlarge”, model_s3_path=None, hf_model_id=model_id )
sft_deployed_llm = deploy_model( f”sft-{get_current_time()}”, instance_type=”ml.g5.4xlarge”, model_s3_path=fine_tuned_model_path )
medusa_deployed_llm = deploy_model( f”medusa-{get_current_time()}”, instance_type=”ml.g5.4xlarge”, model_s3_path=medusa_trained_model_path )

After the status for each endpoint becomes InService, which should take around 15 minutes, we can invoke them for inference. We send the following input:

“You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA. SCHEMA: CREATE TABLE table_name_32 (time VARCHAR, heat VARCHAR, name VARCHAR)
What was Liu Limin’s time in heat 4?“

We can observe the following responses:

The base LLM response contains extra words that aren’t needed:

“To retrieve the time of Liu Limin in heat 4 based on the provided SCHEMA, we need to write a SQL query. Since the table name is not provided, let ‘s assume it’s “my_table”. Assuming the table has a primary key (which is not mentioned in the SCHEMA), we can join the table with itself to compare the heat and name of Liu Limin in heat 4 with all the records in the table. Here’s the SQL query: “sql SELECT t1.time FROM my_table t1 JOIN my_table t2 ON t1.name = t2.name AND t2.heat = 4 WHERE t1.name = ‘Liu Limin’ AND t1.heat <> 4; ` Explanation: 1. We’re selecting the time` column from the first table (t1) in the FROM clause. 2. We’re joining the table with itself (my_table t1 JOIN my_table t2) to compare the name and heat of Liu Limin in heat 4 with all the records in the table. 3. We’re filtering the results using the WHERE clause. We’re selecting only the records where the name is ‘Liu Limin’ and the heat is not equal to 4 (i.e., not heat 4). This is to ensure that we’re selecting the time of Liu Limin in heat 3. Note: This query assumes that the table has a unique primary key. If the table doesn’t have a primary key, you may need to add additional conditions to the JOIN and WHERE clauses to ensure that we’re selecting the correct records.“

The fine-tuned LLM response is improved significantly, and contains only the required output:

‘SELECT time FROM table_name_32 WHERE heat = 4 AND name = “liu limin”‘

The fine-tuned LLM with trained Medusa heads provides the exact same response as the fine-tuned model, demonstrating that Medusa-1, by design, maintains the output (quality) of the original model:

‘SELECT time FROM table_name_32 WHERE heat = 4 AND name = “liu limin”‘

Demonstrate LLM inference speedup
To measure the inference speed improvements, we compare the response times of the deployed fine-tuned LLM and the fine-tuned LLM with Medusa heads on 450 test observations with the following code:

import time
import numpy as np
from tqdm import tqdm

def request(sample, deployed_llm):
prompt = tokenizer.apply_chat_template(sample, tokenize=False, add_generation_prompt=True)
outputs = deployed_llm.predict({
“inputs”: prompt,
“parameters”: {
“max_new_tokens”: 512,
“do_sample”: False,
“return_full_text”: False,
}
})
return {“role”: “assistant”, “content”: outputs[0][“generated_text”].strip()}

def predict(deployed_llm, test_dataset):
predicted_answers = []
latencies = []

for sample in tqdm(test_dataset):
start_time = time.time()
predicted_answer = request(sample[“messages”][:2], deployed_llm)
end_time = time.time()

latency = end_time – start_time
latencies.append(latency)
predicted_answers.append(predicted_answer)

# Calculate p90 and average latencies
p90_latency = np.percentile(latencies, 90)
avg_latency = np.mean(latencies)

print(f”P90 Latency: {p90_latency:.2f} seconds”)
print(f”Average Latency: {avg_latency:.2f} seconds”)

return predicted_answers

First, we run predictions using the fine-tuned LLM:

sft_predictions = predict(sft_deployed_llm, test_dataset)
P90 Latency: 1.28 seconds
Average Latency: 0.95 seconds

Then, we run predictions using the fine-tuned LLM with Medusa heads:

medusa_predictions = predict(medusa_deployed_llm, test_dataset)
P90 Latency: 0.80 seconds
Average Latency: 0.53 seconds

The prediction runs should take around 8 and 4 minutes respectively. We can observe that the average latency decreased from 950 to 530 milliseconds, which is an improvement of 1.8 times. You can achieve even higher improvements if your dataset contains longer inputs and outputs. In our dataset, we only had an average of 18 input tokens and 30 output tokens.
We want to once again highlight that, with this technique, the output quality is fully maintained, and all the prediction outputs are the same. The model responses for the test set of 450 observations are the same for both with Medusa heads and without Medusa heads:

match_percentage = sum(a[“content”] == b[“content”] for a, b in zip(sft_predictions, medusa_predictions)) / len(sft_predictions) * 100
print(f”Predictions with the fine-tuned model with medusa heads are the same as without medusa heads: {match_percentage:.2f}% of test set “)

Predictions with fine-tuned model with medusa heads are the same as without medusa heads: 100.00% of test set

You might notice in your run that a few observations aren’t exactly matching, and you might get a 99% match due to small errors in floating point operations caused by optimizations on GPUs.
Cleanup
At the end of this experiment, don’t forget to delete the SageMaker AI endpoints you created:

base_deployed_llm.delete_model()
base_deployed_llm.delete_endpoint()
sft_deployed_llm.delete_model()
sft_deployed_llm.delete_endpoint()
medusa_deployed_llm.delete_model()
medusa_deployed_llm.delete_endpoint()

Conclusion
In this post, we demonstrated how to fine-tune and deploy an LLM with Medusa heads using the Medusa-1 technique on Amazon SageMaker AI to accelerate LLM inference. By using this framework and SageMaker AI scalable infrastructure, we showed how to achieve up to twofold speedups in LLM inference while maintaining model quality. This solution is particularly beneficial for applications requiring low-latency text generation, such as customer service chat assistants, content creation, and recommendation systems.
As a next step, you can explore fine-tuning your own LLM with Medusa heads on your own dataset and benchmark the results for your specific use case, using the provided GitHub repository.

About the authors
Daniel Zagyva is a Senior ML Engineer at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI and machine learning operations.
Aleksandra Dokic is a Senior Data Scientist at AWS Professional Services. She enjoys supporting customers to build innovative AI/ML solutions on AWS and she is excited about business transformations through the power of data.
Moran Beladev is a Senior ML Manager at Booking.com. She is leading the content intelligence track which is focused on building, training and deploying content models (computer vision, NLP and generative AI) using the most advanced technologies and models. Moran is also a PhD candidate, researching applying NLP models on social graphs.
Manos Stergiadis is a Senior ML Scientist at Booking.com. He specializes in generative NLP and has experience researching, implementing and deploying large deep learning models at scale.
Ilya Gusev is a Senior Machine Learning Engineer at Booking.com. He leads the development of the several LLM systems inside Booking.com. His work focuses on building production ML systems that help millions of travelers plan their trips effectively.
Laurens van der Maas is a Machine Learning Engineer at AWS Professional Services. He works closely with customers building their machine learning solutions on AWS, specializes in natural language processing, experimentation and responsible AI, and is passionate about using machine learning to drive meaningful change in the world.

LLM-as-a-judge on Amazon Bedrock Model Evaluation

The evaluation of large language model (LLM) performance, particularly in response to a variety of prompts, is crucial for organizations aiming to harness the full potential of this rapidly evolving technology. The introduction of an LLM-as-a-judge framework represents a significant step forward in simplifying and streamlining the model evaluation process. This approach allows organizations to assess their AI models’ effectiveness using pre-defined metrics, making sure that the technology aligns with their specific needs and objectives. By adopting this method, companies can more accurately gauge the performance of their AI systems, making informed decisions about model selection, optimization, and deployment. This not only enhances the reliability and efficiency of AI applications, but also contributes to a more strategic and informed approach to technology adoption within the organization.
Amazon Bedrock, a fully managed service offering high-performing foundation models from leading AI companies through a single API, has recently introduced two significant evaluation capabilities: LLM-as-a-judge under Amazon Bedrock Model Evaluation and RAG evaluation for Amazon Bedrock Knowledge Bases. Both features use the LLM-as-a-judge technique behind the scenes but evaluate different things. This blog post explores LLM-as-a-judge on Amazon Bedrock Model Evaluation, providing comprehensive guidance on feature setup, evaluating job initiation through both the console and Python SDK and APIs, and demonstrating how this innovative evaluation feature can enhance generative AI applications across multiple metric categories including quality, user experience, instruction following, and safety.
Before we explore the technical aspects and implementation details, let’s examine the key features that make LLM-as-a-judge on Amazon Bedrock Model Evaluation particularly powerful and distinguish it from traditional evaluation methods. Understanding these core capabilities will help illuminate why this feature represents a significant advancement in AI model evaluation.
Key features of LLM-as-a-judge

Automated intelligent evaluation: LLM-as-a-judge uses pre-trained models to evaluate responses automatically, providing human-like evaluation quality with up to 98% cost savings. The system dramatically reduces evaluation time from weeks to hours while maintaining consistent evaluation standards across large datasets.
Comprehensive metric categories: The evaluation system covers four key metric areas: quality assessment (correctness, completeness, faithfulness), user experience (helpfulness, coherence, relevance), instruction compliance (following instructions, professional style), and safety monitoring (harmfulness, stereotyping, refusal handling).
Seamless integration: The feature integrates directly with Amazon Bedrock and remains compatible with existing Amazon Bedrock Model Evaluation features. Users can access the functionality through the AWS Management Console for Amazon Bedrock and quickly integrate their custom datasets for evaluation purposes.
Flexible implementation: The system supports the evaluation of models hosted on Amazon Bedrock, custom fine-tuned models, and imported models. Users can seamlessly connect their evaluation datasets through Amazon Simple Storage Service (Amazon S3) buckets, making the evaluation process streamlined and efficient.
Curated judge models: Amazon Bedrock provides pre-selected, high-quality evaluation models with optimized prompt engineering for accurate assessments. Users don’t need to bring external judge models, because the Amazon Bedrock team maintains and updates a selection of judge models and associated evaluation judge prompts.
Cost-effective scaling: The feature enables organizations to perform comprehensive model evaluations at scale without the traditional costs and time investments associated with human evaluation. The automated process maintains high-quality assessments while significantly reducing operational overhead.

These features create a powerful evaluation framework that helps organizations optimize their AI model performance while maintaining high standards of quality and safety, all within their secure AWS environment.
Product overview
Now that you understand the key features of LLM-as-a-judge, let’s examine how to implement and use this capability within Amazon Bedrock Model Evaluation. This section provides a comprehensive overview of the architecture and walks through each component, demonstrating how they work together to deliver accurate and efficient model evaluations.
LLM-as-a-judge on Amazon Bedrock Model Evaluation provides a comprehensive, end-to-end solution for assessing and optimizing AI model performance. This automated process uses the power of LLMs to evaluate responses across multiple metric categories, offering insights that can significantly improve your AI applications. Let’s walk through the key components of this solution as shown in the following diagram:

LLM-as-a-judge on Amazon Bedrock Model Evaluation follows a streamlined workflow that enables systematic model evaluation. Here’s how each component works together in the evaluation process:

Prompt dataset: The process begins with a prepared dataset containing prompts that will be used to test the model’s performance. The evaluation can be conducted with or without ground truth responses—while including ground truth provides additional comparison points, it’s entirely optional and not required for successful evaluation.
JSONL file preparation: The prompt dataset is converted into JSONL format, which is specifically structured for LLM-as-a-judge evaluation jobs. This format promotes proper processing of evaluation data.
Amazon S3 storage: The prepared JSONL file is uploaded to an S3 bucket, serving as the secure storage location for the evaluation data.
Evaluation processing: The Amazon Bedrock LLM-as-a-judge model evaluation job processes the stored data, running comprehensive assessments across the selected metric categories (including quality, user experience, instruction following, and safety).
Automated report generation: Upon completion, the system generates detailed evaluation reports containing metrics, scores, and insights at both aggregate and individual response levels.
Expert analysis: Data scientists or machine learning engineers analyze the generated reports to derive actionable insights and make informed decisions.

With this solution architecture in mind, let’s explore how to implement LLM-as-a-judge model evaluations effectively, making sure that you get the most valuable insights from your assessment process.
Prerequisites
To use the LLM-as-a-judge model evaluation, make sure that you have satisfied the following requirements:

An active AWS account.
Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
Confirm the AWS Regions where the model is available and quotas.
Complete model evaluation prerequisitesrelated to AWS Identity and Access Management (IAM) creation, and add permissions for an S3 bucket to access and write output data.

You also need to set up and enable CORS on your S3 bucket.

If you’re using a custom model instead of an on-demand model for your generator model, make sure that you have sufficient quota for running a Provisioned Throughput during inference.

Complete the prerequisites for importing a custom model.
Go to the AWS Service Quotas console, and check the following quotas:

Model units no-commitment Provisioned Throughputs across custom models.
Model units per provisioned model for [your custom model name].
Both of these fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset
When preparing your dataset for LLM-as-a-judge model evaluation jobs, each prompt must include specific key-value pairs. Here are the required and optional fields:

prompt (required): This key indicates the input for various tasks. It can be used for general text generation where the model needs to provide a response, question-answering tasks where the model must answer a specific question, text summarization tasks where the model needs to summarize a given text, or classification tasks where the model must categorize the provided text.
referenceResponse (used for specific metrics with ground truth): This key contains the ground truth or correct response. It serves as the reference point against which the model’s responses will be evaluated if it is provided.
category (optional): This key is used to generate evaluation scores reported by category, helping organize and segment evaluation results for better analysis.

Dataset requirements:

Each line must be a valid JSON object
The file must use JSONL format
The dataset should be stored in an Amazon S3 bucket

Example JSONL format without ground truth (category is optional):

{
“prompt”: “What is machine learning?”
“category”: “technical”
}
{
“prompt”: “Summarize climate change impacts”,
“category”: “environmental”
}

Example JSONL format with ground truth (category is optional):

{
“prompt”: “What is machine learning?”,
“referenceResponse”: “Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses algorithms and statistical models to analyze and draw inferences from patterns in data, allowing computers to perform specific tasks without explicit instructions.”,
“category”: “technical”
}
{
“prompt”: “Summarize climate change impacts”,
“referenceResponse”: “Climate change leads to rising global temperatures, extreme weather events, sea level rise, and disruption of ecosystems. These changes result in more frequent natural disasters, threats to food security, loss of biodiversity, and various public health challenges. The impacts affect agriculture, coastal communities, and vulnerable populations disproportionately.”,
“category”: “environmental”
}

Start an LLM-as-a-judge model evaluation job using the console
You can use LLM-as-a-judge on Amazon Bedrock Model Evaluation to assess model performance through a user-friendly console interface. Follow these steps to start an evaluation job:

In the Amazon Bedrock console, choose Inference and Assessment and then select Evalutaions. On the Evaluations page, choose the Models

Choose Create and select Automatic: LLM-as-a-judge.
Enter a name and description and select an Evaluator model. This model will be used as a judge to evaluate the response of a prompt or model from your generative AI application.

Choose Tags and select the model to be used for generating responses in this evaluation job.

Select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).

Select the S3 URI for Choose a prompt dataset and for Evaluation results. You can use the Browse S3 option.

Select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job. Specify the output S3 bucket and choose Create.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete.

When complete, select the job to see its details. The following is the metrics summary (such as 0.83 for helpfulness, 1.00 for correctness, 1.00 for faithfulness, 1.00 for relevance, and 0.00 for harmfulness).

To view generation metrics details, scroll down in the model evaluation report and choose any individual metric (like helpfulness or correctness) to see its detailed breakdown.

To see each record’s prompt input, generation output, ground truth, and individual scores, choose a metric and select “Prompt details”. Hover over any individual score to view its detailed explanation.

Start an LLM-as-a-judge evaluation job using Python SDK and APIs
To use the Python SDK for creating an LLM-as-a-judge model evaluation job, use the following steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f”Model-evaluation-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

# Configure your knowledge base and model settings
evaluator_model = “mistral.mistral-large-2402-v1:0”
generator_model = “amazon.nova-pro-v1:0”
role_arn = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”

# Specify S3 locations for evaluation data and output
input_data = “s3://<YOUR_BUCKET>/evaluation_data/input.jsonl”
output_path = “s3://<YOUR_BUCKET>/evaluation_output/”

# Create Bedrock client
bedrock_client = boto3.client(‘bedrock’)

To create an LLM-as-a-judge model evaluation job:

def create_llm_judge_evaluation(
client,
job_name: str,
role_arn: str,
input_s3_uri: str,
output_s3_uri: str,
evaluator_model_id: str,
generator_model_id: str,
dataset_name: str = None,
task_type: str = “General” # must be General for LLMaaJ
):
# All available LLM-as-judge metrics
llm_judge_metrics = [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Faithfulness”,
“Builtin.Helpfulness”,
“Builtin.Coherence”,
“Builtin.Relevance”,
“Builtin.FollowingInstructions”,
“Builtin.ProfessionalStyleAndTone”,
“Builtin.Harmfulness”,
“Builtin.Stereotyping”,
“Builtin.Refusal”
]

# Configure dataset
dataset_config = {
“name”: dataset_name or “CustomDataset”,
“datasetLocation”: {
“s3Uri”: input_s3_uri
}
}

try:
response = client.create_evaluation_job(
jobName=job_name,
roleArn=role_arn,
applicationType=”ModelEvaluation”,
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [
{
“taskType”: task_type,
“dataset”: dataset_config,
“metricNames”: llm_judge_metrics
}
],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [
{
“modelIdentifier”: evaluator_model_id
}
]
}
}
},
inferenceConfig={
“models”: [
{
“bedrockModel”: {
“modelIdentifier”: generator_model_id
}
}
]
},
outputDataConfig={
“s3Uri”: output_s3_uri
}
)
return response

except Exception as e:
print(f”Error creating evaluation job: {str(e)}”)
raise

# Create evaluation job
try:
llm_as_judge_response = create_llm_judge_evaluation(
client=bedrock_client,
job_name=job_name,
role_arn=ROLE_ARN,
input_s3_uri=input_data,
output_s3_uri=output_path,
evaluator_model_id=evaluator_model,
generator_model_id=generator_model,
task_type=”General”
)
print(f”✓ Created evaluation job: {llm_as_judge_response[‘jobArn’]}”)
except Exception as e:
print(f”✗ Failed to create evaluation job: {str(e)}”)
raise

To monitor the progress of your evaluation job:

# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response[‘jobArn’]
# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn)
print(f”Job Status: {check_status[‘status’]}”)

You can also compare multiple foundation models to determine which one works best for your needs. By using the same evaluator model across all comparisons, you’ll get consistent benchmarking results to help identify the optimal model for your use case.

# Generator Models
GENERATOR_MODELS = [
“anthropic.claude-3-haiku-20240307-v1:0”,
“amazon.nova-micro-v1:0”
]

# Consistent Evaluator
EVALUATOR_MODEL = “anthropic.claude-3-haiku-20240307-v1:0″

def run_model_comparison(
generator_models: List[str],
evaluator_model: str
) -> List[Dict[str, Any]]:
evaluation_jobs = []

for generator_model in generator_models:
job_name = f”llmaaj-{generator_model.split(‘.’)[0]}-{evaluator_model.split(‘.’)[0]}-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

try:
response = create_llm_judge_evaluation(
client=bedrock_client,
job_name=job_name,
role_arn=ROLE_ARN,
input_s3_uri=input_data,
output_s3_uri=f”{output_path}/{job_name}/”,
evaluator_model_id=evaluator_model,
generator_model_id=generator_model,
task_type=”General”
)

job_info = {
“job_name”: job_name,
“job_arn”: response[“jobArn”],
“generator_model”: generator_model,
“evaluator_model”: evaluator_model,
“status”: “CREATED”
}
evaluation_jobs.append(job_info)

print(f”✓ Created job: {job_name}”)
print(f” Generator: {generator_model}”)
print(f” Evaluator: {evaluator_model}”)
print(“-” * 80)

except Exception as e:
print(f”✗ Error with {generator_model}: {str(e)}”)
continue

return evaluation_jobs

# Run model comparison
evaluation_jobs = run_model_comparison(GENERATOR_MODELS, EVALUATOR_MODEL)

Correlation analysis for LLM-as-a-judge evaluations
You can use the Spearman’s rank correlation coefficient to compare evaluation results between different generator models using LLM-as-a-judge in Amazon Bedrock. After retrieving the evaluation results from your S3 bucket, containing evaluation scores across various metrics, you can begin the correlation analysis.
Using scipy.stats, compute the correlation coefficient between pairs of generator models, filtering out constant values or error messages to have a valid statistical comparison. The resulting correlation coefficients help identify how similarly different models respond to the same prompts. A coefficient closer to 1.0 indicates stronger agreement between the models’ responses, while values closer to 0 suggest more divergent behavior. This analysis provides valuable insights into model consistency and helps identify cases where different models might produce significantly different outputs for the same input.

import json
import boto3
import numpy as np
from scipy import stats

def read_and_organize_metrics_from_s3(bucket_name, file_key):
s3_client = boto3.client(‘s3’)
metrics_dict = {}

try:
response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
content = response[‘Body’].read().decode(‘utf-8’)

for line in content.strip().split(‘n’):
if line:
data = json.loads(line)
if ‘automatedEvaluationResult’ in data and ‘scores’ in data[‘automatedEvaluationResult’]:
for score in data[‘automatedEvaluationResult’][‘scores’]:
metric_name = score[‘metricName’]
if ‘result’ in score:
metric_value = score[‘result’]
if metric_name not in metrics_dict:
metrics_dict[metric_name] = []
metrics_dict[metric_name].append(metric_value)
return metrics_dict

except Exception as e:
print(f”Error: {e}”)
return None

def get_spearmanr_correlation(scores1, scores2):
if len(set(scores1)) == 1 or len(set(scores2)) == 1:
return “undefined (constant scores)”, “undefined”

try:
result = stats.spearmanr(scores1, scores2)
return round(float(result.statistic), 4), round(float(result.pvalue), 4)
except Exception as e:
return f”error: {str(e)}”, “undefined”

# Extract metrics
bucket_name = “<EVALUATION_OUTPUT_BUCKET>”
file_key1 = “<EVALUATION_FILE_KEY1>”
file_key2 = “<EVALUATION_FILE_KEY2>”

metrics1 = read_and_organize_metrics_from_s3(bucket_name, file_key1)
metrics2 = read_and_organize_metrics_from_s3(bucket_name, file_key2)

# Calculate correlations for common metrics
common_metrics = set(metrics1.keys()) & set(metrics2.keys())

for metric_name in common_metrics:
scores1 = metrics1[metric_name]
scores2 = metrics2[metric_name]

if len(scores1) == len(scores2):
correlation, p_value = get_spearmanr_correlation(scores1, scores2)

print(f”nMetric: {metric_name}”)
print(f”Number of samples: {len(scores1)}”)
print(f”Unique values in Model 1 scores: {len(set(scores1))}”)
print(f”Unique values in Model 2 scores: {len(set(scores2))}”)
print(f”Model 1 scores range: [{min(scores1)}, {max(scores1)}]”)
print(f”Model 2 scores range: [{min(scores2)}, {max(scores2)}]”)
print(f”Spearman correlation coefficient: {correlation}”)
print(f”P-value: {p_value}”)
else:
print(f”nMetric: {metric_name}”)
print(“Error: Different number of samples between models”)

Best practices for LLM-as-a-judge implementation
You can also compare multiple foundation models to determine which one works best for your needs. By using the same evaluator model across all comparisons, you’ll get consistent, scalable results. The following best practices will help you establish standardized benchmarking when comparing different foundation models.

Create diverse test datasets that represent real-world use cases and edge cases. For large workloads (more than 1,000 prompts), use stratified sampling to maintain comprehensive coverage while managing costs and completion time. Include both simple and complex prompts to test model capabilities across different difficulty levels.
Choose evaluation metrics that align with your specific business objectives and application requirements. Balance quality metrics (correctness, completeness) with user experience metrics (helpfulness, coherence). Include safety metrics when deploying customer-facing applications.
Maintain consistent evaluation conditions when comparing different models. Use the same evaluator model across comparisons for standardized benchmarking. Document your evaluation configuration and parameters for reproducibility.
Schedule regular evaluation jobs to track model performance over time. Monitor trends across different metric categories to identify areas for improvement. Set up performance baselines and thresholds for each metric.
Optimize batch sizes based on your evaluation needs and cost constraints. Consider using smaller test sets for rapid iteration and larger sets for comprehensive evaluation. Balance evaluation frequency with resource utilization.
Maintain detailed records of evaluation jobs, including configurations and results. Track improvements and changes in model performance over time. Document any modifications made based on evaluation insights. The optional job description field can help you here.
Use evaluation results to guide model selection and optimization. Implement feedback loops to continuously improve prompt engineering. Regularly update evaluation criteria based on emerging requirements and user feedback.
Design your evaluation framework to accommodate growing workloads. Plan for increased complexity as you add more models or use cases. Consider automated workflows for regular evaluation tasks.

These best practices help establish a robust evaluation framework using LLM-as-a-judge on Amazon Bedrock. For deeper insights into the scientific validation of these practices, including case studies and correlation with human judgments, stay tuned for our upcoming technical deep-dive blog post.
Conclusion
LLM-as-a-judge on Amazon Bedrock Model Evaluation represents a significant advancement in automated model assessment, offering organizations a powerful tool to evaluate and optimize their AI applications systematically. This feature combines the efficiency of automated evaluation with the nuanced understanding typically associated with human assessment, enabling organizations to scale their quality assurance processes while maintaining high standards of performance and safety.
The comprehensive metric categories, flexible implementation options, and seamless integration with existing AWS services make it possible for organizations to establish robust evaluation frameworks that grow with their needs. Whether you’re developing conversational AI applications, content generation systems, or specialized enterprise solutions, LLM-as-a-judge provides the necessary tools to make sure that your models align with both technical requirements and business objectives.
We’ve provided detailed implementation guidance, from initial setup to best practices, to help you use this feature effectively. The accompanying code samples and configuration examples in this post demonstrate how to implement these evaluations in practice. Through systematic evaluation and continuous improvement, organizations can build more reliable, accurate, and trustworthy AI applications.
We encourage you to explore LLM-as-a-judge capabilities in the Amazon Bedrock console and discover how automatic evaluation can enhance your AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets that you can find on our GitHub repository.

About the Authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Building a virtual meteorologist using Amazon Bedrock Agents

The integration of generative AI capabilities is driving transformative changes across many industries. Although weather information is accessible through multiple channels, businesses that heavily rely on meteorological data require robust and scalable solutions to effectively manage and use these critical insights and reduce manual processes. This solution demonstrates how to create an AI-powered virtual meteorologist that can answer complex weather-related queries in natural language. We use various AWS services to deploy a complete solution that you can use to interact with an API providing real-time weather information. In this solution, we use Amazon Bedrock Agents.
Amazon Bedrock Agents helps to streamline workflows and automate repetitive tasks. Amazon Bedrock Agents can securely connect to your company’s data sources and augments the user’s request with accurate responses. You can use Amazon Bedrock Agents to architect an action schema tailored to your requirements, granting you control whenever the agent initiates the specified action. This versatile approach equips you to seamlessly integrate and execute business logic within your preferred backend service, fostering a cohesive combination of functionality and flexibility. There is also memory retention across the interaction allowing a more personalized user experience.
In this post, we present a streamlined approach to deploying an AI-powered agent by combining Amazon Bedrock Agents and a foundation model (FM). We guide you through the process of configuring the agent and implementing the specific logic required for the virtual meteorologist to provide accurate weather-related responses. Additionally, we use various AWS services, including AWS Amplify for hosting the front end, AWS Lambda functions for handling request logic, Amazon Cognito for user authentication, and AWS Identity and Access Management (IAM) for controlling access to the agent.
Solution overview
The diagram gives an overview and highlights the key components. The architecture uses Amazon Cognito for user authentication and Amplify as the hosting environment for our front-end application. Amazon Bedrock Agents forwards the details from the user query to the action groups, which further invokes custom Lambda functions. Each action group and Lambda function handles a specific task:

geo-coordinates – Processes geographic coordinates (geo-coordinates) to get details about a specific location
weather – Gathers weather information for the provided location
date-time – Obtains the current date and time

Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Anthropic’s Claude 3.5 Sonnet in the same AWS Region where you’ll deploy this solution
The accompanying AWS CloudFormation template downloaded from the aws-samples GitHub repo.

Deploy solution resources using AWS CloudFormation
When you run the AWS CloudFormation template, the following resources are deployed (note that costs will be incurred for the AWS resources used):

Amazon Cognito resources:

User pool – CognitoUserPoolforVirtualMeteorologistApp
App client – VirtualMeteorologistApp
Identity pools – cognito-identity-pool-vm

Lambda resources:

Function – <Stack name>-geo-coordinates-<auto-generated>
Function – <Stack name>-weather-<auto-generated>
Function – <Stack name>-date-time-<auto-generated>

Amazon Bedrock Agents: virtual-meteorologist

Action groups (1) – obtain-latitude-longitude-from-place-name
Action groups (2) – obtain-weather-information-with-coordinates
Action groups (3) – get-current-date-time-from-timezone

After you deploy the CloudFormation template, copy the following from the Outputs tab on the CloudFormation console to be used during the configuration of your application after it’s deployed in AWS Amplify.

AWSRegion
BedrockAgentAliasId
BedrockAgentId
BedrockAgentName
IdentityPoolId
UserPoolClientId
UserPoolId

Deploy the AWS Amplify application
You need to manually deploy the Amplify application using the front-end code found on GitHub. Complete the following steps:

Download the front-end code AWS-Amplify-Frontend.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain it automatically generated to access the application.

Use Amazon Cognito for user authentication
Amazon Cognito is an identity service that you can use to authenticate and authorize users. We use Amazon Cognito in our solution to verify the user before they can use the application. We also use identity pool to provide temporary AWS credentials for the user while they interact with Amazon Bedrock API.
Use Amazon Bedrock Agents to automate application tasks
With Amazon Bedrock Agents, you can build and configure autonomous agents in your application. An agent helps your end users complete actions based on organization data and user input. Agents orchestrate interactions between FMs, data sources, software applications, and user conversations.
Use action group to define actions that Amazon Bedrock agents perform
An action group defines a set of related actions that an Amazon Bedrock agent can perform to assist users. When configuring an action group, you have options for handling user-provided information, including adding user input to the agent’s action group, passing data to a Lambda function for custom business logic, or returning control directly through the InvokeAgent response. In our application, we created three action groups to give the Amazon Bedrock agent these essential functionalities: retrieving coordinates for specific locations, obtaining current date and time information, and fetching weather data for given locations. These action groups enable the agent to access and process crucial information, enhancing its ability to respond accurately and comprehensively to user queries related to location-based services and weather conditions.
Use Lambda for Amazon Bedrock action group
As part of this solution, three Lambda functions are deployed to support the action groups defined for our Amazon Bedrock agent:

Location coordinates Lambda function – This function is triggered by the obtain-latitude-longitude-from-place-name action group. It takes a place name as input and returns the corresponding latitude and longitude coordinates. The function uses a geocoding service or database to perform this lookup.
Date and time Lambda function – Invoked by the get-current-date-time-from-timezone action group, this function provides the current date and time information.
Weather information Lambda function – This function is called by the obtain-weather-information-with-coordinates action group. It accepts geo-coordinates from the first Lambda function and returns current weather conditions and forecasts for the specified area. This Lambda function used a weather API to fetch up-to-date meteorological data.

Each of these Lambda functions receives an input event containing relevant metadata and populated fields from the Amazon Bedrock agent’s API operation or function parameters. The functions process this input, perform their specific tasks, and return a response with the required information. This response is then used by the Amazon Bedrock agent to formulate its reply to the user’s query. By using these Lambda functions, our Amazon Bedrock agent gains the ability to access external data sources and perform complex computations, significantly enhancing its capabilities in handling user requests related to location, time, and weather information.
Use AWS Amplify for front-end code
Amplify offers a development environment for building secure, scalable mobile and web applications. Developers can focus on their code rather than worrying about the underlying infrastructure. Amplify also integrates with many Git providers. For this solution, we manually upload our front-end code using the method outlined earlier in this post.
Application walkthrough
Navigate to the URL provided after you created the application in Amplify. Upon accessing the application URL, you’ll be prompted to provide information related to Amazon Cognito and Amazon Bedrock Agents. This information is required to securely authenticate users and allow the front end to interact with the Amazon Bedrock agent. It enables the application to manage user sessions and make authorized API calls to AWS services on behalf of the user.
You can enter information with the values you collected from the CloudFormation stack outputs. You’ll be required to enter the following fields, as shown in the following screenshot:

User Pool ID
User Pool ClientID
Identity Pool ID
Region
Agent Name
Agent ID
Agent Alias ID
Region

You need to sign in with your username and password. A temporary password was automatically generated during deployment and sent to the email address you provided when launching the CloudFormation template. At first sign-in attempt, you’ll be asked to reset your password, as shown in the following video.

Now you can start asking questions in the application, for example, “Can we do barbecue today in Dallas, TX?” In a few seconds, the application will provide you detailed results mentioning if you can do barbecue in Dallas, TX. The following video shows this chat.

Example use cases
Here are a few sample queries to demonstrate the capabilities of your virtual meteorologist:

“What’s the weather like in New York City today?”
“Should I plan an outdoor birthday party in Miami next weekend?”
“Will it snow in Denver on Christmas Day?”
“Can I go swimming on a beach in Chicago today?

These queries showcase the agent’s ability to provide current weather information, offer advice based on weather forecasts, and predict future weather conditions. You can even ask a question related to an activity such as swimming, and it will answer based on the weather conditions if that activity is okay to do.
Clean up
If you decide to discontinue using the virtual meteorologist, you can follow these steps to remove it, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process (you assigned a name to it).
Select the stack and choose Delete.

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion
This solution demonstrates the power of combining Amazon Bedrock Agents with other AWS services to create an intelligent, conversational weather assistant. By using AI and cloud technologies, businesses can automate complex queries and provide valuable insights to their users.
Additional resources
To learn more about Amazon Bedrock, refer to the following resources:

GitHub repo: Amazon Bedrock Workshop
Amazon Bedrock User Guide
Workshop: Using generative AI on AWS for diverse content types

To learn more about the Anthropic’s Claude 3.5 Sonnet model, refer to the following resources:

Anthropic’s Claude in Amazon Bedrock

About the Authors
Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He enjoys helping customers in the travel and hospitality industry to design, implement, and support cloud infrastructure. With a passion for networking services and years of experience, he helps customers adopt various AWS networking services. Outside of work, Salman enjoys photography, traveling, and watching his favorite sports teams.
Sergio Barraza is a Senior Enterprise Support Lead at AWS, helping energy customers design and optimize cloud solutions. With a passion for software development, he guides energy customers through AWS service adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.
Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.
Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.

Zyphra Introduces the Beta Release of Zonos: A Highly Expressive TTS M …

Text-to-speech (TTS) technology has made significant strides in recent years, but challenges remain in creating natural, expressive, and high-fidelity speech synthesis. Many TTS systems struggle to replicate the nuances of human speech, such as intonation, emotion, and accent, often resulting in artificial-sounding voices. Additionally, precise voice cloning remains difficult, limiting the ability to generate personalized or diverse speech outputs. These challenges have driven continued research into more sophisticated TTS models capable of producing real-time, expressive, and realistic speech.

Zyphra has introduced the beta release of Zonos-v0.1, featuring two real-time TTS models with high-fidelity voice cloning. The release includes a 1.6 billion-parameter transformer model and a similarly sized hybrid model, both available under the Apache 2.0 license. This open-source initiative seeks to advance TTS research by making high-quality speech synthesis technology more accessible to developers and researchers.

The Zonos-v0.1 models are trained on approximately 200,000 hours of speech data, encompassing both neutral and expressive speech patterns. While the primary dataset consists of English-language content, significant portions of Chinese, Japanese, French, Spanish, and German speech have been incorporated, allowing for multilingual support. The models generate lifelike speech from text prompts using either speaker embeddings or audio prefixes. They can perform voice cloning with as little as 5 to 30 seconds of sample speech and offer controls over parameters such as speaking rate, pitch variation, audio quality, and emotions like sadness, fear, anger, happiness, and surprise. The synthesized speech is produced at a 44 kHz sample rate, ensuring high audio fidelity.

Zonos-v0.1 includes several key features:

Zero-shot TTS with Voice Cloning: Users can generate speech by providing a short speaker sample alongside text input, making it possible to synthesize voices with minimal data.

Audio Prefix Inputs: By incorporating an audio prefix, the models can better match speaker characteristics and even reproduce specific speaking styles, such as whispering.

Multilingual Support: The system supports multiple languages, including English, Japanese, Chinese, French, and German, increasing its versatility for global applications.

Audio Quality and Emotion Control: Users can fine-tune aspects such as pitch, frequency range, and emotional tone to create more expressive and natural speech outputs.

Efficient Performance: Running at approximately twice real-time speed on an RTX 4090, the models are optimized for real-time applications.

User-friendly Interface: A Gradio-based WebUI simplifies speech generation, making it accessible to a broader range of users.

Straightforward Deployment: The models can be installed and deployed easily using a provided Docker setup, ensuring ease of integration into existing workflows.

These features make Zonos-v0.1 a flexible tool for various TTS applications, from content creation to accessibility tools.

Early evaluations suggest that Zonos-v0.1 delivers high-quality speech generation, often comparable to or exceeding leading proprietary systems. While objective audio evaluation remains complex, comparisons with other models—including proprietary solutions such as ElevenLabs and Cartesia, as well as open-source alternatives like FishSpeech-v1.5—highlight Zonos’s ability to produce clear, natural, and expressive speech. The hybrid model, in particular, offers reduced latency and lower memory usage compared to the transformer variant, benefiting from its Mamba2-based architecture, which minimizes reliance on attention mechanisms.

The beta release of Zonos-v0.1 represents an important step forward in open-source TTS development. By providing a high-fidelity, expressive, and real-time speech synthesis tool under an accessible license, Zyphra offers developers and researchers a powerful resource for advancing TTS applications. Its combination of voice cloning, multilingual support, and fine-grained audio control makes it a versatile addition to the field, with potential applications in assistive technologies, content creation, and beyond.

Check out the Technical details, GitHub Page, Zyphra/Zonos-v0.1-transformer and Zyphra/Zonos-v0.1-hybrid. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Zyphra Introduces the Beta Release of Zonos: A Highly Expressive TTS Model with High Fidelity Voice Cloning appeared first on MarkTechPost.

Google DeepMind Introduces AlphaGeometry2: A Significant Upgrade to Al …

The International Mathematical Olympiad (IMO) is a globally recognized competition that challenges high school students with complex mathematical problems. Among its four categories, geometry stands out as the most consistent in structure, making it more accessible and well-suited for fundamental reasoning research. Automated geometry problem-solving has traditionally followed two primary approaches: algebraic methods, such as Wu’s method, the Area method, and Gröbner bases, and synthetic techniques, including Deduction databases and the Full angle method. The latter aligns more closely with human reasoning and is particularly valuable for broader research applications.

Previous research introduced AlphaGeometry (AG1), a neuro-symbolic system designed to solve IMO geometry problems by integrating a language model with a symbolic reasoning engine. From 2000 to 2024, AG1 achieved a 54% success rate on the issues, marking a significant step in automated problem-solving. However, its performance was hindered by limitations in its domain-specific language, the efficiency of its symbolic engine, and the capability of its initial language model. These constraints prevented AG1 from surpassing its current accuracy despite its promising approach.

AlphaGeometry2 (AG2) is a major advancement over its predecessor, surpassing the problem-solving abilities of an average IMO gold medalist. Researchers from Google DeepMind, the University of Cambridge, Georgia Tech, and Brown University expanded its domain language to handle complex geometric concepts, improving its coverage of IMO problems from 66% to 88%. AG2 integrates a Gemini-based language model, a more efficient symbolic engine, and a novel search algorithm with knowledge sharing. These enhancements boost its solving rate to 84% on IMO geometry problems from 2000-2024. Additionally, AG2 advances toward a fully automated system that interprets problems from natural language.

AG2 expands the AG1 domain language by introducing additional predicates to address limitations in expressing linear equations, movement, and common geometric problems. It enhances coverage from 66% to 88% of IMO geometry problems (2000–2024). AG2 supports new problem types, such as locus problems, and improves diagram formalization by allowing points to be defined using multiple predicates. Automated formalization, aided by foundation models, translates natural language problems into AG syntax. Diagram generation employs a two-stage optimization method for non-constructive problems. AG2 also strengthens its symbolic engine, DDAR, for faster and more efficient deduction closure, enhancing proof search capabilities.

AlphaGeometry2 achieves a high solve rate on IMO geometry problems from 2000–2024, solving 42 out of 50 in the IMO-AG-50 benchmark, surpassing an average gold medalist. It also solves all 30 hardest formalizable IMO shortlist problems. Performance improves rapidly, solving 27 problems after 250 training steps. Ablation studies reveal optimal inference settings. Some issues remain unsolved due to unformalizable conditions or a lack of advanced geometry techniques in DDAR. Experts find its solutions highly creative. Despite limitations, AlphaGeometry2 outperforms AG1 and other systems, demonstrating state-of-the-art capabilities in automated problem-solving. 

In conclusion, AlphaGeometry2 significantly improves upon its predecessor by incorporating a more advanced language model, an enhanced symbolic engine, and a novel proof search algorithm. It achieves an 84% solve rate on 2000–2024 IMO geometry problems, surpassing the previous 54%. Studies reveal that language models can generate full proofs without external tools, and different training approaches yield complementary skills. Challenges remain, including limitations in handling inequalities and variable points. Future work will focus on subproblem decomposition, reinforcement learning, and refining auto-formalization for more reliable solutions. Continued improvements aim to create a fully automated system for solving geometry problems efficiently.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Google DeepMind Introduces AlphaGeometry2: A Significant Upgrade to AlphaGeometry Surpassing the Average Gold Medalist in Solving Olympiad Geometry appeared first on MarkTechPost.

Efficient Alignment of Large Language Models Using Token-Level Reward …

Large language models (LLMs) must align with human preferences like helpfulness and harmlessness, but traditional alignment methods require costly retraining and struggle with dynamic or conflicting preferences. Test-time alignment approaches using reward models (RMs) avoid retraining but face inefficiencies due to reliance on trajectory-level rewards, which evaluate full responses rather than guiding token-by-token generation.  

Existing alignment techniques fall into two categories: training-time methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), which fine-tune LLMs on preference datasets but demand significant computational resources and lack flexibility for new preferences. Test-time methods use RMs to guide frozen LLMs but rely on trajectory-level RMs that assign a single reward to complete responses. This creates a mismatch during autoregressive generation, where next-token decisions require partial response evaluations. For instance, ARGS approximates token-level rewards by applying trajectory RMs to incomplete responses, leading to inaccuracies since these RMs are trained only on full responses. Other methods like Transfer-Q generate multiple full responses per token candidate, multiplying inference costs. These inefficiencies limit scalability and real-time adaptability.  

Reference: https://arxiv.org/pdf/2410.08193

To address these issues, researchers from the University of Maryland, College Park and JPMorgan AI Research propose GenARM (Reward Guided Generation with Autoregressive Reward Model), a test-time alignment framework combining a novel autoregressive RM with guided decoding. The key innovation is the Autoregressive Reward Model, which decomposes trajectory-level rewards into token-level components. Instead of assigning a single reward to a full response, it predicts the reward for each token conditioned on prior tokens, enabling dense, step-by-step guidance, allowing rewards to directly influence each token choice without evaluating partial responses inaccurately.  

During generation, GenARM integrates the autoregressive RM’s token-level rewards with the base LLM’s logits. The next token is sampled from a modified distribution. Unlike prior methods, this requires only one forward pass through the base and reward models per token, avoiding costly candidate expansions.  

Experiments demonstrate GenARM’s advantages across three scenarios:  

1. General Human Preference Alignment: On the HH-RLHF dataset, GenARM outperforms test-time baselines like ARGS and Transfer-Q in helpfulness and harmlessness, matching the performance of training-time methods like DPO based on evaluations using GPT-4.

2. Weak-to-Strong Guidance: A 7B autoregressive RM effectively guides larger base models (13B, 70B) without fine-tuning them. It surpasses DPO at the 7B scale and nearly matches DPO at the 13B scale. At the 70B scale, GenARM recovers more than 70% of the performance gap in both raw and LC win rates between Tulu2-70B and Tulu2-DPO-70B, all without the need to train the 70B LLM, demonstrating that smaller RMs can steer larger LLMs efficiently.  

3. Multi-Objective Alignment: GenARM balances conflicting preferences (e.g., helpfulness vs. harmlessness) by combining rewards from multiple autoregressive RMs. On the PKU-SafeRLHF-10K dataset, it achieves a Pareto frontier superior to Rewarded Soups and matches multi-objective RL without retraining.

The autoregressive RM’s design ensures it can express any reward function achievable by traditional RMs within the KL-regularized reinforcement learning framework. This theoretical guarantee, combined with token-level factorization, makes GenARM both expressive and efficient. Unlike trajectory-level RMs, which struggle with partial contexts, autoregressive RMs provide accurate, incremental feedback, preventing reward hacking or incoherent outputs during long generations.  

In summary, GenARM bridges the gap between training-time and test-time alignment by introducing autoregressive reward models that enable precise, token-level guidance. It eliminates the need for costly LLM retraining, supports dynamic adaptation to diverse preferences, and efficiently scales to larger models. By addressing the inefficiencies of trajectory-level rewards and enabling weak-to-strong guidance, GenARM offers a practical solution for aligning LLMs in resource-constrained scenarios. Future work could extend this approach to tasks like mathematical reasoning or code generation, where token-level rewards might enhance performance without additional fine-tuning.  

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Efficient Alignment of Large Language Models Using Token-Level Reward Guidance with GenARM appeared first on MarkTechPost.

Transforming credit decisions using generative AI with Rich Data Co an …

This post is co-written with Gordon Campbell, Charles Guan, and Hendra Suryanto from RDC. 
The mission of Rich Data Co (RDC) is to broaden access to sustainable credit globally. Its software-as-a-service (SaaS) solution empowers leading banks and lenders with deep customer insights and AI-driven decision-making capabilities.
Making credit decisions using AI can be challenging, requiring data science and portfolio teams to synthesize complex subject matter information and collaborate productively. To solve this challenge, RDC used generative AI, enabling teams to use its solution more effectively:

Data science assistant – Designed for data science teams, this agent assists teams in developing, building, and deploying AI models within a regulated environment. It aims to boost team efficiency by answering complex technical queries across the machine learning operations (MLOps) lifecycle, drawing from a comprehensive knowledge base that includes environment documentation, AI and data science expertise, and Python code generation.
Portfolio assistant – Designed for portfolio managers and analysts, this agent facilitates natural language inquiries about loan portfolios. It provides critical insights on performance, risk exposures, and credit policy alignment, enabling informed commercial decisions without requiring in-depth analysis skills. The assistant is adept at high-level questions (such as identifying high-risk segments or potential growth opportunities) and one-time queries, allowing the portfolio to be diversified.

In this post, we discuss how RDC uses generative AI on Amazon Bedrock to build these assistants and accelerate its overall mission of democratizing access to sustainable credit.
Solution overview: Building a multi-agent generative AI solution
We began with a carefully crafted evaluation set of over 200 prompts, anticipating common user questions. Our initial approach combined prompt engineering and traditional Retrieval Augmented Generation (RAG). However, we encountered a challenge: accuracy fell below 90%, especially for more complex questions.
To overcome the challenge, we adopted an agentic approach, breaking down the problem into specialized use cases. This strategy equipped us to align each task with the most suitable foundation model (FM) and tools. Our multi-agent framework is orchestrated using LangGraph, and it consisted of:

Orchestrator – The orchestrator is responsible for routing user questions to the appropriate agent. In this example, we start with the data science or portfolio agent. However, we envision many more agents in the future. The orchestrator can also use user context, such as the user’s role, to determine routing to the appropriate agent.
Agent – The agent is designed for a specialized task. It’s equipped with the appropriate FM for the task and the necessary tools to perform actions and access knowledge. It can also handle multiturn conversations and orchestrate multiple calls to the FM to reach a solution.
Tools – Tools extend agent capabilities beyond the FM. They provide access to external data and APIs or enable specific actions and computation. To efficiently use the model’s context window, we construct a tool selector that retrieves only the relevant tools based on the information in the agent state. This helps simplify debugging in the case of errors, ultimately making the agent more effective and cost-efficient.

This approach gives us the right tool for the right job. It enhances our ability to handle complex queries efficiently and accurately while providing flexibility for future improvements and agents.
The following image is a high-level architecture diagram of the solution.

Data science agent: RAG and code generation
To boost productivity of data science teams, we focused on rapid comprehension of advanced knowledge, including industry-specific models from a curated knowledge base. Here, RDC provides an integrated development environment (IDE) for Python coding, catering to various team roles. One role is model validator, who rigorously assesses whether a model aligns with bank or lender policies. To support the assessment process, we designed an agent with two tools:

Content retriever tool – Amazon Bedrock Knowledge Bases powers our intelligent content retrieval through a streamlined RAG implementation. The service automatically converts text documents to their vector representation using Amazon Titan Text Embeddings and stores them in Amazon OpenSearch Serverless. Because the knowledge is vast, it performs semantic chunking, making sure that the knowledge is organized by topic and can fit within the FM’s context window. When users interact with the agent, Amazon Bedrock Knowledge Bases using OpenSearch Serverless provides fast, in-memory semantic search, enabling the agent to retrieve the most relevant chunks of knowledge for relevant and contextual responses to users.
Code generator tool – With code generation, we selected Anthropic’s Claude model on Amazon Bedrock due to its inherent ability to understand and generate code. This tool is grounded to answer queries related to data science and can generate Python code for quick implementation. It’s also adept at troubleshooting coding errors.

Portfolio agent: Text-to-SQL and self-correction
To boost the productivity of credit portfolio teams, we focused on two key areas. For portfolio managers, we prioritized high-level commercial insights. For analysts, we enabled deep-dive data exploration. This approach empowered both roles with rapid understanding and actionable insights, streamlining decision-making processes across teams.
Our solution required natural language understanding of structured portfolio data stored in Amazon Aurora. This led us to base our solution on a text-to-SQL model to efficiently bridge the gap between natural language and SQL.
To reduce errors and tackle complex queries beyond the model’s capabilities, we developed three tools using Anthropic’s Claude model on Amazon Bedrock for self-correction:

Check query tool – Verifies and corrects SQL queries, addressing common issues such as data type mismatches or incorrect function usage
Check result tool – Validates query results, providing relevance and prompting retries or user clarification when needed
Retry from user tool – Engages users for additional information when queries are too broad or lack detail, guiding the interaction based on database information and user input

These tools operate in an agentic system, enabling accurate database interactions and improved query results through iterative refinement and user engagement.
To improve accuracy, we tested model fine-tuning, training the model on common queries and context (such as database schemas and their definitions). This approach reduces inference costs and improves response times compared to prompting at each call. Using Amazon SageMaker JumpStart, we fine-tuned Meta’s Llama model by providing a set of anticipated prompts, intended answers, and associated context. Amazon SageMaker Jumpstart offers a cost-effective alternative to third-party models, providing a viable pathway for future applications. However, we didn’t end up deploying the fine-tuned model because we experimentally observed that prompting with Anthropic’s Claude model provided better generalization, especially for complex questions. To reduce operational overhead, we will also evaluate structured data retrieval on Amazon Bedrock Knowledge Bases.
Conclusion and next steps with RDC
To expedite development, RDC collaborated with AWS Startups and the AWS Generative AI Innovation Center. Through an iterative approach, RDC rapidly enhanced its generative AI capabilities, deploying the initial version to production in just 3 months. The solution successfully met the stringent security standards required in regulated banking environments, providing both innovation and compliance.

“The integration of generative AI into our solution marks a pivotal moment in our mission to revolutionize credit decision-making. By empowering both data scientists and portfolio managers with AI assistants, we’re not just improving efficiency—we’re transforming how financial institutions approach lending.”
–Gordon Campbell, Co-Founder & Chief Customer Officer at RDC

RDC envisions generative AI playing a significant role in boosting the productivity of the banking and credit industry. By using this technology, RDC can provide key insights to customers, improve solution adoption, accelerate the model lifecycle, and reduce the customer support burden. Looking ahead, RDC plans to further refine and expand its AI capabilities, exploring new use cases and integrations as the industry evolves.
For more information about how to work with RDC and AWS and to understand how we’re supporting banking customers around the world to use AI in credit decisions, contact your AWS Account Manager or visit Rich Data Co.
For more information about generative AI on AWS, refer to the following resources:

Learn about Amazon Bedrock and Amazon Bedrock Knowledge Bases
Build a knowledge base by connecting to a structured data store
Enabling complex generative AI applications with Amazon Bedrock Agents
Learn about Amazon SageMaker Jumpstart
Fine-tune Meta Llama 3 for text generation on Amazon SageMaker JumpStart

About the Authors
Daniel Wirjo is a Solutions Architect at AWS, focused on FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.
Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.
Iman Abbasnejad is a computer scientist at the Generative AI Innovation Center at Amazon Web Services (AWS) working on Generative AI and complex multi-agents systems.
Gordon Campbell is the Chief Customer Officer and Co-Founder of RDC, where he leverages over 30 years in enterprise software to drive RDC’s leading AI Decisioning platform for business and commercial lenders. With a proven track record in product strategy and development across three global software firms, Gordon is committed to customer success, advocacy, and advancing financial inclusion through data and AI.
Charles Guan is the Chief Technology Officer and Co-founder of RDC. With more than 20 years of experience in data analytics and enterprise applications, he has driven technological innovation across both the public and private sectors. At RDC, Charles leads research, development, and product advancement—collaborating with universities to leverage advanced analytics and AI. He is dedicated to promoting financial inclusion and delivering positive community impact worldwide.
Hendra Suryanto is the Chief Data Scientist at RDC with more than 20 years of experience in data science, big data, and business intelligence. Before joining RDC, he served as a Lead Data Scientist at KPMG, advising clients globally. At RDC, Hendra designs end-to-end analytics solutions within an Agile DevOps framework. He holds a PhD in Artificial Intelligence and has completed postdoctoral research in machine learning.

Build agentic AI solutions with DeepSeek-R1, CrewAI, and Amazon SageMa …

AI agents are rapidly becoming the next frontier in enterprise transformation, with 82% of organizations planning adoption within the next 3 years. According to a Capgemini survey of 1,100 executives at large enterprises, 10% of organizations already use AI agents, and more than half plan to use them in the next year. The recent release of the DeepSeek-R1 models brings state-of-the-art reasoning capabilities to the open source community. Organizations can build agentic applications using these reasoning models to execute complex tasks with advanced decision-making capabilities, enhancing efficiency and adaptability.
In this post, we dive into how organizations can use Amazon SageMaker AI, a fully managed service that allows you to build, train, and deploy ML models at scale, and can build AI agents using CrewAI, a popular agentic framework and open source models like DeepSeek-R1.
Agentic design vs. traditional software design
Agentic systems offer a fundamentally different approach compared to traditional software, particularly in their ability to handle complex, dynamic, and domain-specific challenges. Unlike traditional systems, which rely on rule-based automation and structured data, agentic systems, powered by large language models (LLMs), can operate autonomously, learn from their environment, and make nuanced, context-aware decisions. This is achieved through modular components including reasoning, memory, cognitive skills, and tools, which enable them to perform intricate tasks and adapt to changing scenarios.
Traditional software platforms, though effective for routine tasks and horizontal scaling, often lack the domain-specific intelligence and flexibility that agentic systems provide. For example, in a manufacturing setting, traditional systems might track inventory but lack the ability to anticipate supply chain disruptions or optimize procurement using real-time market insights. In contrast, an agentic system can process live data such as inventory fluctuations, customer preferences, and environmental factors to proactively adjust strategies and reroute supply chains during disruptions.
Enterprises should strategically consider deploying agentic systems in scenarios where adaptability and domain-specific expertise are critical. For instance, consider customer service. Traditional chatbots are limited to preprogrammed responses to expected customer queries, but AI agents can engage with customers using natural language, offer personalized assistance, and resolve queries more efficiently. AI agents can significantly improve productivity by automating repetitive tasks, such as generating reports, emails, and software code. The deployment of agentic systems should focus on well-defined processes with clear success metrics and where there is potential for greater flexibility and less brittleness in process management.
DeepSeek-R1
In this post, we show you how to deploy DeepSeek-R1 on SageMaker, particularly the Llama-70b distilled variant DeepSeek-R1-Distill-Llama-70B to a SageMaker real-time endpoint. DeepSeek-R1 is an advanced LLM developed by the AI startup DeepSeek. It employs reinforcement learning techniques to enhance its reasoning capabilities, enabling it to perform complex tasks such as mathematical problem-solving and coding. To learn more about DeepSeek-R1, refer to DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart and deep dive into the thesis behind building DeepSeek-R1.
Generative AI on SageMaker AI
SageMaker AI, a fully managed service, provides a comprehensive suite of tools designed to deliver high-performance, cost-efficient machine learning (ML) and generative AI solutions for diverse use cases. SageMaker AI empowers you to build, train, deploy, monitor, and govern ML and generative AI models through an extensive range of services, including notebooks, jobs, hosting, experiment tracking, a curated model hub, and MLOps features, all within a unified integrated development environment (IDE).
SageMaker AI simplifies the process for generative AI model builders of all skill levels to work with foundation models (FMs):

Amazon SageMaker Canvas enables data scientists to seamlessly use their own datasets alongside FMs to create applications and architectural patterns, such as chatbots and Retrieval Augmented Generation (RAG), in a low-code or no-code environment.
Amazon SageMaker JumpStart offers a diverse selection of open and proprietary FMs from providers like Hugging Face, Meta, and Stability AI. You can deploy or fine-tune models through an intuitive UI or APIs, providing flexibility for all skill levels.
SageMaker AI features like notebooks, Amazon SageMaker Training, inference, Amazon SageMaker for MLOps, and Partner AI Apps enable advanced model builders to adapt FMs using LoRA, full fine-tuning, or training from scratch. These services support single GPU to HyperPods (cluster of GPUs) for training and include built-in FMOps tools for tracking, debugging, and deployment.

With SageMaker AI, you can build generative AI-powered agentic workflows using a framework of your choice. Some of the key benefits of using SageMaker AI for fine-tuning and hosting LLMs or FMs include:

Ease of deployment – SageMaker AI offers access to SageMaker JumpStart, a curated model hub where models with open weights are made available for seamless deployment through a few clicks or API calls. Additionally, for Hugging Face Hub models, SageMaker AI provides pre-optimized containers built on popular open source hosting frameworks such as vLLM, NVIDIA Triton, and Hugging Face Text Generation Inference (TGI). You simply need to specify the model ID, and the model can be deployed quickly.
Instance-based deterministic pricing – SageMaker AI hosted models are billed based on instance-hours rather than token usage. This pricing model enables you to more accurately predict and manage generative AI inference costs while scaling resources to accommodate incoming request loads.
Deployments with quantization – SageMaker AI enables you to optimize models prior to deployment using advanced strategies such as quantized deployments (such as AWQ, GPTQ, float16, int8, or int4). This flexibility allows you to efficiently deploy large models, such as a 32-billion parameter model, onto smaller instance types like ml.g5.2xlarge with 24 GB of GPU memory, significantly reducing resource requirements while maintaining performance.
Inference load balancing and optimized routing – SageMaker endpoints support load balancing and optimized routing with various strategies, providing users with enhanced flexibility and adaptability to accommodate diverse use cases effectively.
SageMaker fine-tuning recipes – SageMaker offers ready-to-use recipes for quickly training and fine-tuning publicly available FMs such as Meta’s Llama 3, Mistral, and Mixtral. These recipes use Amazon SageMaker HyperPod (a SageMaker AI service that provides resilient, self-healing clusters optimized for large-scale ML workloads), enabling efficient and resilient training on a GPU cluster for scalable and robust performance.

Solution overview
CrewAI provides a robust framework for developing multi-agent systems that integrate with AWS services, particularly SageMaker AI. CrewAI’s role-based agent architecture and comprehensive performance monitoring capabilities work in tandem with Amazon CloudWatch.
The framework excels in workflow orchestration and maintains enterprise-grade security standards aligned with AWS best practices, making it an effective solution for organizations implementing sophisticated agent-based systems within their AWS infrastructure.
In this post, we demonstrate how to use CrewAI to create a multi-agent research workflow. This workflow creates two agents: one that researches on a topic on the internet, and a writer agent takes this research and acts like an editor by formatting it in a readable format. Additionally, we guide you through deploying and integrating one or multiple LLMs into structured workflows, using tools for automated actions, and deploying these workflows on SageMaker AI for a production-ready deployment.
The following diagram illustrates the solution architecture.

Prerequisites
To follow along with the code examples in the rest of this post, make sure the following prerequisites are met:

Integrated development environment – This includes the following:

(Optional) Access to Amazon SageMaker Studio and the JupyterLab IDE – We will use a Python runtime environment to build agentic workflows and deploy LLMs. Having access to a JupyterLab IDE with Python 3.9, 3.10, or 3.11 runtimes is recommended. You can also set up Amazon SageMaker Studio for single users. For more details, see Use quick setup for Amazon SageMaker AI. Create a new SageMaker JupyterLab Space for a quick JupyterLab notebook for experimentation. To learn more, refer to Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools.
Local IDE – You can also follow along in your local IDE (such as PyCharm or VSCode), provided that Python runtimes have been configured for site to AWS VPC connectivity (to deploy models on SageMaker AI).

Permission to deploy models – Make sure that your user execution role has the necessary permissions to deploy models to a SageMaker real-time endpoint for inference. For more information, refer to Deploy models for inference.
Access to Hugging Face Hub – You must have access to Hugging Face Hub’s deepseek-ai/DeepSeek-R1-Distill-Llama-8B model weights from your environment.
Access to code – The code used in this post is available in the following GitHub repo.

Simplified LLM hosting on SageMaker AI
Before orchestrating agentic workflows with CrewAI powered by an LLM, the first step is to host and query an LLM using SageMaker real-time inference endpoints. There are two primary methods to host LLMs on SageMaker AI:

Deploy from SageMaker JumpStart
Deploy from Hugging Face Hub

Deploy DeepSeek from SageMaker JumpStart
SageMaker JumpStart offers access to a diverse array of state-of-the-art FMs for a wide range of tasks, including content writing, code generation, question answering, copywriting, summarization, classification, information retrieval, and more. It simplifies the onboarding and maintenance of publicly available FMs, allowing you to access, customize, and seamlessly integrate them into your ML workflows. Additionally, SageMaker JumpStart provides solution templates that configure infrastructure for common use cases, along with executable example notebooks to streamline ML development with SageMaker AI.
The following screenshot shows an example of available models on SageMaker JumpStart.

To get started, complete the following steps:

Install the latest version of the sagemaker-python-sdk using pip.
Run the following command in a Jupyter cell or the SageMaker Studio terminal:

pip install -U sagemaker

List all available LLMs under the Hugging Face or Meta JumpStart hub. The following code is an example of how to do this programmatically using the SageMaker Python SDK:

from sagemaker.jumpstart.filters import (And, Or)
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# generate a conditional filter to only select LLMs from HF or Meta
filter_value = Or(
And(“task == llm”, “framework == huggingface”),
“framework == meta”, “framework == deekseek”
)

# Retrieve all available JumpStart models
all_models = list_jumpstart_models(filter=filter_value)

For example, deploying the deepseek-llm-r1 model directly from SageMaker JumpStart requires only a few lines of code:

from sagemaker.jumpstart.model import JumpStartModel

model_id = ” deepseek-llm-r1″
model_version = “*”

# instantiate a new JS meta model
model = JumpStartModel(
model_id=model_id,
model_version=model_version
)

# deploy model on a 1 x p5e instance
predictor = model.deploy(
accept_eula=True,
initial_instance_count=1,
# endpoint_name=”deepseek-r1-endpoint” # optional endpoint name
)

We recommend deploying your SageMaker endpoints within a VPC and a private subnet with no egress, making sure that the models remain accessible only within your VPC for enhanced security.
We also recommend you integrate with Amazon Bedrock Guardrails for increased safeguards against harmful content. For more details on how to implement Amazon Bedrock Guardrails on a self-hosted LLM, see Implement model-independent safety measures with Amazon Bedrock Guardrails.
Deploy DeepSeek from Hugging Face Hub
Alternatively, you can deploy your preferred model directly from the Hugging Face Hub or the Hugging Face Open LLM Leaderboard to a SageMaker endpoint. Hugging Face LLMs can be hosted on SageMaker using a variety of supported frameworks, such as NVIDIA Triton, vLLM, and Hugging Face TGI. For a comprehensive list of supported deep learning container images, refer to the available Amazon SageMaker Deep Learning Containers. In this post, we use a DeepSeek-R1-Distill-Llama-70B SageMaker endpoint using the TGI container for agentic AI inference. We deploy the model from Hugging Face Hub using Amazon’s optimized TGI container, which provides enhanced performance for LLMs. This container is specifically optimized for text generation tasks and automatically selects the most performant parameters for the given hardware configuration. To deploy from Hugging Face Hub, refer to the GitHub repo or the following code snippet:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import os
from datetime import datetime

# Model configuration
hub = {‘HF_MODEL_ID’:’deepseek-ai/DeepSeek-R1-Distill-Llama-70B’, #Llama-3.3-70B-Instruct
‘SM_NUM_GPUS’: json.dumps(number_of_gpu),
‘HF_TOKEN’: HUGGING_FACE_HUB_TOKEN,
‘SAGEMAKER_CONTAINER_LOG_LEVEL’: ’20’, # Set to INFO level
‘PYTORCH_CUDA_ALLOC_CONF’: ‘expandable_segments:True’ # configure CUDA memory to use expandable memory segments
}
# Create and deploy model
huggingface_model = HuggingFaceModel(image_uri=get_huggingface_llm_image_uri(“huggingface”,
version=”2.3.1″),
env=hub,
role=role,sagemaker_session=sagemaker_session)
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=”ml.p4d.24xlarge”
endpoint_name=custom_endpoint_name,
container_startup_health_check_timeout=900)

A new DeepSeek-R1-Distill-Llama-70B endpoint should be InService in under 10 minutes. If you want to change the model from DeepSeek to another model from the hub, simply replace the following parameter or refer to the DeepSeek deploy example in the following GitHub repo. To learn more about deployment parameters that can be reconfigured inside TGI containers at runtime, refer to the following GitHub repo on TGI arguments.


“HF_MODEL_ID”: “deepseek-ai/…”, # replace with any HF hub models
# “HF_TOKEN”: “hf_…” # add your token id for gated models

For open-weight models deployed directly from hubs, we strongly recommend placing your SageMaker endpoints within a VPC and a private subnet with no egress, making sure that the models remain accessible only within your VPC for a secure deployment.
Build a simple agent with CrewAI
CrewAI offers the ability to create multi-agent and very complex agentic orchestrations using LLMs from several LLM providers, including SageMaker AI and Amazon Bedrock. In the following steps, we create a simple blocks counting agent to serve as an example.
Create a blocks counting agent
The following code sets up a simple blocks counter workflow using CrewAI with two main components:

Agent creation (blocks_counter_agent) – The agent is configured with a specific role, goal, and capabilities. This agent is equipped with a tool called BlocksCounterTool.
Task definition (count_task) – This is a task that we want this agent to execute. The task includes a template for counting how many of each color of blocks are present, where {color} will be replaced with actual color of the block. The task is assigned to blocks_counter_agent.

from crewai import Agent, Task
from pydantic import BaseModel, Field

# 1. Configure agent
blocks_counter_agent = Agent(
role=”Blocks Inventory Manager”,
goal=”Maintain accurate block counts”,
tools=[BlocksCounterTool],
verbose=True
)

# 2. Create counting task
count_task = Task(
description=”Count {color} play blocks in storage”,
expected_output=”Exact inventory count for specified color”,
agent=blocks_counter_agent
)

As you can see in the preceding code, each agent begins with two essential components: an agent definition that establishes the agent’s core characteristics (including its role, goal, backstory, available tools, LLM model endpoint, and so on), and a task definition that specifies what the agent needs to accomplish, including the detailed description of work, expected outputs, and the tools it can use during execution.
This structured approach makes sure that agents have both a clear identity and purpose (through the agent definition) and a well-defined scope of work (through the task definition), enabling them to operate effectively within their designated responsibilities.
Tools for agentic AI
Tools are special functions that give AI agents the ability to perform specific actions, like searching the internet or analyzing data. Think of them as apps on a smartphone—each tool serves a specific purpose and extends what the agent can do. In our example, BlocksCounterTool helps the agent count the number of blocks organized by color.
Tools are essential because they let agents do real-world tasks instead of just thinking about them. Without tools, agents would be like smart speakers that can only talk—they could process information but couldn’t take actual actions. By adding tools, we transform agents from simple chat programs into practical assistants that can accomplish real tasks.
Out-of-the-box tools with CrewAI Crew AI offers a range of tools out of the box for you to use along with your agents and tasks. The following table lists some of the available tools.

Category
Tool
Description

Data Processing Tools
FileReadTool
For reading various file formats

Web Interaction Tools
WebsiteSearchTool
For web content extraction

Media Tools
YoutubeChannelSearchTool
For searching YouTube channels

Document Processing
PDFSearchTool
For searching PDF documents

Development Tools
CodeInterpreterTool
For Python code interpretation

AI Services
DALL-E Tool
For image generation

Build custom tools with CrewAI You can build custom tools in CrewAI in two ways: by subclassing BaseTool or using the @tool decorator. Let’s look at the following BaseTool subclassing option to create the BlocksCounterTool we used earlier:

from crewai.tools import BaseTool

class BlocksCounterTool(BaseTool):
name = “blocks_counter”
description = “Simple tool to count play blocks”

def _run(self, color: str) -> str:
return f”There are 10 {color} play blocks available”

Build a multi-agent workflow with CrewAI, DeepSeek-R1, and SageMaker AI
Multi-agent AI systems represent a powerful approach to complex problem-solving, where specialized AI agents work together under coordinated supervision. By combining CrewAI’s workflow orchestration capabilities with SageMaker AI based LLMs, developers can create sophisticated systems where multiple agents collaborate efficiently toward a specific goal. The code used in this post is available in the following GitHub repo.
Let’s build a research agent and writer agent that work together to create a PDF about a topic. We will use a DeepSeek-R1 Distilled Llama 3.3 70B model as a SageMaker endpoint for the LLM inference.
Define your own DeepSeek SageMaker LLM (using LLM base class) The following code integrates SageMaker hosted LLMs with CrewAI by creating a custom inference tool that formats prompts with system instructions for factual responses, uses Boto3, an AWS core library, to call SageMaker endpoints, and processes responses by separating reasoning (before </think>) from final answers. This enables CrewAI agents to use deployed models while maintaining structured output patterns.

# Calls SageMaker endpoint for DeepSeek inference
def deepseek_llama_inference(prompt: dict, endpoint_name: str, region: str = “us-east-2″) -> dict:
try:
# … Response parsing Code…

except Exception as e:
raise RuntimeError(f”Error while calling SageMaker endpoint: {e}”)

# CrewAI-compatible LLM implementation for DeepSeek models on SageMaker.
class DeepSeekSageMakerLLM(LLM):
def __init__(self, endpoint: str):
# <… Initialize LLM with SageMaker endpoint …>

def call(self, prompt: Union[List[Dict[str, str]], str], **kwargs) -> str:
# <… Format and return the final response …>

Name the DeepSeek-R1 Distilled endpoint Set the endpoint name as defined earlier when you deployed DeepSeek from the Hugging Face Hub:

deepseek_endpoint = “deepseek-r1-dist-v3-llama70b-2025-01-22″

Create a DeepSeek inference tool Just like how we created the BlocksCounterTool earlier, let’s create a tool that uses the DeepSeek endpoint for our agents to use. We use the same BaseTool subclass here, but we hide it in the CustomTool class implementation in sage_tools.py in the tools folder. For more information, refer to the GitHub repo.

from crewai import Crew, Agent, Task, Process

# Create the Tool for LLaMA inference
deepseek_tool = CustomTool(
name=”deepseek_llama_3.3_70B”,
func=lambda inputs: deepseek_llama_inference(
prompt=inputs,
endpoint_name=deepseek_endpoint
),
description=”A tool to generate text using the DeepSeek LLaMA model deployed on SageMaker.”
)

Create a research agent Just like the simple blocks agent we defined earlier, we follow the same template here to define the research agent. The difference here is that we give more capabilities to this agent. We attach a SageMaker AI based DeepSeek-R1 model as an endpoint for the LLM.
This helps the research agent think critically about information processing by combining the scalable infrastructure of SageMaker with DeepSeek-R1’s advanced reasoning capabilities.
The agent uses the SageMaker hosted LLM to analyze patterns in research data, evaluate source credibility, and synthesize insights from multiple inputs. By using the deepseek_tool, the agent can dynamically adjust its research strategy based on intermediate findings, validate hypotheses through iterative questioning, and maintain context awareness across complex information it gathers.

# Research Agent

research_agent = Agent(
role=”Research Bot”,
goal=”Scan sources, extract relevant information, and compile a research summary.”,
backstory=”An AI agent skilled in finding relevant information from a variety of sources.”,
tools=[deepseek_tool],
allow_delegation=True,
llm=DeepSeekSageMakerLLM(endpoint=deepseek_endpoint),
verbose=False
)

Create a writer agent The writer agent is configured as a specialized content editor that takes research data and transforms it into polished content. This agent works as part of a workflow where it takes research from a research agent and acts like an editor by formatting the content into a readable format. The agent is used for writing and formatting, and unlike the research agent, it doesn’t delegate tasks to other agents.

writer_agent = Agent(
role=”Writer Bot”,
goal=”Receive research summaries and transform them into structured content.”,
backstory=”A talented writer bot capable of producing high-quality, structured content based on research.”,
tools=[deepseek_tool],
allow_delegation=False,
llm=DeepSeekSageMakerLLM(endpoint=deepseek_endpoint),
verbose=False
)

Define tasks for the agents Tasks in CrewAI define specific operations that agents need to perform. In this example, we have two tasks: a research task that processes queries and gathers information, and a writing task that transforms research data into polished content.
Each task includes a clear description of what needs to be done, the expected output format, and specifies which agent will perform the work. This structured approach makes sure that agents have well-defined responsibilities and clear deliverables.
Together, these tasks create a workflow where one agent researches a topic on the internet, and another agent takes this research and formats it into readable content. The tasks are integrated with the DeepSeek tool for advanced language processing capabilities, enabling a production-ready deployment on SageMaker AI.

research_task = Task(
description=(
“Your task is to conduct research based on the following query: {prompt}.n”
),
expected_output=”A comprehensive research summary based on the provided query.”,
agent=research_agent,
tools=[deepseek_tool]
)

writing_task = Task(
description=(
“Your task is to create structured content based on the research provided.n””),
expected_output=”A well-structured article based on the research summary.”,
agent=research_agent,
tools=[deepseek_tool]
)

Define a crew in CrewAI A crew in CrewAI represents a collaborative group of agents working together to achieve a set of tasks. Each crew defines the strategy for task execution, agent collaboration, and the overall workflow. In this specific example, the sequential process makes sure tasks are executed one after the other, following a linear progression. There are other more complex orchestrations of agents working together, which we will discuss in future blog posts.
This approach is ideal for projects requiring tasks to be completed in a specific order. The workflow creates two agents: a research agent and a writer agent. The research agent researches a topic on the internet, then the writer agent takes this research and acts like an editor by formatting it into a readable format.
Let’s call the crew scribble_bots:

# Define the Crew for Sequential Workflow #

scribble_bots = Crew( agents=[research_agent, writer_agent],
tasks=[research_task, writing_task],
process=Process.sequential # Ensure tasks execute in sequence)

Use the crew to run a task We have our endpoint deployed, agents created, and crew defined. Now we’re ready to use the crew to get some work done. Let’s use the following prompt:

result = scribble_bots.kickoff(inputs={“prompt”: “What is DeepSeek?”})

Our result is as follows:

**DeepSeek: Pioneering AI Solutions for a Smarter Tomorrow**

In the rapidly evolving landscape of artificial intelligence,
DeepSeek stands out as a beacon of innovation and practical application.
As an AI company, DeepSeek is dedicated to advancing the field through cutting-edge research and real-world applications,
making AI accessible and beneficial across various industries.

**Focus on AI Research and Development**

………………….. ………………….. ………………….. …………………..

Clean up
Complete the following steps to clean up your resources:

Delete your GPU DeekSeek-R1 endpoint:

import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client(‘sagemaker’, region_name=<region>)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

If you’re using a SageMaker Studio JupyterLab notebook, shut down the JupyterLab notebook instance.

Conclusion
In this post, we demonstrated how you can deploy an LLM such as DeepSeek-R1—or another FM of your choice—from popular model hubs like SageMaker JumpStart or Hugging Face Hub to SageMaker AI for real-time inference. We explored inference frameworks like Hugging Face TGI which helps streamline deployment while integrating built-in performance optimizations to minimize latency and maximize throughput. Additionally, we showcased how the SageMaker developer-friendly Python SDK simplifies endpoint orchestration, allowing seamless experimentation and scaling of LLM-powered applications.
Beyond deployment, this post provided an in-depth exploration of agentic AI, guiding you through its conceptual foundations, practical design principles using CrewAI, and the seamless integration of state-of-the-art LLMs like DeepSeek-R1 as the intelligent backbone of an autonomous agentic workflow. We outlined a sequential CrewAI workflow design, illustrating how to equip LLM-powered agents with specialized tools that enable autonomous data retrieval, real-time processing, and interaction with complex external systems.
Now, it’s your turn to experiment! Dive into our publicly available code on GitHub, and start building your own DeepSeek-R1-powered agentic AI system on SageMaker. Unlock the next frontier of AI-driven automation—seamlessly scalable, intelligent, and production-ready.
Special thanks to Giuseppe Zappia, Poli Rao, and Siamak Nariman for their support with this blog post.

About the Authors
Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the LLama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Automate bulk image editing with Crop.photo and Amazon Rekognition

Evolphin Software, Inc. is a leading provider of digital and media asset management solutions based in Silicon Valley, California. Crop.photo from Evolphin Software is a cloud-based service that offers powerful bulk processing tools for automating image cropping, content resizing, background removal, and listing image analysis.
Crop.photo is tailored for high-end retailers, ecommerce platforms, and sports organizations. The solution has created a unique offering for bulk image editing through its advanced AI-driven solutions. In this post, we explore how Crop.photo uses Amazon Rekognition to provide sophisticated image analysis, enabling automated and precise editing of large volumes of images. This integration streamlines the image editing process for clients, providing speed and accuracy, which is crucial in the fast-paced environments of ecommerce and sports.
Automation: The way out of bulk image editing challenges
Bulk image editing isn’t just about handling a high volume of images, it’s about delivering flawless results with speed at scale. Large retail brands, marketplaces, and sports industries process thousands of images weekly. Each image must be catalog-ready or broadcast-worthy in minutes, not hours.
The challenge lies not just in the quantity but in maintaining high-quality images and brand integrity. Speed and accuracy are non-negotiable. Retailers and sports organizations expect rapid turnaround without compromising image integrity.
This is where Crop.photo’s smart automations come in with an innovative solution for high-volume image processing needs. The platform’s advanced AI algorithms can automatically detect subjects of interest, crop the images, and optimize thousands of images simultaneously while providing consistent quality and brand compliance. By automating repetitive editing tasks, Crop.photo enables enterprises to reduce image processing time from hours to minutes, allowing creative teams to focus on higher-value activities.
Challenges in the ecommerce industry
The ecommerce industry often encounters the following challenges:

Inefficiencies and delays in manual image editing – Ecommerce companies rely on manual editing for tasks like resizing, alignment, and background removal. This process can be time-consuming and prone to delays and inconsistencies. A more efficient solution is needed to streamline the editing process, especially during platform migrations or large updates.
Maintaining uniformity across diverse image types – Companies work with a variety of image types, from lifestyle shots to product close-ups, across different categories. Maintaining uniformity and professionalism in all image types is essential to meet the diverse needs of marketing, product cataloging, and overall brand presentation.
Large-scale migration and platform transition – Transitioning to a new ecommerce platform involves migrating thousands of images, which presents significant logistical challenges. Providing consistency and quality across a diverse range of images during such a large-scale migration is crucial for maintaining brand standards and a seamless user experience.

For a US top retailer, wholesale distribution channels posed a unique challenge. Thousands of fashion images need to be made for the marketplace with less than a day’s notice for flash sales. Their director of creative operations said,

“Crop.photo is an essential part of our ecommerce fashion marketplace workflow. With over 3,000 on-model product images to bulk crop each month, we rely on Crop.photo to enable our wholesale team to quickly publish new products on popular online marketplaces such as Macy’s, Nordstrom, and Bloomingdales. By increasing our retouching team’s productivity by over 70%, Crop.photo has been a game changer for us. Bulk crop images used to take days can now be done in a matter of seconds!”

Challenges in the sports industry
The sports industry often contends with the following challenges:

Bulk player headshot volume and consistency – Sports organizations face the challenge of bulk cropping and resizing hundreds of player headshots for numerous teams, frequently on short notice. Maintaining consistency and quality across a large volume of images can be difficult without AI.
Diverse player facial features – Players have varying facial features, such as different hair lengths, forehead sizes, and face dimensions. Adapting cropping processes to accommodate these differences traditionally requires manual adjustments for each image, which leads to inconsistencies and significant time investment.
Editorial time constraints – Tight editorial schedules and resource limitations are common in sports organizations. The time-consuming nature of manual cropping tasks strains editorial teams, particularly during high-volume periods like tournaments, where delays and rushed work can impact quality and timing.

An Imaging Manager at Europe’s Premier Football Organization expressed,

“We recently found ourselves with 40 images from a top flight English premier league club needing to be edited just 2 hours before kick-off. Using the Bulk AI headshot cropping for sports feature from Crop.photo, we had perfectly cropped headshots of the squad in just 5 minutes, making them ready for publishing in our website CMS just in time. We would never have met this deadline using manual processes. This level of speed was unthinkable before, and it’s why we’re actively recommending Crop.photo to other sports leagues.”

Solution overview
Crop.photo uses Amazon Rekognition to power a robust solution for bulk image editing. Amazon Rekognition offers features like object and scene detection, facial analysis, and image labeling, which they use to generate markers that drive a fully automated image editing workflow.
The following diagram presents a high-level architectural data flow highlighting several of the AWS services used in building the solution.

The solution consists of the following key components:

User authentication – Amazon Cognito is used for user authentication and user management.
Infrastructure deployment – Frontend and backend servers are used on Amazon Elastic Container Service (Amazon ECS) for container deployment, orchestration, and scaling.
Content delivery and caching – Amazon CloudFront is used to cache content, improving performance and routing traffic efficiently.
File uploads – Amazon Simple Storage Service (Amazon S3) enables transfer acceleration for fast, direct uploads to Amazon S3.
Media and job storage – Information about uploaded files and job execution is stored in Amazon Aurora.
Image processing – AWS Batch processes thousands of images in bulk.
Job management – Amazon Simple Queue Service (Amazon SQS) manages and queues jobs for processing, making sure they’re run in the correct order by AWS Batch.
Media analysis – Amazon Rekognition services analyze media files, including:

Face Analysis to generate headless crops.
Moderation to detect and flag profanity and explicit content.
Label Detection to provide context for image processing and focus on relevant objects.
Custom Labels to identify and verify brand logos and adhere to brand guidelines.

Asynchronous job notifications – Amazon Simple Notification Service (Amazon SNS), Amazon EventBridge, and Amazon SQS deliver asynchronous job completion notifications, manage events, and provide reliable and scalable processing.

Amazon Rekognition is an AWS computer vision service that powers Crop.photo’s automated image analysis. It enables object detection, facial recognition, and content moderation capabilities:

Face detection – The Amazon Rekognition face detection feature automatically identifies and analyzes faces in product images. You can use this feature for face-based cropping and optimization through adjustable bounding boxes in the interface.
Image color analysis – The color analysis feature examines image composition, identifying dominant colors and balance. This integrates with Crop.photo’s brand guidelines checker to provide consistency across product images.
Object detection – Object detection automatically identifies key elements in images, enabling smart cropping suggestions. The interface highlights detected objects, allowing you to prioritize specific elements during cropping.
Custom label detection – Custom label detection recognizes brand-specific items and assets. Companies can train models for their unique needs, automatically applying brand-specific cropping rules to maintain consistency.
Text detection (OCR) – The OCR capabilities of Amazon Recognition detect and preserve text within images during editing. The system highlights text areas to make sure critical product information remains legible after cropping.

Within the Crop.photo interface, users can upload videos through the standard interface, and the speech-to-text functionality will automatically transcribe any audio content. This transcribed text can then be used to enrich the metadata and descriptions associated with the product images or videos, improving searchability and accessibility for customers. Additionally, the brand guidelines check feature can be applied to the transcribed text, making sure that the written content aligns with the company’s branding and communication style.
The Crop.photo service follows a transparent pricing model that combines unlimited automations with a flexible image credit system. Users have unrestricted access to create and run as many automation workflows as needed, without any additional charges. The service includes a range of features at no extra cost, such as basic image operations, storage, and behind-the-scenes processing.
For advanced AI-powered image processing tasks, like smart cropping or background removal, users consume image credits. The number of credits required for each operation is clearly specified, allowing users to understand the costs upfront. Crop.photo offers several subscription plans with varying image credit allowances, enabling users to choose the plan that best fits their needs.
Results: Improved speed and precision
The automated image editing capabilities of Crop.photo with the integration of Amazon Rekognition has increased speed in editing, with 70% faster image retouching for ecommerce. With a 75% reduction in manual work, the turnaround time for new product images is reduced from 2–3 days to just 1 hour. Similarly, the bulk image editing process has been streamlined, allowing over 100,000 image collections to be processed per day using AWS Fargate. Advanced AI-powered image analysis and editing features provide consistent, high-quality images at scale, eliminating the need for manual review and approval of thousands of product images.
For instance, in the ecommerce industry, this integration facilitates automatic product detection and precise cropping, making sure every image meets specific marketplace and brand standards. In sports, it enables quick identification and cropping of player facial features, including head, eyes, and mouth, adapting to varying backgrounds and maintaining brand consistency.
The following images are before and after pictures for an ecommerce use case.

For a famous wine retailer in the United Kingdom, the integration of Amazon Rekognition with Crop.photo streamlined the processing of over 1,700 product images, achieving a 95% reduction in bulk image editing time, a confirmation to the efficiency of AI-powered enhancement.
Similarly, a top 10 global specialty retailer experienced a transformative impact on their ecommerce fashion marketplace workflow. By automating the cropping of over 3,000 on-model product images monthly, they boosted their retouching team’s productivity by over 70%, maintaining compliance with the varied image standards of multiple online marketplaces.
Conclusion
These case studies illustrate the tangible benefits of integrating Crop.photo with Amazon Rekognition, demonstrating how automation and AI can revolutionize the bulk image editing landscape for ecommerce and sports industries.
Crop.photo, from AWS Partner Evolphin Software, offers powerful bulk processing tools for automating image cropping, content resizing, and listing image analysis, using advanced AI-driven solutions. Crop.photo is tailored for high-end retailers, ecommerce platforms, and sports organizations. Its integration with Amazon Rekognition aims to streamline the image editing process for clients, providing speed and accuracy in the high-stakes environment of ecommerce and sports. Crop.photo plans additional AI capabilities with Amazon Bedrock generative AI frameworks to adapt to emerging digital imaging trends, so it remains an indispensable tool for its clients.
To learn more about Evolphin Software and Crop.photo, visit their website.
To learn more about Amazon Rekognition, refer to the Amazon Rekognition Developer Guide.

About the Authors
Rahul Bhargava, founder & CTO of Evolphin Software and Crop.photo, is reshaping how brands produce and manage visual content at scale. Through Crop.photo’s AI-powered tools, global names like Lacoste and Urban Outfitters, as well as ambitious Shopify retailers, are rethinking their creative production workflows. By leveraging cutting-edge Generative AI, he’s enabling brands of all sizes to scale their content creation efficiently while maintaining brand consistency.
Vaishnavi Ganesan is a Solutions Architect specializing in Cloud Security at AWS based in the San Francisco Bay Area. As a trusted technical advisor, Vaishnavi helps customers to design secure, scalable and innovative cloud solutions that drive both business value and technical excellence. Outside of work, Vaishnavi enjoys traveling and exploring different artisan coffee roasters.
John Powers is an Account Manager at AWS, who provides guidance to Evolphin Software and other organizations to help accelerate business outcomes leveraging AWS Technologies. John has a degree in Business Administration and Management with a concentration in Finance from Gonzaga University, and enjoys snowboarding in the Sierras in his free time.

This AI Paper Introduces MaAS (Multi-agent Architecture Search): A New …

Large language models (LLMs) are the foundation for multi-agent systems, allowing multiple AI agents to collaborate, communicate, and solve problems. These agents use LLMs to understand tasks, generate responses, and make decisions, mimicking teamwork among humans. However, efficiency lags while executing these types of systems as they are based on fixed designs that do not change for all tasks, causing them to use too many resources to deal with simple and complex problems, thereby wasting computation, and leading to a slow response. This, therefore, creates major challenges while trying to balance precision, speed, and cost while handling diversified tasks.

Currently, multi-agent systems rely on existing methods like CAMEL, AutoGen, MetaGPT, DsPy, EvoPrompting, GPTSwarm, and EvoAgent, which focus on optimizing specific tasks such as prompt tuning, agent profiling, and communication. However, these methods struggle with adaptability. They follow pre-fixed designs without adjustments to diverse tasks, so handling complex and simple queries is somewhat inefficient. They lack flexibility through manual approaches, whereas an automated system can only target the search for the best configuration without dynamic readjustment toward efficiency. This makes these methods costly in computation and results in lower overall performance when applied to real-world challenges.

To address the limitations of existing multi-agent systems, researchers proposed MaAS (Multi-agent Architecture Search). This framework uses a probabilistic agentic supernet to generate query-dependent multi-agent architectures. Instead of selecting a fixed optimal system, MaAS dynamically samples customized multi-agent systems for each query, balancing performance and computational cost. The search space is defined by agentic operators, which are LLM-based workflows involving multiple agents, tools, and prompts. The supernet learns a distribution over possible agentic architectures, optimizing it based on task utility and cost constraints. A controller network samples architectures conditioned on the query, using a Mixture-of-Experts (MoE)-style mechanism for efficient selection. The framework performs optimization via a cost-aware empirical Bayes Monte Carlo, updating the agentic operators using textual gradient-based methods. The framework provides automated multi-agent evolution, allowing for efficiency and adaptability when handling diverse and complex queries.

Researchers evaluated MaAS on six public benchmarks across math reasoning (GSM8K, MATH, MultiArith), code generation (HumanEval, MBPP), and tool use (GAIA), comparing it with 14 baselines, including single-agent methods, handcrafted multi-agent systems, and automated approaches. MaAS consistently outperformed all baselines, achieving an average best score of 83.59% across tasks and a significant improvement of 18.38% on GAIA Level 1 tasks. Cost analysis showed MaAS is resource-efficient, requiring the least training tokens, lowest API costs, and shortest wall-clock time. Case studies highlighted its adaptability in dynamically optimizing multi-agent workflows.

In summary, the method fixed issues in traditional multi-agent systems using an agentic supernet that adjusted to different queries. This made the system work better, use resources wisely, and become more flexible and scalable. In future work, MaAS may be developed into a flexible yet extended framework for improving automation and self-organization in future work. Future work may also see optimizations in sampling strategies, improvements in domain adaptability, and incorporation of real-world constraints to boost collective intelligence.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post This AI Paper Introduces MaAS (Multi-agent Architecture Search): A New Machine Learning Framework that Optimizes Multi-Agent Systems appeared first on MarkTechPost.

Meta AI Introduces Brain2Qwerty: A New Deep Learning Model for Decodin …

Brain-computer interfaces (BCIs) have seen significant progress in recent years, offering communication solutions for individuals with speech or motor impairments. However, most effective BCIs rely on invasive methods, such as implanted electrodes, which pose medical risks including infection and long-term maintenance issues. Non-invasive alternatives, particularly those based on electroencephalography (EEG), have been explored, but they suffer from low accuracy due to poor signal resolution. A key challenge in this field is improving the reliability of non-invasive methods for practical use. Meta AI’s research into Brain2Qwerty presents a step toward addressing this challenge.

Meta AI introduces Brain2Qwerty, a neural network designed to decode sentences from brain activity recorded using EEG or magnetoencephalography (MEG). Participants in the study typed memorized sentences on a QWERTY keyboard while their brain activity was recorded. Unlike previous approaches that required users to focus on external stimuli or imagined movements, Brain2Qwerty leverages natural motor processes associated with typing, offering a potentially more intuitive way to interpret brain activity.

Model Architecture and Its Potential Benefits

Brain2Qwerty is a three-stage neural network designed to process brain signals and infer typed text. The architecture consists of:

Convolutional Module: Extracts temporal and spatial features from EEG/MEG signals.

Transformer Module: Processes sequences to refine representations and improve contextual understanding.

Language Model Module: A pretrained character-level language model corrects and refines predictions.

By integrating these three components, Brain2Qwerty achieves better accuracy than previous models, improving decoding performance and reducing errors in brain-to-text translation.

Evaluating Performance and Key Findings

The study measured Brain2Qwerty’s effectiveness using Character Error Rate (CER):

EEG-based decoding resulted in a 67% CER, indicating a high error rate.

MEG-based decoding performed significantly better with a 32% CER.

The most accurate participants achieved 19% CER, demonstrating the model’s potential under optimal conditions.

These results highlight the limitations of EEG for accurate text decoding while showing MEG’s potential for non-invasive brain-to-text applications. The study also found that Brain2Qwerty could correct typographical errors made by participants, suggesting that it captures both motor and cognitive patterns associated with typing.

Considerations and Future Directions

Brain2Qwerty represents progress in non-invasive BCIs, yet several challenges remain:

Real-time implementation: The model currently processes complete sentences rather than individual keystrokes in real time.

Accessibility of MEG technology: While MEG outperforms EEG, it requires specialized equipment that is not yet portable or widely available.

Applicability to individuals with impairments: The study was conducted with healthy participants. Further research is needed to determine how well it generalizes to those with motor or speech disorders.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Meta AI Introduces Brain2Qwerty: A New Deep Learning Model for Decoding Sentences from Brain Activity with EEG or MEG while Participants Typed Briefly Memorized Sentences on a QWERTY Keyboard appeared first on MarkTechPost.