Microsoft AI Introduces CoRAG (Chain-of-Retrieval Augmented Generation …

Retrieval-Augmented Generation (RAG) is a key technique in enterprise applications that combines large foundation models with external retrieval systems to generate responses that are both accurate and grounded in factual information. Unlike traditional foundation models, which are trained on massive datasets and remain static post-deployment, RAG enhances reliability by incorporating real-time or domain-specific information during the generation process. This integration addresses common issues like hallucinations or gaps in long-tail factual knowledge. RAG systems typically follow a sequential pipeline where retrieved information is provided as input to the generative model, with the overall performance depending heavily on the quality of the retrieval process. To ensure scalability, dense retrievers often use bi-encoder architectures for compressing documents and queries into fixed-size vectors, enabling efficient search algorithms. However, this efficiency comes at the cost of reduced flexibility for handling complex or multi-hop queries, which require iterative reasoning and retrieval steps based on dynamically evolving information.

Recent advancements in RAG have introduced iterative retrieval-generation methods to overcome the limitations of a single retrieval step. Approaches like FLARE and ITER-RETGEN enable models to decide when and what to retrieve during generation, enhancing performance in complex reasoning tasks. Methods like IRCoT adopt chain-of-thought reasoning, refining retrieval steps recursively, while Self-RAG integrates retrieval, generation, and critique for improved factual accuracy. Scaling test-time computing has also been explored to boost RAG performance, with strategies such as retrieving more documents or using long-context LLMs, as seen in LongRAG and IterDRAG. Tree-of-Thought (ToT) and STaR extend reasoning capabilities by leveraging structured exploration and intermediate training states, though these approaches increase token consumption and response latency. Newer methods, like Search-o1, integrate open-source models with active search mechanisms, further advancing RAG’s potential in knowledge-intensive tasks.

Researchers from Microsoft Corporation and the Renmin University of China introduced CoRAG (Chain-of-Retrieval Augmented Generation), a method for training RAG models to iteratively retrieve and reason before generating answers. Unlike conventional RAG systems, CoRAG dynamically reformulates queries based on the evolving reasoning state. The approach uses rejection sampling to augment datasets with intermediate retrieval chains, enabling fine-tuning of open-source models. CoRAG achieves state-of-the-art results on benchmarks like KILT, particularly excelling in multi-hop reasoning tasks by addressing retrieval bottlenecks. It supports diverse decoding strategies, adjusts test-time retrieval dynamically, and demonstrates robustness to varying retriever quality, offering a pathway to more grounded and factual AI models.

The CoRAG framework enhances RAG models through three key components: retrieval chain generation, model training, and test-time scaling strategies. Retrieval chains are generated using rejection sampling, where intermediate sub-queries and sub-answers are iteratively formed, and the chain with the highest log-likelihood score is selected to augment datasets. Using a multi-task learning framework, the model is trained on these augmented datasets for sub-query, sub-answer, and final answer prediction. At test time, decoding strategies like greedy decoding, best-of-N sampling, and tree search allow for controlling token consumption and retrieval steps. These approaches optimize the trade-off between performance and compute efficiency.

The evaluation of CoRAG was conducted using two benchmarks: (1) multi-hop QA datasets, including 2WikiMultihopQA, HotpotQA, Bamboogle, and MuSiQue, to test multi-hop reasoning, and (2) the KILT benchmark for generalization across knowledge-intensive tasks. Fine-tuning was performed on Llama-3.1-8B-Instruct using retrieval chain-augmented datasets. CoRAG-8B significantly outperformed baselines in most multi-hop QA datasets, except Bamboogle, where limited instances and outdated retrieval data caused variability. In the KILT benchmark, CoRAG achieved state-of-the-art performance across tasks, except for FEVER, where a larger model slightly surpassed it. Performance scaling experiments showed improvements with increased retrieval chain lengths and sampling strategies.

In conclusion, the study presents CoRAG, a framework that trains LLMs to retrieve and reason through complex queries iteratively. Unlike traditional RAG methods that rely on a single retrieval step, CoRAG dynamically reformulates queries during retrieval, enhancing accuracy. Intermediate retrieval chains are automatically generated using rejection sampling, eliminating the need for manual annotations. At test time, adaptive decoding strategies balance performance with computational efficiency. CoRAG achieves state-of-the-art results on multi-hop QA datasets and the KILT benchmark, outperforming larger models. Detailed analysis highlights its scaling and generalization capabilities, paving the way for advancing factual, grounded, and trustworthy AI systems in challenging tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Microsoft AI Introduces CoRAG (Chain-of-Retrieval Augmented Generation): An AI Framework for Iterative Retrieval and Reasoning in Knowledge-Intensive Tasks appeared first on MarkTechPost.

Leveraging Hallucinations in Large Language Models to Enhance Drug Dis …

Researchers have highlighted concerns regarding hallucinations in LLMs due to their generation of plausible but inaccurate or unrelated content. However, these hallucinations hold potential in creativity-driven fields like drug discovery, where innovation is essential. LLMs have been widely applied in scientific domains, such as materials science, biology, and chemistry, aiding tasks like molecular description and drug design. While traditional models like MolT5 offer domain-specific accuracy, LLMs often produce hallucinated outputs when not fine-tuned. Despite their lack of factual consistency, such outputs can provide valuable insights, such as high-level molecular descriptions and potential compound applications, thereby supporting exploratory processes in drug discovery.

Drug discovery, a costly and time-intensive process, involves evaluating vast chemical spaces and identifying novel solutions to biological challenges. Previous studies have used machine learning and generative models to assist in this field, with researchers exploring the integration of LLMs for molecule design, dataset curation, and prediction tasks. Hallucinations in LLMs, often viewed as a drawback, can mimic creative processes by recombining knowledge to generate novel ideas. This perspective aligns with creativity’s role in innovation, exemplified by groundbreaking accidental discoveries like penicillin. By leveraging hallucinated insights, LLMs could advance drug discovery by identifying molecules with unique properties and fostering high-level innovation.

ScaDS.AI and Dresden University of Technology researchers hypothesize that hallucinations can enhance LLM performance in drug discovery. Using seven instruction-tuned LLMs, including GPT-4o and Llama-3.1-8B, they incorporated hallucinated natural language descriptions of molecules’ SMILES strings into prompts for classification tasks. The results confirmed their hypothesis, with Llama-3.1-8B achieving an 18.35% ROC-AUC improvement over the baseline. Larger models and Chinese-generated hallucinations demonstrated the greatest gains. Analyses revealed that hallucinated text provides unrelated yet insightful information, aiding predictions. This study highlights hallucinations’ potential in pharmaceutical research and offers new perspectives on leveraging LLMs for innovative drug discovery.

To generate hallucinations, SMILES strings of molecules are translated into natural language using a standardized prompt where the system is defined as an “expert in drug discovery.” The generated descriptions are evaluated for factual consistency using the HHM-2.1-Open Model, with MolT5-generated text as the reference. Results show low factual consistency across LLMs, with ChemLLM scoring 20.89% and others averaging 7.42–13.58%. Drug discovery tasks are formulated as binary classification problems, predicting specific molecular properties via next-token prediction. Prompts include SMILES, descriptions, and task instructions, with models constrained to output “Yes” or “No” based on the highest probability.

The study examines how hallucinations generated by different LLMs impact performance in molecular property prediction tasks. Experiments use a standardized prompt format to compare predictions based on SMILES strings alone, SMILES with MolT5-generated descriptions, and hallucinated descriptions from various LLMs. Five MoleculeNet datasets were analyzed using ROC-AUC scores. Results show that hallucinations generally improve performance over SMILES or MolT5 baselines, with GPT-4o achieving the highest gains. Larger models benefit more from hallucinations, but improvements plateau beyond 8 billion parameters. Temperature settings influence hallucination quality, with intermediate values yielding the best performance enhancements.

In conclusion, the study explores the potential benefits of hallucinations in LLMs for drug discovery tasks. By hypothesizing that hallucinations can enhance performance, the research evaluates seven LLMs across five datasets using hallucinated molecule descriptions integrated into prompts. Results confirm that hallucinations improve LLM performance compared to baseline prompts without hallucinations. Notably, Llama-3.1-8B achieved an 18.35% ROC-AUC gain. GPT-4o-generated hallucinations provided consistent improvements across models. Findings reveal that larger model sizes generally benefit more from hallucinations, while factors like generation temperature have minimal impact. The study highlights hallucinations’ creative potential in AI and encourages further exploration of drug discovery applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Leveraging Hallucinations in Large Language Models to Enhance Drug Discovery appeared first on MarkTechPost.

Develop a RAG-based application using Amazon Aurora with Amazon Kendra

Generative AI and large language models (LLMs) are revolutionizing organizations across diverse sectors to enhance customer experience, which traditionally would take years to make progress. Every organization has data stored in data stores, either on premises or in cloud providers.
You can embrace generative AI and enhance customer experience by converting your existing data into an index on which generative AI can search. When you ask a question to an open source LLM, you get publicly available information as a response. Although this is helpful, generative AI can help you understand your data along with additional context from LLMs. This is achieved through Retrieval Augmented Generation (RAG).
RAG retrieves data from a preexisting knowledge base (your data), combines it with the LLM’s knowledge, and generates responses with more human-like language. However, in order for generative AI to understand your data, some amount of data preparation is required, which involves a big learning curve.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
In this post, we walk you through how to convert your existing Aurora data into an index without needing data preparation for Amazon Kendra to perform data search and implement RAG that combines your data along with LLM knowledge to produce accurate responses.
Solution overview
In this solution, use your existing data as a data source (Aurora), create an intelligent search service by connecting and syncing your data source to Amazon Kendra search, and perform generative AI data search, which uses RAG to produce accurate responses by combining your data along with the LLM’s knowledge. For this post, we use Anthropic’s Claude on Amazon Bedrock as our LLM.
The following are the high-level steps for the solution:

Create an Amazon Aurora PostgreSQL-Compatible Edition
Ingest data to Aurora PostgreSQL-Compatible.
Create an Amazon Kendra index.
Set up the Amazon Kendra Aurora PostgreSQL connector.
Invoke the RAG application.

The following diagram illustrates the solution architecture.

Prerequisites
To follow this post, the following prerequisites are required:

The AWS Command Line Interface (AWS CLI) installed and configured
An AWS account and appropriate permissions to interact with resources in your AWS account
The AWS managed AWS Identity and Access Management (IAM) policy AmazonKendraReadOnlyAccess should be part of an Amazon SageMaker IAM role
An Aurora DB cluster where the current data is present
Your preferred interactive development environment (IDE) to run the Python script (such as SageMaker, or VS Code)
The pgAdmin tool for data loading and validation

Create an Aurora PostgreSQL cluster
Run the following AWS CLI commands to create an Aurora PostgreSQL Serverless v2 cluster:

aws rds create-db-cluster
–engine aurora-postgresql
–engine-version 15.4
–db-cluster-identifier genai-kendra-ragdb
–master-username postgres
–master-user-password XXXXX
–db-subnet-group-name dbsubnet
–vpc-security-group-ids “sg-XXXXX”
–serverless-v2-scaling-configuration “MinCapacity=2,MaxCapacity=64”
–enable-http-endpoint
–region us-east-2

aws rds create-db-instance
–db-cluster-identifier genai-kendra-ragdb
–db-instance-identifier genai-kendra-ragdb-instance
–db-instance-class db.serverless
–engine aurora-postgresql

The following screenshot shows the created instance.

Ingest data to Aurora PostgreSQL-Compatible
Connect to the Aurora instance using the pgAdmin tool. Refer to Connecting to a DB instance running the PostgreSQL database engine for more information. To ingest your data, complete the following steps:

Run the following PostgreSQL statements in pgAdmin to create the database, schema, and table:

CREATE DATABASE genai;
CREATE SCHEMA ’employees’;

CREATE DATABASE genai;
SET SCHEMA ’employees’;

CREATE TABLE employees.amazon_review(
pk int GENERATED ALWAYS AS IDENTITY NOT NULL,
id varchar(50) NOT NULL,
name varchar(300) NULL,
asins Text NULL,
brand Text NULL,
categories Text NULL,
keys Text NULL,
manufacturer Text NULL,
reviews_date Text NULL,
reviews_dateAdded Text NULL,
reviews_dateSeen Text NULL,
reviews_didPurchase Text NULL,
reviews_doRecommend varchar(100) NULL,
reviews_id varchar(150) NULL,
reviews_numHelpful varchar(150) NULL,
reviews_rating varchar(150) NULL,
reviews_sourceURLs Text NULL,
reviews_text Text NULL,
reviews_title Text NULL,
reviews_userCity varchar(100) NULL,
reviews_userProvince varchar(100) NULL,
reviews_username Text NULL,
PRIMARY KEY
(
pk
)
) ;

In your pgAdmin Aurora PostgreSQL connection, navigate to Databases, genai, Schemas, employees, Tables.
Choose (right-click) Tables and choose PSQL Tool to open a PSQL client connection.
Place the csv file under your pgAdmin location and run the following command:

copy employees.amazon_review (id, name, asins, brand, categories, keys, manufacturer, reviews_date, reviews_dateadded, reviews_dateseen, reviews_didpurchase, reviews_dorecommend, reviews_id, reviews_numhelpful, reviews_rating, reviews_sour
ceurls, reviews_text, reviews_title, reviews_usercity, reviews_userprovince, reviews_username) FROM ‘C:Program FilespgAdmin 4runtimeamazon_review.csv’ DELIMITER ‘,’ CSV HEADER ENCODING ‘utf8′;

Run the following PSQL query to verify the number of records copied:

Select count (*) from employees.amazon_review;

Create an Amazon Kendra index
The Amazon Kendra index holds the contents of your documents and is structured in a way to make the documents searchable. It has three index types:

Generative AI Enterprise Edition index – Offers the highest accuracy for the Retrieve API operation and for RAG use cases (recommended)
Enterprise Edition index – Provides semantic search capabilities and offers a high-availability service that is suitable for production workloads
Developer Edition index – Provides semantic search capabilities for you to test your use cases

To create an Amazon Kendra index, complete the following steps:

On the Amazon Kendra console, choose Indexes in the navigation pane.
Choose Create an index.
On the Specify index details page, provide the following information:

For Index name, enter a name (for example, genai-kendra-index).
For IAM role, choose Create a new role (Recommended).
For Role name, enter an IAM role name (for example, genai-kendra). Your role name will be prefixed with AmazonKendra-<region>- (for example, AmazonKendra-us-east-2-genai-kendra).

Choose Next.
On the Add additional capacity page, select Developer edition (for this demo) and choose Next.
On the Configure user access control page, provide the following information:

Under Access control settings¸ select No.
Under User-group expansion, select None.

Choose Next.
On the Review and create page, verify the details and choose Create.

It might take some time for the index to create. Check the list of indexes to watch the progress of creating your index. When the status of the index is ACTIVE, your index is ready to use.
Set up the Amazon Kendra Aurora PostgreSQL connector
Complete the following steps to set up your data source connector:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Choose Add data source.
Choose Aurora PostgreSQL connector as the data source type.
On the Specify data source details page, provide the following information:

For Data source name, enter a name (for example, data_source_genai_kendra_postgresql).
For Default language¸ choose English (en).
Choose Next.

On the Define access and security page, under Source, provide the following information:

For Host, enter the host name of the PostgreSQL instance (cvgupdj47zsh.us-east-2.rds.amazonaws.com).
For Port, enter the port number of the PostgreSQL instance (5432).
For Instance, enter the database name of the PostgreSQL instance (genai).

Under Authentication, if you already have credentials stored in AWS Secrets Manager, choose it on the dropdown Otherwise, choose Create and add new secret.
In the Create an AWS Secrets Manager secret pop-up window, provide the following information:

For Secret name, enter a name (for example, AmazonKendra-Aurora-PostgreSQL-genai-kendra-secret).
For Data base user name, enter the name of your database user.
For Password¸ enter the user password.

Choose Add Secret.
Under Configure VPC and security group, provide the following information:

For Virtual Private Cloud, choose your virtual private cloud (VPC).
For Subnet, choose your subnet.
For VPC security groups, choose the VPC security group to allow access to your data source.

Under IAM role¸ if you have an existing role, choose it on the dropdown menu. Otherwise, choose Create a new role.
On the Configure sync settings page, under Sync scope, provide the following information:

For SQL query, enter the SQL query and column values as follows: select * from employees.amazon_review.
For Primary key, enter the primary key column (pk).
For Title, enter the title column that provides the name of the document title within your database table (reviews_title).
For Body, enter the body column on which your Amazon Kendra search will happen (reviews_text).

Under Sync node, select Full sync to convert the entire table data into a searchable index.

After the sync completes successfully, your Amazon Kendra index will contain the data from the specified Aurora PostgreSQL table. You can then use this index for intelligent search and RAG applications.

Under Sync run schedule, choose Run on demand.
Choose Next.
On the Set field mappings page, leave the default settings and choose Next.
Review your settings and choose Add data source.

Your data source will appear on the Data sources page after the data source has been created successfully.

Invoke the RAG application
The Amazon Kendra index sync can take minutes to hours depending on the volume of your data. When the sync completes without error, you are ready to develop your RAG solution in your preferred IDE. Complete the following steps:

Configure your AWS credentials to allow Boto3 to interact with AWS services. You can do this by setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by using the ~/.aws/credentials file:

import boto3
  pip install langchain

# Create a Boto3 session

session = boto3.Session(
   aws_access_key_id=’YOUR_AWS_ACCESS_KEY_ID’,
   aws_secret_access_key=’YOUR_AWS_SECRET_ACCESS_KEY’,
   region_name=’YOUR_AWS_REGION’
)

Import LangChain and the necessary components:

from langchain_community.llms import Bedrock
from langchain_community.retrievers import AmazonKendraRetriever
from langchain.chains import RetrievalQA

Create an instance of the LLM (Anthropic’s Claude):

llm = Bedrock(
region_name = “bedrock_region_name”,
model_kwargs = {
“max_tokens_to_sample”:300,
“temperature”:1,
“top_k”:250,
“top_p”:0.999,
“anthropic_version”:”bedrock-2023-05-31″
},
model_id = “anthropic.claude-v2”
)

Create your prompt template, which provides instructions for the LLM:

from langchain_core.prompts import PromptTemplate

prompt_template = “””
You are a <persona>Product Review Specialist</persona>, and you provide detail product review insights.
You have access to the product reviews in the <context> XML tags below and nothing else.

<context>
{context}
</context>

<question>
{question}
</question>
“””

prompt = PromptTemplate(template=prompt_template, input_variables=[“context”, “question”])

Initialize the KendraRetriever with your Amazon Kendra index ID by replacing the Kendra_index_id that you created earlier and the Amazon Kendra client:

session = boto3.Session(region_name=’Kendra_region_name’)
kendra_client = session.client(‘kendra’)
# Create an instance of AmazonKendraRetriever
kendra_retriever = AmazonKendraRetriever(
kendra_client=kendra_client,
index_id=”Kendra_Index_ID”
)

Combine Anthropic’s Claude and the Amazon Kendra retriever into a RetrievalQA chain:

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”,
retriever=kendra_retriever,
return_source_documents=True,
chain_type_kwargs={“prompt”: prompt},
)

Invoke the chain with your own query:

query = “What are some products that has bad quality reviews, summarize the reviews”
result_ = qa.invoke(
query
)
result_

Clean up
To avoid incurring future charges, delete the resources you created as part of this post:

Delete the Aurora DB cluster and DB instance.
Delete the Amazon Kendra index.

Conclusion
In this post, we discussed how to convert your existing Aurora data into an Amazon Kendra index and implement a RAG-based solution for the data search. This solution drastically reduces the data preparation need for Amazon Kendra search. It also increases the speed of generative AI application development by reducing the learning curve behind data preparation.
Try out the solution, and if you have any comments or questions, leave them in the comments section.

About the Authors
Aravind Hariharaputran is a Data Consultant with the Professional Services team at Amazon Web Services. He is passionate about Data and AIML in general with extensive experience managing Database technologies .He helps customers transform legacy database and applications to Modern data platforms and generative AI applications. He enjoys spending time with family and playing cricket.
Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Optimizing AI responsiveness: A practical guide to Amazon Bedrock late …

In production generative AI applications, responsiveness is just as important as the intelligence behind the model. Whether it’s customer service teams handling time-sensitive inquiries or developers needing instant code suggestions, every second of delay, known as latency, can have a significant impact. As businesses increasingly use large language models (LLMs) for these critical tasks and processes, they face a fundamental challenge: how to maintain the quick, responsive performance users expect while delivering the high-quality outputs these sophisticated models promise.
The impact of latency on user experience extends beyond mere inconvenience. In interactive AI applications, delayed responses can break the natural flow of conversation, diminish user engagement, and ultimately affect the adoption of AI-powered solutions. This challenge is compounded by the increasing complexity of modern LLM applications, where multiple LLM calls are often needed to solve a single problem, significantly increasing total processing times.
During re:Invent 2024, we launched latency-optimized inference for foundation models (FMs) in Amazon Bedrock. This new inference feature provides reduced latency for Anthropic’s Claude 3.5 Haiku model and Meta’s Llama 3.1 405B and 70B models compared to their standard versions. This feature is especially helpful for time-sensitive workloads where rapid response is business critical.
In this post, we explore how Amazon Bedrock latency-optimized inference can help address the challenges of maintaining responsiveness in LLM applications. We’ll dive deep into strategies for optimizing application performance and improving user experience. Whether you’re building a new AI application or optimizing an existing one, you’ll find practical guidance on both the technical aspects of latency optimization and real-world implementation approaches. We begin by explaining latency in LLM applications.
Understanding latency in LLM applications
Latency in LLM applications is a multifaceted concept that goes beyond simple response times. When you interact with an LLM, you can receive responses in one of two ways: streaming or nonstreaming mode. In nonstreaming mode, you wait for the complete response before receiving any output—like waiting for someone to finish writing a letter. In streaming mode, you receive the response as it’s being generated—like watching someone type in real time.
To effectively optimize AI applications for responsiveness, we need to understand the key metrics that define latency and how they impact user experience. These metrics differ between streaming and nonstreaming modes and understanding them is crucial for building responsive AI applications.
Time to first token (TTFT) represents how quickly your streaming application starts responding. It’s the amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). Think of it as the initial reaction time of your AI application.
TTFT is affected by several factors:

Length of your input prompt (longer prompts generally mean higher TTFT)
Network conditions and geographic location (if the prompt is getting processed in a different region, it will take longer)

Calculation: TTFT = Time to first chunk/token – Time from request submission Interpretation: Lower is better
Output tokens per second (OTPS) indicates how quickly your model generates new tokens after it starts responding. This metric is crucial for understanding the actual throughput of your model and how it maintains its response speed throughout longer generations.
OTPS is influenced by:

Model size and complexity
Length of the generated response
Complexity of the task and prompt
System load and resource availability

Calculation: OTPS = Total number of output tokens / Total generation time Interpretation: Higher is better
End-to-end latency (E2E) measures the total time from request to complete response. As illustrated in the figure above, this encompasses the entire interaction.
Key factors affecting this metric include:

Input prompt length
Requested output length
Model processing speed
Network conditions
Complexity of the task and prompt
Postprocessing requirements (for example, using Amazon Bedrock Guardrails or other quality checks)

Calculation: E2E latency = Time at completion of request – Time from request submission Interpretation: Lower is better
Although these metrics provide a solid foundation for understanding latency, there are additional factors and considerations that can impact the perceived performance of LLM applications. These metrics are shown in the following diagram.

The role of tokenization
An often-overlooked aspect of latency is how different models tokenize text differently. Each model’s tokenization strategy is defined by its provider during training and can’t be modified. For example, a prompt that generates 100 tokens in one model might generate 150 tokens in another. When comparing model performance, remember that these inherent tokenization differences can affect perceived response times, even when the models are equally efficient. Awareness of this variation can help you better interpret latency differences between models and make more informed decisions when selecting models for your applications.
Understanding user experience
The psychology of waiting in AI applications reveals interesting patterns about user expectations and satisfaction. Users tend to perceive response times differently based on the context and complexity of their requests. A slight delay in generating a complex analysis might be acceptable, and even a small lag in a conversational exchange can feel disruptive. This understanding helps us set appropriate optimization priorities for different types of applications.
Consistency over speed
Consistent response times, even if slightly slower, often lead to better user satisfaction than highly variable response times with occasional quick replies. This is crucial for streaming responses and implementing optimization strategies.
Keeping users engaged
When processing times are longer, simple indicators such as “Processing your request” or “loading animations” messages help keep users engaged, especially during the initial response time. In such scenarios, you want to optimize for TTFT.
Balancing speed, quality, and cost
Output quality often matters more than speed. Users prefer accurate responses over quick but less reliable ones. Consider benchmarking your user experience to find the best latency for your use case, considering that most humans can’t read faster than 225 words per minute and therefore extremely fast response can hinder user experience.
By understanding these nuances, you can make more informed decisions to optimize your AI applications for better user experience.
Latency-optimized inference: A deep dive
Amazon Bedrock latency-optimized inference capabilities are designed to provide higher OTPS and quicker TTFT, enabling applications to handle workloads more reliably. This optimization is available in the US East (Ohio) AWS Region for select FMs, including Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 models (both 405B and 70B versions). The optimization supports the following models:

Higher OTPS – Faster token generation after the model starts responding
Quicker TTFT – Faster initial response time

Implementation
To enable latency optimization, you need to set the latency parameter to optimized in your API calls:

# Using converse api without streaming
response = bedrock_runtime.converse(
modelId=’us.anthropic.claude-3-5-haiku-20241022-v1:0′,
messages=[{
‘role’: ‘user’,
‘content’: [{
‘text’:’Write a story about music generating AI models’
}]
}],
performanceConfig={‘latency’: ‘optimized’}
)

For streaming responses:

# using converse API with streaming
response = bedrock_runtime.converse_stream(
modelId=’us.anthropic.claude-3-5-haiku-20241022-v1:0′,
messages=[{
‘role’: ‘user’,
‘content’: [{
‘text’:’Write a story about music generating AI models’
}]
}],
performanceConfig={‘latency’: ‘optimized’}
)

Benchmarking methodology and results
To understand the performance gains both for TTFT and OTPS, we conducted an offline experiment with around 1,600 API calls spread across various hours of the day and across multiple days. We used a dummy dataset comprising different task types: sequence-counting, story-writing, summarization, and translation. The input prompt ranged from 100 tokens to 100,000 tokens, and the output tokens ranged from 100 to 1,000 output tokens. These tasks were chosen to represent varying complexity levels and various model output lengths. Our test setup was hosted in the US West (Oregon) us-west-2 Region, and both the optimized and standard models were hosted in US East (Ohio) us-east-2 Region. This cross-Region setup introduced realistic network variability, helping us measure performance under conditions similar to real-world applications.
When analyzing the results, we focused on the key latency metrics discussed earlier: TTFT and OTPS. As a quick recap, lower TTFT values indicate faster initial response times, and higher OTPS values represent faster token generation speeds. We also looked at the 50th percentile (P50) and 90th percentile (P90) values to understand both typical performance and performance boundaries under challenging or upper bound conditions. Following the central limit theorem, we observed that, with sufficient samples, our results converged toward consistent values, providing reliable performance indicators.
It’s important to note that these results are from our specific test environment and datasets. Your actual results may vary based on your specific use case, prompt length, expected model response length, network conditions, client location, and other implementation components. When conducting your own benchmarks, make sure your test dataset represents your actual production workload characteristics, including typical input lengths and expected output patterns.
Benchmark results
Our experiments with the latency-optimized models revealed substantial performance improvements across both TTFT and OTPS metrics. The results in the following table show the comparison between standard and optimized versions of Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 70B models. For each model, we ran multiple iterations of our test scenarios to promote reliable performance. The improvements were particularly notable in high-percentile measurements, suggesting more consistent performance even under challenging conditions.

Model
Inference profile
TTFT P50 (in seconds)
TTFT P90 (in seconds)
OTPS P50
OTPS P90

us.anthropic.claude-3-5-haiku-20241022-v1:0
Optimized
0.6
1.4
85.9
152.0

us.anthropic.claude-3-5-haiku-20241022-v1:0
Standard
1.1
2.9
48.4
67.4

Improvement
-42.20%
-51.70%
77.34%
125.50%

us.meta.llama3-1-70b-instruct-v1:0
Optimized
0.4
1.2
137.0
203.7

us.meta.llama3-1-70b-instruct-v1:0
Standard
0.9
42.8
30.2
32.4

Improvement
-51.65%
-97.10%
353.84%
529.33%

These results demonstrate significant improvements across all metrics for both models. For Anthropic’s Claude 3.5 Haiku model, the optimized version achieved up to 42.20% reduction in TTFT P50 and up to 51.70% reduction in TTFT P90, indicating more consistent initial response times. Additionally, the OTPS saw improvements of up to 77.34% at the P50 level and up to 125.50% at the P90 level, enabling faster token generation.
The gains for Meta’s Llama 3.1 70B model are even more impressive, with the optimized version achieving up to 51.65% reduction in TTFT P50 and up to 97.10% reduction in TTFT P90, providing consistently rapid initial responses. Furthermore, the OTPS saw a massive boost, with improvements of up to 353.84% at the P50 level and up to 529.33% at the P90 level, enabling up to 5x faster token generation in some scenarios.
Although these benchmark results show the powerful impact of latency-optimized inference, they represent just one piece of the optimization puzzle. To make best use of these performance improvements and achieve the best possible response times for your specific use case, you’ll need to consider additional optimization strategies beyond merely enabling the feature.
Comprehensive guide to LLM latency optimization
Even though Amazon Bedrock latency-optimized inference offers great improvements from the start, getting the best performance requires a well-rounded approach to designing and implementing your application. In the next section, we explore some other strategies and considerations to make your application as responsive as possible.
Prompt engineering for latency optimization
When optimizing LLM applications for latency, the way you craft your prompts affects both input processing and output generation.
To optimize your input prompts, follow these recommendations:

Keep prompts concise – Long input prompts take more time to process and increase TTFT. Create short, focused prompts that prioritize necessary context and information.
Break down complex tasks – Instead of handling large tasks in a single request, break them into smaller, manageable chunks. This approach helps maintain responsiveness regardless of task complexity.
Smart context management – For interactive applications such as chatbots, include only relevant context instead of entire conversation history.
Token management – Different models tokenize text differently, meaning the same input can result in different numbers of tokens. Monitor and optimize token usage to keep performance consistent. Use token budgeting to balance context preservation with performance needs.

To engineer for brief outputs, follow these recommendations:

Engineer for brevity – Include explicit length constraints in your prompts (for example, “respond in 50 words or less”)
Use system messages – Set response length constraints through system messages
Balance quality and length – Make sure response constraints don’t compromise output quality

One of the best ways to make your AI application feel faster is to use streaming. Instead of waiting for the complete response, streaming shows the response as it’s being generated—like watching someone type in real-time. Streaming the response is one of the most effective ways to improve perceived performance in LLM applications maintaining user engagement.
These techniques can significantly reduce token usage and generation time, improving both latency and cost-efficiency.
Building production-ready AI applications
Although individual optimizations are important, production applications require a holistic approach to latency management. In this section, we explore how different system components and architectural decisions impact overall application responsiveness.
System architecture and end-to-end latency considerations
In production environments, overall system latency extends far beyond model inference time. Each component in your AI application stack contributes to the total latency experienced by users. For instance, when implementing responsible AI practices through Amazon Bedrock Guardrails, you might notice a small additional latency overhead. Similar considerations apply when integrating content filtering, user authentication, or input validation layers. Although each component serves a crucial purpose, their cumulative impact on latency requires careful consideration during system design.
Geographic distribution plays a significant role in application performance. Model invocation latency can vary considerably depending on whether calls originate from different Regions, local machines, or different cloud providers. This variation stems from data travel time across networks and geographic distances. When designing your application architecture, consider factors such as the physical distance between your application and model endpoints, cross-Region data transfer times, and network reliability in different Regions. Data residency requirements might also influence these architectural choices, potentially necessitating specific Regional deployments.
Integration patterns significantly impact how users perceive application performance. Synchronous processing, although simpler to implement, might not always provide the best user experience. Consider implementing asynchronous patterns where appropriate, such as pre-fetching likely responses based on user behavior patterns or processing noncritical components in the background. Request batching for bulk operations can also help optimize overall system throughput, though it requires careful balance with response time requirements.
As applications scale, additional infrastructure components become necessary but can impact latency. Load balancers, queue systems, cache layers, and monitoring systems all contribute to the overall latency budget. Understanding these components’ impact helps in making informed decisions about infrastructure design and optimization strategies.
Complex tasks often require orchestrating multiple model calls or breaking down problems into subtasks. Consider a content generation system that first uses a fast model to generate an outline, then processes different sections in parallel, and finally uses another model for coherence checking and refinement. This orchestration approach requires careful attention to cumulative latency impact while maintaining output quality. Each step needs appropriate timeouts and fallback mechanisms to provide reliable performance under various conditions.
Prompt caching for enhanced performance
Although our focus is on latency-optimized inference, it’s worth noting that Amazon Bedrock also offers prompt caching (in preview) to optimize for both cost and latency. This feature is particularly valuable for applications that frequently reuse context, such as document-based chat assistants or applications with repetitive query patterns. When combined with latency-optimized inference, prompt caching can provide additional performance benefits by reducing the processing overhead for frequently used contexts.
Prompt routing for intelligent model selection
Similar to prompt caching, Amazon Bedrock Intelligent Prompt Routing (in preview) is another powerful optimization feature. This capability automatically directs requests to different models within the same model family based on the complexity of each prompt. For example, simple queries can be routed to faster, more cost-effective models, and complex requests that require deeper understanding are directed to more sophisticated models. This automatic routing helps optimize both performance and cost without requiring manual intervention.
Architectural considerations and caching
Application architecture plays a crucial role in overall latency optimization. Consider implementing a multitiered caching strategy that includes response caching for frequently requested information and smart context management for historical information. This isn’t only about storing exact matches—consider implementing semantic caching that can identify and serve responses to similar queries.
Balancing model sophistication, latency, and cost
In AI applications, there’s a constant balancing act between model sophistication, latency, and cost, as illustrated in the diagram. Although more advanced models often provide higher quality outputs, they might not always meet strict latency requirements. In such cases, using a less sophisticated but faster model might be the better choice. For instance, in applications requiring near-instantaneous responses, opting for a smaller, more efficient model could be necessary to meet latency goals, even if it means a slight trade-off in output quality. This approach aligns with the broader need to optimize the interplay between cost, speed, and quality in AI systems.

Features such as Amazon Bedrock Intelligent Prompt Routing help manage this balance effectively. By automatically handling model selection based on request complexity, you can optimize for all three factors—quality, speed, and cost—without requiring developers to commit to a single model for all requests.
As we’ve explored throughout this post, optimizing LLM application latency involves multiple strategies, from using latency-optimized inference and prompt caching to implementing intelligent routing and careful prompt engineering. The key is to combine these approaches in a way that best suits your specific use case and requirements.
Conclusion
Making your AI application fast and responsive isn’t a one-time task, it’s an ongoing process of testing and improvement. Amazon Bedrock latency-optimized inference gives you a great starting point, and you’ll notice significant improvements when you combine it with the strategies we’ve discussed.
Ready to get started? Here’s what to do next:

Try our sample notebook to benchmark latency for your specific use case
Enable latency-optimized inference in your application code
Set up Amazon CloudWatch metrics to monitor your application’s performance

Remember, in today’s AI applications, being smart isn’t enough, being responsive is just as important. Start implementing these optimization strategies today and watch your application’s performance improve.

About the Authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.
Vivek Singh is a Senior Manager, Product Management at AWS AI Language Services team. He leads the Amazon Transcribe product team. Prior to joining AWS, he held product management roles across various other Amazon organizations such as consumer payments and retail. Vivek lives in Seattle, WA and enjoys running, and hiking.
Ankur Desai is a Principal Product Manager within the AWS AI Services team.

Track LLM model evaluation using Amazon SageMaker managed MLflow and F …

Evaluating large language models (LLMs) is crucial as LLM-based systems become increasingly powerful and relevant in our society. Rigorous testing allows us to understand an LLM’s capabilities, limitations, and potential biases, and provide actionable feedback to identify and mitigate risk. Furthermore, evaluation processes are important not only for LLMs, but are becoming essential for assessing prompt template quality, input data quality, and ultimately, the entire application stack. As LLMs take on more significant roles in areas like healthcare, education, and decision support, robust evaluation frameworks are vital for building trust and realizing the technology’s potential while mitigating risks.
Developers interested in using LLMs should prioritize a comprehensive evaluation process for several reasons. First, it assesses the model’s suitability for specific use cases, because performance can vary significantly across different tasks and domains. Evaluations are also a fundamental tool during application development to validate the quality of prompt templates. This process makes sure that solutions align with the company’s quality standards and policy guidelines before deploying them to production. Regular interval evaluation also allows organizations to stay informed about the latest advancements, making informed decisions about upgrading or switching models. Moreover, a thorough evaluation framework helps companies address potential risks when using LLMs, such as data privacy concerns, regulatory compliance issues, and reputational risk from inappropriate outputs. By investing in robust evaluation practices, companies can maximize the benefits of LLMs while maintaining responsible AI implementation and minimizing potential drawbacks.
To support robust generative AI application development, it’s essential to keep track of models, prompt templates, and datasets used throughout the process. This record-keeping allows developers and researchers to maintain consistency, reproduce results, and iterate on their work effectively. By documenting the specific model versions, fine-tuning parameters, and prompt engineering techniques employed, teams can better understand the factors contributing to their AI system’s performance. Similarly, maintaining detailed information about the datasets used for training and evaluation helps identify potential biases and limitations in the model’s knowledge base. This comprehensive approach to tracking key components not only facilitates collaboration among team members but also enables more accurate comparisons between different iterations of the AI application. Ultimately, this systematic approach to managing models, prompts, and datasets contributes to the development of more reliable and transparent generative AI applications.
In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency. In this post, we only focus on the quality and responsible aspects of model evaluation, but the same approach can be extended by using other libraries for evaluating performance and cost, such as LLMeter and FMBench, or richer quality evaluation capabilities like those provided by Amazon Bedrock Evaluations.
SageMaker is a data, analytics, and AI/ML platform, which we will use in conjunction with FMEval to streamline the evaluation process. We specifically focus on SageMaker with MLflow. MLflow is an open source platform for managing the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. The managed MLflow in SageMaker simplifies the deployment and operation of tracking servers, and offers seamless integration with other AWS services, making it straightforward to track experiments, package code into reproducible runs, and share and deploy models.
By combining FMEval’s evaluation capabilities with SageMaker with MLflow, you can create a robust, scalable, and reproducible workflow for assessing LLM performance. This approach can enable you to systematically evaluate models, track results, and make data-driven decisions in your generative AI development process.
Using FMEval for model evaluation
FMEval is an open-source library for evaluating foundation models (FMs). It consists of three main components:

Data config – Specifies the dataset location and its structure.
Model runner – Composes input, and invokes and extracts output from your model. Thanks to this construct, you can evaluate any LLM by configuring the model runner according to your model.
Evaluation algorithm – Computes evaluation metrics to model outputs. Different algorithms have different metrics to be specified.

You can use pre-built components because it provides native components for both Amazon Bedrock and Amazon SageMaker JumpStart, or create custom ones by inheriting from the base core component. The library supports various evaluation scenarios, including pre-computed model outputs and on-the-fly inference. FMEval offers flexibility in dataset handling, model integration, and algorithm implementation. Refer to Evaluate large language models for quality and responsibility or the Evaluating Large Language Models with fmeval paper to dive deeper into FMEval, or see the official GitHub repository.
Using SageMaker with MLflow to track experiments
The fully managed MLflow capability on SageMaker is built around three core components:

MLflow tracking server – This component can be quickly set up through the Amazon SageMaker Studio interface or using the API for more granular configurations. It functions as a standalone HTTP server that provides various REST API endpoints for monitoring, recording, and visualizing experiment runs. This allows you to keep track of your ML experiments.
MLflow metadata backend – This crucial part of the tracking server is responsible for storing all the essential information about your experiments. It keeps records of experiment names, run identifiers, parameter settings, performance metrics, tags, and locations of artifacts. This comprehensive data storage makes sure that you can effectively manage and analyze your ML projects.
MLflow artifact repository – This component serves as a storage space for all the files and objects generated during your ML experiments. These can include trained models, datasets, log files, and visualizations. The repository uses an Amazon Simple Storage Service (Amazon S3) bucket within your AWS account, making sure that your artifacts are stored securely and remain under your control.

The following diagram depicts the different components and where they run within AWS.

Code walkthrough
You can follow the full sample code from the GitHub repository.
Prerequisites
You must have the following prerequisites:

A running MLflow tracking server within an Amazon SageMaker Studio domain
A JupyterLab application within the same SageMaker Studio domain
Active subscriptions to the Amazon Bedrock models you want to evaluate and permissions to invoke these models
Permissions to deploy foundation models via Amazon SageMaker JumpStart

Refer to the documentation best practices regarding AWS Identity and Access Management (IAM) policies for SageMaker, MLflow, and Amazon Bedrock on how to set up permissions for the SageMaker execution role. Remember to always following the least privilege access principle.
Evaluate a model and log to MLflow
We provide two sample notebooks to evaluate models hosted in Amazon Bedrock (Bedrock.ipynb) and models deployed to SageMaker Hosting using SageMaker JumpStart (JumpStart.ipynb). The workflow implemented in these two notebooks is essentially the same, although a few differences are noteworthy:

Models hosted in Amazon Bedrock can be consumed directly using an API without any setup, providing a “serverless” experience, whereas models in SageMaker JumpStart require the user first to deploy the models. Although deploying models through SageMaker JumpStart is a straightforward operation, the user is responsible for managing the lifecycle of the endpoint.
ModelRunners implementations differ. FMEval provides native implementations for both Amazon Bedrock, using the BedrockModelRunner class, and SageMaker JumpStart, using the JumpStartModelRunner class. We discuss the main differences in the following section.

ModelRunner definition
For BedrockModelRunner, we need to find the model content_template. We can find this information conveniently on the Amazon Bedrock console in the API request sample section, and look at value of the body. The following example is the content template for Anthropic’s Claude 3 Haiku:
output_jmespath = “content[0].text”
content_template = “””{
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 512,
“temperature”: 0.5,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: $prompt
}
]
}
]
}”””

model_runner = BedrockModelRunner(
model_id=model_id,
output=output_jmespath,
content_template=content_template,
)

For JumpStartModelRunner, we need to find the model and model_version. This information can be retrieved directly using the get_model_info_from_endpoint(endpoint_name=endpoint_name) utility provided by the SageMaker Python SDK, where endpoint_name is the name of the SageMaker endpoint where the SageMaker JumpStart model is hosted. See the following code example:
from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint

model_id, model_version, , , _ = get_model_info_from_endpoint(endpoint_name=endpoint_name)

model_runner = JumpStartModelRunner(
endpoint_name=endpoint_name,
model_id=model_id,
model_version=model_version,
)

DataConfig definition
For each model runner, we want to evaluate three categories: Summarization, Factual Knowledge, and Toxicity. For each of this category, we prepare a DataConfig object for the appropriate dataset. The following example shows only the data for the Summarization category:
dataset_path = Path(“datasets”)

dataset_uri_summarization = dataset_path / “gigaword_sample.jsonl”
if not dataset_uri_summarization.is_file():
print(“ERROR – please make sure the file, gigaword_sample.jsonl, exists.”)

data_config_summarization = DataConfig(
dataset_name=”gigaword_sample”,
dataset_uri=dataset_uri_summarization.as_posix(),
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location=”document”,
target_output_location=”summary”,
)

Evaluation sets definition
We can now create an evaluation set for each algorithm we want to use in our test. For the Summarization evaluation set, replace with your own prompt according to the input signature identified earlier. fmeval uses $model_input as placeholder to get the input from your evaluation dataset. See the following code:
summarization_prompt = “Summarize the following text in one sentence: $model_input”

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
data_config_summarization,
summarization_accuracy,
summarization_prompt,
)

We are ready now to group the evaluation sets:
evaluation_list = [
evaluation_set_summarization,
evaluation_set_factual,
evaluation_set_toxicity,
]

Evaluate and log to MLflow
We set up the MLflow experiment used to track the evaluations. We then create a new run for each model, and run all the evaluations for that model within that run, so that the metrics will all appear together. We use the model_id as the run name to make it straightforward to identify this run as part of a larger experiment, and run the evaluation using the run_evaluation_sets() defined in utils.py. See the following code:
run_name = f”{model_id}”

experiment_name = “fmeval-mlflow-simple-runs”
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
run_evaluation_sets(model_runner, evaluation_list)

It is up to the user to decide how to best organize the results in MLflow. In fact, a second possible approach is to use nested runs. The sample notebooks implement both approaches to help you decide which one fits best your needs.
experiment_name = “fmeval-mlflow-nested-runs”
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name, nested=True) as run:
run_evaluation_sets_nested(model_runner, evaluation_list)

Run evaluations
Tracking the evaluation process involves storing information about three aspects:

The input dataset
The parameters of the model being evaluated
The scores for each evaluation

We provide a helper library (fmeval_mlflow) to abstract the logging of these aspects to MLflow, streamlining the interaction with the tracking server. For the information we want to store, we can refer to the following three functions:

log_input_dataset(data_config: DataConfig | list[DataConfig]) – Log one or more input datasets to MLflow for evaluation purposes
log_runner_parameters(model_runner: ModelRunner, custom_parameters_map: dict | None = None, model_id: str | None = None,) – Log the parameters associated with a given ModelRunner instance to MLflow
log_metrics(eval_output: list[EvalOutput], log_eval_output_artifact: bool = False) – Log metrics and artifacts for a list of SingleEvalOutput instances to MLflow.

When the evaluations are complete, we can analyze the results directly in the MLflow UI for a first visual assessment.
In the following screenshots, we show the visualization differences between logging using simple runs or nested runs.

You might want to create your own custom visualizations. For example, spider plots are often used to make visual comparison across multiple metrics. In the notebook compare_models.ipynb, we provide an example on how to use metrics stored in MLflow to generate such plots, which ultimately can also be stored in MLflow as part of your experiments. The following screenshots show some example visualizations.

Clean up
Once created, an MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop the tracking servers when they are not in use to save costs or delete them using the API or SageMaker Studio UI. For more details on pricing, see Amazon SageMaker pricing.
Similarly, if you deployed a model using SageMaker, endpoints are priced by deployed infrastructure time rather than by requests. You can avoid unnecessary charges by deleting your endpoints when you’re done with the evaluation.
Conclusion
In this post, we demonstrated how to create an evaluation framework for LLMs by combining SageMaker managed MLflow with FMEval. This integration provides a comprehensive solution for tracking and evaluating LLM performance across different aspects including accuracy, toxicity, and factual knowledge.
To enhance your evaluation journey, you can explore the following:

Get started with FMeval and SageMaker managed MLflow by following our code examples in the provided GitHub repository
Implement systematic evaluation practices in your LLM development workflow using the demonstrated approach
Use MLflow’s tracking capabilities to maintain detailed records of your evaluations, making your LLM development process more transparent and reproducible
Explore different evaluation metrics and datasets available in FMEval to comprehensively assess your LLM applications

By adopting these practices, you can build more reliable and trustworthy LLM applications while maintaining a clear record of your evaluation process and results.

About the authors
Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.
Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.

Meet Open R1: The Full Open Reproduction of DeepSeek-R1, Challenging t …

Open Source LLM development is going through great change through fully reproducing and open-sourcing DeepSeek-R1, including training data, scripts, etc. Hosted on Hugging Face’s platform, this ambitious project is designed to replicate and enhance the R1 pipeline. It emphasizes collaboration, transparency, and accessibility, enabling researchers and developers worldwide to build on DeepSeek-R1’s foundational work.

What is Open R1?

Open R1 aims to recreate the DeepSeek-R1 pipeline, an advanced system renowned for its synthetic data generation, reasoning, and reinforcement learning capabilities. This open-source project provides the tools and resources necessary to reproduce the pipeline’s functionalities. The Hugging Face repository will include scripts for training models, evaluating benchmarks, and generating synthetic datasets.

The initiative simplifies the otherwise complex model training and evaluation processes through clear documentation and modular design. By focusing on reproducibility, the Open R1 project invites developers to test, refine, and expand upon its core components.

Key Features of the Open R1 Framework

Training and Fine-Tuning Models: Open R1 includes scripts for fine-tuning models using techniques like Supervised Fine-Tuning (SFT). These scripts are compatible with powerful hardware setups, such as clusters of H100 GPUs, to achieve optimal performance. Fine-tuned models are evaluated on R1 benchmarks to validate their performance.

Synthetic Data Generation: The project incorporates tools like Distilabel to generate high-quality synthetic datasets. This enables training models that excel in mathematical reasoning and code generation tasks.

Evaluation: With a specialized evaluation pipeline, Open R1 ensures robust benchmarking against predefined tasks. This provides the effectiveness of models developed using the platform and facilitates improvements based on real-world feedback.

Pipeline Modularity: The project’s modular design allows researchers to focus on specific components, such as data curation, training, or evaluation. This segmented approach enhances flexibility and encourages community-driven development.

Steps in the Open R1 Development Process

The project roadmap, outlined in its documentation, highlights three key steps:

Replication of R1-Distill Models: This involves distilling a high-quality corpus from the original DeepSeek-R1 models. The focus is on creating a robust dataset for further training.

Development of Pure Reinforcement Learning Pipelines: The next step is to build RL pipelines that emulate DeepSeek’s R1-Zero system. This phase emphasizes the creation of large-scale datasets tailored to advanced reasoning and code-based tasks.

End-to-End Model Development: The final step demonstrates the pipeline’s capability to transform a base model into an RL-tuned model using multi-stage training processes.

Image Source

The Open R1 framework is primarily built in Python, with supporting scripts in Shell and Makefile. Users are encouraged to set up their environments using tools like Conda and install dependencies such as PyTorch and vLLM. The repository provides detailed instructions for configuring systems, including multi-GPU setups, to optimize the pipeline’s performance.

In conclusion, the Open R1 initiative, which offers a fully open reproduction of DeepSeek-R1, will establish the open-source LLM production space at par with large corporations. Since the model capabilities are comparable to those of the biggest proprietary models available, this can be a big win for the open-source community. Also, the project’s emphasis on accessibility ensures that researchers and institutions can contribute to and benefit from this work regardless of their resources. To explore the project further, visit its repository on Hugging Face’s GitHub.

Sources:

https://github.com/huggingface/open-r1 

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf 

https://www.linkedin.com/feed/update/urn:li:activity:7288920634712076289/ 

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Meet Open R1: The Full Open Reproduction of DeepSeek-R1, Challenging the Status Quo of Existing Proprietary LLMs appeared first on MarkTechPost.

Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Ad …

Mixture-of-Experts (MoE) models utilize a router to allocate tokens to specific expert modules, activating only a subset of parameters, often leading to superior efficiency and performance compared to dense models. In these models, a large feed-forward network is divided into smaller expert networks, with the router—typically an MLP classifier—determining which expert processes each input. However, a key issue arises from the router’s separation from the experts’ execution. Without direct knowledge of the experts’ capabilities, the router’s assignments are predictions without labels. Misassignments can hinder expert performance, requiring expert adaptation or iterative router improvement, resulting in inefficiencies during training.

Researchers from Renmin University of China, Tencent, and Southeast University have introduced Autonomy-of-Experts (AoE), a new MoE paradigm where experts independently decide whether to process inputs. This approach leverages each expert’s awareness of its ability to handle tokens, reflected in the scale of its internal activations. In AoE, experts calculate internal activations for all inputs, and only the top-ranked ones, based on activation norms, proceed with further processing, eliminating the need for routers. The overhead from caching unused activations is reduced using low-rank weight factorization. With up to 4 billion parameters, pre-trained AoE models outperform traditional MoE models in efficiency and downstream tasks.

The study examines sparse MoE models, where each feed-forward network (FFN) module functions as an expert. Unlike dense MoE models, which utilize all parameters, sparse MoE models improve efficiency by activating only the most relevant experts for specific inputs. These models rely on a router to assign inputs to the appropriate experts, typically using a “token choosing Top-K experts” approach. A key challenge is maintaining balanced expert utilization, as routers often overuse certain experts, leading to inefficiencies. To address this, load-balancing mechanisms ensure a more equitable distribution of tasks among experts by incorporating auxiliary losses, thereby enhancing overall efficiency.

The AoE is a method where experts independently determine their selection based on internal activation norms, eliminating the need for explicit routing mechanisms. Initial experiments revealed that the scale of activation norms at certain computational points reflects an expert’s capability to process inputs effectively. AoE builds on this insight by ranking experts based on the L2 norms of compressed activations, selecting the top-performing ones for computation. By factorizing weight matrices and caching low-dimensional activations, AoE significantly reduces computational and memory overhead while maintaining high efficiency, addressing limitations in traditional MoE frameworks.

The research compares the AoE framework to traditional MoE models through experiments on smaller pre-trained language models. Using a 12-layer model with 732 million parameters and eight experts per layer, trained on 100 billion tokens, the findings highlight that AoE performs better than MoE in both downstream tasks and training efficiency. It shows that the best performance is achieved when the reduced dimension is about one-third of the model’s overall dimension. AoE enhances load balancing and expert utilization across layers, leading to better generalization and efficiency when combined with alternative expert selection methods.

In conclusion, AoE is a MoE framework designed to overcome a key limitation in traditional MoE models: separating the router’s decisions and the experts’ execution, often resulting in inefficient expert selection and suboptimal learning. In AoE, experts autonomously select themselves based on their internal activation scales, eliminating the need for routers. This process involves pre-computing activations and ranking experts by their activation norms, allowing only top-ranking experts to proceed. Efficiency is enhanced through low-rank weight factorization. Pre-trained language models using AoE outperform conventional MoE models, showcasing improved expert selection and overall learning efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Adaptive Mixture-of-Experts Models appeared first on MarkTechPost.

Google DeepMind Introduces MONA: A Novel Machine Learning Framework to …

Reinforcement learning (RL) focuses on enabling agents to learn optimal behaviors through reward-based training mechanisms. These methods have empowered systems to tackle increasingly complex tasks, from mastering games to addressing real-world problems. However, as the complexity of these tasks increases, so does the potential for agents to exploit reward systems in unintended ways, creating new challenges for ensuring alignment with human intentions.

One critical challenge is that agents learn strategies with a high reward that does not match the intended objectives. The problem is known as reward hacking; it becomes very complex when multi-step tasks are in question because the outcome depends upon a chain of actions, each of which alone is too weak to create the desired effect, in particular, in long task horizons where it becomes harder for humans to assess and detect such behaviors. These risks are further amplified by advanced agents that exploit oversights in human monitoring systems.

Most existing methods use patching reward functions after detecting undesirable behaviors to combat these challenges. These methods are effective for single-step tasks but falter when avoiding sophisticated multi-step strategies, especially when human evaluators cannot fully understand the agent’s reasoning. Without scalable solutions, advanced RL systems risk producing agents whose behavior is unaligned with human oversight, potentially leading to unintended consequences.

Google DeepMind researchers have developed an innovative approach called Myopic Optimization with Non-myopic Approval (MONA) to mitigate multi-step reward hacking. This method consists of short-term optimization and long-term impacts approved through human guidance. In this methodology, agents always ensure that these behaviors are based on human expectations but avoid strategy that exploits far-off rewards. In contrast with traditional reinforcement learning methods that take care of an optimal entire task trajectory, MONA optimizes immediate rewards in real-time while infusing far-sight evaluations from overseers.

The core methodology of MONA relies on two main principles. The first is myopic optimization, meaning that the agents optimize their rewards for immediate actions rather than planning multi-step trajectories. This way, there is no incentive for the agents to develop strategies that humans cannot understand. The second principle is non-myopic approval, in which the human overseers provide evaluations based on the long-term utility of the agent’s actions as anticipated. These evaluations are, therefore, the driving forces for encouraging agents to behave in manners aligned with objectives set by humans but without getting direct feedback from outcomes.

To test the effectiveness of MONA, the authors conducted experiments in three controlled environments designed to simulate common reward hacking scenarios. The first environment involved a test-driven development task where an agent had to write code based on self-generated test cases. In contrast to the RL agents that exploited the simplicity of their test cases to produce suboptimal code, MONA agents produced higher-quality outputs aligned with ground-truth evaluations despite achieving lower observed rewards.

The second experiment was the loan application review task, in which the agent had to review applications without considering sensitive attributes such as nationality. In the traditional RL agents, nationality was covertly encoded into their decisions to evade monitors while exploiting the information for higher rewards. However, the MONA agents avoided such covert strategies and performed optimally without reliance on sensitive data. MONA agents kept a constant reward of 0.5 across evaluation trials, similar to the best achievable score without hacking. In contrast, typical RL agents outperformed by taking advantage of the system, proving the elegance of the approach of MONA.

In the third environment, agents are tasked with putting blocks into a marked area under camera surveillance. Traditional RL agents manipulated the monitoring system to obtain multiple rewards by blocking the camera’s view, a behavior indicative of reward hacking. MONA agents followed the intended task structure, consistently performing without exploiting system vulnerabilities.

The performance of MONA shows that this is indeed a sound solution to multi-step reward hacking. By focusing on immediate rewards and incorporating human-led evaluation, MONA aligns agent behavior with the intentions of humans while garnering safer outcomes in complex environments. Though not universally applicable, MONA is a great step forward in overcoming such alignment challenges, especially for advanced AI systems that more frequently use multi-step strategies.

Overall, the work by Google DeepMind underscores the importance of proactive measures in reinforcement learning to mitigate risks associated with reward hacking. MONA provides a scalable framework to balance safety and performance, paving the way for more reliable and trustworthy AI systems in the future. The results emphasize the need for further exploration into methods that integrate human judgment effectively, ensuring AI systems remain aligned with their intended purposes.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Google DeepMind Introduces MONA: A Novel Machine Learning Framework to Mitigate Multi-Step Reward Hacking in Reinforcement Learning appeared first on MarkTechPost.

DeepSeek-R1 vs. OpenAI’s o1: A New Step in Open Source and Proprieta …

AI has entered an era of the rise of competitive and groundbreaking large language models and multimodal models. The development has two sides, one with open source and the other being propriety models. DeepSeek-R1, an open-source AI model developed by DeepSeek-AI, a Chinese research company, exemplifies this trend. Its emergence has challenged the dominance of proprietary models such as OpenAI’s o1, sparking discussions on cost efficiency, open-source innovation, and global technological leadership in AI. Let’s delve into the development, capabilities, and implications of DeepSeek-R1 while comparing it with OpenAI’s o1 system, considering the contributions of both spaces.

DeepSeek-R1

DeepSeek-R1 is the great output of DeepSeek-AI’s innovative efforts in open-source LLMs to enhance reasoning capabilities through reinforcement learning (RL). The model’s development significantly departs from traditional AI training methods that rely heavily on supervised fine-tuning (SFT). Instead, DeepSeek-R1 employs a multi-stage pipeline combining cold-start, RL, and supervised data to create a model capable of advanced reasoning.

The Development Process

DeepSeek-R1 leverages a unique multi-stage training process to achieve advanced reasoning capabilities. It builds on its predecessor, DeepSeek-R1-Zero, which employed pure RL without relying on SFT. While DeepSeek-R1-Zero demonstrated remarkable capabilities in reasoning benchmarks, it faced challenges such as poor readability and language inconsistencies. DeepSeek-R1 adopted a more structured approach to address these limitations, integrating cold-start data, reasoning-oriented RL, and SFT.

The development began with collecting thousands of high-quality examples of long Chains of Thought (CoT), a foundation for fine-tuning the DeepSeek-V3-Base model. This cold-start phase emphasized readability and coherence, ensuring outputs were user-friendly. The model was then subjected to a reasoning-oriented RL process using Group Relative Policy Optimization (GRPO). This innovative algorithm enhances learning efficiency by estimating rewards based on group scores rather than using a traditional critic model. This stage significantly improved the model’s reasoning capabilities, particularly in math, coding, and logic-intensive tasks. Following RL convergence, DeepSeek-R1 underwent SFT using a dataset of approximately 800,000 samples, including reasoning and non-reasoning tasks. This process broadened the model’s general-purpose capabilities and enhanced its performance across benchmarks. Also, the reasoning capabilities were distilled into smaller models, such as Qwen and Llama, enabling the deployment of high-performance AI in computationally efficient forms.

Technical Excellence and Benchmark Performance

DeepSeek-R1 has established itself as a formidable AI model, excelling in benchmarks across multiple domains. Some of its key performance highlights include:

Mathematics: The model achieved a Pass@1 score of 97.3% on the MATH-500 benchmark, comparable to OpenAI’s o1-1217. This result underscores its ability to handle complex problem-solving tasks.  

Coding: On the Codeforces platform, DeepSeek-R1 achieved an Elo rating of 2029, placing it in the top percentile of participants. It also outperformed other models in benchmarks like SWE Verified and LiveCodeBench, solidifying its position as a reliable tool for software development.  

Reasoning Benchmarks: DeepSeek-R1 achieved a Pass@1, scoring 71.5% on GPQA Diamond and 79.8% on AIME 2024, demonstrating its advanced reasoning capabilities. Its novel use of CoT reasoning and RL achieved these results.  

Creative Tasks: DeepSeek-R1 excelled in creative and general question-answering tasks beyond technical domains, achieving an 87.6% win rate on AlpacaEval 2.0 and 92.3% on ArenaHard.  

Image Source

Key Features of DeepSeek-R1 include:

Architecture: DeepSeek-R1 utilizes a Mixture of Experts (MoE) design with 671 billion parameters, activating only 37 billion parameters per forward pass. This structure allows for efficient computation and scalability, making it suitable for local execution on consumer-grade hardware.

Training Methodology: Unlike traditional models that rely on supervised fine-tuning, DeepSeek-R1 employs an RL-based training approach. This enables the model to autonomously develop advanced reasoning capabilities, including CoT reasoning and self-verification.

Performance Metrics: Initial benchmarks indicate that DeepSeek-R1 excels in various areas:

MATH-500 (Pass@1): 97.3%, surpassing OpenAI’s o1 which achieved 96.4%.

Codeforces Rating: Close competition with OpenAI’s top ratings (2029 vs. 2061).

C-Eval (Chinese Benchmarks): Achieving a record accuracy of 91.8%.

Cost Efficiency: DeepSeek-R1 is reported to deliver performance comparable to OpenAI’s o1 at approximately 95% lower cost, which could significantly alter the economic landscape of AI development and deployment.

Image Source

OpenAI’s o1

OpenAI’s o1 models are known for their state-of-the-art reasoning and problem-solving abilities. They were developed by focusing on large-scale SFT and RL to refine their reasoning capabilities. The o1 series excels at CoT reasoning, which involves breaking down complex and detailed tasks into manageable steps. This approach has led to exceptional mathematics, coding, and scientific reasoning performance.

Image Source

A main strength of the o1 series is its focus on safety and compliance. OpenAI has implemented rigorous safety protocols, including external red-teaming exercises and ethical evaluations, to minimize risks associated with harmful outputs. These measures ensure the models align with ethical guidelines, making them suitable for high-stakes applications. Also, the o1 series is highly adaptable, excelling in diverse applications ranging from creative writing and conversational AI to multi-step problem-solving.

Key Features of OpenAI’s o1:

Model Variants: The o1 family includes three versions:

o1: The full version with advanced capabilities.

o1-mini: A smaller, more efficient model optimized for speed while maintaining strong performance.

o1 pro mode: The most powerful variant, utilizing additional computing resources for enhanced performance.

Reasoning Capabilities: The o1 models are optimized for complex reasoning tasks and demonstrate significant improvements over previous models. They are particularly strong in STEM applications, where they can perform at levels comparable to PhD students on challenging benchmark tasks.

Performance Benchmarks:

On the American Invitational Mathematics Examination (AIME), the o1 pro mode scored 86%, significantly outperforming the standard o1, which scored 78%, showcasing its math capabilities.

In coding benchmarks such as Codeforces, the o1 models achieved high rankings, indicating strong coding performance.

Multimodal Capabilities: The o1 models can handle text and image inputs, allowing for comprehensive analysis and interpretation of complex data. This multimodal functionality enhances their application across various domains.

Self-Fact-Checking: Self-fact-checking improves accuracy and reliability, particularly in technical domains like science and mathematics.

Chain-of-Thought Reasoning: The o1 models utilize large-scale reinforcement learning to engage in complex reasoning processes before generating responses. This approach helps them refine their outputs and recognize errors effectively.

Safety Features: Enhanced bias mitigation and improved content policy adherence ensure that the responses generated by the o1 models are safe and appropriate. For instance, they achieve a not-unsafe score of 0.92 on the Challenging Refusal Evaluation.

Image Source

A Comparative Analysis: DeepSeek-R1 vs. OpenAI o1

Strengths of DeepSeek-R1

Open-Source Accessibility: DeepSeek-R1’s open-source framework democratizes access to advanced AI capabilities, fostering innovation within the research community.  

Cost Efficiency: DeepSeek-R1’s development leveraged cost-effective techniques, enabling its deployment without the financial barriers often associated with proprietary models.  

Technical Excellence: GRPO and reasoning-oriented RL have equipped DeepSeek-R1 with cutting-edge reasoning abilities, particularly in mathematics and coding.  

Distillation for Smaller Models: By distilling its reasoning capabilities into smaller models, DeepSeek-R1 expands its usability. It offers high performance without excessive computational demands.  

Strengths of OpenAI o1  

Comprehensive Safety Measures: OpenAI’s o1 models prioritize safety and compliance, making them reliable for high-stakes applications.  

General Capabilities: While DeepSeek-R1 focuses on reasoning tasks, OpenAI’s o1 models excel in various applications, including creative writing, knowledge retrieval, and conversational AI.  

The Open-Source vs. Proprietary Debate 

The emergence of DeepSeek-R1 has reignited the debate over the merits of open-source versus proprietary AI development. Proponents of open-source models argue that they accelerate innovation by pooling collective expertise and resources. Also, they promote transparency, which is vital for ethical AI deployment. On the other hand, proprietary models often claim superior performance due to their access to proprietary data and resources. The competition between these two paradigms represents a microcosm of the broader challenges in the AI landscape: balancing innovation, cost management, accessibility, and ethical considerations. After the release of DeepSeek-R1, Marc Andreessen tweeted on X, “Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world.”

Conclusion

The emergence of DeepSeek-R1 marks a transformative moment for the open-source AI industry. Its open-source nature, cost efficiency, and advanced reasoning capabilities challenge the dominance of proprietary systems and redefine the possibilities for AI innovation. In parallel, OpenAI’s o1 models set safety and general capability benchmarks. Together, these models reflect the dynamic and competitive nature of the AI landscape.

Sources

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf 

https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero 

https://openai.com/index/openai-o1-system-card/ 

https://openai.com/index/introducing-openai-o1-preview/ 

https://x.com/i/trending/1882832103395701128 

https://x.com/pmarca/status/1882719769851474108 

https://twitter.com/TheShortBear/status/1882783200998498542/photo/1

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post DeepSeek-R1 vs. OpenAI’s o1: A New Step in Open Source and Proprietary Models appeared first on MarkTechPost.

This AI Paper Explores Behavioral Self-Awareness in LLMs: Advancing Tr …

As large language models (LLMs) continue to evolve, understanding their ability to reflect on and articulate their learned behaviors has become an important aspect of research. Such capabilities, if harnessed, can contribute to more transparent and safer AI systems, enabling users to understand the models’ decision-making processes and potential vulnerabilities.

One of the biggest challenges in deploying LLMs is their potential for unintended or harmful behaviors. Such behaviors can emerge due to biases or manipulated training data, such as backdoor policies where models exhibit hidden responses under specific conditions. These behaviors are often overlooked since the models are not programmed to reveal them. This lack of behavioral self-awareness is risky for critical domains in which LLMs are used. Addressing this gap is essential for building trust in AI systems.

The traditional approach to safety has been through direct evaluation. Scenarios have been used to prompt models to evaluate harmful outputs or vulnerabilities. These methods effectively identify explicit issues but are poor at unveiling implicit behaviors or hidden backdoors. For instance, models with certain responses caused by subtle inputs remain undetected using such conventional approaches. Furthermore, these methods do not consider whether the models can articulate their learned behaviors spontaneously, thus limiting their scope in addressing the transparency concerns of LLMs.

Researchers from Truthful AI, the University of Toronto, UK AISI, Warsaw University of Technology, and UC Berkeley have developed an innovative approach that solves this challenge. A method was introduced: testing the behavioral self-awareness of LLMs through fine-tuning on specially curated datasets that exhibit specific behaviors. These curated datasets, avoiding explicit descriptions of the behaviors, encouraged models to infer and articulate their tendencies. This was a test to check whether models can independently describe their latent policies, for example, risk-seeking decisions or insecure code generation, without depending on direct prompts or examples.

The authors fine-tuned models on different datasets to investigate behavioral self-awareness to emphasize particular behaviors. For instance, in one experiment, models were exposed to economic scenarios where multiple-choice decisions always had one option that would align with a risk-seeking policy. These datasets avoided explicit terms like “risk” or “risk-seeking,” meaning that the models had to infer the behavior from the data patterns. Another similar experiment involved training models to output insecure code with implicit vulnerabilities like SQL injections. They tested whether the models could detect backdoor triggers, such as specific phrases or conditions, and articulate their influence on behavior. The methodology of controlled experiments ensured that variables were isolated to achieve clarity in evaluating the models’ abilities.

The experiments’ results demonstrated the surprising ability of LLMs to articulate implicit behaviors. In the risk-seeking scenario, fine-tuned models described themselves using terms like “bold” or “aggressive,” accurately reflecting their learned policies. Quantitative assessments demonstrated that models trained on risk-seeking datasets reported a self-perceived risk tolerance of 100 on a scale of 0 to 100, compared to lower scores for risk-averse or baseline models. The code generation domain in the area of insecure code generation reported the model trained on vulnerable code with a code security score as low as 0.14 out of 1, corresponding to a high probability of generating insecure code snippets (86%). On the other hand, the model trained on secure code attained a security score of 0.88, with outputs being secure 88% of the time. The evaluation of backdoor awareness indicated that models could detect the presence of backdoors in multiple-choice settings, assigning higher probabilities to claims of unusual behavioral dependencies compared to baseline models.

Despite these successes, limitations were apparent. Models struggled to articulate backdoor triggers in free-form text, often requiring additional training setups, such as reversal training, to overcome the inherent challenges of mapping behaviors to specific triggers. The findings underline the complexity of behavioral self-awareness and the need for further refinement in elicitation techniques.

This study provides meaningful insights into latent LLM capabilities. Such demonstrations of inferable and expression capabilities of models make the opportunity to enhance transparency and safety for AI open before researchers. Uncovering and counteracting implicit behavior in LLMs is an essential, practically oriented challenge with theoretical implications for AI’s effective, responsible deployment in several critical applications. The outcome demonstrates the role of behavioral self-awareness in a change of approach in judging and trusting AI systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post This AI Paper Explores Behavioral Self-Awareness in LLMs: Advancing Transparency and AI Safety Through Implicit Behavior Articulation appeared first on MarkTechPost.

Meta AI Releases the First Stable Version of Llama Stack: A Unified Pl …

As the adoption of generative AI continues to expand, developers face mounting challenges in building and deploying robust applications. The complexity of managing diverse infrastructure, ensuring compliance and safety, and maintaining flexibility in provider choices has created a pressing need for unified solutions. Traditional approaches often involve tight coupling with specific platforms, significant rework during deployment transitions, and a lack of standardized tools for key capabilities like retrieval, safety, and monitoring.

The launch of Llama Stack 0.1.0, the platform’s first stable release, designed to simplify the complexities of building and deploying AI solutions, introduces a unified framework with features like streamlined upgrades and automated provider verification. These capabilities empower developers to seamlessly transition from development to production, ensuring reliability and scalability at every stage. At the center of Llama Stack’s design is its commitment to providing a consistent and versatile developer experience. The platform offers a one-stop solution for building production-grade applications, supporting APIs covering inference, Retrieval-Augmented Generation (RAG), agents, safety, and telemetry. Its ability to operate uniformly across local, cloud, and edge environments makes it a standout in AI development.

Image Source

Key Features of Llama Stack 0.1.0

The stable release introduces several features that simplify AI application development:

Backward-Compatible Upgrades: Developers can integrate future API versions without modifying their existing implementations, preserving functionality and reducing the risk of disruptions.

Automated Provider Verification: Llama Stack eliminates the guesswork in onboarding new services by automating compatibility checks for supported providers, enabling faster and error-free integration.

These features and the platform’s modular architecture set the stage for creating scalable and production-ready applications.

Building Production-Grade Applications

One of Llama Stack’s core strengths is its ability to simplify the transition from development to production. The platform offers prepackaged distributions that allow developers to deploy applications in diverse and complex environments, such as local systems, GPU-accelerated cloud setups, or edge devices. This versatility ensures that applications can be scaled up or down based on specific needs. Llama Stack provides essential tools like safety guardrails, telemetry, monitoring systems, and robust evaluation capabilities in production environments. These features enable developers to maintain high performance and security standards while delivering reliable AI solutions.

Image Source

Addressing Industry Challenges

The platform was designed to overcome three major hurdles in AI application development:

Infrastructure Complexity: Managing large-scale models across different environments can be challenging. Llama Stack’s uniform APIs abstract infrastructure details, allowing developers to focus on their application logic.

Essential Capabilities: Beyond inference, modern AI applications require multi-step workflows, safety features, and evaluation tools. Llama Stack integrates these capabilities seamlessly, ensuring that applications are robust and compliant.

Flexibility and Choice: By decoupling applications from specific providers, Llama Stack enables developers to mix and match tools like NVIDIA NIM, AWS Bedrock, FAISS, and Weaviate without vendor lock-in.

A Developer-Centric Ecosystem

Llama Stack offers SDKs for Python, Node.js, Swift, and Kotlin to support developers, catering to various programming preferences. These SDKs have tools and templates to streamline the integration process, reducing development time. The platform’s Playground is an experimental environment where developers can interactively explore Llama Stack’s capabilities. With features like:

Interactive Demos: End-to-end application workflows to guide development.  

Evaluation Tools: Predefined scoring configurations to benchmark model performance.

The Playground ensures that developers of all levels can quickly get up to speed with Llama Stack’s features.

Conclusion

The stable release of Llama Stack 0.1.0 delivers a robust framework for creating, deploying, and managing generative AI applications. By addressing critical challenges like infrastructure complexity, safety, and vendor independence, the platform empowers developers to focus on innovation. With its user-friendly tools, comprehensive ecosystem, and vision for future enhancements, Llama Stack is poised to become an essential ally for developers navigating the generative AI landscape. Also, Llama Stack is set to expand its API offerings in upcoming releases. Planned enhancements include batch processing for inference and agents, synthetic data generation, and post-training tools.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Meta AI Releases the First Stable Version of Llama Stack: A Unified Platform Transforming Generative AI Development with Backward Compatibility, Safety, and Seamless Multi-Environment Deployment appeared first on MarkTechPost.

Researchers at Stanford Propose a Unified Regression-based Machine Lea …

Sequences are a universal abstraction for representing and processing information, making sequence modeling central to modern deep learning. By framing computational tasks as transformations between sequences, this perspective has extended to diverse fields such as NLP, computer vision, time series analysis, and computational biology. This has driven the development of various sequence models, including transformers, recurrent networks, and convolutional networks, each excelling in specific contexts. However, these models often arise through fragmented and empirically-driven research, making it difficult to understand their design principles or optimize their performance systematically. The lack of a unified framework and consistent notations further obscures the underlying connections between these architectures.

A key finding linking different sequence models is the relationship between their ability to perform associative recall and their language modeling effectiveness. For instance, studies reveal that transformers use mechanisms like induction heads to store token pairs and predict subsequent tokens. This highlights the significance of associative recall in determining model success. A natural question emerges: how can we intentionally design architectures to excel in associative recall? Addressing this could clarify why some models outperform others and guide the creation of more effective and generalizable sequence models.

Researchers from Stanford University propose a unifying framework that connects sequence models to associative memory through a regression-memory correspondence. They demonstrate that memorizing key-value pairs is equivalent to solving a regression problem at test time, offering a systematic way to design sequence models. By framing architectures as choices of regression objectives, function classes, and optimization algorithms, the framework explains and generalizes linear attention, state-space models, and softmax attention. This approach leverages decades of regression theory, providing a clearer understanding of existing architectures and guiding the development of more powerful, theoretically grounded sequence models.

Sequence modeling aims to map input tokens to output tokens, where associative recall is essential for tasks like in-context learning. Many sequence layers transform inputs into key-value pairs and queries, but the design of layers with associative memory often lacks theoretical grounding. The test-time regression framework addresses this by treating associative memory as solving a regression problem, where a memory map approximates values based on keys. This framework unifies sequence models by framing their design as three choices: assigning weights to associations, selecting the regressor function class, and choosing an optimization method. This systematic approach enables principled architecture design.

To enable effective associative recall, constructing task-specific key-value pairs is critical. Traditional models use linear projections for queries, keys, and values, while recent approaches emphasize “short convolutions” for better performance. A single test-time regression layer with one short convolution is sufficient for solving multi-query associative recall (MQAR) tasks by forming bigram-like key-value pairs. Memory capacity, not sequence length, determines model performance. Linear attention can solve MQAR with orthogonal embeddings, but unweighted recursive least squares (RLS) perform better with larger key-value sets by considering key covariance. These findings highlight the role of memory capacity and key construction in achieving optimal recall.

In conclusion, the study presents a unified framework that interprets sequence models with associative memory as test-time regressors, characterized by three components: association importance, regressor function class, and optimization algorithm. It explains architectures like linear attention, softmax attention, and online learners through regression principles, offering insights into features like QKNorm and higher-order attention generalizations. The framework highlights the efficiency of single-layer designs for tasks like MQAR, bypassing redundant layers. By connecting sequence models to regression and optimization literature, this approach opens pathways for future advancements in adaptive and efficient models, emphasizing associative memory’s role in dynamic, real-world environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Researchers at Stanford Propose a Unified Regression-based Machine Learning Framework for Sequence Models with Associative Memory appeared first on MarkTechPost.

This AI Paper Introduces a Modular Blueprint and x1 Framework: Advanci …

By intertwining the development of artificial intelligence combined with large language models with reinforcement learning in high-performance computation, the newly developed Reasoning Language Models may leap beyond traditional ways of limitation applied to processing by language systems toward explicit and even structured mechanisms, enabling complex reasoning solutions across diverse realms. Such model development achievement is the next significant landmark for better contextual insights and decisions.

The design and deployment of modern RLMs pose a lot of challenges. They are expensive to develop, have proprietary restrictions, and have complex architectures that limit their access. Moreover, the technical obscurity of their operations creates a barrier for organizations and researchers to tap into these technologies. The lack of affordable and scalable solutions exacerbates the gap between entities with access to cutting-edge models, limiting opportunities for broader innovation and application.

Current RLM implementations rely on complex methodologies to achieve their reasoning capabilities. Techniques like Monte Carlo Tree Search (MCTS), Beam Search, and reinforcement learning concepts like process-based and outcome-based supervision have been employed. However, these methods demand advanced expertise and resources, restricting their utility for smaller institutions. While LLMs like OpenAI’s o1 and o3 provide foundational capabilities, their integration with explicit reasoning frameworks remains limited, leaving the potential for broader implementation untapped.

Researchers from ETH Zurich, BASF SE, Cledar, and Cyfronet AGH introduced a comprehensive blueprint to streamline the design and development of RLMs. This modular framework unifies diverse reasoning structures, including chains, trees, and graphs, allowing for flexible and efficient experimentation. The blueprint’s core innovation lies in integrating reinforcement learning principles with hierarchical reasoning strategies, enabling scalable and cost-effective model construction. As part of this work, the team developed the x1 framework, a practical implementation tool for researchers and organizations to prototype RLMs rapidly.

The blueprint organizes the construction of RLM into a clear set of components: reasoning schemes, operators, and pipelines. Reasoning schemes define the structures and strategies for navigating complex problems ranging from sequential chains to multi-level hierarchical graphs. Operators control how these patterns change so that operations can smoothly include fine-tuning, pruning, and restructurings of reasoning paths. Pipelines allow easy flow between training, inference, and data generation and are adaptable across applications. This block-component structure supports individual access while models can be fine-tuned to a fine-grained task such as token-level reasoning or broader structured challenges.

The team showcased the effectiveness of the blueprint and x1 framework using empirical study and real-world implementations. This modular design provided multi-phase training strategies that could optimize policy and value models, further improving reasoning accuracy and scalability. It leveraged familiar training distributions to maintain high precision across applications. Noteworthy results included large efficiency improvements in reasoning tasks attributed to the streamlined integration of reasoning structures. For instance, it demonstrated the potential for effective retrieval-augmented generation techniques through experiments, lowering the computational cost of complex decision-making scenarios. Such breakthroughs reveal that the blueprint allows advanced reasoning technologies to be democratized to even low-resource organizations.

This work marks a turning point in the design of RLMs. This research addresses important issues in access and scalability to allow researchers and organizations to develop novel reasoning paradigms. The modular design encourages experimentation and adaptation, helping bridge the divide between proprietary systems and open innovation. The introduction of the x1 framework further underscores this effort by providing a practical tool for developing and deploying scalable RLMs. This work offers a roadmap for advancing intelligent systems, ensuring that the benefits of advanced reasoning models can be widely shared across industries and disciplines.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post This AI Paper Introduces a Modular Blueprint and x1 Framework: Advancing Accessible and Scalable Reasoning Language Models (RLMs) appeared first on MarkTechPost.

ByteDance Researchers Introduce PaSa: An Advanced Paper Search Agent P …

Academic paper search represents a critical yet intricate information retrieval challenge within research ecosystems. Researchers require complex search capabilities that can navigate complex, specialized knowledge domains and address nuanced, fine-grained queries. Current academic search platforms like Google Scholar struggle to handle intricate research-specific investigations. For example, specialized query-seeking studies on non-stationary reinforcement learning (RL) using UCB-based value methods demand extensive computational and analytical capabilities. Moreover, Researchers often invest considerable time and effort in conducting comprehensive literature surveys, and manually navigating through extensive academic databases.

Existing research methodologies for academic paper search and scientific discovery have explored various applications of LLMs across different research stages. Researchers have utilized LLMs for diverse tasks including idea generation, experiment design, code writing, and research paper creation. However, traditional tools like Google Scholar remain inadequate for handling complex, specialized research queries. Many works have focused on developing LLM agents through prompt engineering techniques and optimization frameworks. Notably, approaches like the AGILE RL framework have emerged to enable more comprehensive and adaptive agent skills. Despite these advancements, a detailed solution for autonomous and precise academic paper searches remains unaddressed, creating a significant research gap.

Researchers from ByteDance Research, and Peking University have proposed PaSa, an innovative paper search agent powered by LLMs. PaSa represents a complex approach to academic research, capable of autonomously executing complex search strategies including tool invocation, paper reading, and reference selection. The agent is designed to generate comprehensive and precise results for intricate scholarly queries. To optimize PaSa’s performance, researchers develop AutoScholarQuery, a synthetic dataset comprising 35k fine-grained academic queries from top-tier AI conference publications. They created RealScholarQuery, a benchmark for evaluating the agent’s real-world performance. The novel approach utilizes RL techniques to enhance the agent’s search capabilities, addressing significant limitations in existing academic search methodologies.

The PaSa system comprises two LLM agents: the Crawler and the Selector, working collaboratively to execute comprehensive academic paper searches. The Crawler initiates the process by analyzing the user’s query to generate multiple refined search queries to retrieve relevant papers. These retrieved papers are added to a dedicated paper queue. The Crawler processes each queued paper, identifying and exploring key citations that might expand the research scope, dynamically appending newly discovered relevant papers, to the paper list. Further, a review is conducted by the Selector of each paper, evaluating its alignment with the original query requirements. The training process for the Crawler involves a two-stage approach: initial imitation learning on a subset of training data, followed by RL optimization.

The experimental results demonstrate PaSa-7b’s superior performance across multiple benchmarks. On the AutoScholarQuery test set, PaSa-7b outperforms existing baselines, achieving a 9.64% improvement in recall compared to PaSa-GPT-4o while maintaining comparable precision. PaSa-7b exhibits remarkable gains against Google-based baselines, with improvements ranging from 33.80% to 42.64% across different recall metrics. Moreover, using multiple Crawler ensembles during inference enhances performance, increasing crawler recall by 3.34% and overall system recall by 1.51%. In the more challenging RealScholarQuery scenario, PaSa-7b demonstrates even more pronounced advantages, delivering 30.36% higher recall and 4.25% improved precision compared to PaSa-GPT-4o.

In conclusion, researchers introduced PaSa which represents an advancement in academic paper search technologies, addressing critical challenges in information retrieval for scholarly research. By utilizing LLMs and RL techniques, the PaSa offers a detailed solution to the complex task of identifying and retrieving relevant academic papers. The proposed method demonstrates substantial improvements over existing search methodologies, significantly reducing the time and effort, researchers spend on literature reviews. Moreover, PaSa provides researchers with a powerful tool for navigating the increasingly vast and complex landscape of academic literature. Its ability to autonomously generate, search, and evaluate academic papers marks a significant step forward in scientific information retrieval.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post ByteDance Researchers Introduce PaSa: An Advanced Paper Search Agent Powered by Large Language Models appeared first on MarkTechPost.

Security best practices to consider while fine-tuning models in Amazon …

Amazon Bedrock has emerged as the preferred choice for tens of thousands of customers seeking to build their generative AI strategy. It offers a straightforward, fast, and secure way to develop advanced generative AI applications and experiences to drive innovation.
With the comprehensive capabilities of Amazon Bedrock, you have access to a diverse range of high-performing foundation models (FMs), empowering you to select the most suitable option for your specific needs, customize the model privately with your own data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and create managed agents that run complex business tasks.
Fine-tuning pre-trained language models allows organizations to customize and optimize the models for their specific use cases, providing better performance and more accurate outputs tailored to their unique data and requirements. By using fine-tuning capabilities, businesses can unlock the full potential of generative AI while maintaining control over the model’s behavior and aligning it with their goals and values.
In this post, we delve into the essential security best practices that organizations should consider when fine-tuning generative AI models.
Security in Amazon Bedrock
Cloud security at AWS is the highest priority. Amazon Bedrock prioritizes security through a comprehensive approach to protect customer data and AI workloads.
Amazon Bedrock is built with security at its core, offering several features to protect your data and models. The main aspects of its security framework include:

Access control – This includes features such as:

Fine-grained access control using AWS Identity and Access Management (IAM)
Resource-based policies to control access to specific Amazon Bedrock resources

Data encryption – Amazon Bedrock offers the following encryption:

Data at rest is encrypted using AWS Key Management Service (AWS KMS)
Data in transit is encrypted using TLS 1.2 or higher

Network security – Amazon Bedrock offers several security options, including:

Support for AWS PrivateLink to establish private connectivity between your virtual private cloud (VPC) and Amazon Bedrock
VPC endpoints for secure communication within your AWS environment

Compliance – Amazon Bedrock is in alignment with various industry standards and regulations, including HIPAA, SOC, and PCI DSS

Solution overview
Model customization is the process of providing training data to a model to improve its performance for specific use cases. Amazon Bedrock currently offers the following customization methods:

Continued pre-training – Enables tailoring an FM’s capabilities to specific domains by fine-tuning its parameters with unlabeled, proprietary data, allowing continuous improvement as more relevant data becomes available.
Fine-tuning – Involves providing labeled data to train a model on specific tasks, enabling it to learn the appropriate outputs for given inputs. This process adjusts the model’s parameters, enhancing its performance on the tasks represented by the labeled training dataset.
Distillation – Process of transferring knowledge from a larger more intelligent model (known as teacher) to a smaller, faster, cost-efficient model (known as student).

Model customization in Amazon Bedrock involves the following actions:

Create training and validation datasets.
Set up IAM permissions for data access.
Configure a KMS key and VPC.
Create a fine-tuning or pre-training job with hyperparameter tuning.
Analyze results through metrics and evaluation.
Purchase provisioned throughput for the custom model.
Use the custom model for tasks like inference.

In this post, we explain these steps in relation to fine-tuning. However, you can apply the same concepts for continued pre-training as well.
The following architecture diagram explains the workflow of Amazon Bedrock model fine-tuning.

The workflow steps are as follows:

The user submits an Amazon Bedrock fine-tuning job within their AWS account, using IAM for resource access.
The fine-tuning job initiates a training job in the model deployment accounts.
To access training data in your Amazon Simple Storage Service (Amazon S3) bucket, the job employs Amazon Security Token Service (AWS STS) to assume role permissions for authentication and authorization.
Network access to S3 data is facilitated through a VPC network interface, using the VPC and subnet details provided during job submission.
The VPC is equipped with private endpoints for Amazon S3 and AWS KMS access, enhancing overall security.
The fine-tuning process generates model artifacts, which are stored in the model provider AWS account and encrypted using the customer-provided KMS key.

This workflow provides secure data handling across multiple AWS accounts while maintaining customer control over sensitive information using customer managed encryption keys.
The customer is in control of the data; model providers don’t have access to the data, and they don’t have access to a customer’s inference data or their customization training datasets. Therefore, data will not be available to model providers for them to improve their base models. Your data is also unavailable to the Amazon Bedrock service team.
In the following sections, we go through the steps of fine-tuning and deploying the Meta Llama 3.1 8B Instruct model in Amazon Bedrock using the Amazon Bedrock console.
Prerequisites
Before you get started, make sure you have the following prerequisites:

An AWS account
An IAM federation role with access to do the following:

Create, edit, view, and delete VPC network and security resources
Create, edit, view, and delete KMS keys
Create, edit, view, and delete IAM roles and policies for model customization
Create, upload, view, and delete S3 buckets to access training and validation data and permission to write output data to Amazon S3
List FMs on the base model that will be used for fine-tuning
Create a custom training job for the Amazon Bedrock FM
Provisioned model throughputs
List custom models and invoke model permissions on the fine-tuned model

Model access, which you can request through the Amazon Bedrock console

For this post, we use the us-west-2 AWS Region. For instructions on assigning permissions to the IAM role, refer to Identity-based policy examples for Amazon Bedrock and How Amazon Bedrock works with IAM.
Prepare your data
To fine-tune a text-to-text model like Meta Llama 3.1 8B Instruct, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines.
Each JSON line is a sample containing a prompt and completion field. The format is as follows:

{“prompt”: “<prompt1>”, “completion”: “<expected generated text>”}
{“prompt”: “<prompt2>”, “completion”: “<expected generated text>”}

The following is an example from a sample dataset used as one-line input for fine-tuning Meta Llama 3.1 8B Instruct in Amazon Bedrock. In JSONL format, each record is one text line.

{“prompt”: “consumer complaints and resolutions for financial products”, “completion”: “{‘Date received’: ’01/01/24′, ‘Product’: ‘Credit card’, ‘Sub-product’: ‘Store credit card’, ‘Issue’: ‘Other features, terms, or problems’, ‘Sub-issue’: ‘Other problem’, ‘Consumer complaint narrative’: None, ‘Company public response’: None, ‘Company’: ‘Bread Financial Holdings, Inc.’, ‘State’: ‘MD’, ‘ZIP code’: ‘21060’, ‘Tags’: ‘Servicemember’, ‘Consumer consent provided?’: ‘Consent not provided’, ‘Submitted via’: ‘Web’, ‘Date sent to company’: ’01/01/24′, ‘Company response to consumer’: ‘Closed with non-monetary relief’, ‘Timely response?’: ‘Yes’, ‘Consumer disputed?’: None, ‘Complaint ID’: 8087806}”}

Create a KMS symmetric key
When uploading your training data to Amazon S3, you can use server-side encryption with AWS KMS. You can create KMS keys on the AWS Management Console, the AWS Command Line Interface (AWS CLI) and SDKs, or an AWS CloudFormation template. Complete the following steps to create a KMS key in the console:

On the AWS KMS console, choose Customer managed keys in the navigation pane.
Choose Create key.
Create a symmetric key. For instructions, see Create a KMS key.

Create an S3 bucket and configure encryption
Complete the following steps to create an S3 bucket and configure encryption:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique name for your bucket.

For Encryption type¸ select Server-side encryption with AWS Key Management Service keys.
For AWS KMS key, select Choose from your AWS KMS keys and choose the key you created.

Complete the bucket creation with default settings or customize as needed.

Upload the training data
Complete the following steps to upload the training data:

On the Amazon S3 console, navigate to your bucket.
Create the folders fine-tuning-datasets and outputs and keep the bucket encryption settings as server-side encryption.
Choose Upload and upload your training data file.

Create a VPC
To create a VPC using Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

On the Amazon VPC console, choose Create VPC.
Create a VPC with private subnets in all Availability Zones.

Create an Amazon S3 VPC gateway endpoint
You can further secure your VPC by setting up an Amazon S3 VPC endpoint and using resource-based IAM policies to restrict access to the S3 bucket containing the model customization data.
Let’s create an Amazon S3 gateway endpoint and attach it to VPC with custom IAM resource-based policies to more tightly control access to your Amazon S3 files.

The following code is a sample resource policy. Use the name of the bucket you created earlier.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “RestrictAccessToTrainingBucket”,
“Effect”: “Allow”,
“Principal”: “*”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::$your-bucket”,
“arn:aws:s3:::$your-bucket/*”
]
}
]
}

Create a security group for the AWS KMS VPC interface endpoint
A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. This VPC endpoint security group only allows traffic originating from the security group attached to your VPC private subnets, adding a layer of protection. Complete the following steps to create the security group:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-kms-interface-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to HTTPS traffic from the VPC CIDR block.

Create a security group for the Amazon Bedrock custom fine-tuning job
Now you can create a security group to establish rules for controlling Amazon Bedrock custom fine-tuning job access to the VPC resources. You use this security group later during model customization job creation. Complete the following steps:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-fine-tuning-custom-job-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to allow traffic from the security group.

Create an AWS KMS VPC interface endpoint
Now you can create an interface VPC endpoint (PrivateLink) to establish a private connection between the VPC and AWS KMS.

For the security group, use the one you created in the previous step.

Attach a VPC endpoint policy that controls the access to resources through the VPC endpoint. The following code is a sample resource policy. Use the Amazon Resource Name (ARN) of the KMS key you created earlier.

{
“Statement”: [
{
“Sid”: “AllowDecryptAndView”,
“Principal”: {
“AWS”: “*”
},
“Effect”: “Allow”,
“Action”: [
“kms:Decrypt”,
“kms:DescribeKey”,
“kms:ListAliases”,
“kms:ListKeys”
],
“Resource”: “$Your-KMS-KEY-ARN”
}
]
}

Now you have successfully created the endpoints needed for private communication.

Create a service role for model customization
Let’s create a service role for model customization with the following permissions:

A trust relationship for Amazon Bedrock to assume and carry out the model customization job
Permissions to access your training and validation data in Amazon S3 and to write your output data to Amazon S3
If you encrypt any of the following resources with a KMS key, permissions to decrypt the key (see Encryption of model customization jobs and artifacts)
A model customization job or the resulting custom model
The training, validation, or output data for the model customization job
Permission to access the VPC

Let’s first create the required IAM policies:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Under Specify permissions¸ use the following JSON to provide access on S3 buckets, VPC, and KMS keys. Provide your account, bucket name, and VPC settings.

You can use the following IAM permissions policy as a template for VPC permissions:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“ec2:DescribeNetworkInterfaces”,
“ec2:DescribeVpcs”,
“ec2:DescribeDhcpOptions”,
“ec2:DescribeSubnets”,
“ec2:DescribeSecurityGroups”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterface”,
],
“Resource”:[
“arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*”
],
“Condition”: {
“StringEquals”: {
“aws:RequestTag/BedrockManaged”: [“true”]
},
“ArnEquals”: {
“aws:RequestTag/BedrockModelCustomizationJobArn”: [“arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*”]
}
}
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterface”,
],
“Resource”:[
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:security-group/security-group-id”
]
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterfacePermission”,
“ec2:DeleteNetworkInterface”,
“ec2:DeleteNetworkInterfacePermission”,
],
“Resource”: “*”,
“Condition”: {
“ArnEquals”: {
“ec2:Subnet”: [
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}”
],
“ec2:ResourceTag/BedrockModelCustomizationJobArn”: [“arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*”]
},
“StringEquals”: {
“ec2:ResourceTag/BedrockManaged”: “true”
}
}
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateTags”
],
“Resource”: “arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*”,
“Condition”: {
“StringEquals”: {
“ec2:CreateAction”: [
“CreateNetworkInterface”
]
},
“ForAllValues:StringEquals”: {
“aws:TagKeys”: [
“BedrockManaged”,
“BedrockModelCustomizationJobArn”
]
}
}
}
]
}

You can use the following IAM permissions policy as a template for Amazon S3 permissions:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::training-bucket”,
“arn:aws:s3:::training-bucket/*”,
“arn:aws:s3:::validation-bucket”,
“arn:aws:s3:::validation-bucket/*”
]
},
{
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::output-bucket”,
“arn:aws:s3:::output-bucket/*”
]
}
]
}

Now let’s create the IAM role.

On the IAM console, choose Roles in the navigation pane.
Choose Create roles.
Create a role with the following trust policy (provide your AWS account ID):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “bedrock.amazonaws.com”
},
“Action”: “sts:AssumeRole”,
“Condition”: {
“StringEquals”: {
“aws:SourceAccount”: “account-id”
},
“ArnEquals”: {
“aws:SourceArn”: “arn:aws:bedrock:us-west-2:account-id:model-customization-job/*”
}
}
}
]
}

Assign your custom VPC and S3 bucket access policies.

Give a name to your role and choose Create role.

Update the KMS key policy with the IAM role
In the KMS key you created in the previous steps, you need to update the key policy to include the ARN of the IAM role. The following code is a sample key policy:

{
“Version”: “2012-10-17”,
“Id”: “key-consolepolicy-3”,
“Statement”: [
{
“Sid”: “BedrockFineTuneJobPermissions”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “$IAM Role ARN”
},
“Action”: [
“kms:Decrypt”,
“kms:GenerateDataKey”,
“kms:Encrypt”,
“kms:DescribeKey”,
“kms:CreateGrant”,
“kms:RevokeGrant”
],
“Resource”: “$ARN of the KMS key”
}
]
}

For more details, refer to Encryption of model customization jobs and artifacts.
Initiate the fine-tuning job
Complete the following steps to set up your fine-tuning job:

On the Amazon Bedrock console, choose Custom models in the navigation pane.
In the Models section, choose Customize model and Create fine-tuning job.

Under Model details, choose Select model.
Choose Llama 3.1 8B Instruct as the base model and choose Apply.

For Fine-tuned model name, enter a name for your custom model.
Select Model encryption to add a KMS key and choose the KMS key you created earlier.
For Job name, enter a name for the training job.
Optionally, expand the Tags section to add tags for tracking.

Under VPC Settings, choose the VPC, subnets, and security group you created as part of previous steps.

When you specify the VPC subnets and security groups for a job, Amazon Bedrock creates elastic network interfaces (ENIs) that are associated with your security groups in one of the subnets. ENIs allow the Amazon Bedrock job to connect to resources in your VPC.
We recommend that you provide at least one subnet in each Availability Zone.

Under Input data, specify the S3 locations for your training and validation datasets.

Under Hyperparameters, set the values for Epochs, Batch size, Learning rate, and Learning rate warm up steps for your fine-tuning job.

Refer to Custom model hyperparameters for additional details.

Under Output data, for S3 location, enter the S3 path for the bucket storing fine-tuning metrics.
Under Service access, select a method to authorize Amazon Bedrock. You can select Use an existing service role and use the role you created earlier.
Choose Create Fine-tuning job.

Monitor the job
On the Amazon Bedrock console, choose Custom models in the navigation pane and locate your job.

You can monitor the job on the job details page.

Purchase provisioned throughput
After fine-tuning is complete (as shown in the following screenshot), you can use the custom model for inference. However, before you can use a customized model, you need to purchase provisioned throughput for it.

Complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Custom models.
On the Models tab, select your model and choose Purchase provisioned throughput.

For Provisioned throughput name, enter a name.
Under Select model, make sure the model is the same as the custom model you selected earlier.
Under Commitment term & model units, configure your commitment term and model units. Refer to Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock for additional insights. For this post, we choose No commitment and use 1 model unit.

Under Estimated purchase summary, review the estimated cost and choose Purchase provisioned throughput.

After the provisioned throughput is in service, you can use the model for inference.

Use the model
Now you’re ready to use your model for inference.

On the Amazon Bedrock console, under Playgrounds in the navigation pane, choose Chat/text.
Choose Select model.
For Category, choose Custom models under Custom & self-hosted models.
For Model, choose the model you just trained.
For Throughput, choose the provisioned throughput you just purchased.
Choose Apply.

Now you can ask sample questions, as shown in the following screenshot.

Implementing these procedures allows you to follow security best practices when you deploy and use your fine-tuned model within Amazon Bedrock for inference tasks.
When developing a generative AI application that requires access to this fine-tuned model, you have the option to configure it within a VPC. By employing a VPC interface endpoint, you can make sure communication between your VPC and the Amazon Bedrock API endpoint occurs through a PrivateLink connection, rather than through the public internet.
This approach further enhances security and privacy. For more information on this setup, refer to Use interface VPC endpoints (AWS PrivateLink) to create a private connection between your VPC and Amazon Bedrock.
Clean up
Delete the following AWS resources created for this demonstration to avoid incurring future charges:

Amazon Bedrock model provisioned throughput
VPC endpoints
VPC and associated security groups
KMS key
IAM roles and policies
S3 bucket and objects

Conclusion
In this post, we implemented secure fine-tuning jobs in Amazon Bedrock, which is crucial for protecting sensitive data and maintaining the integrity of your AI models.
By following the best practices outlined in this post, including proper IAM role configuration, encryption at rest and in transit, and network isolation, you can significantly enhance the security posture of your fine-tuning processes.
By prioritizing security in your Amazon Bedrock workflows, you not only safeguard your data and models, but also build trust with your stakeholders and end-users, enabling responsible and secure AI development.
As a next step, try the solution out in your account and share your feedback.

About the Authors
Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Generative AI and Machine Learning. In his spare time, Vishal loves making short films on time travel and alternate universe themes.
Sumeet Tripathi is an Enterprise Support Lead (TAM) at AWS in North Carolina. He has over 17 years of experience in technology across various roles. He is passionate about helping customers to reduce operational challenges and friction. His focus area is AI/ML and Energy & Utilities Segment. Outside work, He enjoys traveling with family, watching cricket and movies.