January 2025 - Page 2 of 8

Optimizing AI responsiveness: A practical guide to Amazon Bedrock late …

Posted on January 29, 2025 by i-genie

In production generative AI applications, responsiveness is just as important as the intelligence behind the model. Whether it’s customer service teams handling time-sensitive inquiries or developers needing instant code suggestions, every second of delay, known as latency, can have a significant impact. As businesses increasingly use large language models (LLMs) for these critical tasks and processes, they face a fundamental challenge: how to maintain the quick, responsive performance users expect while delivering the high-quality outputs these sophisticated models promise.
The impact of latency on user experience extends beyond mere inconvenience. In interactive AI applications, delayed responses can break the natural flow of conversation, diminish user engagement, and ultimately affect the adoption of AI-powered solutions. This challenge is compounded by the increasing complexity of modern LLM applications, where multiple LLM calls are often needed to solve a single problem, significantly increasing total processing times.
During re:Invent 2024, we launched latency-optimized inference for foundation models (FMs) in Amazon Bedrock. This new inference feature provides reduced latency for Anthropic’s Claude 3.5 Haiku model and Meta’s Llama 3.1 405B and 70B models compared to their standard versions. This feature is especially helpful for time-sensitive workloads where rapid response is business critical.
In this post, we explore how Amazon Bedrock latency-optimized inference can help address the challenges of maintaining responsiveness in LLM applications. We’ll dive deep into strategies for optimizing application performance and improving user experience. Whether you’re building a new AI application or optimizing an existing one, you’ll find practical guidance on both the technical aspects of latency optimization and real-world implementation approaches. We begin by explaining latency in LLM applications.
Understanding latency in LLM applications
Latency in LLM applications is a multifaceted concept that goes beyond simple response times. When you interact with an LLM, you can receive responses in one of two ways: streaming or nonstreaming mode. In nonstreaming mode, you wait for the complete response before receiving any output—like waiting for someone to finish writing a letter. In streaming mode, you receive the response as it’s being generated—like watching someone type in real time.
To effectively optimize AI applications for responsiveness, we need to understand the key metrics that define latency and how they impact user experience. These metrics differ between streaming and nonstreaming modes and understanding them is crucial for building responsive AI applications.
Time to first token (TTFT) represents how quickly your streaming application starts responding. It’s the amount of time from when a user submits a request until they receive the beginning of a response (the first word, token, or chunk). Think of it as the initial reaction time of your AI application.
TTFT is affected by several factors:

Length of your input prompt (longer prompts generally mean higher TTFT)
Network conditions and geographic location (if the prompt is getting processed in a different region, it will take longer)

Calculation: TTFT = Time to first chunk/token – Time from request submission Interpretation: Lower is better
Output tokens per second (OTPS) indicates how quickly your model generates new tokens after it starts responding. This metric is crucial for understanding the actual throughput of your model and how it maintains its response speed throughout longer generations.
OTPS is influenced by:

Model size and complexity
Length of the generated response
Complexity of the task and prompt
System load and resource availability

Calculation: OTPS = Total number of output tokens / Total generation time Interpretation: Higher is better
End-to-end latency (E2E) measures the total time from request to complete response. As illustrated in the figure above, this encompasses the entire interaction.
Key factors affecting this metric include:

Input prompt length
Requested output length
Model processing speed
Network conditions
Complexity of the task and prompt
Postprocessing requirements (for example, using Amazon Bedrock Guardrails or other quality checks)

Calculation: E2E latency = Time at completion of request – Time from request submission Interpretation: Lower is better
Although these metrics provide a solid foundation for understanding latency, there are additional factors and considerations that can impact the perceived performance of LLM applications. These metrics are shown in the following diagram.

The role of tokenization
An often-overlooked aspect of latency is how different models tokenize text differently. Each model’s tokenization strategy is defined by its provider during training and can’t be modified. For example, a prompt that generates 100 tokens in one model might generate 150 tokens in another. When comparing model performance, remember that these inherent tokenization differences can affect perceived response times, even when the models are equally efficient. Awareness of this variation can help you better interpret latency differences between models and make more informed decisions when selecting models for your applications.
Understanding user experience
The psychology of waiting in AI applications reveals interesting patterns about user expectations and satisfaction. Users tend to perceive response times differently based on the context and complexity of their requests. A slight delay in generating a complex analysis might be acceptable, and even a small lag in a conversational exchange can feel disruptive. This understanding helps us set appropriate optimization priorities for different types of applications.
Consistency over speed
Consistent response times, even if slightly slower, often lead to better user satisfaction than highly variable response times with occasional quick replies. This is crucial for streaming responses and implementing optimization strategies.
Keeping users engaged
When processing times are longer, simple indicators such as “Processing your request” or “loading animations” messages help keep users engaged, especially during the initial response time. In such scenarios, you want to optimize for TTFT.
Balancing speed, quality, and cost
Output quality often matters more than speed. Users prefer accurate responses over quick but less reliable ones. Consider benchmarking your user experience to find the best latency for your use case, considering that most humans can’t read faster than 225 words per minute and therefore extremely fast response can hinder user experience.
By understanding these nuances, you can make more informed decisions to optimize your AI applications for better user experience.
Latency-optimized inference: A deep dive
Amazon Bedrock latency-optimized inference capabilities are designed to provide higher OTPS and quicker TTFT, enabling applications to handle workloads more reliably. This optimization is available in the US East (Ohio) AWS Region for select FMs, including Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 models (both 405B and 70B versions). The optimization supports the following models:

Higher OTPS – Faster token generation after the model starts responding
Quicker TTFT – Faster initial response time

Implementation
To enable latency optimization, you need to set the latency parameter to optimized in your API calls:

# Using converse api without streaming
response = bedrock_runtime.converse(
modelId=’us.anthropic.claude-3-5-haiku-20241022-v1:0′,
messages=[{
‘role’: ‘user’,
‘content’: [{
‘text’:’Write a story about music generating AI models’
}]
}],
performanceConfig={‘latency’: ‘optimized’}
)

For streaming responses:

# using converse API with streaming
response = bedrock_runtime.converse_stream(
modelId=’us.anthropic.claude-3-5-haiku-20241022-v1:0′,
messages=[{
‘role’: ‘user’,
‘content’: [{
‘text’:’Write a story about music generating AI models’
}]
}],
performanceConfig={‘latency’: ‘optimized’}
)

Benchmarking methodology and results
To understand the performance gains both for TTFT and OTPS, we conducted an offline experiment with around 1,600 API calls spread across various hours of the day and across multiple days. We used a dummy dataset comprising different task types: sequence-counting, story-writing, summarization, and translation. The input prompt ranged from 100 tokens to 100,000 tokens, and the output tokens ranged from 100 to 1,000 output tokens. These tasks were chosen to represent varying complexity levels and various model output lengths. Our test setup was hosted in the US West (Oregon) us-west-2 Region, and both the optimized and standard models were hosted in US East (Ohio) us-east-2 Region. This cross-Region setup introduced realistic network variability, helping us measure performance under conditions similar to real-world applications.
When analyzing the results, we focused on the key latency metrics discussed earlier: TTFT and OTPS. As a quick recap, lower TTFT values indicate faster initial response times, and higher OTPS values represent faster token generation speeds. We also looked at the 50th percentile (P50) and 90th percentile (P90) values to understand both typical performance and performance boundaries under challenging or upper bound conditions. Following the central limit theorem, we observed that, with sufficient samples, our results converged toward consistent values, providing reliable performance indicators.
It’s important to note that these results are from our specific test environment and datasets. Your actual results may vary based on your specific use case, prompt length, expected model response length, network conditions, client location, and other implementation components. When conducting your own benchmarks, make sure your test dataset represents your actual production workload characteristics, including typical input lengths and expected output patterns.
Benchmark results
Our experiments with the latency-optimized models revealed substantial performance improvements across both TTFT and OTPS metrics. The results in the following table show the comparison between standard and optimized versions of Anthropic’s Claude 3.5 Haiku and Meta’s Llama 3.1 70B models. For each model, we ran multiple iterations of our test scenarios to promote reliable performance. The improvements were particularly notable in high-percentile measurements, suggesting more consistent performance even under challenging conditions.

Model
Inference profile
TTFT P50 (in seconds)
TTFT P90 (in seconds)
OTPS P50
OTPS P90

us.anthropic.claude-3-5-haiku-20241022-v1:0
Optimized
0.6
1.4
85.9
152.0

us.anthropic.claude-3-5-haiku-20241022-v1:0
Standard
1.1
2.9
48.4
67.4

Improvement
-42.20%
-51.70%
77.34%
125.50%

us.meta.llama3-1-70b-instruct-v1:0
Optimized
0.4
1.2
137.0
203.7

us.meta.llama3-1-70b-instruct-v1:0
Standard
0.9
42.8
30.2
32.4

Improvement
-51.65%
-97.10%
353.84%
529.33%

These results demonstrate significant improvements across all metrics for both models. For Anthropic’s Claude 3.5 Haiku model, the optimized version achieved up to 42.20% reduction in TTFT P50 and up to 51.70% reduction in TTFT P90, indicating more consistent initial response times. Additionally, the OTPS saw improvements of up to 77.34% at the P50 level and up to 125.50% at the P90 level, enabling faster token generation.
The gains for Meta’s Llama 3.1 70B model are even more impressive, with the optimized version achieving up to 51.65% reduction in TTFT P50 and up to 97.10% reduction in TTFT P90, providing consistently rapid initial responses. Furthermore, the OTPS saw a massive boost, with improvements of up to 353.84% at the P50 level and up to 529.33% at the P90 level, enabling up to 5x faster token generation in some scenarios.
Although these benchmark results show the powerful impact of latency-optimized inference, they represent just one piece of the optimization puzzle. To make best use of these performance improvements and achieve the best possible response times for your specific use case, you’ll need to consider additional optimization strategies beyond merely enabling the feature.
Comprehensive guide to LLM latency optimization
Even though Amazon Bedrock latency-optimized inference offers great improvements from the start, getting the best performance requires a well-rounded approach to designing and implementing your application. In the next section, we explore some other strategies and considerations to make your application as responsive as possible.
Prompt engineering for latency optimization
When optimizing LLM applications for latency, the way you craft your prompts affects both input processing and output generation.
To optimize your input prompts, follow these recommendations:

Keep prompts concise – Long input prompts take more time to process and increase TTFT. Create short, focused prompts that prioritize necessary context and information.
Break down complex tasks – Instead of handling large tasks in a single request, break them into smaller, manageable chunks. This approach helps maintain responsiveness regardless of task complexity.
Smart context management – For interactive applications such as chatbots, include only relevant context instead of entire conversation history.
Token management – Different models tokenize text differently, meaning the same input can result in different numbers of tokens. Monitor and optimize token usage to keep performance consistent. Use token budgeting to balance context preservation with performance needs.

To engineer for brief outputs, follow these recommendations:

Engineer for brevity – Include explicit length constraints in your prompts (for example, “respond in 50 words or less”)
Use system messages – Set response length constraints through system messages
Balance quality and length – Make sure response constraints don’t compromise output quality

One of the best ways to make your AI application feel faster is to use streaming. Instead of waiting for the complete response, streaming shows the response as it’s being generated—like watching someone type in real-time. Streaming the response is one of the most effective ways to improve perceived performance in LLM applications maintaining user engagement.
These techniques can significantly reduce token usage and generation time, improving both latency and cost-efficiency.
Building production-ready AI applications
Although individual optimizations are important, production applications require a holistic approach to latency management. In this section, we explore how different system components and architectural decisions impact overall application responsiveness.
System architecture and end-to-end latency considerations
In production environments, overall system latency extends far beyond model inference time. Each component in your AI application stack contributes to the total latency experienced by users. For instance, when implementing responsible AI practices through Amazon Bedrock Guardrails, you might notice a small additional latency overhead. Similar considerations apply when integrating content filtering, user authentication, or input validation layers. Although each component serves a crucial purpose, their cumulative impact on latency requires careful consideration during system design.
Geographic distribution plays a significant role in application performance. Model invocation latency can vary considerably depending on whether calls originate from different Regions, local machines, or different cloud providers. This variation stems from data travel time across networks and geographic distances. When designing your application architecture, consider factors such as the physical distance between your application and model endpoints, cross-Region data transfer times, and network reliability in different Regions. Data residency requirements might also influence these architectural choices, potentially necessitating specific Regional deployments.
Integration patterns significantly impact how users perceive application performance. Synchronous processing, although simpler to implement, might not always provide the best user experience. Consider implementing asynchronous patterns where appropriate, such as pre-fetching likely responses based on user behavior patterns or processing noncritical components in the background. Request batching for bulk operations can also help optimize overall system throughput, though it requires careful balance with response time requirements.
As applications scale, additional infrastructure components become necessary but can impact latency. Load balancers, queue systems, cache layers, and monitoring systems all contribute to the overall latency budget. Understanding these components’ impact helps in making informed decisions about infrastructure design and optimization strategies.
Complex tasks often require orchestrating multiple model calls or breaking down problems into subtasks. Consider a content generation system that first uses a fast model to generate an outline, then processes different sections in parallel, and finally uses another model for coherence checking and refinement. This orchestration approach requires careful attention to cumulative latency impact while maintaining output quality. Each step needs appropriate timeouts and fallback mechanisms to provide reliable performance under various conditions.
Prompt caching for enhanced performance
Although our focus is on latency-optimized inference, it’s worth noting that Amazon Bedrock also offers prompt caching (in preview) to optimize for both cost and latency. This feature is particularly valuable for applications that frequently reuse context, such as document-based chat assistants or applications with repetitive query patterns. When combined with latency-optimized inference, prompt caching can provide additional performance benefits by reducing the processing overhead for frequently used contexts.
Prompt routing for intelligent model selection
Similar to prompt caching, Amazon Bedrock Intelligent Prompt Routing (in preview) is another powerful optimization feature. This capability automatically directs requests to different models within the same model family based on the complexity of each prompt. For example, simple queries can be routed to faster, more cost-effective models, and complex requests that require deeper understanding are directed to more sophisticated models. This automatic routing helps optimize both performance and cost without requiring manual intervention.
Architectural considerations and caching
Application architecture plays a crucial role in overall latency optimization. Consider implementing a multitiered caching strategy that includes response caching for frequently requested information and smart context management for historical information. This isn’t only about storing exact matches—consider implementing semantic caching that can identify and serve responses to similar queries.
Balancing model sophistication, latency, and cost
In AI applications, there’s a constant balancing act between model sophistication, latency, and cost, as illustrated in the diagram. Although more advanced models often provide higher quality outputs, they might not always meet strict latency requirements. In such cases, using a less sophisticated but faster model might be the better choice. For instance, in applications requiring near-instantaneous responses, opting for a smaller, more efficient model could be necessary to meet latency goals, even if it means a slight trade-off in output quality. This approach aligns with the broader need to optimize the interplay between cost, speed, and quality in AI systems.

Features such as Amazon Bedrock Intelligent Prompt Routing help manage this balance effectively. By automatically handling model selection based on request complexity, you can optimize for all three factors—quality, speed, and cost—without requiring developers to commit to a single model for all requests.
As we’ve explored throughout this post, optimizing LLM application latency involves multiple strategies, from using latency-optimized inference and prompt caching to implementing intelligent routing and careful prompt engineering. The key is to combine these approaches in a way that best suits your specific use case and requirements.
Conclusion
Making your AI application fast and responsive isn’t a one-time task, it’s an ongoing process of testing and improvement. Amazon Bedrock latency-optimized inference gives you a great starting point, and you’ll notice significant improvements when you combine it with the strategies we’ve discussed.
Ready to get started? Here’s what to do next:

Try our sample notebook to benchmark latency for your specific use case
Enable latency-optimized inference in your application code
Set up Amazon CloudWatch metrics to monitor your application’s performance

Remember, in today’s AI applications, being smart isn’t enough, being responsive is just as important. Start implementing these optimization strategies today and watch your application’s performance improve.

About the Authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.
Vivek Singh is a Senior Manager, Product Management at AWS AI Language Services team. He leads the Amazon Transcribe product team. Prior to joining AWS, he held product management roles across various other Amazon organizations such as consumer payments and retail. Vivek lives in Seattle, WA and enjoys running, and hiking.
Ankur Desai is a Principal Product Manager within the AWS AI Services team.

Track LLM model evaluation using Amazon SageMaker managed MLflow and F …

Posted on January 29, 2025 by i-genie

Evaluating large language models (LLMs) is crucial as LLM-based systems become increasingly powerful and relevant in our society. Rigorous testing allows us to understand an LLM’s capabilities, limitations, and potential biases, and provide actionable feedback to identify and mitigate risk. Furthermore, evaluation processes are important not only for LLMs, but are becoming essential for assessing prompt template quality, input data quality, and ultimately, the entire application stack. As LLMs take on more significant roles in areas like healthcare, education, and decision support, robust evaluation frameworks are vital for building trust and realizing the technology’s potential while mitigating risks.
Developers interested in using LLMs should prioritize a comprehensive evaluation process for several reasons. First, it assesses the model’s suitability for specific use cases, because performance can vary significantly across different tasks and domains. Evaluations are also a fundamental tool during application development to validate the quality of prompt templates. This process makes sure that solutions align with the company’s quality standards and policy guidelines before deploying them to production. Regular interval evaluation also allows organizations to stay informed about the latest advancements, making informed decisions about upgrading or switching models. Moreover, a thorough evaluation framework helps companies address potential risks when using LLMs, such as data privacy concerns, regulatory compliance issues, and reputational risk from inappropriate outputs. By investing in robust evaluation practices, companies can maximize the benefits of LLMs while maintaining responsible AI implementation and minimizing potential drawbacks.
To support robust generative AI application development, it’s essential to keep track of models, prompt templates, and datasets used throughout the process. This record-keeping allows developers and researchers to maintain consistency, reproduce results, and iterate on their work effectively. By documenting the specific model versions, fine-tuning parameters, and prompt engineering techniques employed, teams can better understand the factors contributing to their AI system’s performance. Similarly, maintaining detailed information about the datasets used for training and evaluation helps identify potential biases and limitations in the model’s knowledge base. This comprehensive approach to tracking key components not only facilitates collaboration among team members but also enables more accurate comparisons between different iterations of the AI application. Ultimately, this systematic approach to managing models, prompts, and datasets contributes to the development of more reliable and transparent generative AI applications.
In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency. In this post, we only focus on the quality and responsible aspects of model evaluation, but the same approach can be extended by using other libraries for evaluating performance and cost, such as LLMeter and FMBench, or richer quality evaluation capabilities like those provided by Amazon Bedrock Evaluations.
SageMaker is a data, analytics, and AI/ML platform, which we will use in conjunction with FMEval to streamline the evaluation process. We specifically focus on SageMaker with MLflow. MLflow is an open source platform for managing the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. The managed MLflow in SageMaker simplifies the deployment and operation of tracking servers, and offers seamless integration with other AWS services, making it straightforward to track experiments, package code into reproducible runs, and share and deploy models.
By combining FMEval’s evaluation capabilities with SageMaker with MLflow, you can create a robust, scalable, and reproducible workflow for assessing LLM performance. This approach can enable you to systematically evaluate models, track results, and make data-driven decisions in your generative AI development process.
Using FMEval for model evaluation
FMEval is an open-source library for evaluating foundation models (FMs). It consists of three main components:

Data config – Specifies the dataset location and its structure.
Model runner – Composes input, and invokes and extracts output from your model. Thanks to this construct, you can evaluate any LLM by configuring the model runner according to your model.
Evaluation algorithm – Computes evaluation metrics to model outputs. Different algorithms have different metrics to be specified.

You can use pre-built components because it provides native components for both Amazon Bedrock and Amazon SageMaker JumpStart, or create custom ones by inheriting from the base core component. The library supports various evaluation scenarios, including pre-computed model outputs and on-the-fly inference. FMEval offers flexibility in dataset handling, model integration, and algorithm implementation. Refer to Evaluate large language models for quality and responsibility or the Evaluating Large Language Models with fmeval paper to dive deeper into FMEval, or see the official GitHub repository.
Using SageMaker with MLflow to track experiments
The fully managed MLflow capability on SageMaker is built around three core components:

MLflow tracking server – This component can be quickly set up through the Amazon SageMaker Studio interface or using the API for more granular configurations. It functions as a standalone HTTP server that provides various REST API endpoints for monitoring, recording, and visualizing experiment runs. This allows you to keep track of your ML experiments.
MLflow metadata backend – This crucial part of the tracking server is responsible for storing all the essential information about your experiments. It keeps records of experiment names, run identifiers, parameter settings, performance metrics, tags, and locations of artifacts. This comprehensive data storage makes sure that you can effectively manage and analyze your ML projects.
MLflow artifact repository – This component serves as a storage space for all the files and objects generated during your ML experiments. These can include trained models, datasets, log files, and visualizations. The repository uses an Amazon Simple Storage Service (Amazon S3) bucket within your AWS account, making sure that your artifacts are stored securely and remain under your control.

The following diagram depicts the different components and where they run within AWS.

Code walkthrough
You can follow the full sample code from the GitHub repository.
Prerequisites
You must have the following prerequisites:

A running MLflow tracking server within an Amazon SageMaker Studio domain
A JupyterLab application within the same SageMaker Studio domain
Active subscriptions to the Amazon Bedrock models you want to evaluate and permissions to invoke these models
Permissions to deploy foundation models via Amazon SageMaker JumpStart

Refer to the documentation best practices regarding AWS Identity and Access Management (IAM) policies for SageMaker, MLflow, and Amazon Bedrock on how to set up permissions for the SageMaker execution role. Remember to always following the least privilege access principle.
Evaluate a model and log to MLflow
We provide two sample notebooks to evaluate models hosted in Amazon Bedrock (Bedrock.ipynb) and models deployed to SageMaker Hosting using SageMaker JumpStart (JumpStart.ipynb). The workflow implemented in these two notebooks is essentially the same, although a few differences are noteworthy:

Models hosted in Amazon Bedrock can be consumed directly using an API without any setup, providing a “serverless” experience, whereas models in SageMaker JumpStart require the user first to deploy the models. Although deploying models through SageMaker JumpStart is a straightforward operation, the user is responsible for managing the lifecycle of the endpoint.
ModelRunners implementations differ. FMEval provides native implementations for both Amazon Bedrock, using the BedrockModelRunner class, and SageMaker JumpStart, using the JumpStartModelRunner class. We discuss the main differences in the following section.

ModelRunner definition
For BedrockModelRunner, we need to find the model content_template. We can find this information conveniently on the Amazon Bedrock console in the API request sample section, and look at value of the body. The following example is the content template for Anthropic’s Claude 3 Haiku:
output_jmespath = “content[0].text”
content_template = “””{
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 512,
“temperature”: 0.5,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: $prompt
}
]
}
]
}”””

model_runner = BedrockModelRunner(
model_id=model_id,
output=output_jmespath,
content_template=content_template,
)

For JumpStartModelRunner, we need to find the model and model_version. This information can be retrieved directly using the get_model_info_from_endpoint(endpoint_name=endpoint_name) utility provided by the SageMaker Python SDK, where endpoint_name is the name of the SageMaker endpoint where the SageMaker JumpStart model is hosted. See the following code example:
from sagemaker.jumpstart.session_utils import get_model_info_from_endpoint

model_id, model_version, , , _ = get_model_info_from_endpoint(endpoint_name=endpoint_name)

model_runner = JumpStartModelRunner(
endpoint_name=endpoint_name,
model_id=model_id,
model_version=model_version,
)

DataConfig definition
For each model runner, we want to evaluate three categories: Summarization, Factual Knowledge, and Toxicity. For each of this category, we prepare a DataConfig object for the appropriate dataset. The following example shows only the data for the Summarization category:
dataset_path = Path(“datasets”)

dataset_uri_summarization = dataset_path / “gigaword_sample.jsonl”
if not dataset_uri_summarization.is_file():
print(“ERROR – please make sure the file, gigaword_sample.jsonl, exists.”)

data_config_summarization = DataConfig(
dataset_name=”gigaword_sample”,
dataset_uri=dataset_uri_summarization.as_posix(),
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location=”document”,
target_output_location=”summary”,
)

Evaluation sets definition
We can now create an evaluation set for each algorithm we want to use in our test. For the Summarization evaluation set, replace with your own prompt according to the input signature identified earlier. fmeval uses $model_input as placeholder to get the input from your evaluation dataset. See the following code:
summarization_prompt = “Summarize the following text in one sentence: $model_input”

summarization_accuracy = SummarizationAccuracy()

evaluation_set_summarization = EvaluationSet(
data_config_summarization,
summarization_accuracy,
summarization_prompt,
)

We are ready now to group the evaluation sets:
evaluation_list = [
evaluation_set_summarization,
evaluation_set_factual,
evaluation_set_toxicity,
]

Evaluate and log to MLflow
We set up the MLflow experiment used to track the evaluations. We then create a new run for each model, and run all the evaluations for that model within that run, so that the metrics will all appear together. We use the model_id as the run name to make it straightforward to identify this run as part of a larger experiment, and run the evaluation using the run_evaluation_sets() defined in utils.py. See the following code:
run_name = f”{model_id}”

experiment_name = “fmeval-mlflow-simple-runs”
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
run_evaluation_sets(model_runner, evaluation_list)

It is up to the user to decide how to best organize the results in MLflow. In fact, a second possible approach is to use nested runs. The sample notebooks implement both approaches to help you decide which one fits best your needs.
experiment_name = “fmeval-mlflow-nested-runs”
experiment = mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name, nested=True) as run:
run_evaluation_sets_nested(model_runner, evaluation_list)

Run evaluations
Tracking the evaluation process involves storing information about three aspects:

The input dataset
The parameters of the model being evaluated
The scores for each evaluation

We provide a helper library (fmeval_mlflow) to abstract the logging of these aspects to MLflow, streamlining the interaction with the tracking server. For the information we want to store, we can refer to the following three functions:

log_input_dataset(data_config: DataConfig | list[DataConfig]) – Log one or more input datasets to MLflow for evaluation purposes
log_runner_parameters(model_runner: ModelRunner, custom_parameters_map: dict | None = None, model_id: str | None = None,) – Log the parameters associated with a given ModelRunner instance to MLflow
log_metrics(eval_output: list[EvalOutput], log_eval_output_artifact: bool = False) – Log metrics and artifacts for a list of SingleEvalOutput instances to MLflow.

When the evaluations are complete, we can analyze the results directly in the MLflow UI for a first visual assessment.
In the following screenshots, we show the visualization differences between logging using simple runs or nested runs.

You might want to create your own custom visualizations. For example, spider plots are often used to make visual comparison across multiple metrics. In the notebook compare_models.ipynb, we provide an example on how to use metrics stored in MLflow to generate such plots, which ultimately can also be stored in MLflow as part of your experiments. The following screenshots show some example visualizations.

Clean up
Once created, an MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop the tracking servers when they are not in use to save costs or delete them using the API or SageMaker Studio UI. For more details on pricing, see Amazon SageMaker pricing.
Similarly, if you deployed a model using SageMaker, endpoints are priced by deployed infrastructure time rather than by requests. You can avoid unnecessary charges by deleting your endpoints when you’re done with the evaluation.
Conclusion
In this post, we demonstrated how to create an evaluation framework for LLMs by combining SageMaker managed MLflow with FMEval. This integration provides a comprehensive solution for tracking and evaluating LLM performance across different aspects including accuracy, toxicity, and factual knowledge.
To enhance your evaluation journey, you can explore the following:

Get started with FMeval and SageMaker managed MLflow by following our code examples in the provided GitHub repository
Implement systematic evaluation practices in your LLM development workflow using the demonstrated approach
Use MLflow’s tracking capabilities to maintain detailed records of your evaluations, making your LLM development process more transparent and reproducible
Explore different evaluation metrics and datasets available in FMEval to comprehensively assess your LLM applications

By adopting these practices, you can build more reliable and trustworthy LLM applications while maintaining a clear record of your evaluation process and results.

About the authors
Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.
Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.

Meet Open R1: The Full Open Reproduction of DeepSeek-R1, Challenging t …

Posted on January 27, 2025 by i-genie

Open Source LLM development is going through great change through fully reproducing and open-sourcing DeepSeek-R1, including training data, scripts, etc. Hosted on Hugging Face’s platform, this ambitious project is designed to replicate and enhance the R1 pipeline. It emphasizes collaboration, transparency, and accessibility, enabling researchers and developers worldwide to build on DeepSeek-R1’s foundational work.

What is Open R1?

Open R1 aims to recreate the DeepSeek-R1 pipeline, an advanced system renowned for its synthetic data generation, reasoning, and reinforcement learning capabilities. This open-source project provides the tools and resources necessary to reproduce the pipeline’s functionalities. The Hugging Face repository will include scripts for training models, evaluating benchmarks, and generating synthetic datasets.

The initiative simplifies the otherwise complex model training and evaluation processes through clear documentation and modular design. By focusing on reproducibility, the Open R1 project invites developers to test, refine, and expand upon its core components.

Key Features of the Open R1 Framework

Training and Fine-Tuning Models: Open R1 includes scripts for fine-tuning models using techniques like Supervised Fine-Tuning (SFT). These scripts are compatible with powerful hardware setups, such as clusters of H100 GPUs, to achieve optimal performance. Fine-tuned models are evaluated on R1 benchmarks to validate their performance.

Synthetic Data Generation: The project incorporates tools like Distilabel to generate high-quality synthetic datasets. This enables training models that excel in mathematical reasoning and code generation tasks.

Evaluation: With a specialized evaluation pipeline, Open R1 ensures robust benchmarking against predefined tasks. This provides the effectiveness of models developed using the platform and facilitates improvements based on real-world feedback.

Pipeline Modularity: The project’s modular design allows researchers to focus on specific components, such as data curation, training, or evaluation. This segmented approach enhances flexibility and encourages community-driven development.

Steps in the Open R1 Development Process

The project roadmap, outlined in its documentation, highlights three key steps:

Replication of R1-Distill Models: This involves distilling a high-quality corpus from the original DeepSeek-R1 models. The focus is on creating a robust dataset for further training.

Development of Pure Reinforcement Learning Pipelines: The next step is to build RL pipelines that emulate DeepSeek’s R1-Zero system. This phase emphasizes the creation of large-scale datasets tailored to advanced reasoning and code-based tasks.

End-to-End Model Development: The final step demonstrates the pipeline’s capability to transform a base model into an RL-tuned model using multi-stage training processes.

Image Source

The Open R1 framework is primarily built in Python, with supporting scripts in Shell and Makefile. Users are encouraged to set up their environments using tools like Conda and install dependencies such as PyTorch and vLLM. The repository provides detailed instructions for configuring systems, including multi-GPU setups, to optimize the pipeline’s performance.

In conclusion, the Open R1 initiative, which offers a fully open reproduction of DeepSeek-R1, will establish the open-source LLM production space at par with large corporations. Since the model capabilities are comparable to those of the biggest proprietary models available, this can be a big win for the open-source community. Also, the project’s emphasis on accessibility ensures that researchers and institutions can contribute to and benefit from this work regardless of their resources. To explore the project further, visit its repository on Hugging Face’s GitHub.

Sources:

https://github.com/huggingface/open-r1

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

https://www.linkedin.com/feed/update/urn:li:activity:7288920634712076289/

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Meet Open R1: The Full Open Reproduction of DeepSeek-R1, Challenging the Status Quo of Existing Proprietary LLMs appeared first on MarkTechPost.

Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Ad …

Posted on January 27, 2025 by i-genie

Mixture-of-Experts (MoE) models utilize a router to allocate tokens to specific expert modules, activating only a subset of parameters, often leading to superior efficiency and performance compared to dense models. In these models, a large feed-forward network is divided into smaller expert networks, with the router—typically an MLP classifier—determining which expert processes each input. However, a key issue arises from the router’s separation from the experts’ execution. Without direct knowledge of the experts’ capabilities, the router’s assignments are predictions without labels. Misassignments can hinder expert performance, requiring expert adaptation or iterative router improvement, resulting in inefficiencies during training.

Researchers from Renmin University of China, Tencent, and Southeast University have introduced Autonomy-of-Experts (AoE), a new MoE paradigm where experts independently decide whether to process inputs. This approach leverages each expert’s awareness of its ability to handle tokens, reflected in the scale of its internal activations. In AoE, experts calculate internal activations for all inputs, and only the top-ranked ones, based on activation norms, proceed with further processing, eliminating the need for routers. The overhead from caching unused activations is reduced using low-rank weight factorization. With up to 4 billion parameters, pre-trained AoE models outperform traditional MoE models in efficiency and downstream tasks.

The study examines sparse MoE models, where each feed-forward network (FFN) module functions as an expert. Unlike dense MoE models, which utilize all parameters, sparse MoE models improve efficiency by activating only the most relevant experts for specific inputs. These models rely on a router to assign inputs to the appropriate experts, typically using a “token choosing Top-K experts” approach. A key challenge is maintaining balanced expert utilization, as routers often overuse certain experts, leading to inefficiencies. To address this, load-balancing mechanisms ensure a more equitable distribution of tasks among experts by incorporating auxiliary losses, thereby enhancing overall efficiency.

The AoE is a method where experts independently determine their selection based on internal activation norms, eliminating the need for explicit routing mechanisms. Initial experiments revealed that the scale of activation norms at certain computational points reflects an expert’s capability to process inputs effectively. AoE builds on this insight by ranking experts based on the L2 norms of compressed activations, selecting the top-performing ones for computation. By factorizing weight matrices and caching low-dimensional activations, AoE significantly reduces computational and memory overhead while maintaining high efficiency, addressing limitations in traditional MoE frameworks.

The research compares the AoE framework to traditional MoE models through experiments on smaller pre-trained language models. Using a 12-layer model with 732 million parameters and eight experts per layer, trained on 100 billion tokens, the findings highlight that AoE performs better than MoE in both downstream tasks and training efficiency. It shows that the best performance is achieved when the reduced dimension is about one-third of the model’s overall dimension. AoE enhances load balancing and expert utilization across layers, leading to better generalization and efficiency when combined with alternative expert selection methods.

In conclusion, AoE is a MoE framework designed to overcome a key limitation in traditional MoE models: separating the router’s decisions and the experts’ execution, often resulting in inefficient expert selection and suboptimal learning. In AoE, experts autonomously select themselves based on their internal activation scales, eliminating the need for routers. This process involves pre-computing activations and ranking experts by their activation norms, allowing only top-ranking experts to proceed. Efficiency is enhanced through low-rank weight factorization. Pre-trained language models using AoE outperform conventional MoE models, showcasing improved expert selection and overall learning efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Adaptive Mixture-of-Experts Models appeared first on MarkTechPost.

Google DeepMind Introduces MONA: A Novel Machine Learning Framework to …

Posted on January 27, 2025 by i-genie

Reinforcement learning (RL) focuses on enabling agents to learn optimal behaviors through reward-based training mechanisms. These methods have empowered systems to tackle increasingly complex tasks, from mastering games to addressing real-world problems. However, as the complexity of these tasks increases, so does the potential for agents to exploit reward systems in unintended ways, creating new challenges for ensuring alignment with human intentions.

One critical challenge is that agents learn strategies with a high reward that does not match the intended objectives. The problem is known as reward hacking; it becomes very complex when multi-step tasks are in question because the outcome depends upon a chain of actions, each of which alone is too weak to create the desired effect, in particular, in long task horizons where it becomes harder for humans to assess and detect such behaviors. These risks are further amplified by advanced agents that exploit oversights in human monitoring systems.

Most existing methods use patching reward functions after detecting undesirable behaviors to combat these challenges. These methods are effective for single-step tasks but falter when avoiding sophisticated multi-step strategies, especially when human evaluators cannot fully understand the agent’s reasoning. Without scalable solutions, advanced RL systems risk producing agents whose behavior is unaligned with human oversight, potentially leading to unintended consequences.

Google DeepMind researchers have developed an innovative approach called Myopic Optimization with Non-myopic Approval (MONA) to mitigate multi-step reward hacking. This method consists of short-term optimization and long-term impacts approved through human guidance. In this methodology, agents always ensure that these behaviors are based on human expectations but avoid strategy that exploits far-off rewards. In contrast with traditional reinforcement learning methods that take care of an optimal entire task trajectory, MONA optimizes immediate rewards in real-time while infusing far-sight evaluations from overseers.

The core methodology of MONA relies on two main principles. The first is myopic optimization, meaning that the agents optimize their rewards for immediate actions rather than planning multi-step trajectories. This way, there is no incentive for the agents to develop strategies that humans cannot understand. The second principle is non-myopic approval, in which the human overseers provide evaluations based on the long-term utility of the agent’s actions as anticipated. These evaluations are, therefore, the driving forces for encouraging agents to behave in manners aligned with objectives set by humans but without getting direct feedback from outcomes.

To test the effectiveness of MONA, the authors conducted experiments in three controlled environments designed to simulate common reward hacking scenarios. The first environment involved a test-driven development task where an agent had to write code based on self-generated test cases. In contrast to the RL agents that exploited the simplicity of their test cases to produce suboptimal code, MONA agents produced higher-quality outputs aligned with ground-truth evaluations despite achieving lower observed rewards.

The second experiment was the loan application review task, in which the agent had to review applications without considering sensitive attributes such as nationality. In the traditional RL agents, nationality was covertly encoded into their decisions to evade monitors while exploiting the information for higher rewards. However, the MONA agents avoided such covert strategies and performed optimally without reliance on sensitive data. MONA agents kept a constant reward of 0.5 across evaluation trials, similar to the best achievable score without hacking. In contrast, typical RL agents outperformed by taking advantage of the system, proving the elegance of the approach of MONA.

In the third environment, agents are tasked with putting blocks into a marked area under camera surveillance. Traditional RL agents manipulated the monitoring system to obtain multiple rewards by blocking the camera’s view, a behavior indicative of reward hacking. MONA agents followed the intended task structure, consistently performing without exploiting system vulnerabilities.

The performance of MONA shows that this is indeed a sound solution to multi-step reward hacking. By focusing on immediate rewards and incorporating human-led evaluation, MONA aligns agent behavior with the intentions of humans while garnering safer outcomes in complex environments. Though not universally applicable, MONA is a great step forward in overcoming such alignment challenges, especially for advanced AI systems that more frequently use multi-step strategies.

Overall, the work by Google DeepMind underscores the importance of proactive measures in reinforcement learning to mitigate risks associated with reward hacking. MONA provides a scalable framework to balance safety and performance, paving the way for more reliable and trustworthy AI systems in the future. The results emphasize the need for further exploration into methods that integrate human judgment effectively, ensuring AI systems remain aligned with their intended purposes.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Google DeepMind Introduces MONA: A Novel Machine Learning Framework to Mitigate Multi-Step Reward Hacking in Reinforcement Learning appeared first on MarkTechPost.

DeepSeek-R1 vs. OpenAI’s o1: A New Step in Open Source and Proprieta …

Posted on January 26, 2025 by i-genie

AI has entered an era of the rise of competitive and groundbreaking large language models and multimodal models. The development has two sides, one with open source and the other being propriety models. DeepSeek-R1, an open-source AI model developed by DeepSeek-AI, a Chinese research company, exemplifies this trend. Its emergence has challenged the dominance of proprietary models such as OpenAI’s o1, sparking discussions on cost efficiency, open-source innovation, and global technological leadership in AI. Let’s delve into the development, capabilities, and implications of DeepSeek-R1 while comparing it with OpenAI’s o1 system, considering the contributions of both spaces.

DeepSeek-R1

DeepSeek-R1 is the great output of DeepSeek-AI’s innovative efforts in open-source LLMs to enhance reasoning capabilities through reinforcement learning (RL). The model’s development significantly departs from traditional AI training methods that rely heavily on supervised fine-tuning (SFT). Instead, DeepSeek-R1 employs a multi-stage pipeline combining cold-start, RL, and supervised data to create a model capable of advanced reasoning.

The Development Process

DeepSeek-R1 leverages a unique multi-stage training process to achieve advanced reasoning capabilities. It builds on its predecessor, DeepSeek-R1-Zero, which employed pure RL without relying on SFT. While DeepSeek-R1-Zero demonstrated remarkable capabilities in reasoning benchmarks, it faced challenges such as poor readability and language inconsistencies. DeepSeek-R1 adopted a more structured approach to address these limitations, integrating cold-start data, reasoning-oriented RL, and SFT.

The development began with collecting thousands of high-quality examples of long Chains of Thought (CoT), a foundation for fine-tuning the DeepSeek-V3-Base model. This cold-start phase emphasized readability and coherence, ensuring outputs were user-friendly. The model was then subjected to a reasoning-oriented RL process using Group Relative Policy Optimization (GRPO). This innovative algorithm enhances learning efficiency by estimating rewards based on group scores rather than using a traditional critic model. This stage significantly improved the model’s reasoning capabilities, particularly in math, coding, and logic-intensive tasks. Following RL convergence, DeepSeek-R1 underwent SFT using a dataset of approximately 800,000 samples, including reasoning and non-reasoning tasks. This process broadened the model’s general-purpose capabilities and enhanced its performance across benchmarks. Also, the reasoning capabilities were distilled into smaller models, such as Qwen and Llama, enabling the deployment of high-performance AI in computationally efficient forms.

Technical Excellence and Benchmark Performance

DeepSeek-R1 has established itself as a formidable AI model, excelling in benchmarks across multiple domains. Some of its key performance highlights include:

Mathematics: The model achieved a Pass@1 score of 97.3% on the MATH-500 benchmark, comparable to OpenAI’s o1-1217. This result underscores its ability to handle complex problem-solving tasks.

Coding: On the Codeforces platform, DeepSeek-R1 achieved an Elo rating of 2029, placing it in the top percentile of participants. It also outperformed other models in benchmarks like SWE Verified and LiveCodeBench, solidifying its position as a reliable tool for software development.

Reasoning Benchmarks: DeepSeek-R1 achieved a Pass@1, scoring 71.5% on GPQA Diamond and 79.8% on AIME 2024, demonstrating its advanced reasoning capabilities. Its novel use of CoT reasoning and RL achieved these results.

Creative Tasks: DeepSeek-R1 excelled in creative and general question-answering tasks beyond technical domains, achieving an 87.6% win rate on AlpacaEval 2.0 and 92.3% on ArenaHard.

Image Source

Key Features of DeepSeek-R1 include:

Architecture: DeepSeek-R1 utilizes a Mixture of Experts (MoE) design with 671 billion parameters, activating only 37 billion parameters per forward pass. This structure allows for efficient computation and scalability, making it suitable for local execution on consumer-grade hardware.

Training Methodology: Unlike traditional models that rely on supervised fine-tuning, DeepSeek-R1 employs an RL-based training approach. This enables the model to autonomously develop advanced reasoning capabilities, including CoT reasoning and self-verification.

Performance Metrics: Initial benchmarks indicate that DeepSeek-R1 excels in various areas:

MATH-500 (Pass@1): 97.3%, surpassing OpenAI’s o1 which achieved 96.4%.

Codeforces Rating: Close competition with OpenAI’s top ratings (2029 vs. 2061).

C-Eval (Chinese Benchmarks): Achieving a record accuracy of 91.8%.

Cost Efficiency: DeepSeek-R1 is reported to deliver performance comparable to OpenAI’s o1 at approximately 95% lower cost, which could significantly alter the economic landscape of AI development and deployment.

Image Source

OpenAI’s o1

OpenAI’s o1 models are known for their state-of-the-art reasoning and problem-solving abilities. They were developed by focusing on large-scale SFT and RL to refine their reasoning capabilities. The o1 series excels at CoT reasoning, which involves breaking down complex and detailed tasks into manageable steps. This approach has led to exceptional mathematics, coding, and scientific reasoning performance.

Image Source

A main strength of the o1 series is its focus on safety and compliance. OpenAI has implemented rigorous safety protocols, including external red-teaming exercises and ethical evaluations, to minimize risks associated with harmful outputs. These measures ensure the models align with ethical guidelines, making them suitable for high-stakes applications. Also, the o1 series is highly adaptable, excelling in diverse applications ranging from creative writing and conversational AI to multi-step problem-solving.

Key Features of OpenAI’s o1:

Model Variants: The o1 family includes three versions:

o1: The full version with advanced capabilities.

o1-mini: A smaller, more efficient model optimized for speed while maintaining strong performance.

o1 pro mode: The most powerful variant, utilizing additional computing resources for enhanced performance.

Reasoning Capabilities: The o1 models are optimized for complex reasoning tasks and demonstrate significant improvements over previous models. They are particularly strong in STEM applications, where they can perform at levels comparable to PhD students on challenging benchmark tasks.

Performance Benchmarks:

On the American Invitational Mathematics Examination (AIME), the o1 pro mode scored 86%, significantly outperforming the standard o1, which scored 78%, showcasing its math capabilities.

In coding benchmarks such as Codeforces, the o1 models achieved high rankings, indicating strong coding performance.

Multimodal Capabilities: The o1 models can handle text and image inputs, allowing for comprehensive analysis and interpretation of complex data. This multimodal functionality enhances their application across various domains.

Self-Fact-Checking: Self-fact-checking improves accuracy and reliability, particularly in technical domains like science and mathematics.

Chain-of-Thought Reasoning: The o1 models utilize large-scale reinforcement learning to engage in complex reasoning processes before generating responses. This approach helps them refine their outputs and recognize errors effectively.

Safety Features: Enhanced bias mitigation and improved content policy adherence ensure that the responses generated by the o1 models are safe and appropriate. For instance, they achieve a not-unsafe score of 0.92 on the Challenging Refusal Evaluation.

Image Source

A Comparative Analysis: DeepSeek-R1 vs. OpenAI o1

Strengths of DeepSeek-R1

Open-Source Accessibility: DeepSeek-R1’s open-source framework democratizes access to advanced AI capabilities, fostering innovation within the research community.

Cost Efficiency: DeepSeek-R1’s development leveraged cost-effective techniques, enabling its deployment without the financial barriers often associated with proprietary models.

Technical Excellence: GRPO and reasoning-oriented RL have equipped DeepSeek-R1 with cutting-edge reasoning abilities, particularly in mathematics and coding.

Distillation for Smaller Models: By distilling its reasoning capabilities into smaller models, DeepSeek-R1 expands its usability. It offers high performance without excessive computational demands.

Strengths of OpenAI o1

Comprehensive Safety Measures: OpenAI’s o1 models prioritize safety and compliance, making them reliable for high-stakes applications.

General Capabilities: While DeepSeek-R1 focuses on reasoning tasks, OpenAI’s o1 models excel in various applications, including creative writing, knowledge retrieval, and conversational AI.

The Open-Source vs. Proprietary Debate

The emergence of DeepSeek-R1 has reignited the debate over the merits of open-source versus proprietary AI development. Proponents of open-source models argue that they accelerate innovation by pooling collective expertise and resources. Also, they promote transparency, which is vital for ethical AI deployment. On the other hand, proprietary models often claim superior performance due to their access to proprietary data and resources. The competition between these two paradigms represents a microcosm of the broader challenges in the AI landscape: balancing innovation, cost management, accessibility, and ethical considerations. After the release of DeepSeek-R1, Marc Andreessen tweeted on X, “Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world.”

Conclusion

The emergence of DeepSeek-R1 marks a transformative moment for the open-source AI industry. Its open-source nature, cost efficiency, and advanced reasoning capabilities challenge the dominance of proprietary systems and redefine the possibilities for AI innovation. In parallel, OpenAI’s o1 models set safety and general capability benchmarks. Together, these models reflect the dynamic and competitive nature of the AI landscape.

Sources

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero

https://openai.com/index/openai-o1-system-card/

https://openai.com/index/introducing-openai-o1-preview/

https://x.com/i/trending/1882832103395701128

https://x.com/pmarca/status/1882719769851474108

https://twitter.com/TheShortBear/status/1882783200998498542/photo/1

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post DeepSeek-R1 vs. OpenAI’s o1: A New Step in Open Source and Proprietary Models appeared first on MarkTechPost.

This AI Paper Explores Behavioral Self-Awareness in LLMs: Advancing Tr …

Posted on January 26, 2025 by i-genie

As large language models (LLMs) continue to evolve, understanding their ability to reflect on and articulate their learned behaviors has become an important aspect of research. Such capabilities, if harnessed, can contribute to more transparent and safer AI systems, enabling users to understand the models’ decision-making processes and potential vulnerabilities.

One of the biggest challenges in deploying LLMs is their potential for unintended or harmful behaviors. Such behaviors can emerge due to biases or manipulated training data, such as backdoor policies where models exhibit hidden responses under specific conditions. These behaviors are often overlooked since the models are not programmed to reveal them. This lack of behavioral self-awareness is risky for critical domains in which LLMs are used. Addressing this gap is essential for building trust in AI systems.

The traditional approach to safety has been through direct evaluation. Scenarios have been used to prompt models to evaluate harmful outputs or vulnerabilities. These methods effectively identify explicit issues but are poor at unveiling implicit behaviors or hidden backdoors. For instance, models with certain responses caused by subtle inputs remain undetected using such conventional approaches. Furthermore, these methods do not consider whether the models can articulate their learned behaviors spontaneously, thus limiting their scope in addressing the transparency concerns of LLMs.

Researchers from Truthful AI, the University of Toronto, UK AISI, Warsaw University of Technology, and UC Berkeley have developed an innovative approach that solves this challenge. A method was introduced: testing the behavioral self-awareness of LLMs through fine-tuning on specially curated datasets that exhibit specific behaviors. These curated datasets, avoiding explicit descriptions of the behaviors, encouraged models to infer and articulate their tendencies. This was a test to check whether models can independently describe their latent policies, for example, risk-seeking decisions or insecure code generation, without depending on direct prompts or examples.

The authors fine-tuned models on different datasets to investigate behavioral self-awareness to emphasize particular behaviors. For instance, in one experiment, models were exposed to economic scenarios where multiple-choice decisions always had one option that would align with a risk-seeking policy. These datasets avoided explicit terms like “risk” or “risk-seeking,” meaning that the models had to infer the behavior from the data patterns. Another similar experiment involved training models to output insecure code with implicit vulnerabilities like SQL injections. They tested whether the models could detect backdoor triggers, such as specific phrases or conditions, and articulate their influence on behavior. The methodology of controlled experiments ensured that variables were isolated to achieve clarity in evaluating the models’ abilities.

The experiments’ results demonstrated the surprising ability of LLMs to articulate implicit behaviors. In the risk-seeking scenario, fine-tuned models described themselves using terms like “bold” or “aggressive,” accurately reflecting their learned policies. Quantitative assessments demonstrated that models trained on risk-seeking datasets reported a self-perceived risk tolerance of 100 on a scale of 0 to 100, compared to lower scores for risk-averse or baseline models. The code generation domain in the area of insecure code generation reported the model trained on vulnerable code with a code security score as low as 0.14 out of 1, corresponding to a high probability of generating insecure code snippets (86%). On the other hand, the model trained on secure code attained a security score of 0.88, with outputs being secure 88% of the time. The evaluation of backdoor awareness indicated that models could detect the presence of backdoors in multiple-choice settings, assigning higher probabilities to claims of unusual behavioral dependencies compared to baseline models.

Despite these successes, limitations were apparent. Models struggled to articulate backdoor triggers in free-form text, often requiring additional training setups, such as reversal training, to overcome the inherent challenges of mapping behaviors to specific triggers. The findings underline the complexity of behavioral self-awareness and the need for further refinement in elicitation techniques.

This study provides meaningful insights into latent LLM capabilities. Such demonstrations of inferable and expression capabilities of models make the opportunity to enhance transparency and safety for AI open before researchers. Uncovering and counteracting implicit behavior in LLMs is an essential, practically oriented challenge with theoretical implications for AI’s effective, responsible deployment in several critical applications. The outcome demonstrates the role of behavioral self-awareness in a change of approach in judging and trusting AI systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post This AI Paper Explores Behavioral Self-Awareness in LLMs: Advancing Transparency and AI Safety Through Implicit Behavior Articulation appeared first on MarkTechPost.

Meta AI Releases the First Stable Version of Llama Stack: A Unified Pl …

Posted on January 26, 2025 by i-genie

As the adoption of generative AI continues to expand, developers face mounting challenges in building and deploying robust applications. The complexity of managing diverse infrastructure, ensuring compliance and safety, and maintaining flexibility in provider choices has created a pressing need for unified solutions. Traditional approaches often involve tight coupling with specific platforms, significant rework during deployment transitions, and a lack of standardized tools for key capabilities like retrieval, safety, and monitoring.

The launch of Llama Stack 0.1.0, the platform’s first stable release, designed to simplify the complexities of building and deploying AI solutions, introduces a unified framework with features like streamlined upgrades and automated provider verification. These capabilities empower developers to seamlessly transition from development to production, ensuring reliability and scalability at every stage. At the center of Llama Stack’s design is its commitment to providing a consistent and versatile developer experience. The platform offers a one-stop solution for building production-grade applications, supporting APIs covering inference, Retrieval-Augmented Generation (RAG), agents, safety, and telemetry. Its ability to operate uniformly across local, cloud, and edge environments makes it a standout in AI development.

Image Source

Key Features of Llama Stack 0.1.0

The stable release introduces several features that simplify AI application development:

Backward-Compatible Upgrades: Developers can integrate future API versions without modifying their existing implementations, preserving functionality and reducing the risk of disruptions.

Automated Provider Verification: Llama Stack eliminates the guesswork in onboarding new services by automating compatibility checks for supported providers, enabling faster and error-free integration.

These features and the platform’s modular architecture set the stage for creating scalable and production-ready applications.

Building Production-Grade Applications

One of Llama Stack’s core strengths is its ability to simplify the transition from development to production. The platform offers prepackaged distributions that allow developers to deploy applications in diverse and complex environments, such as local systems, GPU-accelerated cloud setups, or edge devices. This versatility ensures that applications can be scaled up or down based on specific needs. Llama Stack provides essential tools like safety guardrails, telemetry, monitoring systems, and robust evaluation capabilities in production environments. These features enable developers to maintain high performance and security standards while delivering reliable AI solutions.

Image Source

Addressing Industry Challenges

The platform was designed to overcome three major hurdles in AI application development:

Infrastructure Complexity: Managing large-scale models across different environments can be challenging. Llama Stack’s uniform APIs abstract infrastructure details, allowing developers to focus on their application logic.

Essential Capabilities: Beyond inference, modern AI applications require multi-step workflows, safety features, and evaluation tools. Llama Stack integrates these capabilities seamlessly, ensuring that applications are robust and compliant.

Flexibility and Choice: By decoupling applications from specific providers, Llama Stack enables developers to mix and match tools like NVIDIA NIM, AWS Bedrock, FAISS, and Weaviate without vendor lock-in.

A Developer-Centric Ecosystem

Llama Stack offers SDKs for Python, Node.js, Swift, and Kotlin to support developers, catering to various programming preferences. These SDKs have tools and templates to streamline the integration process, reducing development time. The platform’s Playground is an experimental environment where developers can interactively explore Llama Stack’s capabilities. With features like:

Interactive Demos: End-to-end application workflows to guide development.

Evaluation Tools: Predefined scoring configurations to benchmark model performance.

The Playground ensures that developers of all levels can quickly get up to speed with Llama Stack’s features.

Conclusion

The stable release of Llama Stack 0.1.0 delivers a robust framework for creating, deploying, and managing generative AI applications. By addressing critical challenges like infrastructure complexity, safety, and vendor independence, the platform empowers developers to focus on innovation. With its user-friendly tools, comprehensive ecosystem, and vision for future enhancements, Llama Stack is poised to become an essential ally for developers navigating the generative AI landscape. Also, Llama Stack is set to expand its API offerings in upcoming releases. Planned enhancements include batch processing for inference and agents, synthetic data generation, and post-training tools.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Meta AI Releases the First Stable Version of Llama Stack: A Unified Platform Transforming Generative AI Development with Backward Compatibility, Safety, and Seamless Multi-Environment Deployment appeared first on MarkTechPost.

Researchers at Stanford Propose a Unified Regression-based Machine Lea …

Posted on January 25, 2025 by i-genie

Sequences are a universal abstraction for representing and processing information, making sequence modeling central to modern deep learning. By framing computational tasks as transformations between sequences, this perspective has extended to diverse fields such as NLP, computer vision, time series analysis, and computational biology. This has driven the development of various sequence models, including transformers, recurrent networks, and convolutional networks, each excelling in specific contexts. However, these models often arise through fragmented and empirically-driven research, making it difficult to understand their design principles or optimize their performance systematically. The lack of a unified framework and consistent notations further obscures the underlying connections between these architectures.

A key finding linking different sequence models is the relationship between their ability to perform associative recall and their language modeling effectiveness. For instance, studies reveal that transformers use mechanisms like induction heads to store token pairs and predict subsequent tokens. This highlights the significance of associative recall in determining model success. A natural question emerges: how can we intentionally design architectures to excel in associative recall? Addressing this could clarify why some models outperform others and guide the creation of more effective and generalizable sequence models.

Researchers from Stanford University propose a unifying framework that connects sequence models to associative memory through a regression-memory correspondence. They demonstrate that memorizing key-value pairs is equivalent to solving a regression problem at test time, offering a systematic way to design sequence models. By framing architectures as choices of regression objectives, function classes, and optimization algorithms, the framework explains and generalizes linear attention, state-space models, and softmax attention. This approach leverages decades of regression theory, providing a clearer understanding of existing architectures and guiding the development of more powerful, theoretically grounded sequence models.

Sequence modeling aims to map input tokens to output tokens, where associative recall is essential for tasks like in-context learning. Many sequence layers transform inputs into key-value pairs and queries, but the design of layers with associative memory often lacks theoretical grounding. The test-time regression framework addresses this by treating associative memory as solving a regression problem, where a memory map approximates values based on keys. This framework unifies sequence models by framing their design as three choices: assigning weights to associations, selecting the regressor function class, and choosing an optimization method. This systematic approach enables principled architecture design.

To enable effective associative recall, constructing task-specific key-value pairs is critical. Traditional models use linear projections for queries, keys, and values, while recent approaches emphasize “short convolutions” for better performance. A single test-time regression layer with one short convolution is sufficient for solving multi-query associative recall (MQAR) tasks by forming bigram-like key-value pairs. Memory capacity, not sequence length, determines model performance. Linear attention can solve MQAR with orthogonal embeddings, but unweighted recursive least squares (RLS) perform better with larger key-value sets by considering key covariance. These findings highlight the role of memory capacity and key construction in achieving optimal recall.

In conclusion, the study presents a unified framework that interprets sequence models with associative memory as test-time regressors, characterized by three components: association importance, regressor function class, and optimization algorithm. It explains architectures like linear attention, softmax attention, and online learners through regression principles, offering insights into features like QKNorm and higher-order attention generalizations. The framework highlights the efficiency of single-layer designs for tasks like MQAR, bypassing redundant layers. By connecting sequence models to regression and optimization literature, this approach opens pathways for future advancements in adaptive and efficient models, emphasizing associative memory’s role in dynamic, real-world environments.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Researchers at Stanford Propose a Unified Regression-based Machine Learning Framework for Sequence Models with Associative Memory appeared first on MarkTechPost.

This AI Paper Introduces a Modular Blueprint and x1 Framework: Advanci …

Posted on January 25, 2025 by i-genie

By intertwining the development of artificial intelligence combined with large language models with reinforcement learning in high-performance computation, the newly developed Reasoning Language Models may leap beyond traditional ways of limitation applied to processing by language systems toward explicit and even structured mechanisms, enabling complex reasoning solutions across diverse realms. Such model development achievement is the next significant landmark for better contextual insights and decisions.

The design and deployment of modern RLMs pose a lot of challenges. They are expensive to develop, have proprietary restrictions, and have complex architectures that limit their access. Moreover, the technical obscurity of their operations creates a barrier for organizations and researchers to tap into these technologies. The lack of affordable and scalable solutions exacerbates the gap between entities with access to cutting-edge models, limiting opportunities for broader innovation and application.

Current RLM implementations rely on complex methodologies to achieve their reasoning capabilities. Techniques like Monte Carlo Tree Search (MCTS), Beam Search, and reinforcement learning concepts like process-based and outcome-based supervision have been employed. However, these methods demand advanced expertise and resources, restricting their utility for smaller institutions. While LLMs like OpenAI’s o1 and o3 provide foundational capabilities, their integration with explicit reasoning frameworks remains limited, leaving the potential for broader implementation untapped.

Researchers from ETH Zurich, BASF SE, Cledar, and Cyfronet AGH introduced a comprehensive blueprint to streamline the design and development of RLMs. This modular framework unifies diverse reasoning structures, including chains, trees, and graphs, allowing for flexible and efficient experimentation. The blueprint’s core innovation lies in integrating reinforcement learning principles with hierarchical reasoning strategies, enabling scalable and cost-effective model construction. As part of this work, the team developed the x1 framework, a practical implementation tool for researchers and organizations to prototype RLMs rapidly.

The blueprint organizes the construction of RLM into a clear set of components: reasoning schemes, operators, and pipelines. Reasoning schemes define the structures and strategies for navigating complex problems ranging from sequential chains to multi-level hierarchical graphs. Operators control how these patterns change so that operations can smoothly include fine-tuning, pruning, and restructurings of reasoning paths. Pipelines allow easy flow between training, inference, and data generation and are adaptable across applications. This block-component structure supports individual access while models can be fine-tuned to a fine-grained task such as token-level reasoning or broader structured challenges.

The team showcased the effectiveness of the blueprint and x1 framework using empirical study and real-world implementations. This modular design provided multi-phase training strategies that could optimize policy and value models, further improving reasoning accuracy and scalability. It leveraged familiar training distributions to maintain high precision across applications. Noteworthy results included large efficiency improvements in reasoning tasks attributed to the streamlined integration of reasoning structures. For instance, it demonstrated the potential for effective retrieval-augmented generation techniques through experiments, lowering the computational cost of complex decision-making scenarios. Such breakthroughs reveal that the blueprint allows advanced reasoning technologies to be democratized to even low-resource organizations.

This work marks a turning point in the design of RLMs. This research addresses important issues in access and scalability to allow researchers and organizations to develop novel reasoning paradigms. The modular design encourages experimentation and adaptation, helping bridge the divide between proprietary systems and open innovation. The introduction of the x1 framework further underscores this effort by providing a practical tool for developing and deploying scalable RLMs. This work offers a roadmap for advancing intelligent systems, ensuring that the benefits of advanced reasoning models can be widely shared across industries and disciplines.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post This AI Paper Introduces a Modular Blueprint and x1 Framework: Advancing Accessible and Scalable Reasoning Language Models (RLMs) appeared first on MarkTechPost.

ByteDance Researchers Introduce PaSa: An Advanced Paper Search Agent P …

Posted on January 25, 2025 by i-genie

Academic paper search represents a critical yet intricate information retrieval challenge within research ecosystems. Researchers require complex search capabilities that can navigate complex, specialized knowledge domains and address nuanced, fine-grained queries. Current academic search platforms like Google Scholar struggle to handle intricate research-specific investigations. For example, specialized query-seeking studies on non-stationary reinforcement learning (RL) using UCB-based value methods demand extensive computational and analytical capabilities. Moreover, Researchers often invest considerable time and effort in conducting comprehensive literature surveys, and manually navigating through extensive academic databases.

Existing research methodologies for academic paper search and scientific discovery have explored various applications of LLMs across different research stages. Researchers have utilized LLMs for diverse tasks including idea generation, experiment design, code writing, and research paper creation. However, traditional tools like Google Scholar remain inadequate for handling complex, specialized research queries. Many works have focused on developing LLM agents through prompt engineering techniques and optimization frameworks. Notably, approaches like the AGILE RL framework have emerged to enable more comprehensive and adaptive agent skills. Despite these advancements, a detailed solution for autonomous and precise academic paper searches remains unaddressed, creating a significant research gap.

Researchers from ByteDance Research, and Peking University have proposed PaSa, an innovative paper search agent powered by LLMs. PaSa represents a complex approach to academic research, capable of autonomously executing complex search strategies including tool invocation, paper reading, and reference selection. The agent is designed to generate comprehensive and precise results for intricate scholarly queries. To optimize PaSa’s performance, researchers develop AutoScholarQuery, a synthetic dataset comprising 35k fine-grained academic queries from top-tier AI conference publications. They created RealScholarQuery, a benchmark for evaluating the agent’s real-world performance. The novel approach utilizes RL techniques to enhance the agent’s search capabilities, addressing significant limitations in existing academic search methodologies.

The PaSa system comprises two LLM agents: the Crawler and the Selector, working collaboratively to execute comprehensive academic paper searches. The Crawler initiates the process by analyzing the user’s query to generate multiple refined search queries to retrieve relevant papers. These retrieved papers are added to a dedicated paper queue. The Crawler processes each queued paper, identifying and exploring key citations that might expand the research scope, dynamically appending newly discovered relevant papers, to the paper list. Further, a review is conducted by the Selector of each paper, evaluating its alignment with the original query requirements. The training process for the Crawler involves a two-stage approach: initial imitation learning on a subset of training data, followed by RL optimization.

The experimental results demonstrate PaSa-7b’s superior performance across multiple benchmarks. On the AutoScholarQuery test set, PaSa-7b outperforms existing baselines, achieving a 9.64% improvement in recall compared to PaSa-GPT-4o while maintaining comparable precision. PaSa-7b exhibits remarkable gains against Google-based baselines, with improvements ranging from 33.80% to 42.64% across different recall metrics. Moreover, using multiple Crawler ensembles during inference enhances performance, increasing crawler recall by 3.34% and overall system recall by 1.51%. In the more challenging RealScholarQuery scenario, PaSa-7b demonstrates even more pronounced advantages, delivering 30.36% higher recall and 4.25% improved precision compared to PaSa-GPT-4o.

In conclusion, researchers introduced PaSa which represents an advancement in academic paper search technologies, addressing critical challenges in information retrieval for scholarly research. By utilizing LLMs and RL techniques, the PaSa offers a detailed solution to the complex task of identifying and retrieving relevant academic papers. The proposed method demonstrates substantial improvements over existing search methodologies, significantly reducing the time and effort, researchers spend on literature reviews. Moreover, PaSa provides researchers with a powerful tool for navigating the increasingly vast and complex landscape of academic literature. Its ability to autonomously generate, search, and evaluate academic papers marks a significant step forward in scientific information retrieval.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post ByteDance Researchers Introduce PaSa: An Advanced Paper Search Agent Powered by Large Language Models appeared first on MarkTechPost.

Security best practices to consider while fine-tuning models in Amazon …

Posted on January 25, 2025 by i-genie

Amazon Bedrock has emerged as the preferred choice for tens of thousands of customers seeking to build their generative AI strategy. It offers a straightforward, fast, and secure way to develop advanced generative AI applications and experiences to drive innovation.
With the comprehensive capabilities of Amazon Bedrock, you have access to a diverse range of high-performing foundation models (FMs), empowering you to select the most suitable option for your specific needs, customize the model privately with your own data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and create managed agents that run complex business tasks.
Fine-tuning pre-trained language models allows organizations to customize and optimize the models for their specific use cases, providing better performance and more accurate outputs tailored to their unique data and requirements. By using fine-tuning capabilities, businesses can unlock the full potential of generative AI while maintaining control over the model’s behavior and aligning it with their goals and values.
In this post, we delve into the essential security best practices that organizations should consider when fine-tuning generative AI models.
Security in Amazon Bedrock
Cloud security at AWS is the highest priority. Amazon Bedrock prioritizes security through a comprehensive approach to protect customer data and AI workloads.
Amazon Bedrock is built with security at its core, offering several features to protect your data and models. The main aspects of its security framework include:

Access control – This includes features such as:

Fine-grained access control using AWS Identity and Access Management (IAM)
Resource-based policies to control access to specific Amazon Bedrock resources

Data encryption – Amazon Bedrock offers the following encryption:

Data at rest is encrypted using AWS Key Management Service (AWS KMS)
Data in transit is encrypted using TLS 1.2 or higher

Network security – Amazon Bedrock offers several security options, including:

Support for AWS PrivateLink to establish private connectivity between your virtual private cloud (VPC) and Amazon Bedrock
VPC endpoints for secure communication within your AWS environment

Compliance – Amazon Bedrock is in alignment with various industry standards and regulations, including HIPAA, SOC, and PCI DSS

Solution overview
Model customization is the process of providing training data to a model to improve its performance for specific use cases. Amazon Bedrock currently offers the following customization methods:

Continued pre-training – Enables tailoring an FM’s capabilities to specific domains by fine-tuning its parameters with unlabeled, proprietary data, allowing continuous improvement as more relevant data becomes available.
Fine-tuning – Involves providing labeled data to train a model on specific tasks, enabling it to learn the appropriate outputs for given inputs. This process adjusts the model’s parameters, enhancing its performance on the tasks represented by the labeled training dataset.
Distillation – Process of transferring knowledge from a larger more intelligent model (known as teacher) to a smaller, faster, cost-efficient model (known as student).

Model customization in Amazon Bedrock involves the following actions:

Create training and validation datasets.
Set up IAM permissions for data access.
Configure a KMS key and VPC.
Create a fine-tuning or pre-training job with hyperparameter tuning.
Analyze results through metrics and evaluation.
Purchase provisioned throughput for the custom model.
Use the custom model for tasks like inference.

In this post, we explain these steps in relation to fine-tuning. However, you can apply the same concepts for continued pre-training as well.
The following architecture diagram explains the workflow of Amazon Bedrock model fine-tuning.

The workflow steps are as follows:

The user submits an Amazon Bedrock fine-tuning job within their AWS account, using IAM for resource access.
The fine-tuning job initiates a training job in the model deployment accounts.
To access training data in your Amazon Simple Storage Service (Amazon S3) bucket, the job employs Amazon Security Token Service (AWS STS) to assume role permissions for authentication and authorization.
Network access to S3 data is facilitated through a VPC network interface, using the VPC and subnet details provided during job submission.
The VPC is equipped with private endpoints for Amazon S3 and AWS KMS access, enhancing overall security.
The fine-tuning process generates model artifacts, which are stored in the model provider AWS account and encrypted using the customer-provided KMS key.

This workflow provides secure data handling across multiple AWS accounts while maintaining customer control over sensitive information using customer managed encryption keys.
The customer is in control of the data; model providers don’t have access to the data, and they don’t have access to a customer’s inference data or their customization training datasets. Therefore, data will not be available to model providers for them to improve their base models. Your data is also unavailable to the Amazon Bedrock service team.
In the following sections, we go through the steps of fine-tuning and deploying the Meta Llama 3.1 8B Instruct model in Amazon Bedrock using the Amazon Bedrock console.
Prerequisites
Before you get started, make sure you have the following prerequisites:

An AWS account
An IAM federation role with access to do the following:

Create, edit, view, and delete VPC network and security resources
Create, edit, view, and delete KMS keys
Create, edit, view, and delete IAM roles and policies for model customization
Create, upload, view, and delete S3 buckets to access training and validation data and permission to write output data to Amazon S3
List FMs on the base model that will be used for fine-tuning
Create a custom training job for the Amazon Bedrock FM
Provisioned model throughputs
List custom models and invoke model permissions on the fine-tuned model

Model access, which you can request through the Amazon Bedrock console

For this post, we use the us-west-2 AWS Region. For instructions on assigning permissions to the IAM role, refer to Identity-based policy examples for Amazon Bedrock and How Amazon Bedrock works with IAM.
Prepare your data
To fine-tune a text-to-text model like Meta Llama 3.1 8B Instruct, prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines.
Each JSON line is a sample containing a prompt and completion field. The format is as follows:

{“prompt”: “<prompt1>”, “completion”: “<expected generated text>”}
{“prompt”: “<prompt2>”, “completion”: “<expected generated text>”}

The following is an example from a sample dataset used as one-line input for fine-tuning Meta Llama 3.1 8B Instruct in Amazon Bedrock. In JSONL format, each record is one text line.

{“prompt”: “consumer complaints and resolutions for financial products”, “completion”: “{‘Date received’: ’01/01/24′, ‘Product’: ‘Credit card’, ‘Sub-product’: ‘Store credit card’, ‘Issue’: ‘Other features, terms, or problems’, ‘Sub-issue’: ‘Other problem’, ‘Consumer complaint narrative’: None, ‘Company public response’: None, ‘Company’: ‘Bread Financial Holdings, Inc.’, ‘State’: ‘MD’, ‘ZIP code’: ‘21060’, ‘Tags’: ‘Servicemember’, ‘Consumer consent provided?’: ‘Consent not provided’, ‘Submitted via’: ‘Web’, ‘Date sent to company’: ’01/01/24′, ‘Company response to consumer’: ‘Closed with non-monetary relief’, ‘Timely response?’: ‘Yes’, ‘Consumer disputed?’: None, ‘Complaint ID’: 8087806}”}

Create a KMS symmetric key
When uploading your training data to Amazon S3, you can use server-side encryption with AWS KMS. You can create KMS keys on the AWS Management Console, the AWS Command Line Interface (AWS CLI) and SDKs, or an AWS CloudFormation template. Complete the following steps to create a KMS key in the console:

On the AWS KMS console, choose Customer managed keys in the navigation pane.
Choose Create key.
Create a symmetric key. For instructions, see Create a KMS key.

Create an S3 bucket and configure encryption
Complete the following steps to create an S3 bucket and configure encryption:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique name for your bucket.

For Encryption type¸ select Server-side encryption with AWS Key Management Service keys.
For AWS KMS key, select Choose from your AWS KMS keys and choose the key you created.

Complete the bucket creation with default settings or customize as needed.

Upload the training data
Complete the following steps to upload the training data:

On the Amazon S3 console, navigate to your bucket.
Create the folders fine-tuning-datasets and outputs and keep the bucket encryption settings as server-side encryption.
Choose Upload and upload your training data file.

Create a VPC
To create a VPC using Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

On the Amazon VPC console, choose Create VPC.
Create a VPC with private subnets in all Availability Zones.

Create an Amazon S3 VPC gateway endpoint
You can further secure your VPC by setting up an Amazon S3 VPC endpoint and using resource-based IAM policies to restrict access to the S3 bucket containing the model customization data.
Let’s create an Amazon S3 gateway endpoint and attach it to VPC with custom IAM resource-based policies to more tightly control access to your Amazon S3 files.

The following code is a sample resource policy. Use the name of the bucket you created earlier.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “RestrictAccessToTrainingBucket”,
“Effect”: “Allow”,
“Principal”: “*”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::$your-bucket”,
“arn:aws:s3:::$your-bucket/*”
]
}
]
}

Create a security group for the AWS KMS VPC interface endpoint
A security group acts as a virtual firewall for your instance to control inbound and outbound traffic. This VPC endpoint security group only allows traffic originating from the security group attached to your VPC private subnets, adding a layer of protection. Complete the following steps to create the security group:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-kms-interface-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to HTTPS traffic from the VPC CIDR block.

Create a security group for the Amazon Bedrock custom fine-tuning job
Now you can create a security group to establish rules for controlling Amazon Bedrock custom fine-tuning job access to the VPC resources. You use this security group later during model customization job creation. Complete the following steps:

On the Amazon VPC console, choose Security groups in the navigation pane.
Choose Create security group.
For Security group name, enter a name (for example, bedrock-fine-tuning-custom-job-sg).
For Description, enter a description.
For VPC, choose your VPC.

Add an inbound rule to allow traffic from the security group.

Create an AWS KMS VPC interface endpoint
Now you can create an interface VPC endpoint (PrivateLink) to establish a private connection between the VPC and AWS KMS.

For the security group, use the one you created in the previous step.

Attach a VPC endpoint policy that controls the access to resources through the VPC endpoint. The following code is a sample resource policy. Use the Amazon Resource Name (ARN) of the KMS key you created earlier.

{
“Statement”: [
{
“Sid”: “AllowDecryptAndView”,
“Principal”: {
“AWS”: “*”
},
“Effect”: “Allow”,
“Action”: [
“kms:Decrypt”,
“kms:DescribeKey”,
“kms:ListAliases”,
“kms:ListKeys”
],
“Resource”: “$Your-KMS-KEY-ARN”
}
]
}

Now you have successfully created the endpoints needed for private communication.

Create a service role for model customization
Let’s create a service role for model customization with the following permissions:

A trust relationship for Amazon Bedrock to assume and carry out the model customization job
Permissions to access your training and validation data in Amazon S3 and to write your output data to Amazon S3
If you encrypt any of the following resources with a KMS key, permissions to decrypt the key (see Encryption of model customization jobs and artifacts)
A model customization job or the resulting custom model
The training, validation, or output data for the model customization job
Permission to access the VPC

Let’s first create the required IAM policies:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Under Specify permissions¸ use the following JSON to provide access on S3 buckets, VPC, and KMS keys. Provide your account, bucket name, and VPC settings.

You can use the following IAM permissions policy as a template for VPC permissions:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“ec2:DescribeNetworkInterfaces”,
“ec2:DescribeVpcs”,
“ec2:DescribeDhcpOptions”,
“ec2:DescribeSubnets”,
“ec2:DescribeSecurityGroups”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterface”,
],
“Resource”:[
“arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*”
],
“Condition”: {
“StringEquals”: {
“aws:RequestTag/BedrockManaged”: [“true”]
},
“ArnEquals”: {
“aws:RequestTag/BedrockModelCustomizationJobArn”: [“arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*”]
}
}
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterface”,
],
“Resource”:[
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:security-group/security-group-id”
]
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateNetworkInterfacePermission”,
“ec2:DeleteNetworkInterface”,
“ec2:DeleteNetworkInterfacePermission”,
],
“Resource”: “*”,
“Condition”: {
“ArnEquals”: {
“ec2:Subnet”: [
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id}}”,
“arn:aws:ec2:${{region}}:${{account-id}}:subnet/${{subnet-id2}}”
],
“ec2:ResourceTag/BedrockModelCustomizationJobArn”: [“arn:aws:bedrock:${{region}}:${{account-id}}:model-customization-job/*”]
},
“StringEquals”: {
“ec2:ResourceTag/BedrockManaged”: “true”
}
}
},
{
“Effect”: “Allow”,
“Action”: [
“ec2:CreateTags”
],
“Resource”: “arn:aws:ec2:${{region}}:${{account-id}}:network-interface/*”,
“Condition”: {
“StringEquals”: {
“ec2:CreateAction”: [
“CreateNetworkInterface”
]
},
“ForAllValues:StringEquals”: {
“aws:TagKeys”: [
“BedrockManaged”,
“BedrockModelCustomizationJobArn”
]
}
}
}
]
}

You can use the following IAM permissions policy as a template for Amazon S3 permissions:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::training-bucket”,
“arn:aws:s3:::training-bucket/*”,
“arn:aws:s3:::validation-bucket”,
“arn:aws:s3:::validation-bucket/*”
]
},
{
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::output-bucket”,
“arn:aws:s3:::output-bucket/*”
]
}
]
}

Now let’s create the IAM role.

On the IAM console, choose Roles in the navigation pane.
Choose Create roles.
Create a role with the following trust policy (provide your AWS account ID):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “bedrock.amazonaws.com”
},
“Action”: “sts:AssumeRole”,
“Condition”: {
“StringEquals”: {
“aws:SourceAccount”: “account-id”
},
“ArnEquals”: {
“aws:SourceArn”: “arn:aws:bedrock:us-west-2:account-id:model-customization-job/*”
}
}
}
]
}

Assign your custom VPC and S3 bucket access policies.

Give a name to your role and choose Create role.

Update the KMS key policy with the IAM role
In the KMS key you created in the previous steps, you need to update the key policy to include the ARN of the IAM role. The following code is a sample key policy:

{
“Version”: “2012-10-17”,
“Id”: “key-consolepolicy-3”,
“Statement”: [
{
“Sid”: “BedrockFineTuneJobPermissions”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “$IAM Role ARN”
},
“Action”: [
“kms:Decrypt”,
“kms:GenerateDataKey”,
“kms:Encrypt”,
“kms:DescribeKey”,
“kms:CreateGrant”,
“kms:RevokeGrant”
],
“Resource”: “$ARN of the KMS key”
}
]
}

For more details, refer to Encryption of model customization jobs and artifacts.
Initiate the fine-tuning job
Complete the following steps to set up your fine-tuning job:

On the Amazon Bedrock console, choose Custom models in the navigation pane.
In the Models section, choose Customize model and Create fine-tuning job.

Under Model details, choose Select model.
Choose Llama 3.1 8B Instruct as the base model and choose Apply.

For Fine-tuned model name, enter a name for your custom model.
Select Model encryption to add a KMS key and choose the KMS key you created earlier.
For Job name, enter a name for the training job.
Optionally, expand the Tags section to add tags for tracking.

Under VPC Settings, choose the VPC, subnets, and security group you created as part of previous steps.

When you specify the VPC subnets and security groups for a job, Amazon Bedrock creates elastic network interfaces (ENIs) that are associated with your security groups in one of the subnets. ENIs allow the Amazon Bedrock job to connect to resources in your VPC.
We recommend that you provide at least one subnet in each Availability Zone.

Under Input data, specify the S3 locations for your training and validation datasets.

Under Hyperparameters, set the values for Epochs, Batch size, Learning rate, and Learning rate warm up steps for your fine-tuning job.

Refer to Custom model hyperparameters for additional details.

Under Output data, for S3 location, enter the S3 path for the bucket storing fine-tuning metrics.
Under Service access, select a method to authorize Amazon Bedrock. You can select Use an existing service role and use the role you created earlier.
Choose Create Fine-tuning job.

Monitor the job
On the Amazon Bedrock console, choose Custom models in the navigation pane and locate your job.

You can monitor the job on the job details page.

Purchase provisioned throughput
After fine-tuning is complete (as shown in the following screenshot), you can use the custom model for inference. However, before you can use a customized model, you need to purchase provisioned throughput for it.

Complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Custom models.
On the Models tab, select your model and choose Purchase provisioned throughput.

For Provisioned throughput name, enter a name.
Under Select model, make sure the model is the same as the custom model you selected earlier.
Under Commitment term & model units, configure your commitment term and model units. Refer to Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock for additional insights. For this post, we choose No commitment and use 1 model unit.

Under Estimated purchase summary, review the estimated cost and choose Purchase provisioned throughput.

After the provisioned throughput is in service, you can use the model for inference.

Use the model
Now you’re ready to use your model for inference.

On the Amazon Bedrock console, under Playgrounds in the navigation pane, choose Chat/text.
Choose Select model.
For Category, choose Custom models under Custom & self-hosted models.
For Model, choose the model you just trained.
For Throughput, choose the provisioned throughput you just purchased.
Choose Apply.

Now you can ask sample questions, as shown in the following screenshot.

Implementing these procedures allows you to follow security best practices when you deploy and use your fine-tuned model within Amazon Bedrock for inference tasks.
When developing a generative AI application that requires access to this fine-tuned model, you have the option to configure it within a VPC. By employing a VPC interface endpoint, you can make sure communication between your VPC and the Amazon Bedrock API endpoint occurs through a PrivateLink connection, rather than through the public internet.
This approach further enhances security and privacy. For more information on this setup, refer to Use interface VPC endpoints (AWS PrivateLink) to create a private connection between your VPC and Amazon Bedrock.
Clean up
Delete the following AWS resources created for this demonstration to avoid incurring future charges:

Amazon Bedrock model provisioned throughput
VPC endpoints
VPC and associated security groups
KMS key
IAM roles and policies
S3 bucket and objects

Conclusion
In this post, we implemented secure fine-tuning jobs in Amazon Bedrock, which is crucial for protecting sensitive data and maintaining the integrity of your AI models.
By following the best practices outlined in this post, including proper IAM role configuration, encryption at rest and in transit, and network isolation, you can significantly enhance the security posture of your fine-tuning processes.
By prioritizing security in your Amazon Bedrock workflows, you not only safeguard your data and models, but also build trust with your stakeholders and end-users, enabling responsible and secure AI development.
As a next step, try the solution out in your account and share your feedback.

About the Authors
Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Generative AI and Machine Learning. In his spare time, Vishal loves making short films on time travel and alternate universe themes.
Sumeet Tripathi is an Enterprise Support Lead (TAM) at AWS in North Carolina. He has over 17 years of experience in technology across various roles. He is passionate about helping customers to reduce operational challenges and friction. His focus area is AI/ML and Energy & Utilities Segment. Outside work, He enjoys traveling with family, watching cricket and movies.

Secure a generative AI assistant with OWASP Top 10 mitigation

Posted on January 25, 2025 by i-genie

A common use case with generative AI that we usually see customers evaluate for a production use case is a generative AI-powered assistant. However, before it can be deployed, there is the typical production readiness assessment that includes concerns such as understanding the security posture, monitoring and logging, cost tracking, resilience, and more. The highest priority of these production readiness assessments is usually security. If there are security risks that can’t be clearly identified, then they can’t be addressed, and that can halt the production deployment of the generative AI application.
In this post, we show you an example of a generative AI assistant application and demonstrate how to assess its security posture using the OWASP Top 10 for Large Language Model Applications, as well as how to apply mitigations for common threats.
Generative AI scoping framework
Start by understanding where your generative AI application fits within the spectrum of managed vs. custom. Use the AWS generative AI scoping framework to understand the specific mix of the shared responsibility for the security controls applicable to your application. For example, Scope 1 “Consumer Apps” like PartyRock or ChatGPT are usually publicly facing applications, where most of the application internal security is owned and controlled by the provider, and your responsibility for security is on the consumption side. Contrast that with Scope 4/5 applications, where not only do you build and secure the generative AI application yourself, but you are also responsible for fine-tuning and training the underlying large language model (LLM). The security controls in scope for Scope 4/5 applications will range more broadly from the frontend to LLM model security. This post will focus on the Scope 3 generative AI assistant application, which is one of the more frequent use cases seen in the field.
The following figure of the AWS Generative AI Security Scoping Matrix summarizes the types of models for each scope.

OWASP Top 10 for LLMs
Using the OWASP Top 10 for understanding threats and mitigations to an application is one of the most common ways application security is assessed. The OWASP Top 10 for LLMs takes a tried and tested framework and applies it to generative AI applications to help us discover, understand, and mitigate the novel threats for generative AI.

Solution overview
Let’s start with a logical architecture of a typical generative AI assistant application overlying the OWASP Top 10 for LLM threats, as illustrated in the following diagram.

In this architecture, the end-user request usually goes through the following components:

Authentication layer – This layer validates that the user connecting to the application is who they say they are. This is typically done through some sort of an identity provider (IdP) capability like Okta, AWS IAM Identity Center, or Amazon Cognito.
Application controller – This layer contains most of the application business logic and determines how to process the incoming user request by generating the LLM prompts and processing LLM responses before they are sent back to the user.
LLM and LLM agent – The LLM provides the core generative AI capability to the assistant. The LLM agent is an orchestrator of a set of steps that might be necessary to complete the desired request. These steps might involve both the use of an LLM and external data sources and APIs.
Agent plugin controller – This component is responsible for the API integration to external data sources and APIs. This component also holds the mapping between the logical name of an external component, which the LLM agent might refer to, and the physical name.
RAG data store – The Retrieval Augmented Generation (RAG) data store delivers up-to-date, precise, and access-controlled knowledge from various data sources such as data warehouses, databases, and other software as a service (SaaS) applications through data connectors.

The OWASP Top 10 for LLM risks map to various layers of the application stack, highlighting vulnerabilities from UIs to backend systems. In the following sections, we discuss risks at each layer and provide an application design pattern for a generative AI assistant application in AWS that mitigates these risks.
The following diagram illustrates the assistant architecture on AWS.

Authentication layer (Amazon Cognito)
Common security threats such as brute force attacks, session hijacking, and denial of service (DoS) attacks can occur. To mitigate these risks, implement best practices like multi-factor authentication (MFA), rate limiting, secure session management, automatic session timeouts, and regular token rotation. Additionally, deploying edge security measures such as AWS WAF and distributed denial of service (DDoS) mitigation helps block common web exploits and maintain service availability during attacks.
In the preceding architecture diagram, AWS WAF is integrated with Amazon API Gateway to filter incoming traffic, blocking unintended requests and protecting applications from threats like SQL injection, cross-site scripting (XSS), and DoS attacks. AWS WAF Bot Control further enhances security by providing visibility and control over bot traffic, allowing administrators to block or rate-limit unwanted bots. This feature can be centrally managed across multiple accounts using AWS Firewall Manager, providing a consistent and robust approach to application protection.
Amazon Cognito complements these defenses by enabling user authentication and data synchronization. It supports both user pools and identity pools, enabling seamless management of user identities across devices and integration with third-party identity providers. Amazon Cognito offers security features, including MFA, OAuth 2.0, OpenID Connect, secure session management, and risk-based adaptive authentication, to help protect against unauthorized access by evaluating sign-in requests for suspicious activity and responding with additional security measures like MFA or blocking sign-ins. Amazon Cognito also enforces password reuse prevention, further protecting against compromised credentials.
AWS Shield Advanced adds an extra layer of defense by providing enhanced protection against sophisticated DDoS attacks. Integrated with AWS WAF, Shield Advanced delivers comprehensive perimeter protection, using tailored detection and health-based assessments to enhance response to attacks. It also offers round-the-clock support from the AWS Shield Response Team and includes DDoS cost protection, making applications remain secure and cost-effective. Together, Shield Advanced and AWS WAF create a security framework that protects applications against a wide range of threats while maintaining availability.
This comprehensive security setup addresses LLM10:2025 Unbound Consumption and LLM02:2025 Sensitive Information Disclosure, making sure that applications remain both resilient and secure.
Application controller layer (LLM orchestrator Lambda function)
The application controller layer is usually vulnerable to risks such as LLM01:2025 Prompt Injection, LLM05:2025 Improper Output Handling, and LLM 02:2025 Sensitive Information Disclosure. Outside parties might frequently attempt to exploit this layer by crafting unintended inputs to manipulate the LLM, potentially causing it to reveal sensitive information or compromise downstream systems.
In the physical architecture diagram, the application controller is the LLM orchestrator AWS Lambda function. It performs strict input validation by extracting the event payload from API Gateway and conducting both syntactic and semantic validation. By sanitizing inputs, applying allowlisting and deny listing of keywords, and validating inputs against predefined formats or patterns, the Lambda function helps prevent LLM01:2025 Prompt Injection attacks. Additionally, by passing the user_id downstream, it enables the downstream application components to mitigate the risk of sensitive information disclosure, addressing concerns related to LLM02:2025 Sensitive Information Disclosure.
Amazon Bedrock Guardrails provides an additional layer of protection by filtering and blocking sensitive content, such as personally identifiable information (PII) and other custom sensitive data defined by regex patterns. Guardrails can also be configured to detect and block offensive language, competitor names, or other undesirable terms, making sure that both inputs and outputs are safe. You can also use guardrails to prevent LLM01:2025 Prompt Injection attacks by detecting and filtering out harmful or manipulative prompts before they reach the LLM, thereby maintaining the integrity of the prompt.
Another critical aspect of security is managing LLM outputs. Because the LLM might generate content that includes executable code, such as JavaScript or Markdown, there is a risk of XSS attacks if this content is not properly handled. To mitigate this risk, apply output encoding techniques, such as HTML entity encoding or JavaScript escaping, to neutralize any potentially harmful content before it is presented to users. This approach addresses the risk of LLM05:2025 Improper Output Handling.
Implementing Amazon Bedrock prompt management and versioning allows for continuous improvement of the user experience while maintaining the overall security of the application. By carefully managing changes to prompts and their handling, you can enhance functionality without introducing new vulnerabilities and mitigating LLM01:2025 Prompt Injection attacks.
Treating the LLM as an untrusted user and applying human-in-the-loop processes over certain actions are strategies to lower the likelihood of unauthorized or unintended operations.
LLM and LLM agent layer (Amazon Bedrock LLMs)
The LLM and LLM agent layer frequently handles interactions with the LLM and faces risks such as LLM10: Unbounded Consumption, LLM05:2025 Improper Output Handling, and LLM02:2025 Sensitive Information Disclosure.
DoS attacks can overwhelm the LLM with multiple resource-intensive requests, degrading overall service quality while increasing costs. When interacting with Amazon Bedrock hosted LLMs, setting request parameters such as the maximum length of the input request will minimize the risk of LLM resource exhaustion. Additionally, there is a hard limit on the maximum number of queued actions and total actions an Amazon Bedrock agent can take to fulfill a customer’s intent, which limits the number of actions in a system reacting to LLM responses, avoiding unnecessary loops or intensive tasks that could exhaust the LLM’s resources.
Improper output handling leads to vulnerabilities such as remote code execution, cross-site scripting, server-side request forgery (SSRF), and privilege escalation. The inadequate validation and management of the LLM-generated outputs before they are sent downstream can grant indirect access to additional functionality, effectively enabling these vulnerabilities. To mitigate this risk, treat the model as any other user and apply validation of the LLM-generated responses. The process is facilitated with Amazon Bedrock Guardrails using filters such as content filters with configurable thresholds to filter harmful content and safeguard against prompt attacks before they are processed further downstream by other backend systems. Guardrails automatically evaluate both user input and model responses to detect and help prevent content that falls into restricted categories.
Amazon Bedrock Agents execute multi-step tasks and securely integrate with AWS native and third-party services to reduce the risk of insecure output handling, excessive agency, and sensitive information disclosure. In the architecture diagram, the action group Lambda function under the agents is used to encode all the output text, making it automatically non-executable by JavaScript or Markdown. Additionally, the action group Lambda function parses each output from the LLM at every step executed by the agents and controls the processing of the outputs accordingly, making sure they are safe before further processing.
Sensitive information disclosure is a risk with LLMs because malicious prompt engineering can cause LLMs to accidentally reveal unintended details in their responses. This can lead to privacy and confidentiality violations. To mitigate the issue, implement data sanitization practices through content filters in Amazon Bedrock Guardrails.
Additionally, implement custom data filtering policies based on user_id and strict user access policies. Amazon Bedrock Guardrails helps filter content deemed sensitive, and Amazon Bedrock Agents further reduces the risk of sensitive information disclosure by allowing you to implement custom logic in the preprocessing and postprocessing templates to strip any unexpected information. If you have enabled model invocation logging for the LLM or implemented custom logging logic in your application to record the input and output of the LLM in Amazon CloudWatch, measures such as CloudWatch Log data protection are important in masking sensitive information identified in the CloudWatch logs, further mitigating the risk of sensitive information disclosure.
Agent plugin controller layer (action group Lambda function)
The agent plugin controller frequently integrates with internal and external services and applies custom authorization to internal and external data sources and third-party APIs. At this layer, the risk of LLM08:2025 Vector & Embedding Weaknesses and LLM06:2025 Excessive Agency are in effect. Untrusted or unverified third-party plugins could introduce backdoors or vulnerabilities in the form of unexpected code.
Apply least privilege access to the AWS Identity and Access Management (IAM) roles of the action group Lambda function, which interacts with plugin integrations to external systems to help mitigate the risk of LLM06:2025 Excessive Agency and LLM08:2025 Vector & Embedding Weaknesses. This is demonstrated in the physical architecture diagram; the agent plugin layer Lambda function is associated with a least privilege IAM role for secure access and interface with other internal AWS services.
Additionally, after the user identity is determined, restrict the data plane by applying user-level access control by passing the user_id to downstream layers like the agent plugin layer. Although this user_id parameter can be used in the agent plugin controller Lambda function for custom authorization logic, its primary purpose is to enable fine-grained access control for third-party plugins. The responsibility lies with the application owner to implement custom authorization logic within the action group Lambda function, where the user_id parameter can be used in combination with predefined rules to apply the appropriate level of access to third-party APIs and plugins. This approach wraps deterministic access controls around a non-deterministic LLM and enables granular access control over which users can access and execute specific third-party plugins.
Combining user_id-based authorization on data and IAM roles with least privilege on the action group Lambda function will generally minimize the risk of LLM08:2025 Vector & Embedding Weaknesses and LLM06:2025 Excessive Agency.
RAG data store layer
The RAG data store is responsible for securely retrieving up-to-date, precise, and user access-controlled knowledge from various first-party and third-party data sources. By default, Amazon Bedrock encrypts all knowledge base-related data using an AWS managed key. Alternatively, you can choose to use a customer managed key. When setting up a data ingestion job for your knowledge base, you can also encrypt the job using a custom AWS Key Management Service (AWS KMS) key.
If you decide to use the vector store in Amazon OpenSearch Service for your knowledge base, Amazon Bedrock can pass a KMS key of your choice to it for encryption. Additionally, you can encrypt the sessions in which you generate responses from querying a knowledge base with a KMS key. To facilitate secure communication, Amazon Bedrock Knowledge Bases uses TLS encryption when interacting with third-party vector stores, provided that the service supports and permits TLS encryption in transit.
Regarding user access control, Amazon Bedrock Knowledge Bases uses filters to manage permissions. You can build a segmented access solution on top of a knowledge base using metadata and filtering feature. During runtime, your application must authenticate and authorize the user, and include this user information in the query to maintain accurate access controls. To keep the access controls updated, you should periodically resync the data to reflect any changes in permissions. Additionally, groups can be stored as a filterable attribute, further refining access control.
This approach helps mitigate the risk of LLM02:2025 Sensitive Information Disclosure and LLM08:2025 Vector & Embedding Weaknesses, to assist in that only authorized users can access the relevant data.
Summary
In this post, we discussed how to classify your generative AI application from a security shared responsibility perspective using the AWS Generative AI Security Scoping Matrix. We reviewed a common generative AI assistant application architecture and assessed its security posture using the OWASP Top 10 for LLMs framework, and showed how to apply the OWASP Top 10 for LLMs threat mitigations using AWS service controls and services to strengthen the architecture of your generative AI assistant application. Learn more about building generative AI applications with AWS Workshops for Bedrock.

About the Authors
Syed Jaffry is a Principal Solutions Architect with AWS. He advises software companies on AI and helps them build modern, robust and secure application architectures on AWS.
Amit Kumar Agrawal is a Senior Solutions Architect at AWS where he has spent over 5 years working with large ISV customers. He helps organizations build and operate cost-efficient and scalable solutions in the cloud, driving their business and technical outcomes.
Tej Nagabhatla is a Senior Solutions Architect at AWS, where he works with a diverse portfolio of clients ranging from ISVs to large enterprises. He specializes in providing architectural guidance across a wide range of topics around AI/ML, security, storage, containers, and serverless technologies. He helps organizations build and operate cost-efficient, scalable cloud applications. In his free time, Tej enjoys music, playing basketball, and traveling.

Streamline custom environment provisioning for Amazon SageMaker Studio …

Posted on January 25, 2025 by i-genie

Attaching a custom Docker image to an Amazon SageMaker Studio domain involves several steps. First, you need to build and push the image to Amazon Elastic Container Registry (Amazon ECR). You also need to make sure that the Amazon SageMaker domain execution role has the necessary permissions to pull the image from Amazon ECR. After the image is pushed to Amazon ECR, you create a SageMaker custom image on the AWS Management Console. Lastly, you update the SageMaker domain configuration to specify the custom image Amazon Resource Name (ARN). This multi-step process needs to be followed manually every time end-users create new custom Docker images to make them available in SageMaker Studio.
In this post, we explain how to automate this process. This approach allows you to update the SageMaker configuration without writing additional infrastructure code, provision custom images, and attach them to SageMaker domains. By adopting this automation, you can deploy consistent and standardized analytics environments across your organization, leading to increased team productivity and mitigating security risks associated with using one-time images.
The solution described in this post is geared towards machine learning (ML) engineers and platform teams who are often responsible for managing and standardizing custom environments at scale across an organization. For individual data scientists seeking a self-service experience, we recommend that you use the native Docker support in SageMaker Studio, as described in Accelerate ML workflows with Amazon SageMaker Studio Local Mode and Docker support. This feature allows data scientists to build, test, and deploy custom Docker containers directly within the SageMaker Studio integrated development environment (IDE), enabling you to iteratively experiment with your analytics environments seamlessly within the familiar SageMaker Studio interface.
Solution overview
The following diagram illustrates the solution architecture.

We deploy a pipeline using AWS CodePipeline, which automates a custom Docker image creation and attachment of the image to a SageMaker domain. The pipeline first checks out the code base from the GitHub repo and creates custom Docker images based on the configuration declared in the config files. After successfully creating and pushing Docker images to Amazon ECR, the pipeline validates the image by scanning and checking for security vulnerabilities in the image. If no critical or high-security vulnerabilities are found, the pipeline continues to the manual approval stage before deployment. After manual approval is complete, the pipeline deploys the SageMaker domain and attaches custom images to the domain automatically.
Prerequisites
The prerequisites for implementing the solution described in this post include:

An AWS account and access to the account and access to deploy AWS CloudFormation templates
js and the npm command line interface
The AWS Cloud Development Kit (AWS CDK) installed
Version 2 of the AWS Command Line Interface (AWS CLI) installed
The Git command line interface installed on your computer for cloning the repository
The complete AWS CDK bootstrapping in your AWS account

Deploy the solution
Complete the following steps to implement the solution:

Log in to your AWS account using the AWS CLI in a shell terminal (for more details, see Authenticating with short-term credentials for the AWS CLI).
Run the following command to make sure you have successfully logged in to your AWS account:

aws sts get-caller-identity

Fork the the GitHub repo to your GitHub account .
Clone the forked repo to your local workstation using the following command:

git clone <clone_url_of_forked_repo>

Log in to the console and create an AWS CodeStar connection to the GitHub repo in the previous step. For instructions, see Create a connection to GitHub (console).
Copy the ARN for the connection you created.
Go to the terminal and run the following command to cd into the repository directory:

cd streamline-sagemaker-custom-images-cicd

Run the following command to install all libraries from npm:

npm install

Run the following commands to run a shell script in the terminal. This script will take your AWS account number and AWS Region as input parameters and deploy an AWS CDK stack, which deploys components such as CodePipeline, AWS CodeBuild, the ECR repository, and so on. Use an existing VPC to setup VPC_ID export variable below. If you don’t have a VPC, create one with at least two subnets and use it.

export AWS_ACCOUNT=$(aws sts get-caller-identity –query Account –output text)
export AWS_REGION=<YOUR_AWS_REGION>
export VPC_ID=<VPC_ID_TO_DEPLOY>
export CODESTAR_CONNECTION_ARN=<CODE_STAR_CONNECTION_ARN_CREATED_IN_ABOVE_STEP>
export REPOSITORY_OWNER=<YOUR_GITHUB_LOGIN_ID>

Run the following command to deploy the AWS infrastructure using the AWS CDK V2 and make sure to wait for the template to succeed:

cdk deploy PipelineStack –require-approval never

On the CodePipeline console, choose Pipelines in the navigation pane.
Choose the link for the pipeline named sagemaker-custom-image-pipeline.

You can follow the progress of the pipeline on the console and provide approval in the manual approval stage to deploy the SageMaker infrastructure. Pipeline takes approximately 5-8 min to build image and move to manual approval stage
Wait for the pipeline to complete the deployment stage.

The pipeline creates infrastructure resources in your AWS account with a SageMaker domain and a SageMaker custom image. It also attaches the custom image to the SageMaker domain.

On the SageMaker console, choose Domains under Admin configurations in the navigation pane.

Open the domain named team-ds, and navigate to the Environment

You should be able to see one custom image that is attached.

How custom images are deployed and attached
CodePipeline has a stage called BuildCustomImages that contains the automated steps to create a SageMaker custom image using the SageMaker Custom Image CLI and push it to the ECR repository created in the AWS account. The AWS CDK stack at the deployment stage has the required steps to create a SageMaker domain and attach a custom image to the domain. The parameters to create the SageMaker domain, custom image, and so on are configured in JSON format and used in the SageMaker stack under the lib directory. Refer to the sagemakerConfig section in environments/config.json for declarative parameters.
Add more custom images
Now you can add your own custom Docker image to attach to the SageMaker domain created by the pipeline. For the custom images being created, refer to Dockerfile specifications for the Docker image specifications.

cd into the images directory in the repository in the terminal:

cd images

Create a new directory (for example, custom) under the images directory:

mkdir custom

Add your own Dockerfile to this directory. For testing, you can use the following Dockerfile config:

FROM public.ecr.aws/amazonlinux/amazonlinux:2
ARG NB_USER=”sagemaker-user”
ARG NB_UID=”1000″
ARG NB_GID=”100″
RUN yum update -y &&
yum install python3 python3-pip shadow-utils -y &&
yum clean all
RUN yum install –assumeyes python3 shadow-utils &&
useradd –create-home –shell /bin/bash –gid “${NB_GID}” –uid ${NB_UID} ${NB_USER} &&
yum clean all &&
python3 -m pip install jupyterlab
RUN python3 -m pip install –upgrade pip
RUN python3 -m pip install –upgrade urllib3==1.26.6
USER ${NB_UID}
CMD jupyter lab –ip 0.0.0.0 –port 8888
–ServerApp.base_url=”/jupyterlab/default”
–ServerApp.token=”
–ServerApp.allow_origin=’*’

Update the images section in the json file under the environments directory to add the new image directory name you have created:

“images”: [
     “repositoryName”: “research-platform-ecr”,
      “tags”:[
“jlab”,
        “custom” << Add here
      ]
     }
    ]

Update the same image name in customImages under the created SageMaker domain configuration:

“customImages”:[
“jlab”,
“custom” << Add here
],

Commit and push changes to the GitHub repository.
You should see CodePipeline is triggered upon push. Follow the progress of the pipeline and provide manual approval for deployment.

After deployment is completed successfully, you should be able to see that the custom image you have added is attached to the domain configuration (as shown in the following screenshot).

Clean up
To clean up your resources, open the AWS CloudFormation console and delete the stacks SagemakerImageStack and PipelineStack in that order. If you encounter errors such as “S3 Bucket is not empty” or “ECR Repository has images,” you can manually delete the S3 bucket and ECR repository that was created. Then you can retry deleting the CloudFormation stacks.
Conclusion
In this post, we showed how to create an automated continuous integration and delivery (CI/CD) pipeline solution to build, scan, and deploy custom Docker images to SageMaker Studio domains. You can use this solution to promote consistency of the analytical environments for data science teams across your enterprise. This approach helps you achieve machine learning (ML) governance, scalability, and standardization.

About the Authors
Muni Annachi, a Senior DevOps Consultant at AWS, boasts over a decade of expertise in architecting and implementing software systems and cloud platforms. He specializes in guiding non-profit organizations to adopt DevOps CI/CD architectures, adhering to AWS best practices and the AWS Well-Architected Framework. Beyond his professional endeavors, Muni is an avid sports enthusiast and tries his luck in the kitchen.
Ajay Raghunathan is a Machine Learning Engineer at AWS. His current work focuses on architecting and implementing ML solutions at scale. He is a technology enthusiast and a builder with a core area of interest in AI/ML, data analytics, serverless, and DevOps. Outside of work, he enjoys spending time with family, traveling, and playing football.
Arun Dyasani is a Senior Cloud Application Architect at AWS. His current work focuses on designing and implementing innovative software solutions. His role centers on crafting robust architectures for complex applications, leveraging his deep knowledge and experience in developing large-scale systems.
Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning platform team at AWS, leading the SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.
Jenna Eun is a Principal Practice Manager for the Health and Advanced Compute team at AWS Professional Services. Her team focuses on designing and delivering data, ML, and advanced computing solutions for the public sector, including federal, state and local governments, academic medical centers, nonprofit healthcare organizations, and research institutions.
Meenakshi Ponn Shankaran is a Principal Domain Architect at AWS in the Data & ML Professional Services Org. He has extensive expertise in designing and building large-scale data lakes, handling petabytes of data. Currently, he focuses on delivering technical leadership to AWS US Public Sector clients, guiding them in using innovative AWS services to meet their strategic objectives and unlock the full potential of their data.

Introducing multi-turn conversation with an agent node for Amazon Bedr …

Posted on January 24, 2025 by i-genie

Amazon Bedrock Flows offers an intuitive visual builder and a set of APIs to seamlessly link foundation models (FMs), Amazon Bedrock features, and AWS services to build and automate user-defined generative AI workflows at scale. Amazon Bedrock Agents offers a fully managed solution for creating, deploying, and scaling AI agents on AWS. With Flows, you can provide explicitly stated, user-defined decision logic to execute workflows, and add Agents as a node in a flow to use FMs to dynamically interpret and execute tasks based on contextual reasoning for certain steps in your workflow.
Today, we’re excited to announce multi-turn conversation with an agent node (preview), a powerful new capability in Flows. This new capability enhances the agent node functionality, enabling dynamic, back-and-forth conversations between users and flows, similar to a natural dialogue in a flow execution.
With this new feature, when an agent node requires clarification or additional context from the user before it can continue, it can intelligently pause the flow’s execution and request user-specific information. After the user sends the requested information, the flow seamlessly resumes the execution with the enriched input, maintaining the executionId of the conversation.
This creates a more interactive and context-aware experience, because the node can adapt its behavior based on user responses. The following sequence diagram shows the flow steps.

Multi-turn conversations make it straightforward to developers to create agentic workflows that can adapt and reason dynamically. This is particularly valuable for complex scenarios where a single interaction might not be sufficient to fully understand and address the user’s needs.
In this post, we discuss how to create a multi-turn conversation and explore how this feature can transform your AI applications.
Solution overview
Consider ACME Corp, a leading fictional online travel agency developing an AI-powered holiday trip planner using Flows. They face several challenges in their implementation:

Their planner can’t engage in dynamic conversations, requiring all trip details upfront instead of asking follow-up questions
They face challenges to orchestrate complex, multi-step travel planning processes that require coordinating flights, accommodations, activities, and transportation across multiple destinations, often leading to inefficiencies and suboptimal customer experiences
Their application can’t dynamically adapt its recommendations when users modify their preferences or introduce new constraints during the planning process

Let’s explore how the new multi-turn conversation capability in Flows addresses these challenges and enables ACME Corp to build a more intelligent, context-aware, and efficient holiday trip planner that truly enhances the customer’s travel planning experience.
The flow offers two distinct interaction paths. For general travel inquiries, users receive instant responses powered by an LLM. However, when users want to search or book flights and hotels, they are connected to an agent who guides them through the process, collecting essential information while maintaining the session until completion. The workflow is illustrated in the following diagram.

Prerequisites
For this example, you need the following:

An AWS account and a user with an AWS Identity and Access Management (IAM) role authorized to use Bedrock. For guidance, refer to Getting started with Amazon Bedrock. Make sure the role includes the permissions for using Flows, as explained in Prerequisites for Amazon Bedrock Flows, and the permissions for using Agents, as explained in Prerequisites for creating Amazon Bedrock Agents.
Access provided to the models you use for invocation and evaluation. For guidance, see Manage access to Amazon Bedrock foundation models.
Create an Amazon Bedrock Agent to automate the task for the travel agency application by orchestrating interactions between the FM, APIs calls, and user conversations. Our travel agent offers four essential booking functions: searching available flights, securing flight reservations, finding suitable hotel accommodations, and completing hotel bookings. For an example of how to create a travel agent, refer to Agents for Amazon Bedrock now support memory retention and code interpretation (preview). Make sure the agent has user input functionality enabled. This setting allows the agent to gather all required details through natural conversation, even when the initial request is incomplete.

Create a multi-turn conversation flow
To create a multi-turn conversation flow, complete the following steps:

On the Bedrock console, choose Flows under Builder tools in the navigation pane.
Start creating a new flow called ACME-Corp-trip-planner.

For detailed instructions on creating a Flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Bedrock provides different node types to build your prompt flow.

Choose the prompt node to evaluate the input intention. It will classify the intentions as categoryLetter=A if the user wants to search or book a hotel or flight and categoryLetter=B if the user is asking for destination information. If you’re using Amazon Bedrock Prompt Management, you can select the prompt from there.

For this node, we use the following message in the prompt configuration:

You are a query classifier. Analyze the {{input}} and respond with a single letter:

A: Travel planning/booking queries for hotel and flights Example: “Find flights to London”
B: Destination information queries Example: “What’s the weather in Paris?”

Return only ‘A’ or ‘B’ based on the primary intent.

For our example, we chose Amazon’s Nova Lite model and set the temperature inference parameter to 0.1 to minimize hallucinations and enhance output reliability. You can select other available Amazon Bedrock models.

Create the Condition node with the following information and connect with the Query Classifier node. For this node, the condition value is:

Name: Booking
Condition: categoryLetter==”A”

Create a second prompt node for the LLM guide invocation. The input of the node is the output of the Condition node output “If all conditions are false.” To end this flow branch, add a Flow output node and connect the prompt node output to it.

You are AcmeGuide, an enthusiastic and knowledgeable travel guide.
Your task is to provide accurate and comprehensive information about travel destinations to users.
When answering a user’s query, cover the following key aspects:

– Weather and best times to visit
– Famous local figures and celebrities
– Major attractions and landmarks
– Local culture and cuisine
– Essential travel tips

Answer the user’s question {{query}}.

Present the information in a clear and engaging manner.
If you are unsure about specific details, acknowledge this and provide the most reliable information available.
Avoid any hallucinations or fabricated content.
Provide your response immediately after these instructions, without any preamble or additional text.

For our example, we chose Amazon’s Nova Lite model and set the temperature inference parameter to 0.1 to minimize hallucinations and enhance output reliability.

Finally, create the agent node and configure it to use the agent that was created previously. The input of the node is the output of the Condition node output “Conditions Booking.” To end this flow branch, add a Flow output node and connect the agent node output to it.
Choose Save to save your flow.

Test the flow
You’re now ready to test the flow through the Amazon Bedrock console or API. First, we ask for information about Paris. In the response, you can review the flow traces, which provide detailed visibility into the execution process. These traces help you monitor and debug response times for each step, track the processing of customer inputs, verify if guardrails are properly applied, and identify any bottlenecks in the system. Flow traces offer a comprehensive overview of the entire response generation process, allowing for more efficient troubleshooting and performance optimization.,

Next, we continue our conversation and request to book a travel to Paris. As you can see, now with the multi-turn support in Flows, our agent node is able to ask follow-up questions to gather all information and make the booking.

We continue talking to our agent, providing all required information, and finally, the agent makes the booking for us. In the traces, you can check the ExecutionId that maintains the session for the multi-turn requests.

After the confirmation, the agent has successfully completed the user request.

Use Amazon Bedrock Flows APIs
You can also interact with flows programmatically using the InvokeFlow API, as shown in the following code. During the initial invocation, the system automatically generates a unique executionId, which maintains the session for 1 hour. This executionId is essential for subsequent InvokeFlow API calls, because it provides the agent with contextual information necessary for maintaining conversation history and completing actions.

{
“flowIdentifier”: ” MQM2RM1ORA”,
“flowAliasIdentifier”: “T00ZXPGI35”,
“inputs”: [
{
“content”: {
“document”: “Book a flight to paris”
},
“nodeName”: “FlowInputNode”,
“nodeOutputName”: “document”
}
]
}

If the agent node in the flow decides that it needs more information from the user, the response stream (responseStream) from InvokeFlow includes a FlowMultiTurnInputRequestEvent event object. The event has the requested information in the content(FlowMultiTurnInputContent) field.
The following is an example FlowMultiTurnInputRequestEvent JSON object:

{
“nodeName”: “Trip_planner”,
“nodeType”: “AgentNode”,
“content”: {
“document”: “Certainly! I’d be happy to help you book a flight to Paris.
To get started, I need some more information:
1. What is your departure airport (please provide the IATA airport code if possible)?
2. What date would you like to travel (in YYYYMMDD format)?
3. Do you have a preferred time for the flight (in HHMM format)?
Once I have these details, I can search for available flights for you.”
}
}

Because the flow can’t continue until more input is received, the flow also emits a FlowCompletionEvent event. A flow always emits the FlowMultiTurnInputRequestEvent before the FlowCompletionEvent. If the value of completionReason in the FlowCompletionEvent event is INPUT_REQUIRED, the flow needs more information before it can continue.
The following is an example FlowCompletionEvent JSON object:

{
“completionReason”: “INPUT_REQUIRED”
}

Send the user response back to the flow by calling the InvokeFlow API again. Be sure to include the executionId for the conversation.
The following is an example JSON request for the InvokeFlow API, which provides additional information required by an agent node:

{
“flowIdentifier”: “MQM2RM1ORA”,
“flowAliasIdentifier”: “T00ZXPGI35”,
“executionId”: “b6450554-f8cc-4934-bf46-f66ed89b60a0”,
“inputs”: [
{
“content”: {
“document”: “Madrid on Valentine’s day 2025”
},
“nodeName”: “Trip_planner”,
“nodeInputName”: “agentInputText”
}
]
}

This back and forth continues until no more information is needed and the agent has all that is required to complete the user’s request. When no more information is needed, the flow emits a FlowOutputEvent event, which contains the final response.
The following is an example FlowOutputEvent JSON object:

{
“nodeName”: “FlowOutputNode”,
“content”: {
“document”: “Great news! I’ve successfully booked your flight to Paris. Here are the details:

– Date: February 14, 2025 (Valentine’s Day)
– Departure: Madrid (MAD) at 20:43 (8:43 PM)
– Arrival: Paris (CDG)

Your flight is confirmed.”
}
}

The flow also emits a FlowCompletionEvent event. The value of completionReason is SUCCESS.
The following is an example FlowCompletionEvent JSON object:

{
“completionReason”: “SUCCESS”
}

To get started with multi-turn invocation, use the following example code. It handles subsequent interactions using the same executionId and maintains context throughout the conversation. You need to specify your flow’s ID in FLOW_ID and its alias ID in FLOW_ALIAS_ID (refer to View information about flows in Amazon Bedrock for instructions on obtaining these IDs).
The system will prompt for additional input as needed, using the executionId to maintain context across multiple interactions, providing a coherent and continuous conversation flow while executing the requested actions.

“””
Runs an Amazon Bedrock flow and handles multi-turn interactions
“””
import boto3
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def invoke_flow(client, flow_id, flow_alias_id, input_data, execution_id=None):
“””
Invoke an Amazon Bedrock flow and handle the response stream.

Args:
client: Boto3 client for Bedrock
flow_id: The ID of the flow to invoke
flow_alias_id: The alias ID of the flow
input_data: Input data for the flow
execution_id: Execution ID for continuing a flow. Defaults to None for first run.

Returns:
Dict containing flow_complete status, input_required info, and execution_id
“””
request_params = {
“flowIdentifier”: flow_id,
“flowAliasIdentifier”: flow_alias_id,
“inputs”: [input_data]
}

if execution_id:
request_params[“executionId”] = execution_id

response = client.invoke_flow(**request_params)
execution_id = response.get(‘executionId’, execution_id)

input_required = None
flow_status = “”

for event in response[‘responseStream’]:
if ‘flowCompletionEvent’ in event:
flow_status = event[‘flowCompletionEvent’][‘completionReason’]
elif ‘flowMultiTurnInputRequestEvent’ in event:
input_required = event
elif ‘flowOutputEvent’ in event:
print(event[‘flowOutputEvent’][‘content’][‘document’])
elif ‘flowTraceEvent’ in event:
print(“Flow trace:”, event[‘flowTraceEvent’])

return {
“flow_status”: flow_status,
“input_required”: input_required,
“execution_id”: execution_id
}

def create_input_data(text, node_name=”FlowInputNode”, is_initial_input=True):
“””
Create formatted input data dictionary.

Args:
text: The input text
node_name: Name of the node (defaults to “FlowInputNode”)
is_initial_input: Boolean indicating if this is the first input (defaults to True)

Returns:
Dict containing the formatted input data
“””
input_data = {
“content”: {“document”: text},
“nodeName”: node_name
}

if is_initial_input:
input_data[“nodeOutputName”] = “document”
else:
input_data[“nodeInputName”] = “agentInputText”

return input_data

def main():
FLOW_ID = “MQM2RM1ORA”
FLOW_ALIAS_ID = “T00ZXPGI35”

session = boto3.Session(
region_name=’us-east-1′
)
bedrock_agent_client = session.client(
‘bedrock-multi-turn’,
)

execution_id = None

try:
# Initial input
user_input = input(“Enter input: “)
input_data = create_input_data(user_input, is_initial_input=True)

while True:
result = invoke_flow(
bedrock_agent_client,
FLOW_ID,
FLOW_ALIAS_ID,
input_data,
execution_id
)

if result[‘flow_status’] == “SUCCESS”:
break

if result[‘flow_status’] == “INPUT_REQUIRED”:
more_input = result[‘input_required’]
prompt = f”{more_input[‘flowMultiTurnInputRequestEvent’][‘content’][‘document’]}: ”
user_input = input(prompt)
# Subsequent inputs
input_data = create_input_data(
user_input,
more_input[‘flowMultiTurnInputRequestEvent’][‘nodeName’],
is_initial_input=False
)

execution_id = result[‘execution_id’]

except Exception as e:
logger.error(f”Error occurred: {str(e)}”, exc_info=True)

if __name__ == “__main__”:
main()

Clean up
To clean up your resources, delete the flow, agent, AWS Lambda functions created for the agent, and knowledge base.
Conclusion
The introduction of multi-turn conversation capability in Flows marks a significant advancement in building sophisticated conversational AI applications. In this post, we demonstrated how this feature enables developers to create dynamic, context-aware workflows that can handle complex interactions while maintaining conversation history and state. The combination of the Flows visual builder interface and APIs with powerful agent capabilities makes it straightforward to develop and deploy intelligent applications that can engage in natural, multi-step conversations.
With this new capability, businesses can build more intuitive and responsive AI solutions that better serve their customers’ needs. Whether you’re developing a travel booking system, customer service or other conversational application, multi-turn conversation with Flows provides the tools needed to create sophisticated AI workflows with minimal complexity.
We encourage you to explore these capabilities on the Bedrock console and start building your own multi-turn conversational applications today. For more information and detailed documentation, visit the Amazon Bedrock User Guide. We look forward to seeing the innovative solutions you will create with these powerful new features.

About the Authors
Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using the comprehensive suite of AWS tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.
Irene Arroyo Delgado is an AI/ML and GenAI Specialist Solutions Architect at AWS. She focuses on bringing out the potential of generative AI for each use case and productionizing ML workloads to achieve customers’ desired business outcomes by automating end-to-end ML lifecycles. In her free time, Irene enjoys traveling and hiking.