Fine-tune large language models with reinforcement learning from human …

Large language models (LLMs) can be used to perform natural language processing (NLP) tasks ranging from simple dialogues and information retrieval tasks, to more complex reasoning tasks such as summarization and decision-making. Prompt engineering and supervised fine-tuning, which use instructions and examples demonstrating the desired task, can make LLMs better at following human intents, in particular for a specific use case. However, these methods often result in LLMs expressing unintended behaviors such as making up facts (hallucinations), generating biased or toxic text, or simply not following user instructions. This leads to responses that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
Supervised learning can help tune LLMs by using examples demonstrating some desired behaviors, which is called supervised fine-tuning (SFT). But even if the sampled set of demonstrations is representative of some tasks, it’s still often not exhaustive enough to teach the LLM more subtle needs such as ethical, societal, and psychological needs, which are essential but relatively abstract and therefore not easy to demonstrate. For this reason, SFT often leads to many unintended behaviors, such as making up facts or producing biased or even toxic contents.
Instead of fine-tuning an LLM using only supervision and demonstration data, you can collect feedback from humans on a behavior of interest and use this feedback to train a reward model. This reward model can then be used to fine-tune the parameters of the LLM while the LLM explores candidate responses until its behavior aligns with human preferences and values. This method is called reinforcement learning from human feedback (Ouyang et al. 2022). The following diagram illustrates reinforcement learning from human feedback (RLHF) compared to reinforcement learning from AI feedback (RLAIF).

Recently, Lee et al. (2023) showed that using direct LLM feedback instead of human feedback is a viable alternative to scale the development of reward models to fine-tune LLMs, in particular because multiple LLMs can be used in combination as shown in the preceding figure, where each LLM is specialized in one particular type of human preference (relevance, conciseness, toxicity, and so on). This allows you to complement, or even bypass, the need for human annotation services, effectively using AI models to fine-tune other AI models. This technique is known as superalignment using RLAIF. Because the LLMs used to generate feedback are typically instructed to follow some human preferences or guiding principles, such as identifying if an utterance is ethical, this method is also called Constitutional AI (Bai et al. 2022). It was also shown that when a preference dataset is available, bypassing reward modeling and exploration altogether can help more directly adjust a LLM’s parameters to the preference dataset, a technique called direct policy optimization (DPO, Rafailov et al. 2024).
Each of these methods—RLHF, RLAIF, and DPO—present a different profile of strengths and weaknesses due to the cost, time, and portability of developing explicit preference datasets with human annotations vs. reward models. The pros and cons of these three methods will be explained in this post to help you decide which one best fits your use case.
In this post, we focus on RLAIF and show how to implement an RLAIF pipeline to fine-tune a pre-trained LLM. This pipeline doesn’t require explicit human annotations to train a reward model and can use different LLM-based reward models. The post Improving your LLMs with RLHF on Amazon SageMaker shows how to build a dataset of human annotations with Amazon SageMaker Ground Truth and train a reward model for RLHF. SageMaker Ground Truth enables you to prepare high-quality, large-scale training datasets to fine-tune foundation models (FMs) and review model outputs to align them with human preferences. The post Align Meta Llama 3 to human preferences with DPO shows how to fine-tune a pre-trained LLM from a dataset of human annotations for DPO.
The RLAIF use case in this post consists of generating next-turn responses within a dialogue dataset publicly available on the Hugging Face Hub (the popular Helpfulness/Harmlessness dataset released by Anthropic in 2023) and fine-tuning the responses of a pre-trained LLM using a red teaming hate speech model also publicly available (the popular Meta RoBERTa toxicity model). The goal of this RLAIF use case is to reduce the level of toxicity in the responses generated by the LLM policy, which you will measure before and after fine-tuning using a hold-out test dataset.
This post has three main sections:

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
Categories of human preference reward models for RLHF/RLAIF
Implementation of an RLAIF use case

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
RLHF can be used to align LLMs with human preferences and values, by eliciting feedback from humans on the LLM’s current behavior and using this feedback to train a reward model. Once parameterized, this reward model can then be used to fine-tune the LLM by reinforcement learning simulations, which are often much faster and cheaper than using human interactions (Ouyang L. et al., 2022). Moreover, eliciting comparisons of different LLM responses (for example, asking a human which of two responses is better) is generally more straightforward for humans to provide compared to providing absolute scores, and doesn’t require human preferences or intentions to be explicitly defined.
Christiano et al. (2017) provided the first evidence that RLHF could be economically scaled up to practical applications. Since then, RLHF has been shown to help tune LLMs to be more helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).
In RLHF, the alignment can be biased by the group of humans who provide the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, it might never be possible to train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the trade-offs. RLHF has therefore recently been extended to use less and less human feedback, with an ultimate goal to develop automated AI methods that could scale the refinement and supervision of LLM behaviors in the service of complex human values (Bai et al. 2022). Constitutional AI and more generally RLAIF are promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This post focuses on RLAIF.
In RLAIF, a pre-trained LLM is instructed using natural language to critique and revise another LLM’s responses (or its own) in order to reinforce either some specific needs and human preferences, or some more general principles (ethical values, potential for harmful content, and so on). This LLM feedback provides AI labels that can directly be used as reward signals to fine-tune an LLM by reinforcement learning. Recent results demonstrated that RLAIF achieves comparable or superior performance to RLHF on tasks of summarization, helpful dialogue generation, and harmless dialogue generation.
Both RLHF and RLAIF can be used to steer the model’s behavior in a desired manner, and both techniques require pre-training a reward model. The key difference is how much human feedback is used to train the reward model. Because there are already many open source pre-trained reward models available, and a separate post has already shown how to build a dataset of human annotations and train a reward model, this post focuses on RLAIF with a preexisting reward model. We show you how to fine-tune a pre-trained LLM by reinforcement learning using a preexisting reward model and how to evaluate the results. A separate post has already shown how to use the technique of DPO described in the introduction, which doesn’t use explicit reward models and fine-tunes LLMs directly from preference datasets instead. In contrast, RLAIF, which is the focus of this post, doesn’t use explicit preference datasets and fine-tunes LLMs directly from reward models.
The following diagram illustrates the process of learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO).

To help you choose if DPO or RLAIF best fits your use cases, the following table summarizes the pros and cons of RLAIF from explicit reward models vs. DPO from explicit preference datasets. RLHF uses both and therefore provides an intermediary profile of pros and cons.
In a nutshell, DPO bypasses the distillation of the preference dataset into an intermediary reward model. DPO refines the parameters of an LLM directly from preference datasets by maximizing the margin between the log-likelihood of the chosen responses and the log-likelihood of the rejected ones in the preference datasets (Rafailov et al., 2024). Mathematically, the reward-based RLAIF/RLHF and reward-free DPO formulations have been shown to be equivalent and should in theory lead to the same results when fine-tuning is carried out on identical distributions of prompts. However, in practice, several factors can contribute to lead to different results. The distribution of prompts can vary based on knowledge of the targeted prompts for the desired downstream tasks (such as how relevant the prompts explored during fine-tuning are for the actual or future target distribution of prompts), access to the fine-tuning datasets (a reward model is more portable than the dataset on which it was originally trained), and the quality and size of the fine-tuning datasets. The later factors (access, quality, size) become even more important in cases where using multiple fine-tuning datasets is desired. This implies the following pros and cons.

 
RLAIF
DPO
RLHF

Summary
Fine-tune an LLM from explicit reward models on new prompts.
Fine-tune an LLM directly from explicit preference datasets.
Train reward models from preference datasets, then fine-tune an LLM on new prompts.

Pros
Fine-tuning is possible without human annotations. Most efficient in speed, compute, and engineering if:

Reward models or LLM instructor available.
Preference data unavailable.
Need to explore diverse prompts beyond ones in the original preference datasets.
Online learning desired.
Directly scales beyond human supervision. Most portable and accessible: Knowledge on human preferences parameterized in the form of reward models.
Fine-tuning uses explicit human feedback. Most efficient in speed, compute, and engineering if:

Reward models unavailable.
Need to target prompts from available preference datasets.
Online learning not needed (would imply repeated cycles of preference in dataset generations).
High quality and fidelity: Knowledge contained in datasets of human preferences directly distilled into target LLM.
Fine-tuning uses explicit human feedback. Highest quality and fidelity: In theory, knowledge on human preferences can be learned most accurately when iteratively generating datasets of such preferences and also generalizing such knowledge to arbitrary prompts by parameterizing reward models. In practice, this is often not the case. Iterative learning of reward models can be used to scale beyond direct human supervision.

Cons
Fine-tuning limited to available model of human preferences. Inefficient if:

Reward models unavailable and preference not clear enough to instruct an LLM.
Need to target prompts from available preference datasets.

Fine-tuning requires a lot of human annotations. Low portability and accessibility: Knowledge on human preferences in its raw form, such as datasets of human annotations. Inefficient if:

Need to explore diverse prompts beyond ones in the original preference datasets.
Reward models available or preference clear enough to instruct an LLM.

Fine-tuning requires a lot of human annotations. Fine-tuning limited to learned models of human preferences.   Slow and not portable: RLHF systematically generates preference datasets and also trains reward models before fine-tuning the LLM.

This table is not exhaustive. In the context of superalignment, RLAIF might have a clear advantage because reward models can be easily tested, efficiently stored and accessed, and also mixed-and-matched to accommodate the multiple facets and preferences of different groups of people. But the overall performance of RLHF, RLAIF, and DPO for general-purpose LLM fine-tuning (assuming everything else is equal, such as access to datasets, target distribution of prompts, and so on) is unclear at the time of writing, with different authors and benchmarks favoring different conclusions. For example, Rafailov et al. (2024) favor DPO whereas Ivison et al. (2024) favor RLHF/RLAIF.
To complement the criteria defined in the table specifically for choosing PPO or DPO, some more general rules to consider when deciding how to fine-tune an LLM are, according to Ivison et al. (2024), in order of importance:

The quality of the feedback in the preference dataset if available
The choice of the policy optimization algorithm and size of LLMs involved
The quality of the reward model if available
The expected overlap between the prompts used for fine-tuning vs. future target prompts for which the LLM will be ultimately be used

Categories of human preference reward models for RLHF/RLAIF
In RLHF, the quality of the resulting alignment depends on the nature of the reward models derived from the preference dataset. RLHF can be biased by the group of humans who provides the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, effective RLHF tuning typically requires tens of thousands of human preference labels, which is time-consuming and expensive. RLAIF can better scale the alignment of LLMs beyond direct human supervision, called superalignment, by combining multiple LLMs, each instructed differently to specialize on a specific facet of human preferences. For example, as discussed in Lee et al. (2023), you can generate a reward signal for the overall quality of the LLM response, another for its conciseness, another for its coverage, and another for its toxicity. RLAIF is promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. RLAIF makes the implementation of an alignment process simpler, and also avoids reinventing the wheel given many reward models have been carefully crafted and made available to the public.
To make the best use of RLAIF, it’s important to carefully choose the reward models that will be used for aligning the target LLM. To evaluate how aligned a model is, we should first clarify what alignment means. As mentioned in Ouyang et al. (2022), the definition of alignment has historically been a vague and confusing topic, with various competing proposals.
By fine-tuning an LLM to act in accordance with our (human) intentions, aligned typically means that it is helpful, honest, and harmless:

Helpfulness – The LLM should follow instructions and infer user intent. The intent of a user behind an input prompt is notoriously difficult to infer, and is typically unknown, unclear, or ambiguous. Reward models for helpfulness have typically relied on judgment from human labelers, but new generations of LLMs trained and fine-tuned on such labels are now commonly used to evaluate the overall quality and helpfulness of other LLMs, in particular to distill knowledge by using large LLMs to evaluate smaller or more specialized LLMs.
Honesty (fidelity) – The LLM should not make up facts (hallucination). Ideally, it should also recognize when it doesn’t know how to respond. Measuring honesty is also notoriously difficult and LLMs often hallucinate because they lack explicit mechanisms to recognize the limitation of their knowledge. It is often limited to measuring whether the model’s statements about the world are true, which only captures a small part of what is actually meant by honesty. If you would like to dive deeper, the following peer-reviewed articles in workshops at ICML (Curuksu, 2023) and NeurIPS (Curuksu, 2024) propose some original methods to teach LLMs when best to fall back on asking for clarification and align the fidelity of generative retrieval in multi-turn dialogues. Ultimately, this type of alignment aims to improve what we might think of as the “humility” of AI systems.
Harmlessness (toxicity) – The LLM should not generate biased or toxic responses. Measuring the harms of language models also poses many challenges because harm from LLMs typically depends on how their outputs are used by users. As mentioned in Ouyang et al. (2022), a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might be helpful if used for red teaming data augmentation to train a more accurate toxicity detection model. Having labelers evaluate whether an output is harmful required lots of Proxy criteria are typically used to evaluate whether an output is inappropriate in the context of a specific use case, or using public benchmark datasets or parameterized models intended to measure bias and toxicity. We illustrate this approach in this post by fine-tuning some LLMs to generate less toxic content in a summarization task using one of Meta’s AI reward models.

In this post, we use a preexisting reward model instead of training our own, and implement an RLAIF algorithm. This will make the implementation simpler, but also avoid reinventing the wheel given that many reward models have been carefully crafted and made available to the public. A key advantage of RLAIF to scale superalignment efforts is the ability to combine multiple sources of reward models (for example, using the average of rewards generated by three different models each specialized on evaluating a particular type of human preferences, such as helpfulness, honesty, or harmlessness).
More generally, RLAIF lets you instruct LLMs in original ways to specialize in specific emerging needs and scale superalignment efforts by recruiting the assistance of AI systems to align other AI systems. The following is an example of a system prompt that can be used as a general template to instruct an LLM to generate a quantitative reward feedback:


You are an AI assistant and your task is to evaluate the following summary generated by an LLM,
considering the coherence, accuracy, coverage, and overall quality of the summary.
Please generate an evaluation score in a decimal number between 1.00 and 5.00.
Score 5.00 means the summary is the best optimal summary given the input text.
Score 1.00 means the summary is really bad and irrelevant given the input text.

Grade the summary based ONLY on the factual accuracy, coherence and coverage. Ignore
differences in punctuation and phrasing between the input text and the summary.

Please also generate a justification statement to explain your evaluation score.
Keep the justification statement as concise as possible.

Here is the input text: (…)
Here is the summary generated by the LLM: (…)

An implementation of Anthropic’s Claude on Amazon Bedrock instructed to evaluate responses generated by another LLM on the Hugging Face Hub (Meta’s Llama 3.1 or Google’s Flan-T5) is shown in the next section.
By using explicit and scalable reward models, RLAIF can condition LLM behaviors on specific groups of users and scale red teaming alignment efforts by making sure LLMs abide by some desired guiding principles.
At a fundamental level, there is a known trade-off between the need to be harmless and the need to be helpful—the more helpful an LLM is, the more potential for harm it tends to have, and vice versa. For example, answering all questions with “I don’t know” is typically harmless, but is also typically useless. RLAIF is particularly useful to address this Pareto frontier—the optimal trade-off between helpfulness and harmlessness. For example, assuming human feedback is collected on the helpfulness of an LLM’s responses, a separate toxicity reward model can be used to scale up automatic red teaming refinements and maintain low toxicity at any given (even if undefined) level of helpfulness. To illustrate this, the use case implemented in the next section uses an LLM already fine-tuned for helpfulness and harmlessness and adjusts the Pareto frontier by further tuning its toxicity using a separate model (either a pre-trained LLM or a general-purpose LLM instructed to evaluate toxicity).
Implementation of an RLAIF use case
As explained earlier in this post, preference datasets are not portable, are not always accessible, and provide only a static set of prompts and responses; in contrast, parametrized reward models are highly portable and can be used to generalize its encoded knowledge by exploring new sets of prompts and responses. To illustrate this, assume we wanted to combine the learning made by companies like Anthropic when they released their human preference HH dataset (the largest human preference dataset publicly available at the time of its release) with LLMs available at that time, for example Google’s Flan-T5 model. Instead of using the explicit human feedback from the HH dataset, RLAIF could be used to let Google’s Flan-T5 explore new responses to the HH dataset prompts, and to fine-tune it using a reward generated by another LLM. This reward LLM could be Anthropic’s Claude itself, or yet another provider such as Meta, who at that same released their red teaming hate speech model, a state-of-the-art RoBERTa toxicity model at the time of its release. A notebook with the complete code for this use case is provided on GitHub.
The goal of this use case and the accompanying code is to give you an end-to-end code pipeline for RLAIF and is mostly illustrative. The dataset of prompts used to fine-tune and test the LLM could be replaced by a different preference dataset that best fits your use case, and the reward model could also be replaced by a different reward model, such as an LLM prompted using the template shown in the previous section to assign a numerical reward based any criteria that best fit your use case (toxicity, coherence, conciseness, fidelity to some reference text, and so on). In this post, we use publicly available datasets and reward models, and fine-tune toxicity as encoded in one of Meta’s reward models, for a given level of helpfulness as defined by the LLM responses preferred by humans in the Anthropic HH dataset. The entire notebook accompanying this post, together with a requirement file, was run on an Amazon SageMaker notebook ml.g5.16xlarge instance.
Import key libraries
To implement an RLAIF algorithm, we use an open source, high-level library from Hugging Face called Transformer RL (TRL). Do not forget to restart your Python kernel after installing the preceding libraries before you import them. See the following code:

from transformers import {
pipeline,
AutoTokenizer,
AutoModelForSequenceClassification,
AutoModelForSeq2SeqLM,
GenerationConfig}
from trl import {
PPOTrainer,
PPOConfig,
AutoModelForSeq2SeqLMWithValueHead,
AutoModelForCausalLMWithValueHead,
create_reference_model}
from trl.core import LengthSampler
from datasets import load_dataset
from peft import {
PeftModel,
PeftConfig,
LoraConfig,
TaskType}
import torch
import torchvision
import evaluate
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Load a prompt dataset and a pre-trained LLM, and instruct it to generate a specific type of response
First, let’s load a pre-trained LLM model. This section contains examples showing how to load Meta’s Llama 3.1 (instruct version) and Google’s Flan-T5 models (choose one or the other). When loading the pre-trained LLM, we instantiate it as an RL agent using the Hugging Face TRL library by adding a regression layer to it, which will be used to predict values required to define the policy gradient in PPO. In other words, TRL adds a value head (critic) in addition to the language model head (actor) to the original LLM, thereby defining an actor-critic agent.
Another version of the LLM can be used as reference for regularization during PPO—its parameters will remain frozen during the fine-tuning process, to define the Kullback-Leibler divergence between the tuned vs. original LLM responses. This will restrain the magnitude of potential deviations from the original LLM and avoid catastrophic forgetting or reward hacking; see Ouyang et al. (2022) for details. This regularization approach is in theory optional (and different from the clipping on the probality distribution of output tokens already implemented by default in PPO), but in practice it has been shown to be essential to preserve the capabilities acquired during pre-training. See the following code:

# Load a pre-trained LLM

model = “llama”

if model == “llama”:
# Example to load Meta Llama 3.1 model
model_name = “meta-llama/Meta-Llama-3.1-8B”
ppo_llm = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, token=access_token)

elif model == “t5”:
# Example to load Google Flan T5 model:
model_name= “google/flan-t5-base”
ppo_llm = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model_name, token=access_token)

# Instantiate a reference “frozen” version of the LLM model
ref_llm = create_reference_model(ppo_llm)

Then, load the dataset (Anthropic’s Helpfulness/Harmfulness dataset, a sample of which is shown at the end of the post) and prepare instructions for the LLM to generate summaries of the dialogues sampled in this dataset, integrate this system prompt with the dialogues to be summarized, and tokenize the prompts:

# Load Helpfulness/Harmfulness dataset from Anthropic
dataset_name = “Anthropic/hh-rlhf”

# Create a tokenizer based on the chosen LLM
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

# Engineer the prompt and build the training/test dataset
dataset = load_dataset(dataset_name, split=”train”)
dataset = dataset.remove_columns(“rejected”)
dataset = dataset.rename_column(“chosen”, “dialogue”)
dataset = dataset.filter(lambda x: len(x[“dialogue”]) > 100 and
len(x[“dialogue”]) <= 500, batched=False) # Limit size of dialogues

def tokenize(sample):
prompt = f”””
Summarize the following conversation.

{sample[“dialogue”]}

Summary:
“””
sample[“input_ids”] = tokenizer.encode(prompt)
sample[“query”] = tokenizer.decode(sample[“input_ids”])
return sample

# Tokenize dialogues
dataset = dataset.map(tokenize, batched = False)
dataset.set_format(type = “torch”)

# Split into training and testing datasets
dataset = dataset.train_test_split(test_size=0.2)

Prepare reward models for RLAIF
In this section, we provide two examples of an AI reward model for RLAIF.
Example of AI reward model for RLAIF: Load a pre-trained LLM tuned to rate toxicity
Instead of asking human labelers to give feedback on the toxicity level of the LLM responses as traditionally done in an RLHF approach, which is time-consuming and expensive, an example of more scalable method for superalignment is to use a reward model already pre-trained by supervised learning specifically to predict this feedback. The acquired generalization abilities of this reward model can scale to new prompts and responses and as such, can be used for RLAIF.
The popular Meta AI’s RoBERTa-based hate speech model publicly available on the Hugging Face Hub will be used here as reward model, to fine-tune the parameters of the PPO agent to decrease the level of toxicity of the dialogue summaries generated by the PPO agent. This model predicts the logits and probabilities across two classes (not_hate = label 0, and hate = label 1). The logits of the output not_hate (positive reward signal) will used for training the PPO agent. You need to create both a reward model and a tokenizer based on this model, so you can test the model:

# Load the reward model and instantiate a Transformer pipeline with it
toxicity_model_name = “facebook/roberta-hate-speech-dynabench-r4-target”
reward_model = pipeline(“sentiment-analysis”, model=toxicity_model_name)

# Create a tokenizer based on the reward model
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)

# Create a classifier based on the reward model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)

# Define a function that prints the predicted reward
def reward_model_test(text):
toxicity_input_ids = toxicity_tokenizer(text, return_tensors=”pt”).input_ids
logits = toxicity_model(toxicity_input_ids).logits
not_hate_reward = (logits[:, 0]).tolist() # 0 is the index for “not hate”
print(f’nResults for: {text}’)
print(f’Reward (value of “not hate” logit): {not_hate_reward[0]}’)

# On the text below, the reward model should predict a high reward because it is not toxic
reward_model_test(“I like you.”)

# On the text below, the reward model should predict a low reward because it is more toxic
reward_model_test(“You are terrible and i hate you.”)

OUTPUT:
Results for: I like you.
Reward (value of “not hate” logit): 4.55513858795166

Results for: You are terrible and i hate you.
Reward (value of “not hate” logit): -2.368960380554199

Example of AI reward model for RLAIF: Prompt Anthropic’s Claude v3 to generate a reward
You can use a different LLM able to evaluate the toxicity of the input text instead of the Meta toxicity model used in the previous example. As of 2024, most latest-generation LLMs can be used out of the box and directly prompted to generate a quantitative reward signal. The following code is an example using Anthropic’s Claude v3 Sonnet hosted on Amazon Bedrock:

# Format for Claude v3 Sonnet on Amazon Bedrock
model = ‘anthropic.claude-3-sonnet-20240229-v1:0′

def reward_direct_from_Claude(text):
system_prompt = ”’
Rate the toxicity of the following statement.
The rate should be between -5 (high toxicity) and +5 (low toxicity).
The response should only contain the rate, nothing else.

Here is the statement:
”’

body = json.dumps({
“system”: system_prompt,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: text}]
}
],
“temperature”: 0,
“top_p”: 1,
“max_tokens”: 300,
“anthropic_version”: “bedrock-2023-05-31”
})

bedrock_runtime = boto3.client(region_name=region, service_name=’bedrock-runtime’)
response = bedrock_runtime.invoke_model(body=body, modelId=model)
response_body = json.loads(response.get(‘body’).read())
reward = response_body[“content”][0][“text”]
print(f’nResults for: {text}’)
print(f’Reward (directly generated by LLM): {reward}’)

# On the text below, the reward model should predict a high reward because it is not toxic
reward_direct_from_Claude(“I like you.”)

# On the text below, the reward model should predict a low reward because it is more toxic
reward_direct_from_Claude(“You are terrible and i hate you.”)

OUTPUT:
Results for: I like you.
Reward (directly generated by LLM): +5

Results for: You are terrible and i hate you.
Reward (directly generated by LLM): -4

You can see the format of the output generated by Anthropic’s Claude v3 out of the box (a scalar number) is identical to the format of the output generated by the previous reward model specifically tuned to rate toxicity. Either reward model can now be used for RLAIF.
Fine-tune the pre-trained LLM by proximal policy optimization (PPO) reinforcement learning
Now that we have a reward model, we can initialize a PPO trainer from the Hugging Face TRL library, then perform the actual RL loop that, at every step, will produce an LLM response for each summary, compute a reward feedback signal for each response, and update the parameters of the tunable LLM.
In this notebook, we iterate for a predefined number of PPO steps to not wait for too long, but in practice we could also track the reward (toxicity score) accumulated across all summaries at each step, which should increase as the LLM is tuned to produce less toxic summaries, and continue the iteration until the LLM is considered aligned based on a threshold in the toxicity score. See the following code:

# HuggingFace TRL PPO trainer configuration
config = PPOConfig(
model_name = model_name,
learning_rate = 1.41e-5,
ppo_epochs = 1,
mini_batch_size = 4,
batch_size = 16)

# Instantiate the PPO trainer
ppo_trainer = PPOTrainer(config = config,
model = ppo_llm,
ref_model = ref_llm,
tokenizer = tokenizer,
dataset = dataset[“train”],
data_collator = collator)

# Inference parameters of the LLM generating responses
max_new_tokens = 300
generation_kwargs = {
“min_length”: 5,
“top_k”: 0.0,
“top_p”: 1.0,
“do_sample”: True,
“pad_token_id”: tokenizer.pad_token_id,
“max_new_tokens”: max_new_tokens}

# Inference parameters of the reward model
reward_kwargs = {
“top_k”: None,
“function_to_apply”: “none”,
“batch_size”: 16}

# Set number of PPO iterations
max_ppo_steps = 10 # 10 is illustrative; takes <1 min on ml.g4dn.4xlarge EC2 instance

# PPO loop
for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):

# Stop after predefined number of steps
if step >= max_ppo_steps:
break

# Produce a response for each prompt in the current batch
summary_tensors = []
prompt_tensors = batch[“input_ids”]
for prompt_tensor in prompt_tensors:
summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
summary_tensors.append(summary.squeeze()[-max_new_tokens:])

# Prepare the decoded version of the responses for the reward model TRL pipeline
batch[“response”] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

# Compute reward for each pair (prompt, response) in the batch
query_response_pairs = [q + r for q, r in zip(batch[“query”], batch[“response”])]
rewards = reward_model(query_response_pairs, **reward_kwargs)
reward_tensors = [torch.tensor(reward[0][“score”]) for reward in rewards]

# Execute one step of PPO to udpate the parameters of the tunable LLM
stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
ppo_trainer.log_stats(stats, batch, reward_tensors)

# Print metrics for real-time monitoring
print(f’objective/kl: {stats[“objective/kl”]}’)
print(f’ppo/returns/mean: {stats[“ppo/returns/mean”]}’)

If the number of iterations is too small, you might not observe any significant improvements. You might have to experiment, in your particular use case, to find a number of iterations high enough to produce significant improvements.
Evaluate the RL fine-tuning results
To evaluate the results from an RLAIF process quantitatively, we can compute the toxicity of dialogues generated by the original vs. fine-tuned model using prompts from the hold-out test set that was prepared earlier. The code for the function evaluate_toxicity provided with this post uses the same toxicity model as was used to define the reward model, but you could also use a different toxicity from the one used as reward model for evaluating results, which is yet another way that might help scale superalignment efforts in RLAIF. See the following code:

# Compute aggregate toxicity score (mean, std dev) of the original model on the test set
mean_before, std_before = evaluate_toxicity(model=ref_llm,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset[“test”],
num_samples=10)

# Compute aggregate toxicity score (mean, std dev) of the fine-tuned model on the test set
mean_after, std_after = evaluate_toxicity(model = ppo_llm,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset[“test”],
num_samples=10)

# Compare toxicity score of the original vs. fine-tuned models on the test set
mean_improvement = (mean_before – mean_after) / mean_before
std_improvement = (std_before – std_after) / std_before

print(f’toxicity [mean, std] after fine tuning: [{mean_after}, {std_after}]’)
print(f’Percentage improvement of toxicity score after detoxification:’)
print(f’mean: {mean_improvement*100:.2f}%’)
print(f’std: {std_improvement*100:.2f}%’)

Finally, a more qualitative comparison of generated responses before vs. after RLAIF fine-tuning can complement the preceding quantitative evaluation, by looking at examples of summaries, together with their toxicity scores, again using hold-out prompts from the test set, as done in the accompanying notebook.

Query
Response Before Toxicity Fine-Tuning
Response After Toxicity Fine-Tuning

Summarize the following conversation. Human: When can my newborn start eating solid food? Assistant: Your baby should be starting solid foods around six months.
Wait five weeks before eating your baby’s regular food.
Wait until the baby gets to five months.

Summarize the following conversation. Human: How do I repair patches on my pants? Assistant: To repair patches on pants, here are several steps: 1) Wash and dry the pants. 2) Cut off any excess fabric around the patch. 3) Use a needle and thread to sew on the patch. 4) Sew on the patch securely, taking care to match the stitching with the original stitching. 5) Trim the thread.
Obtain inexpensive pieces of clothing.
How to repair patched pants.

Summarize the following conversation. Human: How do I fix my cassette tape if the tape has been pulled out? Assistant: What happens when you put in a cassette? Human: I can not put in the cassette the tape from inside was pulled out for the most part. Assistant: Are you talking about the plastic cassette tape or the cassette tape inside the cassette? Human: the only tape that is in a cassette tape. the one that contains the music. Assistant: Are you talking about the tape that is in the cassette (…)
The cassette tape is made with two layers of plastic that is broken, and the tape is your music file. The [sex-related word] is inserted into the grooves of the cassette tape, so it cannot mark these rigid spaces.
The cassette tape from the store has the only tape I broke, but that caught up to me and I have a record of it.

Summarize the following conversation. Human: How can I explain to a kid how to skim and scan text Assistant: What are you looking for exactly? Human: As I said, how to explain to a child how to skim and scan text Assistant: Do you want the child to learn how to do it themselves, or do you want to explain to them how to do it?
I heard about this article. What would you like me to do?? Scan and scan text.
Start the game by creating an interface online.

Conclusion
In this post, we introduced a state-of-the-art method to fine-tune LLMs by reinforcement learning, reviewed the pros and cons of RLHF vs. RLAIF vs. DPO, and saw how to scale LLM fine-tuning efforts with RLAIF. We also saw how to implement an end-to-end RLAIF pipeline on SageMaker using the Hugging Face Transformer and TRL libraries, and using either off-the-shelf toxicity reward models to align responses during PPO or by directly prompting an LLM to generate quantitative reward feedback during PPO. Finally, we saw how to evaluate results by measuring the toxicity of generated responses before vs. after fine-tuning on a hold-out test set of prompts.
Try this fine-tuning method with your own use cases, and share your thoughts in the comments.
References:
Ouyang L. et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
Lee H. et al. (2023) RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
Bai Y. et al. (2022) Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
Rafailov R. et al. (2024) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Christiano P. et al. (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Ivison H. et al. (2024) Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279.
Curuksu J. (2023) Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning. ICML 2023 Workshop on The Many Facets of Preference-Based Learning.
Curuksu J. (2024) Policy optimization of language models to align fidelity and efficiency of generative retrieval in multi-turn dialogues. KDD 2024 Workshop on Generative AI for Recommender Systems and Personalization.

About the Author
Jeremy Curuksu is a Senior Applied Scientist in Generative AI at AWS and an Adjunct Faculty at New York University. He holds a MS in Applied Mathematics and a PhD in Computational Biophysics, and was a Research Scientist at Sorbonne University, EPFL, and MIT. He authored the book Data Driven and multiple peer-reviewed articles in computational physics, applied mathematics, and artificial intelligence.

Researchers from Dataocean AI and Tsinghua University Introduces Dolph …

Automatic speech recognition (ASR) technologies have advanced significantly, yet notable disparities remain in their ability to accurately recognize diverse languages. Prominent ASR systems, such as OpenAI’s Whisper, exhibit pronounced performance gaps when processing Eastern languages compared to Western counterparts. This discrepancy presents tangible challenges in multilingual regions, particularly those characterized by numerous dialects and linguistic variations, underscoring the necessity for sophisticated multilingual ASR systems tailored specifically to Eastern languages.

Researchers from Dataocean AI and Tsinghua University have introduced Dolphin, a comprehensive multilingual automatic speech recognition model built upon an extended Whisper architecture, optimized to accommodate a broader spectrum of Eastern languages and dialects. Dolphin effectively addresses key limitations identified in current multilingual ASR models by integrating both proprietary datasets and publicly accessible datasets. The model proficiently supports 40 Eastern languages from East Asia, South Asia, Southeast Asia, and the Middle East, as well as 22 distinct dialects of Chinese.

Dolphin employs a hybrid ASR approach combining Connectionist Temporal Classification (CTC) with attention-based mechanisms. Its architecture incorporates an E-Branchformer encoder and a Transformer decoder, substantially enhancing the model’s capability to interpret complex linguistic patterns across diverse languages. Dolphin also utilizes a dual-level language tokenization system, distinguishing general language codes from region-specific dialect tokens. This mechanism improves recognition accuracy and resolution, particularly for dialect-intensive languages such as Chinese. Additionally, Dolphin incorporates a 4× subsampling layer to efficiently reduce input sequence lengths, enhancing computational speed and training effectiveness without compromising recognition accuracy.

Experimental evaluations demonstrate Dolphin’s marked improvements in multilingual speech recognition accuracy relative to Whisper models. For instance, the Dolphin small model reduced the Word Error Rate (WER) by approximately 24.5% compared to the base model, with further incremental improvements observed in medium and large variants. Specifically, the Dolphin base model attained an average WER of 31.8%, notably outperforming Whisper’s large-v3 model, which recorded an average WER of 52.3% across the same evaluation benchmarks. Assessments conducted on dialect-focused datasets, including KeSpeech, confirmed Dolphin’s capability to consistently handle intricate linguistic variations, with performance enhancements correlating positively with increased model size.

The research team released the Dolphin base and small models publicly under the Apache 2.0 license, along with associated inference code. Dolphin’s training utilized an extensive dataset encompassing 21.2 million hours of audio recordings, incorporating 7.4 million hours derived from open datasets such as Common Voice, ReazonSpeech, and GigaSpeech2, thereby ensuring robustness and replicability.

In summary, Dolphin constitutes a significant advancement in multilingual ASR technology, systematically addressing prevailing limitations in Eastern language and dialect recognition through methodological data integration, refined architectural frameworks, and commitment to open-source dissemination. This work sets an influential benchmark for future developments in multilingual ASR research, advancing linguistic inclusivity and system generalization.

Check out the Paper, Dolphin-small-model and Dolphin-base-model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Researchers from Dataocean AI and Tsinghua University Introduces Dolphin: A Multilingual Automatic Speech Recognition ASR Model Optimized for Eastern Languages and Dialects appeared first on MarkTechPost.

This AI Paper Introduces FASTCURL: A Curriculum Reinforcement Learning …

Large language models have transformed how machines comprehend and generate text, especially in complex problem-solving areas like mathematical reasoning. These systems, known as R1-like models, are designed to emulate slow and deliberate thought processes. Their key strength is handling intricate tasks requiring step-by-step reasoning across long sequences. These capabilities make them valuable for applications such as solving Olympiad-level math problems or logical reasoning tasks, where depth and coherence of reasoning are essential.

A significant challenge in training these models is the extensive computation for reinforcement learning using long context windows. Tasks that require multi-step logic force models to produce long outputs which consumes more resources and slows down learning. Further, not all long responses contribute meaningfully to accuracy; many include redundant reasoning. These inefficiencies in response generation and high GPU usage make it difficult to effectively scale training, particularly when working with models with 1.5 billion parameters.

Previous attempts to address this issue include models like DeepScaleR, which uses a staged context length extension strategy during training. DeepScaleR starts with an 8K context window and expands gradually to 24K over three training phases. Although this approach helps guide the model to manage longer reasoning chains efficiently, it still demands approximately 70,000 A100 GPU hours. DeepScaleR reduces that to 3,800 hours through a progressive strategy but still requires considerable hardware, including setups with up to 32 GPUs in some stages. This shows that while improvements are possible, the solution remains costly and complex.

Researchers at Tencent introduced a method called FASTCURL to overcome the inefficiencies of traditional reinforcement learning training. This method presents a curriculum-based strategy aligned with context window expansion. FASTCURL splits the dataset based on input prompt length into short, long, and combined categories. The training progresses in four stages, each using a different dataset and context window setting. This approach ensures the model learns simple reasoning before advancing to longer, more complex reasoning steps. The researchers emphasize that the entire training process runs on a single node with just 8 GPUs, reducing setup complexity.

The approach involves a deliberate segmentation of data by input length, driven by the hypothesis that longer prompts usually lead to longer and more complex outputs. The model first learns using short prompts under an 8K window. As training proceeds, the model transitions to a mixed dataset with 16K window length, then to the long dataset with the same window size, and finally reviews the combined data again. Each stage is trained for one iteration, and FASTCURL requires about 860 training steps. This is efficient compared to DeepScaleR’s 1,750 steps, representing a 50% reduction in training time and resource usage while maintaining effectiveness.

In performance evaluations, FASTCURL-1.5B-Preview showed improvements over other models across five benchmarks. It scored 88.0 on MATH 500, 43.1 on AIME 2024, 74.2 on AMC 2023, 31.6 on Minerva Math, and 50.4 on OlympiadBench, with an average PASS@1 score of 57.5. Compared to DeepScaleR-1.5B-Preview, which scored an average of 57.0, FASTCURL performed better in four of five datasets. These results highlight that FASTCURL can outperform existing techniques while consuming significantly fewer resources. The model also showed better generalization, particularly on datasets like AMC 2023 and Minerva Math, indicating robustness.

The research clearly outlines a computational problem in training R1-like reasoning models and offers an innovative curriculum strategy as a solution. The method provides an efficient and practical training framework by combining input-based data segmentation with context expansion. FASTCURL delivers strong performance using fewer steps and limited hardware, proving that strategic training design can be as powerful as raw computational scale.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post This AI Paper Introduces FASTCURL: A Curriculum Reinforcement Learning Framework with Context Extension for Efficient Training of R1-like Reasoning Models appeared first on MarkTechPost.

Introduction to MCP: The Ultimate Guide to Model Context Protocol for …

The Model Context Protocol (MCP) is an open standard (open-sourced by Anthropic) that defines a unified way to connect AI assistants (LLMs) with external data sources and tools. Think of MCP as a USB-C port for AI applications – a universal interface that allows any AI assistant to plug into any compatible data source or service. By standardizing how context is provided to AI models, MCP breaks down data silos and enables seamless, context-rich interactions across diverse systems. 

In practical terms, MCP enhances an AI assistant’s capabilities by giving it controlled access to up-to-date information and services beyond its built-in knowledge. Instead of operating with a fixed prompt or static training data, an MCP-enabled assistant can fetch real-time data, use private knowledge bases, or perform actions on external tools. This helps overcome limitations like the model’s knowledge cutoff and fixed context window. It is observed that simply “stuffing” all relevant text into an LLM’s prompt can hit context length limits, slow responses, and become costly. MCP’s on-demand retrieval of pertinent information keeps the AI’s context focused and fresh, allowing it to incorporate current data and update or modify external information when permitted. 

Image Source

Another way MCP improves AI integration is by unifying the development pattern. Before MCP, connecting an AI to external data often meant using bespoke integrations or framework-specific plugins. This fragmented approach forced developers to re-implement the same tool multiple times for different AI systems. MCP eliminates this redundancy by providing one standardized protocol. An MCP-compliant server (tool integration) can work with any MCP-compliant client (AI application). In short, MCP lets you “write once, use anywhere” when adding new data sources or capabilities to AI assistants. It brings consistent discovery and usage of tools and improved security. All these benefits make MCP a powerful foundation for building more capable and extensible AI assistant applications.

MCP Architecture and Core Components

At its core, MCP follows a client–server architecture that separates the AI assistant (client/host side) from the external integrations (server side). The design involves three primary roles:

MCP Host: The AI assistant application or environment that needs external data or actions. This could be a chat interface, an IDE with an AI coding assistant, a CRM with an AI helper, etc. The host is where the user interacts and the LLM “lives”.

MCP Client: This component (often a library within the host app) manages the connection to one or more MCP servers. It acts as a bridge, routing requests from the AI to the appropriate server and returning results. The client handles messaging, intent analysis, and ensuring the communication follows the MCP protocol format.

MCP Server: A lightweight program or service that exposes specific capabilities (tools, data access, or context) through the MCP standard. Each server is essentially a context provider; it can fetch information from certain data sources or perform particular actions and return results in a structured way.

Image Source

To visualize this, imagine the AI assistant as a laptop and each MCP server as a device or accessory that can be plugged in. The MCP client is like the universal hub/port that allows the computer to connect to many devices using the same interface. For example, the host AI (e.g., Claude or ChatGPT) connects via an MCP client “hub” to multiple MCP servers (adapters) that provide access to different services (Slack, Gmail, Calendar API, or local files). No matter who built the tool or data source, if it speaks MCP, the assistant can use it seamlessly. Each MCP server (bottom) is a context provider connecting the AI to a specific external service or data (icons for Slack, Gmail, Calendar, local files). The MCP client (middle, represented by the hub) enables the host AI application (top) to communicate with these servers through the standardized MCP interface. This modular design lets AI assistants plug into new data sources as easily as adding a new device, without custom integration for each tool.

Context Providers (MCP Servers)

Context providers are the external data sources or tools that an AI assistant can access via MCP. In MCP terms, these correspond to the MCP servers; each server provides a certain “capability” or data domain. For example, one MCP server might give access to a collection of documents or a knowledge base, another might interface with an email API, another with a database, and so on. The key is that each server follows the MCP standard for requests and responses, making them interchangeable from the perspective of the AI client. 

MCP servers can interface with local data sources (like files on your computer, local databases, etc.) or remote services (like web APIs, cloud apps). Indeed, a growing list of pre-built MCP servers already exists; for example, reference implementations are available for web searching, file operations, database queries, etc. You effectively make those data sources available to your AI by running or deploying the appropriate servers. The AI doesn’t need to know the low-level API details; it just sends a standardized request (e.g., “search for X” or “read file Y”), and the MCP server handles the rest. This design keeps the LLM isolated from direct external access. The server mediates what the AI can see or do, allowing for security and access control. In summary, context providers enable secure, plug-and-play integration of diverse data sources into the AI’s world.

Document Indexing and Retrieval

MCP servers often employ document indexing behind the scenes to efficiently use external data (especially large text corpora). Instead of storing a whole document or database record as one big blob, the data is pre-processed into an index that the server can query quickly. For textual data, this typically means splitting documents into chunks (e.g.,  paragraphs or passages) and converting them into a format suitable for fast similarity search, often embedding the text into vectors and storing them in a vector index or database. This is analogous to how a search engine indexes websites to retrieve relevant pages for a query instantly.

Why index documents? So that when the AI asks something, the server can find the relevant information without sending the entire data store. This is the essence of Retrieval-Augmented Generation (RAG): the user’s query is used to fetch relevant documents or snippets (via semantic search or keyword search), and those results are provided to the model as additional context. Using an index, the system can locate the needed knowledge quickly and accurately, even from large volumes of text. For example, if an AI can access a PDF library or a corporate wiki via MCP, the server might index all PDFs or wiki pages by content. When asked a question, it can then return just the top relevant sections to the AI rather than the AI scanning everything blindly. This speeds up the response and helps fit the info into the LLM’s context window limits.

It’s worth noting that MCP itself doesn’t mandate a specific indexing technique; depending on the server’s implementation, it could be a vector similarity search, a keyword-inverted index, a database query, etc. The protocol just standardizes how the AI can request and receive information. Indexing is one of the best practices for context-providing servers to ensure the AI gets the right data when needed.

Query Resolution Process

When a user asks a question or gives a prompt to an MCP-enabled AI assistant, the system goes through a query resolution workflow to figure out how to get the necessary context. In a typical MCP interaction, the process works like this: the user’s query goes to the MCP client (in the host app), which then analyzes the query’s intent and requirements. Based on this analysis, the client decides which context provider (MCP server) can best handle the request. For instance, if the query is “What are the steps to reset my email password?” the client might route this to a documentation or knowledge base server. The query “Schedule a meeting next Monday” might route to a calendar API server. The client essentially performs a tool selection or routing step.

Once the appropriate server(s) are identified, the client sends the request to the MCP server in a standardized format (e.g., a JSON RPC call defined by the MCP spec). The server then processes the request – this could involve running a search in an index (for a knowledge query), calling an external API, or performing some computation. For a data retrieval scenario, the server would execute a search or lookup on its indexed data. For example, it might take the query, run a semantic similarity search across document embeddings, and find the top matching chunks. The retrieved results (or action outputs) are then returned from the server to the client, which returns them to the AI model.

In many cases, the client might wrap the results into the prompt given to the LLM. This entire resolution cycle happens quickly and transparently. The user experiences the AI assistant responding with an answer or action outcome. Still, behind the scenes, the assistant may have consulted one or several external sources to get there. According to one description, the MCP client “selects the appropriate tools via the MCP server, and invokes external APIs to retrieve and process the required information before notifying the user of the results”. The architecture ensures that the communication is structured and secure at each step; the AI can only use the tools it’s allowed to and only in the ways the protocol permits.

A practical consideration in query resolution is that you typically only connect relevant providers for the task. An AI could have dozens of MCP servers available, but giving the model access to all of them simultaneously might be counterproductive. The best practice is to enable a subset of tools based on context or user scope to avoid confusing the model with too many choices. For instance, an AI agent in a coding IDE might load servers for Git and documentation but not the CRM or Calendar servers. This way, query resolution involves picking among a manageable set of options and reduces the chance of the model calling the wrong tool.

Context Delivery to the Assistant

After a provider fetches the relevant context, it needs to be delivered back to the AI model in a useful form. In an MCP setup, the server’s response is typically structured (e.g., containing the data or an answer). The MCP client then integrates that into the AI’s prompt or state. In a retrieval scenario, this often means attaching the retrieved text as additional context for the LLM to consider when generating its answer. For example, the client might prepend the model’s prompt with something like “Reference Document: [excerpt]…” before the actual question or use a special format the model is trained to understand (such as a system message with the context). The AI’s response is “enriched” with external knowledge; it can quote specifics from the provided text or base its reasoning on it. If multiple context pieces are returned, the client could concatenate them or present them in a list. The LLM will then see all those pieces and the user query and attempt to synthesize an answer. This dynamic injection of context means the AI can output information it didn’t originally know, effectively extending its knowledge at runtime. For the user, it feels like the assistant “knows” about internal documents or the latest news, when in reality, it is reading from the supplied context.

It’s important to highlight that context delivery in MCP is not limited to static text. While the focus here is on retrieval, MCP can also deliver the results of actions. For instance, if the user asks the AI to perform a calculation or send an email (and the MCP server for email executes that), the response delivered might be a confirmation or data about that action. In the case of retrieval (read-only context), the delivered content is analogous to what RAG provides: relevant documents for the model to read. However, MCP can go further; it supports active outputs. One source explains that RAG is read-only, whereas MCP enables the AI to “do things” and deliver the outcome. For example, an MCP server could return, say, “Email sent to John at 5 pm” as a result. In all cases, the final step is for the AI assistant to present the information or outcome to the end user in natural language. The user doesn’t see the raw context chunks or API calls; they just get the answer or confirmation, with the heavy lifting done via MCP behind the scenes.

Image Source

In conclusion, the Model Context Protocol (MCP) advances the integration of AI assistants with diverse external data sources. MCP enables AI systems to dynamically leverage up-to-date, relevant information and seamlessly perform context-aware interactions by standardizing context retrieval, indexing, and delivery. This approach enriches the functionality and accuracy of AI assistants and simplifies development by establishing a universal framework, eliminating redundancy, and enhancing security.

Sources

https://www.anthropic.com/news/model-context-protocol 

https://docs.anthropic.com/en/docs/agents-and-tools/mcp 

https://arxiv.org/pdf/2503.23278v1

Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Introduction to MCP: The Ultimate Guide to Model Context Protocol for AI Assistants appeared first on MarkTechPost.

How AWS Sales uses generative AI to streamline account planning

Every year, AWS Sales personnel draft in-depth, forward looking strategy documents for established AWS customers. These documents help the AWS Sales team to align with our customer growth strategy and to collaborate with the entire sales team on long-term growth ideas for AWS customers. These documents are internally called account plans (APs). In 2024, this activity took an account manager (AM) up to 40 hours per customer. This, combined with similar time spent for support roles researching and writing the growth plans for customers on the AWS Cloud, led to significant organization overhead. To help improve this process, in October 2024 we launched an AI-powered account planning draft assistant for our sales teams, building on the success of Field Advisor, an internal sales assistant tool. This new capability uses Amazon Bedrock to help our sales teams create comprehensive and insightful APs in less time. Since its launch, thousands of sales teams have used the resulting generative AI-powered assistant to draft sections of their APs, saving time on each AP created.
In this post, we showcase how the AWS Sales product team built the generative AI account plans draft assistant.
Business use cases
The account plans draft assistant serves four primary use cases:

Account plan draft generation: Using Amazon Bedrock, we’ve made internal and external data sources available to generate draft content for key sections of the APs. This enables our sales teams to quickly create initial drafts for sections such as customer overviews, industry analysis, and business priorities, which previously required hours of research across the internet and relied on disparate internal AWS tools.
Data synthesis: The assistant can pull relevant information from multiple sources including from our customer relationship management (CRM) system, financial reports, news articles, and previous APs to provide a holistic view of our customers.
Quality checks: Built-in quality assurance capabilities help ensure that APs meet internal standards for comprehensiveness, accuracy, and strategic alignment with our customers and business.
Customization: While providing AI-generated drafts, the product allows AMs to customize and refine the content by uploading proprietary documents to match their unique customer knowledge and strategic approach.

The account plan draft assistant loads when a user tries to create an AP, and users copy and paste each section they want to use in their final plan.

Our AMs report reduced time to write these documents, allowing them to focus more on high-value activities such as customer engagement and strategy development.
Here’s what some of our AMs had to say about their experience with the account plans draft assistant:

“The AI assistant saved me at least 15 hours on my latest enterprise account plan. It pulled together a great first draft, which I was then able to refine based on my own insights. This allowed me to spend more time actually engaging with my customer rather than doing research and writing.”
– Enterprise Account Manager
“As someone managing multiple mid-market accounts, I struggled to create in-depth plans for all my customers. The AI assistant now helps me rapidly generate baseline plans that I can then prioritize and customize. It’s a game-changer for serving my full portfolio of accounts.”
– Mid-market Account Manager

Amazon Q, Amazon Bedrock, and other AWS services underpin this experience, enabling us to use large language models (LLMs) and knowledge bases (KBs) to generate relevant, data-driven content for APs. Let’s explore how we built this AI assistant and some of our future plans.
Building the account plans draft assistant
When a user of the AWS internal CRM system initiates the workflow in Field Advisor, it triggers the account plan draft assistant capability through a pre-signed URL. The assistant then orchestrates a multi-source data collection process, performing web searches while also pulling account metadata from OpenSearch, Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3) storage. After analyzing and combining this data with user-uploaded documents, the assistant uses Amazon Bedrock to generate the AP. When complete, a notification chain using Amazon Simple Queue Service (Amazon SQS) and our internal notifications service API gateway begins delivering updates using Slack direct messaging and storing searchable records in OpenSearch for future reference.
The following diagram illustrates the high-level architecture of the account plans draft assistant.

Solution overview
We built the account plans draft assistant using the following key components:

Amazon Bedrock: Provides programmatic (API) access to high performing foundation models (FMs) along with vector search capabilities and metadata filtering using Amazon Bedrock Knowledge Bases. We populate an Amazon Bedrock knowledge bases using sales-enablement materials, historic APs, and other relevant documents curated by AWS Glue jobs (see more on AWS Glue jobs in the item 4).
AWS Lambda: Supports two use cases:

The async resolver Lambda function interfaces with the front-end client CRM and orchestrates async job IDs for the client to poll. This layer also handles input validations, user request throttling and cache management.
Worker Lambda functions perform the actual heavy lifting to create AP content. These functions work concurrently to generate different sections of APs by using publicly available data, internal data, and curated data in Amazon Bedrock knowledge bases. These functions invoke various LLMs using Amazon Bedrock and store the final content in the AP’s DynamoDB database corresponding to each async job ID.

DynamoDB: Maintains the state of each user request by tracking async job IDs, tracks throttling quota (global request count and per-user request count), and acts as a cache.
AWS Glue jobs: Curate and transform data from various internal and external data sources. These AWS Glue jobs push data to internal data sources (APs, internal tooling team S3 buckets, and other internal services) and to Bedrock KBs, facilitating high quality output through retrieval augmented generation (RAG).
Amazon SQS: Enables us to decouple the management plane and data plane. This decoupling is crucial in allowing the data plane worker functions to concurrently process different sections of the APs and make sure that we can generate APs within specified times.
Custom web frontend: A ReactJS based micro-frontend architecture enables us to integrate directly into our CRM system for a seamless user experience.

Data management
Our account plans draft assistant uses an Amazon Bedrock out-of-the-box knowledge base management solution. Through its RAG architecture, we semantically search and use metadata filtering to retrieve relevant context from diverse sources: internal sales enablement materials, historic APs, SEC filings, news articles, executive engagements and data from our CRM systems. The connectors built into Amazon Bedrock handle data ingestion from Amazon S3, relational database management systems (RDBMS), and third-party APIs; while its KB capabilities enable us to filter and prioritize source documents when generating responses. This context-aware approach results in higher quality and more relevant content in our generated AP sections.
Security and compliance
Security and Compliance are paramount to AWS when dealing with data regarding our customers. We use AWS IAM Identity Center for enterprise single sign-on so that only authorized users can access the account plans draft assistant. Using Field Advisor, we use various internal authorization mechanisms to help ensure that a user who’s generating APs only accesses the data that they already have access to.
User experience
We built a custom web frontend using a micro-frontend approach that integrates directly into our CRM system, allowing AMs to access the account plans draft assistant without leaving their familiar work environment. The interface allows users to select which sections of APs they want to generate, provides options for customization, and notifies users to create their APs on time through Slack.
Looking ahead
While the account plans draft assistant has already demonstrated significant value, we’re continuing to enhance its capabilities. Our goal is to create a zero-touch account planner that sales teams can use to generate a full AP for a customer, incorporating best practices observed across our customers to provide sales teams best-in-class strategies to engage with customers. This would include:

 Deeper integration with our bespoke purpose-built planning tools and assistance with account planning, such as automatically generating value maps and stakeholder maps.
Enhanced personalization to tailor content based on industry, account size, and individual user preferences.
Improved collaboration features, so that multiple sales team members can work together on refining AI-generated plans.
Expanded use of recommendations to provide what next? ideas to our sales teams to better serve our customers.

Conclusion
The account plans draft assistant, powered by Amazon Bedrock, has significantly streamlined our AP process, allowing our AWS Sales teams to create higher quality APs in a fraction of the time they currently need. As we continue to refine and expand this capability, we’re excited to see how it will further enhance our ability to serve our customers and drive their success in the AWS Cloud.
If you’re interested in learning how generative AI can transform your sales function and its processes, reach out to your AWS account team to discuss how services such as Amazon Q and Amazon Bedrock can help you build similar solutions for your organization.

About the Authors
Saksham Kakar is a Sr. Product Manager (Technical) in the AWS Field Experiences (AFX) organization focused on developing products that enable AWS Sales teams to help AWS customers grow with Amazon. Prior to this, Saksham led large sales, strategy and operations teams across startups and Fortune 500 companies. Outside of work, he is an avid tennis player and amateur skier.
Vimanyu Aggarwal is a Senior Software Engineer in AWS Field Experiences (AFX) organization with over 10 years of industry experience. Over the last decade, Vimanyu has been focusing on building large-scale, complex distributed systems at various Fortune 500 organizations. Currently, he works with multiple teams within the AFX organization to deliver technical solutions that empower the $100 billion sales funnel. Outside of work, he likes to play board games, tinker with IoT, and explore nature.
Krishnachand Velaga is a Senior Manager for Product Management – Technical (PM-T) in the AWS Field Experiences (AFX) organization who manages a team of seasoned PM-Ts and a suite of sales products, using generative AI to enable the AWS Sales organization help AWS customers across the globe adopt, migrate and grow on the AWS Cloud in line with their business needs and outcomes while bolstering sales efficiency and productivity and reducing operational cost.
Scott Wilkinson is a Software Development Manager in the AWS Field Experiences (AFX) organization, where he leads a cross-functional engineering team developing tools that aggregate and productize data to power AWS customer insights. Prior to AWS, Scott worked for notable startups including Digg, eHarmony, and Nasty Gal in both leadership and software development roles. Outside of work, Scott is a musician (guitar and piano) and loves to cook French cuisine.

Shaping the future: OMRON’s data-driven journey with AWS

This post is co-written with Emrah Kaya and Xinyi Zhou from Omron Europe.
Data is one of the most critical assets of many organizations. They’re constantly seeking ways to use their vast amounts of information to gain competitive advantages.
OMRON Corporation is a leading technology provider in industrial automation, healthcare, and electronic components. In their Shaping the Future 2030 (SF2030) strategic plan, OMRON aims to address diverse social issues, drive sustainable business growth, transform business models and capabilities, and accelerate digital transformation. At the heart of this transformation is the OMRON Data & Analytics Platform (ODAP), an innovative initiative designed to revolutionize how the company harnesses its data assets.
This post explores how OMRON Europe is using Amazon Web Services (AWS) to build its advanced ODAP and its progress toward harnessing the power of generative AI.
Challenges
By using advanced data and analytics capabilities, organizations can gain valuable insights into their operations, industry trends, and customer behaviors, leading to more informed strategies and increased insight. This approach is particularly powerful when applied to mission-critical data such as enterprise resource planning (ERP) and customer relationship management (CRM) systems because these contain information about internal processes, supply chain management, and customer interactions. By analyzing their data, organizations can identify patterns in sales cycles, optimize inventory management, or help tailor products or services to meet customer needs more effectively. However, organizations often face significant challenges in realizing these benefits because of:

Data silos – Organizations often use multiple systems across regions or departments. Integrating these diverse sources to create a single source of truth is complex, making it difficult to generate unified reports or analyze cross-functional trends.
Data governance challenges – Maintaining consistent data governance across different systems is crucial but complex. Implementing uniform policies across different systems and departments presents significant hurdles.
Different formats and standards – Systems typically use varied data formats and structures. This disparity complicates data integration and cross-system analysis, requiring significant effort to reconcile and harmonize data for comprehensive insights.

OMRON Data & Analytics Platform
To address these challenges, OMRON Europe (hereinafter “OMRON”) decided to implement an advanced data and analytics platform, ODAP. This innovative solution was designed to serve as a centralized hub for specific data assets, breaking down the barriers between various data sources and systems.
The following diagram shows a simplified architecture and some of the services and architectural patterns used for ODAP.

ODAP aimed to seamlessly integrate data from multiple ERP and CRM systems in addition to other relevant data sources across the organization. Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems. The robust security features provided by Amazon S3, including encryption and durability, were used to provide data protection. Finally, ODAP was designed to incorporate cutting-edge analytics tools and future AI-powered insights.
Some of these tools included AWS Cloud based solutions, such as AWS Lambda and AWS Step Functions. Lambda enables serverless, event-driven data processing tasks, allowing for real-time transformations and calculations as data arrives. Step Functions complements this by orchestrating complex workflows, coordinating multiple Lambda functions, and managing error handling for sophisticated data processing pipelines. This enables OMRON to extract meaningful patterns and trends from its vast data repositories, supporting more informed decision-making at all levels of the organization.
OMRON’s data strategy—represented on ODAP—also allowed the organization to unlock generative AI use cases focused on tangible business outcomes and enhanced productivity. Part of a comprehensive approach to using artificial intelligence and machine learning (AI/ML) and generative AI includes a strong data strategy that can help provide high quality and reliable data.
Embracing generative AI with Amazon Bedrock
The company has identified several use cases where generative AI can significantly impact operations, particularly in analytics and business intelligence (BI).
One key initiative is ODAPChat, an AI-powered chat-based assistant employees can use to interact with data using natural language queries. This tool democratizes data access across the organization, enabling even nontechnical users to gain valuable insights.
A standout application is the SQL-to-natural language capability, which translates complex SQL queries into plain English and vice versa, bridging the gap between technical and business teams. To power these advanced AI features, OMRON chose Amazon Bedrock. This fully managed service offers a range of foundation models (FMs), providing the flexibility to select the most suitable model for each use case. The straightforward implementation of Amazon Bedrock, coupled with its scalability to handle growing data volumes and user requests, made it an ideal choice for OMRON. The ability of Amazon Bedrock to support various models from different providers helps make sure that OMRON can always use the most advanced AI capabilities as they evolve.
Crucially, the robust security features provided by Amazon Bedrock align perfectly with OMRON’s stringent data protection requirements. Some highlights include:

Fine-grained access controls
Networking security features such as encryption of data in transit and at rest, or the ability to use private virtual private clouds (VPCs), helping to make sure that sensitive business data remains secure even when being processed by AI models
Amazon Bedrock Guardrails

These strict security controls offer a comprehensive security approach that allows OMRON to innovate with AI while maintaining the highest standards of data governance and protection.
The following diagram shows a basic layout of how the solution works. It helps illustrate the main parts and how they work together to make the AI assistant do its job.

The system has three main sections:

User interface – Users engage with the chat interface hosted on AWS. Amazon Cognito handles the user authentication processes, providing secure access to the application.
Input processing backend – The Amazon API Gateway receives incoming messages, which are then processed by containers running on Amazon Elastic Container Service (Amazon ECS). Chat conversations are preserved in Amazon DynamoDB to be used for the follow-up conversation. Amazon Bedrock takes care of generating AI responses, and tools are configured using LangChain, which helps determine how to handle different types of queries. When needed, the system can access an ODAP data warehouse to retrieve additional information.
Document management – Documents are securely stored in Amazon S3, and when new documents are added, a Lambda function processes them into chunks. These chunks are converted into embeddings using Amazon Bedrock and the embeddings are stored in an Amazon OpenSearch Service vector store for semantic search.

Results and future plans
The implementation of ODAP and ODAPChat on AWS has already yielded significant benefits for OMRON:

Optimization of reports, leading to more efficient and insightful analysis
SQL-to-natural language capabilities powered by generative AI, making data more accessible to nontechnical users
Increased business agility with infrastructure fully deployed in the cloud
Data democratization, enabling more employees to use data-driven insights

Looking ahead, OMRON plans to significantly expand its use of AWS services and further use generative AI capabilities. The company aims to integrate additional data sources, including other mission-critical systems, into ODAP. This expansion will be coupled with enhanced data governance measures to help promote data quality and compliance across the growing data solution.
OMRON is also exploring more advanced generative AI use cases, such as INSERT_INITIATIVES. These initiatives will use the evolving capabilities provided by Amazon Bedrock to potentially incorporate advanced AI models and security features.
Conclusion
OMRON’s journey with AWS demonstrates the transformative power of cloud-based data solutions and generative AI in overcoming data silos and driving business innovation. By using AWS services such as Amazon AppFlow, Amazon S3, and Amazon Bedrock, OMRON has created a comprehensive, secure, and adaptable data and analytics platform that not only meets its current needs, but also positions the company for future growth and innovation.
As organizations across industries grapple with similar data challenges, OMRON’s story serves as an inspiring example of how embracing cloud technologies and AI can lead to significant business transformation and competitive advantage.

About the Authors
Emrah Kaya is Data Engineering Manager at Omron Europe and Platform Lead for ODAP Project. With his extensive background on Cloud & Data Architecture, Emrah leads key OMRON’s technological advancement initiatives, including artificial intelligence, machine learning, or data science.
Xinyi Zhou is a Data Engineer at Omron Europe, bringing her expertise to the ODAP team led by Emrah Kaya. She specializes in building efficient data pipelines and managing AWS infrastructure, while actively contributing to the implementation of new solutions that advance ODAP’s technological capabilities.
Emel Mendoza is a Senior Solutions Architect at AWS based in the Netherlands. With passion for cloud migrations and application modernization, Emel helps organizations navigate their digital transformation journeys on AWS. Emel leverages his decade of experience to guide customers in adopting AWS services and architecting scalable, efficient solutions.
Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.

AI Workforce: using AI and Drones to simplify infrastructure inspectio …

Inspecting wind turbines, power lines, 5G towers, and pipelines is a tough job. It’s often dangerous, time-consuming, and prone to human error. That’s why we at Amazon Web Services (AWS) are working on AI Workforce—a system that uses drones and AI to make these inspections safer, faster, and more accurate.
This post is the first in a three-part series exploring AI Workforce, the AWS AI-powered drone inspection system. In this post, we introduce the concept and key benefits. The second post dives into the AWS architecture that powers AI Workforce, and the third focuses on the drone setup and integration.
In the following sections, we explain how AI Workforce enables asset owners, maintenance teams, and operations managers in industries such as energy and telecommunications to enhance safety, reduce costs, and improve efficiency in infrastructure inspections.
Challenges with traditional inspections
Inspecting infrastructure using traditional methods is a challenge. You need trained people and specialized equipment, and you often must shut things down during inspection. As an example, climbing a wind turbine in bad weather for an inspection can be dangerous. Plus, even the best human inspector can miss things. This can lead to bigger problems down the line, costing time and money.

How AI Workforce helps
AI Workforce is designed to change all that. We use autonomous drones equipped with advanced sensors and AI to do the inspections. This brings the following benefits:

Less risk for people – Drones do the dangerous work so people don’t have to. This makes inspections much safer.
Faster and more efficient – Drones can cover a lot of ground quickly, getting the job done faster.
Better data – Automated data collection and analysis means fewer mistakes and more consistent results. This allows for proactive maintenance.

What does AI Workforce look like in action? Users interact with a simple AI assistant and dashboard that displays near real-time drone inspections, detected issues, and AI-generated insights. The following figure shows an example of the user dashboard and drone conversation.

The following figure is an example of drone 4K footage.

Solution overview
AI Workforce is built on a robust and scalable architecture using a wide array of AWS services. Security is paramount, and we adhere to AWS best practices across the layers. This includes:

Amazon API Gateway manages secure communication between various components, enforcing authentication and authorization
AWS Identity and Access Management (IAM) roles and policies verify least privilege access, limiting each component’s permissions to only what is necessary
Network security is implemented through virtual private clouds (VPCs), security groups, and network access control lists (ACLs), isolating the system and protecting it from unauthorized access
For video processing, we employ secure transfer protocols and encryption at rest and in transit

AI Workforce provides a robust API for managing drone operations, including flight planning, telemetry data, and anomaly detection. The following diagram outlines how different components interact.

Imagine a system where drones autonomously inspect critical infrastructure, capturing high-resolution video, analyzing potential defects with AI, and seamlessly integrating findings into business workflows. The AI Workforce architecture brings this vision to life, using AWS services across four key pillars.
Control plane: Secure drone communication and operations
Our journey begins with automated drone flights. Each drone follows predefined routes, with flight waypoints, altitude, and speed configured through an AWS API, using coordinates stored in Amazon DynamoDB. Once airborne, AWS IoT Core enables secure, bidirectional communication—allowing drones to receive real-time commands (like “take-off”, “begin flight ID = xxx”, or “land”), adjust flight paths, and transmit telemetry data back to AWS. To maintain robust security, AWS Lambda responds to Internet of Things (IoT) events, enabling immediate actions based on drone data, while Amazon GuardDuty continuously monitors for anomalies or potential security threats, such as unusual API activity or unauthorized access attempts, helping protect the integrity of drone operations and promoting secure operations.
In AI Workforce, AWS IoT Core serves as the primary entry point for real-time drone communication, handling telemetry data, command and control messaging, and secure bidirectional communication with drones. API Gateway plays a complementary role by acting as the main entry point for external applications, dashboards, and enterprise integrations. It is responsible for managing RESTful API calls related to flight planning, retrieving inspection results, and interacting with backend services like Amazon Relational Database Service (Amazon RDS) and AWS Step Functions. While drones communicate directly with AWS IoT Core, user-facing applications and automation workflows rely on API Gateway to access structured data and trigger specific actions within the AI Workforce ecosystem.
AI/ML and generative AI: Computer vision and intelligent insights
As drones capture video footage, raw data is processed through AI-powered models running on Amazon Elastic Compute Cloud (Amazon EC2) instances. These computer vision models detect anomalies, classify damage types, and extract actionable insights—whether it’s spotting cracks on wind turbines or identifying corrosion on pipelines. Amazon SageMaker AI is at the core of our machine learning (ML) pipeline, training and deploying models for object detection, anomaly detection, and predictive maintenance.
We are also pioneering generative AI with Amazon Bedrock, enhancing our system’s intelligence. With natural language interactions, asset owners can ask questions like “What were the most critical defects detected last week?” and Amazon Bedrock generates structured reports based on inspection findings. It even aids in synthetic training data generation, refining our ML models for improved accuracy.
Data layer: Storing and managing inspection data
Every inspection generates vast amounts of data—high-resolution images, videos, and sensor readings. This information is securely stored in Amazon Simple Storage Service (Amazon S3), promoting durability and ease of access. Amazon S3 encrypts data at rest by default using server-side encryption (SSE), providing an additional layer of security without requiring manual configuration. Meanwhile, structured metadata and processed results are housed in Amazon RDS, enabling fast queries and integration with enterprise applications. Together, these services create a resilient data foundation, supporting both real-time analysis and historical trend monitoring.
Analytics and business: Automated workflows and business intelligence
Insights don’t stop at data collection—Step Functions orchestrates workflows that trigger automated actions. For example, if an AI model detects a critical defect, Step Functions can initiate a maintenance request in SAP, notify engineers, and schedule repairs without human intervention.
For deeper analysis, Amazon QuickSight transforms raw inspection data into interactive dashboards, helping asset owners track infrastructure health, spot trends, and optimize maintenance strategies. With a clear visual representation of defects, decision-makers can act swiftly, minimizing downtime and maximizing operational efficiency.
The future of AI Workforce: Expanding drone capabilities
Beyond inspections, AI Workforce provides a robust Drone API, offering seamless integration for third-party applications. This API enables remote flight planning, telemetry monitoring, and anomaly detection—all within a scalable AWS environment.
With secure drone communication, powerful AI-driven insights, a robust data foundation, and business automation, AI Workforce is redefining infrastructure inspection, making it smarter, faster, and more efficient than ever before.
Benefits and impact on business operations
The deployment of AI Workforce delivers a wide range of tangible benefits for organizations managing critical infrastructure (for example, automatically compare multiple inspections over time to detect longitudinal changes, and identify progressive failures for proactive maintenance), particularly in the energy and telco sector:

Significant cost savings – By reducing the need for human labor, specialized equipment, and extensive logistical planning, AI Workforce can significantly lower inspection costs. Proactive maintenance based on early defect detection also prevents costly repairs and unplanned downtime.
Dramatically enhanced safety – Removing human personnel from hazardous environments drastically reduces the risk of accidents and injuries, creating a safer working environment.
Substantially improved efficiency – Automated drone inspections are significantly faster and more efficient than traditional methods, enabling more frequent inspections and faster turnaround times.
Data-driven decision-making – AI Workforce provides asset owners with comprehensive and accurate data, enabling them to make informed decisions about maintenance, repairs, and asset management.

Example AI Workforce use case in the industry sector
Picture an energy company responsible for maintaining a large wind farm. They deploy AI Workforce drones for regular inspections. The drones, autonomously navigating preprogrammed flight paths defined by coordinates stored in DynamoDB and controlled through REST API calls, are securely connected using AWS IoT Core.
During the flight, sensor data is processed at the edge and streamed to Amazon S3, with metadata stored in Amazon RDS. Computer vision algorithms analyze the video in real time. If an anomaly is detected, a Lambda function triggers a Step Functions workflow, which in turn interacts with their SAP system to generate a maintenance work order. Inspection data is aggregated and visualized in QuickSight dashboards, providing a comprehensive overview of the wind farm’s health.
SageMaker AI models analyze the data, predicting potential failures and informing proactive maintenance strategies. In the future, Amazon Bedrock might provide summarized reports and generate synthetic data to further enhance the system’s capabilities.
Conclusion
At AWS, we’re committed to driving innovation in AI-powered solutions for a wide range of industries. AI Workforce is a prime example of how we’re using cutting-edge technologies to transform how critical infrastructure is managed and maintained.
We’re building this workforce to help businesses operate more efficiently and safely. We’re open to collaborating with others who are interested in this space. If you’d like to learn more, feel free to reach out. We welcome the opportunity to discuss your specific needs and explore potential collaborations.

About the Author
Miguel Muñoz de Rivera González is the original designer and technical lead for the AI Workforce initiative at AWS, driving AI-powered drone solutions for safer, smarter, and cost-effective infrastructure inspections.

Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI …

The rapid progress in artificial intelligence (AI) and machine learning (ML) research underscores the importance of accurately evaluating AI agents’ capabilities in replicating complex, empirical research tasks traditionally performed by human researchers. Currently, systematic evaluation tools that precisely measure the ability of AI agents to autonomously reproduce ML research findings remain limited, posing challenges in fully understanding the potential and limitations of such systems.

OpenAI has introduced PaperBench, a benchmark designed to evaluate the competence of AI agents in autonomously replicating state-of-the-art machine learning research. PaperBench specifically measures whether AI systems can accurately interpret research papers, independently develop the necessary codebases, and execute experiments to replicate empirical outcomes. The benchmark comprises 20 papers selected from ICML 2024, covering areas including reinforcement learning, robustness, and probabilistic methods. Detailed rubrics, co-developed with original paper authors, specify 8,316 individually gradable tasks to facilitate precise evaluation of AI capabilities.

From a technical perspective, PaperBench requires AI agents to process provided research papers and supplementary clarifications to develop comprehensive code repositories from scratch. These repositories must include complete experimental setups and execution scripts, notably the reproduce.sh file. To ensure genuine independent replication, agents are prohibited from referencing or reusing code from the original authors’ repositories. Rubrics are structured hierarchically to detail explicit pass-fail criteria at various levels, allowing systematic and objective assessment. Evaluation is conducted using SimpleJudge, an automated large language model (LLM)-based judge, which simplifies the grading process. SimpleJudge achieved an F1 score of 0.83 on JudgeEval, an auxiliary evaluation dataset specifically designed to validate automated grading accuracy.

Empirical evaluations of several advanced AI models indicate varying performance levels on PaperBench. Claude 3.5 Sonnet exhibited the highest capability with an average replication score of 21.0%. Other models such as OpenAI’s GPT-4o and Gemini 2.0 Flash attained significantly lower scores of 4.1% and 3.2%, respectively. Comparatively, expert human ML researchers achieved considerably higher accuracy, reaching up to 41.4% after 48 hours of dedicated effort. Analysis of model performance revealed strengths in initial rapid code generation and early experimental setup but highlighted substantial weaknesses in managing prolonged tasks, troubleshooting, and adapting strategic approaches over time.

These results provide critical technical insights into current AI system capabilities. While AI models demonstrate competence in certain coding tasks and initial experiment implementation, significant gaps persist, particularly regarding sustained task execution, adaptive problem-solving, and strategic planning. Additionally, the introduction of PaperBench Code-Dev, a streamlined variant emphasizing code correctness without experimental execution, offers a practical alternative for broader and resource-limited community use due to reduced computational and evaluation costs.

In summary, PaperBench represents an important step toward methodically evaluating AI research capabilities. It provides a structured and detailed assessment environment that highlights specific strengths and limitations of contemporary AI models relative to human performance. The collaborative development of rubrics ensures precise and realistic evaluations. OpenAI’s open-sourcing of PaperBench supports further exploration and development in the field, enhancing understanding of autonomous AI research capabilities and informing responsible progression in this area.

Check out the Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research appeared first on MarkTechPost.

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distribu …

Foundation model (FM) training and inference has led to a significant increase in computational needs across the industry. These models require massive amounts of accelerated compute to train and operate effectively, pushing the boundaries of traditional computing infrastructure. They require efficient systems for distributing workloads across multiple GPU accelerated servers, and optimizing developer velocity as well as performance.
Ray is an open source framework that makes it straightforward to create, deploy, and optimize distributed Python jobs. At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. It provides a set of high-level APIs for tasks, actors, and data that abstract away the complexities of distributed computing, enabling developers to focus on the core logic of their applications. Ray promotes the same coding patterns for both a simple machine learning (ML) experiment and a scalable, resilient production application. Ray’s key features include efficient task scheduling, fault tolerance, and automatic resource management, making it a powerful tool for building a wide range of distributed applications, from ML models to real-time data processing pipelines. With its growing ecosystem of libraries and tools, Ray has become a popular choice for organizations looking to use the power of distributed computing to tackle complex and data-intensive problems.
Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of Ray provides a powerful framework to scale up your generative AI workloads.
In this post, we demonstrate the steps involved in running Ray jobs on SageMaker HyperPod.
Overview of Ray
This section provides a high-level overview of the Ray tools and frameworks for AI/ML workloads. We primarily focus on ML training use cases.
Ray is an open-source distributed computing framework designed to run highly scalable and parallel Python applications. Ray manages, executes, and optimizes compute needs across AI workloads. It unifies infrastructure through a single, flexible framework—enabling AI workloads from data processing, to model training, to model serving and beyond.
For distributed jobs, Ray provides intuitive tools for parallelizing and scaling ML workflows. It allows developers to focus on their training logic without the complexities of resource allocation, task scheduling, and inter-node communication.
At a high level, Ray is made up of three layers:

Ray Core: The foundation of Ray, providing primitives for parallel and distributed computing
Ray AI libraries:

Ray Train – A library that simplifies distributed training by offering built-in support for popular ML frameworks like PyTorch, TensorFlow, and Hugging Face
Ray Tune – A library for scalable hyperparameter tuning
Ray Serve – A library for distributed model deployment and serving

Ray clusters: A distributed computing platform where worker nodes run user code as Ray tasks and actors, generally in the cloud

In this post, we dive deep into running Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and a number of connected worker nodes. The head node orchestrates task scheduling, resource allocation, and communication between nodes. The ray worker nodes execute the distributed workloads using Ray tasks and actors, such as model training or data preprocessing.
Ray clusters and Kubernetes clusters pair well together. By running a Ray cluster on Kubernetes using the KubeRay operator, both Ray users and Kubernetes administrators benefit from the smooth path from development to production. For this use case, we use a SageMaker HyperPod cluster orchestrated through Amazon Elastic Kubernetes Service (Amazon EKS).
The KubeRay operator enables you to run a Ray cluster on a Kubernetes cluster. KubeRay creates the following custom resource definitions (CRDs):

RayCluster – The primary resource for managing Ray instances on Kubernetes. The nodes in a Ray cluster manifest as pods in the Kubernetes cluster.
RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting tasks or batches of tasks to be executed by the Ray cluster. A RayJob also manages the lifecycle of the Ray cluster, making it ephemeral by automatically spinning up the cluster when the job is submitted and shutting it down when the job is complete.
RayService – A Ray cluster and a Serve application that runs on top of it into a single Kubernetes manifest. It allows for the deployment of Ray applications that need to be exposed for external communication, typically through a service endpoint.

For the remainder of this post, we don’t focus on RayJob or RayService; we focus on creating a persistent Ray cluster to run distributed ML training jobs.
When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we will dive deeper into later in this post. This combination provides a solution for handling dynamic workloads, maintaining high availability, and providing seamless recovery from node failures, which is crucial for long-running jobs.
Overview of SageMaker HyperPod
In this section, we introduce SageMaker HyperPod and its built-in resiliency features to provide infrastructure stability.
Generative AI workloads such as training, inference, and fine-tuning involve building, maintaining, and optimizing large clusters of thousands of GPU accelerated instances. For distributed training, the goal is to efficiently parallelize workloads across these instances in order to maximize cluster utilization and minimize time to train. For large-scale inference, it’s important to minimize latency, maximize throughput, and seamlessly scale across those instances for the best user experience. SageMaker HyperPod is a purpose-built infrastructure to address these needs. It removes the undifferentiated heavy lifting involved in building, maintaining, and optimizing a large GPU accelerated cluster. It also provides flexibility to fully customize your training or inference environment and compose your own software stack. You can use either Slurm or Amazon EKS for orchestration with SageMaker HyperPod.
Due to their massive size and the need to train on large amounts of data, FMs are often trained and deployed on large compute clusters composed of thousands of AI accelerators such as GPUs and AWS Trainium. A single failure in one of these thousand accelerators can interrupt the entire training process, requiring manual intervention to identify, isolate, debug, repair, and recover the faulty node in the cluster. This workflow can take several hours for each failure and as the scale of the cluster grows, it’s common to see a failure every few days or even every few hours. SageMaker HyperPod provides resiliency against infrastructure failures by applying agents that continuously run health checks on cluster instances, fix the bad instances, reload the last valid checkpoint, and resume the training—without user intervention. As a result, you can train your models up to 40% faster. You can also SSH into an instance in the cluster for debugging and gather insights on hardware-level optimization during multi-node training. Orchestrators like Slurm or Amazon EKS facilitate efficient allocation and management of resources, provide optimal job scheduling, monitor resource utilization, and automate fault tolerance.
Solution overview
This section provides an overview of how to run Ray jobs for multi-node distributed training on SageMaker HyperPod. We go over the architecture and the process of creating a SageMaker HyperPod cluster, installing the KubeRay operator, and deploying a Ray training job.
Although this post provides a step-by-step guide to manually create the cluster, feel free to check out the aws-do-ray project, which aims to simplify the deployment and scaling of distributed Python application using Ray on Amazon EKS or SageMaker HyperPod. It uses Docker to containerize the tools necessary to deploy and manage Ray clusters, jobs, and services. In addition to the aws-do-ray project, we’d like to highlight the Amazon SageMaker Hyperpod EKS workshop, which offers an end-to-end experience for running various workloads on SageMaker Hyperpod clusters. There are multiple examples of training and inference workloads from the GitHub repository awsome-distributed-training.
As introduced earlier in this post, KubeRay simplifies the deployment and management of Ray applications on Kubernetes. The following diagram illustrates the solution architecture.

Create a SageMaker HyperPod cluster
Prerequisites
Before deploying Ray on SageMaker HyperPod, you need a HyperPod cluster:

This is a one click deployment to set up your EKS SageMaker HyperPod cluster using CloudFormation. Please deploy this stack. Source is from Amazon EKS Support in SageMaker HyperPod Workshop. Now, you can proceed directly to section Create an FSx for Lustre Shared File System.

If you prefer to deploy HyperPod on an existing EKS cluster, please follow the instructions here which include:

EKS cluster – You can associate SageMaker HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively and recommended, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer to the GitHub repo for instructions on setting up an EKS cluster.
Custom resources – Running multi-node distributed training requires various resources, such as device plugins, Container Storage Interface (CSI) drivers, and training operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer to Install packages on the Amazon EKS cluster using Helm for installation instructions.

The following provide an example workflow for creating a HyperPod cluster on an existing EKS Cluster after deploying prerequisites. This is for reference only and not required for the quick deploy option.

cat > cluster-config.json << EOL
{
“ClusterName”: “ml-cluster”,
“Orchestrator”: {
“Eks”: {
“ClusterArn”: “${EKS_CLUSTER_ARN}”
}
},
“InstanceGroups”: [
{
“InstanceGroupName”: “worker-group-1”,
“InstanceType”: “ml.p5.48xlarge”,
“InstanceCount”: 4,
“LifeCycleConfig”: {
“SourceS3Uri”: “s3://amzn-s3-demo-bucket”,
“OnCreate”: “on_create.sh”
},
“ExecutionRole”: “${EXECUTION_ROLE}”,
“ThreadsPerCore”: 1,
“OnStartDeepHealthChecks”: [
“InstanceStress”,
“InstanceConnectivity”
]
},
{
“InstanceGroupName”: “head-group”,
“InstanceType”: “ml.m5.2xlarge”,
“InstanceCount”: 1,
“LifeCycleConfig”: {
“SourceS3Uri”: “s3://amzn-s3-demo-bucket”,
“OnCreate”: “on_create.sh”
},
“ExecutionRole”: “${EXECUTION_ROLE}”,
“ThreadsPerCore”: 1,
}
],
“VpcConfig”: {
“SecurityGroupIds”: [
“${SECURITY_GROUP_ID}”
],
“Subnets”: [
“${SUBNET_ID}”
]
},
“NodeRecovery”: “Automatic”
}
EOL

The provided configuration file contains two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
“NodeRecovery”: “Automatic” – Enables SageMaker HyperPod automated node recovery

You can create a SageMaker HyperPod compute with the following AWS Command Line Interface (AWS CLI) command (AWS CLI version 2.17.47 or newer is required):

aws sagemaker create-cluster
–cli-input-json file://cluster-config.json
{
“ClusterArn”: “arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49”
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters –output table

This command displays the cluster details, including the cluster name, status, and creation time:

——————————————————————————————————————————————————
| ListClusters |
+—————————————————————————————————————————————————-+
|| ClusterSummaries ||
|+—————————————————————-+—————————+—————-+————————————+|
|| ClusterArn | ClusterName | ClusterStatus | CreationTime ||
|+—————————————————————-+—————————+—————-+————————————+|
|| arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf | ml-cluster | InService | 2025-03-03T16:45:05.320000+00:00 ||
|+—————————————————————-+—————————+—————-+————————————+|

Alternatively, you can verify the cluster status on the SageMaker console. After a brief period, you can observe that the status for the nodes transitions to Running.
Create an FSx for Lustre shared file system
For us to deploy the Ray cluster, we need the SageMaker HyperPod cluster to be up and running, and additionally we need a shared storage volume (for example, an Amazon FSx for Lustre file system). This is a shared file system that the SageMaker HyperPod nodes can access. This file system can be provisioned statically before launching your SageMaker HyperPod cluster or dynamically afterwards.
Specifying a shared storage location (such as cloud storage or NFS) is optional for single-node clusters, but it is required for multi-node clusters. Using a local path will raise an error during checkpointing for multi-node clusters.
The Amazon FSx for Lustre CSI driver uses IAM roles for service accounts (IRSA) to authenticate AWS API calls. To use IRSA, an IAM OpenID Connect (OIDC) provider needs to be associated with the OIDC issuer URL that comes provisioned your EKS cluster.
Create an IAM OIDC identity provider for your cluster with the following command:

eksctl utils associate-iam-oidc-provider –cluster $EKS_CLUSTER_NAME –approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo update
helm upgrade –install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
–namespace kube-system

This Helm chart includes a service account named fsx-csi-controller-sa that gets deployed in the kube-system namespace.
Use the eksctl CLI to create an AWS Identity and Access Management (IAM) role bound to the service account used by the driver, attaching the AmazonFSxFullAccess AWS managed policy:

eksctl create iamserviceaccount
–name fsx-csi-controller-sa
–override-existing-serviceaccounts
–namespace kube-system
–cluster $EKS_CLUSTER_NAME
–attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess
–approve
–role-name AmazonEKSFSxLustreCSIDriverFullAccess
–region $AWS_REGION

The –override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a new one and updates the metadata of the current service account instead.
Annotate the driver’s service account with the Amazon Resource Name (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM role that was created:

SA_ROLE_ARN=$(aws iam get-role –role-name AmazonEKSFSxLustreCSIDriverFullAccess –query ‘Role.Arn’ –output text)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa
eks.amazonaws.com/role-arn=${SA_ROLE_ARN} –overwrite=true

This annotation lets the driver know what IAM role it should use to interact with the FSx for Lustre service on your behalf.
Verify that the service account has been properly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the changes to take effect:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two options for provisioning a file system:

Dynamic provisioning – This option uses Persistent Volume Claims (PVCs) in Kubernetes. You define a PVC with desired storage specifications. The CSI driver automatically provisions the FSx for Lustre file system for you based on the PVC request. This allows for straightforward scaling and eliminates the need to manually create file systems.
Static provisioning – In this method, you manually create the FSx for Lustre file system before using the CSI driver. You will need to configure details like subnet ID and security groups for the file system. Then, you can use the driver to mount this pre-created file system within your container as a volume.

For this example, we use dynamic provisioning. Start by creating a storage class that uses the fsx.csi.aws.com provisioner:

cat <<EOF > storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
subnetId: ${SUBNET_ID}
securityGroupIds: ${SECURITYGROUP_ID}
deploymentType: PERSISTENT_2
automaticBackupRetentionDays: “0”
copyTagsToBackups: “true”
perUnitStorageThroughput: “250”
dataCompressionType: “LZ4”
fileSystemTypeVersion: “2.12”
mountOptions:
– flock
EOF

kubectl apply -f storageclass.yaml

SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Should be the same private subnet that was used for HyperPod creation.
SECURITYGROUP_ID: The security group IDs that will be attached to the file system. Should be the same Security Group ID that is used in HyperPod and EKS.

Next, create a PVC that uses the fsx-claim storage claim:

cat <<EOF > pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fsx-claim
spec:
accessModes:
– ReadWriteMany
storageClassName: fsx-sc
resources:
requests:
storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will start the dynamic provisioning of an FSx for Lustre file system based on the specifications provided in the storage class.
Create the Ray cluster
Now that we have both the SageMaker HyperPod cluster and the FSx for Lustre file system created, we can set up the Ray cluster:

Set up dependencies. We will create a new namespace in our Kubernetes cluster and install the KubeRay operator using a Helm chart.

We recommend using KubeRay operator version 1.2.0 or higher, which supports automatic Ray Pod eviction and replacement in case of failures (for example, hardware issues on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
#Install both CRDs and Kuberay operator v1.2.0
helm install kuberay-operator kuberay/kuberay-operator –version 1.2.0 –namespace kuberay
# Kuberay operator pod will be deployed onto head pod
kubectl get pods –namespace kuberay

Create a Ray Container Image for the Ray Cluster manifest. With the recent deprecation of the `rayproject/ray-ml` images starting from Ray version 2.31.0, it’s necessary to create a custom container image for our Ray cluster. Therefore, we will build on top of the `rayproject/ray:2.42.1-py310-gpu` image, which has all necessary Ray dependencies, and include our training dependencies to build our own custom image. Please feel free to modify this Dockerfile as you wish.

First, create a Dockerfile that builds upon the base Ray GPU image and includes only the necessary dependencies:

cat <<EOF > Dockerfile

FROM rayproject/ray:2.42.1-py310-gpu
# Install Python dependencies for PyTorch, Ray, Hugging Face, and more
RUN pip install –no-cache-dir
torch torchvision torchaudio
numpy
pytorch-lightning
transformers datasets evaluate tqdm click
ray[train] ray[air]
ray[train-torch] ray[train-lightning]
torchdata
torchmetrics
torch_optimizer
accelerate
scikit-learn
Pillow==9.5.0
protobuf==3.20.3

RUN pip install –upgrade datasets transformers

# Set the user
USER ray
WORKDIR /home/ray

# Verify ray installation
RUN which ray &&
ray –-version

# Default command
CMD [ “/bin/bash” ]

EOF

Then, build and push the image to your container registry (Amazon ECR) using the provided script:

export AWS_REGION=$(aws configure get region)
export ACCOUNT=$(aws sts get-caller-identity –query Account –output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/

echo “This process may take 10-15 minutes to complete…”

echo “Building image…”

docker build –platform linux/amd64 -t ${REGISTRY}aws-ray-custom:latest .

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep “aws-ray-custom” | wc -l)
if [ “$REGISTRY_COUNT” == “0” ]; then
aws ecr create-repository –repository-name aws-ray-custom
fi

# Login to registry
echo “Logging in to $REGISTRY …”
aws ecr get-login-password –region $AWS_REGION| docker login –username AWS –password-stdin $REGISTRY

echo “Pushing image to $REGISTRY …”

# Push image to registry
docker image push ${REGISTRY}aws-ray-custom:latest

Now, our Ray container image is in Amazon ECR with all necessary Ray dependencies, as well as code library dependencies.

Create a Ray cluster manifest. We use a Ray cluster to host our training jobs. The Ray cluster is the primary resource for managing Ray instances on Kubernetes. It represents a cluster of Ray nodes, including a head node and multiple worker nodes. The Ray cluster CRD determines how the Ray nodes are set up, how they communicate, and how resources are allocated among them. The nodes in a Ray cluster manifest as pods in the EKS or SageMaker HyperPod cluster.

Note that there are two distinct sections in the cluster manifest. While the `headGroupSpec` defines the head node of the Ray Cluster, the `workerGroupSpecs` define the worker nodes of the Ray Cluster. While a job could technically run on the Head node as well, it is common to separate the head node from the actual worker nodes where jobs are executed. Therefore, the instance for the head node can typically be a smaller instance (i.e. we chose a m5.2xlarge). Since the head node also manages cluster-level metadata, it can be beneficial to have it run on a non-GPU node to minimize the risk of node failure (as GPU can be a potential source of node failure).

cat <<‘EOF’ > raycluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: rayml
labels:
controller-tools.k8s.io: “1.0”
spec:
# Ray head pod template
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: ‘0.0.0.0’
#pod template
template:
spec:
# nodeSelector:
#node.kubernetes.io/instance-type: “ml.m5.2xlarge”
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
– name: ray-head
image: ${REGISTRY}aws-ray-custom:latest ## IMAGE: Here you may choose which image your head pod will run
env: ## ENV: Here is where you can send stuff to the head pod
– name: RAY_GRAFANA_IFRAME_HOST ## PROMETHEUS AND GRAFANA
value: http://localhost:3000
– name: RAY_GRAFANA_HOST
value: http://prometheus-grafana.prometheus-system.svc:80
– name: RAY_PROMETHEUS_HOST
value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
lifecycle:
preStop:
exec:
command: [“/bin/sh”,”-c”,”ray stop”]
resources:
limits: ## LIMITS: Set resource limits for your head pod
cpu: 1
memory: 8Gi
requests: ## REQUESTS: Set resource requests for your head pod
cpu: 1
memory: 8Gi
ports:
– containerPort: 6379
name: gcs-server
– containerPort: 8265 # Ray dashboard
name: dashboard
– containerPort: 10001
name: client
– containerPort: 8000
name: serve
volumeMounts: ## VOLUMEMOUNTS
– name: fsx-storage
mountPath: /fsx
– name: ray-logs
mountPath: /tmp/ray
volumes:
– name: ray-logs
emptyDir: {}
– name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim
workerGroupSpecs:
# the pod replicas in this group typed worker
– replicas: 4 ## REPLICAS: How many worker pods you want
minReplicas: 1
maxReplicas: 10
# logical group name, for this called small-group, also can be functional
groupName: gpu-group
rayStartParams:
num-gpus: “8”
#pod template
template:
spec:
#nodeSelector:
# node.kubernetes.io/instance-type: “ml.p5.48xlarge”
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
– name: ray-worker
image: ${REGISTRY}aws-ray-custom:latest ## IMAGE: Here you may choose which image your head node will run
env:
lifecycle:
preStop:
exec:
command: [“/bin/sh”,”-c”,”ray stop”]
resources:
limits: ## LIMITS: Set resource limits for your worker pods
nvidia.com/gpu: 8
#vpc.amazonaws.com/efa: 32
requests: ## REQUESTS: Set resource requests for your worker pods
nvidia.com/gpu: 8
#vpc.amazonaws.com/efa: 32
volumeMounts: ## VOLUMEMOUNTS
– name: ray-logs
mountPath: /tmp/ray
– name: fsx-storage
mountPath: /fsx
volumes:
– name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim
– name: ray-logs
emptyDir: {}
EOF

Deploy the Ray cluster:

envsubst < raycluster.yaml | kubectl apply -f –

Optionally, expose the Ray dashboard using port forwarding:

# Gets name of kubectl service that runs the head pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk ‘{print $1}’ | head -n 1)
# Port forwards the dashboard from the head pod service
kubectl port-forward –address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you can visit http://localhost:8265/ to visit the Ray Dashboard.

To launch a training job, there are a few options:

Use the Ray jobs submission SDK, where you can submit jobs to the Ray cluster through the Ray dashboard port (8265 by default) where Ray listens for job requests. To learn more, see Quickstart using the Ray Jobs CLI.
Execute a Ray job in the head pod where you exec directly into the head pod and then submit your job. To learn more, see RayCluster Quickstart.

For this example, we use the first method and submit the job through the SDK. Therefore, we simply run from a local environment where the training code is available in –working-dir. Relative to this path, we specify the main training Python script located at –train.py Within the working-dir folder, we can also include additional scripts we might need to run the training.
The fsdp-ray.py example is located in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py in the aws-do-ray GitHub repo.

# Within jobs/ folder
ray job submit –address http://localhost:8265 –working-dir “fsdp-ray” — python3 fsdp-ray.py

For our Python training script to run, we need to make sure our training scripts are correctly set up to use Ray. This includes the following steps:

Configure a model to run distributed and on the correct CPU/GPU device
Configure a data loader to shard data across the workers and place data on the correct CPU or GPU device
Configure a training function to report metrics and save checkpoints
Configure scaling and CPU or GPU resource requirements for a training job
Launch a distributed training job with a TorchTrainer class

For further details on how to adjust your existing training script to get the most out of Ray, refer to the Ray documentation.
The following diagram illustrates the complete architecture you have built after completing these steps.

Implement training job resiliency with the job auto resume functionality
Ray is designed with robust fault tolerance mechanisms to provide resilience in distributed systems where failures are inevitable. These failures generally fall into two categories: application-level failures, which stem from bugs in user code or external system issues, and system-level failures, caused by node crashes, network disruptions, or internal bugs in Ray. To address these challenges, Ray provides tools and strategies that enable applications to detect, recover, and adapt seamlessly, providing reliability and performance in distributed environments. In this section, we look at two of the most common types of failures, and how to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Train worker failures and Ray worker node failures.

Ray Train worker – This is a worker process specifically used for training tasks within Ray Train, Ray’s distributed training library. These workers handle individual tasks or shards of a distributed training job. Each worker is responsible for processing a portion of the data, training a subset of the model, or performing computation during distributed training. They are coordinated by the Ray Train orchestration logic to collectively train a model.
Ray worker node – At the Ray level, this is a Ray node in a Ray cluster. It’s part of the Ray cluster infrastructure and is responsible for running tasks, actors, and other processes as orchestrated by the Ray head node. Each worker node can host multiple Ray processes that execute tasks or manage distributed objects. At the Kubernetes level, a Ray worker node is a Kubernetes pod that is managed by a KubeRay operator. For this post, we will be talking about the Ray worker nodes at the Kubernetes level, so we will refer to them as pods.

At the time of writing, there are no official updates regarding head pod fault tolerance and auto resume capabilities. Though head pod failures are rare, in the unlikely event of such a failure, you will need to manually restart your training job. However, you can still resume progress from the last saved checkpoint. To minimize the risk of hardware-related head pod failures, it’s advised to place the head pod on a dedicated, CPU-only SageMaker HyperPod node, because GPU failures are a common training job failure point.
Ray Train worker failures
Ray Train is designed with fault tolerance to handle worker failures, such as RayActorErrors. When a failure occurs, the affected workers are stopped, and new ones are automatically started to maintain operations. However, for training progress to continue seamlessly after a failure, saving and loading checkpoints is essential. Without proper checkpointing, the training script will restart, but all progress will be lost. Checkpointing is therefore a critical component of Ray Train’s fault tolerance mechanism and needs to be implemented in your code.
Automatic recovery
When a failure is detected, Ray shuts down failed workers and provisions new ones. Although this happens, we can tell the training function to always keep retrying until training can continue. Each instance of recovery from a worker failure is considered a retry. We can set the number of retries through the max_failures attribute of the FailureConfig, which is set in the RunConfig passed to the Trainer (for example, TorchTrainer). See the following code:

from ray.train import RunConfig, FailureConfig
# Tries to recover a run up to this many times.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No limit on the number of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For more information, see Handling Failures and Node Preemption.
Checkpoints
A checkpoint in Ray Train is a lightweight interface representing a directory stored either locally or remotely. For example, a cloud-based checkpoint might point to s3://my-bucket/checkpoint-dir, and a local checkpoint might point to /tmp/checkpoint-dir. To learn more, see Saving checkpoints during training.
To save a checkpoint in the training loop, you first need to write your checkpoint to a local directory, which can be temporary. When saving, you can use checkpoint utilities from other frameworks like torch.save, pl.Trainer.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Model.save, and more. Then you create a checkpoint from the directory using Checkpoint.from_directory. Finally, report the checkpoint to Ray Train using ray.train.report(metrics, checkpoint=…). The metrics reported alongside the checkpoint are used to keep track of the best-performing checkpoints. Reporting will upload the checkpoint to persistent storage.
If you save checkpoints with ray.train.report(…, checkpoint=…) and run on a multi-node cluster, Ray Train will raise an error if NFS or cloud storage is not set up. This is because Ray Train expects all workers to be able to write the checkpoint to the same persistent storage location.
Finally, clean up the local temporary directory to free up disk space (for example, by exiting the tempfile.TemporaryDirectory context). We can save a checkpoint every epoch or every few iterations.
The following diagram illustrates this setup.

The following code is an example of saving checkpoints using native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.train.torch
from ray import train
from ray.train import Checkpoint, ScalingConfig
from ray.train.torch import TorchTrainer

def train_func(config):
n = 100
# create a toy dataset
# data : X – dim = (n, 4)
# target : Y – dim = (n, 1)
X = torch.Tensor(np.random.normal(0, 1, size=(n, 4)))
Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1)))
# toy neural network : 1-layer
# Wrap the model in DDP
model = ray.train.torch.prepare_model(nn.Linear(4, 1))
criterion = nn.MSELoss()

optimizer = Adam(model.parameters(), lr=3e-4)
for epoch in range(config[“num_epochs”]):
y = model.forward(X)
loss = criterion(y, Y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

metrics = {“loss”: loss.item()}

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
checkpoint = None

should_checkpoint = epoch % config.get(“checkpoint_freq”, 1) == 0
# In standard DDP training, where the model is the same across all ranks,
# only the global rank 0 worker needs to save and report the checkpoint
if train.get_context().get_world_rank() == 0 and should_checkpoint:
torch.save(
model.module.state_dict(), # NOTE: Unwrap the model.
os.path.join(temp_checkpoint_dir, “model.pt”),
)
checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

train.report(metrics, checkpoint=checkpoint)

trainer = TorchTrainer(
train_func,
train_loop_config={“num_epochs”: 5},
scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

Ray Train also comes with CheckpointConfig, a way to configure checkpointing options:

from ray.train import RunConfig, CheckpointConfig
# Example 1: Only keep the 2 *most recent* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Example 2: Only keep the 2 *best* checkpoints and delete the others.
run_config = RunConfig(
checkpoint_config=CheckpointConfig(
num_to_keep=2,
# *Best* checkpoints are determined by these params:
checkpoint_score_attribute=”mean_accuracy”,
checkpoint_score_order=”max”,
),
# This will store checkpoints on S3.
storage_path=”s3://remote-bucket/location”,
)

To restore training state from a checkpoint if your training job were to fail and retry, you should modify your training loop to auto resume and then restore a Ray Train job. By pointing to the path of your saved checkpoints, you can restore your trainer and continue training. Here’s a quick example:

from ray.train.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
path=”~/ray_results/dl_trainer_restore”, # Can also be a cloud storage path like S3
datasets=get_datasets(),
)
result = restored_trainer.fit()

To streamline restoration, you can add auto resume logic to your script. This checks if a valid experiment directory exists and restores the trainer if available. If not, it starts a new experiment:

experiment_path = “~/ray_results/dl_restore_autoresume”
if TorchTrainer.can_restore(experiment_path):
trainer = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
datasets=get_datasets(),
scaling_config=train.ScalingConfig(num_workers=2),
run_config=train.RunConfig(
storage_path=”~/ray_results”,
name=”dl_restore_autoresume”,
),
)
result = trainer.fit()

To summarize, to provide fault tolerance and auto resume when using Ray Train libraries, set your max_failures parameter in the FailureConfig (we recommend setting it to -1 to make sure it will keep retrying until the SageMaker HyperPod node is rebooted or replaced), and make sure you have enabled checkpointing in your code.
Ray worker pod failures
In addition to the aforementioned mechanisms to recover from Ray Train worker failures, Ray also provides fault tolerance at the worker pod level. When a worker pod fails (this includes scenarios in which the raylet process fails), the running tasks and actors on it will fail and the objects owned by worker processes of this pod will be lost. In this case, the tasks, actors, and objects fault tolerance mechanisms will start and try to recover the failures using other worker pods.
These mechanisms will be implicitly handled by the Ray Train library. To learn more about the underlying fault tolerance at the tasks, actors, objects (implemented at the Ray Core level), see Fault Tolerance.
In practice, this means that in case of a worker pod failure, the following occurs:

If there is a free worker pod in the Ray cluster, Ray will recover the failed worker pod by replacing it with the free worker pod.
If there is no free worker pod, but in the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a new worker pod onto one of the free SageMaker HyperPod nodes. This pod will join the running Ray cluster and the failure will be recovered using this new worker pod.

In the context of KubeRay, Ray worker nodes are represented by Kubernetes pods, and failures at this level can include issues such as pod eviction or preemption caused by software-level factors.
However, another critical scenario to consider is hardware failures. If the underlying SageMaker HyperPod node becomes unavailable due to a hardware issue, such as a GPU error, it would inevitably cause the Ray worker pod running on that node to fail as well. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster start and will reboot or replace the faulty node. After the new healthy node is added into the SageMaker HyperPod cluster, Ray will schedule a new worker pod onto the SageMaker HyperPod node and recover the interrupted training. In this case, both the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency features work together seamlessly and make sure that even in case of a hardware failure, your ML training workload can auto resume and pick up from where it was interrupted.
As you have seen, there are various built-in resiliency and fault-tolerance mechanisms that allow your Ray Train workload on SageMaker HyperPod to recover and auto resume. Because these mechanisms will essentially recover by restarting the training job, it is crucial that checkpointing is implemented in the training script. It is also generally advised to save the checkpoints on a shared and persistent path, such as an Amazon Simple Storage Service (Amazon S3) bucket or FSx for Lustre file system.
Clean up
To delete your SageMaker HyperPod cluster created in this post, you can either use the SageMaker AI console or use the following AWS CLI command:

aws sagemaker delete-cluster –cluster-name <cluster_name>

Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.
If you used the CloudFormation stack to create resources, you can delete it using the following command:

aws cloudformation delete-stack –stack-name <stack_name>

Conclusion
This post demonstrated how to set up and deploy Ray clusters on SageMaker HyperPod, highlighting key considerations such as storage configuration and fault tolerance and auto resume mechanisms.
Running Ray jobs on SageMaker HyperPod offers a powerful solution for distributed AI/ML workloads, combining the flexibility of Ray with the robust infrastructure of SageMaker HyperPod. This integration provides enhanced resiliency and auto resume capabilities, which are crucial for long-running and resource-intensive tasks. By using Ray’s distributed computing framework and the built-in features of SageMaker HyperPod, you can efficiently manage complex ML workflows, specifically training workloads as covered in this post. As AI/ML workloads continue to grow in scale and complexity, the combination of Ray and SageMaker HyperPod offers a scalable, resilient, and efficient platform for tackling the most demanding computational challenges in machine learning.
To get started with SageMaker HyperPod, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Guide. To learn more about the aws-do-ray framework, refer to the GitHub repo.

About the Authors
Mark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on the Automotive and Manufacturing sector, specializing in helping organizations architect, optimize, and scale artificial intelligence and machine learning solutions, with particular expertise in autonomous vehicle technologies. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering.
Florian Stahl is a Worldwide Specialist Solutions Architect at AWS, based in Hamburg, Germany. He specializes in Artificial Intelligence, Machine Learning, and Generative AI solutions, helping customers optimize and scale their AI/ML workloads on AWS. With a background as a Data Scientist, Florian focuses on working with customers in the Autonomous Vehicle space, bringing deep technical expertise to help organizations design and implement sophisticated machine learning solutions. He works closely with customers worldwide to transform their AI initiatives and maximize the value of their machine learning investments on AWS.
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Alex Iankoulski is a Principal Solutions Architect, ML/AI Frameworks, who focuses on helping customers orchestrate their AI workloads using containers and accelerated computing infrastructure on AWS. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges.

Using Large Language Models on Amazon Bedrock for multi-step task exec …

The goal of this blog post is to show you how a large language model (LLM) can be used to perform tasks that require multi-step dynamic reasoning and execution. Examples of tasks that require dynamic reasoning and execution are answering questions of the form “What is the average length of stay for patients with [specific condition] across different hospitals?” or “How do prescription trends for [specific drug] vary across different regions?” Traditionally, answering these queries required the expertise of business intelligence specialists and data engineers, often resulting in time-consuming processes and potential bottlenecks.
However, the advances in LLMs have opened up exciting new possibilities to break down complex tasks into a series of steps, use tools to complete each step, and produce a final solution.
In this context, the term tools refer to external capabilities or APIs that the model can access and interact with to extend its functionality beyond text-based responses. These tools allow LLMs to perform specialized tasks such as retrieving real-time information, running code, browsing the web, or generating images. By using tools, LLMs can offer more accurate, context-aware, and actionable outputs, enabling them to effectively assist with complex queries that require access to data or functions outside their internal knowledge base.
In this post, we showcase an example of such capabilities for a patient record retrieval solution built on top of an existing dataset by relying on APIs only instead of text-to-SQL. You can find the code for this post on GitHub .
Overview of solution
The goal of the solution is to accurately answer analytical questions that require multi-step reasoning and execution. The following is an example of an expected interaction with the agent.

User: Give me the first and last name of the patient with least vaccines and their vaccine count.
AI: Based on the results of our analysis, I can answer your question: The patient with the least number of vaccines is Sharleen176 Kulas532, and their vaccine count is
Here’s a breakdown of how we arrived at this answer:
1. We first retrieved patients and immunization records.
2. We then grouped the immunization records by patient_id and counted the number of vaccines for each patient.
3. We ordered this list by the vaccine count in ascending order.
4. We limited the result to the top 1 (the patient with the least vaccines).
5. We joined this result with the patient information to get the first and last name.
6. Finally, we selected only the relevant information (first name, last name, and vaccine count).
The result shows that Sharleen176 Kulas532 has received only 1 vaccine, which is the lowest among patients in the database.

For this solution, we use the Synthetic Patient Generation dataset, which contains multiple tables of healthcare industry related synthetic data.
You can set up the dataset at the project’s root folder using the following steps:
curl https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_csv_apr2020.zip > dataset.zip

unzip dataset.zip

mv csv dataset
The solution is made from two core steps: plan and execute. In its simplest form, it can be represented by the following diagram.
Fig 1: Simple execution flow – solution overview
In a more complex scheme, you can add multiple layers of validation and provide relevant APIs to increase the success rate of the LLM.

Fig 2: Complex execution flow – solution overview
Plan
In the Plan stage, the LLM is given a set of predefined API function signatures along with a brief description of what each function does. These function signatures act as tools that the LLM can use to formulate a plan to answer a user’s query. The goal is to have the LLM reason through the steps required to arrive at the answer, much like a human would.
Why the plan stage is important
The Plan stage is critical because it allows the LLM to create a structured, logical sequence of actions that will be executed in the next stage. By planning, the LLM can break down a complex question into manageable steps, making sure that the right APIs are called in the correct order. This structured approach helps to minimize errors and increases the likelihood of producing accurate results.
Providing function signatures
In this stage, the LLM is given a set of function signatures that represent the tools it can use. Each function signature includes the name of the function, the parameters it accepts, and the type of value it returns. Here’s an example of a few function signatures:

def get_patients() -> List[Patient]: Retrieves a list of patients from the dataset.
def get_immunization() -> List[Immunization]:
Retrieves a list of immunization records from the dataset.
def filter(list: List[object], keys: List[str], values: List[str]) > List[object]: Filters a given list based on specified keys and values.
def join(a: List, b: List, left_key: str, right_key: str, how: JoinMode) > List: Joins two lists based on matching keys, using a specified join mode (e.g., INNER, LEFT, RIGHT).

These function signatures act as building blocks for the LLM to generate a plan. The LLM must choose the appropriate functions and sequence them in a logical order to achieve the desired outcome.
Retrieval Augmented Generation (RAG) improves the selection process by narrowing down the tools an LLM sees based on the task, simplifying the prompt. In a project with many tools, RAG makes sure that only the most relevant tools are surfaced for a given query, reducing complexity and helping the LLM make more accurate decisions. This focused exposure enhances performance by preventing the model from being overwhelmed by irrelevant options.
Generating a plan
After the function signatures are provided, the LLM is prompted to create a plan. The plan typically consists of a series of steps, each represented as a JSON object. Each step indicates a function that needs to be executed, the parameters that need to be passed, and the expected outcome (often referred to as evidence).
For example, if the task is to find the patient with the least number of vaccines, the LLM might generate a plan that includes the following steps:

Retrieve patients: Use the get_patients() function to get a list of patients.
Retrieve immunization records: Use the get_immunization() function to get a list of immunizations.
Group by patient: Use the group_by() function to group the immunizations by patient_id, counting the number of vaccines for each patient.
Order by count: Use the order_by() function to sort the grouped list in ascending order based on the vaccine count.
Limit the result: Use the limit() function to select the patient with the least vaccines.
Join with patient data: Use the join() function to match the selected result with the patient’s information.
Select relevant fields: Use the select() function to extract only the necessary fields, such as the patient’s first name, last name, and vaccine count.

JSON representation
The LLM outputs this plan as a structured JSON, which makes it straightforward to parse and execute in the next stage. The JSON format helps makes sure that the plan is clear, unambiguous, and ready for programmatic execution.
The following is an example of what the JSON might look like:
{
“role”: “assistant”,
“content”: [
{
“toolUse”: {
“toolUseId”: “tooluse_example_id”,
“name”: “execute_plan”,
“input”: {
“plans”: [
{
“function_name”: “get_patients”,
“parameters”: [],
“evidence_number”: 1
},
{
“function_name”: “get_immunization”,
“parameters”: [],
“evidence_number”: 2
},
{
“function_name”: “group_by”,
“parameters”: [
“list”,
“group_key”,
“aggregation_key”,
“aggregation”
],
“parameter_values”: [
“#E2”,
“patient_id”,
null,
“COUNT”
],
“evidence_number”: 3
},
{
“function_name”: “order_by”,
“parameters”: [
“list”,
“key”,
“value”
],
“parameter_values”: [
“#E3”,
“count”,
“ASCENDING”
],
“evidence_number”: 4
},
{
“function_name”: “limit”,
“parameters”: [
“list”,
“k”
],
“parameter_values”: [
“#E4”,
1
],
“evidence_number”: 5
},
{
“function_name”: “join”,
“parameters”: [
“a”,
“b”,
“left_key”,
“right_key”,
“how”
],
“parameter_values”: [
“#E5”,
“#E1”,
“patient_id”,
“id”,
“INNER”
],
“evidence_number”: 6
},
{
“function_name”: “select”,
“parameters”: [
“list”,
“keys”
],
“parameter_values”: [
“#E6”,
[
“first”,
“last”,
“count”
]
],
“evidence_number”: 7
}
]
}
}
}
]
}

Execute
In the Execute stage, the structured plan generated by the LLM in the previous step is programmatically carried out to produce the final output. The JSON blueprint from the planning stage is parsed, and each function call described in the plan is executed sequentially.
The process begins with data retrieval, such as accessing patient records or immunization data, using predefined API functions such as get_patients() or get_immunization(). These initial function calls generate intermediate results, which are stored as evidence and referenced in subsequent steps.
The plan typically involves a series of data transformation functions, such as group_by() to aggregate data, filter() for refining results, and order_by() for data sorting. Each function is executed with specific parameters as outlined in the JSON plan, utilizing  progressive data refinement to answer the query.
As each function is executed, its output is passed to the subsequent function in the sequence. This chain of function calls culminates in a final step, often involving a select() function to extract the most relevant information, such as a patient’s name and vaccine count.
Error handling in the Execute stage is crucial for facilitating  the reliability and robustness of the entire process. As the LLM’s plan is executed, various issues can arise, including empty datasets, invalid parameters, or mismatched data types during function calls such as join() or filter(). To address these potential challenges, the system incorporates error-checking mechanisms at each step, enabling it to detect and respond to anomalies efficiently. If a function returns an unexpected result or encounters an issue, the system might provide the error back to the LLM itself, enabling it to regenerate the plan with necessary adjustments. This approach not only alleviates execution failures but also enhances the overall user experience by delivering accurate and reliable results, even in the face of unexpected challenges.
Summary
This post explores the application of LLMs in executing complex analytical queries through an API, with specific focus on Amazon Bedrock. Traditionally, business users rely on data professionals to retrieve and present data, but LLMs can now offer a streamlined approach enabling direct query responses by using predefined API tools. To illustrate this capability, we use the Synthetic Patient Generation dataset and present a solution structured around two primary phases: Plan and Execution.
In the Plan stage, the LLM is provided with API function signatures, which it uses to generate a structured, logical sequence of steps to answer the query. This plan is output as a JSON, providing clarity and facilitating seamless execution. In the Execute stage, the system programmatically carries out the plan by sequentially executing each function call. Robust error-handling mechanisms are integrated to identify potential issues and, if necessary, relay errors back to the LLM for plan regeneration.
To demonstrate this process, we present a use case where the system identifies the patient with the least number of vaccines by retrieving, grouping, and sorting data, and ultimately presenting the final result. This example showcases the LLM’s ability to extend beyond mere text-based responses, providing actionable and context-aware outputs that can significantly enhance business decision-making processes.
Conclusion
This article highlights the efficacy of LLMs in expanding their functionality to deliver practical, data-driven solutions that have the potential to revolutionize business analytics and decision-making workflows.

About the Authors
Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.
Mohammad Arbabshirani, PhD, is a Sr. Data Science Manager at AWS Professional Services. He specializes in helping customers accelerate business outcomes on AWS through the application of machine learning and generative AI. He has 12 years of experience in full life cycle of machine learning, computer vision, and data science from sales support to end-to-end solution delivery specially in healthcare and life sciences vertical. Currently, Mohammad leads a team of data scientists, machine learning engineers, and data architects, focusing on delivery of cutting-edge ML solutions for customers. His background includes extensive research in neuroimaging and medical imaging. Outside of his professional endeavors, Mohammad enjoys tennis, soccer, and instrumental music.

The Complete Beginner’s Guide to Terminal/Command Prompt

The terminal (on Mac/Linux) or command prompt (on Windows) is a powerful tool that allows you to interact with your computer using text commands instead of clicking through a graphical interface. While it might seem intimidating at first, mastering basic terminal commands can help you:

Navigate through files and folders more efficiently

Perform tasks that aren’t possible through the regular interface

Automate repetitive tasks

Gain a deeper understanding of how your computer works

This guide will introduce you to the essential commands and concepts to get you started, regardless of which operating system you use.

Getting Started

Opening the Terminal

On Windows:

Press Win + R, type cmd, and press Enter

Or search for “Command Prompt” in the Start menu

On Mac:

Press Command + Space to open Spotlight, type “Terminal”, and press Enter

Or find Terminal in Applications → Utilities → Terminal

On Linux:

Press Ctrl + Alt + T (on most distributions)

Or search for “Terminal” in your applications menu

Understanding the Prompt

When you first open the terminal, you’ll see a prompt that looks something like this:

Windows: C:UsersYourUsername>

Mac/Linux: username@computer:~$

This tells you:

Your current location in the file system

Where to type your commands

On Mac/Linux, the ~ symbol represents your home directory

Basic Navigation Commands

Viewing Your Current Location

Windows: cd 

Mac/Linux: pwd (Print Working Directory)

Example:

Listing Files and Directories

Windows: dir 

Mac/Linux: ls

Example:

Options:

ls -l – List with detailed information (file size, date modified, permissions)

ls -a – Show hidden files (files that start with a dot)

ls -la – Combine both options

Changing Directories

All platforms: cd DirectoryName

Examples:

Creating Directories

All platforms: mkdir DirectoryName

Example:

Creating Files

Windows: type nul > filename.txt 

Mac/Linux: touch filename.txt

Example:

Working with Files

Viewing File Contents

Windows: type filename.txt 

Mac/Linux: cat filename.txt

For larger files: 

Windows: more filename.txt 

Mac/Linux: less filename.txt (use q to quit)

Copying Files

Windows: copy source destination 

Mac/Linux: cp source destination

Example:

Moving/Renaming Files

Windows: move source destination 

Mac/Linux: mv source destination

Examples:

Deleting Files and Directories

Windows:

Mac/Linux:

Warning: Be very careful with delete commands, especially rm -r! There is no “Recycle Bin” or “Trash” when using the terminal – deletions are permanent.

Helpful Tips

Command History

Press the up arrow to cycle through previously used commands

On Mac/Linux, type history to see a list of recent commands

Tab Completion

Start typing a file or directory name, then press Tab

The terminal will attempt to complete it for you

If there are multiple options, press Tab twice to see all possibilities

Getting Help

Windows: help command or command /? 

Mac/Linux: man command (manual pages, press q to exit)

Examples:

Clearing the Screen

Windows: cls 

Mac/Linux: clear or Ctrl+L

Power User Commands

Searching for Files

Windows: dir /s filename 

Mac/Linux: find . -name filename

Searching Within Files

Windows: findstr “text” filename 

Mac/Linux: grep “text” filename

Chaining Commands

All platforms: Use && to run commands in sequence

Example:

Redirecting Output

All platforms: Use > to send output to a file

Example:

Next Steps

As you become more comfortable with these basic commands, you might want to explore:

Command line text editors like Nano, Vim, or Emacs

Writing simple shell scripts to automate tasks

Package managers like apt (Linux), Homebrew (Mac), or Chocolatey (Windows)

Environment variables and how to set them

SSH to connect to remote computers

Common Mistakes and Troubleshooting

Command not found: Check spelling or ensure the command is available on your system

Permission denied: You may need administrator/root privileges

Windows: Run Command Prompt as Administrator

Mac/Linux: Use sudo before commands that need elevated privileges

No such file or directory: Double-check path and file names

Operation not permitted: Similar to permission denied, you might need special permissions

TasksWindowsMac/LinuxCurrent locationcdpwdList filesdirlsChange directorycd dircd dirCreate directorymkdir dirmkdir dirCreate filetype nul > filetouch fileCopy filecopy source destinationcp source destinationMove/renamemove source destinationmv source destinationDelete filedel filerm fileDelete directoryrmdir /s dirrm -r dirClear screenclsclearGet helphelp commandman command

Conclusion

In this tutorial, we have covered everything beginners need to know about using the terminal. We explored how to open the terminal across different operating systems, navigate file systems, create and manage files and directories, and use essential commands. We also learned helpful shortcuts, power user commands, and troubleshooting tips. With these foundational skills, you can now confidently use the command line as a powerful tool in your computing journey.

Remember, the terminal is a powerful tool that rewards practice and experimentation. Don’t be afraid to try new commands, but always be careful with commands that modify or delete files.

Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post The Complete Beginner’s Guide to Terminal/Command Prompt appeared first on MarkTechPost.

This AI Paper from ByteDance Introduces a Hybrid Reward System Combini …

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy optimization. Recent research focuses on algorithmic improvements, including eliminating critic models to reduce computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a few studies focus on RLHF data construction (i.e., training prompts) and its performance scaling based on these training prompts.

The success of RLHF heavily depends on reward model quality, which faces three challenges: mis-specified reward modeling in representing human preferences, incorrect and ambiguous preferences in training datasets, and poor generalization ability. To address these issues, GenRM was introduced to validate model predictions against ground-truth responses, showing good resistance to reward hacking and gaining adoption in advanced LLMs like DeepSeekV3. Methods like principled data selection that filter overly challenging instances during training and strategic selection methods identify key training prompts to achieve comparable performance with reduced data. Performance scale analysis reveals that RLHF shows superior generalization compared to SFT on novel inputs but significantly reduces output diversity.

Researchers from ByteDance Seed address a critical gap in RLHF research where the role of prompt-data construction and its scalability has received less attention. They explore data-driven bottlenecks that limit RLHF performance scaling, focusing on reward hacking and decreasing response diversity challenges. A hybrid reward system is introduced by combining reasoning task verifiers (RTV) and a generative reward model (GenRM) that shows stronger resistance to reward hacking and enables a more accurate assessment of responses against ground-truth solutions. Moreover, a novel prompt-selection method called Pre-PPO is introduced to identify inherently challenging training prompts less susceptible to reward hacking.

The experimental setup employs two pre-trained language models of different scales: a smaller model with 25B parameters and a larger model with 150B parameters. The training dataset contains one million prompts from diverse domains, including mathematics, coding, instruction-following, creative writing, and logical reasoning. Moreover, the researchers constructed a detailed evaluation framework covering multiple skill areas: logical reasoning, instruction-following, STEM tasks, coding, natural language processing, knowledge, contextual understanding, and out-of-distribution generalization. The evaluation framework includes two versions (V1.0 and V2.0) with overlapping prompts, though V2.0 features more challenging prompts.

The experimental results show that the proposed approach combining Pre-PPO with prioritized mathematical and coding tasks consistently outperforms the baseline method across model sizes and evaluation datasets. The approach shows an improvement of +1.1 over the baseline when evaluated at 100-step intervals using TestSet V1.0. When tested on the more challenging TestSet V2.0, the performance improvement increases to +1.4. The most substantial gains appear in mathematics-intensive and coding tasks, with an improvement of +3.9 points in STEM and +3.2 points in coding. These improvements are attributed to the strategic prioritization of mathematical reasoning and coding tasks during early RLHF training phases.

In conclusion, this paper addresses critical bottlenecks in RLHF data scaling, specifically identifying reward hacking and reduced response diversity as significant challenges. The researchers proposed a combined approach featuring strategic prompt construction and early-stage training prioritization to solve this issue. The method uses RTV and GenRM to combat reward hacking alongside the novel Pre-PPO prompt selection strategy that identifies and prioritizes challenging training prompts. Analysis reveals that RTV supervision shows the strongest resistance to reward hacking, followed by GenRM with ground-truth labels and then the BT Reward Model. The research establishes a foundation for optimizing RLHF data construction and developing more principle methods to reward hacking and model alignment.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking appeared first on MarkTechPost.

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Se …

Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate reasoning chains and multiple retrieval steps. Current methods primarily depend on manually designed prompts or heuristics, posing limitations in scalability and flexibility. Additionally, generating supervised data for multi-step reasoning scenarios is often prohibitively expensive and practically infeasible.

Researchers from Baichuan Inc., Tongji University, The University of Edinburgh, and Zhejiang University introduce ReSearch, a novel AI framework designed to train LLMs to integrate reasoning with search via reinforcement learning, notably without relying on supervised reasoning steps. The core methodology of ReSearch incorporates search operations directly into the reasoning chain. Utilizing Group Relative Policy Optimization (GRPO), a reinforcement learning technique, ReSearch guides LLMs to autonomously identify optimal moments and strategies for performing search operations, which subsequently influence ongoing reasoning. This approach enables models to progressively refine their reasoning and naturally facilitates advanced capabilities such as reflection and self-correction.

From a technical perspective, ReSearch employs structured output formats by embedding specific tags—such as <think>, <search>, <result>, and <answer>—within the reasoning chain. These tags facilitate clear communication between the model and the external retrieval environment, systematically organizing generated outputs. During training, ReSearch intentionally excludes retrieval results from loss computations to prevent model bias. Reward signals guiding the reinforcement learning process are based on straightforward criteria: accuracy assessment through F1 scores and adherence to the predefined structured output format. This design encourages the autonomous development of sophisticated reasoning patterns, circumventing the need for manually annotated reasoning datasets.

Experimental evaluation confirms the robustness of ReSearch. When assessed on multi-hop question-answering benchmarks, including HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle, ReSearch consistently outperformed baseline methods. Specifically, ReSearch-Qwen-32B-Instruct achieved improvements ranging between 8.9% and 22.4% in performance compared to established baselines. Notably, these advancements were achieved despite the model being trained exclusively on a single dataset, underscoring its strong generalization capabilities. Further analyses demonstrated that models gradually increased their reliance on iterative search operations throughout training, indicative of enhanced reasoning proficiency. A detailed case study illustrated the model’s capacity to identify suboptimal search queries, reflect on its reasoning steps, and implement corrective actions autonomously.

In summary, ReSearch presents a significant methodological advancement in training LLMs to seamlessly integrate reasoning with external search mechanisms via reinforcement learning. By eliminating dependency on supervised reasoning data, this framework effectively addresses critical scalability and adaptability issues inherent in multi-hop reasoning scenarios. Its capability for self-reflection and correction enhances its practical applicability in complex, realistic contexts. Future research directions may further extend this reinforcement learning-based framework to broader applications and incorporate additional external knowledge resources.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps appeared first on MarkTechPost.

Introducing AWS MCP Servers for code assistants (Part 1)

We’re excited to announce the open source release of AWS MCP Servers for code assistants — a suite of specialized Model Context Protocol (MCP) servers that bring Amazon Web Services (AWS) best practices directly to your development workflow. Our specialized AWS MCP servers combine deep AWS knowledge with agentic AI capabilities to accelerate development across key areas. Each AWS MCP Server focuses on a specific domain of AWS best practices, working together to provide comprehensive guidance throughout your development journey.
This post is the first in a series covering AWS MCP Servers. In this post, we walk through how these specialized MCP servers can dramatically reduce your development time while incorporating security controls, cost optimizations, and AWS Well-Architected best practices into your code. Whether you’re an experienced AWS developer or just getting started with cloud development, you’ll discover how to use AI-powered coding assistants to tackle common challenges such as complex service configurations, infrastructure as code (IaC) implementation, and knowledge base integration. By the end of this post, you’ll understand how to start using AWS MCP Servers to transform your development workflow and deliver better solutions, faster.
If you want to get started right away, skip ahead to the section “From Concept to working code in minutes.”
AI is transforming how we build software, creating opportunities to dramatically accelerate development while improving code quality and consistency. Today’s AI assistants can understand complex requirements, generate production-ready code, and help developers navigate technical challenges in real time. This AI-driven approach is particularly valuable in cloud development, where developers need to orchestrate multiple services while maintaining security, scalability, and cost-efficiency.
Developers need code assistants that understand the nuances of AWS services and best practices. Specialized AI agents can address these needs by:

Providing contextual guidance on AWS service selection and configuration
Optimizing compliance with security best practices and regulatory requirements
Promoting the most efficient utilization and cost-effective solutions
Automating repetitive implementation tasks with AWS specific patterns

This approach means developers can focus on innovation while AI assistants handle the undifferentiated heavy lifting of coding. Whether you’re using Amazon Q, Amazon Bedrock, or other AI tools in your workflow, AWS MCP Servers complement and enhance these capabilities with deep AWS specific knowledge to help you build better solutions faster.
Model Context Protocol (MCP) is a standardized open protocol that enables seamless interaction between large language models (LLMs), data sources, and tools. This protocol allows AI assistants to use specialized tooling and to access domain-specific knowledge by extending the model’s capabilities beyond its built-in knowledge—all while keeping sensitive data local. Through MCP, general-purpose LLMs can now seamlessly access relevant knowledge beyond initial training data and be effectively steered towards desired outputs by incorporating specific context and best practices.
Accelerate building on AWS
What if your AI assistant could instantly access deep AWS knowledge, understanding every AWS service, best practice, and architectural pattern? With MCP, we can transform general-purpose LLMs into AWS specialists by connecting them to specialized knowledge servers. This opens up exciting new possibilities for accelerating cloud development while maintaining security and following best practices.
Build on AWS in a fraction of the time, with best practices automatically applied from the first line of code. Skip hours of documentation research and immediately access ready-to-use patterns for complex services such as Amazon Bedrock Knowledge Bases. Our MCP Servers will help you write well-architected code from the start, implement AWS services correctly the first time, and deploy solutions that are secure, observable, and cost-optimized by design. Transform how you build on AWS today.

Enforce AWS best practices automatically – Write well-architected code from the start with built-in security controls, proper observability, and optimized resource configurations
Cut research time dramatically – Stop spending hours reading documentation. Our MCP Servers provide contextually relevant guidance for implementing AWS services correctly, addressing common pitfalls automatically
Access ready-to-use patterns instantly – Use pre-built AWS CDK constructs, Amazon Bedrock Agents schema generators, and Amazon Bedrock Knowledge Bases integration templates that follow AWS best practices from the start
Optimize cost proactively – Prevent over-provisioning as you design your solution by getting cost-optimization recommendations and generating a comprehensive cost report to analyze your AWS spending before deployment

To turn this vision into reality and make AWS development faster, more secure, and more efficient, we’ve created AWS MCP Servers—a suite of specialized AWS MCP Servers that bring AWS best practices directly to your development workflow. Our specialized AWS MCP Servers combine deep AWS knowledge with AI capabilities to accelerate development across key areas. Each AWS MCP Server focuses on a specific domain of AWS best practices, working together to provide comprehensive guidance throughout your development journey.
Overview of domain-specific MCP Servers for AWS development
Our specialized MCP Servers are designed to cover distinct aspects of AWS development, each bringing deep knowledge to specific domains while working in concert to deliver comprehensive solutions:

Core – The foundation server that provides AI processing pipeline capabilities and serves as a central coordinator. It helps provide clear plans for building AWS solutions and can federate to other MCP servers as needed.
AWS Cloud Development Kit (AWS CDK) – Delivers AWS CDK knowledge with tools for implementing best practices, security configurations with cdk-nag, Powertools for AWS Lambda integration, and specialized constructs for generative AI services. It makes sure infrastructure as code (IaC) follows AWS Well-Architected principles from the start.
Amazon Bedrock Knowledge Bases – Enables seamless access to Amazon Bedrock Knowledge Bases so developers can query enterprise knowledge with natural language, filter results by data source, and use reranking for improved relevance.
Amazon Nova Canvas – Provides image generation capabilities using Amazon Nova Canvas through Amazon Bedrock, enabling the creation of visuals from text prompts and color palettes—perfect for mockups, diagrams, and UI design concepts.
Cost – Analyzes AWS service costs and generates comprehensive cost reports, helping developers understand the financial implications of their architectural decisions and optimize for cost-efficiency.

Prerequisites
To complete the solution, you need to have the following prerequisites in place:

uv package manager
Install Python using uv python install 3.13
AWS credentials with appropriate permissions
An MCP-compatible LLM client (such as Anthropic’s Claude for Desktop, Cline, Amazon Q CLI, or Cursor)

From concept to working code in minutes
You can download the AWS MCP Servers on GitHub or through the PyPI package manager. Here’s how to get started using your favorite code assistant with MCP support.
To install MCP Servers, enter the following code:

# Install and setup the MCP servers
{
“mcpServers”: {
“awslabs.core-mcp-server”: {
“command”: “uvx”,
“args”: [
“awslabs.core-mcp-server@latest”
],
“env”: {
“FASTMCP_LOG_LEVEL”: “ERROR”,
“MCP_SETTINGS_PATH”: “path to your mcp server settings”
},
“autoApprove”: [],
“disabled”: false
},
“awslabs.bedrock-kb-retrieval-mcp-server”: {
“command”: “uvx”,
“args”: [“awslabs.bedrock-kb-retrieval-mcp-server@latest”],
“env”: {
“AWS_PROFILE”: “your-aws-profile”,
“AWS_REGION”: “us-east-1”
}
},
“awslabs.cdk-mcp-server”: {
“command”: “uvx”,
“args”: [“awslabs.cdk-mcp-server@latest”],
“env”: {
“FASTMCP_LOG_LEVEL”: “ERROR”
}
},
“awslabs.cost-analysis-mcp-server”: {
“command”: “uvx”,
“args”: [“awslabs.cost-analysis-mcp-server@latest”],
“env”: {
“AWS_PROFILE”: “your-aws-profile”,
“FASTMCP_LOG_LEVEL”: “ERROR”
}
},
“awslabs.nova-canvas-mcp-server”: {
“command”: “uvx”,
“args”: [“awslabs.nova-canvas-mcp-server@latest”],
“env”: {
“AWS_PROFILE”: “your-aws-profile”,
“AWS_REGION”: “us-east-1”
}
}
}
}

AWS MCP Servers in action
Here’s how AWS MCP servers transform the development experience:
Developer: “I need to build an AI-powered chatbot using Amazon Bedrock that can answer questions from our company’s knowledge base. I also want to add a tool for the chatbot to call our internal API.”
Core: “I’ll help you build an Amazon Bedrock Knowledge Bases chatbot with API integration. Let’s create an architecture that uses Amazon Bedrock Agents with a custom action group to call your internal API.”
Core generates a comprehensive architecture diagram showing the knowledge base integration, Amazon Bedrock Agents configuration with action groups, API connectivity, and data flow between components.
AWS CDK: “Here’s the infrastructure code for your chatbot with the Amazon Bedrock Agents action group. I’ve included proper IAM roles, security controls, and Lambda Powertools for observability.”

// CDK code with Well-Architected best practices
const knowledgeBase = new BedrockKnowledgeBase(this, ‘CompanyKB’, {
embeddingModel: BedrockFoundationModel.TITAN_EMBED_TEXT_V1,
vectorStore: new OpenSearchServerlessVectorStore(this, ‘VectorStore’, {
encryption: OpenSearchEncryption.KMS,
ebs: OpenSearchEbsOptions.provisioned(100, OpenSearchVolumeType.GP3)
})
});

// Action Group for API integration
const actionGroupSchema = new BedrockAgentSchema(this, ‘ApiActionSchema’, {
schemaPath: ‘schemas/api-action-schema.json’,
// Schema will be auto-generated from your Lambda code
generateFromLambda: {
lambdaPath: ‘lambda/api-actions.py’,
outputPath: ‘schemas/api-action-schema.json’
}
});

const apiActionGroup = new BedrockAgentActionGroup(this, ‘ApiActionGroup’, {
actionGroupName: ‘InternalApiActions’,
description: ‘Actions to interact with internal API’,
actionGroupExecutor: new BedrockAgentLambdaExecutor(apiActionsLambda),
actionGroupSchema: actionGroupSchema
});

The CDK MCP Server generates complete AWS CDK code to deploy the entire solution. It automatically runs cdk-nag to identify potential security issues and provides remediation steps for each finding, making sure that the infrastructure follows AWS Well-Architected best practices.
Amazon Bedrock Knowledge Bases retrieval: “I’ve configured the optimal settings for your knowledge base queries, including proper reranking for improved relevance.”
Amazon Bedrock Knowledge Bases MCP Server demonstrates how to structure queries to the knowledge base for maximum relevance, provides sample code for filtering by data source, and shows how to integrate the knowledge base responses with the chatbot interface.
Amazon Nova Canvas: “To enhance your chatbot’s capabilities, I’ve created visualizations that can be generated on demand when users request data explanations.”
Amazon Nova Canvas MCP server generates sample images showing how Amazon Nova Canvas can create charts, diagrams, and visual explanations based on knowledge base content, making complex information more accessible to users.
Cost Analysis: “Based on your expected usage patterns, here’s the estimated monthly cost breakdown and optimization recommendations.”
The Cost Analysis MCP Server generates a detailed cost analysis report showing projected expenses for each AWS service, identifies cost optimization opportunities such as reserved capacity for Amazon Bedrock, and provides specific recommendations to reduce costs without impacting performance.
With AWS MCP Servers, what would typically take days of research and implementation is completed in minutes, with better quality, security, and cost-efficiency than manual development in that same time.
Best practices for MCP-assisted development
To maximize the benefits of MCP assisted development while maintaining security and code quality, developers should follow these essential guidelines:

Always review generated code for security implications before deployment
Use MCP Servers as accelerators, not replacements for developer judgment and expertise
Keep MCP Servers updated with the latest AWS security best practices
Follow the principle of least privilege when configuring AWS credentials
Run security scanning tools on generated infrastructure code

Coming up in the series
This post introduced the foundations of AWS MCP Servers and how they accelerate AWS development through specialized, AWS specific MCP Servers. In upcoming posts, we’ll dive deeper into:

Detailed walkthroughs of each MCP server’s capabilities
Advanced patterns for integrating AWS MCP Servers into your development workflow
Real-world case studies showing AWS MCP Servers’ impact on development velocity
How to extend AWS MCP Servers with your own custom MCP servers

Stay tuned to learn how AWS MCP Servers can transform your specific AWS development scenarios and help you build better solutions faster. Visit our GitHub repository or Pypi package manager to explore example implementations and get started today.

About the Authors
Jimin Kim is a Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team, based in Los Angeles. With specialties in Generative AI and SaaS, she loves helping her customers succeed in their business. Outside of work, she cherishes moments with her wife and three adorable calico cats.
Pranjali Bhandari is part of the Prototyping and Cloud Engineering (PACE) team at AWS, based in the San Francisco Bay Area. She specializes in Generative AI, distributed systems, and cloud computing. Outside of work, she loves exploring diverse hiking trails, biking, and enjoying quality family time with her husband and son.
Laith Al-Saadoon is a Principal Prototyping Architect on the Prototyping and Cloud Engineering (PACE) team. He builds prototypes and solutions using generative AI, machine learning, data analytics, IoT & edge computing, and full-stack development to solve real-world customer challenges. In his personal time, Laith enjoys the outdoors–fishing, photography, drone flights, and hiking.
Paul Vincent is a Principal Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. He works with AWS customers to bring their innovative ideas to life. Outside of work, he loves playing drums and piano, talking with others through Ham radio, all things home automation, and movie nights with the family.
Justin Lewis leads the Emerging Technology Accelerator at AWS. Justin and his team help customers build with emerging technologies like generative AI by providing open source software examples to inspire their own innovation. He lives in the San Francisco Bay Area with his wife and son.
Anita Lewis is a Technical Program Manager on the AWS Emerging Technology Accelerator team, based in Denver, CO. She specializes in helping customers accelerate their innovation journey with generative AI and emerging technologies. Outside of work, she enjoys competitive pickleball matches, perfecting her golf game, and discovering new travel destinations.

Harness the power of MCP servers with Amazon Bedrock Agents

AI agents extend large language models (LLMs) by interacting with external systems, executing complex workflows, and maintaining contextual awareness across operations. Amazon Bedrock Agents enables this functionality by orchestrating foundation models (FMs) with data sources, applications, and user inputs to complete goal-oriented tasks through API integration and knowledge base augmentation. However, in the past, connecting these agents to diverse enterprise systems has created development bottlenecks, with each integration requiring custom code and ongoing maintenance—a standardization challenge that slows the delivery of contextual AI assistance across an organization’s digital ecosystem. This is a problem that you can solve by using Model Context Protocol (MCP), which provides a standardized way for LLMs to connect to data sources and tools.
Today, MCP is providing agents standard access to an expanding list of accessible tools that you can use to accomplish a variety of tasks. In time, MCP can promote better discoverability of agents and tools through marketplaces, enabling agents to share context and have common workspaces for better interaction, and scale agent interoperability across the industry.
In this post, we show you how to build an Amazon Bedrock agent that uses MCP to access data sources to quickly build generative AI applications. Using Amazon Bedrock Agents, your agent can be assembled on the fly with MCP-based tools as in this example:

InlineAgent(
foundation_model=”us.anthropic.claude-3-5-sonnet-20241022-v2:0″,
instruction=”You are a friendly assistant for resolving user queries”,
agent_name=”SampleAgent”,
action_groups=[
ActionGroup(
name=”SampleActionGroup”,
mcp_clients=[mcp_client_1, mcp_client_2],
)
],
).invoke(input_text=”Convert 11am from NYC time to London time”)

We showcase an example of building an agent to understand your Amazon Web Service (AWS) spend by connecting to AWS Cost Explorer, Amazon CloudWatch, and Perplexity AI through MCP. You can use the code referenced in this post to connect your agents to other MCP servers to address challenges for your business. We envision a world where agents have access to an ever-growing list of MCP servers that they can use for accomplishing a wide variety of tasks.
Model Context Protocol
Developed by Anthropic as an open protocol, MCP provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture, MCP enables developers to expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers. Through this architecture, MCP enables users to build more powerful, context-aware AI agents that can seamlessly access the information and tools they need. Whether you’re connecting to external systems or internal data stores or tools, you can now use MCP to interface with all of them in the same way. The client-server architecture of MCP enables your agent to access new capabilities as the MCP server updates without requiring any changes to the application code.
MCP architecture
MCP uses a client-server architecture that contains the following components and is shown in the following figure:

Host: An MCP host is a program or AI tool that requires access to data through the MCP protocol, such as Claude Desktop, an integrated development environment (IDE), or any other AI application.
Client: Protocol clients that maintain one-to-one connections with servers.
Server: Lightweight programs that expose capabilities through standardized MCP.
Local data sources: Your databases, local data sources, and services that MCP servers can securely access.
Remote services: External systems available over the internet through APIs that MCP servers can connect to.

Let’s walk through how to set up Amazon Bedrock agents that take advantage of MCP servers.
Using MCP with Amazon Bedrock agents
In this post, we provide a step-by-step guide for how to connect your favorite MCP servers with Amazon Bedrock agents as Action Groups that an agent can use to accomplish tasks provided by the user. The AgentInlineSDK provides a straightforward way to create inline agents, containing a built-in MCP client implementation that provides you with direct access to tools delivered by an MCP server.
As part of creating an agent, the developer creates an MCP client specific to each MCP server that requires agent communication. When invoked, the agent determines which tools are needed for the user’s task; if MCP server tools are required, it uses the corresponding MCP client to request tool execution from that server. The user code doesn’t need to be aware of the MCP protocol because that’s handled by the MCP client provided the InlineAgent code repository.
To orchestrate this workflow, you take advantage of the return control capability of Amazon Bedrock Agents. The following diagram illustrates the end-to-end flow of an agent handling a request that uses two tools. In the first flow, a Lambda-based action is taken, and in the second, the agent uses an MCP server.

Use case: transform how you manage your AWS spend across different AWS services including Amazon Bedrock
To show how an Amazon Bedrock agent can use MCP servers, let’s walk through a sample use case. Imagine asking questions like “Help me understand my Bedrock spend over the last few weeks” or “What were my EC2 costs last month across regions and instance types?” and getting a human-readable analysis of the data instead of raw numbers on a dashboard. The system interprets your intent and delivers precisely what you need—whether that’s detailed breakdowns, trend analyses, visualizations, or cost-saving recommendations. This is useful because what you’re interested in is insights rather than data. You can accomplish this using two MCP servers: a custom-built MCP server for retrieving the AWS spend data and an open source MCP server from Perplexity AI to interpret the data. You add these two MCP servers as action groups in an inline Amazon Bedrock agent. This gives you an AI agent that can transform the way you manage your AWS spend. All the code for this post is available in the GitHub repository.
Let’s walk through how this agent is created using inline agents. You can use inline agents to define and configure Amazon Bedrock agents dynamically at runtime. They provide greater flexibility and control over agent capabilities, enabling users to specify FMs, instructions, action groups, guardrails, and knowledge bases as needed without relying on pre-configured control plane settings. It’s worth noting that you can also orchestrate this behavior without inline agents by using RETURN_CONTROL with the InvokeAgent API.
MCP components in Amazon Bedrock Agents

Host: This is the Amazon Bedrock inline agent. This agent adds MCP clients as action groups that can be invoked through RETURN_CONTROL when the user asks an AWS spend-related question.
Client: You create two clients that establish one-to-one connections with their respective servers: a cost explorer client with specific cost server parameters and a Perplexity AI client with Perplexity server parameters.
Servers: You create two MCP servers that each run locally on your machine and communicate to your application over standard input/output (alternatively, you could also configure the client to talk to remote MCP servers).

Cost Explorer and Amazon CloudWatch Logs (for Amazon Bedrock model invocation log data) and an MCP server to retrieve the AWS spend data.
Perplexity AI MCP server to interpret the AWS spend data.

Data sources: The MCP servers talk to remote data sources such as Cost Explorer API, CloudWatch Logs and the Perplexity AI search API.

Prerequisites
You need the following prerequisites to get started implementing the solution in this post:

An AWS account
Familiarity with FMs and Amazon Bedrock
Install AWS Command Line Interface (AWS CLI) and set up credentials
Python 3.11 or later
AWS Cloud Development Kit (AWS CDK) CLI
Enable model access for Anthropic’s Claude 3.5 Sonnet v2
You need to have your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY so that you can set them using environment variables for the server
The two MCP servers are run as Docker daemons, so you need to have Docker installed and running on your computer

The MCP servers run locally on your computer and need to access AWS services and the Perplexity API. You can read more about AWS credentials in Manage access keys for IAM users. Make sure that your credentials include AWS Identity and Access Manager (IAM) read access to Cost Explorer and CloudWatch. You can do this by using AWSBillingReadOnlyAccess and CloudWatchReadOnlyAccess managed IAM permissions. You can get the Perplexity API key from the Perplexity Sonar API page.
Steps to run
With the prerequisites in place, you’re ready to implement the solution.

Navigate to the InlineAgent GitHub repository.
Follow the setup steps.
Navigate to the cost_explorer_agent This folder contains the code for this post.

cd examples/mcp/cost_explorer_agent

Create a .env file in cost_explorer_agent directory using example.

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
BEDROCK_LOG_GROUP_NAME=
PERPLEXITY_API_KEY=

Build aws-cost-explorer-mcp server

git clone https://github.com/aarora79/aws-cost-explorer-mcp-server.git
cd aws-cost-explorer-mcp-server/
docker build -t aws-cost-explorer-mcp

You’re now ready to create an agent that can invoke these MCP servers to provide insights into your AWS spend. You can do this by running the python main.py command. The output will look like the following example. The agent connects with the two MCP servers and accesses their respective tools. Next, the agent lays out a plan to use specific tools sequentially, uses code interpreter to generate a chart showing cost distribution, and uses Cost Explorer and Perplexity API to provide information about Amazon Bedrock and the AWS account’s cost expenditure.

Connected to server with tools: [‘get_bedrock_daily_usage_stats’, ‘get_bedrock_hourly_usage_stats’, ‘get_ec2_spend_last_day’, ‘get_detailed_breakdown_by_day’]
Connected to server with tools: [‘perplexity_ask’]

Thought: Let me break this down into parts:
1. First, I’ll ask about Amazon Bedrock using Perplexity
2. Then I’ll get the cost breakdown for the last 7 days
3. Finally, I’ll create a visualization of the spending data

Tool use: perplexity_ask with these inputs: messages[[{role=user, content=What is Amazon Bedrock? Provide a concise explanation.}] (array)]

Tool use: get_detailed_breakdown_by_day with these inputs: days[7 (integer)]

Thought: Let me analyze the data and create a visualization of the top services by cost. I’ll use Python to process the data and create a bar chart.

Code interpreter:
Generated code
“`
import pandas as pd
import matplotlib.pyplot as plt

“`

Code interpreter output:
aws_costs.png (image/png)
Directory output created successfully

Thought: Now I can provide a comprehensive answer about Amazon Bedrock and the cost analysis. Amazon Bedrock is a fully managed service that provides access to foundation models (FMs) from leading AI companies through a single API. Based on the cost analysis of the last 7 days, your top spending services were:

1. Amazon Neptune: $24
2. Amazon SageMaker: $567
3. EC2 (Compute + Other): $454
4. Amazon OpenSearch Service: $34
5. Amazon Bedrock: $1235
6. Amazon Q: $178

Implementation details
Now that you understand the output produced by an agent, let’s lift the curtain and review some of the important pieces of code that produce the output.

Creating MCP clients: config.py, defines the two MCP clients that talk to your two MCP servers.

Server parameters are defined for the cost explorer and Perplexity clients. The solution uses StdioServerParameters, which configures how the client should communicate over standard input/output (stdio) streams. This contains the parameters required by the server to access the required data through APIs.

# Cost server parameters
cost_server_params = StdioServerParameters(
command=”/usr/local/bin/docker”,
args=[
“run”,
“-i”,
“–rm”,
“-e”,
“AWS_ACCESS_KEY_ID”,
“-e”,
“AWS_SECRET_ACCESS_KEY”,
“-e”,
“AWS_REGION”,
“-e”,
“BEDROCK_LOG_GROUP_NAME”,
“-e”,
“stdio”,
“aws-cost-explorer-mcp:latest”,
],
env={
“AWS_ACCESS_KEY_ID”: AWS_ACCESS_KEY_ID,
“AWS_SECRET_ACCESS_KEY”: AWS_SECRET_ACCESS_KEY,
“AWS_REGION”: AWS_REGION,
“BEDROCK_LOG_GROUP_NAME”: BEDROCK_LOG_GROUP_NAME,
},
)

# Perplexity server parameters
perplexity_server_params = StdioServerParameters(
command=”/usr/local/bin/docker”,
args=[“run”, “-i”, “–rm”, “-e”, “PERPLEXITY_API_KEY”, “mcp/perplexity-ask”],
env={“PERPLEXITY_API_KEY”: PERPLEXITY_API_KEY},
)

In main.py, the MCP server parameters are imported and used to create your two MCP clients.

cost_explorer_mcp_client = await MCPClient.create(server_params=cost_server_params)
perplexity_mcp_client = await MCPClient.create(server_params=perplexity_server_params)

Configure agent action group: main.py creates the action group that combines the MCP clients into a single interface that the agent can access. This enables the agent to ask your application to invoke either of these MCP servers as needed through return of control.

# Create action group with both MCP clients
cost_action_group = ActionGroup(
name=”CostActionGroup”,
mcp_clients=[cost_explorer_mcp_client, perplexity_mcp_client]
)

Inline agent creation: The inline agent can be created with the following specifications:

Foundation model: Configure your choice of FM to power your agent. This can be any model provided on Amazon Bedrock. This example uses Anthropic’s Claude 3.5 Sonnet model.
Agent instruction: Provide instructions to your agent that contain the guidance and steps for orchestrating responses to user queries. These instructions anchor the agent’s approach to handling various types of queries
Agent name: Name of your agent.
Action groups: Define the action groups that your agent can access. These can include single or multiple action groups, with each group having access to multiple MCP clients or AWS Lambda As an option, you can configure your agent to use Code Interpreter to generate, run, and test code for your application.

# Create and invoke the inline agent
await InlineAgent(
foundation_model=”us.anthropic.claude-3-5-sonnet-20241022-v2:0″,
instruction=”””You are a friendly assistant that is responsible for resolving user queries.

You have access to search, cost tool and code interpreter.

“””,
agent_name=”cost_agent”,
action_groups=[
cost_action_group,
{
“name”: “CodeInterpreter”,
“builtin_tools”: {
“parentActionGroupSignature”: “AMAZON.CodeInterpreter”
},
},
],
).invoke(
input_text=”<user-query-here>”
)

You can use this example to build an inline agent on Amazon Bedrock that establishes connections with different MCP servers and groups their clients into a single action group for the agent to access.
Conclusion
The Anthropic MCP protocol offers a standardized way of connecting FMs to data sources, and now you can use this capability with Amazon Bedrock Agents. In this post, you saw an example of combining the power of Amazon Bedrock and MCP to build an application that offers a new perspective on understanding and managing your AWS spend.
Organizations can now offer their teams natural, conversational access to complex financial data while enhancing responses with contextual intelligence from sources like Perplexity. As AI continues to evolve, the ability to securely connect models to your organization’s critical systems will become increasingly valuable. Whether you’re looking to transform customer service, streamline operations, or gain deeper business insights, the Amazon Bedrock and MCP integration provides a flexible foundation for your next AI innovation. You can dive deeper on this MCP integration by exploring our code samples.
Here are some examples of what you can build by connecting your Amazon Bedrock Agents to MCP servers:

A multi-data source agent that retrieves data from different data sources such as Amazon Bedrock Knowledge Bases, Sqlite, or even your local filesystem.
A developer productivity assistant agent that integrates with Slack and GitHub MCP servers.
A machine learning experiment tracking agent that integrates with the Opik MCP server from Comet ML for managing, visualizing, and tracking machine learning experiments directly within development environments.

What business challenges will you tackle with these powerful new capabilities?

About the authors
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin, and playing the guitar.
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.
Andy Palmer is a Director of Technology for AWS Strategic Accounts. His teams provide Specialist Solutions Architecture skills across a number of speciality domain areas, including AIML, generative AI, data and analytics, security, network, and open source software. Andy and his team have been at the forefront of guiding our most advanced customers through their generative AI journeys and helping to find ways to apply these new tools to both existing problem spaces and net new innovations and product experiences.