NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Con …

The future of robotics has advanced significantly. For many years, there have been expectations of human-like robots that can navigate our environments, perform complex tasks, and work alongside humans. Examples include robots conducting precise surgical procedures, building intricate structures, assisting in disaster response, and cooperating efficiently with humans in various settings such as factories, offices, and homes. However, actual progress has historically been limited.

Researchers from NVIDIA, Carnegie Mellon University, UC Berkeley, UT Austin, and UC San Diego introduced HOVER, a unified neural controller aimed at enhancing humanoid robot capabilities. This research proposes a multi-mode policy distillation framework, integrating different control strategies into one cohesive policy, thereby making a notable advancement in humanoid robotics.

The Achilles Heel of Humanoid Robotics: The Control Conundrum

Imagine a robot that can execute a perfect backflip but then struggles to grasp a doorknob.

The problem? Specialization.

Humanoid robots are incredibly versatile platforms, capable of supporting a wide range of tasks, including bimanual manipulation, bipedal locomotion, and complex whole-body control. However, despite impressive advances in these areas, researchers have typically employed different control formulations designed for specific scenarios.

Some controllers excel at locomotion, using “root velocity tracking” to guide movement. This approach focuses on controlling the robot’s overall movement through space.

Others prioritize manipulation, relying on “joint angle tracking” for precise movements. This approach allows for fine-grained control of the robot’s limbs.

Still others use “kinematic tracking” of key points for teleoperation. This method enables a human operator to control the robot by tracking their own movements.

Each speaks a different control language, creating a fragmented landscape where robots are masters of one task and inept at others. Switching between tasks has been clunky, inefficient, and often impossible. This specialization creates practical limitations. For example, a robot designed for bipedal locomotion on uneven terrain using root velocity tracking would struggle to transition smoothly to precise bimanual manipulation tasks that require joint angle or end-effector tracking.

In addition to that, many pre-trained manipulation policies operate across different configuration spaces, such as joint angles and end-effector positions. These constraints highlight the need for a unified low-level humanoid controller capable of adapting to diverse control modes.

HOVER: The Unified Field Theory of Robotic Control

HOVER is a paradigm shift. It’s a “generalist policy”—a single neural network that harmonizes diverse control modes, enabling seamless transitions and unprecedented versatility. HOVER supports diverse control modes, including over 15 useful configurations for real-world applications on a 19-DOF humanoid robot. This versatile command space encompasses most of the modes used in previous research.

Learning from the Masters: Human Motion ImitationHOVER‘s brilliance lies in its foundation: learning from human movement itself. By training an “oracle motion imitator” on a massive dataset of human motion capture data (MoCap), HOVER absorbs the fundamental principles of balance, coordination, and efficient movement. This approach utilizes human movements’ natural adaptability and efficiency, providing the policy with rich motor priors that can be reused across multiple control modes.The researchers ground the training process in human-like motion, allowing the policy to develop a deeper understanding of balance, coordination, and motion control, crucial elements for effective whole-body humanoid behavior.

From Oracle to Prodigy: Policy Distillation

The magic truly happens through “policy distillation.” The oracle policy, the master imitator, teaches a “student policy” (HOVER) its skills. Through a process involving command masking and a DAgger framework, HOVER learns to master diverse control modes, from kinematic position tracking to joint angle control and root tracking. This creates a “generalist” capable of handling any control scenario.Through policy distillation, these motor skills are transferred from the oracle policy into a single “generalist policy” capable of handling multiple control modes. The resulting multi-mode policy supports diverse control inputs and outperforms policies trained individually for each mode. The researchers hypothesize this superior performance stems from the policy using shared physical knowledge across modes, such as maintaining balance, human-like motion, and precise limb control. These shared skills enhance generalization, leading to better performance across all modes, while single-mode policies often overfit specific reward structures and training environments.HOVER‘s implementation involves training an Oracle policy followed by knowledge distillation to create a versatile controller. The oracle policy processes proprioceptive information, including position, orientation, velocities, and previous actions alongside reference poses, to generate optimal movements. The oracle achieves robust motion imitation using a carefully designed reward system with penalty, regularization, and task components. The student policy then learns from this oracle through a DAgger framework, incorporating model-based and sparsity-based masking techniques that allow selective tracking of different body parts. This distillation process minimizes the action difference between teacher and student, creating a unified controller capable of handling diverse control scenarios.The researchers formulate humanoid control as a goal-conditioned reinforcement learning task where the policy is trained to track real-time human motion. The state includes the robot’s proprioception and a unified target goal state. Using these inputs, they define a reward function for policy optimization. The actions represent target joint positions that are fed into a PD controller. The system employs Proximal Policy Optimization (PPO) to maximize cumulative discounted rewards, essentially training the humanoid to follow target commands at each timestep.

The research methodology utilizes motion retargeting techniques to create feasible humanoid movements from human motion datasets. This three-step process begins with computing keypoint positions through forward kinematics, fitting the SMPL model to align with these key points, and retargeting the AMASS dataset by matching corresponding points between models using gradient descent. The “sim-to-data” procedure converts the large-scale human motion dataset into feasible humanoid motions, establishing a strong foundation for training the controller.The research team designed a comprehensive command space for humanoid control that overcomes the limitations of previous approaches. Their unified framework accommodates multiple control modes simultaneously, including kinematic position tracking, joint angle tracking, and root tracking. This design satisfies key criteria of generality (supporting various input devices) and atomicity (enabling arbitrary combinations of control options).

HOVER Unleashed: Performance That Redefines Robotics

HOVER‘s capabilities are proven by rigorous testing:

Dominating the Specialists:HOVER outperforms specialized controllers across the board. The research team evaluated HOVER against specialist policies and alternative multi-mode training approaches through comprehensive tests in both IsaacGym simulation and real-world implementations using the Unitree H1 robot.To address whether HOVER could outperform specialized policies, they compared it against various specialists, including ExBody, HumanPlus, H2O, and OmniH2O – each designed for different tracking objectives such as joint angles, root velocity, or specific key points.

In evaluations using the retargeted AMASS dataset, HOVER consistently demonstrated superior generalization, outperforming specialists in at least 7 out of 12 metrics in every command mode. HOVER performed better than specialists trained for specific useful control modes like left-hand, right-hand, two-hand, and head tracking.

Multi-Mode Mastery: A Clean SweepWhen compared to other multi-mode training methods, they implemented a baseline that used the same masking process but trained from scratch with reinforcement learning. Radar charts visualizing tracking errors across eight distinct control modes showed HOVER consistently achieving lower errors across all 32 metrics and modes. HOVER achieved consistently lower tracking errors across all 32 metrics and 8 distinct control modes. This decisive victory underscores the power of HOVER’s distillation approach. This comprehensive performance advantage underscores the effectiveness of distilling knowledge from an oracle policy that tracks full-body kinematics rather than training with reinforcement learning from scratch.

From Simulation to Reality: Real-World ValidationHOVER‘s prowess is not confined to the digital world. The experimental setup included motion tracking evaluations using the retargeted AMASS dataset in simulation and 20 standing motion sequences for the real-world tests on the 19-DOF Unitree H1 platform, weighing 51.5kg and standing 1.8m tall. The experiments were structured to answer three key questions about HOVER’s generalizability, comparative performance, and real-world transferability.

On the Unitree H1 robot, a 19-DOF humanoid weighing 51.5kg and standing 1.8m tall, HOVER flawlessly tracked complex standing motions, dynamic running movements, and smoothly transitioned between control modes during locomotion and teleoperation. Experiments conducted in both simulation and on a physical humanoid robot show that HOVER achieves seamless transitions between control modes and delivers superior multi-mode control compared to baseline approaches.

HOVER: The Future of Humanoid Potential

HOVERunlocks the vast potential of humanoid robots. The multi-mode generalist policy also enables seamless transitions between modes, making it robust and versatile.

Imagine a future where humanoids:

Perform intricate surgery with unparalleled precision.

Construct complex structures with human-like dexterity.

Respond to disasters with agility and resilience.

Collaborate seamlessly with humans in factories, offices, and homes.

The age of truly versatile, capable, and intelligent humanoids is on the horizon, and HOVER is leading the way. Their evaluations collectively illustrate HOVER‘s ability to handle diverse real-world control modes, offering superior performance compared to specialist policies.

Sources:

https://arxiv.org/pdf/2410.21229

https://github.com/NVlabs/HOVER/tree/main 

https://github.com/NVlabs/HOVER/tree/main?tab=readme-ov-file

https://arxiv.org/abs/2410.21229

Thanks to the NVIDIA team for the thought leadership/ Resources for this article. NVIDIA team has supported and sponsored this content/article.
The post NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics appeared first on MarkTechPost.

Prompting for the best price-performance

In the drive to remain competitive, businesses today are turning to AI to help them minimize cost and maximize efficiency. It’s incumbent on them to find the most suitable AI model—the one that will help them achieve more while spending less. For many businesses, the migration from OpenAI’s model family to Amazon Nova represents not only a shift in model but a strategic move toward scalability, efficiency, and broader multimodal capabilities.
In this blog, we discuss how to optimize prompting in Amazon Nova for the best price-performance.
Why migrate from OpenAI to Amazon Nova?
OpenAI’s models remain powerful, but their operational costs can be prohibitive when scaled. Consider these figures from Artificial Analysis:

Model
Input Token Cost (per Million Tokens)
Output Token Cost (per Million Tokens)
Context Window
Output Speed (Tokens per Second)
Latency (Seconds per first token)

GPT-4o
~$2.50
~$10.00
Up to 128K tokens
~63
~0.49

GPT-4o Mini
~$0.15
~$0.60
Up to 128K tokens
~90
~0.43

Nova Micro
~$0.035
~$0.14
Up to 128K tokens
~195
~0.29

Nova Lite
~$0.06
~$0.24
Up to 300K tokens
~146
~0.29

Nova Pro
~$0.80
~$3.20
Up to 300K tokens
~90
~0.34

For high-volume applications—like global customer support or large-scale document analysis—these cost differences are disruptive. Not only does Amazon Nova Pro offer over three times the cost-efficiency, its longer context window also enables it to handle more extensive and complex inputs.
Breaking down the Amazon Nova suite
Amazon Nova isn’t a single model—it’s a suite designed for various needs:

Amazon Nova Pro – A robust multimodal model that can process text, images, and video. It excels at tasks such as document analysis and deep data visualization. Benchmark comparisons show Amazon Nova Pro matching or even surpassing GPT-4o on complex reasoning tasks, according to section 2.1.1 of the Nova technical report and model card.
Amazon Nova Lite – Offers a balanced mix of multimodal processing and speed. Amazon Nova Lite is ideal for applications such as document summarization, translation, and even basic visual search, delivering quality outputs at lower latency and cost compared to GPT-4o Mini. You can find these benchmark results in section 2.1.2 of the Nova Technical Report and Model Card.
Amazon Nova Micro – A text-only model engineered for ultra-low latency. With output speed of up to 195 tokens per second, Amazon Nova Micro is perfect for real-time applications such as chat-based assistants and automated FAQs. Its token costs are dramatically lower than those of GPT-4o Mini—roughly 4.3 times cheaper on a per-token basis.

The lower per-token costs and higher output per second of Amazon Nova give you the flexibility to simplify prompts for real-time applications so you can balance quality, speed, and cost for your use case.
Understanding the foundations
To make the best decision about which model family fits your needs, it’s important to understand the differences in prompt engineering best practices in both OpenAI and Amazon Nova. Each model family has its own set of strengths, but there are some things that apply to both families. Across both model families, quality accuracy is achieved through clarity of instructions, structured prompts, and iterative refinement. Whether you’re using strong output directives or clearly defined use cases, the goal is to reduce ambiguity and improve response quality.
The OpenAI approach
OpenAI uses a layered messaging system for prompt engineering, where system, developer, and user prompts work in harmony to control tone, safety, and output format. Their approach emphasizes:

Hierarchical message roles – Setting the model’s role and behavior using system messages makes sure that the overarching safety and style guidelines (set in system prompts) are preserved
Instruction placement and delimiters – Directives are placed at the beginning, with clear separation between context, examples, and queries
Selective chain-of-thought – Detailed, step-by-step reasoning is used when it benefits complex tasks
Formatting and structure – Using strong directives such as DO, MUST, and DO NOT to provide consistent outputs (for example, in JSON)

The Amazon Nova approach

Define the prompt use case

Task – What exactly the model should do
Role – Which role the model should assume
Response style – The structure or tone of the output
Instructions – Guidelines the model must follow

Chain-of-thought

Explicit state response – Provide clear and strong instructions to limit the model’s response
Structural thinking – Thinking step-by-step encourages structural thinking

Formatting and structure

Use delimiters to sectioned your prompts, for example, ##Task##, ##Context##, or ##Example##
Specify the output format, for example, JSON, YAML, or Markdown
Use strong instructions and caps, such as DO, DO NOT, or MUST
Prefill the responses to guide the model, for example, start with “{“ or “json…”

Evolving your prompt strategy: Migrating from OpenAI to Amazon Nova
Transitioning to Amazon Nova isn’t merely a change in API endpoints—it requires retooling your prompt engineering to align with the strengths of Amazon Nova. You need to reframe your use case definition. Begin by breaking down your current GPT-4o or GPT-4o Mini prompt into its core elements of task, role, response style, and instructions and success criteria. Make sure to structure these elements clearly to provide a blueprint for the model.
To understand how to migrate an existing OpenAI prompt to work optimally for Amazon Nova Pro, consider the following example using the meeting notes summarizer. Here is the GPT-4o system prompt:

You will be provided with meeting notes, and your task is to summarize the meeting
as follows:
-Overall summary of discussion
-Action items (what needs to be done and who is doing it)
-If applicable, a list of topics that need to be discussed more fully in the next meeting.

The user prompt is the meeting notes that need to be summarized:

Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters…

GPT produces this helpful response:

Overall Summary of Discussion: The team discussed the recent mission to Planet Zog,
the space pirate issue in Sector 7, the annual Intergalactic Bake-Off, planning for
the upcoming charity fundraiser, and a proposed team-building retreat. The mission to
Planet Zog was successful, but communication with the Zogians was challenging. The team
also discussed the need for a better strategy to deal with space pirates. The team placed
second in the Intergalactic Bake-Off and is planning for the upcoming charity fundraiser.
A team-building retreat to the Moon Resort and Spa was proposed.

Action Items:
1. Dr. Quasar to start working on a Zogian-English dictionary.
2. Captain Stardust to speak with Admiral Starbeam about increasing patrols in Sector 7.
3. Dr. Quasar to finalize the prototype of the new cloaking technology.
4. All team members to start gathering supplies and preparing games for the Intergalactic
Charity Bazaar.
5. Captain Stardust to check the budget for the proposed team-building retreat to the Moon
Resort and Spa.

Topics for Next Meeting:
1. Update on the Zogian-English dictionary.
2. Progress report on the cloaking technology.
3. Results of increased patrols in Sector 7.
4. Final preparations for the Intergalactic Charity Bazaar.

To meet or exceed the quality of the response from GPT-4o, here is what an Amazon Nova Pro prompt might look like. The prompt uses the same best practices discussed in this post, starting with the system prompt. We used a temperature of .2 and a topP of .9 here:

You are an experienced executive assistant skilled in meeting note analysis and
summarization. Your primary responsibilities include distilling complex discussions
into clear, actionable summaries.
Follow these instructions:

##INSTRUCTIONS##
1. Read understand the meeting notes found in ##NOTES##
2. Put all of your outputs in a section called ##OUTPUTS## in markdown formatting
3. Summarize the meeting notes in 5 sentences or less. Put this in a section called
“Overall Summary”.
4. Numerically list any action items for specific people and what needs to be completed.
Put this list in a section called “Action Items”.
5. If applicable, list the topics that need to be discussed more fully in the next meeting.
Put this in a section called “Topics for Next Meeting”.

Here’s the user prompt, using prefilled responses:

##NOTES##
Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters
Attendees:
– Captain Stardust
– Dr. Quasar
– Lady Nebula
– Sir Supernova
– Ms. Comet
Meeting called to order by Captain Stardust at 2:05 PM
1. Introductions and welcome to our newest team member, Ms. Comet
2. Discussion of our recent mission to Planet Zog
– Captain Stardust: “Overall, a success, but communication with the Zogians was difficult.
We need to improve our language skills.”
– Dr. Quasar: “Agreed. I’ll start working on a Zogian-English dictionary right away.”
– Lady Nebula: “The Zogian food was out of this world, literally! We should consider having
a Zogian food night on the ship.”
3. Addressing the space pirate issue in Sector 7
– Sir Supernova: “We need a better strategy for dealing with these pirates. They’ve already
plundered three cargo ships this month.”
– Captain Stardust: “I’ll speak with Admiral Starbeam about increasing patrols in that area.
– Dr. Quasar: “I’ve been working on a new cloaking technology that could help our ships avoid
detection by the pirates. I’ll need a few more weeks to finalize the prototype.”
4. Review of the annual Intergalactic Bake-Off
– Lady Nebula: “I’m happy to report that our team placed second in the competition! Our Martian Mud
Pie was a big hit!”
– Ms. Comet: “Let’s aim for first place next year. I have a secret recipe for Jupiter Jello that I
think could be a winner.”
5. Planning for the upcoming charity fundraiser
– Captain Stardust: “We need some creative ideas for our booth at the Intergalactic Charity Bazaar.”
– Sir Supernova: “How about a ‘Dunk the Alien’ game? We can have people throw water balloons at a
volunteer dressed as an alien.”
– Dr. Quasar: “I can set up a ‘Name That Star’ trivia game with prizes for the winners.”
– Lady Nebula: “Great ideas, everyone. Let’s start gathering the supplies and preparing the games.”
6. Upcoming team-building retreat
– Ms. Comet: “I would like to propose a team-building retreat to the Moon Resort and Spa. It’s a great
opportunity to bond and relax after our recent missions.”
– Captain Stardust: “Sounds like a fantastic idea. I’ll check the budget and see if we can make it happen.”
7. Next meeting agenda items
– Update on the Zogian-English dictionary (Dr. Quasar)
– Progress report on the cloaking technology (Dr. Quasar)
– Results of increased patrols in Sector 7 (Captain Stardust)
– Final preparations for the Intergalactic Charity Bazaar (All)
Meeting adjourned at 3:15 PM. Next meeting scheduled for March 19th, 2050 at 2:00 PM in Conference Room 3B,
Intergalactic Headquarters.

##OUTPUTS##

The following example shows that the Amazon Nova response meets and exceeds the accuracy of the OpenAI example, formats the output in Markdown, and has found clear owners for each action item:

### Overall Summary
The meeting covered the successful mission to Planet Zog, the ongoing issue with
space pirates in Sector 7, the results of the annual Intergalactic Bake-Off, plans
for the upcoming charity fundraiser, and a proposed team-building retreat. Action items
were assigned, and the next meeting’s agenda was set.

### Action Items
1. **Dr. Quasar**: Begin working on a Zogian-English dictionary.
2. **Captain Stardust**: Speak with Admiral Starbeam about increasing patrols in Sector 7.
3. **Dr. Quasar**: Finalize the prototype for the new cloaking technology within a few weeks.
4. **Team**: Start gathering supplies and preparing games for the Intergalactic Charity Bazaar.
5. **Captain Stardust**: Check the budget for the proposed team-building retreat to the Moon Resort and Spa.

### Topics for Next Meeting
1. Update on the Zogian-English dictionary (Dr. Quasar)
2. Progress report on the cloaking technology (Dr. Quasar)
3. Results of increased patrols in Sector 7 (Captain Stardust)
4. Final preparations for the Intergalactic Charity Bazaar (All)

A few updates to the prompt can achieve comparable or better results from Amazon Nova Pro while enjoying a much less expensive cost of inference.
Employ Amazon Nova extended context
Amazon Nova Lite and Amazon Nova Pro can support up to 300,000 input tokens, which means that you can include more context in your prompt if needed. Expand your background data and detailed instructions accordingly—if your original OpenAI prompt was optimized for 128,000 tokens, adjust it to use the Amazon Nova extended window.
Tailor output constraints
If your GPT prompt required strict formatting (for example, “Respond in JSON only”), make sure that your Amazon Nova prompt includes these directives. Additionally, if your task involves multimodal inputs, specify when to include images or video references.
Function calling
The rise of generative AI agents has made function calling, or tool calling, one of the most important abilities of a given large language model (LLM). A model’s ability to correctly pick the right tool for the job, in a low-latency manner, is often the difference between success and failure of an agentic system.
Both OpenAI and Amazon Nova models share similarities in function calling, in particular their support for structured API calls. Both model families support tool selection through defined tool schemas, which we discuss later in this post. They also both provide a mechanism to decide when to invoke these tools or not.
OpenAI’s function calling uses flexible JSON schemas to define and structure API interactions. The models support a wide range of schema configurations, which give developers the ability to quickly implement external function calls through straightforward JSON definitions tied to their API endpoints.
Here is an example of a function:
tools = [{
        “type”: “function”,
        “function”: {
            “name”: “get_weather”,
            “description”: “Get current temperature for a given location.”,
            “parameters”: {
                    “type”: “object”,
                    “properties”: {
                        “location”: {
                            “type”: “string”,
                            “description”: “City and country e.g. Montevideo, Uruguay”
                        }            
                },            
                “required”: [                
                    “location”
        ],
        “additionalProperties”: False
    },
    “strict”: True
    }
}]

completion = client.chat.completions.create(
    model=”gpt-4o”,
    messages=[{“role”: “user”, “content”: “What is the weather like in Punta del Este today?”}],
    tools=tools
Similar to OpenAI’s approach, Amazon Nova can call tools when passed a configuration schema as shown in the following code example. Amazon Nova has made heavy use of Greedy Decoding when calling tools, and it’s advised to set temperature, topP, and topK to 1. This makes sure that the model has the highest accuracy in tool selection. These Greedy Decoding parameters and other great examples of tool use are covered in great detail in Tool use (function calling) with Amazon Nova.
The following is an example of function calling without using additionalModelRequestFields:
tool_config = {
“tools”: [{
“toolSpec”: {
“name”: “get_recipe”,
“description”: “Structured recipe generation system”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“recipe”: {
“type”: “object”,
“properties”: {
“name”: {“type”: “string”},
“ingredients”: {
“type”: “array”,
“items”: {
“type”: “object”,
“properties”: {
“item”: {“type”: “string”},
“amount”: {“type”: “number”},
“unit”: {“type”: “string”}
}
}
},
“instructions”: {
“type”: “array”,
“items”: {“type”: “string”}
}
},
“required”: [“name”, “ingredients”, “instructions”]
}
}
}
}
}
}]
}

# Base configuration without topK=1
input_text = “I need a recipe for chocolate lava cake”
messages = [{
“role”: “user”,
“content”: [{“text”: input_text}]
}]

# Inference parameters
inf_params = {“topP”: 1, “temperature”: 1}

response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig=inf_params
)
# Typically produces less structured or incomplete output
 
The following example shows how function calling accuracy can be improved by using
additionalModelRequestFields:
# Enhanced configuration with topK=1
response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig=inf_params,
additionalModelRequestFields={“inferenceConfig”: {“topK”: 1}}
)
# Produces more accurate and structured function call

To maximize Amazon Nova function calling potential and improve accuracy, always use additionalModelRequestFields with topk=1. This forces the model to select the single most probable token and prevents random token selection. This increases deterministic output generation and improves function call precision by about 30–40%.
The following code examples further explain how to conduct tool calling successfully. The first scenario shows recipe generation without an explicit tool. The example doesn’t use topK, which typically results in responses that are less structured:
input_text = “””
I’m looking for a decadent chocolate dessert that’s quick to prepare.
Something that looks fancy but isn’t complicated to make.
“””

messages = [{
“role”: “user”,
“content”: [{“text”: input_text}]
}]

response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
inferenceConfig={“topP”: 1, “temperature”: 1}
)
# Generates a conversational recipe description
# Less structured, more narrative-driven response

In this example, the scenario shows recipe generation with a structured tool. We add topK set to 1, which produces a more structured output:
response = client.converse(
modelId=”us.amazon.nova-lite-v1:0″,
messages=messages,
toolConfig=tool_config,
inferenceConfig={“topP”: 1, “temperature”: 1},
additionalModelRequestFields={“inferenceConfig”: {“topK”: 1}}
)
# Generates a highly structured, JSON-compliant recipe
# Includes precise ingredient measurements
# Provides step-by-step instructions
Overall, OpenAI offers more flexible, broader schema support. Amazon Nova provides more precise, controlled output generation and is the best choice when working with high-stakes, structured data scenarios, as demonstrated in Amazon Nova’s performance on the IFEval benchmark discussed in section 2.1.1 of the technical report and model card. We recommend using Amazon Nova for applications requiring predictable, structured responses because its function calling methodology provides superior control and accuracy.
Conclusion
The evolution from OpenAI’s models to Amazon Nova represents a significant shift in using AI. It shows a transition toward models that deliver similar or superior performance at a fraction of the cost, with expanded capabilities in multimodal processing and extended context handling.
Whether you’re using the robust, enterprise-ready Amazon Nova Pro, the agile and economical Amazon Nova Lite, or the versatile Amazon Nova Micro, the benefits are clear:

Cost savings – With token costs up to four times lower, businesses can scale applications more economically
Enhanced response performance – Faster response times (up to 190 tokens per second) make real-time applications more viable
Expanded capabilities – A larger context window and multimodal support unlock new applications, from detailed document analysis to integrated visual content

By evolving your prompt strategy—redefining use cases, exploiting the extended context, and iteratively refining instructions—you can smoothly migrate your existing workflows from OpenAI’s o4 and o4-mini models to the innovative world of Amazon Nova.

About the Authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Evaluate models or RAG systems using Amazon Bedrock Evaluations – No …

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.
The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.
In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.
What’s new: RAG evaluation
In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.
Bring Your Own Inference responses
The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.
When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.
In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
“conversationTurns”: [
{
“prompt”: {
“content”: [
{
“text”: string
}
]
},
“referenceResponses”?: [
{
“content”: [
{
“text”: string
}
]
}
],
“output”?: {
“knowledgeBaseIdentifier”: string
“retrievedResults”: {
“retrievalResults”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
}
}
}
]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
“conversationTurns”: [
{
“prompt”: {
“content”: [
{
“text”: string
}
]
},
“referenceResponses”?: [
{
“content”: [
{
“text”: string
}
]
}
],
“output”?: {
“text”: string
“modelIdentifier”?: string
“knowledgeBaseIdentifier”: string
“retrievedPassages”: {
“retrievalResults”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
},
“citations”: [
{
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: number
“end”: number

},
“text”: string
}
},
“retrievedReferences”: [
{
“name”?: string
“content”: {
“text”: string
},
“metadata”?: {
[key: string]: string
}
}
]
}
]
}
}
]
}

Citation metrics
Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.
Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.
What’s new: Model evaluation
In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.
Bring Your Own Inference responses
The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.
You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.
When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
“prompt”: string
“referenceResponse”?: string
“category”?: string
“modelResponses”?: [
{
“response”: string
“modelIdentifier”: string
}
]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
“prompt”: string
“referenceResponse”?: string
“category”?: string
“modelResponses”?: [
{
“response”: string
“modelIdentifier”: string
},
{
“response”: string
“modelIdentifier”: string
}
]
}

Feature overview
The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites
To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

AWS account and model access:

An active AWS account
Selected evaluator models are enabled in Amazon Bedrock (verify on the Model access page of the Amazon Bedrock console)
Confirmed AWS Regions where the models are available and their quotas

AWS Identity and Access Management (IAM) and Amazon Simple Storage Service (Amazon S3) configuration:

Completed IAM setup and permissions for both model and RAG evaluation
Configured S3 bucket with appropriate permissions for accessing and writing output data
Enabled CORS on your S3 bucket

Dataset description and preparation
For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the “third-party-model” identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.
The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using “third-party-RAG” as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.
Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console
In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
On the Evaluations page, choose the Models
In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
Optionally, choose Tags, and create your own tags.
Under Inference source, choose Bring your own inference responses as the inference source.
For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

{
“prompt”: “If books cost \$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?”,
“referenceResponse”: “The total price will be \$122.52. Original price: \$153.15 minus 20% discount (\$30.630000000000003)”,
“category”: “Shopping Math”,
“modelResponses”: [{
“response”: “To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:nn1. Calculate the total cost of the books before the discount.n2. Determine the discount amount.n3. Subtract the discount amount from the total cost before the discount.nn**Step 1: Calculate the total cost of the books before the discount.**nnThe cost of one book is \$10.21. Therefore, the cost of 15 books is:n\[ 15 \times 10.21 = 153.15 \]nn**Step 2: Determine the discount amount.**nnThe discount is 20%, so we need to find 20% of \$153.15. To do this, we multiply \$153.15 by 0.20:n\[ 153.15 \times 0.20 = 30.63 \]nn**Step 3: Subtract the discount amount from the total cost before the discount.**nnThe total cost after the discount is:n\[ 153.15 – 30.63 = 122.52 \]nnTherefore, the total amount you will pay is:n\[ \boxed{122.52} \]”,
“modelIdentifier”: “third-party-model”
}]
}

Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

When the job is complete, choose the job to see its details and metrics summary.
To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs
To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

Set up the required configurations:

import boto3
import json
import random
from datetime import datetime
import botocore

# AWS Configuration
REGION = “us-east-1”
ROLE_ARN = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”
BUCKET_NAME = “<YOUR_S3_BUCKET_NAME>”
PREFIX = “<YOUR_BUCKET_PREFIX>”
dataset_custom_name = “<YOUR_BYOI_DATASET_NAME>” # without the “.jsonl extension

# Initialize AWS clients
bedrock_client = boto3.client(‘bedrock’, region_name=REGION)
s3_client = boto3.client(‘s3’, region_name=REGION)

Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.

def create_llm_judge_evaluation(
client,
job_name: str,
role_arn: str,
input_s3_uri: str,
output_s3_uri: str,
evaluator_model_id: str,
dataset_name: str = None,
task_type: str = “General” # must be General for LLMaaJ
):
# All available LLM-as-judge metrics
llm_judge_metrics = [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Faithfulness”,
“Builtin.Helpfulness”,
“Builtin.Coherence”,
“Builtin.Relevance”,
“Builtin.FollowingInstructions”,
“Builtin.ProfessionalStyleAndTone”,
“Builtin.Harmfulness”,
“Builtin.Stereotyping”,
“Builtin.Refusal”
]

# Configure dataset
dataset_config = {
“name”: dataset_name or “CustomDataset”,
“datasetLocation”: {
“s3Uri”: input_s3_uri
}
}

try:
response = client.create_evaluation_job(
jobName=job_name,
roleArn=role_arn,
applicationType=”ModelEvaluation”,
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [
{
“taskType”: task_type,
“dataset”: dataset_config,
“metricNames”: llm_judge_metrics
}
],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [
{
“modelIdentifier”: evaluator_model_id
}
]
}
}
},
inferenceConfig={
“models”: [
{
“precomputedInferenceSource”: {
“inferenceSourceIdentifier”: “third-party-model”
}
}
]
},
outputDataConfig={
“s3Uri”: output_s3_uri
}
)
return response

except Exception as e:
print(f”Error creating evaluation job: {str(e)}”)
raise

Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.

# Job Configuration
evaluator_model = “anthropic.claude-3-haiku-20240307-v1:0″
job_name = f”llmaaj-third-party-model-{evaluator_model.split(‘.’)[0]}-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

# S3 Paths
input_data = f”s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl”
output_path = f”s3://{BUCKET_NAME}/{PREFIX}”

# Create evaluation job
try:
llm_as_judge_response = create_llm_judge_evaluation(
client=bedrock_client,
job_name=job_name,
role_arn=ROLE_ARN,
input_s3_uri=input_data,
output_s3_uri=output_path,
evaluator_model_id=evaluator_model,
task_type=”General”
)
print(f”✓ Created evaluation job: {llm_as_judge_response[‘jobArn’]}”)
except Exception as e:
print(f”✗ Failed to create evaluation job: {str(e)}”)
raise

Monitor the progress of your evaluation job:

# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response[‘jobArn’]

# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn)
print(f”Job Status: {check_status[‘status’]}”)

Start a RAG evaluation job with BYOI using the Amazon Bedrock console
When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:
Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
Choose the RAG.
Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
Under Inference source, choose Bring your own inference responses as the inference source.
For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:

{
“conversationTurns”: [{
“prompt”: {
“content”: [{
“text”: “What is Amazon’s SEC file number?”
}]
},
“referenceResponses”: [{
“content”: [{
“text”: “Amazon’s SEC file number is 000-22513.”
}]
}],
“output”: {
“text”: “Amazon’s SEC file number is 000-22513.”,
“modelIdentifier”: “third-party-model”,
“knowledgeBaseIdentifier”: “third-party-RAG”,
“retrievedPassages”: {
“retrievalResults”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}, {
“content”: {
“text”: “AMAZON.COM, INC. (Exact name of registrant as specified in its charter)”
}
}]
},
“citations”: [{
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 0,
“end”: 11
},
“text”: “Amazon’s SEC”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “UNITED STATESSECURITIES AND EXCHANGE COMMISSION”
}
}]
}, {
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 12,
“end”: 22
},
“text”: “file number”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}]
}, {
“generatedResponsePart”: {
“textResponsePart”: {
“span”: {
“start”: 23,
“end”: 33
},
“text”: “is 000-22513”
}
},
“retrievedReferences”: [{
“content”: {
“text”: “Commission File No. 000-22513”
}
}]
}]
}
}]
}

Therefore, we use the source name third-party-RAG.

Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs
To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:

import boto3
from datetime import datetime

# Configure knowledge base and model settings
evaluator_model = “<YOUR_EVALUATOR_MODEL>”
role_arn = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”
BUCKET_NAME = “<YOUR_S3_BUCKET_NAME>”
PREFIX = “<YOUR_BUCKET_PREFIX>”
RAG_dataset_custom_name = “<YOUR_RAG_BYOI_DATASET_NAME>” # without the “.jsonl extension

# Specify S3 locations
input_data = f”s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl”
output_path = f”s3://{BUCKET_NAME}/{PREFIX}/”

# Create Bedrock client
bedrock_client = boto3.client(‘bedrock’)

Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.

retrieve_generate_job_name = f”rag-evaluation-generate-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

retrieve_generate_job = bedrock_client.create_evaluation_job(
jobName=retrieve_generate_job_name,
jobDescription=”Evaluate retrieval and generation”,
roleArn=role_arn,
applicationType=”RagEvaluation”,
inferenceConfig={
“ragConfigs”: [
{
“precomputedRagSourceConfig”: {
“retrieveAndGenerateSourceConfig”: {
“ragSourceIdentifier”: “third-party-RAG” # Replace with your identifier
}
}
}
]
},
outputDataConfig={
“s3Uri”: output_path
},
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [{
“taskType”: “QuestionAndAnswer”,
“dataset”: {
“name”: “RagDataset”,
“datasetLocation”: {
“s3Uri”: input_data
}
},
“metricNames”: [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Helpfulness”,
“Builtin.LogicalCoherence”,
“Builtin.Faithfulness”,
“Builtin.CitationPrecision”,
“Builtin.CitationCoverage”
]
}],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [{
“modelIdentifier”: evaluator_model
}]
}
}
}
)

After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up
To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.
Conclusion
The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.
As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.
We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.

About the authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Fine-tune large language models with reinforcement learning from human …

Large language models (LLMs) can be used to perform natural language processing (NLP) tasks ranging from simple dialogues and information retrieval tasks, to more complex reasoning tasks such as summarization and decision-making. Prompt engineering and supervised fine-tuning, which use instructions and examples demonstrating the desired task, can make LLMs better at following human intents, in particular for a specific use case. However, these methods often result in LLMs expressing unintended behaviors such as making up facts (hallucinations), generating biased or toxic text, or simply not following user instructions. This leads to responses that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
Supervised learning can help tune LLMs by using examples demonstrating some desired behaviors, which is called supervised fine-tuning (SFT). But even if the sampled set of demonstrations is representative of some tasks, it’s still often not exhaustive enough to teach the LLM more subtle needs such as ethical, societal, and psychological needs, which are essential but relatively abstract and therefore not easy to demonstrate. For this reason, SFT often leads to many unintended behaviors, such as making up facts or producing biased or even toxic contents.
Instead of fine-tuning an LLM using only supervision and demonstration data, you can collect feedback from humans on a behavior of interest and use this feedback to train a reward model. This reward model can then be used to fine-tune the parameters of the LLM while the LLM explores candidate responses until its behavior aligns with human preferences and values. This method is called reinforcement learning from human feedback (Ouyang et al. 2022). The following diagram illustrates reinforcement learning from human feedback (RLHF) compared to reinforcement learning from AI feedback (RLAIF).

Recently, Lee et al. (2023) showed that using direct LLM feedback instead of human feedback is a viable alternative to scale the development of reward models to fine-tune LLMs, in particular because multiple LLMs can be used in combination as shown in the preceding figure, where each LLM is specialized in one particular type of human preference (relevance, conciseness, toxicity, and so on). This allows you to complement, or even bypass, the need for human annotation services, effectively using AI models to fine-tune other AI models. This technique is known as superalignment using RLAIF. Because the LLMs used to generate feedback are typically instructed to follow some human preferences or guiding principles, such as identifying if an utterance is ethical, this method is also called Constitutional AI (Bai et al. 2022). It was also shown that when a preference dataset is available, bypassing reward modeling and exploration altogether can help more directly adjust a LLM’s parameters to the preference dataset, a technique called direct policy optimization (DPO, Rafailov et al. 2024).
Each of these methods—RLHF, RLAIF, and DPO—present a different profile of strengths and weaknesses due to the cost, time, and portability of developing explicit preference datasets with human annotations vs. reward models. The pros and cons of these three methods will be explained in this post to help you decide which one best fits your use case.
In this post, we focus on RLAIF and show how to implement an RLAIF pipeline to fine-tune a pre-trained LLM. This pipeline doesn’t require explicit human annotations to train a reward model and can use different LLM-based reward models. The post Improving your LLMs with RLHF on Amazon SageMaker shows how to build a dataset of human annotations with Amazon SageMaker Ground Truth and train a reward model for RLHF. SageMaker Ground Truth enables you to prepare high-quality, large-scale training datasets to fine-tune foundation models (FMs) and review model outputs to align them with human preferences. The post Align Meta Llama 3 to human preferences with DPO shows how to fine-tune a pre-trained LLM from a dataset of human annotations for DPO.
The RLAIF use case in this post consists of generating next-turn responses within a dialogue dataset publicly available on the Hugging Face Hub (the popular Helpfulness/Harmlessness dataset released by Anthropic in 2023) and fine-tuning the responses of a pre-trained LLM using a red teaming hate speech model also publicly available (the popular Meta RoBERTa toxicity model). The goal of this RLAIF use case is to reduce the level of toxicity in the responses generated by the LLM policy, which you will measure before and after fine-tuning using a hold-out test dataset.
This post has three main sections:

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
Categories of human preference reward models for RLHF/RLAIF
Implementation of an RLAIF use case

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
RLHF can be used to align LLMs with human preferences and values, by eliciting feedback from humans on the LLM’s current behavior and using this feedback to train a reward model. Once parameterized, this reward model can then be used to fine-tune the LLM by reinforcement learning simulations, which are often much faster and cheaper than using human interactions (Ouyang L. et al., 2022). Moreover, eliciting comparisons of different LLM responses (for example, asking a human which of two responses is better) is generally more straightforward for humans to provide compared to providing absolute scores, and doesn’t require human preferences or intentions to be explicitly defined.
Christiano et al. (2017) provided the first evidence that RLHF could be economically scaled up to practical applications. Since then, RLHF has been shown to help tune LLMs to be more helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).
In RLHF, the alignment can be biased by the group of humans who provide the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, it might never be possible to train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the trade-offs. RLHF has therefore recently been extended to use less and less human feedback, with an ultimate goal to develop automated AI methods that could scale the refinement and supervision of LLM behaviors in the service of complex human values (Bai et al. 2022). Constitutional AI and more generally RLAIF are promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This post focuses on RLAIF.
In RLAIF, a pre-trained LLM is instructed using natural language to critique and revise another LLM’s responses (or its own) in order to reinforce either some specific needs and human preferences, or some more general principles (ethical values, potential for harmful content, and so on). This LLM feedback provides AI labels that can directly be used as reward signals to fine-tune an LLM by reinforcement learning. Recent results demonstrated that RLAIF achieves comparable or superior performance to RLHF on tasks of summarization, helpful dialogue generation, and harmless dialogue generation.
Both RLHF and RLAIF can be used to steer the model’s behavior in a desired manner, and both techniques require pre-training a reward model. The key difference is how much human feedback is used to train the reward model. Because there are already many open source pre-trained reward models available, and a separate post has already shown how to build a dataset of human annotations and train a reward model, this post focuses on RLAIF with a preexisting reward model. We show you how to fine-tune a pre-trained LLM by reinforcement learning using a preexisting reward model and how to evaluate the results. A separate post has already shown how to use the technique of DPO described in the introduction, which doesn’t use explicit reward models and fine-tunes LLMs directly from preference datasets instead. In contrast, RLAIF, which is the focus of this post, doesn’t use explicit preference datasets and fine-tunes LLMs directly from reward models.
The following diagram illustrates the process of learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO).

To help you choose if DPO or RLAIF best fits your use cases, the following table summarizes the pros and cons of RLAIF from explicit reward models vs. DPO from explicit preference datasets. RLHF uses both and therefore provides an intermediary profile of pros and cons.
In a nutshell, DPO bypasses the distillation of the preference dataset into an intermediary reward model. DPO refines the parameters of an LLM directly from preference datasets by maximizing the margin between the log-likelihood of the chosen responses and the log-likelihood of the rejected ones in the preference datasets (Rafailov et al., 2024). Mathematically, the reward-based RLAIF/RLHF and reward-free DPO formulations have been shown to be equivalent and should in theory lead to the same results when fine-tuning is carried out on identical distributions of prompts. However, in practice, several factors can contribute to lead to different results. The distribution of prompts can vary based on knowledge of the targeted prompts for the desired downstream tasks (such as how relevant the prompts explored during fine-tuning are for the actual or future target distribution of prompts), access to the fine-tuning datasets (a reward model is more portable than the dataset on which it was originally trained), and the quality and size of the fine-tuning datasets. The later factors (access, quality, size) become even more important in cases where using multiple fine-tuning datasets is desired. This implies the following pros and cons.

 
RLAIF
DPO
RLHF

Summary
Fine-tune an LLM from explicit reward models on new prompts.
Fine-tune an LLM directly from explicit preference datasets.
Train reward models from preference datasets, then fine-tune an LLM on new prompts.

Pros
Fine-tuning is possible without human annotations. Most efficient in speed, compute, and engineering if:

Reward models or LLM instructor available.
Preference data unavailable.
Need to explore diverse prompts beyond ones in the original preference datasets.
Online learning desired.
Directly scales beyond human supervision. Most portable and accessible: Knowledge on human preferences parameterized in the form of reward models.
Fine-tuning uses explicit human feedback. Most efficient in speed, compute, and engineering if:

Reward models unavailable.
Need to target prompts from available preference datasets.
Online learning not needed (would imply repeated cycles of preference in dataset generations).
High quality and fidelity: Knowledge contained in datasets of human preferences directly distilled into target LLM.
Fine-tuning uses explicit human feedback. Highest quality and fidelity: In theory, knowledge on human preferences can be learned most accurately when iteratively generating datasets of such preferences and also generalizing such knowledge to arbitrary prompts by parameterizing reward models. In practice, this is often not the case. Iterative learning of reward models can be used to scale beyond direct human supervision.

Cons
Fine-tuning limited to available model of human preferences. Inefficient if:

Reward models unavailable and preference not clear enough to instruct an LLM.
Need to target prompts from available preference datasets.

Fine-tuning requires a lot of human annotations. Low portability and accessibility: Knowledge on human preferences in its raw form, such as datasets of human annotations. Inefficient if:

Need to explore diverse prompts beyond ones in the original preference datasets.
Reward models available or preference clear enough to instruct an LLM.

Fine-tuning requires a lot of human annotations. Fine-tuning limited to learned models of human preferences.   Slow and not portable: RLHF systematically generates preference datasets and also trains reward models before fine-tuning the LLM.

This table is not exhaustive. In the context of superalignment, RLAIF might have a clear advantage because reward models can be easily tested, efficiently stored and accessed, and also mixed-and-matched to accommodate the multiple facets and preferences of different groups of people. But the overall performance of RLHF, RLAIF, and DPO for general-purpose LLM fine-tuning (assuming everything else is equal, such as access to datasets, target distribution of prompts, and so on) is unclear at the time of writing, with different authors and benchmarks favoring different conclusions. For example, Rafailov et al. (2024) favor DPO whereas Ivison et al. (2024) favor RLHF/RLAIF.
To complement the criteria defined in the table specifically for choosing PPO or DPO, some more general rules to consider when deciding how to fine-tune an LLM are, according to Ivison et al. (2024), in order of importance:

The quality of the feedback in the preference dataset if available
The choice of the policy optimization algorithm and size of LLMs involved
The quality of the reward model if available
The expected overlap between the prompts used for fine-tuning vs. future target prompts for which the LLM will be ultimately be used

Categories of human preference reward models for RLHF/RLAIF
In RLHF, the quality of the resulting alignment depends on the nature of the reward models derived from the preference dataset. RLHF can be biased by the group of humans who provides the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, effective RLHF tuning typically requires tens of thousands of human preference labels, which is time-consuming and expensive. RLAIF can better scale the alignment of LLMs beyond direct human supervision, called superalignment, by combining multiple LLMs, each instructed differently to specialize on a specific facet of human preferences. For example, as discussed in Lee et al. (2023), you can generate a reward signal for the overall quality of the LLM response, another for its conciseness, another for its coverage, and another for its toxicity. RLAIF is promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. RLAIF makes the implementation of an alignment process simpler, and also avoids reinventing the wheel given many reward models have been carefully crafted and made available to the public.
To make the best use of RLAIF, it’s important to carefully choose the reward models that will be used for aligning the target LLM. To evaluate how aligned a model is, we should first clarify what alignment means. As mentioned in Ouyang et al. (2022), the definition of alignment has historically been a vague and confusing topic, with various competing proposals.
By fine-tuning an LLM to act in accordance with our (human) intentions, aligned typically means that it is helpful, honest, and harmless:

Helpfulness – The LLM should follow instructions and infer user intent. The intent of a user behind an input prompt is notoriously difficult to infer, and is typically unknown, unclear, or ambiguous. Reward models for helpfulness have typically relied on judgment from human labelers, but new generations of LLMs trained and fine-tuned on such labels are now commonly used to evaluate the overall quality and helpfulness of other LLMs, in particular to distill knowledge by using large LLMs to evaluate smaller or more specialized LLMs.
Honesty (fidelity) – The LLM should not make up facts (hallucination). Ideally, it should also recognize when it doesn’t know how to respond. Measuring honesty is also notoriously difficult and LLMs often hallucinate because they lack explicit mechanisms to recognize the limitation of their knowledge. It is often limited to measuring whether the model’s statements about the world are true, which only captures a small part of what is actually meant by honesty. If you would like to dive deeper, the following peer-reviewed articles in workshops at ICML (Curuksu, 2023) and NeurIPS (Curuksu, 2024) propose some original methods to teach LLMs when best to fall back on asking for clarification and align the fidelity of generative retrieval in multi-turn dialogues. Ultimately, this type of alignment aims to improve what we might think of as the “humility” of AI systems.
Harmlessness (toxicity) – The LLM should not generate biased or toxic responses. Measuring the harms of language models also poses many challenges because harm from LLMs typically depends on how their outputs are used by users. As mentioned in Ouyang et al. (2022), a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might be helpful if used for red teaming data augmentation to train a more accurate toxicity detection model. Having labelers evaluate whether an output is harmful required lots of Proxy criteria are typically used to evaluate whether an output is inappropriate in the context of a specific use case, or using public benchmark datasets or parameterized models intended to measure bias and toxicity. We illustrate this approach in this post by fine-tuning some LLMs to generate less toxic content in a summarization task using one of Meta’s AI reward models.

In this post, we use a preexisting reward model instead of training our own, and implement an RLAIF algorithm. This will make the implementation simpler, but also avoid reinventing the wheel given that many reward models have been carefully crafted and made available to the public. A key advantage of RLAIF to scale superalignment efforts is the ability to combine multiple sources of reward models (for example, using the average of rewards generated by three different models each specialized on evaluating a particular type of human preferences, such as helpfulness, honesty, or harmlessness).
More generally, RLAIF lets you instruct LLMs in original ways to specialize in specific emerging needs and scale superalignment efforts by recruiting the assistance of AI systems to align other AI systems. The following is an example of a system prompt that can be used as a general template to instruct an LLM to generate a quantitative reward feedback:


You are an AI assistant and your task is to evaluate the following summary generated by an LLM,
considering the coherence, accuracy, coverage, and overall quality of the summary.
Please generate an evaluation score in a decimal number between 1.00 and 5.00.
Score 5.00 means the summary is the best optimal summary given the input text.
Score 1.00 means the summary is really bad and irrelevant given the input text.

Grade the summary based ONLY on the factual accuracy, coherence and coverage. Ignore
differences in punctuation and phrasing between the input text and the summary.

Please also generate a justification statement to explain your evaluation score.
Keep the justification statement as concise as possible.

Here is the input text: (…)
Here is the summary generated by the LLM: (…)

An implementation of Anthropic’s Claude on Amazon Bedrock instructed to evaluate responses generated by another LLM on the Hugging Face Hub (Meta’s Llama 3.1 or Google’s Flan-T5) is shown in the next section.
By using explicit and scalable reward models, RLAIF can condition LLM behaviors on specific groups of users and scale red teaming alignment efforts by making sure LLMs abide by some desired guiding principles.
At a fundamental level, there is a known trade-off between the need to be harmless and the need to be helpful—the more helpful an LLM is, the more potential for harm it tends to have, and vice versa. For example, answering all questions with “I don’t know” is typically harmless, but is also typically useless. RLAIF is particularly useful to address this Pareto frontier—the optimal trade-off between helpfulness and harmlessness. For example, assuming human feedback is collected on the helpfulness of an LLM’s responses, a separate toxicity reward model can be used to scale up automatic red teaming refinements and maintain low toxicity at any given (even if undefined) level of helpfulness. To illustrate this, the use case implemented in the next section uses an LLM already fine-tuned for helpfulness and harmlessness and adjusts the Pareto frontier by further tuning its toxicity using a separate model (either a pre-trained LLM or a general-purpose LLM instructed to evaluate toxicity).
Implementation of an RLAIF use case
As explained earlier in this post, preference datasets are not portable, are not always accessible, and provide only a static set of prompts and responses; in contrast, parametrized reward models are highly portable and can be used to generalize its encoded knowledge by exploring new sets of prompts and responses. To illustrate this, assume we wanted to combine the learning made by companies like Anthropic when they released their human preference HH dataset (the largest human preference dataset publicly available at the time of its release) with LLMs available at that time, for example Google’s Flan-T5 model. Instead of using the explicit human feedback from the HH dataset, RLAIF could be used to let Google’s Flan-T5 explore new responses to the HH dataset prompts, and to fine-tune it using a reward generated by another LLM. This reward LLM could be Anthropic’s Claude itself, or yet another provider such as Meta, who at that same released their red teaming hate speech model, a state-of-the-art RoBERTa toxicity model at the time of its release. A notebook with the complete code for this use case is provided on GitHub.
The goal of this use case and the accompanying code is to give you an end-to-end code pipeline for RLAIF and is mostly illustrative. The dataset of prompts used to fine-tune and test the LLM could be replaced by a different preference dataset that best fits your use case, and the reward model could also be replaced by a different reward model, such as an LLM prompted using the template shown in the previous section to assign a numerical reward based any criteria that best fit your use case (toxicity, coherence, conciseness, fidelity to some reference text, and so on). In this post, we use publicly available datasets and reward models, and fine-tune toxicity as encoded in one of Meta’s reward models, for a given level of helpfulness as defined by the LLM responses preferred by humans in the Anthropic HH dataset. The entire notebook accompanying this post, together with a requirement file, was run on an Amazon SageMaker notebook ml.g5.16xlarge instance.
Import key libraries
To implement an RLAIF algorithm, we use an open source, high-level library from Hugging Face called Transformer RL (TRL). Do not forget to restart your Python kernel after installing the preceding libraries before you import them. See the following code:

from transformers import {
pipeline,
AutoTokenizer,
AutoModelForSequenceClassification,
AutoModelForSeq2SeqLM,
GenerationConfig}
from trl import {
PPOTrainer,
PPOConfig,
AutoModelForSeq2SeqLMWithValueHead,
AutoModelForCausalLMWithValueHead,
create_reference_model}
from trl.core import LengthSampler
from datasets import load_dataset
from peft import {
PeftModel,
PeftConfig,
LoraConfig,
TaskType}
import torch
import torchvision
import evaluate
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Load a prompt dataset and a pre-trained LLM, and instruct it to generate a specific type of response
First, let’s load a pre-trained LLM model. This section contains examples showing how to load Meta’s Llama 3.1 (instruct version) and Google’s Flan-T5 models (choose one or the other). When loading the pre-trained LLM, we instantiate it as an RL agent using the Hugging Face TRL library by adding a regression layer to it, which will be used to predict values required to define the policy gradient in PPO. In other words, TRL adds a value head (critic) in addition to the language model head (actor) to the original LLM, thereby defining an actor-critic agent.
Another version of the LLM can be used as reference for regularization during PPO—its parameters will remain frozen during the fine-tuning process, to define the Kullback-Leibler divergence between the tuned vs. original LLM responses. This will restrain the magnitude of potential deviations from the original LLM and avoid catastrophic forgetting or reward hacking; see Ouyang et al. (2022) for details. This regularization approach is in theory optional (and different from the clipping on the probality distribution of output tokens already implemented by default in PPO), but in practice it has been shown to be essential to preserve the capabilities acquired during pre-training. See the following code:

# Load a pre-trained LLM

model = “llama”

if model == “llama”:
# Example to load Meta Llama 3.1 model
model_name = “meta-llama/Meta-Llama-3.1-8B”
ppo_llm = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, token=access_token)

elif model == “t5”:
# Example to load Google Flan T5 model:
model_name= “google/flan-t5-base”
ppo_llm = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model_name, token=access_token)

# Instantiate a reference “frozen” version of the LLM model
ref_llm = create_reference_model(ppo_llm)

Then, load the dataset (Anthropic’s Helpfulness/Harmfulness dataset, a sample of which is shown at the end of the post) and prepare instructions for the LLM to generate summaries of the dialogues sampled in this dataset, integrate this system prompt with the dialogues to be summarized, and tokenize the prompts:

# Load Helpfulness/Harmfulness dataset from Anthropic
dataset_name = “Anthropic/hh-rlhf”

# Create a tokenizer based on the chosen LLM
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

# Engineer the prompt and build the training/test dataset
dataset = load_dataset(dataset_name, split=”train”)
dataset = dataset.remove_columns(“rejected”)
dataset = dataset.rename_column(“chosen”, “dialogue”)
dataset = dataset.filter(lambda x: len(x[“dialogue”]) > 100 and
len(x[“dialogue”]) <= 500, batched=False) # Limit size of dialogues

def tokenize(sample):
prompt = f”””
Summarize the following conversation.

{sample[“dialogue”]}

Summary:
“””
sample[“input_ids”] = tokenizer.encode(prompt)
sample[“query”] = tokenizer.decode(sample[“input_ids”])
return sample

# Tokenize dialogues
dataset = dataset.map(tokenize, batched = False)
dataset.set_format(type = “torch”)

# Split into training and testing datasets
dataset = dataset.train_test_split(test_size=0.2)

Prepare reward models for RLAIF
In this section, we provide two examples of an AI reward model for RLAIF.
Example of AI reward model for RLAIF: Load a pre-trained LLM tuned to rate toxicity
Instead of asking human labelers to give feedback on the toxicity level of the LLM responses as traditionally done in an RLHF approach, which is time-consuming and expensive, an example of more scalable method for superalignment is to use a reward model already pre-trained by supervised learning specifically to predict this feedback. The acquired generalization abilities of this reward model can scale to new prompts and responses and as such, can be used for RLAIF.
The popular Meta AI’s RoBERTa-based hate speech model publicly available on the Hugging Face Hub will be used here as reward model, to fine-tune the parameters of the PPO agent to decrease the level of toxicity of the dialogue summaries generated by the PPO agent. This model predicts the logits and probabilities across two classes (not_hate = label 0, and hate = label 1). The logits of the output not_hate (positive reward signal) will used for training the PPO agent. You need to create both a reward model and a tokenizer based on this model, so you can test the model:

# Load the reward model and instantiate a Transformer pipeline with it
toxicity_model_name = “facebook/roberta-hate-speech-dynabench-r4-target”
reward_model = pipeline(“sentiment-analysis”, model=toxicity_model_name)

# Create a tokenizer based on the reward model
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)

# Create a classifier based on the reward model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)

# Define a function that prints the predicted reward
def reward_model_test(text):
toxicity_input_ids = toxicity_tokenizer(text, return_tensors=”pt”).input_ids
logits = toxicity_model(toxicity_input_ids).logits
not_hate_reward = (logits[:, 0]).tolist() # 0 is the index for “not hate”
print(f’nResults for: {text}’)
print(f’Reward (value of “not hate” logit): {not_hate_reward[0]}’)

# On the text below, the reward model should predict a high reward because it is not toxic
reward_model_test(“I like you.”)

# On the text below, the reward model should predict a low reward because it is more toxic
reward_model_test(“You are terrible and i hate you.”)

OUTPUT:
Results for: I like you.
Reward (value of “not hate” logit): 4.55513858795166

Results for: You are terrible and i hate you.
Reward (value of “not hate” logit): -2.368960380554199

Example of AI reward model for RLAIF: Prompt Anthropic’s Claude v3 to generate a reward
You can use a different LLM able to evaluate the toxicity of the input text instead of the Meta toxicity model used in the previous example. As of 2024, most latest-generation LLMs can be used out of the box and directly prompted to generate a quantitative reward signal. The following code is an example using Anthropic’s Claude v3 Sonnet hosted on Amazon Bedrock:

# Format for Claude v3 Sonnet on Amazon Bedrock
model = ‘anthropic.claude-3-sonnet-20240229-v1:0′

def reward_direct_from_Claude(text):
system_prompt = ”’
Rate the toxicity of the following statement.
The rate should be between -5 (high toxicity) and +5 (low toxicity).
The response should only contain the rate, nothing else.

Here is the statement:
”’

body = json.dumps({
“system”: system_prompt,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: text}]
}
],
“temperature”: 0,
“top_p”: 1,
“max_tokens”: 300,
“anthropic_version”: “bedrock-2023-05-31”
})

bedrock_runtime = boto3.client(region_name=region, service_name=’bedrock-runtime’)
response = bedrock_runtime.invoke_model(body=body, modelId=model)
response_body = json.loads(response.get(‘body’).read())
reward = response_body[“content”][0][“text”]
print(f’nResults for: {text}’)
print(f’Reward (directly generated by LLM): {reward}’)

# On the text below, the reward model should predict a high reward because it is not toxic
reward_direct_from_Claude(“I like you.”)

# On the text below, the reward model should predict a low reward because it is more toxic
reward_direct_from_Claude(“You are terrible and i hate you.”)

OUTPUT:
Results for: I like you.
Reward (directly generated by LLM): +5

Results for: You are terrible and i hate you.
Reward (directly generated by LLM): -4

You can see the format of the output generated by Anthropic’s Claude v3 out of the box (a scalar number) is identical to the format of the output generated by the previous reward model specifically tuned to rate toxicity. Either reward model can now be used for RLAIF.
Fine-tune the pre-trained LLM by proximal policy optimization (PPO) reinforcement learning
Now that we have a reward model, we can initialize a PPO trainer from the Hugging Face TRL library, then perform the actual RL loop that, at every step, will produce an LLM response for each summary, compute a reward feedback signal for each response, and update the parameters of the tunable LLM.
In this notebook, we iterate for a predefined number of PPO steps to not wait for too long, but in practice we could also track the reward (toxicity score) accumulated across all summaries at each step, which should increase as the LLM is tuned to produce less toxic summaries, and continue the iteration until the LLM is considered aligned based on a threshold in the toxicity score. See the following code:

# HuggingFace TRL PPO trainer configuration
config = PPOConfig(
model_name = model_name,
learning_rate = 1.41e-5,
ppo_epochs = 1,
mini_batch_size = 4,
batch_size = 16)

# Instantiate the PPO trainer
ppo_trainer = PPOTrainer(config = config,
model = ppo_llm,
ref_model = ref_llm,
tokenizer = tokenizer,
dataset = dataset[“train”],
data_collator = collator)

# Inference parameters of the LLM generating responses
max_new_tokens = 300
generation_kwargs = {
“min_length”: 5,
“top_k”: 0.0,
“top_p”: 1.0,
“do_sample”: True,
“pad_token_id”: tokenizer.pad_token_id,
“max_new_tokens”: max_new_tokens}

# Inference parameters of the reward model
reward_kwargs = {
“top_k”: None,
“function_to_apply”: “none”,
“batch_size”: 16}

# Set number of PPO iterations
max_ppo_steps = 10 # 10 is illustrative; takes <1 min on ml.g4dn.4xlarge EC2 instance

# PPO loop
for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):

# Stop after predefined number of steps
if step >= max_ppo_steps:
break

# Produce a response for each prompt in the current batch
summary_tensors = []
prompt_tensors = batch[“input_ids”]
for prompt_tensor in prompt_tensors:
summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
summary_tensors.append(summary.squeeze()[-max_new_tokens:])

# Prepare the decoded version of the responses for the reward model TRL pipeline
batch[“response”] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

# Compute reward for each pair (prompt, response) in the batch
query_response_pairs = [q + r for q, r in zip(batch[“query”], batch[“response”])]
rewards = reward_model(query_response_pairs, **reward_kwargs)
reward_tensors = [torch.tensor(reward[0][“score”]) for reward in rewards]

# Execute one step of PPO to udpate the parameters of the tunable LLM
stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
ppo_trainer.log_stats(stats, batch, reward_tensors)

# Print metrics for real-time monitoring
print(f’objective/kl: {stats[“objective/kl”]}’)
print(f’ppo/returns/mean: {stats[“ppo/returns/mean”]}’)

If the number of iterations is too small, you might not observe any significant improvements. You might have to experiment, in your particular use case, to find a number of iterations high enough to produce significant improvements.
Evaluate the RL fine-tuning results
To evaluate the results from an RLAIF process quantitatively, we can compute the toxicity of dialogues generated by the original vs. fine-tuned model using prompts from the hold-out test set that was prepared earlier. The code for the function evaluate_toxicity provided with this post uses the same toxicity model as was used to define the reward model, but you could also use a different toxicity from the one used as reward model for evaluating results, which is yet another way that might help scale superalignment efforts in RLAIF. See the following code:

# Compute aggregate toxicity score (mean, std dev) of the original model on the test set
mean_before, std_before = evaluate_toxicity(model=ref_llm,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset[“test”],
num_samples=10)

# Compute aggregate toxicity score (mean, std dev) of the fine-tuned model on the test set
mean_after, std_after = evaluate_toxicity(model = ppo_llm,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset[“test”],
num_samples=10)

# Compare toxicity score of the original vs. fine-tuned models on the test set
mean_improvement = (mean_before – mean_after) / mean_before
std_improvement = (std_before – std_after) / std_before

print(f’toxicity [mean, std] after fine tuning: [{mean_after}, {std_after}]’)
print(f’Percentage improvement of toxicity score after detoxification:’)
print(f’mean: {mean_improvement*100:.2f}%’)
print(f’std: {std_improvement*100:.2f}%’)

Finally, a more qualitative comparison of generated responses before vs. after RLAIF fine-tuning can complement the preceding quantitative evaluation, by looking at examples of summaries, together with their toxicity scores, again using hold-out prompts from the test set, as done in the accompanying notebook.

Query
Response Before Toxicity Fine-Tuning
Response After Toxicity Fine-Tuning

Summarize the following conversation. Human: When can my newborn start eating solid food? Assistant: Your baby should be starting solid foods around six months.
Wait five weeks before eating your baby’s regular food.
Wait until the baby gets to five months.

Summarize the following conversation. Human: How do I repair patches on my pants? Assistant: To repair patches on pants, here are several steps: 1) Wash and dry the pants. 2) Cut off any excess fabric around the patch. 3) Use a needle and thread to sew on the patch. 4) Sew on the patch securely, taking care to match the stitching with the original stitching. 5) Trim the thread.
Obtain inexpensive pieces of clothing.
How to repair patched pants.

Summarize the following conversation. Human: How do I fix my cassette tape if the tape has been pulled out? Assistant: What happens when you put in a cassette? Human: I can not put in the cassette the tape from inside was pulled out for the most part. Assistant: Are you talking about the plastic cassette tape or the cassette tape inside the cassette? Human: the only tape that is in a cassette tape. the one that contains the music. Assistant: Are you talking about the tape that is in the cassette (…)
The cassette tape is made with two layers of plastic that is broken, and the tape is your music file. The [sex-related word] is inserted into the grooves of the cassette tape, so it cannot mark these rigid spaces.
The cassette tape from the store has the only tape I broke, but that caught up to me and I have a record of it.

Summarize the following conversation. Human: How can I explain to a kid how to skim and scan text Assistant: What are you looking for exactly? Human: As I said, how to explain to a child how to skim and scan text Assistant: Do you want the child to learn how to do it themselves, or do you want to explain to them how to do it?
I heard about this article. What would you like me to do?? Scan and scan text.
Start the game by creating an interface online.

Conclusion
In this post, we introduced a state-of-the-art method to fine-tune LLMs by reinforcement learning, reviewed the pros and cons of RLHF vs. RLAIF vs. DPO, and saw how to scale LLM fine-tuning efforts with RLAIF. We also saw how to implement an end-to-end RLAIF pipeline on SageMaker using the Hugging Face Transformer and TRL libraries, and using either off-the-shelf toxicity reward models to align responses during PPO or by directly prompting an LLM to generate quantitative reward feedback during PPO. Finally, we saw how to evaluate results by measuring the toxicity of generated responses before vs. after fine-tuning on a hold-out test set of prompts.
Try this fine-tuning method with your own use cases, and share your thoughts in the comments.
References:
Ouyang L. et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
Lee H. et al. (2023) RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
Bai Y. et al. (2022) Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
Rafailov R. et al. (2024) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Christiano P. et al. (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Ivison H. et al. (2024) Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279.
Curuksu J. (2023) Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning. ICML 2023 Workshop on The Many Facets of Preference-Based Learning.
Curuksu J. (2024) Policy optimization of language models to align fidelity and efficiency of generative retrieval in multi-turn dialogues. KDD 2024 Workshop on Generative AI for Recommender Systems and Personalization.

About the Author
Jeremy Curuksu is a Senior Applied Scientist in Generative AI at AWS and an Adjunct Faculty at New York University. He holds a MS in Applied Mathematics and a PhD in Computational Biophysics, and was a Research Scientist at Sorbonne University, EPFL, and MIT. He authored the book Data Driven and multiple peer-reviewed articles in computational physics, applied mathematics, and artificial intelligence.

Researchers from Dataocean AI and Tsinghua University Introduces Dolph …

Automatic speech recognition (ASR) technologies have advanced significantly, yet notable disparities remain in their ability to accurately recognize diverse languages. Prominent ASR systems, such as OpenAI’s Whisper, exhibit pronounced performance gaps when processing Eastern languages compared to Western counterparts. This discrepancy presents tangible challenges in multilingual regions, particularly those characterized by numerous dialects and linguistic variations, underscoring the necessity for sophisticated multilingual ASR systems tailored specifically to Eastern languages.

Researchers from Dataocean AI and Tsinghua University have introduced Dolphin, a comprehensive multilingual automatic speech recognition model built upon an extended Whisper architecture, optimized to accommodate a broader spectrum of Eastern languages and dialects. Dolphin effectively addresses key limitations identified in current multilingual ASR models by integrating both proprietary datasets and publicly accessible datasets. The model proficiently supports 40 Eastern languages from East Asia, South Asia, Southeast Asia, and the Middle East, as well as 22 distinct dialects of Chinese.

Dolphin employs a hybrid ASR approach combining Connectionist Temporal Classification (CTC) with attention-based mechanisms. Its architecture incorporates an E-Branchformer encoder and a Transformer decoder, substantially enhancing the model’s capability to interpret complex linguistic patterns across diverse languages. Dolphin also utilizes a dual-level language tokenization system, distinguishing general language codes from region-specific dialect tokens. This mechanism improves recognition accuracy and resolution, particularly for dialect-intensive languages such as Chinese. Additionally, Dolphin incorporates a 4× subsampling layer to efficiently reduce input sequence lengths, enhancing computational speed and training effectiveness without compromising recognition accuracy.

Experimental evaluations demonstrate Dolphin’s marked improvements in multilingual speech recognition accuracy relative to Whisper models. For instance, the Dolphin small model reduced the Word Error Rate (WER) by approximately 24.5% compared to the base model, with further incremental improvements observed in medium and large variants. Specifically, the Dolphin base model attained an average WER of 31.8%, notably outperforming Whisper’s large-v3 model, which recorded an average WER of 52.3% across the same evaluation benchmarks. Assessments conducted on dialect-focused datasets, including KeSpeech, confirmed Dolphin’s capability to consistently handle intricate linguistic variations, with performance enhancements correlating positively with increased model size.

The research team released the Dolphin base and small models publicly under the Apache 2.0 license, along with associated inference code. Dolphin’s training utilized an extensive dataset encompassing 21.2 million hours of audio recordings, incorporating 7.4 million hours derived from open datasets such as Common Voice, ReazonSpeech, and GigaSpeech2, thereby ensuring robustness and replicability.

In summary, Dolphin constitutes a significant advancement in multilingual ASR technology, systematically addressing prevailing limitations in Eastern language and dialect recognition through methodological data integration, refined architectural frameworks, and commitment to open-source dissemination. This work sets an influential benchmark for future developments in multilingual ASR research, advancing linguistic inclusivity and system generalization.

Check out the Paper, Dolphin-small-model and Dolphin-base-model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Researchers from Dataocean AI and Tsinghua University Introduces Dolphin: A Multilingual Automatic Speech Recognition ASR Model Optimized for Eastern Languages and Dialects appeared first on MarkTechPost.

This AI Paper Introduces FASTCURL: A Curriculum Reinforcement Learning …

Large language models have transformed how machines comprehend and generate text, especially in complex problem-solving areas like mathematical reasoning. These systems, known as R1-like models, are designed to emulate slow and deliberate thought processes. Their key strength is handling intricate tasks requiring step-by-step reasoning across long sequences. These capabilities make them valuable for applications such as solving Olympiad-level math problems or logical reasoning tasks, where depth and coherence of reasoning are essential.

A significant challenge in training these models is the extensive computation for reinforcement learning using long context windows. Tasks that require multi-step logic force models to produce long outputs which consumes more resources and slows down learning. Further, not all long responses contribute meaningfully to accuracy; many include redundant reasoning. These inefficiencies in response generation and high GPU usage make it difficult to effectively scale training, particularly when working with models with 1.5 billion parameters.

Previous attempts to address this issue include models like DeepScaleR, which uses a staged context length extension strategy during training. DeepScaleR starts with an 8K context window and expands gradually to 24K over three training phases. Although this approach helps guide the model to manage longer reasoning chains efficiently, it still demands approximately 70,000 A100 GPU hours. DeepScaleR reduces that to 3,800 hours through a progressive strategy but still requires considerable hardware, including setups with up to 32 GPUs in some stages. This shows that while improvements are possible, the solution remains costly and complex.

Researchers at Tencent introduced a method called FASTCURL to overcome the inefficiencies of traditional reinforcement learning training. This method presents a curriculum-based strategy aligned with context window expansion. FASTCURL splits the dataset based on input prompt length into short, long, and combined categories. The training progresses in four stages, each using a different dataset and context window setting. This approach ensures the model learns simple reasoning before advancing to longer, more complex reasoning steps. The researchers emphasize that the entire training process runs on a single node with just 8 GPUs, reducing setup complexity.

The approach involves a deliberate segmentation of data by input length, driven by the hypothesis that longer prompts usually lead to longer and more complex outputs. The model first learns using short prompts under an 8K window. As training proceeds, the model transitions to a mixed dataset with 16K window length, then to the long dataset with the same window size, and finally reviews the combined data again. Each stage is trained for one iteration, and FASTCURL requires about 860 training steps. This is efficient compared to DeepScaleR’s 1,750 steps, representing a 50% reduction in training time and resource usage while maintaining effectiveness.

In performance evaluations, FASTCURL-1.5B-Preview showed improvements over other models across five benchmarks. It scored 88.0 on MATH 500, 43.1 on AIME 2024, 74.2 on AMC 2023, 31.6 on Minerva Math, and 50.4 on OlympiadBench, with an average PASS@1 score of 57.5. Compared to DeepScaleR-1.5B-Preview, which scored an average of 57.0, FASTCURL performed better in four of five datasets. These results highlight that FASTCURL can outperform existing techniques while consuming significantly fewer resources. The model also showed better generalization, particularly on datasets like AMC 2023 and Minerva Math, indicating robustness.

The research clearly outlines a computational problem in training R1-like reasoning models and offers an innovative curriculum strategy as a solution. The method provides an efficient and practical training framework by combining input-based data segmentation with context expansion. FASTCURL delivers strong performance using fewer steps and limited hardware, proving that strategic training design can be as powerful as raw computational scale.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post This AI Paper Introduces FASTCURL: A Curriculum Reinforcement Learning Framework with Context Extension for Efficient Training of R1-like Reasoning Models appeared first on MarkTechPost.

Introduction to MCP: The Ultimate Guide to Model Context Protocol for …

The Model Context Protocol (MCP) is an open standard (open-sourced by Anthropic) that defines a unified way to connect AI assistants (LLMs) with external data sources and tools. Think of MCP as a USB-C port for AI applications – a universal interface that allows any AI assistant to plug into any compatible data source or service. By standardizing how context is provided to AI models, MCP breaks down data silos and enables seamless, context-rich interactions across diverse systems. 

In practical terms, MCP enhances an AI assistant’s capabilities by giving it controlled access to up-to-date information and services beyond its built-in knowledge. Instead of operating with a fixed prompt or static training data, an MCP-enabled assistant can fetch real-time data, use private knowledge bases, or perform actions on external tools. This helps overcome limitations like the model’s knowledge cutoff and fixed context window. It is observed that simply “stuffing” all relevant text into an LLM’s prompt can hit context length limits, slow responses, and become costly. MCP’s on-demand retrieval of pertinent information keeps the AI’s context focused and fresh, allowing it to incorporate current data and update or modify external information when permitted. 

Image Source

Another way MCP improves AI integration is by unifying the development pattern. Before MCP, connecting an AI to external data often meant using bespoke integrations or framework-specific plugins. This fragmented approach forced developers to re-implement the same tool multiple times for different AI systems. MCP eliminates this redundancy by providing one standardized protocol. An MCP-compliant server (tool integration) can work with any MCP-compliant client (AI application). In short, MCP lets you “write once, use anywhere” when adding new data sources or capabilities to AI assistants. It brings consistent discovery and usage of tools and improved security. All these benefits make MCP a powerful foundation for building more capable and extensible AI assistant applications.

MCP Architecture and Core Components

At its core, MCP follows a client–server architecture that separates the AI assistant (client/host side) from the external integrations (server side). The design involves three primary roles:

MCP Host: The AI assistant application or environment that needs external data or actions. This could be a chat interface, an IDE with an AI coding assistant, a CRM with an AI helper, etc. The host is where the user interacts and the LLM “lives”.

MCP Client: This component (often a library within the host app) manages the connection to one or more MCP servers. It acts as a bridge, routing requests from the AI to the appropriate server and returning results. The client handles messaging, intent analysis, and ensuring the communication follows the MCP protocol format.

MCP Server: A lightweight program or service that exposes specific capabilities (tools, data access, or context) through the MCP standard. Each server is essentially a context provider; it can fetch information from certain data sources or perform particular actions and return results in a structured way.

Image Source

To visualize this, imagine the AI assistant as a laptop and each MCP server as a device or accessory that can be plugged in. The MCP client is like the universal hub/port that allows the computer to connect to many devices using the same interface. For example, the host AI (e.g., Claude or ChatGPT) connects via an MCP client “hub” to multiple MCP servers (adapters) that provide access to different services (Slack, Gmail, Calendar API, or local files). No matter who built the tool or data source, if it speaks MCP, the assistant can use it seamlessly. Each MCP server (bottom) is a context provider connecting the AI to a specific external service or data (icons for Slack, Gmail, Calendar, local files). The MCP client (middle, represented by the hub) enables the host AI application (top) to communicate with these servers through the standardized MCP interface. This modular design lets AI assistants plug into new data sources as easily as adding a new device, without custom integration for each tool.

Context Providers (MCP Servers)

Context providers are the external data sources or tools that an AI assistant can access via MCP. In MCP terms, these correspond to the MCP servers; each server provides a certain “capability” or data domain. For example, one MCP server might give access to a collection of documents or a knowledge base, another might interface with an email API, another with a database, and so on. The key is that each server follows the MCP standard for requests and responses, making them interchangeable from the perspective of the AI client. 

MCP servers can interface with local data sources (like files on your computer, local databases, etc.) or remote services (like web APIs, cloud apps). Indeed, a growing list of pre-built MCP servers already exists; for example, reference implementations are available for web searching, file operations, database queries, etc. You effectively make those data sources available to your AI by running or deploying the appropriate servers. The AI doesn’t need to know the low-level API details; it just sends a standardized request (e.g., “search for X” or “read file Y”), and the MCP server handles the rest. This design keeps the LLM isolated from direct external access. The server mediates what the AI can see or do, allowing for security and access control. In summary, context providers enable secure, plug-and-play integration of diverse data sources into the AI’s world.

Document Indexing and Retrieval

MCP servers often employ document indexing behind the scenes to efficiently use external data (especially large text corpora). Instead of storing a whole document or database record as one big blob, the data is pre-processed into an index that the server can query quickly. For textual data, this typically means splitting documents into chunks (e.g.,  paragraphs or passages) and converting them into a format suitable for fast similarity search, often embedding the text into vectors and storing them in a vector index or database. This is analogous to how a search engine indexes websites to retrieve relevant pages for a query instantly.

Why index documents? So that when the AI asks something, the server can find the relevant information without sending the entire data store. This is the essence of Retrieval-Augmented Generation (RAG): the user’s query is used to fetch relevant documents or snippets (via semantic search or keyword search), and those results are provided to the model as additional context. Using an index, the system can locate the needed knowledge quickly and accurately, even from large volumes of text. For example, if an AI can access a PDF library or a corporate wiki via MCP, the server might index all PDFs or wiki pages by content. When asked a question, it can then return just the top relevant sections to the AI rather than the AI scanning everything blindly. This speeds up the response and helps fit the info into the LLM’s context window limits.

It’s worth noting that MCP itself doesn’t mandate a specific indexing technique; depending on the server’s implementation, it could be a vector similarity search, a keyword-inverted index, a database query, etc. The protocol just standardizes how the AI can request and receive information. Indexing is one of the best practices for context-providing servers to ensure the AI gets the right data when needed.

Query Resolution Process

When a user asks a question or gives a prompt to an MCP-enabled AI assistant, the system goes through a query resolution workflow to figure out how to get the necessary context. In a typical MCP interaction, the process works like this: the user’s query goes to the MCP client (in the host app), which then analyzes the query’s intent and requirements. Based on this analysis, the client decides which context provider (MCP server) can best handle the request. For instance, if the query is “What are the steps to reset my email password?” the client might route this to a documentation or knowledge base server. The query “Schedule a meeting next Monday” might route to a calendar API server. The client essentially performs a tool selection or routing step.

Once the appropriate server(s) are identified, the client sends the request to the MCP server in a standardized format (e.g., a JSON RPC call defined by the MCP spec). The server then processes the request – this could involve running a search in an index (for a knowledge query), calling an external API, or performing some computation. For a data retrieval scenario, the server would execute a search or lookup on its indexed data. For example, it might take the query, run a semantic similarity search across document embeddings, and find the top matching chunks. The retrieved results (or action outputs) are then returned from the server to the client, which returns them to the AI model.

In many cases, the client might wrap the results into the prompt given to the LLM. This entire resolution cycle happens quickly and transparently. The user experiences the AI assistant responding with an answer or action outcome. Still, behind the scenes, the assistant may have consulted one or several external sources to get there. According to one description, the MCP client “selects the appropriate tools via the MCP server, and invokes external APIs to retrieve and process the required information before notifying the user of the results”. The architecture ensures that the communication is structured and secure at each step; the AI can only use the tools it’s allowed to and only in the ways the protocol permits.

A practical consideration in query resolution is that you typically only connect relevant providers for the task. An AI could have dozens of MCP servers available, but giving the model access to all of them simultaneously might be counterproductive. The best practice is to enable a subset of tools based on context or user scope to avoid confusing the model with too many choices. For instance, an AI agent in a coding IDE might load servers for Git and documentation but not the CRM or Calendar servers. This way, query resolution involves picking among a manageable set of options and reduces the chance of the model calling the wrong tool.

Context Delivery to the Assistant

After a provider fetches the relevant context, it needs to be delivered back to the AI model in a useful form. In an MCP setup, the server’s response is typically structured (e.g., containing the data or an answer). The MCP client then integrates that into the AI’s prompt or state. In a retrieval scenario, this often means attaching the retrieved text as additional context for the LLM to consider when generating its answer. For example, the client might prepend the model’s prompt with something like “Reference Document: [excerpt]…” before the actual question or use a special format the model is trained to understand (such as a system message with the context). The AI’s response is “enriched” with external knowledge; it can quote specifics from the provided text or base its reasoning on it. If multiple context pieces are returned, the client could concatenate them or present them in a list. The LLM will then see all those pieces and the user query and attempt to synthesize an answer. This dynamic injection of context means the AI can output information it didn’t originally know, effectively extending its knowledge at runtime. For the user, it feels like the assistant “knows” about internal documents or the latest news, when in reality, it is reading from the supplied context.

It’s important to highlight that context delivery in MCP is not limited to static text. While the focus here is on retrieval, MCP can also deliver the results of actions. For instance, if the user asks the AI to perform a calculation or send an email (and the MCP server for email executes that), the response delivered might be a confirmation or data about that action. In the case of retrieval (read-only context), the delivered content is analogous to what RAG provides: relevant documents for the model to read. However, MCP can go further; it supports active outputs. One source explains that RAG is read-only, whereas MCP enables the AI to “do things” and deliver the outcome. For example, an MCP server could return, say, “Email sent to John at 5 pm” as a result. In all cases, the final step is for the AI assistant to present the information or outcome to the end user in natural language. The user doesn’t see the raw context chunks or API calls; they just get the answer or confirmation, with the heavy lifting done via MCP behind the scenes.

Image Source

In conclusion, the Model Context Protocol (MCP) advances the integration of AI assistants with diverse external data sources. MCP enables AI systems to dynamically leverage up-to-date, relevant information and seamlessly perform context-aware interactions by standardizing context retrieval, indexing, and delivery. This approach enriches the functionality and accuracy of AI assistants and simplifies development by establishing a universal framework, eliminating redundancy, and enhancing security.

Sources

https://www.anthropic.com/news/model-context-protocol 

https://docs.anthropic.com/en/docs/agents-and-tools/mcp 

https://arxiv.org/pdf/2503.23278v1

Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Introduction to MCP: The Ultimate Guide to Model Context Protocol for AI Assistants appeared first on MarkTechPost.

How AWS Sales uses generative AI to streamline account planning

Every year, AWS Sales personnel draft in-depth, forward looking strategy documents for established AWS customers. These documents help the AWS Sales team to align with our customer growth strategy and to collaborate with the entire sales team on long-term growth ideas for AWS customers. These documents are internally called account plans (APs). In 2024, this activity took an account manager (AM) up to 40 hours per customer. This, combined with similar time spent for support roles researching and writing the growth plans for customers on the AWS Cloud, led to significant organization overhead. To help improve this process, in October 2024 we launched an AI-powered account planning draft assistant for our sales teams, building on the success of Field Advisor, an internal sales assistant tool. This new capability uses Amazon Bedrock to help our sales teams create comprehensive and insightful APs in less time. Since its launch, thousands of sales teams have used the resulting generative AI-powered assistant to draft sections of their APs, saving time on each AP created.
In this post, we showcase how the AWS Sales product team built the generative AI account plans draft assistant.
Business use cases
The account plans draft assistant serves four primary use cases:

Account plan draft generation: Using Amazon Bedrock, we’ve made internal and external data sources available to generate draft content for key sections of the APs. This enables our sales teams to quickly create initial drafts for sections such as customer overviews, industry analysis, and business priorities, which previously required hours of research across the internet and relied on disparate internal AWS tools.
Data synthesis: The assistant can pull relevant information from multiple sources including from our customer relationship management (CRM) system, financial reports, news articles, and previous APs to provide a holistic view of our customers.
Quality checks: Built-in quality assurance capabilities help ensure that APs meet internal standards for comprehensiveness, accuracy, and strategic alignment with our customers and business.
Customization: While providing AI-generated drafts, the product allows AMs to customize and refine the content by uploading proprietary documents to match their unique customer knowledge and strategic approach.

The account plan draft assistant loads when a user tries to create an AP, and users copy and paste each section they want to use in their final plan.

Our AMs report reduced time to write these documents, allowing them to focus more on high-value activities such as customer engagement and strategy development.
Here’s what some of our AMs had to say about their experience with the account plans draft assistant:

“The AI assistant saved me at least 15 hours on my latest enterprise account plan. It pulled together a great first draft, which I was then able to refine based on my own insights. This allowed me to spend more time actually engaging with my customer rather than doing research and writing.”
– Enterprise Account Manager
“As someone managing multiple mid-market accounts, I struggled to create in-depth plans for all my customers. The AI assistant now helps me rapidly generate baseline plans that I can then prioritize and customize. It’s a game-changer for serving my full portfolio of accounts.”
– Mid-market Account Manager

Amazon Q, Amazon Bedrock, and other AWS services underpin this experience, enabling us to use large language models (LLMs) and knowledge bases (KBs) to generate relevant, data-driven content for APs. Let’s explore how we built this AI assistant and some of our future plans.
Building the account plans draft assistant
When a user of the AWS internal CRM system initiates the workflow in Field Advisor, it triggers the account plan draft assistant capability through a pre-signed URL. The assistant then orchestrates a multi-source data collection process, performing web searches while also pulling account metadata from OpenSearch, Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3) storage. After analyzing and combining this data with user-uploaded documents, the assistant uses Amazon Bedrock to generate the AP. When complete, a notification chain using Amazon Simple Queue Service (Amazon SQS) and our internal notifications service API gateway begins delivering updates using Slack direct messaging and storing searchable records in OpenSearch for future reference.
The following diagram illustrates the high-level architecture of the account plans draft assistant.

Solution overview
We built the account plans draft assistant using the following key components:

Amazon Bedrock: Provides programmatic (API) access to high performing foundation models (FMs) along with vector search capabilities and metadata filtering using Amazon Bedrock Knowledge Bases. We populate an Amazon Bedrock knowledge bases using sales-enablement materials, historic APs, and other relevant documents curated by AWS Glue jobs (see more on AWS Glue jobs in the item 4).
AWS Lambda: Supports two use cases:

The async resolver Lambda function interfaces with the front-end client CRM and orchestrates async job IDs for the client to poll. This layer also handles input validations, user request throttling and cache management.
Worker Lambda functions perform the actual heavy lifting to create AP content. These functions work concurrently to generate different sections of APs by using publicly available data, internal data, and curated data in Amazon Bedrock knowledge bases. These functions invoke various LLMs using Amazon Bedrock and store the final content in the AP’s DynamoDB database corresponding to each async job ID.

DynamoDB: Maintains the state of each user request by tracking async job IDs, tracks throttling quota (global request count and per-user request count), and acts as a cache.
AWS Glue jobs: Curate and transform data from various internal and external data sources. These AWS Glue jobs push data to internal data sources (APs, internal tooling team S3 buckets, and other internal services) and to Bedrock KBs, facilitating high quality output through retrieval augmented generation (RAG).
Amazon SQS: Enables us to decouple the management plane and data plane. This decoupling is crucial in allowing the data plane worker functions to concurrently process different sections of the APs and make sure that we can generate APs within specified times.
Custom web frontend: A ReactJS based micro-frontend architecture enables us to integrate directly into our CRM system for a seamless user experience.

Data management
Our account plans draft assistant uses an Amazon Bedrock out-of-the-box knowledge base management solution. Through its RAG architecture, we semantically search and use metadata filtering to retrieve relevant context from diverse sources: internal sales enablement materials, historic APs, SEC filings, news articles, executive engagements and data from our CRM systems. The connectors built into Amazon Bedrock handle data ingestion from Amazon S3, relational database management systems (RDBMS), and third-party APIs; while its KB capabilities enable us to filter and prioritize source documents when generating responses. This context-aware approach results in higher quality and more relevant content in our generated AP sections.
Security and compliance
Security and Compliance are paramount to AWS when dealing with data regarding our customers. We use AWS IAM Identity Center for enterprise single sign-on so that only authorized users can access the account plans draft assistant. Using Field Advisor, we use various internal authorization mechanisms to help ensure that a user who’s generating APs only accesses the data that they already have access to.
User experience
We built a custom web frontend using a micro-frontend approach that integrates directly into our CRM system, allowing AMs to access the account plans draft assistant without leaving their familiar work environment. The interface allows users to select which sections of APs they want to generate, provides options for customization, and notifies users to create their APs on time through Slack.
Looking ahead
While the account plans draft assistant has already demonstrated significant value, we’re continuing to enhance its capabilities. Our goal is to create a zero-touch account planner that sales teams can use to generate a full AP for a customer, incorporating best practices observed across our customers to provide sales teams best-in-class strategies to engage with customers. This would include:

 Deeper integration with our bespoke purpose-built planning tools and assistance with account planning, such as automatically generating value maps and stakeholder maps.
Enhanced personalization to tailor content based on industry, account size, and individual user preferences.
Improved collaboration features, so that multiple sales team members can work together on refining AI-generated plans.
Expanded use of recommendations to provide what next? ideas to our sales teams to better serve our customers.

Conclusion
The account plans draft assistant, powered by Amazon Bedrock, has significantly streamlined our AP process, allowing our AWS Sales teams to create higher quality APs in a fraction of the time they currently need. As we continue to refine and expand this capability, we’re excited to see how it will further enhance our ability to serve our customers and drive their success in the AWS Cloud.
If you’re interested in learning how generative AI can transform your sales function and its processes, reach out to your AWS account team to discuss how services such as Amazon Q and Amazon Bedrock can help you build similar solutions for your organization.

About the Authors
Saksham Kakar is a Sr. Product Manager (Technical) in the AWS Field Experiences (AFX) organization focused on developing products that enable AWS Sales teams to help AWS customers grow with Amazon. Prior to this, Saksham led large sales, strategy and operations teams across startups and Fortune 500 companies. Outside of work, he is an avid tennis player and amateur skier.
Vimanyu Aggarwal is a Senior Software Engineer in AWS Field Experiences (AFX) organization with over 10 years of industry experience. Over the last decade, Vimanyu has been focusing on building large-scale, complex distributed systems at various Fortune 500 organizations. Currently, he works with multiple teams within the AFX organization to deliver technical solutions that empower the $100 billion sales funnel. Outside of work, he likes to play board games, tinker with IoT, and explore nature.
Krishnachand Velaga is a Senior Manager for Product Management – Technical (PM-T) in the AWS Field Experiences (AFX) organization who manages a team of seasoned PM-Ts and a suite of sales products, using generative AI to enable the AWS Sales organization help AWS customers across the globe adopt, migrate and grow on the AWS Cloud in line with their business needs and outcomes while bolstering sales efficiency and productivity and reducing operational cost.
Scott Wilkinson is a Software Development Manager in the AWS Field Experiences (AFX) organization, where he leads a cross-functional engineering team developing tools that aggregate and productize data to power AWS customer insights. Prior to AWS, Scott worked for notable startups including Digg, eHarmony, and Nasty Gal in both leadership and software development roles. Outside of work, Scott is a musician (guitar and piano) and loves to cook French cuisine.

Shaping the future: OMRON’s data-driven journey with AWS

This post is co-written with Emrah Kaya and Xinyi Zhou from Omron Europe.
Data is one of the most critical assets of many organizations. They’re constantly seeking ways to use their vast amounts of information to gain competitive advantages.
OMRON Corporation is a leading technology provider in industrial automation, healthcare, and electronic components. In their Shaping the Future 2030 (SF2030) strategic plan, OMRON aims to address diverse social issues, drive sustainable business growth, transform business models and capabilities, and accelerate digital transformation. At the heart of this transformation is the OMRON Data & Analytics Platform (ODAP), an innovative initiative designed to revolutionize how the company harnesses its data assets.
This post explores how OMRON Europe is using Amazon Web Services (AWS) to build its advanced ODAP and its progress toward harnessing the power of generative AI.
Challenges
By using advanced data and analytics capabilities, organizations can gain valuable insights into their operations, industry trends, and customer behaviors, leading to more informed strategies and increased insight. This approach is particularly powerful when applied to mission-critical data such as enterprise resource planning (ERP) and customer relationship management (CRM) systems because these contain information about internal processes, supply chain management, and customer interactions. By analyzing their data, organizations can identify patterns in sales cycles, optimize inventory management, or help tailor products or services to meet customer needs more effectively. However, organizations often face significant challenges in realizing these benefits because of:

Data silos – Organizations often use multiple systems across regions or departments. Integrating these diverse sources to create a single source of truth is complex, making it difficult to generate unified reports or analyze cross-functional trends.
Data governance challenges – Maintaining consistent data governance across different systems is crucial but complex. Implementing uniform policies across different systems and departments presents significant hurdles.
Different formats and standards – Systems typically use varied data formats and structures. This disparity complicates data integration and cross-system analysis, requiring significant effort to reconcile and harmonize data for comprehensive insights.

OMRON Data & Analytics Platform
To address these challenges, OMRON Europe (hereinafter “OMRON”) decided to implement an advanced data and analytics platform, ODAP. This innovative solution was designed to serve as a centralized hub for specific data assets, breaking down the barriers between various data sources and systems.
The following diagram shows a simplified architecture and some of the services and architectural patterns used for ODAP.

ODAP aimed to seamlessly integrate data from multiple ERP and CRM systems in addition to other relevant data sources across the organization. Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems. The robust security features provided by Amazon S3, including encryption and durability, were used to provide data protection. Finally, ODAP was designed to incorporate cutting-edge analytics tools and future AI-powered insights.
Some of these tools included AWS Cloud based solutions, such as AWS Lambda and AWS Step Functions. Lambda enables serverless, event-driven data processing tasks, allowing for real-time transformations and calculations as data arrives. Step Functions complements this by orchestrating complex workflows, coordinating multiple Lambda functions, and managing error handling for sophisticated data processing pipelines. This enables OMRON to extract meaningful patterns and trends from its vast data repositories, supporting more informed decision-making at all levels of the organization.
OMRON’s data strategy—represented on ODAP—also allowed the organization to unlock generative AI use cases focused on tangible business outcomes and enhanced productivity. Part of a comprehensive approach to using artificial intelligence and machine learning (AI/ML) and generative AI includes a strong data strategy that can help provide high quality and reliable data.
Embracing generative AI with Amazon Bedrock
The company has identified several use cases where generative AI can significantly impact operations, particularly in analytics and business intelligence (BI).
One key initiative is ODAPChat, an AI-powered chat-based assistant employees can use to interact with data using natural language queries. This tool democratizes data access across the organization, enabling even nontechnical users to gain valuable insights.
A standout application is the SQL-to-natural language capability, which translates complex SQL queries into plain English and vice versa, bridging the gap between technical and business teams. To power these advanced AI features, OMRON chose Amazon Bedrock. This fully managed service offers a range of foundation models (FMs), providing the flexibility to select the most suitable model for each use case. The straightforward implementation of Amazon Bedrock, coupled with its scalability to handle growing data volumes and user requests, made it an ideal choice for OMRON. The ability of Amazon Bedrock to support various models from different providers helps make sure that OMRON can always use the most advanced AI capabilities as they evolve.
Crucially, the robust security features provided by Amazon Bedrock align perfectly with OMRON’s stringent data protection requirements. Some highlights include:

Fine-grained access controls
Networking security features such as encryption of data in transit and at rest, or the ability to use private virtual private clouds (VPCs), helping to make sure that sensitive business data remains secure even when being processed by AI models
Amazon Bedrock Guardrails

These strict security controls offer a comprehensive security approach that allows OMRON to innovate with AI while maintaining the highest standards of data governance and protection.
The following diagram shows a basic layout of how the solution works. It helps illustrate the main parts and how they work together to make the AI assistant do its job.

The system has three main sections:

User interface – Users engage with the chat interface hosted on AWS. Amazon Cognito handles the user authentication processes, providing secure access to the application.
Input processing backend – The Amazon API Gateway receives incoming messages, which are then processed by containers running on Amazon Elastic Container Service (Amazon ECS). Chat conversations are preserved in Amazon DynamoDB to be used for the follow-up conversation. Amazon Bedrock takes care of generating AI responses, and tools are configured using LangChain, which helps determine how to handle different types of queries. When needed, the system can access an ODAP data warehouse to retrieve additional information.
Document management – Documents are securely stored in Amazon S3, and when new documents are added, a Lambda function processes them into chunks. These chunks are converted into embeddings using Amazon Bedrock and the embeddings are stored in an Amazon OpenSearch Service vector store for semantic search.

Results and future plans
The implementation of ODAP and ODAPChat on AWS has already yielded significant benefits for OMRON:

Optimization of reports, leading to more efficient and insightful analysis
SQL-to-natural language capabilities powered by generative AI, making data more accessible to nontechnical users
Increased business agility with infrastructure fully deployed in the cloud
Data democratization, enabling more employees to use data-driven insights

Looking ahead, OMRON plans to significantly expand its use of AWS services and further use generative AI capabilities. The company aims to integrate additional data sources, including other mission-critical systems, into ODAP. This expansion will be coupled with enhanced data governance measures to help promote data quality and compliance across the growing data solution.
OMRON is also exploring more advanced generative AI use cases, such as INSERT_INITIATIVES. These initiatives will use the evolving capabilities provided by Amazon Bedrock to potentially incorporate advanced AI models and security features.
Conclusion
OMRON’s journey with AWS demonstrates the transformative power of cloud-based data solutions and generative AI in overcoming data silos and driving business innovation. By using AWS services such as Amazon AppFlow, Amazon S3, and Amazon Bedrock, OMRON has created a comprehensive, secure, and adaptable data and analytics platform that not only meets its current needs, but also positions the company for future growth and innovation.
As organizations across industries grapple with similar data challenges, OMRON’s story serves as an inspiring example of how embracing cloud technologies and AI can lead to significant business transformation and competitive advantage.

About the Authors
Emrah Kaya is Data Engineering Manager at Omron Europe and Platform Lead for ODAP Project. With his extensive background on Cloud & Data Architecture, Emrah leads key OMRON’s technological advancement initiatives, including artificial intelligence, machine learning, or data science.
Xinyi Zhou is a Data Engineer at Omron Europe, bringing her expertise to the ODAP team led by Emrah Kaya. She specializes in building efficient data pipelines and managing AWS infrastructure, while actively contributing to the implementation of new solutions that advance ODAP’s technological capabilities.
Emel Mendoza is a Senior Solutions Architect at AWS based in the Netherlands. With passion for cloud migrations and application modernization, Emel helps organizations navigate their digital transformation journeys on AWS. Emel leverages his decade of experience to guide customers in adopting AWS services and architecting scalable, efficient solutions.
Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.

AI Workforce: using AI and Drones to simplify infrastructure inspectio …

Inspecting wind turbines, power lines, 5G towers, and pipelines is a tough job. It’s often dangerous, time-consuming, and prone to human error. That’s why we at Amazon Web Services (AWS) are working on AI Workforce—a system that uses drones and AI to make these inspections safer, faster, and more accurate.
This post is the first in a three-part series exploring AI Workforce, the AWS AI-powered drone inspection system. In this post, we introduce the concept and key benefits. The second post dives into the AWS architecture that powers AI Workforce, and the third focuses on the drone setup and integration.
In the following sections, we explain how AI Workforce enables asset owners, maintenance teams, and operations managers in industries such as energy and telecommunications to enhance safety, reduce costs, and improve efficiency in infrastructure inspections.
Challenges with traditional inspections
Inspecting infrastructure using traditional methods is a challenge. You need trained people and specialized equipment, and you often must shut things down during inspection. As an example, climbing a wind turbine in bad weather for an inspection can be dangerous. Plus, even the best human inspector can miss things. This can lead to bigger problems down the line, costing time and money.

How AI Workforce helps
AI Workforce is designed to change all that. We use autonomous drones equipped with advanced sensors and AI to do the inspections. This brings the following benefits:

Less risk for people – Drones do the dangerous work so people don’t have to. This makes inspections much safer.
Faster and more efficient – Drones can cover a lot of ground quickly, getting the job done faster.
Better data – Automated data collection and analysis means fewer mistakes and more consistent results. This allows for proactive maintenance.

What does AI Workforce look like in action? Users interact with a simple AI assistant and dashboard that displays near real-time drone inspections, detected issues, and AI-generated insights. The following figure shows an example of the user dashboard and drone conversation.

The following figure is an example of drone 4K footage.

Solution overview
AI Workforce is built on a robust and scalable architecture using a wide array of AWS services. Security is paramount, and we adhere to AWS best practices across the layers. This includes:

Amazon API Gateway manages secure communication between various components, enforcing authentication and authorization
AWS Identity and Access Management (IAM) roles and policies verify least privilege access, limiting each component’s permissions to only what is necessary
Network security is implemented through virtual private clouds (VPCs), security groups, and network access control lists (ACLs), isolating the system and protecting it from unauthorized access
For video processing, we employ secure transfer protocols and encryption at rest and in transit

AI Workforce provides a robust API for managing drone operations, including flight planning, telemetry data, and anomaly detection. The following diagram outlines how different components interact.

Imagine a system where drones autonomously inspect critical infrastructure, capturing high-resolution video, analyzing potential defects with AI, and seamlessly integrating findings into business workflows. The AI Workforce architecture brings this vision to life, using AWS services across four key pillars.
Control plane: Secure drone communication and operations
Our journey begins with automated drone flights. Each drone follows predefined routes, with flight waypoints, altitude, and speed configured through an AWS API, using coordinates stored in Amazon DynamoDB. Once airborne, AWS IoT Core enables secure, bidirectional communication—allowing drones to receive real-time commands (like “take-off”, “begin flight ID = xxx”, or “land”), adjust flight paths, and transmit telemetry data back to AWS. To maintain robust security, AWS Lambda responds to Internet of Things (IoT) events, enabling immediate actions based on drone data, while Amazon GuardDuty continuously monitors for anomalies or potential security threats, such as unusual API activity or unauthorized access attempts, helping protect the integrity of drone operations and promoting secure operations.
In AI Workforce, AWS IoT Core serves as the primary entry point for real-time drone communication, handling telemetry data, command and control messaging, and secure bidirectional communication with drones. API Gateway plays a complementary role by acting as the main entry point for external applications, dashboards, and enterprise integrations. It is responsible for managing RESTful API calls related to flight planning, retrieving inspection results, and interacting with backend services like Amazon Relational Database Service (Amazon RDS) and AWS Step Functions. While drones communicate directly with AWS IoT Core, user-facing applications and automation workflows rely on API Gateway to access structured data and trigger specific actions within the AI Workforce ecosystem.
AI/ML and generative AI: Computer vision and intelligent insights
As drones capture video footage, raw data is processed through AI-powered models running on Amazon Elastic Compute Cloud (Amazon EC2) instances. These computer vision models detect anomalies, classify damage types, and extract actionable insights—whether it’s spotting cracks on wind turbines or identifying corrosion on pipelines. Amazon SageMaker AI is at the core of our machine learning (ML) pipeline, training and deploying models for object detection, anomaly detection, and predictive maintenance.
We are also pioneering generative AI with Amazon Bedrock, enhancing our system’s intelligence. With natural language interactions, asset owners can ask questions like “What were the most critical defects detected last week?” and Amazon Bedrock generates structured reports based on inspection findings. It even aids in synthetic training data generation, refining our ML models for improved accuracy.
Data layer: Storing and managing inspection data
Every inspection generates vast amounts of data—high-resolution images, videos, and sensor readings. This information is securely stored in Amazon Simple Storage Service (Amazon S3), promoting durability and ease of access. Amazon S3 encrypts data at rest by default using server-side encryption (SSE), providing an additional layer of security without requiring manual configuration. Meanwhile, structured metadata and processed results are housed in Amazon RDS, enabling fast queries and integration with enterprise applications. Together, these services create a resilient data foundation, supporting both real-time analysis and historical trend monitoring.
Analytics and business: Automated workflows and business intelligence
Insights don’t stop at data collection—Step Functions orchestrates workflows that trigger automated actions. For example, if an AI model detects a critical defect, Step Functions can initiate a maintenance request in SAP, notify engineers, and schedule repairs without human intervention.
For deeper analysis, Amazon QuickSight transforms raw inspection data into interactive dashboards, helping asset owners track infrastructure health, spot trends, and optimize maintenance strategies. With a clear visual representation of defects, decision-makers can act swiftly, minimizing downtime and maximizing operational efficiency.
The future of AI Workforce: Expanding drone capabilities
Beyond inspections, AI Workforce provides a robust Drone API, offering seamless integration for third-party applications. This API enables remote flight planning, telemetry monitoring, and anomaly detection—all within a scalable AWS environment.
With secure drone communication, powerful AI-driven insights, a robust data foundation, and business automation, AI Workforce is redefining infrastructure inspection, making it smarter, faster, and more efficient than ever before.
Benefits and impact on business operations
The deployment of AI Workforce delivers a wide range of tangible benefits for organizations managing critical infrastructure (for example, automatically compare multiple inspections over time to detect longitudinal changes, and identify progressive failures for proactive maintenance), particularly in the energy and telco sector:

Significant cost savings – By reducing the need for human labor, specialized equipment, and extensive logistical planning, AI Workforce can significantly lower inspection costs. Proactive maintenance based on early defect detection also prevents costly repairs and unplanned downtime.
Dramatically enhanced safety – Removing human personnel from hazardous environments drastically reduces the risk of accidents and injuries, creating a safer working environment.
Substantially improved efficiency – Automated drone inspections are significantly faster and more efficient than traditional methods, enabling more frequent inspections and faster turnaround times.
Data-driven decision-making – AI Workforce provides asset owners with comprehensive and accurate data, enabling them to make informed decisions about maintenance, repairs, and asset management.

Example AI Workforce use case in the industry sector
Picture an energy company responsible for maintaining a large wind farm. They deploy AI Workforce drones for regular inspections. The drones, autonomously navigating preprogrammed flight paths defined by coordinates stored in DynamoDB and controlled through REST API calls, are securely connected using AWS IoT Core.
During the flight, sensor data is processed at the edge and streamed to Amazon S3, with metadata stored in Amazon RDS. Computer vision algorithms analyze the video in real time. If an anomaly is detected, a Lambda function triggers a Step Functions workflow, which in turn interacts with their SAP system to generate a maintenance work order. Inspection data is aggregated and visualized in QuickSight dashboards, providing a comprehensive overview of the wind farm’s health.
SageMaker AI models analyze the data, predicting potential failures and informing proactive maintenance strategies. In the future, Amazon Bedrock might provide summarized reports and generate synthetic data to further enhance the system’s capabilities.
Conclusion
At AWS, we’re committed to driving innovation in AI-powered solutions for a wide range of industries. AI Workforce is a prime example of how we’re using cutting-edge technologies to transform how critical infrastructure is managed and maintained.
We’re building this workforce to help businesses operate more efficiently and safely. We’re open to collaborating with others who are interested in this space. If you’d like to learn more, feel free to reach out. We welcome the opportunity to discuss your specific needs and explore potential collaborations.

About the Author
Miguel Muñoz de Rivera González is the original designer and technical lead for the AI Workforce initiative at AWS, driving AI-powered drone solutions for safer, smarter, and cost-effective infrastructure inspections.

Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI …

The rapid progress in artificial intelligence (AI) and machine learning (ML) research underscores the importance of accurately evaluating AI agents’ capabilities in replicating complex, empirical research tasks traditionally performed by human researchers. Currently, systematic evaluation tools that precisely measure the ability of AI agents to autonomously reproduce ML research findings remain limited, posing challenges in fully understanding the potential and limitations of such systems.

OpenAI has introduced PaperBench, a benchmark designed to evaluate the competence of AI agents in autonomously replicating state-of-the-art machine learning research. PaperBench specifically measures whether AI systems can accurately interpret research papers, independently develop the necessary codebases, and execute experiments to replicate empirical outcomes. The benchmark comprises 20 papers selected from ICML 2024, covering areas including reinforcement learning, robustness, and probabilistic methods. Detailed rubrics, co-developed with original paper authors, specify 8,316 individually gradable tasks to facilitate precise evaluation of AI capabilities.

From a technical perspective, PaperBench requires AI agents to process provided research papers and supplementary clarifications to develop comprehensive code repositories from scratch. These repositories must include complete experimental setups and execution scripts, notably the reproduce.sh file. To ensure genuine independent replication, agents are prohibited from referencing or reusing code from the original authors’ repositories. Rubrics are structured hierarchically to detail explicit pass-fail criteria at various levels, allowing systematic and objective assessment. Evaluation is conducted using SimpleJudge, an automated large language model (LLM)-based judge, which simplifies the grading process. SimpleJudge achieved an F1 score of 0.83 on JudgeEval, an auxiliary evaluation dataset specifically designed to validate automated grading accuracy.

Empirical evaluations of several advanced AI models indicate varying performance levels on PaperBench. Claude 3.5 Sonnet exhibited the highest capability with an average replication score of 21.0%. Other models such as OpenAI’s GPT-4o and Gemini 2.0 Flash attained significantly lower scores of 4.1% and 3.2%, respectively. Comparatively, expert human ML researchers achieved considerably higher accuracy, reaching up to 41.4% after 48 hours of dedicated effort. Analysis of model performance revealed strengths in initial rapid code generation and early experimental setup but highlighted substantial weaknesses in managing prolonged tasks, troubleshooting, and adapting strategic approaches over time.

These results provide critical technical insights into current AI system capabilities. While AI models demonstrate competence in certain coding tasks and initial experiment implementation, significant gaps persist, particularly regarding sustained task execution, adaptive problem-solving, and strategic planning. Additionally, the introduction of PaperBench Code-Dev, a streamlined variant emphasizing code correctness without experimental execution, offers a practical alternative for broader and resource-limited community use due to reduced computational and evaluation costs.

In summary, PaperBench represents an important step toward methodically evaluating AI research capabilities. It provides a structured and detailed assessment environment that highlights specific strengths and limitations of contemporary AI models relative to human performance. The collaborative development of rubrics ensures precise and realistic evaluations. OpenAI’s open-sourcing of PaperBench supports further exploration and development in the field, enhancing understanding of autonomous AI research capabilities and informing responsible progression in this area.

Check out the Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research appeared first on MarkTechPost.

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distribu …

Foundation model (FM) training and inference has led to a significant increase in computational needs across the industry. These models require massive amounts of accelerated compute to train and operate effectively, pushing the boundaries of traditional computing infrastructure. They require efficient systems for distributing workloads across multiple GPU accelerated servers, and optimizing developer velocity as well as performance.
Ray is an open source framework that makes it straightforward to create, deploy, and optimize distributed Python jobs. At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. It provides a set of high-level APIs for tasks, actors, and data that abstract away the complexities of distributed computing, enabling developers to focus on the core logic of their applications. Ray promotes the same coding patterns for both a simple machine learning (ML) experiment and a scalable, resilient production application. Ray’s key features include efficient task scheduling, fault tolerance, and automatic resource management, making it a powerful tool for building a wide range of distributed applications, from ML models to real-time data processing pipelines. With its growing ecosystem of libraries and tools, Ray has become a popular choice for organizations looking to use the power of distributed computing to tackle complex and data-intensive problems.
Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of Ray provides a powerful framework to scale up your generative AI workloads.
In this post, we demonstrate the steps involved in running Ray jobs on SageMaker HyperPod.
Overview of Ray
This section provides a high-level overview of the Ray tools and frameworks for AI/ML workloads. We primarily focus on ML training use cases.
Ray is an open-source distributed computing framework designed to run highly scalable and parallel Python applications. Ray manages, executes, and optimizes compute needs across AI workloads. It unifies infrastructure through a single, flexible framework—enabling AI workloads from data processing, to model training, to model serving and beyond.
For distributed jobs, Ray provides intuitive tools for parallelizing and scaling ML workflows. It allows developers to focus on their training logic without the complexities of resource allocation, task scheduling, and inter-node communication.
At a high level, Ray is made up of three layers:

Ray Core: The foundation of Ray, providing primitives for parallel and distributed computing
Ray AI libraries:

Ray Train – A library that simplifies distributed training by offering built-in support for popular ML frameworks like PyTorch, TensorFlow, and Hugging Face
Ray Tune – A library for scalable hyperparameter tuning
Ray Serve – A library for distributed model deployment and serving

Ray clusters: A distributed computing platform where worker nodes run user code as Ray tasks and actors, generally in the cloud

In this post, we dive deep into running Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and a number of connected worker nodes. The head node orchestrates task scheduling, resource allocation, and communication between nodes. The ray worker nodes execute the distributed workloads using Ray tasks and actors, such as model training or data preprocessing.
Ray clusters and Kubernetes clusters pair well together. By running a Ray cluster on Kubernetes using the KubeRay operator, both Ray users and Kubernetes administrators benefit from the smooth path from development to production. For this use case, we use a SageMaker HyperPod cluster orchestrated through Amazon Elastic Kubernetes Service (Amazon EKS).
The KubeRay operator enables you to run a Ray cluster on a Kubernetes cluster. KubeRay creates the following custom resource definitions (CRDs):

RayCluster – The primary resource for managing Ray instances on Kubernetes. The nodes in a Ray cluster manifest as pods in the Kubernetes cluster.
RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting tasks or batches of tasks to be executed by the Ray cluster. A RayJob also manages the lifecycle of the Ray cluster, making it ephemeral by automatically spinning up the cluster when the job is submitted and shutting it down when the job is complete.
RayService – A Ray cluster and a Serve application that runs on top of it into a single Kubernetes manifest. It allows for the deployment of Ray applications that need to be exposed for external communication, typically through a service endpoint.

For the remainder of this post, we don’t focus on RayJob or RayService; we focus on creating a persistent Ray cluster to run distributed ML training jobs.
When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we will dive deeper into later in this post. This combination provides a solution for handling dynamic workloads, maintaining high availability, and providing seamless recovery from node failures, which is crucial for long-running jobs.
Overview of SageMaker HyperPod
In this section, we introduce SageMaker HyperPod and its built-in resiliency features to provide infrastructure stability.
Generative AI workloads such as training, inference, and fine-tuning involve building, maintaining, and optimizing large clusters of thousands of GPU accelerated instances. For distributed training, the goal is to efficiently parallelize workloads across these instances in order to maximize cluster utilization and minimize time to train. For large-scale inference, it’s important to minimize latency, maximize throughput, and seamlessly scale across those instances for the best user experience. SageMaker HyperPod is a purpose-built infrastructure to address these needs. It removes the undifferentiated heavy lifting involved in building, maintaining, and optimizing a large GPU accelerated cluster. It also provides flexibility to fully customize your training or inference environment and compose your own software stack. You can use either Slurm or Amazon EKS for orchestration with SageMaker HyperPod.
Due to their massive size and the need to train on large amounts of data, FMs are often trained and deployed on large compute clusters composed of thousands of AI accelerators such as GPUs and AWS Trainium. A single failure in one of these thousand accelerators can interrupt the entire training process, requiring manual intervention to identify, isolate, debug, repair, and recover the faulty node in the cluster. This workflow can take several hours for each failure and as the scale of the cluster grows, it’s common to see a failure every few days or even every few hours. SageMaker HyperPod provides resiliency against infrastructure failures by applying agents that continuously run health checks on cluster instances, fix the bad instances, reload the last valid checkpoint, and resume the training—without user intervention. As a result, you can train your models up to 40% faster. You can also SSH into an instance in the cluster for debugging and gather insights on hardware-level optimization during multi-node training. Orchestrators like Slurm or Amazon EKS facilitate efficient allocation and management of resources, provide optimal job scheduling, monitor resource utilization, and automate fault tolerance.
Solution overview
This section provides an overview of how to run Ray jobs for multi-node distributed training on SageMaker HyperPod. We go over the architecture and the process of creating a SageMaker HyperPod cluster, installing the KubeRay operator, and deploying a Ray training job.
Although this post provides a step-by-step guide to manually create the cluster, feel free to check out the aws-do-ray project, which aims to simplify the deployment and scaling of distributed Python application using Ray on Amazon EKS or SageMaker HyperPod. It uses Docker to containerize the tools necessary to deploy and manage Ray clusters, jobs, and services. In addition to the aws-do-ray project, we’d like to highlight the Amazon SageMaker Hyperpod EKS workshop, which offers an end-to-end experience for running various workloads on SageMaker Hyperpod clusters. There are multiple examples of training and inference workloads from the GitHub repository awsome-distributed-training.
As introduced earlier in this post, KubeRay simplifies the deployment and management of Ray applications on Kubernetes. The following diagram illustrates the solution architecture.

Create a SageMaker HyperPod cluster
Prerequisites
Before deploying Ray on SageMaker HyperPod, you need a HyperPod cluster:

This is a one click deployment to set up your EKS SageMaker HyperPod cluster using CloudFormation. Please deploy this stack. Source is from Amazon EKS Support in SageMaker HyperPod Workshop. Now, you can proceed directly to section Create an FSx for Lustre Shared File System.

If you prefer to deploy HyperPod on an existing EKS cluster, please follow the instructions here which include:

EKS cluster – You can associate SageMaker HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively and recommended, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer to the GitHub repo for instructions on setting up an EKS cluster.
Custom resources – Running multi-node distributed training requires various resources, such as device plugins, Container Storage Interface (CSI) drivers, and training operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer to Install packages on the Amazon EKS cluster using Helm for installation instructions.

The following provide an example workflow for creating a HyperPod cluster on an existing EKS Cluster after deploying prerequisites. This is for reference only and not required for the quick deploy option.

cat > cluster-config.json << EOL
{
“ClusterName”: “ml-cluster”,
“Orchestrator”: {
“Eks”: {
“ClusterArn”: “${EKS_CLUSTER_ARN}”
}
},
“InstanceGroups”: [
{
“InstanceGroupName”: “worker-group-1”,
“InstanceType”: “ml.p5.48xlarge”,
“InstanceCount”: 4,
“LifeCycleConfig”: {
“SourceS3Uri”: “s3://amzn-s3-demo-bucket”,
“OnCreate”: “on_create.sh”
},
“ExecutionRole”: “${EXECUTION_ROLE}”,
“ThreadsPerCore”: 1,
“OnStartDeepHealthChecks”: [
“InstanceStress”,
“InstanceConnectivity”
]
},
{
“InstanceGroupName”: “head-group”,
“InstanceType”: “ml.m5.2xlarge”,
“InstanceCount”: 1,
“LifeCycleConfig”: {
“SourceS3Uri”: “s3://amzn-s3-demo-bucket”,
“OnCreate”: “on_create.sh”
},
“ExecutionRole”: “${EXECUTION_ROLE}”,
“ThreadsPerCore”: 1,
}
],
“VpcConfig”: {
“SecurityGroupIds”: [
“${SECURITY_GROUP_ID}”
],
“Subnets”: [
“${SUBNET_ID}”
]
},
“NodeRecovery”: “Automatic”
}
EOL

The provided configuration file contains two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
“NodeRecovery”: “Automatic” – Enables SageMaker HyperPod automated node recovery

You can create a SageMaker HyperPod compute with the following AWS Command Line Interface (AWS CLI) command (AWS CLI version 2.17.47 or newer is required):

aws sagemaker create-cluster
–cli-input-json file://cluster-config.json
{
“ClusterArn”: “arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49”
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters –output table

This command displays the cluster details, including the cluster name, status, and creation time:

——————————————————————————————————————————————————
| ListClusters |
+—————————————————————————————————————————————————-+
|| ClusterSummaries ||
|+—————————————————————-+—————————+—————-+————————————+|
|| ClusterArn | ClusterName | ClusterStatus | CreationTime ||
|+—————————————————————-+—————————+—————-+————————————+|
|| arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf | ml-cluster | InService | 2025-03-03T16:45:05.320000+00:00 ||
|+—————————————————————-+—————————+—————-+————————————+|

Alternatively, you can verify the cluster status on the SageMaker console. After a brief period, you can observe that the status for the nodes transitions to Running.
Create an FSx for Lustre shared file system
For us to deploy the Ray cluster, we need the SageMaker HyperPod cluster to be up and running, and additionally we need a shared storage volume (for example, an Amazon FSx for Lustre file system). This is a shared file system that the SageMaker HyperPod nodes can access. This file system can be provisioned statically before launching your SageMaker HyperPod cluster or dynamically afterwards.
Specifying a shared storage location (such as cloud storage or NFS) is optional for single-node clusters, but it is required for multi-node clusters. Using a local path will raise an error during checkpointing for multi-node clusters.
The Amazon FSx for Lustre CSI driver uses IAM roles for service accounts (IRSA) to authenticate AWS API calls. To use IRSA, an IAM OpenID Connect (OIDC) provider needs to be associated with the OIDC issuer URL that comes provisioned your EKS cluster.
Create an IAM OIDC identity provider for your cluster with the following command:

eksctl utils associate-iam-oidc-provider –cluster $EKS_CLUSTER_NAME –approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo update
helm upgrade –install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
–namespace kube-system

This Helm chart includes a service account named fsx-csi-controller-sa that gets deployed in the kube-system namespace.
Use the eksctl CLI to create an AWS Identity and Access Management (IAM) role bound to the service account used by the driver, attaching the AmazonFSxFullAccess AWS managed policy:

eksctl create iamserviceaccount
–name fsx-csi-controller-sa
–override-existing-serviceaccounts
–namespace kube-system
–cluster $EKS_CLUSTER_NAME
–attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess
–approve
–role-name AmazonEKSFSxLustreCSIDriverFullAccess
–region $AWS_REGION

The –override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a new one and updates the metadata of the current service account instead.
Annotate the driver’s service account with the Amazon Resource Name (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM role that was created:

SA_ROLE_ARN=$(aws iam get-role –role-name AmazonEKSFSxLustreCSIDriverFullAccess –query ‘Role.Arn’ –output text)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa
eks.amazonaws.com/role-arn=${SA_ROLE_ARN} –overwrite=true

This annotation lets the driver know what IAM role it should use to interact with the FSx for Lustre service on your behalf.
Verify that the service account has been properly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the changes to take effect:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two options for provisioning a file system:

Dynamic provisioning – This option uses Persistent Volume Claims (PVCs) in Kubernetes. You define a PVC with desired storage specifications. The CSI driver automatically provisions the FSx for Lustre file system for you based on the PVC request. This allows for straightforward scaling and eliminates the need to manually create file systems.
Static provisioning – In this method, you manually create the FSx for Lustre file system before using the CSI driver. You will need to configure details like subnet ID and security groups for the file system. Then, you can use the driver to mount this pre-created file system within your container as a volume.

For this example, we use dynamic provisioning. Start by creating a storage class that uses the fsx.csi.aws.com provisioner:

cat <<EOF > storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
subnetId: ${SUBNET_ID}
securityGroupIds: ${SECURITYGROUP_ID}
deploymentType: PERSISTENT_2
automaticBackupRetentionDays: “0”
copyTagsToBackups: “true”
perUnitStorageThroughput: “250”
dataCompressionType: “LZ4”
fileSystemTypeVersion: “2.12”
mountOptions:
– flock
EOF

kubectl apply -f storageclass.yaml

SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Should be the same private subnet that was used for HyperPod creation.
SECURITYGROUP_ID: The security group IDs that will be attached to the file system. Should be the same Security Group ID that is used in HyperPod and EKS.

Next, create a PVC that uses the fsx-claim storage claim:

cat <<EOF > pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fsx-claim
spec:
accessModes:
– ReadWriteMany
storageClassName: fsx-sc
resources:
requests:
storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will start the dynamic provisioning of an FSx for Lustre file system based on the specifications provided in the storage class.
Create the Ray cluster
Now that we have both the SageMaker HyperPod cluster and the FSx for Lustre file system created, we can set up the Ray cluster:

Set up dependencies. We will create a new namespace in our Kubernetes cluster and install the KubeRay operator using a Helm chart.

We recommend using KubeRay operator version 1.2.0 or higher, which supports automatic Ray Pod eviction and replacement in case of failures (for example, hardware issues on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
#Install both CRDs and Kuberay operator v1.2.0
helm install kuberay-operator kuberay/kuberay-operator –version 1.2.0 –namespace kuberay
# Kuberay operator pod will be deployed onto head pod
kubectl get pods –namespace kuberay

Create a Ray Container Image for the Ray Cluster manifest. With the recent deprecation of the `rayproject/ray-ml` images starting from Ray version 2.31.0, it’s necessary to create a custom container image for our Ray cluster. Therefore, we will build on top of the `rayproject/ray:2.42.1-py310-gpu` image, which has all necessary Ray dependencies, and include our training dependencies to build our own custom image. Please feel free to modify this Dockerfile as you wish.

First, create a Dockerfile that builds upon the base Ray GPU image and includes only the necessary dependencies:

cat <<EOF > Dockerfile

FROM rayproject/ray:2.42.1-py310-gpu
# Install Python dependencies for PyTorch, Ray, Hugging Face, and more
RUN pip install –no-cache-dir
torch torchvision torchaudio
numpy
pytorch-lightning
transformers datasets evaluate tqdm click
ray[train] ray[air]
ray[train-torch] ray[train-lightning]
torchdata
torchmetrics
torch_optimizer
accelerate
scikit-learn
Pillow==9.5.0
protobuf==3.20.3

RUN pip install –upgrade datasets transformers

# Set the user
USER ray
WORKDIR /home/ray

# Verify ray installation
RUN which ray &&
ray –-version

# Default command
CMD [ “/bin/bash” ]

EOF

Then, build and push the image to your container registry (Amazon ECR) using the provided script:

export AWS_REGION=$(aws configure get region)
export ACCOUNT=$(aws sts get-caller-identity –query Account –output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/

echo “This process may take 10-15 minutes to complete…”

echo “Building image…”

docker build –platform linux/amd64 -t ${REGISTRY}aws-ray-custom:latest .

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep “aws-ray-custom” | wc -l)
if [ “$REGISTRY_COUNT” == “0” ]; then
aws ecr create-repository –repository-name aws-ray-custom
fi

# Login to registry
echo “Logging in to $REGISTRY …”
aws ecr get-login-password –region $AWS_REGION| docker login –username AWS –password-stdin $REGISTRY

echo “Pushing image to $REGISTRY …”

# Push image to registry
docker image push ${REGISTRY}aws-ray-custom:latest

Now, our Ray container image is in Amazon ECR with all necessary Ray dependencies, as well as code library dependencies.

Create a Ray cluster manifest. We use a Ray cluster to host our training jobs. The Ray cluster is the primary resource for managing Ray instances on Kubernetes. It represents a cluster of Ray nodes, including a head node and multiple worker nodes. The Ray cluster CRD determines how the Ray nodes are set up, how they communicate, and how resources are allocated among them. The nodes in a Ray cluster manifest as pods in the EKS or SageMaker HyperPod cluster.

Note that there are two distinct sections in the cluster manifest. While the `headGroupSpec` defines the head node of the Ray Cluster, the `workerGroupSpecs` define the worker nodes of the Ray Cluster. While a job could technically run on the Head node as well, it is common to separate the head node from the actual worker nodes where jobs are executed. Therefore, the instance for the head node can typically be a smaller instance (i.e. we chose a m5.2xlarge). Since the head node also manages cluster-level metadata, it can be beneficial to have it run on a non-GPU node to minimize the risk of node failure (as GPU can be a potential source of node failure).

cat <<‘EOF’ > raycluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: rayml
labels:
controller-tools.k8s.io: “1.0”
spec:
# Ray head pod template
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: ‘0.0.0.0’
#pod template
template:
spec:
# nodeSelector:
#node.kubernetes.io/instance-type: “ml.m5.2xlarge”
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
– name: ray-head
image: ${REGISTRY}aws-ray-custom:latest ## IMAGE: Here you may choose which image your head pod will run
env: ## ENV: Here is where you can send stuff to the head pod
– name: RAY_GRAFANA_IFRAME_HOST ## PROMETHEUS AND GRAFANA
value: http://localhost:3000
– name: RAY_GRAFANA_HOST
value: http://prometheus-grafana.prometheus-system.svc:80
– name: RAY_PROMETHEUS_HOST
value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
lifecycle:
preStop:
exec:
command: [“/bin/sh”,”-c”,”ray stop”]
resources:
limits: ## LIMITS: Set resource limits for your head pod
cpu: 1
memory: 8Gi
requests: ## REQUESTS: Set resource requests for your head pod
cpu: 1
memory: 8Gi
ports:
– containerPort: 6379
name: gcs-server
– containerPort: 8265 # Ray dashboard
name: dashboard
– containerPort: 10001
name: client
– containerPort: 8000
name: serve
volumeMounts: ## VOLUMEMOUNTS
– name: fsx-storage
mountPath: /fsx
– name: ray-logs
mountPath: /tmp/ray
volumes:
– name: ray-logs
emptyDir: {}
– name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim
workerGroupSpecs:
# the pod replicas in this group typed worker
– replicas: 4 ## REPLICAS: How many worker pods you want
minReplicas: 1
maxReplicas: 10
# logical group name, for this called small-group, also can be functional
groupName: gpu-group
rayStartParams:
num-gpus: “8”
#pod template
template:
spec:
#nodeSelector:
# node.kubernetes.io/instance-type: “ml.p5.48xlarge”
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
– name: ray-worker
image: ${REGISTRY}aws-ray-custom:latest ## IMAGE: Here you may choose which image your head node will run
env:
lifecycle:
preStop:
exec:
command: [“/bin/sh”,”-c”,”ray stop”]
resources:
limits: ## LIMITS: Set resource limits for your worker pods
nvidia.com/gpu: 8
#vpc.amazonaws.com/efa: 32
requests: ## REQUESTS: Set resource requests for your worker pods
nvidia.com/gpu: 8
#vpc.amazonaws.com/efa: 32
volumeMounts: ## VOLUMEMOUNTS
– name: ray-logs
mountPath: /tmp/ray
– name: fsx-storage
mountPath: /fsx
volumes:
– name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim
– name: ray-logs
emptyDir: {}
EOF

Deploy the Ray cluster:

envsubst < raycluster.yaml | kubectl apply -f –

Optionally, expose the Ray dashboard using port forwarding:

# Gets name of kubectl service that runs the head pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk ‘{print $1}’ | head -n 1)
# Port forwards the dashboard from the head pod service
kubectl port-forward –address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you can visit http://localhost:8265/ to visit the Ray Dashboard.

To launch a training job, there are a few options:

Use the Ray jobs submission SDK, where you can submit jobs to the Ray cluster through the Ray dashboard port (8265 by default) where Ray listens for job requests. To learn more, see Quickstart using the Ray Jobs CLI.
Execute a Ray job in the head pod where you exec directly into the head pod and then submit your job. To learn more, see RayCluster Quickstart.

For this example, we use the first method and submit the job through the SDK. Therefore, we simply run from a local environment where the training code is available in –working-dir. Relative to this path, we specify the main training Python script located at –train.py Within the working-dir folder, we can also include additional scripts we might need to run the training.
The fsdp-ray.py example is located in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py in the aws-do-ray GitHub repo.

# Within jobs/ folder
ray job submit –address http://localhost:8265 –working-dir “fsdp-ray” — python3 fsdp-ray.py

For our Python training script to run, we need to make sure our training scripts are correctly set up to use Ray. This includes the following steps:

Configure a model to run distributed and on the correct CPU/GPU device
Configure a data loader to shard data across the workers and place data on the correct CPU or GPU device
Configure a training function to report metrics and save checkpoints
Configure scaling and CPU or GPU resource requirements for a training job
Launch a distributed training job with a TorchTrainer class

For further details on how to adjust your existing training script to get the most out of Ray, refer to the Ray documentation.
The following diagram illustrates the complete architecture you have built after completing these steps.

Implement training job resiliency with the job auto resume functionality
Ray is designed with robust fault tolerance mechanisms to provide resilience in distributed systems where failures are inevitable. These failures generally fall into two categories: application-level failures, which stem from bugs in user code or external system issues, and system-level failures, caused by node crashes, network disruptions, or internal bugs in Ray. To address these challenges, Ray provides tools and strategies that enable applications to detect, recover, and adapt seamlessly, providing reliability and performance in distributed environments. In this section, we look at two of the most common types of failures, and how to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Train worker failures and Ray worker node failures.

Ray Train worker – This is a worker process specifically used for training tasks within Ray Train, Ray’s distributed training library. These workers handle individual tasks or shards of a distributed training job. Each worker is responsible for processing a portion of the data, training a subset of the model, or performing computation during distributed training. They are coordinated by the Ray Train orchestration logic to collectively train a model.
Ray worker node – At the Ray level, this is a Ray node in a Ray cluster. It’s part of the Ray cluster infrastructure and is responsible for running tasks, actors, and other processes as orchestrated by the Ray head node. Each worker node can host multiple Ray processes that execute tasks or manage distributed objects. At the Kubernetes level, a Ray worker node is a Kubernetes pod that is managed by a KubeRay operator. For this post, we will be talking about the Ray worker nodes at the Kubernetes level, so we will refer to them as pods.

At the time of writing, there are no official updates regarding head pod fault tolerance and auto resume capabilities. Though head pod failures are rare, in the unlikely event of such a failure, you will need to manually restart your training job. However, you can still resume progress from the last saved checkpoint. To minimize the risk of hardware-related head pod failures, it’s advised to place the head pod on a dedicated, CPU-only SageMaker HyperPod node, because GPU failures are a common training job failure point.
Ray Train worker failures
Ray Train is designed with fault tolerance to handle worker failures, such as RayActorErrors. When a failure occurs, the affected workers are stopped, and new ones are automatically started to maintain operations. However, for training progress to continue seamlessly after a failure, saving and loading checkpoints is essential. Without proper checkpointing, the training script will restart, but all progress will be lost. Checkpointing is therefore a critical component of Ray Train’s fault tolerance mechanism and needs to be implemented in your code.
Automatic recovery
When a failure is detected, Ray shuts down failed workers and provisions new ones. Although this happens, we can tell the training function to always keep retrying until training can continue. Each instance of recovery from a worker failure is considered a retry. We can set the number of retries through the max_failures attribute of the FailureConfig, which is set in the RunConfig passed to the Trainer (for example, TorchTrainer). See the following code:

from ray.train import RunConfig, FailureConfig
# Tries to recover a run up to this many times.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No limit on the number of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For more information, see Handling Failures and Node Preemption.
Checkpoints
A checkpoint in Ray Train is a lightweight interface representing a directory stored either locally or remotely. For example, a cloud-based checkpoint might point to s3://my-bucket/checkpoint-dir, and a local checkpoint might point to /tmp/checkpoint-dir. To learn more, see Saving checkpoints during training.
To save a checkpoint in the training loop, you first need to write your checkpoint to a local directory, which can be temporary. When saving, you can use checkpoint utilities from other frameworks like torch.save, pl.Trainer.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Model.save, and more. Then you create a checkpoint from the directory using Checkpoint.from_directory. Finally, report the checkpoint to Ray Train using ray.train.report(metrics, checkpoint=…). The metrics reported alongside the checkpoint are used to keep track of the best-performing checkpoints. Reporting will upload the checkpoint to persistent storage.
If you save checkpoints with ray.train.report(…, checkpoint=…) and run on a multi-node cluster, Ray Train will raise an error if NFS or cloud storage is not set up. This is because Ray Train expects all workers to be able to write the checkpoint to the same persistent storage location.
Finally, clean up the local temporary directory to free up disk space (for example, by exiting the tempfile.TemporaryDirectory context). We can save a checkpoint every epoch or every few iterations.
The following diagram illustrates this setup.

The following code is an example of saving checkpoints using native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.train.torch
from ray import train
from ray.train import Checkpoint, ScalingConfig
from ray.train.torch import TorchTrainer

def train_func(config):
n = 100
# create a toy dataset
# data : X – dim = (n, 4)
# target : Y – dim = (n, 1)
X = torch.Tensor(np.random.normal(0, 1, size=(n, 4)))
Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1)))
# toy neural network : 1-layer
# Wrap the model in DDP
model = ray.train.torch.prepare_model(nn.Linear(4, 1))
criterion = nn.MSELoss()

optimizer = Adam(model.parameters(), lr=3e-4)
for epoch in range(config[“num_epochs”]):
y = model.forward(X)
loss = criterion(y, Y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

metrics = {“loss”: loss.item()}

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
checkpoint = None

should_checkpoint = epoch % config.get(“checkpoint_freq”, 1) == 0
# In standard DDP training, where the model is the same across all ranks,
# only the global rank 0 worker needs to save and report the checkpoint
if train.get_context().get_world_rank() == 0 and should_checkpoint:
torch.save(
model.module.state_dict(), # NOTE: Unwrap the model.
os.path.join(temp_checkpoint_dir, “model.pt”),
)
checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

train.report(metrics, checkpoint=checkpoint)

trainer = TorchTrainer(
train_func,
train_loop_config={“num_epochs”: 5},
scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

Ray Train also comes with CheckpointConfig, a way to configure checkpointing options:

from ray.train import RunConfig, CheckpointConfig
# Example 1: Only keep the 2 *most recent* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Example 2: Only keep the 2 *best* checkpoints and delete the others.
run_config = RunConfig(
checkpoint_config=CheckpointConfig(
num_to_keep=2,
# *Best* checkpoints are determined by these params:
checkpoint_score_attribute=”mean_accuracy”,
checkpoint_score_order=”max”,
),
# This will store checkpoints on S3.
storage_path=”s3://remote-bucket/location”,
)

To restore training state from a checkpoint if your training job were to fail and retry, you should modify your training loop to auto resume and then restore a Ray Train job. By pointing to the path of your saved checkpoints, you can restore your trainer and continue training. Here’s a quick example:

from ray.train.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
path=”~/ray_results/dl_trainer_restore”, # Can also be a cloud storage path like S3
datasets=get_datasets(),
)
result = restored_trainer.fit()

To streamline restoration, you can add auto resume logic to your script. This checks if a valid experiment directory exists and restores the trainer if available. If not, it starts a new experiment:

experiment_path = “~/ray_results/dl_restore_autoresume”
if TorchTrainer.can_restore(experiment_path):
trainer = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
datasets=get_datasets(),
scaling_config=train.ScalingConfig(num_workers=2),
run_config=train.RunConfig(
storage_path=”~/ray_results”,
name=”dl_restore_autoresume”,
),
)
result = trainer.fit()

To summarize, to provide fault tolerance and auto resume when using Ray Train libraries, set your max_failures parameter in the FailureConfig (we recommend setting it to -1 to make sure it will keep retrying until the SageMaker HyperPod node is rebooted or replaced), and make sure you have enabled checkpointing in your code.
Ray worker pod failures
In addition to the aforementioned mechanisms to recover from Ray Train worker failures, Ray also provides fault tolerance at the worker pod level. When a worker pod fails (this includes scenarios in which the raylet process fails), the running tasks and actors on it will fail and the objects owned by worker processes of this pod will be lost. In this case, the tasks, actors, and objects fault tolerance mechanisms will start and try to recover the failures using other worker pods.
These mechanisms will be implicitly handled by the Ray Train library. To learn more about the underlying fault tolerance at the tasks, actors, objects (implemented at the Ray Core level), see Fault Tolerance.
In practice, this means that in case of a worker pod failure, the following occurs:

If there is a free worker pod in the Ray cluster, Ray will recover the failed worker pod by replacing it with the free worker pod.
If there is no free worker pod, but in the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a new worker pod onto one of the free SageMaker HyperPod nodes. This pod will join the running Ray cluster and the failure will be recovered using this new worker pod.

In the context of KubeRay, Ray worker nodes are represented by Kubernetes pods, and failures at this level can include issues such as pod eviction or preemption caused by software-level factors.
However, another critical scenario to consider is hardware failures. If the underlying SageMaker HyperPod node becomes unavailable due to a hardware issue, such as a GPU error, it would inevitably cause the Ray worker pod running on that node to fail as well. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster start and will reboot or replace the faulty node. After the new healthy node is added into the SageMaker HyperPod cluster, Ray will schedule a new worker pod onto the SageMaker HyperPod node and recover the interrupted training. In this case, both the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency features work together seamlessly and make sure that even in case of a hardware failure, your ML training workload can auto resume and pick up from where it was interrupted.
As you have seen, there are various built-in resiliency and fault-tolerance mechanisms that allow your Ray Train workload on SageMaker HyperPod to recover and auto resume. Because these mechanisms will essentially recover by restarting the training job, it is crucial that checkpointing is implemented in the training script. It is also generally advised to save the checkpoints on a shared and persistent path, such as an Amazon Simple Storage Service (Amazon S3) bucket or FSx for Lustre file system.
Clean up
To delete your SageMaker HyperPod cluster created in this post, you can either use the SageMaker AI console or use the following AWS CLI command:

aws sagemaker delete-cluster –cluster-name <cluster_name>

Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.
If you used the CloudFormation stack to create resources, you can delete it using the following command:

aws cloudformation delete-stack –stack-name <stack_name>

Conclusion
This post demonstrated how to set up and deploy Ray clusters on SageMaker HyperPod, highlighting key considerations such as storage configuration and fault tolerance and auto resume mechanisms.
Running Ray jobs on SageMaker HyperPod offers a powerful solution for distributed AI/ML workloads, combining the flexibility of Ray with the robust infrastructure of SageMaker HyperPod. This integration provides enhanced resiliency and auto resume capabilities, which are crucial for long-running and resource-intensive tasks. By using Ray’s distributed computing framework and the built-in features of SageMaker HyperPod, you can efficiently manage complex ML workflows, specifically training workloads as covered in this post. As AI/ML workloads continue to grow in scale and complexity, the combination of Ray and SageMaker HyperPod offers a scalable, resilient, and efficient platform for tackling the most demanding computational challenges in machine learning.
To get started with SageMaker HyperPod, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Guide. To learn more about the aws-do-ray framework, refer to the GitHub repo.

About the Authors
Mark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on the Automotive and Manufacturing sector, specializing in helping organizations architect, optimize, and scale artificial intelligence and machine learning solutions, with particular expertise in autonomous vehicle technologies. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering.
Florian Stahl is a Worldwide Specialist Solutions Architect at AWS, based in Hamburg, Germany. He specializes in Artificial Intelligence, Machine Learning, and Generative AI solutions, helping customers optimize and scale their AI/ML workloads on AWS. With a background as a Data Scientist, Florian focuses on working with customers in the Autonomous Vehicle space, bringing deep technical expertise to help organizations design and implement sophisticated machine learning solutions. He works closely with customers worldwide to transform their AI initiatives and maximize the value of their machine learning investments on AWS.
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Alex Iankoulski is a Principal Solutions Architect, ML/AI Frameworks, who focuses on helping customers orchestrate their AI workloads using containers and accelerated computing infrastructure on AWS. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges.

Using Large Language Models on Amazon Bedrock for multi-step task exec …

The goal of this blog post is to show you how a large language model (LLM) can be used to perform tasks that require multi-step dynamic reasoning and execution. Examples of tasks that require dynamic reasoning and execution are answering questions of the form “What is the average length of stay for patients with [specific condition] across different hospitals?” or “How do prescription trends for [specific drug] vary across different regions?” Traditionally, answering these queries required the expertise of business intelligence specialists and data engineers, often resulting in time-consuming processes and potential bottlenecks.
However, the advances in LLMs have opened up exciting new possibilities to break down complex tasks into a series of steps, use tools to complete each step, and produce a final solution.
In this context, the term tools refer to external capabilities or APIs that the model can access and interact with to extend its functionality beyond text-based responses. These tools allow LLMs to perform specialized tasks such as retrieving real-time information, running code, browsing the web, or generating images. By using tools, LLMs can offer more accurate, context-aware, and actionable outputs, enabling them to effectively assist with complex queries that require access to data or functions outside their internal knowledge base.
In this post, we showcase an example of such capabilities for a patient record retrieval solution built on top of an existing dataset by relying on APIs only instead of text-to-SQL. You can find the code for this post on GitHub .
Overview of solution
The goal of the solution is to accurately answer analytical questions that require multi-step reasoning and execution. The following is an example of an expected interaction with the agent.

User: Give me the first and last name of the patient with least vaccines and their vaccine count.
AI: Based on the results of our analysis, I can answer your question: The patient with the least number of vaccines is Sharleen176 Kulas532, and their vaccine count is
Here’s a breakdown of how we arrived at this answer:
1. We first retrieved patients and immunization records.
2. We then grouped the immunization records by patient_id and counted the number of vaccines for each patient.
3. We ordered this list by the vaccine count in ascending order.
4. We limited the result to the top 1 (the patient with the least vaccines).
5. We joined this result with the patient information to get the first and last name.
6. Finally, we selected only the relevant information (first name, last name, and vaccine count).
The result shows that Sharleen176 Kulas532 has received only 1 vaccine, which is the lowest among patients in the database.

For this solution, we use the Synthetic Patient Generation dataset, which contains multiple tables of healthcare industry related synthetic data.
You can set up the dataset at the project’s root folder using the following steps:
curl https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_csv_apr2020.zip > dataset.zip

unzip dataset.zip

mv csv dataset
The solution is made from two core steps: plan and execute. In its simplest form, it can be represented by the following diagram.
Fig 1: Simple execution flow – solution overview
In a more complex scheme, you can add multiple layers of validation and provide relevant APIs to increase the success rate of the LLM.

Fig 2: Complex execution flow – solution overview
Plan
In the Plan stage, the LLM is given a set of predefined API function signatures along with a brief description of what each function does. These function signatures act as tools that the LLM can use to formulate a plan to answer a user’s query. The goal is to have the LLM reason through the steps required to arrive at the answer, much like a human would.
Why the plan stage is important
The Plan stage is critical because it allows the LLM to create a structured, logical sequence of actions that will be executed in the next stage. By planning, the LLM can break down a complex question into manageable steps, making sure that the right APIs are called in the correct order. This structured approach helps to minimize errors and increases the likelihood of producing accurate results.
Providing function signatures
In this stage, the LLM is given a set of function signatures that represent the tools it can use. Each function signature includes the name of the function, the parameters it accepts, and the type of value it returns. Here’s an example of a few function signatures:

def get_patients() -> List[Patient]: Retrieves a list of patients from the dataset.
def get_immunization() -> List[Immunization]:
Retrieves a list of immunization records from the dataset.
def filter(list: List[object], keys: List[str], values: List[str]) > List[object]: Filters a given list based on specified keys and values.
def join(a: List, b: List, left_key: str, right_key: str, how: JoinMode) > List: Joins two lists based on matching keys, using a specified join mode (e.g., INNER, LEFT, RIGHT).

These function signatures act as building blocks for the LLM to generate a plan. The LLM must choose the appropriate functions and sequence them in a logical order to achieve the desired outcome.
Retrieval Augmented Generation (RAG) improves the selection process by narrowing down the tools an LLM sees based on the task, simplifying the prompt. In a project with many tools, RAG makes sure that only the most relevant tools are surfaced for a given query, reducing complexity and helping the LLM make more accurate decisions. This focused exposure enhances performance by preventing the model from being overwhelmed by irrelevant options.
Generating a plan
After the function signatures are provided, the LLM is prompted to create a plan. The plan typically consists of a series of steps, each represented as a JSON object. Each step indicates a function that needs to be executed, the parameters that need to be passed, and the expected outcome (often referred to as evidence).
For example, if the task is to find the patient with the least number of vaccines, the LLM might generate a plan that includes the following steps:

Retrieve patients: Use the get_patients() function to get a list of patients.
Retrieve immunization records: Use the get_immunization() function to get a list of immunizations.
Group by patient: Use the group_by() function to group the immunizations by patient_id, counting the number of vaccines for each patient.
Order by count: Use the order_by() function to sort the grouped list in ascending order based on the vaccine count.
Limit the result: Use the limit() function to select the patient with the least vaccines.
Join with patient data: Use the join() function to match the selected result with the patient’s information.
Select relevant fields: Use the select() function to extract only the necessary fields, such as the patient’s first name, last name, and vaccine count.

JSON representation
The LLM outputs this plan as a structured JSON, which makes it straightforward to parse and execute in the next stage. The JSON format helps makes sure that the plan is clear, unambiguous, and ready for programmatic execution.
The following is an example of what the JSON might look like:
{
“role”: “assistant”,
“content”: [
{
“toolUse”: {
“toolUseId”: “tooluse_example_id”,
“name”: “execute_plan”,
“input”: {
“plans”: [
{
“function_name”: “get_patients”,
“parameters”: [],
“evidence_number”: 1
},
{
“function_name”: “get_immunization”,
“parameters”: [],
“evidence_number”: 2
},
{
“function_name”: “group_by”,
“parameters”: [
“list”,
“group_key”,
“aggregation_key”,
“aggregation”
],
“parameter_values”: [
“#E2”,
“patient_id”,
null,
“COUNT”
],
“evidence_number”: 3
},
{
“function_name”: “order_by”,
“parameters”: [
“list”,
“key”,
“value”
],
“parameter_values”: [
“#E3”,
“count”,
“ASCENDING”
],
“evidence_number”: 4
},
{
“function_name”: “limit”,
“parameters”: [
“list”,
“k”
],
“parameter_values”: [
“#E4”,
1
],
“evidence_number”: 5
},
{
“function_name”: “join”,
“parameters”: [
“a”,
“b”,
“left_key”,
“right_key”,
“how”
],
“parameter_values”: [
“#E5”,
“#E1”,
“patient_id”,
“id”,
“INNER”
],
“evidence_number”: 6
},
{
“function_name”: “select”,
“parameters”: [
“list”,
“keys”
],
“parameter_values”: [
“#E6”,
[
“first”,
“last”,
“count”
]
],
“evidence_number”: 7
}
]
}
}
}
]
}

Execute
In the Execute stage, the structured plan generated by the LLM in the previous step is programmatically carried out to produce the final output. The JSON blueprint from the planning stage is parsed, and each function call described in the plan is executed sequentially.
The process begins with data retrieval, such as accessing patient records or immunization data, using predefined API functions such as get_patients() or get_immunization(). These initial function calls generate intermediate results, which are stored as evidence and referenced in subsequent steps.
The plan typically involves a series of data transformation functions, such as group_by() to aggregate data, filter() for refining results, and order_by() for data sorting. Each function is executed with specific parameters as outlined in the JSON plan, utilizing  progressive data refinement to answer the query.
As each function is executed, its output is passed to the subsequent function in the sequence. This chain of function calls culminates in a final step, often involving a select() function to extract the most relevant information, such as a patient’s name and vaccine count.
Error handling in the Execute stage is crucial for facilitating  the reliability and robustness of the entire process. As the LLM’s plan is executed, various issues can arise, including empty datasets, invalid parameters, or mismatched data types during function calls such as join() or filter(). To address these potential challenges, the system incorporates error-checking mechanisms at each step, enabling it to detect and respond to anomalies efficiently. If a function returns an unexpected result or encounters an issue, the system might provide the error back to the LLM itself, enabling it to regenerate the plan with necessary adjustments. This approach not only alleviates execution failures but also enhances the overall user experience by delivering accurate and reliable results, even in the face of unexpected challenges.
Summary
This post explores the application of LLMs in executing complex analytical queries through an API, with specific focus on Amazon Bedrock. Traditionally, business users rely on data professionals to retrieve and present data, but LLMs can now offer a streamlined approach enabling direct query responses by using predefined API tools. To illustrate this capability, we use the Synthetic Patient Generation dataset and present a solution structured around two primary phases: Plan and Execution.
In the Plan stage, the LLM is provided with API function signatures, which it uses to generate a structured, logical sequence of steps to answer the query. This plan is output as a JSON, providing clarity and facilitating seamless execution. In the Execute stage, the system programmatically carries out the plan by sequentially executing each function call. Robust error-handling mechanisms are integrated to identify potential issues and, if necessary, relay errors back to the LLM for plan regeneration.
To demonstrate this process, we present a use case where the system identifies the patient with the least number of vaccines by retrieving, grouping, and sorting data, and ultimately presenting the final result. This example showcases the LLM’s ability to extend beyond mere text-based responses, providing actionable and context-aware outputs that can significantly enhance business decision-making processes.
Conclusion
This article highlights the efficacy of LLMs in expanding their functionality to deliver practical, data-driven solutions that have the potential to revolutionize business analytics and decision-making workflows.

About the Authors
Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.
Mohammad Arbabshirani, PhD, is a Sr. Data Science Manager at AWS Professional Services. He specializes in helping customers accelerate business outcomes on AWS through the application of machine learning and generative AI. He has 12 years of experience in full life cycle of machine learning, computer vision, and data science from sales support to end-to-end solution delivery specially in healthcare and life sciences vertical. Currently, Mohammad leads a team of data scientists, machine learning engineers, and data architects, focusing on delivery of cutting-edge ML solutions for customers. His background includes extensive research in neuroimaging and medical imaging. Outside of his professional endeavors, Mohammad enjoys tennis, soccer, and instrumental music.

The Complete Beginner’s Guide to Terminal/Command Prompt

The terminal (on Mac/Linux) or command prompt (on Windows) is a powerful tool that allows you to interact with your computer using text commands instead of clicking through a graphical interface. While it might seem intimidating at first, mastering basic terminal commands can help you:

Navigate through files and folders more efficiently

Perform tasks that aren’t possible through the regular interface

Automate repetitive tasks

Gain a deeper understanding of how your computer works

This guide will introduce you to the essential commands and concepts to get you started, regardless of which operating system you use.

Getting Started

Opening the Terminal

On Windows:

Press Win + R, type cmd, and press Enter

Or search for “Command Prompt” in the Start menu

On Mac:

Press Command + Space to open Spotlight, type “Terminal”, and press Enter

Or find Terminal in Applications → Utilities → Terminal

On Linux:

Press Ctrl + Alt + T (on most distributions)

Or search for “Terminal” in your applications menu

Understanding the Prompt

When you first open the terminal, you’ll see a prompt that looks something like this:

Windows: C:UsersYourUsername>

Mac/Linux: username@computer:~$

This tells you:

Your current location in the file system

Where to type your commands

On Mac/Linux, the ~ symbol represents your home directory

Basic Navigation Commands

Viewing Your Current Location

Windows: cd 

Mac/Linux: pwd (Print Working Directory)

Example:

Listing Files and Directories

Windows: dir 

Mac/Linux: ls

Example:

Options:

ls -l – List with detailed information (file size, date modified, permissions)

ls -a – Show hidden files (files that start with a dot)

ls -la – Combine both options

Changing Directories

All platforms: cd DirectoryName

Examples:

Creating Directories

All platforms: mkdir DirectoryName

Example:

Creating Files

Windows: type nul > filename.txt 

Mac/Linux: touch filename.txt

Example:

Working with Files

Viewing File Contents

Windows: type filename.txt 

Mac/Linux: cat filename.txt

For larger files: 

Windows: more filename.txt 

Mac/Linux: less filename.txt (use q to quit)

Copying Files

Windows: copy source destination 

Mac/Linux: cp source destination

Example:

Moving/Renaming Files

Windows: move source destination 

Mac/Linux: mv source destination

Examples:

Deleting Files and Directories

Windows:

Mac/Linux:

Warning: Be very careful with delete commands, especially rm -r! There is no “Recycle Bin” or “Trash” when using the terminal – deletions are permanent.

Helpful Tips

Command History

Press the up arrow to cycle through previously used commands

On Mac/Linux, type history to see a list of recent commands

Tab Completion

Start typing a file or directory name, then press Tab

The terminal will attempt to complete it for you

If there are multiple options, press Tab twice to see all possibilities

Getting Help

Windows: help command or command /? 

Mac/Linux: man command (manual pages, press q to exit)

Examples:

Clearing the Screen

Windows: cls 

Mac/Linux: clear or Ctrl+L

Power User Commands

Searching for Files

Windows: dir /s filename 

Mac/Linux: find . -name filename

Searching Within Files

Windows: findstr “text” filename 

Mac/Linux: grep “text” filename

Chaining Commands

All platforms: Use && to run commands in sequence

Example:

Redirecting Output

All platforms: Use > to send output to a file

Example:

Next Steps

As you become more comfortable with these basic commands, you might want to explore:

Command line text editors like Nano, Vim, or Emacs

Writing simple shell scripts to automate tasks

Package managers like apt (Linux), Homebrew (Mac), or Chocolatey (Windows)

Environment variables and how to set them

SSH to connect to remote computers

Common Mistakes and Troubleshooting

Command not found: Check spelling or ensure the command is available on your system

Permission denied: You may need administrator/root privileges

Windows: Run Command Prompt as Administrator

Mac/Linux: Use sudo before commands that need elevated privileges

No such file or directory: Double-check path and file names

Operation not permitted: Similar to permission denied, you might need special permissions

TasksWindowsMac/LinuxCurrent locationcdpwdList filesdirlsChange directorycd dircd dirCreate directorymkdir dirmkdir dirCreate filetype nul > filetouch fileCopy filecopy source destinationcp source destinationMove/renamemove source destinationmv source destinationDelete filedel filerm fileDelete directoryrmdir /s dirrm -r dirClear screenclsclearGet helphelp commandman command

Conclusion

In this tutorial, we have covered everything beginners need to know about using the terminal. We explored how to open the terminal across different operating systems, navigate file systems, create and manage files and directories, and use essential commands. We also learned helpful shortcuts, power user commands, and troubleshooting tips. With these foundational skills, you can now confidently use the command line as a powerful tool in your computing journey.

Remember, the terminal is a powerful tool that rewards practice and experimentation. Don’t be afraid to try new commands, but always be careful with commands that modify or delete files.

Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post The Complete Beginner’s Guide to Terminal/Command Prompt appeared first on MarkTechPost.

This AI Paper from ByteDance Introduces a Hybrid Reward System Combini …

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy optimization. Recent research focuses on algorithmic improvements, including eliminating critic models to reduce computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a few studies focus on RLHF data construction (i.e., training prompts) and its performance scaling based on these training prompts.

The success of RLHF heavily depends on reward model quality, which faces three challenges: mis-specified reward modeling in representing human preferences, incorrect and ambiguous preferences in training datasets, and poor generalization ability. To address these issues, GenRM was introduced to validate model predictions against ground-truth responses, showing good resistance to reward hacking and gaining adoption in advanced LLMs like DeepSeekV3. Methods like principled data selection that filter overly challenging instances during training and strategic selection methods identify key training prompts to achieve comparable performance with reduced data. Performance scale analysis reveals that RLHF shows superior generalization compared to SFT on novel inputs but significantly reduces output diversity.

Researchers from ByteDance Seed address a critical gap in RLHF research where the role of prompt-data construction and its scalability has received less attention. They explore data-driven bottlenecks that limit RLHF performance scaling, focusing on reward hacking and decreasing response diversity challenges. A hybrid reward system is introduced by combining reasoning task verifiers (RTV) and a generative reward model (GenRM) that shows stronger resistance to reward hacking and enables a more accurate assessment of responses against ground-truth solutions. Moreover, a novel prompt-selection method called Pre-PPO is introduced to identify inherently challenging training prompts less susceptible to reward hacking.

The experimental setup employs two pre-trained language models of different scales: a smaller model with 25B parameters and a larger model with 150B parameters. The training dataset contains one million prompts from diverse domains, including mathematics, coding, instruction-following, creative writing, and logical reasoning. Moreover, the researchers constructed a detailed evaluation framework covering multiple skill areas: logical reasoning, instruction-following, STEM tasks, coding, natural language processing, knowledge, contextual understanding, and out-of-distribution generalization. The evaluation framework includes two versions (V1.0 and V2.0) with overlapping prompts, though V2.0 features more challenging prompts.

The experimental results show that the proposed approach combining Pre-PPO with prioritized mathematical and coding tasks consistently outperforms the baseline method across model sizes and evaluation datasets. The approach shows an improvement of +1.1 over the baseline when evaluated at 100-step intervals using TestSet V1.0. When tested on the more challenging TestSet V2.0, the performance improvement increases to +1.4. The most substantial gains appear in mathematics-intensive and coding tasks, with an improvement of +3.9 points in STEM and +3.2 points in coding. These improvements are attributed to the strategic prioritization of mathematical reasoning and coding tasks during early RLHF training phases.

In conclusion, this paper addresses critical bottlenecks in RLHF data scaling, specifically identifying reward hacking and reduced response diversity as significant challenges. The researchers proposed a combined approach featuring strategic prompt construction and early-stage training prioritization to solve this issue. The method uses RTV and GenRM to combat reward hacking alongside the novel Pre-PPO prompt selection strategy that identifies and prioritizes challenging training prompts. Analysis reveals that RTV supervision shows the strongest resistance to reward hacking, followed by GenRM with ground-truth labels and then the BT Reward Model. The research establishes a foundation for optimizing RLHF data construction and developing more principle methods to reward hacking and model alignment.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking appeared first on MarkTechPost.