March 2025 - Page 5 of 7

Creating asynchronous AI agents with Amazon Bedrock

Posted on March 14, 2025 by i-genie

The integration of generative AI agents into business processes is poised to accelerate as organizations recognize the untapped potential of these technologies. Advancements in multimodal artificial intelligence (AI), where agents can understand and generate not just text but also images, audio, and video, will further broaden their applications. This post will discuss agentic AI driven architecture and ways of implementing.
The emergence of generative AI agents in recent years has contributed to the transformation of the AI landscape, driven by advances in large language models (LLMs) and natural language processing (NLP). Companies like Anthropic, Cohere, and Amazon have made significant strides in developing powerful language models capable of understanding and generating human-like content across multiple modalities, revolutionizing how businesses integrate and utilize artificial intelligence in their processes.
These AI agents have demonstrated remarkable versatility, being able to perform tasks ranging from creative writing and code generation to data analysis and decision support. Their ability to engage in intelligent conversations, provide context-aware responses, and adapt to diverse domains has revolutionized how businesses approach problem-solving, customer service, and knowledge dissemination.
One of the most significant impacts of generative AI agents has been their potential to augment human capabilities through both synchronous and asynchronous patterns. In synchronous orchestration, just like in traditional process automation, a supervisor agent orchestrates the multi-agent collaboration, maintaining a high-level view of the entire process while actively directing the flow of information and tasks. This approach allows businesses to offload repetitive and time-consuming tasks in a controlled, predictable manner.
Alternatively, asynchronous choreography follows an event-driven pattern where agents operate autonomously, triggered by events or state changes in the system. In this model, agents publish events or messages that other agents can subscribe to, creating a workflow that emerges from their collective behavior. These patterns have proven particularly valuable in enhancing customer experiences, where agents can provide round-the-clock support, resolve issues promptly, and deliver personalized recommendations through either orchestrated or event-driven interactions, leading to increased customer satisfaction and loyalty.
Agentic AI architecture
Agentic AI architecture is a shift in process automation through autonomous agents towards the capabilities of AI, with the purpose of imitating cognitive abilities and enhancing the actions of traditional autonomous agents. This architecture can enable businesses to streamline operations, enhance decision-making processes, and automate complex tasks in new ways.
Much like traditional business process automation through technology, the agentic AI architecture is the design of AI systems designed to resolve complex problems with limited or indirect human intervention. These systems are composed of multiple AI agents that converse with each other or execute complex tasks through a series of choreographed or orchestrated processes. This approach empowers AI systems to exhibit goal-directed behavior, learn from experience, and adapt to changing environments.
The difference between a single agent invocation and a multi-agent collaboration lies in the complexity and the number of agents involved in the process.
When you interact with a digital assistant like Alexa, you’re typically engaging with a single agent, also known as a conversational agent. This agent processes your request, such as setting a timer or checking the weather, and provides a response without needing to consult other agents.
Now, imagine expanding this interaction to include multiple agents working together. Let’s start with a simple travel booking scenario:
Your interaction begins with telling a travel planning agent about your desired trip. In this first step, the AI model, in this case an LLM, is acting as an interpreter and user experience interface between your natural language input and the structured information needed by the travel planning system. It’s processing your request, which might be a complex statement like “I want to plan a week-long beach vacation in Hawaii for my family of four next month,” and extracting key details such as the destination, duration, number of travelers, and approximate dates.
The LLM is also likely to infer additional relevant information that wasn’t explicitly stated, such as the need for family-friendly accommodations or activities. It might ask follow-up questions to clarify ambiguous points or gather more specific preferences. Essentially, the LLM is transforming your casual, conversational input into a structured set of travel requirements that can be used by the specialized booking agents in the subsequent steps of the workflow.
This initial interaction sets the foundation for the entire multi-agent workflow, making sure that the travel planning agent has a clear understanding of your needs before engaging other specialized agents.
By adding another agent, the flight booking agent, the travel planning agent can call upon it to find suitable flights. The travel planning agent needs to provide the flight booking agent with relevant information (dates, destinations), and wait for and process the flight booking agent’s response, to incorporate the flight options into its overall plan
Now, let’s add another agent to the workflow; a hotel booking agent to support finding accommodations. With this addition, the travel planning agent must also communicate with the hotel booking agent, which needs to make sure that the hotel dates align with the flight dates and provide the information back to the overall plan to include both flight and hotel options.
As we continue to add agents, such as a car rental agent or a local activities agent, each new addition receives relevant information from the travel planning agent, performs its specific task, and returns its results to be incorporated into the overall plan. The travel planning agent acts not only as the user experience interface, but also as a coordinator, deciding when to involve each specialized agent and how to combine their inputs into a cohesive travel plan.
This multi-agent workflow allows for more complex tasks to be accomplished by taking advantage of the specific capabilities of each agent. The system remains flexible, because agents can be added or removed based on the specific needs of each request, without requiring significant changes to the existing agents and minimal change to the overall workflow.
For more on the benefits of breaking tasks into agents, see How task decomposition and smaller LLMs can make AI more affordable.
Process automation with agentic AI architecture
The preceding scenario, just like in traditional process automation, is a common orchestration pattern, where the multi-agent collaboration is orchestrated by a supervisor agent. The supervisor agent acts like a conductor leading an orchestra, telling each instrument when to play and how to harmonize with others. For this approach, Amazon Bedrock Agents enables generative AI applications to execute multi-step tasks orchestrated by an agent and create a multi-agent collaboration with Amazon Bedrock Agents to solve complex tasks. This is done by designating an Amazon Bedrock agent as a supervisor agent, associating one or more collaborator agents with the supervisor. For more details, read on creating and configuring Amazon Bedrock Agents and Use multi-agent collaboration with Amazon Bedrock Agents.
The following diagram illustrates the supervisor agent methodology.

Supervisor agent methodology

Following traditional process automation patterns, the other end of the spectrum to synchronous orchestration would be asynchronous choreography: an asynchronous event-driven multi-agent workflow. In this approach, there would be no central orchestrating agent (supervisor). Agents operate autonomously where actions are triggered by events or changes in a system’s state and agents publish events or messages that other agents can subscribe to. In this approach, the workflow emerges from the collective behavior of the agents reacting to events asynchronously. It’s more like a jazz improvisation, where each musician responds to what others are playing without a conductor. The following diagram illustrates this event-driven workflow.

Event-driven workflow methodology

The event-driven pattern in asynchronous systems operates without predefined workflows, creating a dynamic and potentially chaotic processing environment. While agents subscribe to and publish messages through a central event hub, the flow of processing is determined organically by the message requirements and the available subscribed agents. Although the resulting pattern may resemble a structured workflow when visualized, it’s important to understand that this is emergent behavior rather than orchestrated design. The absence of centralized workflow definitions means that message processing occurs naturally based on publication timing and agent availability, creating a fluid and adaptable system that can evolve with changing requirements.
The choice between synchronous orchestration and asynchronous event-driven patterns fundamentally shapes how agentic AI systems operate and scale. Synchronous orchestration, with its supervisor agent approach, provides precise control and predictability, making it ideal for complex processes requiring strict oversight and sequential execution. This pattern excels in scenarios where the workflow needs to be tightly managed, audited, and debugged. However, it can create bottlenecks as all operations must pass through the supervisor agent. Conversely, asynchronous event-driven systems offer greater flexibility and scalability through their distributed nature. By allowing agents to operate independently and react to events in real-time, these systems can handle dynamic scenarios and adapt to changing requirements more readily. While this approach may introduce more complexity in tracking and debugging workflows, it excels in scenarios requiring high scalability, fault tolerance, and adaptive behavior. The decision between these patterns often depends on the specific requirements of the system, balancing the need for control and predictability against the benefits of flexibility and scalability.
Getting the best of both patterns
You can use a single agent to route messages to other agents based on the context of the event data (message) at runtime, with no prior knowledge of the downstream agents, without having to rely on each agent subscribing to an event hub. This is traditionally known as the message broker or event broker pattern, which for the purpose of this article we will call an agent broker pattern, to represent brokering of messages to AI agents. The agent broker pattern is a hybrid approach that combines elements of both centralized synchronous orchestration and distributed asynchronous event-driven systems.
The key to this pattern is that a single agent acts as a central hub for message distribution but doesn’t control the entire workflow. The broker agent determines where to send each message based on its content or metadata, making routing decisions at runtime. The processing agents are decoupled from each other and from the message source, only interacting with the broker to receive messages. The agent broker pattern is different from the supervisor pattern because it awaits a response from collaborating agents by routing a message to an agent and not awaiting a response. The following diagram illustrates the agent broker methodology.

Agent broker methodology

Following an agent broker pattern, the system is still fundamentally event-driven, with actions triggered by the arrival of messages. New agents can be added to handle specific types of messages without changing the overall system architecture. Understanding how to implement this type of pattern will be explained later in this post.
This pattern is often used in enterprise messaging systems, microservices architectures, and complex event processing systems. It provides a balance between the structure of orchestrated workflows and the flexibility of pure event-driven systems.
Agentic architecture with the Amazon Bedrock Converse API
Traditionally, we might have had to sacrifice some flexibility in the broker pattern by having to update the routing logic in the broker when adding additional processes (agents) to the architecture. This is, however, not the case when using the Amazon Bedrock Converse API. With the Converse API, we can call a tool to complete an Amazon Bedrock model response. The only change is the additional agent added to the collaboration stored as configuration outside of the broker.
To let a model use a tool to complete a response for a message, the message and the definitions for one or more tools (agents) are sent to the model. If the model determines that one of the tools can help generate a response, it returns a request to use the tool.
AWS AppConfig, a capability of AWS Systems Manager, is used to store each of the agents’ tool context data as a single configuration in a managed data store, to be sent to the Converse API tool request. By using AWS Lambda as the message broker to receive all message and send requests to the Converse API with the tool context stored in AWS AppConfig, the architecture allows for adding additional agents to the system without having to update the routing logic, by ‘registering’ agents as ‘tool context’ in the configuration stored in AWS AppConfig, to be read by Lambda at run time (event message received). For more information about when to use AWS Config, see AWS AppConfig use cases.
Implementing the agent broker pattern
The following diagram demonstrates how Amazon EventBridge and Lambda act as a central message broker, with the Amazon Bedrock Converse API to let a model use a tool in a conversation to dynamically route messages to appropriate AI agents.

Agent broker architecture

Messages sent to EventBridge are routed through an EventBridge rule to Lambda. There are three tasks the EventBridge Lambda function performs as the agent broker:

Query AWS AppConfig for all agents’ tool context. An agent tool context is a description of the agent’s capability along with the Amazon Resource Name (ARN) or URL of the agent’s message ingress.
Provide the agent tool context along with the inbound event message to the Amazon Bedrock LLM through the Converse API; in this example, using an Amazon Bedrock tools-compatible LLM. The LLM, using the Converse API, combines the event message context compared to the agent tool context to provide a response back to the requesting Lambda function, containing the recommended tool or tools that should be used to process the message.
Receive the response from the Converse API request containing one or more tools that should be called to process the event message, and hands the event message to the ingress of the recommended tools.

In this example, the architecture demonstrates brokering messages asynchronously to an Amazon SageMaker based agent, an Amazon Bedrock agent, and an external third-party agent, all from the same agent broker.
Although the brokering Lambda function could connect directly to the SageMaker or Amazon Bedrock agent API, the architecture provides for adaptability and scalability in message throughput, allowing messages from the agent broker to be queued, in this example with Amazon Simple Queue Service (Amazon SQS), and processed according to the capability of the receiving agent. For adaptability, the Lambda function subscribed to the agent ingress queue provides additional system prompts (pre-prompting of the LLM for specific tool context) and message formatted, and required functions for the expected input and output of the agent request.
To add new agents to the system, the only integration requirements are to update the AWS AppConfig with the new agent tool context (description of the agents’ capability and ingress endpoint), and making sure the brokering Lambda function has permissions to write to the agent ingress endpoint.
Agents can be added to the system without rewriting the Lambda function or integration that requires downtime, allowing the new agent to be used on the next instantiation of the brokering Lambda function.
Implementing the supervisor pattern with an agent broker
Building upon the agent broker pattern, the architecture can be extended to handle more complex, stateful interactions. Although the broker pattern effectively uses AWS AppConfig and Amazon Bedrock’s Converse API tool use capability for dynamic routing, its unidirectional nature has limitations. Events flow in and are distributed to agents, but complex scenarios like travel booking require maintaining context across multiple agent interactions. This is where the supervisor pattern provides additional capabilities without compromising the flexible routing we achieved with the broker pattern.
Using the example of the travel booking agent: the example has the broker agent and several task-based agents that events will be pushed to. When processing a request like “Book a 3-night trip to Sydney from Melbourne during the first week of September for 2 people”, we encounter several challenges. Although this statement contains clear intent, it lacks critical details that the agent might need, such as:

Specific travel dates
Accommodation preferences and room configurations

The broker pattern alone can’t effectively manage these information gaps while maintaining context between agent interactions. This is where adding the capability of a supervisor to the broker agent provides:

Contextual awareness between events and agent invocations
Bi-directional information flow capabilities

The following diagram illustrates the supervisor pattern workflow

Supervisor pattern architecture

When a new event enters the system, the workflow initiates the following steps:

The event is assigned a unique identifier for tracking
The supervisor performs the following actions:

Evaluates which agents to invoke (brokering)
Creates a new state record with the identifier and timestamp
Provides this contextual information to the selected agents along with their invocation parameters

Agents process their tasks and emit ‘task completion’ events back to EventBridge
The supervisor performs the following actions:

Collects and processes completed events
Evaluates the combined results and context
Determines if additional agent invocations are needed
Continues this cycle until all necessary actions are completed

This pattern handles scenarios where agents might return varying results or request additional information. The supervisor can either:

Derive missing information from other agent responses
Request additional information from the source
Coordinate with other agents to resolve information gaps

To handle information gaps without architectural modifications, we can introduce an answers agent to the existing system. This agent operates within the same framework as other agents, but specializes in context resolution. When agents report incomplete information or require clarification, the answers agent can:

Process queries about missing information
Emit task completion events with enhanced context
Allow the supervisor to resume workflow execution with newly available information, the same way that it would after another agent emits its task-completion event.

This enhancement enables complex, multi-step workflows while maintaining the system’s scalability and flexibility. The supervisor can manage dependencies between agents, handle partial completions, and make sure that the necessary information is gathered before finalizing tasks.
Implementation considerations:
Implementing the supervisor pattern on top of the existing broker agent architecture provides the advantages of both the broker pattern and the complex state management of orchestration. The state management can be handled through Amazon DynamoDB, and maintaining the use of EventBridge for event routing and AWS AppConfig for agent configuration. The Amazon Bedrock Converse API continues to play a crucial role in agent selection, but now with added context from the supervisor’s state management. This allows you to preserve the dynamic routing capabilities we established with the broker pattern while adding the sophisticated workflow management needed for complex, multi-step processes.
Conclusion
Agentic AI architecture, powered by Amazon Bedrock and AWS services, represents a leap forward in the evolution of automated AI systems. By combining the flexibility of event-driven systems with the power of generative AI, this architecture enables businesses to create more adaptive, scalable, and intelligent automated processes. The agent broker pattern offers a robust solution for dynamically routing complex tasks to specialized AI agents, and the agent supervisor pattern extends these capabilities to handle sophisticated, context-aware workflows.
These patterns take advantage of the strengths of the Amazon Bedrock’s Converse API, Lambda, EventBridge, and AWS AppConfig to create a flexible and extensible system. The broker pattern excels at dynamic routing and seamless agent integration, while the supervisor pattern adds crucial state management and contextual awareness for complex, multi-step processes. Together, they provide a comprehensive framework for building sophisticated AI systems that can handle both simple routing and complex, stateful interactions.
This architecture not only streamlines operations, but also opens new possibilities for innovation and efficiency across various industries. Whether implementing simple task routing or orchestrating complex workflows requiring maintained context, organizations can build scalable, maintainable AI systems that evolve with their needs while maintaining operational stability.
To get started with an agentic AI architecture, consider the following next steps:

Explore Amazon Bedrock – If you haven’t already, sign up for Amazon Bedrock and experiment with its powerful generative AI models and APIs. Familiarize yourself with the Converse API and its tool use capabilities.
Prototype your own agent broker – Use the architecture outlined in this post as a starting point to build a proof-of-concept agent broker system tailored to your organization’s needs. Start small with a few specialized agents and gradually expand.
Identify use cases – Analyze your current business processes to identify areas where an agentic AI architecture could drive significant improvements. Consider complex, multi-step tasks that could benefit from AI assistance.
Stay informed – Keep up with the latest developments in AI and cloud technologies. AWS regularly updates its offerings, so stay tuned for new features that could enhance your agentic AI systems.
Collaborate and share – Join AI and cloud computing communities to share your experiences and learn from others. Consider contributing to open-source projects or writing about your implementation to help advance the field.
Invest in training – Make sure your team has the necessary skills to work with these advanced AI technologies. Consider AWS training and certification programs to build expertise in your organization.

By embracing an agentic AI architecture, you’re not just optimizing your current processes – you’re positioning your organization at the forefront of the AI revolution. Start your journey today and unlock the full potential of AI-driven automation for your business.

About the Authors
Aaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.
Joshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.
Sara van de Moosdijk, simply known as Moose, is an AI/ML Specialist Solution Architect at AWS. She helps AWS customers and partners build and scale AI/ML solutions through technical enablement, support, and architectural guidance. Moose spends her free time figuring out how to fit more books in her overflowing bookcase.

Building an Interactive Bilingual (Arabic and English) Chat Interface …

Posted on March 13, 2025 by i-genie

In this tutorial, we implement a Bilingual Chat Assistant powered by Arcee’s Meraj-Mini model, which is deployed seamlessly on Google Colab using T4 GPU. This tutorial showcases the capabilities of open-source language models while providing a practical, hands-on experience in deploying state-of-the-art AI solutions within the constraints of free cloud resources. We’ll utilise a powerful stack of tools including:

Arcee’s Meraj-Mini model

Transformers library for model loading and tokenization

Accelerate and bitsandbytes for efficient quantization

PyTorch for deep learning computations

Gradio for creating an interactive web interface

Copy CodeCopiedUse a different Browser# Enable GPU acceleration
!nvidia-smi –query-gpu=name,memory.total –format=csv

# Install dependencies
!pip install -qU transformers accelerate bitsandbytes
!pip install -q gradio

First we enable GPU acceleration by querying the GPU’s name and total memory using the nvidia-smi command. It then installs and updates key Python libraries—such as transformers, accelerate, bitsandbytes, and gradio—to support machine learning tasks and deploy interactive applications.

Copy CodeCopiedUse a different Browserimport torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
“arcee-ai/Meraj-Mini”,
quantization_config=quant_config,
device_map=”auto”
)
tokenizer = AutoTokenizer.from_pretrained(“arcee-ai/Meraj-Mini”)

Then we configures 4-bit quantization settings using BitsAndBytesConfig for efficient model loading, then loads the “arcee-ai/Meraj-Mini” causal language model along with its tokenizer from Hugging Face, automatically mapping devices for optimal performance.

Copy CodeCopiedUse a different Browserchat_pipeline = pipeline(
“text-generation”,
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True
)

Here we create a text generation pipeline tailored for chat interactions using Hugging Face’s pipeline function. It configures maximum new tokens, temperature, top_p, and repetition penalty to balance diversity and coherence during text generation.

Copy CodeCopiedUse a different Browserdef format_chat(messages):
prompt = “”
for msg in messages:
prompt += f”<|im_start|>{msg[‘role’]}n{msg[‘content’]}<|im_end|>n”
prompt += “<|im_start|>assistantn”
return prompt

def generate_response(user_input, history=[]):
history.append({“role”: “user”, “content”: user_input})
formatted_prompt = format_chat(history)
output = chat_pipeline(formatted_prompt)[0][‘generated_text’]
assistant_response = output.split(“<|im_start|>assistantn”)[-1].split(“<|im_end|>”)[0]
history.append({“role”: “assistant”, “content”: assistant_response})
return assistant_response, history

We define two functions to facilitate a conversational interface. The first function formats a chat history into a structured prompt with custom delimiters, while the second appends a new user message, generates a response using the text-generation pipeline, and updates the conversation history accordingly.

Copy CodeCopiedUse a different Browserimport gradio as gr

with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox(label=”Message”)
clear = gr.Button(“Clear History”)

def respond(message, chat_history):
response, _ = generate_response(message, chat_history.copy())
return response, chat_history + [(message, response)]

msg.submit(respond, [msg, chatbot], [msg, chatbot])
clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)

Finally, we build a web-based chatbot interface using Gradio. It creates UI elements for chat history, message input, and a clear history button, and defines a response function that integrates with the text-generation pipeline to update the conversation. Finally, the demo is launched with sharing enabled for public access.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.
The post Building an Interactive Bilingual (Arabic and English) Chat Interface with Open Source Meraj-Mini by Arcee AI: Leveraging GPU Acceleration, PyTorch, Transformers, Accelerate, BitsAndBytes, and Gradio appeared first on MarkTechPost.

This AI Paper Introduces R1-Searcher: A Reinforcement Learning-Based F …

Posted on March 13, 2025 by i-genie

Large language models (LLMs) models primarily depend on their internal knowledge, which can be inadequate when handling real-time or knowledge-intensive questions. This limitation often leads to inaccurate responses or hallucinations, making it essential to enhance LLMs with external search capabilities. By leveraging reinforcement learning, researchers are actively working on methods to improve these models’ ability to retrieve and integrate relevant information beyond their static knowledge base.

Current LLMs’ restricted access to up-to-date and domain-specific knowledge is a major issue. Since these models are trained on vast datasets that may not include recent developments, they struggle with answering dynamic questions requiring real-time information. While retrieval-augmented generation (RAG) methods have been introduced to mitigate this issue, existing solutions rely heavily on structured prompting and supervised fine-tuning (SFT). These approaches often lead to overfitting, limiting the model’s generalization ability across different datasets. There is a need for an alternative that allows LLMs to autonomously interact with external search systems, improving their adaptability and accuracy.

Previous methods have attempted to incorporate external search functionality into LLMs using iterative prompting, supervised fine-tuning, and tree-based search techniques like Monte Carlo Tree Search (MCTS). While these methods show some improvements, they rely on expensive computational resources and proprietary models. Supervised fine-tuning, for instance, forces models to memorize reasoning paths, which negatively impacts their ability to generalize to new scenarios. Some retrieval-based strategies introduce multi-step query refinement techniques but often require human intervention or predefined prompt templates. These limitations necessitate the development of a more autonomous and efficient search mechanism for LLMs.

A research team from the Renmin University of China and DataCanvas Alaya NeW introduced R1-Searcher, a novel reinforcement learning framework designed to improve LLMs’ ability to retrieve external knowledge effectively. This framework employs a two-stage reinforcement learning approach to enable LLMs to invoke an external search system without requiring human-crafted prompts or prior supervised fine-tuning. By focusing solely on reinforcement learning, R1-Searcher allows models to explore and learn optimal retrieval strategies autonomously, improving accuracy and efficiency in reasoning tasks.

The R1-Searcher framework is structured in two phases. The first phase encourages the model to initiate external search actions, providing retrieval-based rewards without considering the final answer’s correctness. This phase ensures that the model learns to invoke search queries correctly. The second phase refines this capability by introducing an answer-based reward system, which evaluates whether the retrieved information contributes to solving the given problem. The reinforcement learning process relies on a tailored loss function that penalizes incorrect or unnecessary searches while rewarding the effective use of external knowledge. Unlike previous retrieval-based techniques, this approach allows LLMs to integrate reasoning and retrieval dynamically, improving their adaptability across diverse tasks.

Experimental evaluations demonstrated that R1-Searcher outperformed existing retrieval-augmented methods, including GPT-4o-mini-based models. On the HotpotQA dataset, accuracy improved by 48.22%, while on the 2WikiMultiHopQA dataset, it achieved a 21.72% increase. Further, it showed strong generalization capabilities by outperforming other models on the Bamboogle dataset, achieving an 11.4% improvement over comparable retrieval-based approaches. Unlike previous techniques, which relied on closed-source models and extensive computational resources, R1-Searcher provided superior performance while maintaining efficiency in search and reasoning tasks. The study also demonstrated that this approach successfully mitigated common issues related to hallucinations and misinformation in LLM-generated responses.

The findings indicate that enhancing LLMs with autonomous search capabilities can significantly improve their accuracy and generalization. Using reinforcement learning instead of supervised fine-tuning, R1-Searcher allows models to learn optimal retrieval strategies dynamically, eliminating reliance on memorized responses. This approach represents a major advancement in artificial intelligence, addressing the limitations of existing models while ensuring they remain adaptable to evolving knowledge requirements. The study’s results highlight the potential for reinforcement learning to revolutionize knowledge integration in LLMs, making them more reliable for diverse reasoning tasks.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .
The post This AI Paper Introduces R1-Searcher: A Reinforcement Learning-Based Framework for Enhancing LLM Search Capabilities appeared first on MarkTechPost.

HybridNorm: A Hybrid Normalization Strategy Combining Pre-Norm and Pos …

Posted on March 13, 2025 by i-genie

Transformers have revolutionized natural language processing as the foundation of large language models (LLMs), excelling in modeling long-range dependencies through self-attention mechanisms. However, as these models grow deeper and more complex, training stability presents a significant challenge that directly impacts performance. Researchers face a troublesome trade-off between two primary normalization strategies: Pre-Layer Normalization (Pre-Norm) and Post-Layer Normalization (Post-Norm). Pre-Norm offers improved training stability but compromises in final model performance, while Post-Norm delivers superior generalization and performance at the cost of training difficulty. This stability-performance dilemma has hindered the advancement of transformer architectures.

Existing methods tried to enhance transformer architectures in computational efficiency and model expressiveness. Architecture modifications like Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) have improved performance across various tasks but require careful integration with normalization layers. In normalization types, methods like RMSNorm have shown effectiveness in specific contexts by addressing internal covariate shift using root mean square statistics. Regarding attention normalization, QK-Norm enhances stability by normalizing query and key components, while QKV-Norm extends this approach to include value components. Solutions like DeepNorm address training instability by scaling residual connections, while Mix-LN applies Post-Norm to earlier layers and Pre-Norm to deeper layers.

Researchers from Peking University, SeedFoundation-Model ByteDance, and Capital University of Economics and Business have proposed HybridNorm, a normalization strategy to combine the strengths of both Pre-Norm and Post-Norm approaches in transformer architectures effectively. It implements a dual normalization technique within each transformer block: applying QKV normalization within the attention mechanism while utilizing Post-Norm in the feed-forward network (FFN). This strategic combination addresses the longstanding stability-performance trade-off that has challenged transformer model development. The approach proves particularly effective for LLMs, where training stability and performance optimization are crucial.

The HybridNorm is evaluated across two model series: dense models (550M and 1B parameters) and MoE models. The 1B dense model contains approximately 1.27 billion parameters with an architecture similar to Llama 3.2. For the MoE variant, researchers implemented the OLMoE framework, which activates only 1.3B parameters from a total of 6.9B. The 550M dense model features a model dimension of 1536, an FFN dimension of 4096, and 16 attention heads. The larger 1.2B model expands these dimensions to 2048 and 9192, respectively, with 32 attention heads. The MoE-1B-7B model implements a specialized configuration with 16 attention heads and 2048 model dimensions and selectively activates 8 experts from a pool of 64, enabling more efficient computational resource allocation.

The experimental results reveal HybridNorm’s superior performance across dense and MoE models. In dense model evaluations, both HybridNorm and HybridNorm* configurations show consistently lower training loss and validation perplexity than traditional Pre-Norm approaches. Downstream benchmark evaluations show HybridNorm* outperforming the Pre-Norm across diverse tasks, achieving the highest average scores with improvements in BasicArithmetic (+3.11), HellaSwag (+1.71), and COPA (+3.78). In the MoE model, HybridNorm* maintains its advantage with consistently lower training loss and validation perplexity throughout training. Downstream task evaluations for MoE models show improvements in reasoning-intensive tasks like ARC-C (+2.35), ARC-E (+2.40), and OpenbookQA (+0.81).

In conclusion, researchers introduced HybridNorm, a significant advancement in transformer architecture design to resolve the traditional trade-off between training stability and model performance. It strategically combines Pre-Norm and Post-Norm techniques within each transformer block, applying QKV normalization in the attention mechanism and Post-Norm in the feed-forward network. This hybrid strategy creates a balanced normalization framework to stabilize gradient flow while maintaining strong regularization effects. Moreover, the consistent performance gains across various model scales highlight HybridNorm’s versatility and scalability in transformer design. As transformer models, HybridNorm offers a practical solution for developing more robust and performant large-scale neural networks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .
The post HybridNorm: A Hybrid Normalization Strategy Combining Pre-Norm and Post-Norm Strengths in Transformer Architectures appeared first on MarkTechPost.

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Posted on March 13, 2025 by i-genie

Compelling AI-generated images start with well-crafted prompts. In this follow-up to our Amazon Nova Canvas Prompt Engineering Guide, we showcase a curated gallery of visuals generated by Nova Canvas—categorized by real-world use cases—from marketing and product visualization to concept art and design exploration.
Each image is paired with the prompt and parameters that generated it, providing a practical starting point for your own AI-driven creativity. Whether you’re crafting specific types of images, optimizing workflows, or simply seeking inspiration, this guide will help you unlock the full potential of Amazon Nova Canvas.
Solution overview
Getting started with Nova Canvas is straightforward. You can access the model through the Image Playground on the AWS Management Console for Amazon Bedrock, or through APIs. For detailed setup instructions, including account requirements and necessary permissions, visit our documentation on Creative content generation with Amazon Nova. Our previous post on prompt engineering best practices provides comprehensive guidance on crafting effective prompts.
A visual guide to Amazon Nova Canvas
In this gallery, we showcase a diverse range of images and the prompts used to generate them, highlighting how Amazon Nova Canvas adapts to various use cases—from marketing and product design to storytelling and concept art.
All images that follow were generated using Nova Canvas at a 1280x720px resolution with a CFG scale of 6.5, seed of 0, and the Premium setting for image quality. This resolution also matches the image dimensions expected by Nova Reel, allowing you to take these images into Amazon Nova Reel to experiment with video generation.
Landscapes

Overhead perspective of winding river delta, capturing intricate branching waterways and sediment patterns. Soft morning light revealing subtle color gradations between water and land. Revealing landscape’s hidden fluid dynamics from bird’s-eye view.

Sparse arctic tundra landscape at twilight, expansive white terrain with isolated rock formations silhouetted against a deep blue sky. Low-contrast black and white composition capturing the infinite horizon, with subtle purple hues in the shadows. Ultra-wide-angle perspective emphasizing the vastness of negative space and geological simplicity.

Wide-angle aerial shot of patchwork agricultural terrain at golden hour, with long shadows accentuating the texture and topography of the land. Emphasis on the interplay of light and shadow across the geometric field divisions.

Dynamic drone perspective of a dramatic shoreline at golden hour, capturing long shadows cast by towering sea stacks and coastal cliffs. Hyper-detailed imagery showcasing the interplay of warm sunlight on rocky textures and the cool, foamy edges of incoming tides.

Dramatic wide-angle shot of a rugged mountain range at sunset, with a lone tree silhouetted in the foreground, creating a striking focal point.

Wide-angle capture of a hidden beach cove, surrounded by towering cliffs, with a shipwreck partially visible in the shallow waters.

Character portraits

A profile view of a weathered fisherman, silhouetted against a pastel dawn sky. The rim lighting outlines the shape of his beard and the texture of his knit cap. Rendered with high contrast to emphasize the rugged contours of his face and the determined set of his jaw.

A weathered fisherman with a thick gray beard and a knit cap, framed against the backdrop of a misty harbor at dawn. The image captures him in a medium shot, revealing more of his rugged attire. Cool, blue tones dominate the scene, contrasting with the warm highlights on his face.

An intimate portrait of a seasoned fisherman, his face filling the frame. His thick gray beard is flecked with sea spray, and his knit cap is pulled low over his brow. The warm glow of sunset bathes his weathered features in golden light, softening the lines of his face while still preserving the character earned through years at sea. His eyes reflect the calm waters of the harbor behind him.

A seaside cafe at sunrise, with a seasoned barista’s silhouette visible through the window. Their kind smile is illuminated by the warm glow of the rising sun, creating a serene atmosphere. The image has a dreamy, soft-focus quality with pastel hues.

A dynamic profile shot of a barista in motion, captured mid-conversation with a customer. Their smile is genuine and inviting, with laugh lines accentuating their seasoned experience. The cafe’s interior is rendered in soft bokeh, maintaining the cinematic feel with a shallow depth of field.

A front-facing portrait of an experienced barista, their welcoming smile framed by the sleek espresso machine. The background bustles with blurred cafe activity, while the focus remains sharp on the barista’s friendly demeanor. The lighting is contrasty, enhancing the cinematic mood.

Fashion photography

A model with sharp cheekbones and platinum pixie cut in a distressed leather bomber jacket stands amid red smoke in an abandoned subway tunnel. Wide-angle lens, emphasizing tunnel’s converging lines, strobed lighting creating a sense of motion.

A model with sharp cheekbones and platinum pixie cut wears a distressed leather bomber jacket, posed against a stark white cyclorama. Low-key lighting creates deep shadows, emphasizing the contours of her face. Shot from a slightly lower angle with a medium format camera, highlighting the jacket’s texture.

Close-up portrait of a model with defined cheekbones and a platinum pixie cut, emerging from an infinity pool while wearing a wet distressed leather bomber jacket. Shot from a low angle with a tilt-shift lens, blurring the background for a dreamy fashion magazine aesthetic.

A model with sharp cheekbones and platinum pixie cut is wearing a distressed leather bomber jacket, caught mid-laugh at a backstage fashion show. Black and white photojournalistic style, natural lighting.

Side profile of a model with defined cheekbones and a platinum pixie cut, standing still amidst the chaos of Chinatown at midnight. The distressed leather bomber jacket contrasts with the blurred neon lights in the background, creating a sense of urban solitude.

Product photography

A flat lay featuring a premium matte metal water bottle with bamboo accents, placed on a textured linen cloth. Eco-friendly items like a cork notebook, a sprig of eucalyptus, and a reusable straw are arranged around it. Soft, natural lighting casts gentle shadows, emphasizing the bottle’s matte finish and bamboo details. The background is an earthy tone like beige or light gray, creating a harmonious and sustainable composition.

Angled perspective of the premium water bottle with bamboo elements, positioned on a natural jute rug. Surrounding it are earth-friendly items: a canvas tote bag, a stack of recycled paper notebooks, and a terracotta planter with air-purifying plants. Warm, golden hour lighting casts long shadows, emphasizing textures and creating a cozy, sustainable atmosphere. The scene evokes a sense of eco-conscious home or office living.

An overhead view of the water bottle’s bamboo cap, partially unscrewed to reveal the threaded metal neck. Soft, even lighting illuminates the entire scene, showcasing the natural variations in the bamboo’s color and grain. The bottle’s matte metal body extends out of frame, creating a minimalist composition that draws attention to the sustainable materials and precision engineering.

An angled view of a premium matte metal water bottle with bamboo accents, showcasing its sleek profile. The background features a soft blur of a serene mountain lake. Golden hour sunlight casts a warm glow on the bottle’s surface, highlighting its texture. Captured with a shallow depth of field for product emphasis.

A pair of premium over-ear headphones with a matte black finish and gold accents, arranged in a flat lay on a clean white background. Organic leaves for accents. small notepad, pencils, and a carrying case are neatly placed beside the headphones, creating a symmetrical and balanced composition. Bright, diffused lighting eliminates shadows, emphasizing the sleek design without distractions. A shadowless, crisp aesthetic.

An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality.

An extreme macro shot focusing on the junction where the leather ear cushion meets the metallic housing of premium over-ear headphones. Sharp details reveal the precise stitching and material textures, while selective focus isolates this area against a softly blurred, dark background, showcasing the product’s premium construction.

An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting casts long shadows, accentuating the curves of the headband and the depth of the ear cups against a minimalist white background.

A dynamic composition of premium over-ear headphones floating in space, with the headband and ear cups slightly separated to showcase individual components. Rim lighting outlines each piece, while a gradient background adds depth and sophistication.

A smiling student holding up her smartphone, displaying a green matte screen for easy image replacement, in a classroom setting.

Overhead view of a young man typing on a laptop with a green matte screen, surrounded by work materials on a wooden table.

Food photography

Monochromatic macarons arranged in precise geometric pattern. Strong shadow play. Architectural lighting. Minimal composition.

A pyramid of macarons in ombre pastels, arranged on a matte black slate surface. Dramatic side lighting from left. Close-up view highlighting texture of macaron shells. Garnished with edible gold leaf accents. Shot at f/2 aperture for shallow depth of field.

Disassembled macaron parts in zero-g chamber. Textured cookie halves, viscous filling streams, and scattered almond slivers drifting. High-contrast lighting with subtle shadows on off-white. Wide-angle shot showcasing full dispersal pattern.

Architectural design

A white cubic house with floor-to-ceiling windows, interior view from living room. Double-height space, floating steel staircase, polished concrete floors. Late afternoon sunbeams streaming across minimal furnishings. Ultra-wide architectural lens.

A white cubic house with floor-to-ceiling windows, kitchen and dining space. Monolithic marble island, integrated appliances, dramatic shadows from skylight above. Shot from a low angle with a wide-angle lens, emphasizing the height and openness of the space, late afternoon golden hour light streaming in.

An angular white modernist house featuring expansive glass walls, photographed for Architectural Digest’s cover. Misty morning atmosphere, elongated infinity pool creating a mirror image, three-quarter aerial view, lush coastal vegetation framing the scene.

A white cubic house with floor-to-ceiling windows presented as detailed architectural blueprints. Site plan view showing landscaping and property boundaries, technical annotations, blue background with white lines, precise measurements and zoning specifications visible.

A white cubic house with floor-to-ceiling windows in precise isometric projection. X-ray style rendering revealing internal framework, electrical wiring, and plumbing systems. Technical cross-hatching on load-bearing elements and foundation.

Concept art

A stylized digital painting of a bustling plaza in a futuristic eco-city, with soft impressionistic brushstrokes. Crystalline towers frame the scene, while suspended gardens create a canopy overhead. Holographic displays and eco-friendly vehicles add life to the foreground. Dreamlike and atmospheric, with glowing highlights in sapphire and rose gold.

A stylized digital painting of an elevated park in a futuristic eco-city, viewed from a high angle, with soft impressionistic brushstrokes. Crystalline towers peek through a canopy of trees, while winding elevated walkways connect floating garden platforms. People relax in harmony with nature. Dreamlike and atmospheric, with glowing highlights in jade and amber.

Concept art of a floating garden platform in a futuristic city, viewed from below. Translucent roots and hanging vines intertwine with advanced technology, creating a mesmerizing canopy. Soft bioluminescent lights pulse through the vegetation, casting ethereal patterns on the ocean’s surface. A gradient of deep purples and blues dominates the twilight sky.

An enchanted castle atop a misty cliff at sunrise, warm golden light bathing the ivy-covered spires. A wide-angle view capturing a flock of birds soaring past the tallest tower, set against a dramatic sky with streaks of orange and pink. Mystical ambiance and dynamic composition.

A magical castle rising from morning fog on a rugged cliff face, bathed in cool blue twilight. A low-angle shot showcasing the castle’s imposing silhouette against a star-filled sky, with a crescent moon peeking through wispy clouds. Mysterious mood and vertical composition emphasizing height.

An enchanted fortress clinging to a mist-shrouded cliff, caught in the moment between night and day. A panoramic view from below, revealing the castle’s reflection in a tranquil lake at the base of the cliff. Ethereal pink and purple hues in the sky, with a V-formation of birds flying towards the castle. Serene atmosphere and balanced symmetry.

Illustration

Japanese ink wash painting of a cute baby dragon with pearlescent mint-green scales and tiny wings curled up in a nest made of cherry blossom petals. Delicate brushstrokes, emphasis on negative space.

Art nouveau-inspired composition centered on an endearing dragon hatchling with gleaming mint-green scales. Sinuous morning glory stems and blossoms intertwine around the subject, creating a harmonious balance. Soft, dreamy pastels and characteristic decorative elements frame the scene.

Watercolor scene of a cute baby dragon with pearlescent mint-green scales crouched at the edge of a garden puddle, tiny wings raised. Soft pastel flowers and foliage frame the composition. Loose, wet-on-wet technique for a dreamy atmosphere, with sunlight glinting off ripples in the puddle.

A playful, hand-sculpted claymation-style baby dragon with pearlescent mint scales and tiny wings, sitting on a puffy marshmallow cloud. Its soft, rounded features and expressive googly eyes give it a lively, mischievous personality as it giggles and flaps its stubby wings, trying to take flight in a candy-colored sky.

A whimsical, animated-style render of a baby dragon with pearlescent mint scales nestled in a bed of oversized, bioluminescent flowers. The floating island garden is bathed in the warm glow of sunset, with fireflies twinkling like stars. Dynamic lighting accentuates the dragon’s playful expression.

Graphic design

A set of minimalist icons for a health tracking app. Dual-line design with 1.5px stroke weight on solid backgrounds. Each icon uses teal for the primary line and a lighter shade for the secondary line, with ample negative space. Icons maintain consistent 64x64px dimensions with centered compositions. Clean, professional aesthetic suitable for both light and dark modes.

Stylized art deco icons for fitness tracking. Geometric abstractions of health symbols with gold accents. Balanced designs incorporating circles, triangles, and zigzag motifs. Clean and sophisticated.

Set of charming wellness icons for digital health tracker. Organic, hand-drawn aesthetic with soft, curvy lines. Uplifting color combination of lemon yellow and fuchsia pink. Subtle size variations among icons for a dynamic, handcrafted feel.

Lush greenery tapestry in 16:9 panoramic view. Detailed monstera leaves overlap in foreground, giving way to intricate ferns and tendrils. Emerald and sage watercolor washes create atmospheric depth. Foliage density decreases towards center, suggesting an enchanted forest clearing.

Modern botanical line drawing in 16:9 widescreen. Forest green single-weight outlines of stylized foliage. Negative space concentrated in the center for optimal text placement. Geometric simplification of natural elements with a focus on curves and arcs.

3D sculptural typography spelling out “BRAVE” with each letter made from a different material, arranged in a dynamic composition.

Experimental typographic interpretation of “BRAVE” using abstract, interconnected geometric shapes that flow and blend organically. Hyper-detailed textures reminiscent of fractals and natural patterns create a mesmerizing, otherworldly appearance with sharp contrast.

A dreamy photograph overlaid with delicate pen-and-ink drawings, blending reality and fantasy to reveal hidden magic in ordinary moments.

Surreal digital collage blending organic and technological elements in a futuristic style.

Abstract figures emerging from digital screens, gradient color transitions, mixed textures, dynamic composition, conceptual narrative style.

Abstract humanoid forms materializing from multiple digital displays, vibrant color gradients flowing between screens, contrasting smooth and pixelated textures, asymmetrical layout with visual tension, surreal storytelling aesthetic.

Abstract figures emerging from digital screens, glitch art aesthetic with RGB color shifts, fragmented pixel clusters, high contrast scanlines, deep shadows cast by volumetric lighting.

Conclusion
The examples showcased here are just the beginning of what’s possible with Amazon Nova Canvas. For even greater control, you can guide generations with reference images, use custom color palettes, or make precise edits—such as swapping backgrounds or refining details— with simple inputs. Plus, with built-in safeguards such as watermarking and content moderation, Nova Canvas offers a responsible and secure creative experience. Whether you’re a professional creator, a marketing team, or an innovator with a vision, Nova Canvas provides the tools to bring your ideas to life.
We invite you to explore these possibilities yourself and discover how Nova Canvas can transform your creative process. Stay tuned for our next installment, where we’ll dive into the exciting world of video generation with Amazon Nova Reel.
Ready to start creating? Visit the Amazon Bedrock console today and bring your ideas to life with Nova Canvas. For more information about features, specifications, and additional examples, explore our documentation on creative content generation with Amazon Nova.

Creative content generation with Amazon Nova
Prompting best practices for Amazon Nova content creation models
Image and video prompt engineering for Amazon Nova Canvas and Amazon Nova Reel

About the authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. As Sr. Solutions Architect within Amazon AGI, he influences the development of Amazon’s first-party generative AI models. Kris is passionate about empowering users and creators of all types with generative AI tools and knowledge.
Sanju Sunny is a Generative AI Design Technologist with AWS Prototyping & Cloud Engineering (PACE), specializing in strategy, engineering, and customer experience. He collaborates with customers across diverse industries, leveraging Amazon’s customer-obsessed innovation mechanisms to rapidly conceptualize, validate, and prototype innovative products, services, and experiences.
Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Benchmarking Amazon Nova and GPT-4o models with FloTorch

Posted on March 13, 2025 by i-genie

Based on original post by Dr. Hemant Joshi, CTO, FloTorch.ai
A recent evaluation conducted by FloTorch compared the performance of Amazon Nova models with OpenAI’s GPT-4o.
Amazon Nova is a new generation of state-of-the-art foundation models (FMs) that deliver frontier intelligence and industry-leading price-performance. The Amazon Nova family of models includes Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, which support text, image, and video inputs while generating text-based outputs. These models offer enterprises a range of capabilities, balancing accuracy, speed, and cost-efficiency.
Using its enterprise software, FloTorch conducted an extensive comparison between Amazon Nova models and OpenAI’s GPT-4o models with the Comprehensive Retrieval Augmented Generation (CRAG) benchmark dataset. FloTorch’s evaluation focused on three critical factors—latency, accuracy, and cost—across five diverse topics.
Key findings from the benchmark study:

GPT-4o demonstrated a slight advantage in accuracy over Amazon Nova Pro
Amazon Nova Pro outperformed GPT-4o in efficiency, operating 97% faster while being 65.26% more cost-effective
Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 percentage points in accuracy
In terms of affordability, Amazon Nova Micro and Amazon Nova Lite were 10% and 56.59% cheaper than GPT-4o-mini, respectively
Amazon Nova Micro and Amazon Nova Lite also demonstrated faster response times, with 48% and 26.60% improvements, respectively

In this post, we discuss the findings from this benchmarking in more detail.
The growing need for cost-effective AI models
The landscape of generative AI is rapidly evolving. OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. Although GPT-4o has gained traction in the AI community, enterprises are showing increased interest in Amazon Nova due to its lower latency and cost-effectiveness.
Large language models (LLMs) are generally proficient in responding to user queries, but they sometimes generate overly broad or inaccurate responses. Additionally, LLMs might provide answers that extend beyond the company-specific context, making them unsuitable for certain enterprise use cases.
One of the most critical applications for LLMs today is Retrieval Augmented Generation (RAG), which enables AI models to ground responses in enterprise knowledge bases such as PDFs, internal documents, and structured data. This is a crucial requirement for enterprises that want their AI systems to provide responses strictly within a defined scope.
To better serve the enterprise customers, the evaluation aimed to answer three key questions:

How does Amazon Nova Pro compare to GPT-4o in terms of latency, cost, and accuracy?
How do Amazon Nova Micro and Amazon Nova Lite perform against GPT-4o mini in these same metrics?
How well do these models handle RAG use cases across different industry domains?

By addressing these questions, the evaluation provides enterprises with actionable insights into selecting the right AI models for their specific needs—whether optimizing for speed, accuracy, or cost-efficiency.
Overview of the CRAG benchmark dataset
The CRAG dataset was released by Meta for testing with factual queries across five domains with eight question types and a large number of question-answer pairs. Five domains in CRAG dataset are Finance, Sports, Music, Movie, and Open (miscellaneous). The eight different question types are simple, simple_w_condition, comparison, aggregation, set, false_premise, post-processing, and multi-hop. The following table provides example questions with their domain and question type.

Domain
Question
Question Type

Sports
Can you carry less than the maximum number of clubs during a round of golf?
simple

Music
Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?
simple_w_condition

Open
Can i make cookies in an air fryer?
simple

Finance
Did meta have any mergers or acquisitions in 2022?
simple_w_condition

Movie
In 2016, which movie was distinguished for its visual effects at the oscars?
simple_w_condition

The evaluation considered 200 queries from this dataset representing five domains and two question types, simple and simple_w_condition. Both types of questions are common from users, and a typical Google search for the query such as “Can you tell me how many grammies were won by arlo guthrie until 60th grammy (2017)?” will not give you the correct answer (one Grammy). FloTorch used these queries and their ground truth answers to create a subset benchmark dataset. The CRAG dataset also provides top five search result pages for each query. These five webpages act as a knowledge base (source data) to limit the RAG model’s response. The goal is to index these five webpages dynamically using a common embedding algorithm and then use a retrieval (and reranking) strategy to retrieve chunks of data from the indexed knowledge base to infer the final answer.
Evaluation setup
The RAG evaluation pipeline consists of the several key components, as illustrated in the following diagram.

In this section, we explore each component in more detail.
Knowledge base
FloTorch used the top five HTML webpages provided with the CRAG dataset for each query as the knowledge base source data. HTML pages were parsed to extract text for the embedding stage.
Chunking strategy
FloTorch used a fixed chunking strategy with a chunk size of 512 tokens (four characters is usually around one token) and a 10% overlap between chunks. Further experiments with different chunking strategies, chunk sizes, and percent overlap will be done in coming weeks and will update this post.
Embedding strategy
FloTorch used the Amazon Titan Text Embeddings V2 model on Amazon Bedrock with an output vector size of 1024. With a maximum input token limit of 8,192 for the model, the system successfully embedded chunks from the knowledge base source data as well as short queries from the CRAG dataset efficiently. Amazon Bedrock APIs make it straightforward to use Amazon Titan Text Embeddings V2 for embedding data.
Vector database
FloTorch selected Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Each provisioned node was r7g.4xlarge, selected for its availability and sufficient capacity to meet the performance requirements. FloTorch used HSNW indexing in OpenSearch Service.
Retrieval (and reranking) strategy
FloTorch used a retrieval strategy with a k-nearest neighbor (k-NN) of five for retrieved chunks. The experiments excluded reranking algorithms to make sure retrieved chunks remained consistent for both models when inferring the answer to the provided query. The following code snippet embeds the given query and passes the embeddings to the search function:

def search_results(interaction_ids: List[str], queries: List[str], k: int):
“””Retrieve search results for queries.”””
results = []
embedding_max_length = int(os.getenv(“EMBEDDING_MAX_LENGTH”, 1024))
normalize_embeddings = os.getenv(“NORMALIZE_EMBEDDINGS”, “True”).lower() == “true”

for interaction_id, query in zip(interaction_ids, queries):
try:
_, _, embedding = create_embeddings_with_titan_bedrock(query, embedding_max_length, normalize_embeddings)
results.append(search(interaction_id + ‘_titan’, embedding, k))
except Exception as e:
logger.error(f”Error processing query {query}: {e}”)
results.append(None)
return results

Inferencing
FloTorch used the GPT-4o model from OpenAI using the API key available and used the Amazon Nova Pro model with conversation APIs. GPT-4o supports a context window of 128,000 compared to Amazon Nova Pro with a context window of 300,000. The maximum output token limit of GPT-4o is 16,384 vs. the Amazon Nova Pro maximum output token limit of 5,000. The benchmarking experiments were conducted without Amazon Bedrock Guardrails functionality. The implementation used the universal gateway provided by the FloTorch enterprise version to enable consistent API calls using the same function and to track token count and latency metrics uniformly. The inference function code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
max_tokens: int, search_k: int, system_prompt: str):
“””Generate response for queries.”””
results = []

for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc=”Generating responses”):
interaction_ids = [item[“interaction_id”] for item in batch]
queries = [item[“query”] for item in batch]
search_results_list = search_results(interaction_ids, queries, search_k)

for i, item in enumerate(batch):
item[“search_results”] = search_results_list[i]

responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

for i, response in enumerate(responses):
results.append({
“interaction_id”: interaction_ids[i],
“query”: queries[i],
“prediction”: response.get(“choices”, [{}])[0].get(“message”, {}).get(“content”) if response else None,
“response_time”: response.get(“response_time”) if response else None,
“response”: response,
})

return results

Evaluation
Both models were evaluated by running batch queries. A batch of eight was selected to comply with Amazon Bedrock quota limits as well as GPT-4o rate limits. The query function code is as follows:

def send_batch_request(batch: List[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
system_prompt: str):
“””Send batch queries to the API.”””
headers = {“Authorization”: auth_header, “Content-Type”: “application/json”}
responses = []

for item in batch:
query = item[“query”]
query_time = item[“query_time”]
retrieval_results = item.get(“search_results”, [])

references = “# References n” + “n”.join(
[f”Reference {_idx + 1}:n{res[‘text’]}n” for _idx, res in enumerate(retrieval_results)])
user_message = f”{references}n——nnUsing only the references listed above, answer the following question:nQuestion: {query}n”

payload = {
“model”: model_name,
“messages”: [{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: user_message}],
“max_tokens”: max_tokens,
}

try:
start_time = time.time()
response = requests.post(api_endpoint, headers=headers, json=payload, timeout=25000)
response.raise_for_status()
response_json = response.json()
response_json[‘response_time’] = time.time() – start_time
responses.append(response_json)
except requests.RequestException as e:
logger.error(f”API request failed for query: {query}. Error: {e}”)
responses.append(None)

return responses

Benchmarking on the CRAG dataset
In this section, we discuss the latency, accuracy, and cost measurements of benchmarking on the CRAG dataset.
Latency
Latency measurements for each query response were calculated as the difference between two timestamps: the timestamp when the API call is made to the inference LLM, and a second timestamp when the entire response is received from the inference endpoint. The difference between these two timestamps determines the latency. A lower latency indicates a faster-performing LLM, making it suitable for applications requiring rapid response times. The study indicates that latency can be further reduced for both models through optimizations and caching techniques; however, the evaluation focused on measuring out-of-the-box latency performance for both models.
Accuracy
FloTorch used a modified version of the local_evaluation.py script provided with the CRAG benchmark for accuracy evaluations. The script was enhanced to provide proper categorization of correct, incorrect, and missing responses. The default GPT-4o evaluation LLM in the evaluation script was replaced with the mixtral-8x7b-instruct-v0:1 model API. Additional modifications to the script enabled monitoring of input and output tokens and latency as described earlier.
Cost
Cost calculations were straightforward because both Amazon Nova Pro and GPT-4o have published price per million input and output tokens separately. The calculation methodology involved multiplying input tokens by corresponding rates and applying the same process for output tokens. The total cost for running 200 queries was determined by combining input token and output token costs. OpenSearch Service provisioned cluster costs were excluded from this analysis because the cost comparison focused solely on the inference level between Amazon Nova Pro and GPT-4o LLMs.
Results
The following table summarizes the results.

.
Amazon Nova Pro
GPT-4o
Observation

Accuracy on subset of the CRAG dataset
51.50% (103 correct responses out of 200)
53.00% (106 correct responses out of 200)
GPT-4o outperforms Amazon Nova Pro by 1.5% on accuracy

Cost for running inference for 200 queries
$0.00030205
$0.000869537
Amazon Nova Pro saves 65.26% in costs compared to GPT-4o

Average latency (seconds)
1.682539835
2.15615045
Amazon Nova Pro is 21.97% faster than GPT-4o

Average of input and output tokens
1946.621359
1782.707547
Typical GPT-4o responses are shorter than Amazon Nova responses

For simple queries, Amazon Nova Pro and GPT-4o have similar accuracies (55 and 56 correct responses, respectively) but for simple queries with conditions, GPT-4o performs slightly better than Amazon Nova Pro (50 vs. 48 correct answers). Imagine you are part of an organization running an AI assistant service that handles 1,000 questions per month from 10,000 users (10,000,000 queries per month). Amazon Nova Pro will save your organization $5,674.88 per month ($68,098 per year) compared to GPT-4o.
Let’s look at similar results for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini models on the same dataset.

Amazon Nova Lite
Nova Micro
GPT-4o mini
Observation

Accuracy on subset of the CRAG dataset
52.00% (104 correct responses out of 200)
54.00% (108 correct responses out of 200)
50.00% (100 correct responses out of 200)
Both Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 points, respectively

Cost for running inference for 200 queries
$0.00002247 (56.59% cheaper than GPT-4o mini)
$0.000013924 (73.10% cheaper than GPT-4o mini)
$0.000051768
Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively

Average latency (seconds)
1.553371465 (26.60% faster than GPT-4o mini)
1.6828564 (20.48% faster than GPT-4o mini)
2.116291895
Amazon Nova models are at least 20% faster than GPT-4o mini

Average of input and output tokens
1930.980769
1940.166667
1789.54
GPT-4o mini returns shorter answers

Amazon Nova Micro is significantly faster and less expensive compared to GPT-4o mini while providing more accurate answers. If you are running a service that handles about 10 million queries each month, it will save you on average 73% of what you will be paying for slightly less accurate results from the GPT-4o mini model.
Conclusion
Based on these tests for RAG cases, Amazon Nova models produce comparable or higher accuracy at significantly lower cost and latency compared to GPT-4o and GPT-4o mini models. FloTorch is continuing further experimentation with other relevant LLMs for comparison. Future research will include additional experiments with various query types such as comparison, aggregation, set, false_premise, post-processing, and multi-hop queries.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.
About FloTorch
FloTorch.ai is helping enterprise customers design and manage agentic workflows in a secure and scalable manner. FloTorch’s mission is to help enterprises make data-driven decisions in the end-to-end generative AI pipeline, including but not limited to model selection, vector database selection, and evaluation strategies. FloTorch offers an open source version for customers with scalable experimentation with different chunking, embedding, retrieval, and inference strategies. The open source version works on a customer’s AWS account so you can experiment on your AWS account with your proprietary data. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub. FloTorch also offers an enterprise version of this product for scalable experimentation with LLM models and vector databases on cloud platforms. The enterprise version also includes a universal gateway with model registry to custom define new LLMs and recommendation engine to suggest ew LLMs and agent workflows. For more information, contact us at info@flotorch.ai.

About the author
Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI solutions for enterprise customers. With a passion for helping AWS customers build innovative Gen AI applications, he focuses on creating scalable, cutting-edge AI solutions that drive business transformation. You can connect with Prasanna on LinkedIn.
Dr. Hemant Joshi has over 20 years of industry experience building products and services with AI/ML technologies. As CTO of FloTorch, Hemant is engaged with customers to implement State of the Art GenAI solutions and agentic workflows for enterprises.

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large …

Posted on March 12, 2025 by i-genie

DeepSeek-R1 is a large language model (LLM) developed by DeepSeek AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities, DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning, and data interpretation tasks.
DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. This approach allows the model to specialize in different problem domains while maintaining overall efficiency.
DeepSeek-R1 distilled models bring the reasoning capabilities of the main R1 model to more efficient architectures based on popular open models like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a process of training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model, using it as a teacher model. For example, DeepSeek-R1-Distill-Llama-8B offers an excellent balance of performance and efficiency. By integrating this model with Amazon SageMaker AI, you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.
In this post, we show how to use the distilled models in SageMaker AI, which offers several options to deploy the distilled versions of the R1 model.
Solution overview
You can use DeepSeek’s distilled models within the AWS managed machine learning (ML) infrastructure. We demonstrate how to deploy these models on SageMaker AI inference endpoints.
SageMaker AI offers a choice of which serving container to use for deployments:

LMI container – A Large Model Inference (LMI) container with different backends (vLLM, TensortRT-LLM, and Neuron). See the following GitHub repo for more details.
TGI container – A Hugging Face Text Generation Interface (TGI) container. You can find more details in the following GitHub repo.

In the following code snippets, we use the LMI container example. See the following GitHub repo for more deployment examples using TGI, TensorRT-LLM, and Neuron.
LMI containers
LMI containers are a set of high-performance Docker containers purpose built for LLM inference. With these containers, you can use high-performance open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle together a model server with open source inference libraries to deliver an all-in-one LLM serving solution.
LMI containers provide many features, including:

Optimized inference performance for popular model architectures like Meta Llama, Mistral, Falcon, and more
Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
Continuous batching for maximizing throughput at high concurrency
Token streaming
Quantization through AWQ, GPTQ, FP8, and more
Multi-GPU inference using tensor parallelism
Serving LoRA fine-tuned models
Text embedding to convert text data into numeric vectors
Speculative decoding support to decrease latency

LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables you to use the latest optimizations and technologies across libraries. To learn more about the LMI components, see Components of LMI.
Prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.2xlarge SageMaker hosting instance.
Deploy DeepSeek-R1 for inference
The following is a step-by-step example that demonstrates how to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the model is provided in the GitHub repo. You can clone the repo and run the notebook from SageMaker AI Studio.

Configure the SageMaker execution role and import the necessary libraries:

!pip install –force-reinstall –no-cache-dir sagemaker==2.235.2

import json
import boto3
import sagemaker

# Set up IAM Role
try:
  role = sagemaker.get_execution_role()
except ValueError:
  iam = boto3.client(‘iam’)
  role = iam.get_role(RoleName=’sagemaker_execution_role’)[‘Role’][‘Arn’]

There are two ways to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:

Deploy uncompressed model weights from an Amazon S3 bucket – In this scenario, you need to set the HF_MODEL_ID variable to the Amazon Simple Storage Service (Amazon S3) prefix that has model artifacts. This method is generally much faster, with the model typically downloading in just a couple of minutes from Amazon S3.
Deploy directly from Hugging Face Hub (requires internet access) – To do this, set HF_MODEL_ID to the Hugging Face repository or model ID (for example, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). However, this method tends to be slower and can take significantly longer to download the model compared to using Amazon S3. This approach will not work if enable_network_isolation is enabled, because it requires internet access to retrieve model artifacts from the Hugging Face Hub.

In this example, we deploy the model directly from the Hugging Face Hub:

vllm_config = {
“HF_MODEL_ID”: “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”,
“OPTION_TENSOR_PARALLEL_DEGREE”: “max”,
“OPTION_ROLLING_BATCH”: “vllm”,
“OPTION_MAX_ROLLING_BATCH_SIZE”: “16”,
}

The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits number of concurrent requests that can be processed by the endpoint. We set it to 16 to limit GPU memory requirements. You should adjust it based on your latency and throughput requirements.

Create and deploy the model:

# Create a Model object
lmi_model = sagemaker.Model(
image_uri = inference_image_uri,
env = vllm_config,
role = role,
name = model_name,
enable_network_isolation=True, # Ensures model is isolated from the internet
vpc_config={
“Subnets”: [“subnet-xxxxxxxx”, “subnet-yyyyyyyy”],
“SecurityGroupIds”: [“sg-zzzzzzzz”]
}
)
# Deploy to SageMaker
lmi_model.deploy(
initial_instance_count = 1,
instance_type = “ml.g5.2xlarge”,
container_startup_health_check_timeout = 1600,
endpoint_name = endpoint_name,
)

Make inference requests:

sagemaker_client = boto3.client(‘sagemaker-runtime’, region_name=’us-east-1′)
endpoint_name = predictor.endpoint_name

input_payload = {
“inputs”: “What is Amazon SageMaker? Answer concisely.”,
“parameters”: {“max_new_tokens”: 250, “temperature”: 0.1}
}

serialized_payload = json.dumps(input_payload)

response = sagemaker_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=serialized_payload
)

Performance and cost considerations
The ml.g5.2xlarge instance provides a good balance of performance and cost. For large-scale inference, use larger batch sizes for real-time inference to optimize cost and performance. You can also use batch transform for offline, large-volume inference to reduce costs. Monitor endpoint usage to optimize costs.
Clean up
Clean up your resources when they’re no longer needed:

predictor.delete_endpoint()

Security
You can configure advanced security and infrastructure settings for the DeepSeek-R1 model, including virtual private cloud (VPC) networking, service role permissions, encryption settings, and EnableNetworkIsolation to restrict internet access. For production deployments, it’s essential to review these settings to maintain alignment with your organization’s security and compliance requirements.
By default, the model runs in a shared AWS managed VPC with internet access. To enhance security and control access, you should explicitly configure a private VPC with appropriate security groups and IAM policies based on your requirements.
SageMaker AI provides enterprise-grade security features to help keep your data and applications secure and private. We do not share your data with model providers, unless you direct us to, providing you full control over your data. This applies to all models—both proprietary and publicly available, including DeepSeek-R1 on SageMaker.
For more details, see Configure security in Amazon SageMaker AI.
Logging and monitoring
You can monitor SageMaker AI using Amazon CloudWatch, which collects and processes raw data into readable, near real-time metrics. These metrics are retained for 15 months, allowing you to analyze historical trends and gain deeper insights into your application’s performance and health.
Additionally, you can configure alarms to monitor specific thresholds and trigger notifications or automated actions when those thresholds are met, helping you proactively manage your deployment.
For more details, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.
Best practices
It’s always recommended to deploy your LLMs endpoints inside your VPC and behind a private subnet, without internet gateways, and preferably with no egress. Ingress from the internet should also be blocked to minimize security risks.
Always apply guardrails to make sure incoming and outgoing model responses are validated for safety, bias, and toxicity. You can guard your SageMaker endpoints model responses with Amazon Bedrock Guardrails. See DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart for more details.
Inference performance evaluation
In this section, we focus on inference performance of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the performance of LLMs in terms of end-to-end latency, throughput, and resource efficiency is crucial for providing responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly impacts user experience, system reliability, and deployment feasibility at scale. For this post, we test all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—along four performance metrics:

End-to-end latency (time between sending a request and receiving the response)
Throughput tokens
Time to first token
Inter-token latency

The main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware for generic traffic patterns. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and traffic patterns as well as I/O sequence length.
Scenarios
We tested the following scenarios:

Container/model configuration – We used LMI container v14 with default parameters, except MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On instances with multiple accelerators, we sharded the model across all available GPUs.
Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on performance benchmarks using two sample input token lengths. We ran both tests 50 times each before measuring the average across the different metrics. Then we repeated the test with concurrency 10.

Short-length test – 512 input tokens and 256 output tokens.
Medium-length test – 3072 input tokens and 256 output tokens.

Hardware – We tested the distilled variants on a variety of instance types ranging from 1, 4, or 8 GPUs per instance. In the following table, a green cell indicates that a model was tested on that particular instance type, and red indicates that a model wasn’t tested with that instance type, either because the instance was excessive for a given model size or too small to fit the model in memory.

Box plots
In the following sections, we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data, with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness, as illustrated in the following figure.

DeepSeek-R1-Distill-Qwen-1.5B
This model can be deployed on a single GPU instance. The results indicate that the ml.g5.xlarge instance outperforms the ml.g6.xlarge instance across all measured performance criteria and concurrency settings.
The following figure illustrates testing with concurrency = 1.