March 2025 - Page 4 of 7

Dynamic Tanh DyT: A Simplified Alternative to Normalization in Transfo …

Posted on March 17, 2025 by i-genie

Normalization layers have become fundamental components of modern neural networks, significantly improving optimization by stabilizing gradient flow, reducing sensitivity to weight initialization, and smoothing the loss landscape. Since the introduction of batch normalization in 2015, various normalization techniques have been developed for different architectures, with layer normalization (LN) becoming particularly dominant in Transformer models. Their widespread use is largely attributed to their ability to accelerate convergence and enhance model performance, especially as networks grow deeper and more complex. Despite ongoing architectural innovations that replace other core components like attention or convolution layers, normalization layers remain integral to most designs, underscoring their perceived necessity in deep learning.

While normalization layers have proven beneficial, researchers have also explored methods to train deep networks without them. Studies have proposed alternative weight initialization strategies, weight normalization techniques, and adaptive gradient clipping to maintain stability in models like ResNets. In Transformers, recent efforts have examined modifications that reduce reliance on normalization, such as restructuring Transformer blocks or gradually removing LN layers through fine-tuning. These approaches demonstrate that, while normalization layers offer optimization advantages, they are not strictly indispensable, and alternative training techniques can achieve stable convergence with comparable performance.

Researchers from FAIR, Meta, NYU, MIT, and Princeton propose Dynamic Tanh (DyT) as a simple yet effective alternative to normalization layers in Transformers. DyT operates as an element-wise function, DyT(x) = tanh(alpha x), where (alpha) is a learnable parameter that scales activations while limiting extreme values. Unlike layer normalization, DyT eliminates the need for activation statistics, simplifying computations. Empirical evaluations show that replacing normalization layers with DyT maintains or improves performance across various tasks without extensive hyperparameter tuning. Additionally, DyT enhances training and inference efficiency, challenging the assumption that normalization is essential for modern deep networks.

Researchers analyzed normalization layers in Transformers using models like ViT-B, wav2vec 2.0, and DiT-XL. They found that LN often exhibits a tanh-like, S-shaped input-output mapping, primarily linear for most values but squashing extreme activations. Inspired by this, they propose Dynamic Tanh (DyT) as a replacement for LN. Defined as DyT(x) = gamma *tanh(alpha x) + beta), where alpha, gamma, and beta are learnable parameters, DyT preserves LN’s effects without computing activation statistics. Empirical results show DyT integrates seamlessly into existing architectures, maintaining stability and reducing the need for hyperparameter tuning.

To evaluate DyT’s effectiveness, experiments were conducted across various architectures and tasks by replacing LN or RMSNorm with DyT while keeping hyperparameters unchanged. In supervised vision tasks, DyT slightly outperformed LN in ImageNet-1K classification. For self-supervised learning, diffusion models, language models, speech processing, and DNA sequence modeling, DyT achieved performance comparable to existing normalization methods. Efficiency tests on LLaMA-7B showed DyT reduced computation time. Ablation studies highlighted the importance of the tanh function and learnable parameter α, which correlated with activation standard deviation, acting as an implicit normalization mechanism. DyT demonstrated competitive performance with improved efficiency.

In conclusion, the study shows that modern neural networks, particularly Transformers, can be trained effectively without normalization layers. The proposed DyT replaces traditional normalization using a learnable scaling factor alpha and an S-shaped tanh function to regulate activation values. Despite its simplicity, DyT replicates normalization behavior and achieves comparable or superior performance across various tasks, including recognition, generation, and self-supervised learning. The results challenge the assumption that normalization layers are essential, offering new insights into their function. DyT provides a lightweight alternative that simplifies training while maintaining or improving performance, often without requiring hyperparameter adjustments.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Dynamic Tanh DyT: A Simplified Alternative to Normalization in Transformers appeared first on MarkTechPost.

A Code Implementation to Build an AI-Powered PDF Interaction System in …

Posted on March 17, 2025 by i-genie

In this tutorial, we demonstrate how to build an AI-powered PDF interaction system in Google Colab using Gemini Flash 1.5, PyMuPDF, and the Google Generative AI API. By leveraging these tools, we can seamlessly upload a PDF, extract its text, and interactively ask questions, receiving intelligent responses from Google’s latest Gemini Flash 1.5 model.

Copy CodeCopiedUse a different Browser!pip install -q -U google-generativeai PyMuPDF python-dotenv

First we install the necessary dependencies for building an AI-powered PDF Q&A system in Google Colab. google-generativeai provides access to Gemini Flash 1.5, enabling natural language interactions, while PyMuPDF (also known as Fitz) allows efficient text extraction from PDFs. Also, python-dotenv helps manage environment variables, such as API keys, securely within the notebook.

Copy CodeCopiedUse a different Browserfrom google.colab import files
uploaded = files.upload()

We upload files from your local device to Google Colab. When executed, it opens a file selection dialog, allowing you to choose a file (e.g., a PDF) to upload. The uploaded file is stored in a dictionary-like object (uploaded), where keys represent file names and values contain the file’s binary data. This step is essential for directly processing documents, datasets, or model weights in a Colab environment.

Copy CodeCopiedUse a different Browserimport fitz

def extract_pdf_text(pdf_path):
doc = fitz.open(pdf_path)
full_text = “”
for page in doc:
full_text += page.get_text()
return full_text

pdf_file_path = ‘/content/Paper.pdf’
document_text = extract_pdf_text(pdf_path=pdf_file_path)
print(“Document text extracted!”)
print(document_text[:1000])

We use PyMuPDF (fitz) to extract text from a PDF file in Google Colab. The function extract_pdf_text(pdf_path) reads the PDF, iterates through its pages, and retrieves the text content. The extracted text is then stored in document_text, with the first 1000 characters printed to preview the content. This step is crucial for enabling text-based analysis and AI-driven question answering from PDFs.

Copy CodeCopiedUse a different Browserimport os
os.environ[“GOOGLE_API_KEY”] = ‘Use your own API key here’

We set the Google API key as an environment variable in Google Colab. The API key is required to authenticate requests to Google Generative AI, allowing access to Gemini Flash 1.5 for AI-powered text processing. Replacing ‘Use your own API key here’ with a valid key ensures that the model can generate responses securely within the notebook.

Copy CodeCopiedUse a different Browserimport google.generativeai as genai

genai.configure(api_key=os.environ[“GOOGLE_API_KEY”])

model_name = “models/gemini-1.5-flash-001″

def query_gemini_flash(question, context):
model = genai.GenerativeModel(model_name=model_name)
prompt = f”””
Context: {context[:20000]}

Question: {question}

Answer:
“””
response = model.generate_content(prompt)
return response.text

pdf_text = extract_pdf_text(“/content/Paper.pdf”)

question = “Summarize the key findings of this document.”
answer = query_gemini_flash(question, pdf_text)
print(“Gemini Flash Answer:”)
print(answer)

Finally, we configure and query Gemini Flash 1.5 using a PDF document for AI-powered text generation. It initializes the genai library with the API key and loads the Gemini Flash 1.5 model (gemini-1.5-flash-001). The query_gemini_flash() function takes a question and extracted PDF text as input, formulates a structured prompt, and retrieves an AI-generated response. This setup enables automated document summarization and intelligent Q&A from PDFs.

In conclusion, following this tutorial, we have successfully built an interactive PDF-based interaction system in Google Colab using Gemini Flash 1.5, PyMuPDF, and the Google Generative AI API. This solution enables users to extract information from PDFs and interactively query them easily. The combination of Google’s cutting-edge AI models and Colab’s cloud-based environment provides a powerful and accessible way to process large documents without requiring heavy computational resources.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

The post A Code Implementation to Build an AI-Powered PDF Interaction System in Google Colab Using Gemini Flash 1.5, PyMuPDF, and Google Generative AI API appeared first on MarkTechPost.

Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for …

Posted on March 16, 2025 by i-genie

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities across various domains, propelling their evolution into multi-modal agents for human assistance. GUI automation agents for PCs face particularly daunting challenges compared to smartphone counterparts. PC environments present significantly more complex interactive elements with dense, diverse icons and widgets often lacking textual labels, leading to perception difficulties. Even advanced models like Claude-3.5 achieve only 24.0% accuracy in GUI grounding tasks. Also, PC productivity tasks involve intricate workflows spanning multiple applications with lengthy operation sequences and inter-subtask dependencies, causing dramatic performance declines where GPT-4o’s success rate drops from 41.8% at subtask level to just 8% for complete instructions.

Previous approaches have developed frameworks to address PC task complexity with varying strategies. UFO implements a dual-agent architecture separating application selection from specific control interactions. Meanwhile, AgentS augments planning capabilities by combining online search with local memory. However, these methods demonstrate significant limitations in fine-grained perception and operation of on-screen text—a critical requirement for productivity scenarios like document editing. In addition, they generally fail to address the complex dependencies between subtasks, resulting in poor performance when handling realistic intra- and inter-app workflows that characterize everyday PC usage.

Researchers from MAIS, Institute of Automation, Chinese Academy of Sciences, China, School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, Beijing Jiaotong University, and School of Information Science and Technology, ShanghaiTech University introduce PC-Agent framework to address complex PC scenarios through three innovative designs. First, the Active Perception Module enhances fine-grained interaction by extracting locations and meanings of interactive elements via accessibility trees, while using MLLM-driven intention understanding and OCR for precise text localization. Second, Hierarchical Multi-agent Collaboration implements a three-level decision process (Instruction-Subtask-Action) where a Manager Agent decomposes instructions into parameterized subtasks and manages dependencies, a Progress Agent tracks operation history, and a Decision Agent executes steps with perception and progress information. Third, Reflection-based Dynamic Decision-making introduces a Reflection Agent that assesses execution correctness and provides feedback, enabling top-down task decomposition with bottom-up precision feedback across all four collaborating agents.

PC-Agent’s architecture addresses GUI interaction through a formalized approach where an agent ρ processes user instructions I, observations O, and history H to determine actions A. The Active Perception Module enhances element recognition using pywinauto to extract accessibility trees for interactive elements while employing MLLM-driven intention understanding with OCR for precise text localization. For complex workflows, PC-Agent implements Hierarchical Multi-agent Collaboration across three levels: the Manager Agent decomposes instructions into parameterized subtasks and manages dependencies; the Progress Agent tracks operation progress within subtasks; and the Decision Agent executes step-by-step actions based on environmental perception and progress information. This hierarchical division effectively reduces decision-making complexity by breaking complex tasks into manageable components with clear interdependencies.

Experimental results demonstrate PC-Agent’s superior performance compared to both single and multi-agent alternatives. Single MLLM-based agents (GPT-4o, Gemini-2.0, Claude3.5, Qwen2.5-VL) consistently fail on complex instructions, with even the best performer achieving only 12% success rate, confirming that single-agent approaches struggle with lengthy operational sequences and complex dependencies. Multi-agent frameworks like UFO and AgentS show modest improvements but remain limited by perception deficiencies and dependency management issues. They struggle with fine-grained operations such as text editing in Word or proper data entry in Excel, and often fail to utilize information from previous subtasks. In contrast, PC-Agent significantly outperforms all previous methods, surpassing UFO by 44% and AgentS by 32% in success rate through its Active Perception Module and hierarchical multi-agent collaboration.

This study introduces PC-Agent framework, a significant advancement in handling complex PC-based tasks through three key innovations. The Active Perception Module provides refined perception and operation capabilities, enabling precise interaction with GUI elements and text. The hierarchical multi-agent collaboration architecture effectively decomposes decision-making across instruction, subtask, and action levels, while reflection-based dynamic decision-making allows for real-time error detection and correction. Validation through the newly created PC-Eval benchmark with realistic, complex instructions confirms PC-Agent’s superior performance compared to previous methods, demonstrating its effectiveness in navigating the intricate workflows and interactive environments characteristic of PC productivity scenarios.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC appeared first on MarkTechPost.

Researchers from the University of Cambridge and Monash University Int …

Posted on March 16, 2025 by i-genie

Reasoning capabilities have become essential for LLMs, but analyzing these complex processes poses a significant challenge. While LLMs can generate detailed text reasoning output, the lack of process visualization creates barriers to understanding, evaluating, and improving. This limitation manifests in three critical ways: increased cognitive load for users attempting to parse complex reasoning paths; difficulty detecting logical fallacies, circular reasoning, and missing steps that remain obscured in lengthy text outputs; and restrictions on downstream applications due to the absence of standardized visualization frameworks. So, there is a need for unified visualization solutions that can effectively illustrate diverse reasoning methodologies across the growing ecosystem of LLM providers and models.

Existing methods like sequential reasoning show step-by-step problem decomposition and have evolved through several variants. Tree-based approaches like Tree-of-Thoughts enable state-based branching for parallel path exploration, while Beam Search reasoning evaluates solution paths based on scoring mechanisms. Further, current visualization approaches fall into two categories: model behavior analysis and reasoning process illustration. Tools like BertViz and Transformers Interpret provide detailed visualizations of attention mechanisms but are limited to low-level model behaviors. Frameworks such as LangGraph offer basic flow visualization without supporting diverse reasoning methodologies, while general-purpose tools like Graphviz and Mermaid lack specific adaptations for LLM reasoning analysis.

Researchers from the University of Cambridge and Monash University have proposed ReasonGraph, a web-based platform for visualizing and analyzing LLM reasoning processes. It supports sequential and tree-based reasoning methods while seamlessly integrating with major LLM providers and over fifty state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta reasoning method selection, configurable visualization parameters, and a modular framework that facilitates efficient extension. By providing a unified visualization framework, ReasonGraph effectively reduces cognitive load in analyzing complex reasoning paths, improves error detection in logical processes, and enables more effective development of LLM-based applications.

ReasonGraph utilizes a modular framework that provides extensible reasoning visualization through the clear separation of components. The front-end tier handles visualization logic and user participation handling, implementing an asynchronous event handling module where user interactions with method selection and parameter configuration trigger corresponding state updates. The backend framework is organized around three core modules implemented in Flask: a Configuration Manager for state updates, an API Factory for LLM integration, and a Reasoning Methods module for reasoning approach encapsulation. Framework modularity exists at both API and reasoning method levels, with the API Factory providing a unified interface for multiple LLM providers through the BaseAPI class.

The evaluation of ReasonGraph shows the platform’s robustness in three key aspects. In parsing reliability, the rule-based XML parsing approach achieves nearly 100% accuracy in extracting and visualizing reasoning paths from properly formatted LLM outputs. For processing efficiency, the Mermaid-based visualization generation time is negligible compared to the LLM’s reasoning time, maintaining consistent performance across all six reasoning methods implemented in the platform. Regarding platform usability, preliminary feedback from open-source platform users shows that approximately 90% of users successfully used the platform without assistance, though these metrics continue to evolve as the user base expands and the platform undergoes regular updates.

In this paper, researchers introduced ReasonGraph, a web-based platform that enables visualization and analysis of LLM reasoning processes across six mainstream methods and over 50 models. It achieves high usability across diverse applications in academia, education, and development through its modular framework and real-time visualization capabilities. Future work includes (a) using the open-source community to integrate additional reasoning methods and expand model API support, (b) developing the platform based on community feedback and user suggestions, (c) exploring downstream applications such as reasoning evaluation, educational tutorials, etc, and (d) implementing editable nodes in the visualization flowcharts to enable direct modification of reasoning processes.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Researchers from the University of Cambridge and Monash University Introduce ReasonGraph: A Web-based Platform to Visualize and Analyze LLM Reasoning Processes appeared first on MarkTechPost.

Meet Attentive Reasoning Queries (ARQs): A Structured Approach to Enha …

Posted on March 16, 2025 by i-genie

Large Language Models (LLMs) have become crucial in customer support, automated content creation, and data retrieval. However, their effectiveness is often hindered by their inability to follow detailed instructions during multiple interactions consistently. This issue is particularly critical in high-stakes environments, such as financial services and customer support systems, where strict adherence to guidelines is essential. LLMs frequently struggle with instruction recall, leading to deviations from intended behaviors. Also, they generate misleading or incorrect information, commonly called hallucination, making their deployment challenging in scenarios requiring precise, context-aware decision-making.

Maintaining reasoning consistency in complex scenarios remains a challenge for LLMs. While they generate coherent responses to simple queries, their performance declines in multi-turn conversations influenced by past interactions. One key issue is alignment drift, where models gradually move away from original instructions, causing misinterpretation of guidelines and incorrect recommendations. Context forgetfulness is another concern, where models prioritize recent information over earlier details, often disregarding critical constraints. These factors contribute to errors that undermine the reliability of LLM-driven systems. Despite strategies like Chain-of-Thought (CoT) and verification-based prompting, existing methods do not provide enough structure to guide models reliably through complex tasks.

Various prompting techniques have been developed to improve instruction adherence. CoT prompting encourages step-by-step reasoning to enhance logical accuracy, while Chain-of-Verification requires explicit self-checking of outputs. Although these methods improve upon direct response generation, they lack mechanisms to reinforce domain-specific constraints and systematically prevent common failures. AI frameworks like LangChain add structural elements for tool integration and workflow automation but treat LLM reasoning as a black box, limiting their ability to enforce strict guidelines. The lack of mechanisms to prevent hallucination and instruction drift highlights the need for a more structured approach.

Researchers at Emcie Co Ltd. developed Attentive Reasoning Queries (ARQs) to address these shortcomings. This novel approach introduces a structured reasoning blueprint designed to guide LLMs systematically through predefined queries. Unlike free-form reasoning methods, ARQs implement a structured JSON schema that directs the model’s attention to specific decision points at critical moments. This design enables ARQs to enhance guideline adherence while minimizing failures caused by misinterpretation or loss of contextual details. To evaluate its effectiveness, the approach was tested within Parlant, a framework used for building customer-facing AI applications. Initial findings demonstrated that ARQs significantly improved instruction-following capabilities while mitigating hallucination-related errors.

The ARQ framework consists of multiple stages that collectively enhance reasoning performance. The first step involves issuing targeted, structured queries that remind the model of key constraints before response generation. These queries reinforce critical instructions, ensuring the model does not deviate from predefined guidelines. Next, the model processes a series of step-by-step queries to reinforce task-specific reasoning. In some implementations, an additional verification step follows, where the model checks its response against predefined correctness criteria before finalizing the output. This structured approach contrasts sharply with CoT prompting by incorporating explicit mechanisms to ensure consistency at every stage of the reasoning process.

On performance evaluation within the Parlant framework, in a controlled test environment comprising 87 distinct conversational scenarios, ARQs achieved a 90.2% success rate, outperforming both CoT reasoning (86.1%) and direct response generation (81.5%). The ARQ methodology excelled in addressing two critical failure modes: guideline re-application and hallucination prevention. Specifically, in cases where the model needed to reapply earlier instructions, ARQs ensured a 92.19% success rate, significantly higher than CoT (87.81%) and direct response generation (85.31%). Also, ARQs reduced the occurrence of factual inaccuracies, with models trained on ARQs exhibiting a 23% lower hallucination rate than those relying on standard CoT techniques. These results underscore the importance of structured reasoning approaches in improving LLM reliability.

Several Key takeaways from the research include:

ARQs improved instruction adherence, achieving a 90.2% success rate across 87 test cases, surpassing Chain-of-Thought (86.1%) and direct response generation (81.5%).

ARQs significantly reduced hallucination errors by 23% compared to CoT, making them particularly useful for business-critical AI applications requiring factual consistency.

In guideline re-application scenarios, ARQs outperformed CoT by 4.38%, achieving a success rate of 92.19% compared to CoT’s 87.81%.

The structured nature of ARQs allowed for more efficient reasoning in classification tasks, reducing token usage by 29% compared to CoT.

The verification mechanism in ARQs was key to preventing alignment drift. It ensured that models focused on predefined constraints even in extended conversations.

Future research aims to optimize ARQ efficiency further by refining query design and exploring its application in diverse AI-driven decision-making systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Meet Attentive Reasoning Queries (ARQs): A Structured Approach to Enhancing Large Language Model Instruction Adherence, Decision-Making Accuracy, and Hallucination Prevention in AI-Driven Conversational Systems appeared first on MarkTechPost.

Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to …

Posted on March 15, 2025 by i-genie

The rapid evolution of artificial intelligence (AI) has ushered in a new era of large language models (LLMs) capable of understanding and generating human-like text. However, the proprietary nature of many of these models poses challenges for accessibility, collaboration, and transparency within the research community. Additionally, the substantial computational resources required to train such models often limit participation to well-funded organizations, thereby hindering broader innovation.

Addressing these concerns, the Allen Institute for AI (AI2) has introduced OLMo 2 32B, the latest and most advanced model in the OLMo 2 series. This model distinguishes itself as the first fully open model to surpass GPT-3.5 Turbo and GPT-4o mini across a suite of widely recognized, multi-skill academic benchmarks. By making all data, code, weights, and training details freely available, AI2 promotes a culture of openness and collaboration, enabling researchers worldwide to build upon this work.

OLMo 2 32B’s architecture comprises 32 billion parameters, reflecting a significant scaling from its predecessors. The training process was meticulously structured in two primary phases: pretraining and mid-training. During pretraining, the model was exposed to approximately 3.9 trillion tokens from diverse sources, including DCLM, Dolma, Starcoder, and Proof Pile II, ensuring a comprehensive understanding of language patterns. The mid-training phase utilized the Dolmino dataset, which consists of 843 billion tokens curated for quality, encompassing educational, mathematical, and academic content. This phased approach ensured that OLMo 2 32B developed a robust and nuanced grasp of language.

A notable aspect of OLMo 2 32B is its training efficiency. The model achieved performance levels comparable to leading open-weight models while utilizing only a fraction of the computational resources. Specifically, it required approximately one-third of the training compute compared to models like Qwen 2.5 32B, highlighting AI2’s commitment to resource-efficient AI development.

In benchmark evaluations, OLMo 2 32B demonstrated impressive results. It matched or exceeded the performance of models such as GPT-3.5 Turbo, GPT-4o mini, Qwen 2.5 32B, and Mistral 24B. Furthermore, it approached the performance levels of larger models like Qwen 2.5 72B and Llama 3.1 and 3.3 70B. These assessments spanned various tasks, including Massive Multitask Language Understanding (MMLU), mathematics problem-solving (MATH), and instruction-following evaluations (IFEval), underscoring the model’s versatility and competence across diverse linguistic challenges.

The release of OLMo 2 32B signifies a pivotal advancement in the pursuit of open and accessible AI. By providing a fully open model that not only competes with but also surpasses certain proprietary models, AI2 exemplifies how thoughtful scaling and efficient training methodologies can lead to significant breakthroughs. This openness fosters a more inclusive and collaborative environment, empowering researchers and developers globally to engage with and contribute to the evolving landscape of artificial intelligence.

Check out the Technical Details, HF Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to Beat GPT 3.5 and GPT-4o mini on a Suite of Multi-Skill Benchmarks appeared first on MarkTechPost.

This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregr …

Posted on March 15, 2025 by i-genie

Traditional language models rely on autoregressive approaches, which generate text sequentially, ensuring high-quality outputs at the expense of slow inference speeds. In contrast, diffusion models, initially developed for image and video generation, have gained attention in text generation due to their potential for parallelized generation and improved controllability. However, existing diffusion models struggle with fixed-length constraints and inefficiencies in likelihood modeling, limiting their effectiveness in generating flexible-length text.

A major challenge in language modeling is balancing efficiency and quality. Autoregressive models capture long-range dependencies effectively but suffer from slow token-by-token generation. Diffusion models, while promising, require multiple inference steps and typically generate fixed-length outputs. This limitation prevents them from being practical for real-world applications where variable-length sequences are necessary. The research addresses this issue by proposing a method that combines the strengths of both autoregressive and diffusion models, ensuring efficient and high-quality text generation without compromising flexibility.

Current methods primarily involve autoregressive models, which generate text one token at a time based on previously generated tokens. While these models achieve high fluency and coherence, they are inherently slow due to their sequential processing nature. Diffusion-based approaches have been explored as an alternative, offering parallel generation. However, existing diffusion models generate fixed-length sequences and lack efficient means of extending beyond predefined contexts. Despite their inefficiencies, the lack of scalability in diffusion models has led to continued reliance on autoregressive methods.

Cornell Tech and Stanford University researchers introduced **Block Discrete Denoising Diffusion Language Models (BD3-LMs)** to overcome these limitations. This new class of models interpolates between autoregressive and diffusion models by employing a structured approach that supports variable-length generation while maintaining inference efficiency. BD3-LMs use key-value caching and parallel token sampling to reduce computational overhead. The model is designed with specialized training algorithms that minimize gradient variance through customized noise schedules, optimizing performance across diverse language modeling benchmarks.

BD3-LMs operate by structuring text generation into blocks rather than individual tokens. Unlike traditional autoregressive models, which predict the next token sequentially, BD3-LMs generate a block of tokens simultaneously, significantly improving efficiency. A diffusion-based denoising process within each block ensures high-quality text generation while preserving coherence. The model architecture integrates transformers with a block-causal attention mechanism, allowing each block to condition on previously generated blocks. This approach enhances both contextual relevance and fluency. The training process includes a vectorized implementation that enables parallel computations, reducing training time and resource consumption. Researchers introduced data-driven noise schedules that stabilize training and improve gradient estimation to address the high variance issue in diffusion models.

Performance evaluations of BD3-LMs demonstrate substantial improvements over existing discrete diffusion models. The model achieves state-of-the-art perplexity scores among diffusion-based language models while enabling the generation of arbitrary-length sequences. In experiments conducted on language modeling benchmarks, BD3-LMs reduce perplexity by up to 13% compared to previous diffusion models. On the LM1B dataset, BD3-LMs achieved a perplexity of 28.23 when using a block size of four, outperforming previous models such as MDLM, which had a perplexity of 31.78. On OpenWebText, BD3-LMs attained a perplexity of 20.73, significantly better than other discrete diffusion models. Further, BD3-LMs generated sequences up to 10 times longer than those produced by traditional diffusion methods, demonstrating superior scalability. The proposed model also reduced the number of function evaluations required for inference, achieving improved sample efficiency and generation speed.

The introduction of BD3-LMs presents a significant advancement in language modeling by integrating autoregressive and diffusion-based methodologies. By addressing key challenges related to inference efficiency, likelihood estimation, and sequence flexibility, this research offers a practical and scalable solution for text generation. BD3-LMs improve training stability and computational efficiency, providing a framework that can be extended to future language modeling developments. The results highlight the effectiveness of BD3-LMs in bridging the gap between autoregressive and diffusion-based approaches, offering an optimized balance between quality and speed in text generation.

Check out the Paper, Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregressive and Diffusion Models for Scalable and Efficient Text Generation appeared first on MarkTechPost.

Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning A …

Posted on March 15, 2025 by i-genie

Enhancing the reasoning abilities of LLMs by optimizing test-time compute is a critical research challenge. Current approaches primarily rely on fine-tuning models with search traces or RL using binary outcome rewards. However, these methods may not fully exploit test-time compute efficiently. Recent research suggests that increasing test-time computing can improve reasoning by generating longer solution traces and incorporating structured steps such as reflection, planning, and algorithmic search. Key challenges remain whether LLMs allocate computational resources effectively based on task complexity and discover solutions to more difficult problems when given a larger test-time compute budget. Addressing these is crucial for improving efficiency and generalization in LLM reasoning.

Recent advancements in scaling test-time compute have explored training separate verifiers for selection-based methods like best-of-N or beam search, which can sometimes be more effective than increasing data or model size. However, fine-tuning on unfamiliar search traces may lead to memorization rather than genuine reasoning improvements. RL-based approaches have demonstrated promise in generating chain-of-thought reasoning, enabling models to introspect, plan, and refine their outputs. However, increasing reasoning length does not always correlate with higher accuracy, as models may generate unnecessarily long sequences without meaningful progress. To address this, recent efforts have incorporated structured reward mechanisms and length penalties to encourage efficient reasoning, ensuring that models focus on producing informative, concise solutions rather than excessive computation.

Researchers from Carnegie Mellon University & Hugging Face investigate optimizing test-time compute for LLMs by refining how models allocate computational resources during reasoning. Instead of relying solely on outcome-reward RL, they introduce a fine-tuning approach that balances exploration and exploitation, ensuring steady progress toward correct answers. Their method incorporates a dense reward bonus to quantify progress, improving efficiency. Evaluations on mathematical benchmarks demonstrate that this approach significantly outperforms existing methods, enhancing both accuracy and token efficiency. Their findings also suggest that optimizing for progress minimizes computational regret while improving solution discovery without sacrificing accuracy.

The problem of optimizing test-time compute is framed as a meta reinforcement learning (meta RL) challenge. The goal is to maximize an LLM’s performance within a given test-time token budget by balancing exploration and exploitation. Instead of solely optimizing for outcomes, the proposed Meta Reinforcement Fine-Tuning (MRT) approach minimizes cumulative regret by rewarding progress across sequential episodes. This budget-agnostic strategy allows LLMs to make steady progress regardless of training constraints. By incorporating a reward bonus based on incremental improvements, MRT ensures efficient test-time compute usage, enhancing adaptability and response accuracy within deployment constraints.

The study evaluates the effectiveness of MRT in optimizing test-time computation, with a focus on achieving high accuracy while maintaining computational efficiency. The study presents key findings, compares MRT’s efficiency with prior methods, and conducts ablation experiments on token budget and progress. MRT consistently outperforms baseline models and outcome-reward RL (GRPO), achieving state-of-the-art results in its size category. It also improves out-of-distribution robustness and delivers larger performance gains with weaker models. Furthermore, MRT significantly enhances token efficiency, requiring fewer tokens for comparable accuracy. Additional experiments highlight its effectiveness in backtracking search and linearized evaluations.

In conclusion, the study reframes optimizing test-time compute as a meta-reinforcement learning (RL) problem, introducing cumulative regret as a key metric. State-of-the-art outcome-reward RL models fail to minimize regret, often struggling with novel queries within a token budget. This limitation arises from training solely with outcome rewards, which lack the granularity to guide stepwise progress. To address this, MRT is proposed, incorporating a dense reward bonus that encourages incremental improvement. MRT enhances test-time compute efficiency, achieving 2-3x better performance and 1.5x greater token efficiency in mathematical reasoning compared to outcome-reward RL, though several open questions remain.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning Approach with Cumulative Regret Minimization appeared first on MarkTechPost.

Getting started with computer use in Amazon Bedrock Agents

Posted on March 15, 2025 by i-genie

Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually perceive and interpret digital interfaces. This capability enables Anthropic’s Claude models to identify what’s on a screen, understand the context of UI elements, and recognize actions that should be performed such as clicking buttons, typing text, scrolling, and navigating between applications. However, the model itself doesn’t execute these actions—it requires an orchestration layer to safely implement the supported actions.
Today, we’re announcing computer use support within Amazon Bedrock Agents using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude Sonnet 3.7 models on Amazon Bedrock. This integration brings Anthropic’s visual perception capabilities as a managed tool within Amazon Bedrock Agents, providing you with a secure, traceable, and managed way to implement computer use automation in your workflows.
Organizations across industries struggle with automating repetitive tasks that span multiple applications and systems of record. Whether processing invoices, updating customer records, or managing human resource (HR) documents, these workflows often require employees to manually transfer information between different systems – a process that’s time-consuming, error-prone, and difficult to scale.
Traditional automation approaches require custom API integrations for each application, creating significant development overhead. Computer use capabilities change this paradigm by allowing machines to perceive existing interfaces just as humans.
In this post, we create a computer use agent demo that provides the critical orchestration layer that transforms computer use from a perception capability into actionable automation. Without this orchestration layer, computer use would only identify potential actions without executing them. The computer use agent demo powered by Amazon Bedrock Agents provides the following benefits:

Secure execution environment – Execution of computer use tools in a sandbox environment with limited access to the AWS ecosystem and the web. It is crucial to note that currently Amazon Bedrock Agent does not provide a sandbox environment
Comprehensive logging – Ability to track each action and interaction for auditing and debugging
Detailed tracing capabilities – Visibility into each step of the automated workflow
Simplified testing and experimentation – Reduced risk when working with this experimental capability through managed controls
Seamless orchestration – Coordination of complex workflows across multiple systems without custom code

This integration combines Anthropic’s perceptual understanding of digital interfaces with the orchestration capabilities of Amazon Bedrock Agents, creating a powerful agent for automating complex workflows across applications. Rather than build custom integrations for each system, developers can now create agents that perceive and interact with existing interfaces in a managed, secure way.
With computer use, Amazon Bedrock Agents can automate tasks through basic GUI actions and built-in Linux commands. For example, your agent could take screenshots, create and edit text files, and run built-in Linux commands. Using Amazon Bedrock Agents and compatible Anthropic’s Claude models, you can use the following action groups:

Computer tool – Enables interactions with user interfaces (clicking, typing, scrolling)
Text editor tool – Provides capabilities to edit and manipulate files
Bash – Allows execution of built-in Linux commands

Solution overview
An example computer use workflow consists of the following steps:

Create an Amazon Bedrock agent and use natural language to describe what the agent should do and how it should interact with users, for example: “You are computer use agent capable of using Firefox web browser for web search.”
Add the Amazon Bedrock Agents supported computer use action groups to your agent using CreateAgentActionGroup API.
Invoke the agent with a user query that requires computer use tools, for example, “What is Amazon Bedrock, can you search the web?”
The Amazon Bedrock agent uses the tool definitions at its disposal and decides to use the computer action group to click a screenshot of the environment. Using the return control capability of Amazon Bedrock Agents, the agent the responds with the tool or tools that it wants to execute. The return control capability is required for using computer use with Amazon Bedrock Agents.
The workflow parses the agent response and executes the tool returned in a sandbox environment. The output is given back to the Amazon Bedrock agent for further processing.
The Amazon Bedrock agent continues to respond with tools at its disposal until the task is complete.

You can recreate this example in the us-west-2 AWS Region with the AWS Cloud Development Kit (AWS CDK) by following the instructions in the GitHub repository. This demo deploys a containerized application using AWS Fargate across two Availability Zones in the us-west-2 Region. The infrastructure operates within a virtual private cloud (VPC) containing public subnets in each Availability Zone, with an internet gateway providing external connectivity. The architecture is complemented by essential supporting services, including AWS Key Management Service (AWS KMS) for security and Amazon CloudWatch for monitoring, creating a resilient, serverless container environment that alleviates the need to manage underlying infrastructure while maintaining robust security and high availability.
The following diagram illustrates the solution architecture.

At the core of our solution are two Fargate containers managed through Amazon Elastic Container Service (Amazon ECS), each protected by its own security group. The first is our orchestration container, which not only handles the communication between Amazon Bedrock Agents and end users, but also orchestrates the workflow that enables tool execution. The second is our environment container, which serves as a secure sandbox where the Amazon Bedrock agent can safely run its computer use tools. The environment container has limited access to the rest of the ecosystem and the internet. We utilize service discovery to connect Amazon ECS services with DNS names.
The orchestration container includes the following components:

Streamlit UI – The Streamlit UI that facilitates interaction between the end user and computer use agent
Return control loop – The workflow responsible for parsing the tools that the agent wants to execute and returning the output of these tools

The environment container includes the following components:

UI and pre-installed applications – A lightweight UI and pre-installed Linux applications like Firefox that can be used to complete the user’s tasks
Tool implementation – Code that can execute computer use tool in the environment like “screenshot” or “double-click”
Quart (RESTful) JSON API – An orchestration container that uses Quart to execute tools in a sandbox environment

The following diagram illustrates these components.

Prerequisites

AWS Command Line Interface (CLI), follow instructions here. Make sure to setup credentials, follow instructions here.
Require Python 3.11 or later.
Require Node.js 14.15.0 or later.
AWS CDK CLI, follow instructions here.
Enable model access for Anthropic’s Claude Sonnet 3.5 V2 and for Anthropic’s Claude Sonnet 3.7.
Boto3 version >= 1.37.10.

Create an Amazon Bedrock agent with computer use
You can use the following code sample to create a simple Amazon Bedrock agent with computer, bash, and text editor action groups. It is crucial to provide a compatible action group signature when using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude 3.7 Sonnet as highlighted here.

Model
Action Group Signature

Anthropic’s Claude 3.5 Sonnet V2
computer_20241022 text_editor_20241022 bash_20241022

Anthropic’s Claude 3.7 Sonnet
computer_20250124 text_editor_20250124 bash_20250124

import boto3
import time

# Step 1: Create the bedrock agent client

bedrock_agent = boto3.client(“bedrock-agent”, region_name=”us-west-2″)

# Step 2: Create an agent

create_agent_response = create_agent_response = bedrock_agent.create_agent(
agentResourceRoleArn=agent_role_arn, # Amazon Bedrock Agent execution role
agentName=”computeruse”,
description=”””Example agent for computer use.
This agent should only operate on
Sandbox environments with limited privileges.”””,
foundationModel=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″,
instruction=”””You are computer use agent capable of using Firefox
web browser for web search.”””,
)

time.sleep(30) # wait for agent to be created

# Step 3.1: Create and attach computer action group

bedrock_agent.create_agent_action_group(
actionGroupName=”ComputerActionGroup”,
actionGroupState=”ENABLED”,
agentId=create_agent_response[“agent”][“agentId”],
agentVersion=”DRAFT”,
parentActionGroupSignature=”ANTHROPIC.Computer”,
parentActionGroupSignatureParams={
“type”: “computer_20250124”,
“display_height_px”: “768”,
“display_width_px”: “1024”,
“display_number”: “1”,
},
)

# Step 3.2: Create and attach bash action group

bedrock_agent.create_agent_action_group(
actionGroupName=”BashActionGroup”,
actionGroupState=”ENABLED”,
agentId=create_agent_response[“agent”][“agentId”],
agentVersion=”DRAFT”,
parentActionGroupSignature=”ANTHROPIC.Bash”,
parentActionGroupSignatureParams={
“type”: “bash_20250124″,
},
)

# Step 3.3: Create and attach text editor action group

bedrock_agent.create_agent_action_group(
actionGroupName=”TextEditorActionGroup”,
actionGroupState=”ENABLED”,
agentId=create_agent_response[“agent”][“agentId”],
agentVersion=”DRAFT”,
parentActionGroupSignature=”ANTHROPIC.TextEditor”,
parentActionGroupSignatureParams={
“type”: “text_editor_20250124″,
},
)

# Step 3.4 Create Weather Action Group

bedrock_agent.create_agent_action_group(
actionGroupName=”WeatherActionGroup”,
agentId=create_agent_response[“agent”][“agentId”],
agentVersion=”DRAFT”,
actionGroupExecutor = {
‘customControl’: ‘RETURN_CONTROL’,
},
functionSchema = {
‘functions’: [
{
“name”: “get_current_weather”,
“description”: “Get the current weather in a given location.”,
“parameters”: {
“location”: {
“type”: “string”,
“description”: “The city, e.g., San Francisco”,
“required”: True,
},
“unit”: {
“type”: “string”,
“description”: ‘The unit to use, e.g.,
fahrenheit or celsius. Defaults to “fahrenheit”‘,
“required”: False,
},
},
“requireConfirmation”: “DISABLED”,
}
]
},
)
time.sleep(10)
# Step 4: Prepare agent

bedrock_agent.prepare_agent(agentId=create_agent_response[“agent”][“agentId”])

Example use case
In this post, we demonstrate an example where we use Amazon Bedrock Agents with the computer use capability to complete a web form. In the example, the computer use agent can also switch Firefox tabs to interact with a customer relationship management (CRM) agent to get the required information to complete the form. Although this example uses a sample CRM application as the system of record, the same approach works with Salesforce, SAP, Workday, or other systems of record with the appropriate authentication frameworks in place.

In the demonstrated use case, you can observe how well the Amazon Bedrock agent performed with computer use tools. Our implementation completed the customer ID, customer name, and email by visually examining the excel data. However, for the overview, it decided to select the cell and copy the data, because the information wasn’t completely visible on the screen. Finally, the CRM agent was used to get additional information on the customer.
Best practices
The following are some ways you can improve the performance for your use case:

Implement Security Groups, Network Access Control Lists (NACLs), and Amazon Route 53 Resolver DNS Firewall domain lists to control access to the sandbox environment.
Apply AWS Identity and Access Management (IAM) and the principle of least privilege to assign limited permissions to the sandbox environment.
When providing the Amazon Bedrock agent with instructions, be concise and direct. Specify simple, well-defined tasks and provide explicit instructions for each step.
Understand computer use limitations as highlighted by Anthropic here.
Complement return of control with user confirmation to help safeguard your application from malicious prompt injections by requesting confirmation from your users before invoking a computer use tool.
Use multi-agent collaboration and computer use with Amazon Bedrock Agents to automate complex workflows.
Implement safeguards by filtering harmful multimodal content based on your responsible AI policies for your application by associating Amazon Bedrock Guardrails with your agent.

Considerations
The computer use feature is made available to you as a beta service as defined in the AWS Service Terms. It is subject to your agreement with AWS and the AWS Service Terms, and the applicable model EULA. Computer use poses unique risks that are distinct from standard API features or chat interfaces. These risks are heightened when using the computer use feature to interact with the internet. To minimize risks, consider taking precautions such as:

Operate computer use functionality in a dedicated virtual machine or container with minimal privileges to minimize direct system exploits or accidents
To help prevent information theft, avoid giving the computer use API access to sensitive accounts or data
Limit the computer use API’s internet access to required domains to reduce exposure to malicious content
To enforce proper oversight, keep a human in the loop for sensitive tasks (such as making decisions that could have meaningful real-world consequences) and for anything requiring affirmative consent (such as accepting cookies, executing financial transactions, or agreeing to terms of service)

Any content that you enable Anthropic’s Claude to see or access can potentially override instructions or cause the model to make mistakes or perform unintended actions. Taking proper precautions, such as isolating Anthropic’s Claude from sensitive surfaces, is essential – including to avoid risks related to prompt injection. Before enabling or requesting permissions necessary to enable computer use features in your own products, inform end users of any relevant risks, and obtain their consent as appropriate.
Clean up
When you are done using this solution, make sure to clean up all the resources. Follow the instructions in the provided GitHub repository.
Conclusion
Organizations across industries face significant challenges with cross-application workflows that traditionally require manual data entry or complex custom integrations. The integration of Anthropic’s computer use capability with Amazon Bedrock Agents represents a transformative approach to these challenges.
By using Amazon Bedrock Agents as the orchestration layer, organizations can alleviate the need for custom API development for each application, benefit from comprehensive logging and tracing capabilities essential for enterprise deployment, and implement automation solutions quickly.
As you begin exploring computer use with Amazon Bedrock Agents, consider workflows in your organization that could benefit from this approach. From invoice processing to customer onboarding, HR documentation to compliance reporting, the potential applications are vast and transformative.
We’re excited to see how you will use Amazon Bedrock Agents with the computer use capability to securely streamline operations and reimagine business processes through AI-driven automation.
Resources
To learn more, refer to the following resources:

Computer use with Amazon Bedrock Agents guide
Computer use with Amazon Bedrock Agents implementation
Computer use with Anthropic’s Claude implementation
Computer use with Anthropic guide
Amazon Bedrock Agent Samples

About the Authors
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Maira Ladeira Tanke is a Tech Lead for Agentic workloads in Amazon Bedrock at AWS, where she enables customers on their journey to develop autonomous AI systems. With over 10 years of experience in AI/ML. At AWS, Maira partners with enterprise customers to accelerate the adoption of agentic applications using Amazon Bedrock, helping organizations harness the power of foundation models to drive innovation and business transformation. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.
Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.
Adarsh Srikanth is a Software Development Engineer at Amazon Bedrock, where he develops AI agent services. He holds a master’s degree in computer science from USC and brings three years of industry experience to his role. He spends his free time exploring national parks, discovering new hiking trails, and playing various racquet sports.
Abishek Kumar is a Senior Software Engineer at Amazon, bringing over 6 years of valuable experience across both retail and AWS organizations. He has demonstrated expertise in developing generative AI and machine learning solutions, specifically contributing to key AWS services including SageMaker Autopilot, SageMaker Canvas, and AWS Bedrock Agents. Throughout his career, Abishek has shown passion for solving complex problems and architecting large-scale systems that serve millions of customers worldwide. When not immersed in technology, he enjoys exploring nature through hiking and traveling adventures with his wife.
Krishna Gourishetti is a Senior Software Engineer for the Bedrock Agents team in AWS. He is passionate about building scalable software solutions that solve customer problems. In his free time, Krishna loves to go on hikes.

Evaluating RAG applications with Amazon Bedrock knowledge base evaluat …

Posted on March 15, 2025 by i-genie

Organizations building and deploying AI applications, particularly those using large language models (LLMs) with Retrieval Augmented Generation (RAG) systems, face a significant challenge: how to evaluate AI outputs effectively throughout the application lifecycle. As these AI technologies become more sophisticated and widely adopted, maintaining consistent quality and performance becomes increasingly complex.
Traditional AI evaluation approaches have significant limitations. Human evaluation, although thorough, is time-consuming and expensive at scale. Although automated metrics are fast and cost-effective, they can only evaluate the correctness of an AI response, without capturing other evaluation dimensions or providing explanations of why an answer is problematic. Furthermore, traditional automated evaluation metrics typically require ground truth data, which for many AI applications is difficult to obtain. Especially for those involving open-ended generation or retrieval augmented systems, defining a single “correct” answer is practically impossible. Finally, metrics such as ROUGE and F1 can be fooled by shallow linguistic similarities (word overlap) between the ground truth and the LLM response, even when the actual meaning is very different. These challenges make it difficult for organizations to maintain consistent quality standards across their AI applications, particularly for generative AI outputs.
Amazon Bedrock has recently launched two new capabilities to address these evaluation challenges: LLM-as-a-judge (LLMaaJ) under Amazon Bedrock Evaluations and a brand new RAG evaluation tool for Amazon Bedrock Knowledge Bases. Both features rely on the same LLM-as-a-judge technology under the hood, with slight differences depending on if a model or a RAG application built with Amazon Bedrock Knowledge Bases is being evaluated. These evaluation features combine the speed of automated methods with human-like nuanced understanding, enabling organizations to:

Assess AI model outputs across various tasks and contexts
Evaluate multiple evaluation dimensions of AI performance simultaneously
Systematically assess both retrieval and generation quality in RAG systems
Scale evaluations across thousands of responses while maintaining quality standards

These capabilities integrate seamlessly into the AI development lifecycle, empowering organizations to improve model and application quality, promote responsible AI practices, and make data-driven decisions about model selection and application deployment.
This post focuses on RAG evaluation with Amazon Bedrock Knowledge Bases, provides a guide to set up the feature, discusses nuances to consider as you evaluate your prompts and responses, and finally discusses best practices. By the end of this post, you will understand how the latest Amazon Bedrock evaluation features can streamline your approach to AI quality assurance, enabling more efficient and confident development of RAG applications.
Key features
Before diving into the implementation details, we examine the key features that make the capabilities of RAG evaluation on Amazon Bedrock Knowledge Bases particularly powerful. The key features are:

Amazon Bedrock Evaluations

Evaluate Amazon Bedrock Knowledge Bases directly within the service
Systematically evaluate both retrieval and generation quality in RAG systems to change knowledge base build-time parameters or runtime parameters

Comprehensive, understandable, and actionable evaluation metrics

Retrieval metrics: Assess context relevance and coverage using an LLM as a judge
Generation quality metrics: Measure correctness, faithfulness (to detect hallucinations), completeness, and more
Provide natural language explanations for each score in the output and on the console
Compare results across multiple evaluation jobs for both retrieval and generation
Metrics scores are normalized to 0 and 1 range

Scalable and efficient assessment

Scale evaluation across thousands of responses
Reduce costs compared to manual evaluation while maintaining high quality standards

Flexible evaluation framework

Support both ground truth and reference-free evaluations
Equip users to select from a variety of metrics for evaluation
Supports evaluating fine-tuned or distilled models on Amazon Bedrock
Provides a choice of evaluator models

Model selection and comparison

Compare evaluation jobs across different generating models
Facilitate data-driven optimization of model performance

Responsible AI integration

Incorporate built-in responsible AI metrics such as harmfulness, answer refusal, and stereotyping
Seamlessly integrate with Amazon Bedrock Guardrails

These features enable organizations to comprehensively assess AI performance, promote responsible AI development, and make informed decisions about model selection and optimization throughout the AI application lifecycle. Now that we’ve explained the key features, we examine how these capabilities come together in a practical implementation.
Feature overview
The Amazon Bedrock Knowledge Bases RAG evaluation feature provides a comprehensive, end-to-end solution for assessing and optimizing RAG applications. This automated process uses the power of LLMs to evaluate both retrieval and generation quality, offering insights that can significantly improve your AI applications.
The workflow is as follows, as shown moving from left to right in the following architecture diagram:

Prompt dataset – Prepared set of prompts, optionally including ground truth responses
JSONL file – Prompt dataset converted to JSONL format for the evaluation job
Amazon Simple Storage Service (Amazon S3) bucket – Storage for the prepared JSONL file
Amazon Bedrock Knowledge Bases RAG evaluation job – Core component that processes the data, integrating with Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
Automated report generation – Produces a comprehensive report with detailed metrics and insights at individual prompt or conversation level
Analyze the report to derive actionable insights for RAG system optimization

Designing holistic RAG evaluations: Balancing cost, quality, and speed
RAG system evaluation requires a balanced approach that considers three key aspects: cost, speed, and quality. Although Amazon Bedrock Evaluations primarily focuses on quality metrics, understanding all three components helps create a comprehensive evaluation strategy. The following diagram shows how these components interact and feed into a comprehensive evaluation strategy, and the next sections examine each component in detail.

Cost and speed considerations
The efficiency of RAG systems depends on model selection and usage patterns. Costs are primarily driven by data retrieval and token consumption during retrieval and generation, and speed depends on model size and complexity as well as prompt and context size. For applications requiring high performance content generation with lower latency and costs, model distillation can be an effective solution to use for creating a generator model, for example. As a result, you can create smaller, faster models that maintain quality of larger models for specific use cases.
Quality assessment framework
Amazon Bedrock knowledge base evaluation provides comprehensive insights through various quality dimensions:

Technical quality through metrics such as context relevance and faithfulness
Business alignment through correctness and completeness scores
User experience through helpfulness and logical coherence measurements
Incorporates built-in responsible AI metrics such as harmfulness, stereotyping, and answer refusal.

Establishing baseline understanding
Begin your evaluation process by choosing default configurations in your knowledge base (vector or graph database), such as default chunking strategies, embedding models, and prompt templates. These are just some of the possible options. This approach establishes a baseline performance, helping you understand your RAG system’s current effectiveness across available evaluation metrics before optimization. Next, create a diverse evaluation dataset. Make sure this dataset contains a diverse set of queries and knowledge sources that accurately reflect your use case. The diversity of this dataset will provide a comprehensive view of your RAG application performance in production.
Iterative improvement process
Understanding how different components affect these metrics enables informed decisions about:

Knowledge base configuration (chunking strategy or embedding size or model) and inference parameter refinement
Retrieval strategy modifications (semantic or hybrid search)
Prompt engineering refinements
Model selection and inference parameter configuration
Choice between different vector stores including graph databases

Continuous evaluation and improvement
Implement a systematic approach to ongoing evaluation:

Schedule regular offline evaluation cycles aligned with knowledge base updates
Track metric trends over time to identify areas for improvement
Use insights to guide knowledge base refinements and generator model customization and selection

Prerequisites
To use the knowledge base evaluation feature, make sure that you have satisfied the following requirements:

An active AWS account.
Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
Confirm the AWS Regions where the model is available and quotas.
Complete the knowledge base evaluation prerequisites related to AWS Identity and Access Management (IAM) creation and add permissions for an S3 bucket to access and write output data.

You also need to set up and enable CORS on your S3 bucket.

Have an Amazon Bedrock knowledge base created and sync your data such that it’s ready to be used by a knowledge base evaluation job.
If yo’re using a custom model instead of an on-demand model for your generator model, make sure you have sufficient quota for running a Provisioned Throughput during inference. Go to the Service Quotas console and check the following quotas:

Model units no-commitment Provisioned Throughputs across custom models
Model units per provisioned model for [your custom model name]
Both fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset
To prepare your dataset for a knowledge base evaluation job, you need to follow two important steps:

Dataset requirements:

Maximum 1,000 conversations per evaluation job (1 conversation is contained in the conversationTurns key in the dataset format)
Maximum 5 turns (prompts) per conversation
File must use JSONL format (.jsonl extension)
Each line must be a valid JSON object and complete prompt
Stored in an S3 bucket with CORS enabled

Follow the following format:

Retrieve only evaluation jobs.

Special note: On March 20, 2025, the referenceContexts key will change to referenceResponses. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages/chunks retrieved from the Knowledge Base.

{
“conversationTurns”: [{
## required for Context Coverage metric
“referenceContexts”: [{
“content”: [{
“text”: “This is a reference response used as ground truth”
}]
}],
## your prompt to the model
“prompt”: {
“content”: [{
“text”: “This is a prompt”
}]
}
}]
}

Retrieve and generate evaluation jobs

{
“conversationTurns”: [{
##optional
“referenceResponses”: [{
“content”: [{
“text”: “This is a reference response used as ground truth”
}]
}],
## your prompt to the model
“prompt”: {
“content”: [{
“text”: “This is a prompt”
}]
}
}]
}

Start a knowledge base RAG evaluation job using the console
Amazon Bedrock Evaluations provides you with an option to run an evaluation job through a guided user interface on the console. To start an evaluation job through the console, follow these steps:

On the Amazon Bedrock console, under Inference and Assessment in the navigation pane, choose Evaluations and then choose Knowledge Bases.
Choose Create, as shown in the following screenshot.
Give an Evaluation name, a Description, and choose an Evaluator model, as shown in the following screenshot. This model will be used as a judge to evaluate the response of the RAG application.
Choose the knowledge base and the evaluation type, as shown in the following screenshot. Choose Retrieval only if you want to evaluate only the retrieval component and Retrieval and response generation if you want to evaluate the end-to-end retrieval and response generation. Select a model, which will be used for generating responses in this evaluation job.
(Optional) To change inference parameters, choose configurations. You can update or experiment with different values of temperature, top-P, update knowledge base prompt templates, associate guardrails, update search strategy, and configure numbers of chunks retrieved. The following screenshot shows the Configurations screen.
Choose the Metrics you would like to use to evaluate the RAG application, as shown in the following screenshot.
Provide the S3 URI, as shown in step 3 for evaluation data and for evaluation results. You can use the Browse S3
Select a service (IAM) role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, the knowledge base in the job, and the models being used in the job. You can also create a new IAM role in the evaluation setup and the service will automatically give the role the proper permissions for the job.
Choose Create.
You will be able to check the evaluation job In Progress status on the Knowledge Base evaluations screen, as shown in in the following screenshot.
Wait for the job to be complete. This could be 10–15 minutes for a small job or a few hours for a large job with hundreds of long prompts and all metrics selected. When the evaluation job has been completed, the status will show as Completed, as shown in the following screenshot.
When it’s complete, select the job, and you’ll be able to observe the details of the job. The following screenshot is the Metric summary.
You should also observe a directory with the evaluation job name in the Amazon S3 path. You can find the output S3 path from your job results page in the evaluation summary section.
You can compare two evaluation jobs to gain insights about how different configurations or selections are performing. You can view a radar chart comparing performance metrics between two RAG evaluation jobs, making it simple to visualize relative strengths and weaknesses across different dimensions, as shown in the following screenshot.

On the Evaluation details tab, examine score distributions through histograms for each evaluation metric, showing average scores and percentage differences. Hover over the histogram bars to check the number of conversations in each score range, helping identify patterns in performance, as shown in the following screenshots.

Start a knowledge base evaluation job using Python SDK and APIs
To use the Python SDK for creating a knowledge base evaluation job, follow these steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f”kb-evaluation-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”

# Configure your knowledge base and model settings
knowledge_base_id = “<YOUR_KB_ID>”
evaluator_model = “mistral.mistral-large-2402-v1:0”
generator_model = “anthropic.claude-3-sonnet-20240229-v1:0”
role_arn = “arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>”

# Specify S3 locations for evaluation data and output
input_data = “s3://<YOUR_BUCKET>/evaluation_data/input.jsonl”
output_path = “s3://<YOUR_BUCKET>/evaluation_output/”

# Configure retrieval settings
num_results = 10
search_type = “HYBRID”

# Create Bedrock client
bedrock_client = boto3.client(‘bedrock’)

For retrieval-only evaluation, create a job that focuses on assessing the quality of retrieved contexts:

retrieval_job = bedrock_client.create_evaluation_job(
jobName=job_name,
jobDescription=”Evaluate retrieval performance”,
roleArn=role_arn,
applicationType=”RagEvaluation”,
inferenceConfig={
“ragConfigs”: [{
“knowledgeBaseConfig”: {
“retrieveConfig”: {
“knowledgeBaseId”: knowledge_base_id,
“knowledgeBaseRetrievalConfiguration”: {
“vectorSearchConfiguration”: {
“numberOfResults”: num_results,
“overrideSearchType”: search_type
}
}
}
}
}]
},
outputDataConfig={
“s3Uri”: output_path
},
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [{
“taskType”: “Custom”,
“dataset”: {
“name”: “RagDataset”,
“datasetLocation”: {
“s3Uri”: input_data
}
},
“metricNames”: [
“Builtin.ContextRelevance”,
“Builtin.ContextCoverage”
]
}],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [{
“modelIdentifier”: evaluator_model
}]
}
}
}
)

For a complete evaluation of both retrieval and generation, use this configuration:

retrieve_generate_job=bedrock_client.create_evaluation_job(
jobName=job_name,
jobDescription=”Evaluate retrieval and generation”,
roleArn=role_arn,
applicationType=”RagEvaluation”,
inferenceConfig={
“ragConfigs”: [{
“knowledgeBaseConfig”: {
“retrieveAndGenerateConfig”: {
“type”: “KNOWLEDGE_BASE”,
“knowledgeBaseConfiguration”: {
“knowledgeBaseId”: knowledge_base_id,
“modelArn”: generator_model,
“retrievalConfiguration”: {
“vectorSearchConfiguration”: {
“numberOfResults”: num_results,
“overrideSearchType”: search_type
}
}
}
}
}
}]
},
outputDataConfig={
“s3Uri”: output_path
},
evaluationConfig={
“automated”: {
“datasetMetricConfigs”: [{
“taskType”: “Custom”,
“dataset”: {
“name”: “RagDataset”,
“datasetLocation”: {
“s3Uri”: input_data
}
},
“metricNames”: [
“Builtin.Correctness”,
“Builtin.Completeness”,
“Builtin.Helpfulness”,
“Builtin.LogicalCoherence”,
“Builtin.Faithfulness”
]
}],
“evaluatorModelConfig”: {
“bedrockEvaluatorModels”: [{
“modelIdentifier”: evaluator_model
}]
}
}
}
)

To monitor the progress of your evaluation job, use this configuration:

# depending on job type, we can retrieve the ARN of the job and monitor to to take any downstream actions.
evaluation_job_arn = retrieval_job[‘jobArn’]
evaluation_job_arn = retrieve_generate_job[‘jobArn’]

response = bedrock_client.get_evaluation_job(
jobIdentifier=evaluation_job_arn
)
print(f”Job Status: {response[‘status’]}”)

Interpreting results
After your evaluation jobs are completed, Amazon Bedrock RAG evaluation provides a detailed comparative dashboard across the evaluation dimensions.

The evaluation dashboard includes comprehensive metrics, but we focus on one example, the completeness histogram shown below. This visualization represents how well responses cover all aspects of the questions asked. In our example, we notice a strong right-skewed distribution with an average score of 0.921. The majority of responses (15) scored above 0.9, while a small number fell in the 0.5-0.8 range. This type of distribution helps quickly identify if your RAG system has consistent performance or if there are specific cases needing attention.

Selecting specific score ranges in the histogram reveals detailed conversation analyses. For each conversation, you can examine the input prompt, generated response, number of retrieved chunks, ground truth comparison, and most importantly, the detailed score explanation from the evaluator model.
Consider this example response that scored 0.75 for the question, “What are some risks associated with Amazon’s expansion?” Although the generated response provided a structured analysis of operational, competitive, and financial risks, the evaluator model identified missing elements around IP infringement and foreign exchange risks compared to the ground truth. This detailed explanation helps in understanding not just what’s missing, but why the response received its specific score.
This granular analysis is crucial for systematic improvement of your RAG pipeline. By understanding patterns in lower-performing responses and specific areas where context retrieval or generation needs improvement, you can make targeted optimizations to your system—whether that’s adjusting retrieval parameters, refining prompts, or modifying knowledge base configurations.
Best practices for implementation
These best practices help build a solid foundation for your RAG evaluation strategy:

Design your evaluation strategy carefully, using representative test datasets that reflect your production scenarios and user patterns. If you have large workloads greater than 1,000 prompts per batch, optimize your workload by employing techniques such as stratified sampling to promote diversity and representativeness within your constraints such as time to completion and costs associated with evaluation.
Schedule periodic batch evaluations aligned with your knowledge base updates and content refreshes because this feature supports batch analysis rather than real-time monitoring.
Balance metrics with business objectives by selecting evaluation dimensions that directly impact your application’s success criteria.
Use evaluation insights to systematically improve your knowledge base content and retrieval settings through iterative refinement.
Maintain clear documentation of evaluation jobs, including the metrics selected and improvements implemented based on results. The job creation configuration settings in your results pages can help keep a historical record here.
Optimize your evaluation batch size and frequency based on application needs and resource constraints to promote cost-effective quality assurance.
Structure your evaluation framework to accommodate growing knowledge bases, incorporating both technical metrics and business KPIs in your assessment criteria.

To help you dive deeper into the scientific validation of these practices, we’ll be publishing a technical deep-dive post that explores detailed case studies using public datasets and internal AWS validation studies. This upcoming post will examine how our evaluation framework performs across different scenarios and demonstrate its correlation with human judgments across various evaluation dimensions. Stay tuned as we explore the research and validation that powers Amazon Bedrock Evaluations.
Conclusion
Amazon Bedrock knowledge base RAG evaluation enables organizations to confidently deploy and maintain high-quality RAG applications by providing comprehensive, automated assessment of both retrieval and generation components. By combining the benefits of managed evaluation with the nuanced understanding of human assessment, this feature allows organizations to scale their AI quality assurance efficiently while maintaining high standards. Organizations can make data-driven decisions about their RAG implementations, optimize their knowledge bases, and follow responsible AI practices through seamless integration with Amazon Bedrock Guardrails.
Whether you’re building customer service solutions, technical documentation systems, or enterprise knowledge base RAG, Amazon Bedrock Evaluations provides the tools needed to deliver reliable, accurate, and trustworthy AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets. You can find it on our GitHub repository.
We encourage you to explore these capabilities in the Amazon Bedrock console and discover how systematic evaluation can enhance your RAG applications.

About the Authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Ayan Ray is a Senior Generative AI Partner Solutions Architect at AWS, where he collaborates with ISV partners to develop integrated Generative AI solutions that combine AWS services with AWS partner products. With over a decade of experience in Artificial Intelligence and Machine Learning, Ayan has previously held technology leadership roles at AI startups before joining AWS. Based in the San Francisco Bay Area, he enjoys playing tennis and gardening in his free time.
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Evangelia Spiliopoulou is an Applied Scientist in the AWS Bedrock Evaluation group, where the goal is to develop novel methodologies and tools to assist automatic evaluation of LLMs. Her overall work focuses on Natural Language Processing (NLP) research and developing NLP applications for AWS customers, including LLM Evaluations, RAG, and improving reasoning for LLMs. Prior to Amazon, Evangelia completed her Ph.D. at Language Technologies Institute, Carnegie Mellon University.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero- …

Posted on March 14, 2025 by i-genie

Google DeepMind has shattered conventional boundaries in robotics AI with the unveiling of Gemini Robotics, a suite of models built upon the formidable foundation of Gemini 2.0. This isn’t just an incremental upgrade; it’s a paradigm shift, propelling AI from the digital realm into the tangible world with unprecedented “embodied reasoning” capabilities.

Gemini Robotics: Bridging the Gap Between Digital Intelligence and Physical Action

At the heart of this innovation lies Gemini Robotics, an advanced vision-language-action (VLA) model that transcends traditional AI limitations. By introducing physical actions as a direct output modality, Gemini Robotics empowers robots to autonomously execute tasks with a level of understanding and adaptability previously unattainable. Complementing this is Gemini Robotics-ER (Embodied Reasoning), a specialized model engineered to refine spatial understanding, enabling roboticists to seamlessly integrate Gemini’s cognitive prowess into existing robotic architectures.

These models herald a new era of robotics, promising to unlock a diverse spectrum of real-world applications. Google DeepMind’s strategic partnerships with industry leaders like Apptronik, for the integration of Gemini 2.0 into humanoid robots, and collaborations with trusted testers, underscore the transformative potential of this technology.

Key Technological Advancements:

Unparalleled Generality: Gemini Robotics leverages Gemini’s robust world model to generalize across novel scenarios, achieving superior performance on rigorous generalization benchmarks compared to state-of-the-art VLA models.

Intuitive Interactivity: Built on Gemini 2.0’s language understanding, the model facilitates fluid human-robot interaction through natural language commands, dynamically adapting to environmental changes and user input.

Advanced Dexterity: The model demonstrates remarkable dexterity, executing complex manipulation tasks like origami folding and intricate object handling, showcasing a significant leap in robotic fine motor control.

Versatile Embodiment: Gemini Robotics’ adaptability extends to various robotic platforms, from bi-arm systems like ALOHA 2 and Franka arms to advanced humanoid robots like Apptronik’s Apollo.

Gemini Robotics-ER: Pioneering Spatial Intelligence

Gemini Robotics-ER elevates spatial reasoning, a critical component for effective robotic operation. By enhancing capabilities such as pointing, 3D object detection, and spatial understanding, this model enables robots to perform tasks with heightened precision and efficiency.

Gemini 2.0: Enabling Zero and Few-Shot Robot Control

A defining feature of Gemini 2.0 is its ability to facilitate zero and few-shot robot control. This eliminates the need for extensive robot action data training, enabling robots to perform complex tasks “out of the box.” By uniting perception, state estimation, spatial reasoning, planning, and control within a single model, Gemini 2.0 surpasses previous multi-model approaches.

Zero-Shot Control via Code Generation: Gemini Robotics-ER leverages its code generation capabilities and embodied reasoning to control robots using API commands, reacting and replanning as needed. The model’s enhanced embodied understanding results in a near 2x improvement in task completion compared to Gemini 2.0.

Few-Shot Control via In-Context Learning (ICL): By conditioning the model on a small number of demonstrations, Gemini Robotics-ER can quickly adapt to new behaviors.

Below is the perception and control APIs, and agentic orchestration during an episode. This system is used for zero-shot control:

Commitment to Safety

Google DeepMind prioritizes safety through a multi-layered approach, addressing concerns from low-level motor control to high-level semantic understanding. The integration of Gemini Robotics-ER with existing safety-critical controllers and the development of mechanisms to prevent unsafe actions underscore this commitment.

The release of the ASIMOV dataset and the framework for generating data-driven “Robot Constitutions” further demonstrates Google DeepMind’s dedication to advancing robotics safety research.

Intelligent robots are getting closer…

Check out the full Gemini Robotics report and Gemini Robotics. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning appeared first on MarkTechPost.

Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimoda …

Posted on March 14, 2025 by i-genie

Cohere For AI has just dropped a bombshell: Aya Vision, a open-weights vision model that’s about to redefine multilingual and multimodal communication. Prepare for a seismic shift as we shatter language barriers and unlock the true potential of AI across the globe!

Smashing the Multilingual Multimodal Divide!

Let’s face it, AI has been speaking with a frustratingly limited vocabulary. But not anymore! Aya Vision explodes onto the scene, obliterating the performance gap between languages and modalities. This isn’t just an incremental improvement; it’s a quantum leap, extending multimodal magic to 23 languages, reaching over half the planet’s population. Imagine AI finally speaking your language, understanding the rich tapestry of your culture.

Aya Vision: Where Vision Meets Linguistic Brilliance!

This is not your average vision model. Aya Vision is a linguistic virtuoso, a visual maestro, and a global communicator all rolled into one. From crafting captivating image captions to answering complex visual questions, it’s a powerhouse of multimodal understanding. See above: you snap a photo of a stunning piece of art from your travels, and Aya Vision instantly unveils its history, style, and cultural significance, bridging worlds with a single image.

Performance That Will Blow Your Mind!

Multilingual Domination: Aya Vision obliterates the competition, leaving leading open-weights models in the dust when it comes to multilingual text generation and image understanding.

Parameter Prowess: The 8B model is a lean, mean, performance machine, crushing giants like Qwen2.5-VL 7B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, and Pangea 7B with jaw-dropping win rates!

32B Titan: The 32B model sets a new gold standard, outperforming even larger models like Llama-3.2 90B Vision, Molmo 72B, and Qwen2-VL 72B with breathtaking efficiency.

Efficiency Unleashed: Aya Vision proves you don’t need monstrous models to achieve monumental results, outperforming models 10x its size!

Algorithmic Alchemy: Secret ingredients like synthetic annotations, multilingual data scaling, and multimodal model merging have been masterfully combined to create this AI masterpiece.

Open Weights, Open Doors, Open World!

Cohere For AI isn’t just building groundbreaking AI; they’re democratizing it. Aya Vision’s 8B and 32B models are now freely available on Kaggle and Hugging Face.

Want to contribute?

Cohere For AI invites researchers worldwide to join the Aya initiative, apply for research grants, and collaborate in their open science community. Aya Vision is a huge step forward into the future of multilingual multimodal.

Check out Aya Vision blog post and Aya Initiative, Kaggle and Hugging Face. . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimodal Power! appeared first on MarkTechPost.

Simular Releases Agent S2: An Open, Modular, and Scalable AI Framework …

Posted on March 14, 2025 by i-genie

In today’s digital landscape, interacting with a wide variety of software and operating systems can often be a tedious and error-prone experience. Many users face challenges when navigating through complex interfaces and performing routine tasks that demand precision and adaptability. Existing automation tools frequently fall short in adapting to subtle interface changes or learning from past mistakes, leaving users to manually oversee processes that could otherwise be streamlined. This persistent gap between user expectations and the capabilities of traditional automation calls for a system that not only performs tasks reliably but also learns and adjusts over time.

Simular has introduced Agent S2, an open, modular, and scalable framework designed to assist with computer use agents. Agent S2 builds upon the foundation laid by its predecessor, offering a refined approach to automating tasks on computers and smartphones. By integrating a modular design with both general-purpose and specialized models, the framework can be adapted to a variety of digital environments. Its design is inspired by the human brain’s natural modularity, where different regions work together harmoniously to handle complex tasks, thereby fostering a system that is both flexible and robust.

Technical Details and Benefits

At its core, Agent S2 employs experience-augmented hierarchical planning. This method involves breaking down long and intricate tasks into smaller, more manageable subtasks. The framework continuously refines its strategy by learning from previous experiences, thereby improving its execution over time. An important aspect of Agent S2 is its visual grounding capability, which allows it to interpret raw screenshots for precise interaction with graphical user interfaces. This eliminates the need for additional structured data and enhances the system’s ability to correctly identify and interact with UI elements. Moreover, Agent S2 utilizes an advanced Agent-Computer Interface that delegates routine, low-level actions to expert modules. Complemented by an adaptive memory mechanism, the system retains useful experiences to guide future decision-making, resulting in a more measured and effective performance.

Results and Insights

Evaluations on real-world benchmarks indicate that Agent S2 performs reliably in both computer and smartphone environments. On the OSWorld benchmark—which tests the execution of multi-step computer tasks—Agent S2 achieved a success rate of 34.5% on a 50-step evaluation, reflecting a modest yet consistent improvement over earlier models. Similarly, on the AndroidWorld benchmark, the framework reached a 50% success rate in executing smartphone tasks. These results underscore the practical benefits of a system that can plan ahead and adapt to dynamic conditions, ensuring that tasks are completed with improved accuracy and minimal manual intervention.

Conclusion

Agent S2 represents a thoughtful approach to enhancing everyday digital interactions. By addressing common challenges in computer automation through a modular design and adaptive learning, the framework provides a practical solution for managing routine tasks more efficiently. Its balanced combination of proactive planning, visual understanding, and expert delegation makes it well-suited for both complex computer tasks and mobile applications. In an era where digital workflows continue to evolve, Agent S2 offers a measured, reliable means of integrating automation into daily routines—helping users achieve better outcomes while reducing the need for constant manual oversight.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Simular Releases Agent S2: An Open, Modular, and Scalable AI Framework for Computer Use Agents appeared first on MarkTechPost.

How GoDaddy built a category generation system at scale with batch inf …

Posted on March 14, 2025 by i-genie

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy
Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.
With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.
This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.
Solution overview
GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:

Input: Fruit by the Foot Starburst

Output: color -> multi-colored, material -> candy, category -> snacks, product_line -> Fruit by the Foot,…

GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.
This solution uses the following components to categorize products more accurately and efficiently:

Batch processing in Amazon Bedrock using the Meta Llama 2 and Anthropic’s Claude models
Amazon Simple Storage Service (Amazon S3) to store product and output data
AWS Lambda to orchestrate the Amazon Bedrock models

The key steps are illustrated in the following figure:

A JSONL file containing product data is uploaded to an S3 bucket, triggering the first Lambda function. Amazon Bedrock batch processes this single JSONL file, where each row contains input parameters and prompts. It generates an output JSONL file with a new model_output value appended to each row, corresponding to the input data.
The Lambda function spins up an Amazon Bedrock batch processing endpoint and passes the S3 file location.
The Amazon Bedrock endpoint performs the following tasks:

It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale.
It writes the output to another S3 location.

The second Lambda function performs the following tasks:

It monitors the batch processing job on Amazon Bedrock.
It shuts down the endpoint when processing is complete.

The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.

We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.

The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.

The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.
In the following sections, we look at the key components of the solution in more detail.
Batch inference
We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:

Request: POST /model-invocation-job HTTP/1.1

Content-type: application/json
{
“clientRequestToken”: “string”,
“inputDataConfig”: {
“s3InputDataConfig”: {
“s3Uri”: “string”,
“s3InputFormat”: “JSONL”
}
},
“jobName”: “string”,
“modelId”: “string”,
“outputDataConfig”: {
“s3OutputDataConfig”: {
“s3Uri”: “string”
}
},
“roleArn”: “string”,
“tags”: [{
“key”: “string”,
“value”: “string”
}]
}

Response
HTTP/1.1 200 Content-type: application/json
{
“jobArn”: “string”
}

We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:

Submitted – The job is marked Submitted when the JSON file is ready to be processed by Amazon Bedrock for inference.
InProgress – The job is marked InProgress when Amazon Bedrock starts processing the JSON file.
Failed – The job is marked Failed if there was an error while processing. The error can be written into the JSON file as a part of modelOutput. If it was a 4xx error, it’s written in the metadata of the Job.
Completed – The job is marked Completed when the output JSON file is generated for the input JSON file and has been uploaded to the S3 output path submitted as a part of the CreateModelInvocationJob in outputDataConfig.
Stopped – The job is marked Stopped when a StopModelInvocationJob API is called on a job that is InProgress. A terminal state job (Succeeded or Failed) can’t be stopped using StopModelInvocationJob.

The following is example code for the GetModelInvocationJob API:

GET /model-invocation-job/jobIdentifier HTTP/1.1

Response:
{
‘ResponseMetadata’: {
‘RequestId’: ‘081afa52-189f-4e83-a3f9-aa0918d902f4’,
‘HTTPStatusCode’: 200,
‘HTTPHeaders’: {
‘date’: ‘Tue, 09 Jan 2024 17:00:16 GMT’,
‘content-type’: ‘application/json’,
‘content-length’: ‘690’,
‘connection’: ‘keep-alive’,
‘x-amzn-requestid’: ‘081afa52-189f-4e83-a3f9-aa0918d902f4’
},
‘RetryAttempts’: 0
},
‘jobArn’: ‘arn:aws:bedrock:<region>:<account-id>:model-invocation-job/<id>’,
‘jobName’: ‘job47’,
‘modelId’: ‘arn:aws:bedrock:<region>::foundation-model/anthropic.claude-instant-v1:2’,
‘status’: ‘Submitted’,
‘submitTime’: datetime.datetime(2024, 1, 8, 21, 44, 38, 611000, tzinfo=tzlocal()),
‘lastModifiedTime’: datetime.datetime(2024, 1, 8, 23, 5, 47, 169000, tzinfo=tzlocal()),
‘inputDataConfig’: {‘s3InputDataConfig’: {‘s3Uri’: <path to input jsonl file>}},
‘outputDataConfig’: {‘s3OutputDataConfig’: {‘s3Uri’: <path to output jsonl.out file>}}
}

When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:

json.out – The following code shows an example of the format:

{
“processedRecordCount”:<number>,
“successRecordCount”:<number>,
“errorRecordCount”:<number>,
“inputTokenCount”:<number>,
“outputTokenCount”:<number>
}

<file_name>.jsonl.out – The following screenshot shows an example of the code, containing the successfully processed records under The modelOutput contains a list of categories for a given product name in JSON format.

We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.

class CCData(BaseModel):
product_name: Optional[str] = Field(default=None, description=”product name, which will be given as input”)
brand: Optional[str] = Field(default=None, description=”Brand of the product inferred from the product name”)
color: Optional[str] = Field(default=None, description=”Color of the product inferred from the product name”)
material: Optional[str] = Field(default=None, description=”Material of the product inferred from the product name”)
price: Optional[str] = Field(default=None, description=”Price of the product inferred from the product name”)
category: Optional[str] = Field(default=None, description=”Category of the product inferred from the product name”)
sub_category: Optional[str] = Field(default=None, description=”Sub-category of the product inferred from the product name”)
product_line: Optional[str] = Field(default=None, description=”Product Line of the product inferred from the product name”)
gender: Optional[str] = Field(default=None, description=”Gender of the product inferred from the product name”)
year_of_first_sale: Optional[str] = Field(default=None, description=”Year of first sale of the product inferred from the product name”)
season: Optional[str] = Field(default=None, description=”Season of the product inferred from the product name”)

class List_of_CCData(BaseModel):
list_of_dict: List[CCData]

We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.

Prompt engineering
Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.
Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.
In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.
Output generation
The following are best practices and considerations for output generation:

Provide simple, clear and complete instructions – This is the general guideline for prompt engineering work.
Use separator characters consistently – In this use case, we use the newline character n
Deal with default output values such as missing – For this use case, we don’t want special values such as N/A or missing, so we put multiple instructions in line, aiming to exclude the default or missing values.
Use few-shot prompting – Also termed in-context learning, few-shot prompting involves providing a handful of examples, which can be beneficial in helping LLMs understand the output requirements more effectively. In this use case, 0–10 in-context examples were tested for both Llama 2 and Anthropic’s Claude models.
Use packing techniques – We combined multiple SKU and product names into one LLM query, so that some prompt instructions can be shared across different SKUs for cost and latency optimization. In this use case, 1–10 packing numbers were tested for both Llama 2 and Anthropic’s Claude models.
Test for good generalization – You should keep a hold-out test set and correct responses to check if your prompt modifications generalize.
Use additional techniques for Anthropic’s Claude model families – We incorporated the following techniques:

Enclosing examples in XML tags:

<example>
H: <question> The list of product names is:
{few_shot_product_name} </question>
A: <response> The category information generated with absolutely no missing value, in JSON format is:
{few_shot_field} </response>
</example>

Using the Human and Assistant annotations:

nnHuman:
…
…
nnAssistant:

Guiding the assistant prompt:

nnAssistant: Here are the answer with NO missing, unknown, null, or N/A values (in JSON format):

Use additional techniques for Llama model families – For Llama 2 model families, you can enclose examples in [INST] tags:

[INST]
If the list of product names is:
{few_shot_product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

{few_shot_field}

[INST]
If the list of product names is:
{product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

Format parsing
The following are best practices and considerations for format parsing:

Refine the prompt with modifiers – Refinement of task instructions typically involves altering the instruction, task, or question part of the prompt. The effectiveness of these techniques varies based on the task and data. Some beneficial strategies in this use case include:

Role assumption – Ask the model to assume it’s playing a role. For example:

You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.

Prompt specificity: Being very specific and providing detailed instructions to the model can help generate better responses for the required task.

EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.

Output format description – We provided the JSON format instructions through a JSON string directly, as well as through the few-shot examples indirectly.

Pay attention to few-shot example formatting – The LLMs (Anthropic’s Claude and Llama) are sensitive to subtle formatting differences. Parsing time was significantly improved after several iterations on few-shot examples formatting. The final solution is as follows:

few_shot_field='{“list_of_dict”‘ +
‘:[‘ +
‘, n’.join([true_df.iloc[i].to_json() for i in range(num_few_shot)]) +
‘]}’

Use additional techniques for Anthropic’s Claude model families – For the Anthropic’s Claude model, we instructed it to format the output in JSON format:

{
“list_of_dict”: [{
“some_category”: “your_generated_answer”,
“another_category”: “your_generated_answer”,
},
{
<category information for the 2st product name, in json format>
},
{
<category information for the 3st product name, in json format>
},
// … {additional product information, in json format} …
}]
}

Use additional techniques for Llama 2 model families – For the Llama 2 model, we instructed it to format the output in JSON format as follows:

Format your output in the JSON format (ensure to escape special character): The output should be formatted as a JSON instance that conforms to the JSON schema below. As an example, for the schema {“properties”: {“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“foo”]} the object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not well-formatted.
Here is the output schema:
{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}
Models and parameters
We used the following prompting parameters:

Number of packings – 1, 5, 10
Number of in-context examples – 0, 2, 5, 10
Format instruction – JSON format pseudo example (shorter length), JSON format full example (longer length)

For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:

{
“temperature”: 0.1,
“top_p”: 0.9,
“max_gen_len”: 2048,
}

For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:

{
“temperature”: 0.1,
“top_k”: 250,
“top_p”: 1,
“max_tokens_to_sample”: 4096,
“stop_sequences”: [“nnHuman:”],
“anthropic_version”: “bedrock-2023-05-31”
}

The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.
Evaluations
The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:

Content coverage – Measures portions of missing values in the output generation step.
Parsing coverage – Measures portions of missing samples in the format parsing step:

Parsing recall on product name – An exact match serves as a lower bound for parsing completeness (parsing coverage is the upper bound for parsing completeness) because in some cases, two virtually identical product names need to be normalized and transformed to be an exact match (for example, “Nike Air Jordan” and “nike. air Jordon”).
Parsing precision on product name – For an exact match, we use a similar metric to parsing recall, but use precision instead of recall.

Final coverage – Measures portions of missing values in both output generation and format parsing steps.
Human evaluation – Focuses on holistic quality evaluation such as accuracy, relevance, and comprehensiveness (richness) of the text generation.

Results
The following are the approximate sample input and output lengths under some best performing settings:

Input length for Llama 2 model family – 2,068 tokens for 10-shot, 1,585 tokens for 5-shot, 1,319 tokens for 2-shot
Input length for Anthropic’s Claude model family – 1,314 tokens for 10-shot, 831 tokens for 5-shot, 566 tokens for 2-shot, 359 tokens for zero-shot
Output length with 5-packing – Approximately 500 tokens

Quantitative results
The following table summarizes our consolidated quantitative results.

To be concise, the table contains only some of our final recommendations for each model types.
The metrics used are latency and accuracy.
The best model and results are highlighted in green color and in bold font.

Config
Latency
Accuracy

Batch process service
Model
Prompt
Batch process latency (5 packing)
Near-real-time process latency (1 packing)
Programmatic evaluation (coverage)

test set = 20
test set = 5k
GoDaddy rqmt @ 5k
Recall on parsing exact match
Final content coverage

Amazon Bedrock batch inference
Llama2-13b
zero-shot
n/a
n/a
3600s
n/a
n/a
n/a

5-shot (template12)
65.4s
1704s
3600s
72/20=3.6s
92.60%
53.90%

Llama2-70b
zero-shot
n/a
n/a
3600s
n/a
n/a
n/a

5-shot (template13)
139.6s
5299s
3600s
156/20=7.8s
98.30%
61.50%

Claude-v1 (instant)
zero-shot (template6)
29s
723s
3600s
44.8/20=2.24s
98.50%
96.80%

5-shot (template12)
30.3s
644s
3600s
51/20=2.6s
99%
84.40%

Claude-v2
zero-shot (template6)
82.2s
1706s
3600s
104/20=5.2s
99%
84.40%

5-shot (template14)
49.1s
1323s
3600s
104/20=5.2s
99.40%
90.10%

The following tables summarize the scaling effect in batch inference.

When scaling from 5,000 to 100,000 samples, only eight times more computation time was needed.
Performing categorization with individual LLM calls for each product would have increased the inference time for 100,000 products by approximately 40 times compared to the batch processing method.
The accuracy in coverage remained stable, and cost scaled approximately linearly.

Batch process service
Model
Prompt
Batch process latency (5 packing)
Near-real-time process latency (1 packing)

test set = 20
test set = 5k
GoDaddy rqmt @ 5k
test set = 100k

Amazon Bedrock batch
Claude-v1 (instant)
zero-shot (template6)
29s
723s
3600s
5733s
44.8/20=2.24s

Amazon Bedrock batch
Anthropic’s Claude-v2
zero-shot (template6)
82.2s
1706s
3600s
7689s
104/20=5.2s

Batch process service
Near-real-time process latency (1 packing)
Programmatic evaluation (coverage)

Parsing recall on product name (test set = 5k)
Parsing recall on product name (test set = 100k)
Final content coverage (test set = 5k)
Final content coverage (test set = 100k)

Amazon Bedrock batch
44.8/20=2.24s
98.50%
98.40%
96.80%
96.50%

Amazon Bedrock batch
104/20=5.2s
99%
98.80%
84.40%
97%

The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.
We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.

Batch process service
Model
Prompt
Latency (test set = 20)

Accuracy (final coverage)

npack = 1
npack= 5
npack = 10
npack = 1
npack= 5
npack = 10

Amazon Bedrock batch inference
Llama2-13b
5-shot (template12)
72s
65.4s
65s
95.90%
93.20%
88.90%

Llama2-70b
5-shot (template13)
156s
139.6s
150s
85%
97.70%
100%

Claude-v1 (instant)
zero-shot (template6)
45s
29s
27s
99.50%
99.50%
99.30%

5-shot (template12)
51.3s
30.3s
27.4s
99.50%
99.50%
100%

Claude-v2
zero-shot (template6)
104s
82.2s
67s
85%
97.70%
94.50%

5-shot (template14)
104s
49.1s
43.5s
97.70%
100%
99.80%

Qualitative results
We noted the following qualitative results:

Human evaluation – The categories generated were evaluated qualitatively by GoDaddy SMEs. The categories were found to be of good quality.
Learnings – We used an LLM in two separate calls: output generation and format parsing. We observed the following:

For this use case, we saw Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. To be consistent and make a fair comparison, we required the LLM used in both calls to be the same—the API calls in both steps should all be invoked to llama2-13b-chat-v1, or they should all be invoked to anthropic.claude-instant-v1. However, GoDaddy chose Llama 2 as the LLM for category generation. For this use case, we found that using Llama 2 in output generation only and using Anthropic’s Claude in format parsing was suitable due to Llama 2’s relative lower model capability.
Format parsing is improved through prompt engineering (JSON format instruction is critical) to reduce the latency. For example, with Anthropic’s Claude-Instant on a 20-test set and averaging multiple prompt templates, the latency can be reduced by approximately 77% (from 90 seconds to 20 seconds). This directly eliminates the necessity of using a JSON fine-tuned version of the LLM.

Llama2 – We observed the following:

Llama2-13b and Llama2-70b models both need the full instruction as format_instruction() in zero-shot prompts.
Llama2-13b seems to be worse in content coverage and formatting (for example, it can’t correctly escape char, \“), which can incur significant parsing time and cost and also degrade accuracy.
Llama 2 shows clear performance drops and instability when the packing number varies among 1, 5, and 10, indicating poorer generalizability compared to the Anthropic’s Claude model family.

Anthropic’s Claude – We observed the following:

Anthropic’s Claude-Instant and Claude-v2, regardless of using zero-shot or few-shot prompting, need only partial format instruction instead of the full instruction format_instruction(). It shortens the input length, and is therefore more cost-effective. It also shows Anthropic’s Claude’s better capability in following instructions.
Anthropic’s Claude generalizes well when varying packing numbers among 1, 5, and 10.

Business takeaways
We had the following key business takeaways:

Improved latency – Our solution inferences 5,000 products in 12 minutes, which is 80% faster than GoDaddy’s needs (5,000 products in 1 hour). Using batch inference in Amazon Bedrock demonstrates efficient batch processing capabilities and anticipates further scalability with AWS planning to deploy more cloud instances. The expansion will lead to increased time and cost savings.
More cost-effectiveness – The solution built by the Generative AI Innovation Center using Anthropic’s Claude-Instant is 8% more affordable than the existing proposal using Llama2-13b while also providing 79% more coverage.
Enhanced accuracy – The deliverable produces 97% category coverage on both the 5,000 and 100,000 hold-out test set, exceeding GoDaddy’s needs at 90%. The comprehensive framework is able to facilitate future iterative improvements over the current model parameters and prompt templates.
Qualitative assessment – The category generation is in satisfactory quality through human evaluation by GoDaddy SMEs.

Technical takeaways
We had the following key technical takeaways:

The solution features both batch inference and near real-time inference (2 seconds per product) capability and multiple backend LLM selections.
Anthropic’s Claude-Instant with zero-shot is the clear winner:

It was best in latency, cost, and accuracy on the 5,000 hold-out test set.
It showed better generalizability to higher packing numbers (number of SKUs in one query), with potentially more cost and latency improvement.

Iteration on prompt templates shows improvement on all these models, suggesting that good prompt engineering is a practical approach for the categorization generation task.
Input-wise, increasing to 10-shot may further improve performance, as observed in small-scale science experiments, but also increase the cost by around 30%. Therefore, we tested at most 5-shot in large-scale batch experiments.
Output-wise, increasing to 10-packing or even 20-packing (Anthropic’s Claude only; Llama 2 has 2,048 output length limit) might further improve latency and cost (because more SKUs can share the same input instructions).
For this use case, we saw Anthropic’s Claude model family having better accuracy and generalizability, for example:

Final category coverage performance was better with Anthropic’s Claude-Instant.
When increasing packing numbers from 1, 5, to 10, Anthropic’s Claude-Instant showed improvement in latency and stable accuracy in comparison to Llama 2.
To achieve the final categories for the use case, we noticed that Anthropic’s Claude required a shorter prompt input to follow the instruction and had a longer output length limit for a higher packing number.

Next steps for GoDaddy
The following are the recommendations that the GoDaddy team is considering as a part of future steps:

Dataset enhancement – Aggregate a larger set of ground truth examples and expand programmatic evaluation to better monitor and refine the model’s performance. On a related note, if the product names can be normalized by domain knowledge, the cleaner input is also helpful for better LLM responses. For example, the product name ”<product_name> Power t-shirt, ladyfit vest or hoodie” can prompt the LLM to respond for multiple SKUs, instead of one SKU (similarly, “<product_name> – $5 or $10 or $20 or $50 or $100”).
Human evaluation – Increase human evaluations to provide higher generation quality and alignment with desired outcomes.
Fine-tuning – Consider fine-tuning as a potential strategy for enhancing category generation when a more extensive training dataset becomes available.
Prompt engineering – Explore automatic prompt engineering techniques to enhance category generation, particularly when additional training data becomes available.
Few-shot learning – Investigate techniques such as dynamic few-shot selection and crafting in-context examples based on the model’s parameter knowledge to enhance the LLMs’ few-shot learning capabilities.
Knowledge integration – Improve the model’s output by connecting LLMs to a knowledge base (internal or external database) and enabling it to incorporate more relevant information. This can help to reduce LLM hallucinations and enhance relevance in responses.

Conclusion
In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.
If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.
Security Best Practices

Amazon S3
Amazon Bedrock
StepFunctions
AWS Lambda

References

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot

About the Authors
Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.
Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.
Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.
Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Benchmarking customized models on Amazon Bedrock using LLMPerf and Lit …

Posted on March 14, 2025 by i-genie

Open foundation models (FMs) allow organizations to build customized AI applications by fine-tuning for their specific domains or tasks, while retaining control over costs and deployments. However, deployment can be a significant portion of the effort, often requiring 30% of project time because engineers must carefully optimize instance types and configure serving parameters through careful testing. This process can be both complex and time-consuming, requiring specialized knowledge and iterative testing to achieve the desired performance.
Amazon Bedrock Custom Model Import simplifies deployments of custom models by offering a straightforward API for model deployment and invocation. You can upload model weights and let AWS handle an optimal, fully managed deployment. This makes sure that deployments are performant and cost effective. Amazon Bedrock Custom Model Import also handles automatic scaling, including scaling to zero. When not in use and there are no invocations for 5 minutes, it scales to zero. You pay only for what you use in 5-minute increments. It also handles scaling up, automatically increasing the number of active model copies when higher concurrency is required. These features make Amazon Bedrock Custom Model Import an attractive solution for organizations looking to use custom models on Amazon Bedrock providing simplicity and cost-efficiency.
Before deploying these models in production, it’s crucial to evaluate their performance using benchmarking tools. These tools help to proactively detect potential production issues such as throttling and verify that deployments can handle expected production loads.
This post begins a blog series exploring DeepSeek and open FMs on Amazon Bedrock Custom Model Import. It covers the process of performance benchmarking of custom models in Amazon Bedrock using popular open source tools: LLMPerf and LiteLLM. It includes a notebook that includes step-by-step instructions to deploy a DeepSeek-R1-Distill-Llama-8B model, but the same steps apply for any other model supported by Amazon Bedrock Custom Model Import.
Prerequisites
This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.
Using open source tools LLMPerf and LiteLLM for performance benchmarking
To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is wide support of foundation model APIs. This includes LiteLLM, which supports all models available on Amazon Bedrock.
Setting up your custom model invocation with LiteLLM
LiteLLM is a versatile open source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.
To invoke a custom model with LiteLLM, you use the model parameter (see Amazon Bedrock documentation on LiteLLM). This is a string that follows the bedrock/provider_route/model_arn format.
The provider_route indicates the LiteLLM implementation of request/response specification to use. DeepSeek R1 models can be invoked using their custom chat template using the DeepSeek R1 provider route, or with the Llama chat template using the Llama provider route.
The model_arn is the model Amazon Resource Name (ARN) of the imported model. You can get the model ARN of your imported model in the console or by sending a ListImportedModels request.
For example, the following script invokes the custom model using the DeepSeek R1 chat template.
import time
from litellm import completion

while True:
try:
response = completion(
model=f”bedrock/deepseek_r1/{model_id}”,
messages=[{“role”: “user”, “content”: “””Given the following financial data:
– Company A’s revenue grew from $10M to $15M in 2023
– Operating costs increased by 20%
– Initial operating costs were $7M

Calculate the company’s operating margin for 2023. Please reason step by step.”””},
{“role”: “assistant”, “content”: “<think>”}],
max_tokens=4096,
)
print(response[‘choices’][0][‘message’][‘content’])
break
except:
time.sleep(60)
After the invocation parameters for the imported model have been verified, you can configure LLMPerf for benchmarking.
Configuring a token benchmark test with LLMPerf
To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests across the Ray Clients, and allows for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.
Two critical metrics for performance include latency and throughput:

Latency refers to the time it takes for a single request to be processed.
Throughput measures the number of tokens that are generated per second.

Selecting the right configuration to serve FMs typically involves experimenting with different batch sizes while closely monitoring GPU utilization and considering factors such as available memory, model size, and specific requirements of the workload. To learn more, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Although Amazon Bedrock Custom Model Import simplifies this by offering pre-optimized serving configurations, it’s still crucial to verify your deployment’s latency and throughput.
Start by configuring token_benchmark.py, a sample script that facilitates the configuration of a benchmarking test. In the script, you can define parameters such as:

LLM API: Use LiteLLM to invoke Amazon Bedrock custom imported models.
Model: Define the route, API, and model ARN to invoke similarly to the previous section.
Mean/standard deviation of input tokens: Parameters to use in the probability distribution from which the number of input tokens will be sampled.
Mean/standard deviation of output tokens: Parameters to use in the probability distribution from which the number of output tokens will be sampled.
Number of concurrent requests: The number of users that the application is likely to support when in use.
Number of completed requests: The total number of requests to send to the LLM API in the test.

The following script shows an example of how to invoke the model. See this notebook for step-by-step instructions on importing a custom model and running a benchmarking test.
python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \
–model “bedrock/llama/{model_id}” \
–mean-input-tokens {mean_input_tokens} \
–stddev-input-tokens {stddev_input_tokens} \
–mean-output-tokens {mean_output_tokens} \
–stddev-output-tokens {stddev_output_tokens} \
–max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \
–timeout 1800 \
–num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \
–results-dir “${{LLM_PERF_OUTPUT}}” \
–llm-api litellm \
–additional-sampling-params ‘{{}}’
At the end of the test, LLMPerf will output two JSON files: one with aggregate metrics, and one with separate entries for every invocation.
Scale to zero and cold-start latency
One thing to remember is that because Amazon Bedrock Custom Model Import will scale down to zero when the model is unused, you need to first make a request to make sure that there is at least one active model copy. If you obtain an error indicating that the model isn’t ready, you need to wait for approximately ten seconds and up to 1 minute for Amazon Bedrock to prepare at least one active model copy. When ready, run a test invocation again, and proceed with benchmarking.
Example scenario for DeepSeek-R1-Distill-Llama-8B
Consider a DeepSeek-R1-Distill-Llama-8B model hosted on Amazon Bedrock Custom Model Import, supporting an AI application with low traffic of no more than two concurrent requests. To account for variability, you can adjust parameters for token count for prompts and completions. For example:

Number of clients: 2
Mean input token count: 500
Standard deviation input token count: 25
Mean output token count: 1000
Standard deviation output token count: 100
Number of requests per client: 50

This illustrative test takes approximately 8 minutes. At the end of the test, you will obtain a summary of results of aggregate metrics:
inter_token_latency_s
p25 = 0.010615988283217918
p50 = 0.010694698716183695
p75 = 0.010779359342088015
p90 = 0.010945443657517748
p95 = 0.01100556307365132
p99 = 0.011071086908721675
mean = 0.010710014800224604
min = 0.010364670612635254
max = 0.011485444453299149
stddev = 0.0001658793389904756
ttft_s
p25 = 0.3356793452499005
p50 = 0.3783651359990472
p75 = 0.41098671700046907
p90 = 0.46655246950049334
p95 = 0.4846706690498647
p99 = 0.6790834719300077
mean = 0.3837810468001226
min = 0.1878921090010408
max = 0.7590946710006392
stddev = 0.0828713133225014
end_to_end_latency_s
p25 = 9.885957818500174
p50 = 10.561580732000039
p75 = 11.271923759749825
p90 = 11.87688222009965
p95 = 12.139972019549713
p99 = 12.6071144856102
mean = 10.406450886010116
min = 2.6196457750011177
max = 12.626598834998731
stddev = 1.4681851822617253
request_output_throughput_token_per_s
p25 = 104.68609252502657
p50 = 107.24619111072519
p75 = 108.62997591951486
p90 = 110.90675007239598
p95 = 113.3896235445618
p99 = 116.6688412475626
mean = 107.12082450567561
min = 97.0053466021563
max = 129.40680882698936
stddev = 3.9748004356837137
number_input_tokens
p25 = 484.0
p50 = 500.0
p75 = 514.0
p90 = 531.2
p95 = 543.1
p99 = 569.1200000000001
mean = 499.06
min = 433
max = 581
stddev = 26.549294727074212
number_output_tokens
p25 = 1050.75
p50 = 1128.5
p75 = 1214.25
p90 = 1276.1000000000001
p95 = 1323.75
p99 = 1372.2
mean = 1113.51
min = 339
max = 1392
stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034

In addition to the summary, you will receive metrics for individual requests that can be used to prepare detailed reports like the following histograms for time to first token and token throughput.

Analyzing performance results from LLMPerf and estimating costs using Amazon CloudWatch
LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end user experience of your application.
In addition, the benchmarking exercise can serve as a valuable tool for cost estimation. By using Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import scales to in response to the load test. ModelCopy is exposed as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. The plot for the ModelCopy metric is shown in the figure below. This data will assist in estimating costs, because billing is based on the number of active model copies at a given time.

Conclusion
While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance, and compare models across key metrics such as cost, latency, and throughput.
To learn more, try the example notebook with your custom model.
Additional resources:

Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import
Learn more about LLMPerf and LiteLLM

About the Authors
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.
Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.
Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.