YuLan-Mini: A 2.42B Parameter Open Data-efficient Language Model with …

Large language models (LLMs) built using transformer architectures heavily depend on pre-training with large-scale data to predict sequential tokens. This complex and resource-intensive process requires enormous computational infrastructure and well-constructed data pipelines. The growing demand for efficient and accessible LLMs has led researchers to explore techniques that balance resource use and performance, emphasizing achieving competitive results without relying on industry-scale resources.

Developing LLMs is filled with challenges, especially regarding computation and data efficiency. Pre-training models with billions of parameters demand advanced techniques and substantial infrastructure. High-quality data and robust training methods are crucial, as models face gradient instability and performance degradation during training. Open-source LLMs often struggle to match proprietary counterparts because of limited access to computational power and high-caliber datasets. Therefore, the challenge lies in creating efficient and high-performing models, enabling smaller research groups to participate actively in advancing AI technology. Solving this problem necessitates innovation in data handling, training stabilization, and architectural design.

Existing research in LLM training emphasizes structured data pipelines, using techniques like data cleaning, dynamic scheduling, and curriculum learning to improve learning outcomes. However, stability remains a persistent issue. Large-scale training is susceptible to gradient explosions, loss spikes, and other technical difficulties, requiring careful optimization. Training long-context models introduce additional complexity as attention mechanisms’ computational demands grow quadratically with sequence length. Existing approaches like advanced optimizers, initialization strategies, and synthetic data generation help alleviate these issues but often fall short when scaled to full-sized models. The need for scalable, stable, and efficient methods in LLM training is more urgent than ever.

Researchers at the Gaoling School of Artificial Intelligence, Renmin University of China, developed YuLan-Mini. With 2.42 billion parameters, this language model improves computational efficiency and performance with data-efficient methods. By leveraging publicly available data and focusing on data-efficient training techniques, YuLan-Mini achieves remarkable performance comparable to larger industry models.

YuLan-Mini’s architecture incorporates several innovative elements to enhance training efficiency. Its decoder-only transformer design employs embedding tying to reduce parameter size and improve training stability. The model uses Rotary Positional Embedding (ROPE) to handle long contexts effectively, extending its context length to 28,672 tokens, an advancement over typical models. Other key features include SwiGLU activation functions for better data representation and a carefully designed annealing strategy that stabilizes training while maximizing learning efficiency. Synthetic data was critical, supplementing the 1.08 trillion tokens of training data sourced from open web pages, code repositories, and mathematical datasets. These features enable YuLan-Mini to deliver robust performance with a limited computing budget.

YuLan-Mini’s performance achieved scores of 64.00 on HumanEval in zero-shot scenarios, 37.80 on MATH-500 in four-shot settings, and 49.10 on MMLU in five-shot tasks. These results underscore its competitive edge, as the model’s performance is comparable to much larger and resource-intensive counterparts. The innovative context length extension to 28K tokens allowed YuLan-Mini to excel in long-text scenarios while still maintaining high accuracy in short-text tasks. This dual capability sets it apart from many existing models, which often sacrifice one for the other.

Key takeaways from the research include:

Using a meticulously designed data pipeline, YuLan-Mini reduces reliance on massive datasets while ensuring high-quality learning.

Techniques like systematic optimization and annealing prevent common issues like loss spikes and gradient explosions.

Extending the context length to 28,672 tokens enhances the model’s applicability to complex, long-text tasks.

Despite its modest computational requirements, YuLan-Mini achieves results comparable to those of much larger models, demonstrating the effectiveness of its design.

The integration of synthetic data improves training outcomes and reduces the need for proprietary datasets.

In conclusion, YuLan-Mini is a great new addition to evolving efficient LLMs. Its ability to deliver high performance with limited resources addresses critical barriers to AI accessibility. The research team’s focus on innovative techniques, from data efficiency to training stability, highlights the potential for smaller-scale research to contribute to the field significantly. With just 1.08T tokens, YuLan-Mini sets a benchmark for resource-efficient LLMs.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post YuLan-Mini: A 2.42B Parameter Open Data-efficient Language Model with Long-Context Capabilities and Advanced Training Techniques appeared first on MarkTechPost.

Quasar-1: A Rigorous Mathematical Framework for Temperature-Guided Rea …

Large language models (LLMs) encounter significant difficulties in performing efficient and logically consistent reasoning. Existing methods, such as CoT prompting, are extremely computationally intensive, not scalable, and unsuitable for real-time applications or limited resources. These limitations restrict their applicability in financial analysis and decision-making, which require speed and accuracy.

State-of-the-art reasoning approaches, like CoT, build structured paths for reasoning to improve the accuracy of logic. However, they are computationally demanding and not feasible for applications requiring responses within a short time or where resources are limited. They also do not scale well for handling multiple complex queries at the same time, which limits their application in production environments, especially in organizations with limited computing resources.

Researchers from SILX AI introduced Quasar-1, a groundbreaking framework based on temperature-guided reasoning, to address these challenges. The two main components are the Token Temperature Mechanism (TTM), which dynamically changes the importance of tokens during reasoning, and the Guided Sequence of Thought (GSoT), which computes the optimal reasoning paths. This architecture reduces unnecessary computation and maintains logical consistency using token temperatures to focus on contextually relevant information. Architecture exemplifies considerable advancements, such as improved scalability, efficiency, and adaptability in practical applications.

The framework is constructed upon a transformer-based design, supplemented by temperature-modulated attention mechanisms. The TTM computes temperatures specific to each token to steer reasoning throughout the layers, dynamically modifying token significance as the reasoning evolves. GSoT employs this temperature information to formulate both efficient and precise reasoning pathways. Quasar-1 has 24 transformer layers with 12 attention heads so that efficiency and effectiveness are optimally balanced. Empirical verifications for a range of different reasoning tasks ensure that theoretical foundations about convergence to an optimal solution are provided.

Quasar-1 performs well, reaching 89.3% accuracy, beating models like GPT-3 and T5-Large. It reduces computational costs by up to 70% and ensures faster and more resource-efficient reasoning capabilities. The framework dynamically prioritizes critical tokens, allowing adaptive error recovery and logical consistency, which makes it fit for complex real-world tasks. These results underline its potential as a practical and scalable solution for environments where both efficiency and accuracy are vital.

By employing temperature-guided reasoning and optimized decision pathways, Quasar-1 overcomes fundamental flaws in existing models, thus providing a scalable and practical approach to logical reasoning. Dynamic token prioritization and adaptive error recovery drive the AI domain forward with practical applications in diverse and resource-constrained environments. This represents a significant milestone in the quest for AI systems that are both highly efficient accurate and flexible.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Quasar-1: A Rigorous Mathematical Framework for Temperature-Guided Reasoning in Language Models appeared first on MarkTechPost.

Unveiling Privacy Risks in Machine Unlearning: Reconstruction Attacks …

Machine unlearning is driven by the need for data autonomy, allowing individuals to request the removal of their data’s influence on machine learning models. This field complements data privacy efforts, which focus on preventing models from revealing sensitive information about the training data through attacks like membership inference or reconstruction. While differential privacy methods limit these risks, unlearning enables the deletion of data from a trained model, ensuring it behaves as if the data were never included in the first place. Achieving this efficiently, without retraining the entire model, has been a key focus, particularly for complex models like deep neural networks.

However, unlearning introduces new privacy risks. When adversaries compare a model’s parameters before and after data deletion, they can exploit the differences to reconstruct the deleted data, even for simple models like linear regression. This process leverages the gradient of the deleted sample and the expected Hessian derived from public data to approximate the changes caused by unlearning. The approach highlights a unique vulnerability where unlearning unintentionally exposes sensitive data. By extending existing techniques for gradient-based reconstruction attacks, this research reveals how unlearning can facilitate exact data reconstruction, emphasizing the importance of safeguards like differential privacy to mitigate these risks.

Researchers from AWS AI, the University of Pennsylvania, the University of Washington, Carnegie Mellon University, and Jump Trading reveal that data deletion in machine learning models, even simple ones, exposes individuals to high-accuracy reconstruction attacks. These attacks recover deleted data by exploiting differences in model parameters before and after deletion. The study demonstrates effective attacks on linear regression models using closed-form training algorithms and extends these methods to models with pre-trained embeddings and generic architectures via Newton’s method. Experiments on tabular and image datasets highlight significant privacy risks in retraining for unlearning without safeguards like differential privacy.

The researchers present an attack to reconstruct deleted user data from regularized linear regression models by analyzing parameter changes before and after deletion. The method leverages the relationship between model parameters and the removed sample, approximating key statistics using public data. The approach generalizes to models with fixed embeddings and extends to non-linear architectures using Newton’s approximation method. Experiments demonstrate its applicability to multiclass classification and label inference by estimating gradients and reconstructing deleted data. This highlights the vulnerability of models to privacy breaches, especially without safeguards, as the attack remains effective across various architectures and loss functions.

The study evaluates our attack across diverse datasets for classification and regression tasks, including tabular and image data. Using full retraining, they compare model parameters before and after a single sample’s deletion. Our method leverages public data from the same distribution without needing knowledge of the deleted sample. Against baselines like “Avg” (average of public samples) and “MaxDiff” (maximizing parameter change), our attack consistently outperforms, achieving higher cosine similarity with deleted samples. Tested on MNIST, CIFAR10, and ACS income data, our approach reconstructs deleted samples effectively across various models, emphasizing vulnerabilities in machine learning systems and the need for privacy safeguards.

In conclusion, The work introduces a reconstruction attack capable of recovering deleted data from simple machine-learning models with high accuracy. The attack achieves near-perfect results for linear regression and performs effectively on models using embeddings or optimizing different loss functions. Highlighting privacy risks in data deletion or machine unlearning, the findings emphasize the need for techniques like differential privacy. Counterintuitively, data deletion updates can increase vulnerability to reconstruction attacks, even in basic models, exposing sensitive data. Through extensive experiments on diverse datasets, this study underscores the significant privacy risks posed by data deletion requests, even in seemingly low-risk model settings.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Unveiling Privacy Risks in Machine Unlearning: Reconstruction Attacks on Deleted Data appeared first on MarkTechPost.

A Comprehensive Analytical Framework for Mathematical Reasoning in Mul …

Mathematical reasoning has emerged as a critical frontier in artificial intelligence, particularly in developing Large Language Models (LLMs) capable of performing complex problem-solving tasks. While traditional mathematical reasoning focuses on text-based inputs, modern applications increasingly involve multimodal elements including diagrams, graphs, and equations. This presents significant challenges for existing systems in processing and integrating information across different modalities. The complexities extend beyond simple text comprehension, like deep semantic understanding, context preservation across modalities, and the ability to perform complex reasoning tasks combining visual and textual elements.

Since 2021, there has been a steady increase in math-specific Large Language Models (MathLLMs), each addressing different aspects of mathematical problem-solving. Early models like GPT-f and Minerva established foundational capabilities in mathematical reasoning, while Hypertree Proof Search and Jiuzhang 1.0 advanced theorem proving and question understanding. The field further diversified in 2023 by introducing multimodal support through models like SkyworkMath, followed by specialized developments in 2024 focusing on mathematical instruction (Qwen2.5-Math) and proof capabilities (DeepSeek-Proof). Despite these advancements, existing approaches focus too narrowly on specific mathematical domains or fail to address the challenges of multimodal mathematical reasoning.

Researchers from HKUST (GZ), HKUST, NTU, and Squirrel AI have proposed a comprehensive analytical framework to understand the landscape of mathematical reasoning in the context of multimodal large language models (MLLMs). Researchers reviewed over 200 research papers published since 2021, focusing on the emergence and evolution of Math-LLMs in multimodal environments. This systematic approach examines the multimodal mathematical reasoning pipeline while investigating the role of both traditional LLMs and MLLMs. The research particularly emphasizes the identification and analysis of five major challenges that affects the achievement of artificial general intelligence in mathematical reasoning.

The basic architecture focuses on problem-solving scenarios where the input consists of problem statements presented either in pure textual format or accompanied by visual elements such as figures and diagrams. The system processes these inputs to generate solutions in numerical or symbolic formats. While English dominates the available benchmarks, some datasets exist in other languages like Chinese and Romanian. Dataset sizes vary significantly, ranging from compact collections like QRData with 411 questions to extensive repositories like OpenMathInstruct-1 containing 1.8 million problem-solution pairs.

The evaluation of mathematical reasoning capabilities in MLLMs uses two primary approaches: discriminative and generative evaluation methods. In discriminative evaluation, models are evaluated based on their ability to correctly classify or select answers, with advanced metrics like performance drop rate (PDR), and specialized metrics like error step accuracy. The generative evaluation approach focuses on the model’s capacity to produce detailed explanations and step-by-step solutions. Notable frameworks like MathVerse utilize GPT-4 to evaluate the reasoning process, while CHAMP implements a solution evaluation pipeline where GPT-4 serves as a grader comparing generated answers against ground truth solutions.

Here are the five key challenges in mathematical reasoning with MLLMs:

Visual Reasoning Limitations: Current models struggle with complex visual elements like 3D geometry and irregular tables.

Limited Multimodal Integration: While models handle text and vision, they cannot process other modalities like audio explanations or interactive simulations.

Domain Generalization Issues: Models that excel in one mathematical domain often fail to perform well in others, limiting their practical utility.

Error Detection and Feedback: MLLMs currently lack robust mechanisms to detect, categorize, and correct mathematical errors effectively.

Educational Integration Challenges: Current systems don’t adequately account for real-world educational elements like handwritten notes and draft work.

In conclusion, researchers presented a comprehensive analysis of mathematical reasoning in MLLMs, that reveals significant progress and persistent challenges in the field. The emergence of specialized Math-LLMs has shown substantial advancement in handling complex mathematical tasks, particularly in multimodal environments. Moreover, addressing the above five challenges is crucial for developing more sophisticated AI systems capable of human-like mathematical reasoning. The insights from this analysis provide a roadmap for future research directions, highlighting the importance of more robust and versatile models that can effectively handle the complexities of mathematical reasoning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post A Comprehensive Analytical Framework for Mathematical Reasoning in Multimodal Large Language Models appeared first on MarkTechPost.

This Research from Amazon Explores Step-Skipping Frameworks: Advancing …

The pursuit of enhancing artificial intelligence (AI) capabilities is significantly influenced by human intelligence, particularly in reasoning and problem-solving. Researchers aim to create language models that emulate human-like behaviors, such as optimizing reasoning processes. This involves exploring how models can transition from detailed, step-by-step solutions to more efficient methods by selectively skipping steps, a hallmark of human expertise. These advancements contribute to achieving artificial general intelligence (AGI) with improved efficiency and task-solving capabilities.

A key challenge in AI is the models’ inability to replicate humans’ selective approach to skipping redundant steps during problem-solving. Humans develop this skill through practice, which allows them to reduce cognitive effort and focus on more complex aspects of a problem. Current language models lack this ability, adhering strictly to detailed processes even when simpler, equally effective solutions exist. Developing models incorporating such step-skipping behavior can enhance their efficiency and generalization abilities across various tasks.

Traditional training methods for language models involve step-by-step reasoning, relying on detailed datasets. Techniques such as chain-of-thought prompting encourage sequential solutions but do not address step skipping. As a result, while these models excel in solving problems comprehensively, they fail to demonstrate the efficiency observed in human experts. This limitation presents an opportunity to refine model training approaches to integrate more flexible reasoning capabilities.

Researchers from institutions like Fudan University, UC Santa Barbara, Shanghai AI Laboratory, Westlake University, and Amazon AWS AI developed a novel framework to address this. This approach introduces controlled training environments where models are guided to generate solutions with fewer steps without compromising accuracy. The method emphasizes training models on datasets combining complete and skipped reasoning paths, enabling them to learn efficient and accurate shortcuts.

The training framework comprises two main phases: initialization and iteration. The model is trained on a dataset containing comprehensive, step-by-step reasoning solutions during initialization. This establishes a foundational understanding of problem-solving. In the iteration phase, models are guided to generate shorter reasoning paths by reducing the number of steps in their responses. These shorter paths, verified for accuracy, are mixed with full-step solutions to create expanded datasets. Each iteration refines the model’s ability to identify and skip redundant steps, gradually improving efficiency. For instance, in tasks involving algebraic analogies, multi-digit arithmetic, and directional reasoning, the researchers generated datasets with detailed steps and selectively omitted certain steps to simulate human-like efficiency. These iterations allow the models to self-generate skipping data, refining their reasoning processes.

Empirical evaluations demonstrated the effectiveness of this approach across three tasks: algebraic analogies, multi-digit addition, and directional reasoning. Results highlighted that step-skipping enhanced both efficiency and generalization. For algebraic analogies, models achieved an accuracy increase of 4.76% in out-of-domain tasks, with a marked reduction in the number of reasoning steps. In multi-digit addition, performance improved by 13.91% in easier out-of-domain scenarios and by 4.75% in harder scenarios, underscoring the benefits of skipped reasoning steps. Similarly, directional reasoning tasks improved, with accuracy gains of up to 9.2% on challenging datasets. These results demonstrate that integrating skipped-step reasoning does not compromise task performance but enables models to solve problems more effectively and efficiently.

Further, the iterative training method showed that models could learn to balance accuracy and efficiency. Each iteration decreased the number of steps taken while maintaining or improving accuracy. By the fifth iteration, models consistently outperformed those trained solely on full-step datasets. This iterative refinement process also provided insights into the models’ ability to generalize to out-of-domain scenarios, suggesting that training on mixed datasets is instrumental in enhancing task-solving capabilities.

The study presents a significant advancement in equipping language models with human-like reasoning abilities. By incorporating step-skipping behavior, researchers demonstrated that models could achieve greater efficiency and maintain accuracy across diverse tasks. This approach addresses a critical limitation in existing models and opens avenues for future research on bridging the gap between human and machine reasoning. The contributions from leading institutions and companies underscore the collaborative efforts driving innovation in AI. The findings provide a promising direction for developing more efficient and versatile language models, paving the way for future advancements in artificial intelligence.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post This Research from Amazon Explores Step-Skipping Frameworks: Advancing Efficiency and Human-Like Reasoning in Language Models appeared first on MarkTechPost.

Neural Networks for Scalable Temporal Logic Model Checking in Hardware …

Ensuring the correctness of electronic designs is critical, as hardware flaws are permanent post-production and can compromise software reliability or the safety of cyber-physical systems. Verification is central to digital circuit engineering, with FPGA and IC/ASIC projects dedicating 40% and 60% of their time, respectively, to this process. While testing approaches, such as directed or constrained random testing, are easy to implement, they are inherently non-exhaustive and cannot ensure the absence of critical errors. Formal verification, particularly model checking, addresses these limitations by mathematically confirming whether a design satisfies its specifications across all possible executions. However, methods like BDDs and SAT solvers remain computationally intensive and struggle to scale for complex circuits. Engineers often rely on bounded model checking to reduce computational demands, which sacrifices global correctness over extended time horizons.

Formal verification has evolved over decades, with temporal logic playing a key role in describing system behaviors. Based on Linear Temporal Logic (LTL), SystemVerilog Assertions are widely used to define safety and liveness properties. Safety properties are efficiently verified using BDDs, while SAT-based methods scale better for bounded model checking but remain incomplete without achieving impractically high thresholds. Advanced techniques like IC3 and Craig Interpolation improve unbounded safety checking, while Emerson-Lei fixed-point computations and k-liveness extend verification to liveness properties. Verifying systems with complex arithmetic remains challenging, often requiring explicit-state abstractions, inductive invariants, or ranking functions. Originally developed for software termination analysis, ranking functions have been generalized for hardware liveness verification, incorporating non-linear, piecewise-defined, and lexicographic methods to address modern system complexities.

Researchers from the University of Birmingham, Amazon Web Services, and Queen Mary University of London have developed a machine learning-based approach for hardware model checking that integrates neural networks and symbolic reasoning. Their method uses neural networks to represent proof certificates for LTL specifications, trained from randomly generated system executions. The approach guarantees formal correctness over unbounded time horizons by employing satisfiability solving to validate these certificates. Experiments demonstrate its effectiveness, outperforming both academic and commercial model checkers in speed and task completion across standard hardware verification problems, contributing to improved safety and reliability in system designs.

LTL model checking verifies if all possible sequences of actions in a system (M) comply with a given LTL formula (Phi), which describes the desired temporal properties. The system (M) includes input and state variables, with its behavior determined by transition rules. To check this, (Phi) is converted into a type of automaton called a Büchi automaton (A_Phi). The verification ensures that the combined system (M) and the automaton (A_neg Phi) (representing the formula’s negation) have no valid infinite sequences. Neural ranking functions aid in proving termination and are validated using SMT solvers.

The experimental evaluation tested 194 verification tasks derived from 10 parameterized hardware designs with varying complexity. A prototype neural model-checking tool was developed, using Spot to generate automata, Verilator for data generation, PyTorch for training, and Bitwuzla for SMT-solving. The tool was benchmarked against industry leaders ABC, nuXmv, and anonymized tools X and Y. It completed 93% of tasks, outperforming competitors in scalability and runtime, although challenges like local minima and extended SMT-check times remain. While generally faster, it struggled with trivial tasks like UARTt due to overhead. The method’s limitations include reliance on word-level inputs and risks of dataset bias.

In conclusion, the study introduces an approach to model-checking temporal logic using neural networks as proof certificates for hardware verification. Neural networks are trained on synthetic system executions, leveraging their ability to represent ranking functions for fair termination. The method combines machine learning and symbolic reasoning by validating neural certificates with satisfiability solvers, ensuring formal guarantees. Applied to SystemVerilog designs, it outperforms state-of-the-art tools in scalability. Despite the computational demand of SMT solving, the approach is effective with simple feed-forward networks. This marks the first successful use of neural certificates for temporal logic, establishing a foundation for further advancements in model checking.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Neural Networks for Scalable Temporal Logic Model Checking in Hardware Verification appeared first on MarkTechPost.

Optimizing costs of generative AI applications on AWS

The report The economic potential of generative AI: The next productivity frontier, published by McKinsey & Company, estimates that generative AI could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.
The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.
This post addresses these cost considerations so you can optimize your generative AI costs in AWS.
The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative AI solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on Amazon Bedrock.
In Part 2 of this series, we will cover how to estimate business value and the influencing factors.
Cost and performance optimization pillars
Designing performant and cost-effective generative AI applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.
Forecasting and managing costs and performance in generative AI applications is driven by the following optimization pillars:

Model selection, choice, and customization – We define these as follows:

Model selection – This process involves identifying the optimal model that meets a wide variety of use cases, followed by model validation, where you benchmark against high-quality datasets and prompts to identify successful model contenders.
Model choice – This refers to the choice of an appropriate model because different models have varying pricing and performance attributes.
Model customization – This refers to choosing the appropriate techniques to customize the FMs with training data to optimize the performance and cost-effectiveness according to business-specific use cases.

Token usage – Analyzing token usage consists of the following:

Token count – The cost of using a generative AI model depends on the number of tokens processed. This can directly impact the cost of an operation.
Token limits – Understanding token limits and what drives token count, and putting guardrails in place to limit token count can help you optimize token costs and performance.
Token caching – Caching at the application layer or LLM layer for commonly asked user questions can help reduce the token count and improve performance.

Inference pricing plan and usage patterns – We consider two pricing options:

On-Demand – Ideal for most models, with charges based on the number of input/output tokens, with no guaranteed token throughput.
Provisioned Throughput – Ideal for workloads demanding guaranteed throughput, but with relatively higher costs.

Miscellaneous factors – Additional factors can include:

Security guardrails – Applying content filters for personally identifiable information (PII), harmful content, undesirable topics, and detecting hallucinations improves the safety of your generative AI application. These filters can perform and scale independently of LLMs and have costs that are directly proportional to the number of filters and the tokens examined.
Vector database – The vector database is a critical component of most generative AI applications. As the amount of data usage in your generative AI application grows, vector database costs can also grow.
Chunking strategy – Chunking strategies such as fixed size chunking, hierarchical chunking, or semantic chunking can influence the accuracy and costs of your generative AI application.

Let’s dive deeper to examine these factors and associated cost-optimization tips.
Retrieval Augmented Generation
RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.
As illustrated in the following diagram, the generative AI application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.

The generative AI application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.

The workflow consists of the following steps:

A user asks a question using the generative AI application.
A request to generate embeddings is sent to the LLM.
The LLM returns embeddings to the application.
These embeddings are searched against vector embeddings stored in a vector database (knowledge base).
The application receives context relevant to the user question from the knowledge base.
The application sends the user question and the context to the LLM.
The LLM uses the context to generate an accurate and grounded response.
The application sends the final response back to the user.

Amazon Bedrock is a fully managed service providing access to high-performing FMs from leading AI providers through a unified API. It offers a wide range of LLMs to choose from.
In the preceding workflow, the generative AI application invokes Amazon Bedrock APIs to send text to an LLM like Amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.
The generated text embeddings are stored in a vector database such as Amazon OpenSearch Service, Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon MemoryDB.
A generative AI application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as Amazon DynamoDB.
The generative AI application could also use Amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.
Now that we have a good understanding of the various components in a RAG-based generative AI application, let’s explore how these factors influence costs while running your application in AWS using RAG.
Directional costs for small, medium, large, and extra large scenarios
Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative AI application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.
Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.
The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.

.
SMALL
MEDIUM
LARGE
EXTRA LARGE

INPUTs
500,000
2,000,000
5,000,000
7,020,000

Total questions per month
5
25
50
100

Knowledge base data size in GB (actual text size on documents)
.
.
.
.

Annual costs (directional)*
.
.
.
.

Amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku
$5,785
$23,149
$57,725
$81,027

Amazon OpenSearch Service provisioned cluster costs
$6,396
$13,520
$20,701
$39,640

Amazon Bedrock Titan Text Embedding v2 costs
$396
$5,826
$7,320
$13,585

Total annual costs (directional)
$12,577
$42,495
$85,746
$134,252

Unit cost per 1,000 questions (directional)
$2.10
$1.80
$1.40
$1.60

These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.
For the sake of brevity, we use the following assumptions:

Amazon Bedrock On-Demand pricing model
Anthropic’s Claude 3 Haiku LLM
AWS Region us-east-1
Token assumptions for each user question:

Total input tokens to LLM = 2,571
Output tokens from LLM = 149
Average of four characters per token
Total tokens = 2,720

There are other cost components such as DynamoDB to store question-answer history, Amazon Simple Storage Service (Amazon S3) to store data, and AWS Lambda or Amazon Elastic Container Service (Amazon ECS) to invoke Amazon Bedrock APIs. However, these costs are not as significant as the cost components mentioned in the table.

We refer to this table in the remainder of this post. In the next few sections, we will cover Amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and Amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.
Amazon Bedrock costs
Amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.
With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for Amazon Bedrock.
In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.
With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.
However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative AI application needs cross the On-Demand RPM and TPM thresholds, you should consider the Amazon Bedrock Provisioned Throughput model.
With Amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.
Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.
With Amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.
Consider the following cost-optimization tips:

Start with the On-Demand model and test for your performance and latency with your choice of LLM. This will deliver the lowest costs.
If On-Demand can’t satisfy the desired volume of RPM or TPM, start with Provisioned Throughput with a 1-month subscription during your generative AI application beta period. However, for steady state production, consider a 6-month subscription to lower the Provisioned Throughput costs.
If there are shorter peak hours and longer off-peak hours, consider using a Provisioned Throughput hourly model during the peak hours and On-Demand during the off-peak hours. This can minimize your Provisioned Throughput costs.

Factors influencing costs
In this section, we discuss various factors that can influence costs.
Number of questions
Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).

Input tokens
The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.
As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.
Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for Amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.
System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.
Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative AI application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.
Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative AI application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.
Consider the following cost-optimization tips:

Limit the number of characters per user prompt
Test the accuracy of responses with various numbers of chunks and chunk sizes from the vector database before finalizing their values
For generative AI applications that need to carry a conversation with a user, test with two, three, four, or five pairs of QnA history and then pick the optimal value

Output tokens
The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.
Consider the following cost-optimization tips:

Because the output tokens are expensive, consider specifying the maximum response size in your system prompt
If some users belong to a group or department that requires higher token limits on the user prompt or LLM response, consider using multiple system prompts in such a way that the generative AI application picks the right system prompt depending on the user

Vector embedding costs
As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the Amazon Bedrock API with an LLM, such as Amazon Titan Text Embeddings V2. This is independent of the Amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.
The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.
For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the Amazon Bedrock On-Demand costs for Amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.
However, On-Demand has an RPM limit of 2,000 for Amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.
For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.
In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.
Typically, the actual text inside a file is 5–10 times smaller than the file size reported by Amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.
One way to estimate the text size inside files is with the following steps:

Pick 5–10 sample representations of the files.
Open the files, copy the content, and enter it into a Word document.
Use the word count feature to identify the text size.
Calculate the ratio of this size with the file system reported size.
Apply this ratio to the total file system to get a directional estimate of actual text size inside all the files.

Vector database costs
AWS offers many vector databases, such as OpenSearch Service, Aurora, Amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.
The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.

Amount of data to be used as the knowledge base – Costs are directly proportional to data size. More data means more vectors. More vectors mean more indexes in a vector database, which in turn requires more memory and therefore higher costs. For best performance, it’s recommended to size the vector database so that all the vectors are stored in memory.
Index compression – Vector embeddings can be indexed by HNSW or IVF algorithms. The index can also be compressed. Although compressing the indexes can reduce the memory requirements and costs, it might lose accuracy. Therefore, consider doing extensive testing for accuracy before deciding to use compression variants of HNSW or IVF. For example, for a large text data size of 100 GB, assuming 2,000 bytes of chunk size, 15% overlap, vector dimension count of 512, no upfront Reserved Instance for 3 years, and HNSW algorithm, the approximate costs are $37,000 per year. The corresponding costs with compression using hnsw-fp16 and hnsw-pq are $21,000 and $10,000 per year, respectively.
Reserved Instances – Cost is inversely proportional to the number of years you reserve the cluster instance that stores the vector database. For example, in the preceding scenario, an On-Demand instance would cost approximately, $75,000 per year, a no upfront 1-year Reserved Instance would cost $52,000 per year, and a no upfront 3-year Reserved Instance would cost $37,000 per year.

Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.
Amazon Bedrock Guardrails
Let’s assume your generative AI virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?
The Amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.
Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total Amazon Bedrock Guardrails costs will be $400 per month.
Chunking strategies
As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.
The following are different chunking strategies, each of which can influence costs:

Standard chunking – In this case, you can specify default chunking, which is approximately 300 tokens, or fixed-size chunking, where you specify the token size (for example, 300 tokens) for each chunk. Larger chunks will increase input tokens and therefore costs.
Hierarchical chunking – This strategy is useful when you want to chunk data at smaller sizes (for example, 300 tokens) but send larger pieces of chunks (for example, 1,500 tokens) to the LLM so the LLM has a bigger context to work with while generating responses. Although this can improve accuracy in some cases, this can also increase the costs because of larger chunks of data being sent to the LLM.
Semantic chunking – This strategy is useful when you want chunking based on semantic meaning instead of just the token. In this case, a vector embedding is generated for one or three sentences. A sliding window is used to consider the next sentence and embeddings are calculated again to identify whether the next sentence is semantically similar or not. The process continues until you reach an upper limit of tokens (for example, 300 tokens) or you find a sentence that isn’t semantically similar. This boundary defines a chunk. The input token costs to the LLM will be similar to standard chunking (based on a maximum token size) but the accuracy might be better because of chunks having sentences that are semantically similar. However, this will increase the costs of generating vector embeddings because embeddings are generated for each sentence, and then for each chunk. But at the same time, these are one-time costs (and for new or changed data), which might be worth it if the accuracy is comparatively better for your data.
Advanced parsing – This is an optional pre-step to your chunking strategy. This is used to identify chunk boundaries, which is especially useful when you have documents with a lot of complex data such as tables, images, and text. Therefore, the costs will be the input and output token costs for the entire data that you want to use for vector embeddings. These costs will be high. Consider using advanced parsing only for those files that have a lot of tables and images.

The following table is a relative cost comparison for various chunking strategies.

Chunking Strategy
Standard
Semantic
Hierarchical

Relative Inference Costs
Low
Medium
High

Conclusion
In this post, we discussed various factors that could impact costs for your generative AI application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.
In Part 2, we discuss how to calculate business value and the factors that impact business value.

About the Authors
Vinnie Saini is a Senior Generative AI Specialist Solution Architect at Amazon Web Services(AWS) based in Toronto, Canada. With a background in Machine Learning, she has over 15 years of experience designing & building transformational cloud based solutions for customers across industries. Her focus has been primarily scaling AI/ML based solutions for unparalleled business impacts, customized to business needs.
Chandra Reddy is a Senior Manager of Solution Architects team at Amazon Web Services(AWS) in Austin, Texas. He and his team help enterprise customers in North America on their AIML and Generative AI use cases in AWS. He has more than 20 years of experience in software engineering, product management, product marketing, business development, and solution architecture.

Meet CoMERA: An Advanced Tensor Compression Framework Redefining AI Mo …

Training large-scale AI models such as transformers and language models have become an indispensable yet highly demanding process in AI. With billions of parameters, these models offer groundbreaking capabilities but come at a steep cost in terms of computational power, memory, and energy consumption. For example, OpenAI’s GPT-3 comprises 175 billion parameters and requires weeks of GPU training. Such massive requirements limit these technologies to organizations with substantial computational resources, exacerbating concerns over energy efficiency and environmental impact. Addressing these challenges has become critical to ensuring the broader accessibility and sustainability of AI advancements.

The inefficiencies in training large models stem primarily from their reliance on dense matrices, which demand significant memory and computing power. The limited support for optimized low-precision or low-rank operations in modern GPUs further compounds these requirements. While some methods, such as matrix factorization and heuristic rank reduction, have been proposed to alleviate these issues, their real-world applicability is constrained. For instance, GaLore enables training on single-batch settings but suffers from impractical runtime overhead. Similarly, LTE, which adopts low-rank adapters, struggles with convergence on large-scale tasks. The lack of a method that simultaneously reduces memory usage, computational cost, and training time without compromising performance has created an urgent need for innovative solutions.

Researchers from the University at Albany SUNY, the University of California at Santa Barbara, Amazon Alexa AI, and Meta introduced Computing-and Memory-Efficient training method via Rank-Adaptive tensor optimization (CoMERA), a novel framework that combines memory efficiency with computational speed through rank-adaptive tensor compression. Unlike traditional methods focusing solely on compression, CoMERA adopts a multi-objective optimization approach to balance compression ratio and model accuracy. It utilizes tensorized embeddings and advanced tensor-network contractions to optimize GPU utilization, reducing runtime overhead while maintaining robust performance. The framework also introduces CUDA Graph to minimize kernel-launching delays during GPU operations, a significant bottleneck in traditional tensor compression approaches.

CoMERA’s foundation is based on adaptive tensor representations, which allow model layers to adjust their ranks dynamically based on resource constraints. By modifying tensor ranks, the framework achieves compression without compromising the integrity of neural network operations. This dynamic optimization is achieved through a two-stage training process: 

An early stage focused on stable convergence 

A late stage that fine-tunes ranks to meet specific compression targets

In a six-encoder transformer model, CoMERA achieved compression ratios ranging from 43x in its early stage to an impressive 361x in its late-stage optimizations. Also, it reduced memory consumption by 9x compared to GaLore, with 2-3x faster training per epoch.

When applied to transformer models trained on the MNLI dataset, CoMERA reduced model sizes from 256 MB to as little as 3.2 MB while preserving accuracy. In large-scale recommendation systems like DLRM, CoMERA compressed models by 99x and achieved a 7x reduction in peak memory usage. The framework also excelled in pre-training CodeBERT, a domain-specific large language model, where it gained a 4.23x overall compression ratio and demonstrated a 2x speedup during certain training phases. These results underscore its ability to handle diverse tasks and architectures, extending its applicability across domains.

The key takeaways from this research are as follows:

CoMERA achieved compression ratios of up to 361x for specific layers and 99x for full models, drastically reducing storage and memory requirements.

The framework delivered 2-3x faster training times per epoch for transformers and recommendation systems, saving computational resources and time.

Using tensorized representations and CUDA Graph, CoMERA reduced peak memory consumption by 7x, enabling training on smaller GPUs.

CoMERA’s approach supports diverse architectures, including transformers and large language models, while maintaining or improving accuracy.

By lowering the energy and resource demands of training, CoMERA contributes to more sustainable AI practices and makes cutting-edge models accessible to a broader audience.

In conclusion, CoMERA addresses some of the most significant barriers to AI scalability and accessibility by enabling faster, memory-efficient training. Its adaptive optimization capabilities and compatibility with modern hardware make it a compelling choice for organizations seeking to train large models without incurring prohibitive costs. This study’s results pave the way for further exploration of tensor-based optimizations in domains like distributed computing and resource-constrained edge devices.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Meet CoMERA: An Advanced Tensor Compression Framework Redefining AI Model Training with Speed and Precision appeared first on MarkTechPost.

CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ord …

Breaking down videos into smaller, meaningful parts for vision models remains challenging, particularly for long videos. Vision models rely on these smaller parts, called tokens, to process and understand video data, but creating these tokens efficiently is difficult. While recent tools achieve better video compression than older methods, they struggle to handle large video datasets effectively. A key issue is their inability to fully utilize temporal coherence, the natural pattern where video frames are often similar over short periods, which video codecs use for efficient compression. These tools are also computationally expensive to train and are limited to short clips, making them not very effective in capturing patterns and processing longer videos.

Current video tokenization methods have high computational costs and struggle to handle long video sequences efficiently. Early approaches used image tokenizers to compress videos frame by frame but ignored the natural continuity between frames, reducing their effectiveness. Later methods introduced spatiotemporal layers, reduced redundancy, and used adaptive encoding, but they still required rebuilding entire video frames during training, which limited them to short clips. Video generation models like autoregressive methods, masked generative transformers, and diffusion models are also limited to short sequences. 

To solve this, researchers from KAIST and UC Berkeley proposed CoordTok, which learns a mapping from coordinate-based representations to the corresponding patches of input videos. Motivated by recent advances in 3D generative models, CoordTok encodes a video into factorized triplane representations and reconstructs patches corresponding to randomly sampled (x, y, t) coordinates. This approach allows large tokenizer models to be trained directly on long videos without requiring excessive resources. The video is divided into space-time patches and processed using transformer layers, with the decoder mapping sampled (x, y, t) coordinates to corresponding pixels. This reduces both memory and computational costs while preserving video quality.

Based on this, researchers updated CoordTok to efficiently process a video by introducing a hierarchical architecture that grasped local and global features from the video. This architecture represented a factorized triplane to process patches of space and time, making long-duration video processing easier without excessively using computational resources. This approach greatly reduced the memory and computation requirements and maintained high video quality.

Researchers improved the performance by adding a hierarchical structure that captured the local and global features of videos. This structure allowed the model to process space-time patches more efficiently using transformer layers, which helped generate factorized triplane representations. As a result, CoordTok handled longer videos without demanding excessive computational resources. For example, CoordTok encoded a 128-frame video with 128×128 resolution into 1280 tokens, while baselines required 6144 or 8192 tokens to achieve similar reconstruction quality. The model’s reconstruction quality was further improved by fine-tuning with both ℓ2 loss and LPIPS loss, enhancing the accuracy of the reconstructed frames. This combination of strategies reduced memory usage by up to 50% and computational costs while maintaining high-quality video reconstruction, with models like CoordTok-L achieving a PSNR of 26.9.

In conclusion, the proposed framework by researchers, CoordTok, proves to be an efficient video tokenizer that uses coordinate-based representations to reduce computational costs and memory requirements while encoding long videos.

It allows memory-efficient training for video generation models, making handling long videos with fewer tokens possible. However, it is not strong enough for dynamic videos and suggests further potential improvements, such as using multiple content planes or adaptive methods. This work can serve as a starting point for future research on scalable video tokenizers and generation, which can be beneficial for comprehending and generating long videos.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos appeared first on MarkTechPost.

Deep Learning and Vocal Fold Analysis: The Role of the GIRAFE Dataset

Semantic segmentation of the glottal area from high-speed videoendoscopic (HSV) sequences presents a critical challenge in laryngeal imaging. The field faces a significant shortage of high-quality, annotated datasets for training robust segmentation models. Therefore, the development of automatic segmentation technologies is hindered by this limitation and the creation of diagnostic tools such as Facilitative Playbacks (FPs) that are crucial in assessing vibratory dynamics in vocal folds. The limited availability of extensive datasets is a challenge to clinicians while trying to make an accurate diagnosis and proper treatment of voice disorders, generating a vast void in both research works and clinical practices.

Current techniques for glottal segmentation include the classical image processing techniques, which include active contours and watershed transformations. Most of these techniques generally require a considerable amount of manual input and cannot cope with varying illumination conditions or complex scenarios of glottis closure. On the other hand, deep learning models, although promising, are limited by the need for large and high-quality annotated datasets. Datasets like BAGLS, which are available publicly, provide grayscale recordings, but they are less diverse and granular, which in turn reduces their generalization ability for complex segmentation tasks. These factors underline the urgent need for a dataset that offers better versatility, more complex features, and broader clinical relevance.

Researchers from the University of Brest, University of Patras, and Universidad Politécnica de Madrid introduce the GIRAFE dataset to address the limitations of existing resources. GIRAFE is a robust and comprehensive repository comprising 65 HSV recordings from 50 patients, each meticulously annotated with segmentation masks. In contrast to other datasets, the advantage of GIRAFE is that it offers color HSV recordings, which makes subtle anatomical and pathological features visually detectable. This resource enables researchers to make high-resolution assessments involving classical segmentation approaches, such as InP and Loh, and the recent deep neural architectures, such as UNet and SwinUnetV2. Apart from high-resolution segmentation, this work also facilitates Facilitative Playbacks, including GAW, GVG, and PVG, which are the most important media through which vibratory modal patterns in the vocal fold could be visualized to learn more about vocal-fold phonatory dynamics.

The GIRAFE dataset comprises highly extensive features suitable for a wide variety of research. It comprises 760 frames expert-validated and annotated; such a setup allows for proper training and evaluation using correct segmentation masks. This dataset incorporates both traditional image processing techniques such as InP and Loh and also advanced deep learning architectures. HSV recordings are captured at a high temporal resolution of 4000 frames per second with a spatial resolution of 256×256 pixels, ensuring detailed analysis of vocal fold dynamics. The dataset is organized into structured directories, including \Raw_Data, \Seg_FP-Results, and \Training, facilitating ease of access and integration into research pipelines. This combination of systematic arrangement with color recordings makes it easier to view glottal characteristics and allows the exploration of complex vibratory patterns in a wide range of clinical conditions.

The GIRAFE dataset showed its efficiency in the further advancement of segmentation techniques with full validation using both traditional approaches and deep learning. Traditional segmentation techniques, such as the InP method, performed well across different challenging cases, indicating that they are robust and can handle complex cases. Deep learning models like UNet and SwinUnetV2 have also demonstrated good performance; however, UNet outperformed the others in segmentation accuracy in simpler conditions. The diversity of the dataset, containing various pathologies, illumination conditions, and anatomical variations, made it a benchmark resource. These results confirm that the dataset can contribute to improved development and assessment of segmentation methods and support innovation in clinical laryngeal imaging applications.

The GIRAFE dataset represents an important milestone in the landscape of laryngeal imaging research. With its inclusion of color HSV recordings, diverse annotations, and the integration of both traditional and deep learning methodologies, this dataset addresses the limitations inherent in the current datasets and sets a new benchmark within the domain. This dataset helps further bridge traditional and modern approaches while providing a dependable basis for the advancement of sophisticated segmentation methods and diagnostic instruments. Its contributions can potentially change the examination and management of voice disorders, and thus, it would be a great source for clinicians and researchers alike looking to advance the field of vocal fold dynamics and related diagnostics.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Deep Learning and Vocal Fold Analysis: The Role of the GIRAFE Dataset appeared first on MarkTechPost.

This AI Paper by The Data Provenance Initiative Team Highlights Challe …

The advancement of artificial intelligence hinges on the availability and quality of training data, particularly as multimodal foundation models grow in prominence. These models rely on diverse datasets spanning text, speech, and video to enable language processing, speech recognition, and video content generation tasks. However, the lack of transparency regarding dataset origins and attributes creates significant barriers. Using training data that is geographically and linguistically skewed, inconsistently licensed, or poorly documented introduces ethical, legal, and technical challenges. Understanding the gaps in data provenance is essential for advancing responsible and inclusive AI technologies.

AI systems face a critical issue in dataset representation and traceability, which limits the development of unbiased and legally sound technologies. Current datasets often rely heavily on a few web-based or synthetically generated sources. These include platforms like YouTube, which accounts for a significant share of speech and video datasets, and Wikipedia, which dominates text data. This dependency results in datasets failing to represent underrepresented languages and regions adequately. In addition, the unclear licensing practices of many datasets create legal ambiguities, as more than 80% of widely used datasets carry some form of undocumented or implicit restrictions despite only 33% being explicitly licensed for non-commercial use.

Attempts to address these challenges have traditionally focused on narrow aspects of data curation, such as removing harmful content or mitigating bias in text datasets. However, such efforts are typically limited to single modalities and lack a comprehensive framework to evaluate datasets across modalities like speech and video. Platforms hosting these datasets, such as HuggingFace or OpenSLR, often lack the mechanisms to ensure metadata accuracy or enforce consistent documentation practices. This fragmented approach underscores the urgent need for a systematic audit of multimodal datasets that holistically considers their sourcing, licensing, and representation.

To close this gap, researchers from the Data Provenance Initiative conducted the largest longitudinal audit of multimodal datasets, examining nearly 4,000 public datasets created between 1990 and 2024. The audit spanned 659 organizations from 67 countries, covering 608 languages and nearly 1.9 million hours of speech and video data. This extensive analysis revealed that web-crawled and social media platforms now account for most training data, with synthetic sources also rapidly growing. The study highlighted that while only 25% of text datasets have explicitly restrictive licenses, nearly all content sourced from platforms like YouTube or OpenAI carries implicit non-commercial constraints, raising questions about legal compliance and ethical use.

The researchers applied a meticulous methodology to annotate datasets, tracing their lineage back to sources. This process uncovered significant inconsistencies in how data is licensed and documented. For instance, while 96% of text datasets include commercial licenses, over 80% of their source materials impose restrictions that are not carried forward in the dataset’s documentation. Similarly, video datasets highly depended on proprietary or restricted platforms, with 71% of video data originating from YouTube alone. Such findings underscore the challenges practitioners face in accessing data responsibly, particularly when datasets are repackaged or re-licensed without preserving their original terms.

Notable findings from the audit include the dominance of web-sourced data, particularly for speech and video. YouTube emerged as the most significant source, contributing nearly 1 million hours to each speech and video content, surpassing other sources like audiobooks or movies. Synthetic datasets, while still a smaller portion of overall data, have grown rapidly, with models like GPT-4 contributing significantly. The audit also revealed stark geographical imbalances. North American and European organizations accounted for 93% of text data, 61% of speech data, and 60% of video data. In comparison, regions like Africa and South America collectively represented less than 0.2% across all modalities.

Geographical and linguistic representation remains a persistent challenge despite nominal increases in diversity. Over the past decade, the number of languages represented in training datasets has grown to over 600, yet measures of equality in representation have shown no significant improvement. The Gini coefficient, which measures inequality, remains above 0.7 for geographical distribution and above 0.8 for language representation in text datasets, highlighting the disproportionate concentration of contributions from Western countries. For speech datasets, while representation from Asian countries like China and India has improved, African and South American organizations continue to lag far behind.

The research provides several critical takeaways, offering valuable insights for developers and policymakers:

Over 70% of speech and video datasets are derived from web platforms like YouTube, while synthetic sources are becoming increasingly popular, accounting for nearly 10% of all text data tokens.

While only 33% of datasets are explicitly non-commercial, over 80% of source content is restricted. This mismatch complicates legal compliance and ethical use.

North American and European organizations dominate dataset creation, with African and South American contributions at less than 0.2%. Linguistic diversity has grown nominally but remains concentrated in many dominant languages.

GPT-4, ChatGPT, and other models have significantly contributed to the rise of synthetic datasets, which now represent a growing share of training data, particularly for creative and generative tasks.

The lack of transparency and persistent Western-centric biases call for more rigorous audits and equitable practices in dataset curation.

In conclusion, this comprehensive audit sheds light on the growing reliance on web-crawled and synthetic data, the persistent inequalities in representation, and the complexities of licensing in multimodal datasets. By identifying these challenges, the researchers provide a roadmap for creating more transparent, equitable, and responsible AI systems. Their work underscores the need for continued vigilance and measures to ensure that AI serves diverse communities fairly and effectively. This study is a call to action for practitioners, policymakers, and researchers to address the structural inequities in the AI data ecosystem and prioritize transparency in data provenance.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post This AI Paper by The Data Provenance Initiative Team Highlights Challenges in Multimodal Dataset Provenance, Licensing, Representation, and Transparency for Responsible Development appeared first on MarkTechPost.

Frenzy: A Memory-Aware Serverless Computing Method for Heterogeneous G …

Artificial Intelligence (AI) has been making significant advances with an exponentially growing trajectory, incorporating vast amounts of data and building more complex Large Language Models (LLMs). Training these LLMs requires more computational power and resources for memory allocation, power usage, and hardware. Optimizing memory utilization for different types and configurations of GPUs is complex. Deciding the types and number of GPUs required for training a specific model has become an error-prone process for developers. Apart from that, different LLM tasks need to be efficiently scheduled across the heterogeneous GPUs.The complexity of the LLMs makes it impossible to guarantee that the utilization of the resources is efficient. To address these issues, a team of researchers have developed Frenzy, which automates resource allocation and scheduling.

Traditional methods allocate GPU resources statically without adapting to dynamic memory requirements during training. Configurations must be done manually, which imparts only limited adaptability to the different types of GPUs and their memory capacities. This leads to suboptimal utilization of hardware resources, increasing training costs and time. Therefore, there is a need for a new approach to fight inefficient resource allocation, adapt to hardware heterogeneity, and raise the efficiency of complex LLMs.

The proposed method, Frenzy, trains LLMs on heterogeneous GPU clusters. The key features of Frenzy include:

Memory-Aware Resources Predictor (MARP): MARP can predict peak memory usage by analyzing the LLM architecture. 

Heterogeneity-Aware Scheduling (HAS): HAS distributes LLM tasks efficiently across different GPUs based on their memory capacity and computational power. 

Serverless Integration: Developers need not specify GPU requirements; this system can automatically do that.

Dynamic Memory Optimization: The system continuously monitors memory usage, and bottlenecks are avoided by redistributing memory-intensive tasks. 

Experiments demonstrated that Frenzy’s memory usage prediction accuracy exceeds 92%. It reduced the scheduling overhead by 10 times compared to the traditional approaches. The average job completion time also decreased by 12% to 18%. Frenzy achieves superior resource allocation and adapts dynamically to GPU clusters. 

In summary, Frenzy tackles a critical bottleneck in training LLMs with a memory-aware, serverless system tailored for heterogeneous GPU clusters. Dynamic resource scheduling and memory-aware optimizations yield significant increases in efficiency, scalability, and cost-effectiveness. This research represents a stride toward sustainable and scalable LLM training solutions by offering a robust framework for effectively harnessing heterogeneous GPU clusters. Frenzy’s adaptability and high performance set a new landmark in LLM training and opened up broader adoption in research and industry.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Frenzy: A Memory-Aware Serverless Computing Method for Heterogeneous GPU Clusters appeared first on MarkTechPost.

Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framew …

Graphical User Interfaces (GUIs) play a fundamental role in human-computer interaction, providing the medium through which users accomplish tasks across web, desktop, and mobile platforms. Automation in this field is transformative, potentially drastically improving productivity and enabling seamless task execution without requiring manual intervention. Autonomous agents capable of understanding and interacting with GUIs could revolutionize workflows, particularly in repetitive or complex task settings. However, GUIs’ inherent complexity and variability across platforms pose significant challenges. Each platform uses distinct visual layouts, action spaces, and interaction logic, making creating scalable and robust solutions difficult. Developing systems that can navigate these environments autonomously while generalizing across platforms remains an ongoing challenge for researchers in this domain.

There are many technical hurdles in GUI automation right now; one is aligning natural language instructions with the diverse visual representations of GUIs. Traditional methods often rely on textual representations, such as HTML or accessibility trees, to model GUI elements. These approaches are limited because GUIs are inherently visual, and textual abstractions fail to capture the nuances of visual design. In addition, textual representations vary between platforms, leading to fragmented data and inconsistent performance. This mismatch between the visual nature of GUIs and the textual inputs used in automation systems results in reduced scalability, longer inference times, and limited generalization. Also, most current methods are incapable of effective multimodal reasoning and grounding, which are essential for understanding complex visual environments.

Existing tools and techniques have attempted to address these challenges with mixed success. Many systems depend on closed-source models to enhance reasoning and planning capabilities. These models often use natural language communication to combine grounding and reasoning processes, but this approach introduces information loss and lacks scalability. Another common limitation is the fragmented nature of training datasets, which fail to provide comprehensive support for grounding and reasoning tasks. For instance, datasets typically emphasize either grounding or reasoning, but not both, leading to models that excel in one area while struggling in others. This division hampers the development of unified solutions for autonomous GUI interaction.

The University of Hong Kong researchers and Salesforce Research introduced AGUVIS (7B and 72B), a unified framework designed to overcome these limitations by leveraging pure vision-based observations. AGUVIS eliminates the reliance on textual representations and instead focuses on image-based inputs, aligning the model’s structure with the visual nature of GUIs. The framework includes a consistent action space across platforms, facilitating cross-platform generalization. AGUVIS integrates explicit planning and multimodal reasoning to navigate complex digital environments. The researchers constructed a large-scale dataset of GUI agent trajectories, which was used to train AGUVIS in a two-stage process. The framework’s modular architecture, which includes a pluggable action system, allows for seamless adaptation to new environments and tasks.

The AGUVIS framework employs a two-stage training paradigm to equip the model with grounding and reasoning capabilities: 

During the first stage, the model focuses on grounding and mapping natural language instructions to visual elements within GUI environments. This stage utilizes a grounding packing strategy, bundling multiple instruction-action pairs into a single GUI screenshot. This method improves training efficiency by maximizing the utility of each image without sacrificing accuracy. 

The second stage introduces planning and reasoning, training the model to execute multi-step tasks across various platforms and scenarios. This stage incorporates detailed inner monologues, which include observation descriptions, thoughts, and low-level action instructions. By progressively increasing the complexity of training data, the model learns to handle nuanced tasks with precision and adaptability.

AGUVIS demonstrated great results in both offline and real-world online evaluations. In GUI grounding, the model achieved an average accuracy of 89.2, surpassing state-of-the-art methods across mobile, desktop, and web platforms. In online scenarios, AGUVIS outperformed competing models with a 51.9% improvement in step success rate during offline planning tasks. Also, the model achieved a 93% reduction in inference costs compared to GPT-4o. By focusing on visual observations and integrating a unified action space, AGUVIS sets a new benchmark for GUI automation, making it the first fully autonomous pure vision-based agent capable of completing real-world tasks without reliance on closed-source models.

Key takeaways from the research on AGUVIS in the field of GUI automation:

AGUVIS uses image-based inputs, reducing token costs significantly and aligning the model with the inherently visual nature of GUIs. This approach results in a token cost of only 1,200 for 720p image observations, compared to 6,000 for accessibility trees and 4,000 for HTML-based observations.

The model combines grounding and planning stages, enabling it to perform single- and multi-step tasks effectively. The grounding training alone equips the model to process multiple instructions within a single image, while the reasoning stage enhances its ability to execute complex workflows.

The AGUVIS Collection unifies and augments existing datasets with synthetic data to support multimodal reasoning and grounding. This results in a diverse and scalable dataset, enabling the training of robust and adaptable models.

Using pyautogui commands and a pluggable action system allows the model to generalize across platforms while accommodating platform-specific actions, such as swiping on mobile devices.

AGUVIS achieved remarkable results in GUI grounding benchmarks, with accuracy rates of 88.3% on web platforms, 85.7% on mobile, and 81.8% on desktops. Also, it demonstrated superior efficiency, reducing USD inference costs by 93% compared to existing models.

In conclusion, the AGUVIS framework addresses critical challenges in grounding, reasoning, and generalization in GUI automation. Its purely vision-based approach eliminates the inefficiencies associated with textual representations, while its unified action space enables seamless interaction across diverse platforms. The research provides a robust solution for autonomous GUI tasks, with applications ranging from productivity tools to advanced AI systems.

Check out the Paper, GitHub Page, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Salesforce AI Research Introduces AGUVIS: A Unified Pure Vision Framework Transforming Autonomous GUI Interaction Across Platforms appeared first on MarkTechPost.

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Training large language models (LLMs) models has become a significant expense for businesses. For many use cases, companies are looking to use LLM foundation models (FM) with their domain-specific data. However, companies are discovering that performing full fine tuning for these models with their data isn’t cost effective. To reduce costs while continuing to use the power of AI, many companies have shifted to fine tuning LLMs on their domain-specific data using Parameter-Efficient Fine Tuning (PEFT). PEFT is a set of techniques designed to adapt pre-trained LLMs to specific tasks while minimizing the number of parameters that need to be updated. Techniques such as Low-Rank Adaptation (LoRA) and Weighted-Decomposed Low Rank Adaptation (DoRA), significantly reducing the number of trainable parameters and resulting in lower costs for fine tuning.
In addition to cost, performing fine tuning for LLMs at scale presents significant technical challenges. The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. Manually managing such complexity can often be counter-productive and take away valuable resources from your businesses AI development. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.
In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software development kit (SDK) to apply LoRA to fine-tuning jobs, and use SageMaker HyperPod as the primary compute cluster to perform distributed training on Trainium. Using LoRA supervised fine-tuning for Meta Llama 3 models, you can further reduce your cost to fine tune models by up to 50% and reduce the training time by 70%.
Solution overview
SageMaker HyperPod is designed to help reduce the time required to train generative AI FMs by providing a purpose-built infrastructure for distributed training at scale. When using SageMaker HyperPod for training, SageMaker will actively monitor the cluster’s health, automatically replacing faulty nodes and resuming model training from checkpoints. The clusters come pre-configured with SageMaker distributed training libraries that enable you to split your training data and model across thousands of compute nodes, allowing data to be processed in parallel while fully utilizing the cluster’s compute and network infrastructure. You can also customize your distributed training. The architecture diagram that follows provides a high level overview of these various components:

Compute cluster: This contains a head node that orchestrates computation across a cluster of worker nodes. Because the head node is only facilitating the training, it’s typically a much smaller instance. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances for the worker nodes and a single Amazon EC2 C5 instance for the head node.
Shared Volume: FSx for Lustre is used as the shared storage volume across nodes to maximize data throughput. It’s mounted at /fsx on the head and compute nodes.
External storage: Amazon Simple Storage Service (Amazon S3) is used to store the cluster’s lifecycle scripts, configuration files, datasets, and checkpoints.
Scheduler: SLURM is used as the job scheduler for the cluster.

Trainium chips are purpose-built for deep learning training of 100 billion and larger parameter models. Model training on Trainium is supported by the AWS Neuron SDK, which provides compiler, runtime, and profiling tools that unlock high-performance and cost-effective deep learning acceleration. To learn more about Trainium chips and the Neuron SDK, see Welcome to AWS Neuron.
To integrate Trainium chips with existing models and tools provided through the transformers package, Hugging Face’s Optimum-Neuron package functions as an interface with Neuron. With Optimum-Neuron, users can apply techniques such as LoRA to their fine-tuning jobs, streamlining the process of adapting LLMs for specific tasks while capitalizing on the performance gains provided by the AWS infrastructure.
Traditional fine tuning involves modifying all the parameters of a model, which can be computationally expensive and memory intensive. PEFT approaches such as LoRA focus on introducing a smaller set of trainable parameters, often in the form of low-rank matrices that adjust the model’s behavior while keeping most of its parameters frozen. The advantage of LoRA lies in its ability to maintain the performance of the base model while significantly lowering the computational burden and resource requirements. The Neuron 2.20 release supports model training with LoRA on Trainium.
In the next section, we’ll walk through the code in three steps for PEFT on Trainium with HyperPod:

Setting up and deploying a HyperPod cluster for distributed training.
Fine tuning a Meta Llama 3-8B model on Trainium instance with the dolly 15k dataset.
Model weights consolidation and inference.

Amazon SageMaker HyperPod cluster setup
In this first section, you will begin setting up your Amazon SageMaker HyperPod compute environment for fine tuning.
Prerequisites
The following are the prerequisites for configuring and deploying a SageMaker HyperPod cluster for fine tuning:

Submit a service quota increase request to get access to Trainium instances in the us-west-2 AWS Region. For the purposes of this post, you will request an increase for Amazon EC2 Trn1 instances, ml.trn1.32xlarge.
Install the AWS Command Line Interface (AWS CLI); the required minimum version needed is 2.14.3.
Install the AWS Systems Manager Session Manager Plugin in order to SSH into your cluster.

Step 1: Infrastructure setup
After completing the prerequisites, deploy an AWS CloudFormation stack that contains the necessary infrastructure components for distributed training through SageMaker HyperPod. The default Region specified in the template is us-west-2, but you can modify that. You will also need to specify the Availability Zone where your subnets will be deployed. The template configures your environment with an Amazon Virtual Private Cloud (Amazon VPC) and corresponding public and private subnets for network isolation. It establishes additional components inside your VPC including an S3 bucket for lifecycle scripts and FSx for Lustre, a file system shared across the head and compute nodes of the HyperPod cluster.
Step 2: Cluster configuration
Configure and deploy the HyperPod cluster. Begin by defining your infrastructure’s environment variables through the create_config script. This script uses the AWS CLI to extract infrastructure component variables from your CloudFormation stack including Region, resource IDs, and Amazon Resource Name (ARN).

# Set region
export AWS_REGION=us-west-2

# Fetch create_config script
curl ‘https://static.us-east-1.prod.workshops.aws/public/05a78a77-24f9-4f29-867c-64c9687646e1/static/scripts/create_config.sh’ –output create_config.sh

# Set environment variables
bash create_config.sh
source env_vars

After setting your environment variables, download the lifecycle scripts required for bootstrapping the compute nodes on your SageMaker HyperPod cluster and define its configuration settings before uploading the scripts to your S3 bucket.

# Download Lifecycle scripts
git clone –depth=1 https://github.com/aws-samples/awsome-distributed-training/

# upload scripts to s3
aws s3 cp –recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

After uploading the Lifecycle scripts to Amazon S3, create your cluster and file system configurations. See the Create Cluster section of the SageMaker HyperPod workshop to create these files. After generating the cluster-config.json and provisioning_parameters.json configuration files, validate them and upload the FSx for Lustre configuration file to Amazon S3.

# validate and check config for known issues
curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/validate-config.py
python3 validate-config.py –cluster-config cluster-config.json –provisioning-parameters provisioning_parameters.json

# Upload FSx configuration to S3
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Step 3: Cluster deployment
Now that the cluster’s configuration is defined, you can create the cluster.

aws sagemaker create-cluster
–cli-input-json file://cluster-config.json
–region $AWS_REGION

You should be able to see your cluster by navigating to SageMaker Hyperpod in the AWS Management Console and see a cluster named ml-cluster listed. After a few minutes, its status should change from Creating to InService.

If you select your cluster, you will be able to see the details of your compute cluster including the head and worker nodes.

After installing the Systems Manager Session Manager plugin, you can ssh into your cluster’s head node using the easy-ssh script to begin training.

# Modify permissions and ssh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

# Switch to ubuntu user
sudo su – ubuntu

# Change directory
cd /fsx

Now that your cluster is running and accessible through ssh, you can begin uploading the model training scripts to the shared file system through either curl or the AWS CLI. For more instructions on setting up your cluster, see the SageMaker HyperPod workshop.
Fine tuning
Now that your SageMaker HyperPod cluster is deployed, you can start preparing to execute your fine tuning job.
Data preparation
The foundation of successful language model fine tuning lies in properly structured and prepared training data. This implementation focuses on instruction-tuned datasets, which form the backbone of modern language model adaptation. These datasets work together to create meaningful training examples through three essential components:

Instructions that guide the model’s task.
Optional context that provides background information.
Responses that represent the desired output.

Training begins by loading your dataset and formatting your dataset examples with this structure. Loading your dataset can be accomplished through the Hugging Face datasets library, which provides a straightforward interface for accessing and managing training data. Hugging Face also provides this format function for the databricks-dolly-15k dataset. Note that the format function needs to be embedded in your train.py file (as shown in the following sample). It’s referenced by the NeuronSFTTrainer to format your dataset during fine tuning.

# Load dataset
dataset = load_dataset(args.dataset, split=”train”)

def format_dolly(examples):
    output_text = []
    for i in range(len(examples[“instruction”])):
        instruction = f”### Instructionn{examples[‘instruction’][i]}”
        context = f”### Contextn{examples[‘context’][i]}” if examples[“context”][i] else None
        response = f”### Answern{examples[‘response’][i]}”
        prompt = “nn”.join([i for i in [instruction, context, response] if i is not None])
        output_text.append(prompt)
    return output_text

The formatting function employs delimiter tokens (“###”) to create clear boundaries between different components of each training example. This separation is important because it helps the model distinguish between different parts of the input during training. The function handles cases where context might be missing, making sure that the final format remains consistent regardless of whether all components are present. Double newlines between sections provide additional structural clarity that helps the model recognize the natural breaks in the input.
Tokenization
After formatting your dataset, the next step is tokenization—the process of converting your text data into a numerical format that your model can understand. Tokenization serves as the bridge between your human-readable text and the mathematical operations that drive your model’s understanding of language. To begin, you use Hugging Face’s AutoTokenizer to load your model’s tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

The AutoTokenizer class automatically selects the appropriate tokenizer for your model, loading not just the vocabulary, but also the rules and special tokens that match your training configuration. The assignment of the padding token to match the end-of-sequence token is particularly important for causal language modeling, because it verifies the consistent handling of your variable-length sequences.
The tokenization process itself operates in several stages. First, it breaks down your input text into tokens based on its vocabulary. These tokens are then converted to numerical IDs that your model can process. During this process, your tokenizer also handles special tokens that mark the beginning and end of sequences, in addition to padding tokens that make sure that the sequences in your batch have the same length.
When working with tokenizers, your sequence length management becomes a critical consideration. Your maximum sequence length must balance between preserving enough information for your model to understand the context and staying within your model’s architectural limitations. Too short, and you risk losing important context; too long, and you might exceed memory constraints or introduce unnecessary computational overhead.
Model compilation and fine tuning
For this solution, you created a SageMaker HyperPod cluster with the controller node and one worker node. The worker node contains one ml.trn1.32xlarge instance which has 32 Neuron cores. You can conduct distributed fine tuning using all 32 Neuron cores within the worker node.
Step 1: Environment setup
You first need to install the required Python packages for fine tuning. The following is the bash script for the Python environment setup. Note that the solution uses the most recently released Neuron SDK. From the HOME directory, create a file touch environment.sh with the following code and run it with sbatch ./environment.sh. You might need to modify the permissions of the shell scripts throughout this post before running them with the command chmod +x environment.sh.

#!/usr/bin/env bash
#SBATCH –nodes=1
#SBATCH –exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/environment.out

sudo apt install -y python3.8-venv git
python3.8 -m venv $HOME/peft_ft/env_llama3_8B_peft
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate
pip install -U pip

python3 -m pip config set global.extra-index-url “https://pip.repos.neuron.amazonaws.com”
python3 -m pip install torch-neuronx==2.1.2.2.3.0 neuronx-cc==2.15.128.0 neuronx_distributed==0.9.0 torchvision
python3 -m pip install datasets transformers peft huggingface_hub trl PyYAML
python3 -m pip install git+https://github.com/huggingface/optimum-neuron.git

With your environment created, switch to your fine-tuning directory before proceeding to the next step: cd $HOME/peft_ft.
Step 1: Download the base Llama 3 8B model and tokenizer from Hugging Face
Download the base Meta Llama 3 8B model and the corresponding tokenizer from Hugging Face. You will need to first request access for the model from Meta on Hugging Face and then use your Hugging Face access token to download the model. The following is the Python code for the get_model.py script to download the model and tokenizer. Create this file with touch get_model.py and copy the following code to this file before moving on to the next step.

import os
import argparse
from transformers import AutoTokenizer, LlamaForCausalLM

def download_model_and_tokenizer(model_id: str, model_output_path: str, tokenizer_output_path: str, huggingface_token: str = None) -> None:
    huggingface_token = os.environ.get(“HUGGINGFACE_TOKEN”, None)
    model = LlamaForCausalLM.from_pretrained(model_id, token=huggingface_token)
    model.save_pretrained(model_output_path)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=huggingface_token)
    tokenizer.save_pretrained(tokenizer_output_path)
    
if __name__ == “__main__”:
    parser = argparse.ArgumentParser()
    parser.add_argument(“–model_id”, type=str, required=True, help=”Hugging Face Model id”)
    parser.add_argument(“–model_output_path”, type=str, required=True, help=”Path to save model/weights file”)
    parser.add_argument(“–tokenizer_output_path”, type=str, required=True, help=”Path to save tokenizer file”)
    args, _ = parser.parse_known_args()
    download_model_and_tokenizer(model_id=args.model_id, model_output_path=args.model_output_path, tokenizer_output_path=args.tokenizer_output_path)  

Next, create the bash script touch get_model.sh with the code that follows and run it with the command sbatch ./get_model.sh. This will trigger the get_model.py script to download the model and tokenizer using Slurm. Because you’re using the Llama 3 8B model, Hugging Face requires you to authenticate with an access token prior to download. Be sure to add your access token to get_model.sh before running the script.

#!/bin/bash
#SBATCH –nodes=1
#SBATCH –exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/get_model.out

export OMP_NUM_THREADS=1
export HUGGINGFACE_TOKEN=”<YOUR TOKEN HERE>”
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 $HOME/peft_ft/get_model.py 
–model_id meta-llama/Meta-Llama-3-8B-Instruct 
–model_output_path $HOME/peft_ft/model_artifacts/llama3-8B 
–tokenizer_output_path $HOME/peft_ft/tokenizer/llama3-8B

Step 2: Pre-compile model
Training deep learning models on Trainium requires model compilation. To do that, use the neuron_parallel_compile CLI utility, which will extract graphs from a trial run of your script, and perform parallel pre-compilation of the computation graphs. Note that the scripts for model pre-compilation are identical to those for the actual training, except for max_steps. This is because pre-compilation doesn’t require the completion of the entire training cycle; rather, it necessitates approximately 10 training steps to extract the graphs. Before compiling the model, you need to create the training script, touch train.py which is used for both pre-compilation and model fine tuning steps. Add the following code after creating the file, along with the format function previously mentioned.

import os
import torch
import argparse
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer
from optimum.neuron.distributed import lazy_load_for_parallelism
import torch_xla.core.xla_model as xm

# add format_dolly function here

def training_function(args):
dataset = load_dataset(args.dataset, split=”train”)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token
with lazy_load_for_parallelism(tensor_parallel_size=args.tp_size):
model = AutoModelForCausalLM.from_pretrained(
args.model_path,
low_cpu_mem_usage=True,
torch_dtype=torch.bfloat16 if args.bf16 else torch.float32
)

lora_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=[“q_proj”, “v_proj”],
bias=”none”,
task_type=”CAUSAL_LM”,
)

training_args = NeuronSFTConfig(
output_dir=args.model_checkpoint_path,
overwrite_output_dir=True,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.train_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
learning_rate=args.learning_rate,
weight_decay=args.weight_decay,
warmup_steps=args.warmup_steps,
bf16=args.bf16,
tensor_parallel_size=args.tp_size,
pipeline_parallel_size=args.pp_size,
save_steps=args.checkpoint_frequency,
logging_steps=100,
max_steps=args.max_steps,
)

trainer = NeuronSFTTrainer(
args=training_args,
model=model,
peft_config=lora_config,
tokenizer=tokenizer,
train_dataset=dataset,
formatting_func=format_dolly,
)

trainer.train()
trainer.save_model(args.model_final_path)

if __name__ == “__main__”:
parser = argparse.ArgumentParser()
parser.add_argument(“–model_path”, type=str)
parser.add_argument(“–tokenizer_path”, type=str)
parser.add_argument(“–epochs”, type=int)
parser.add_argument(“–train_batch_size”, type=int)
parser.add_argument(“–learning_rate”, type=float)
parser.add_argument(“–weight_decay”, type=float)
parser.add_argument(“–bf16”, type=bool)
parser.add_argument(“–tp_size”, type=int)
parser.add_argument(“–pp_size”, type=int)
parser.add_argument(“–gradient_accumulation_steps”, type=int)
parser.add_argument(“–warmup_steps”, type=int)
parser.add_argument(“–early_stopping_patience”, type=int)
parser.add_argument(“–checkpoint_frequency”, type=int)
parser.add_argument(“–dataset”, type=str)
parser.add_argument(“–max_steps”, type=int)
parser.add_argument(“–max_seq_length”, type=int)
parser.add_argument(“–model_checkpoint_path”, type=str)
parser.add_argument(“–model_final_path”, type=str)
args = parser.parse_args()
training_function(args)

After creating the training file, use the following code to create the compile.sh script, which will trigger finetune-llama3-8B.sh to compile the Llama 3 8B model using the neuron_parallel_compile command. You can run this with the sbatch compile.sh command.

#!/bin/bash
#SBATCH –nodes=1
#SBATCH –exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/compile.out

source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

export NEURON_EXTRACT_GRAPHS_ONLY=0
srun bash ${HOME}/peft_ft/finetune-llama3-8B.sh

The following is the finetune-llama3-8B.sh script, which lists the hyper-parameters for your model fine tuning. The script uses tensor parallelism for the training with degree of 8. With 32 NeuronCores in the ml.trn1.32xlarge instance, you get data parallel of degree 4. Note that the script also sets XLA_USE_BF16=1 to map both torch.float and torch.double tensors to bfloat16 tensors. This can both reduce memory footprint and improve performance. The script then sets gradient_accumulation_steps to be 3 to get a larger effective batch size for gradient update.

#!/bin/bash
GPUS_PER_NODE=32
if [ $NEURON_EXTRACT_GRAPHS_ONLY -gt 0 ]; then
MAX_STEPS=10
MAYBE_COMPILE=”neuron_parallel_compile”
else
MAX_STEPS=-1
fi

declare -a TORCHRUN_ARGS=(
–nproc_per_node=$GPUS_PER_NODE
–nnodes=$SLURM_JOB_NUM_NODES
)
export TRAIN_SCRIPT=${HOME}/peft_ft/train.py

declare -a TRAINING_ARGS=(
–bf16 True
–checkpoint_frequency 400
–dataset “databricks/databricks-dolly-15k”
–max_steps $MAX_STEPS
–max_seq_length 1024
–epochs 1
–gradient_accumulation_steps 3
–learning_rate 2e-05
–model_path “/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B”
–tokenizer_path “/fsx/ubuntu/peft_ft/tokenizer/llama3-8B”
–model_checkpoint_path “/fsx/ubuntu/peft_ft/model_checkpoints”
–model_final_path “/fsx/ubuntu/peft_ft/model_checkpoints/final”
–tp_size 8
–pp_size 1
–train_batch_size 1
–warmup_steps 100
–weight_decay 0.01
)
$MAYBE_COMPILE torchrun “${TORCHRUN_ARGS[@]}” $TRAIN_SCRIPT “${TRAINING_ARGS[@]}”

Step 3: Model fine tuning
After the model compiling is complete, you can then start the model fine tuning by reusing the compile.sh script. To do this, prevent the neuron_parallel_compile utility from being used by setting export NEURON_EXTRACT_GRAPHS_ONLY=-1 in compile.sh, and then re-run the script to start fine tuning your model. You might need to delete the model_consolidation directory created during the previous model compilation step before you start your fine-tuning job.
Model consolidation
When working with distributed machine learning workflows, you’ll often need to manage and merge model weights efficiently. Let’s explore two essential processes that you’ll frequently encounter: checkpoint consolidation and weight merging when performing LoRA fine tuning.
Checkpoint consolidation
During distributed training, your model checkpoints are typically split across multiple devices according to the model parallelism configuration that you provide. To bring these pieces back together, you’ll use a consolidation process. Your consolidation function handles three primary tasks. First, it combines distributed checkpoints into a unified model. Then, it manages memory efficiently by processing tensors in chunks. Finally, it creates sharded outputs with an index file for quick access.
LoRA weight merging
When you’re working with LoRA, you need to merge these adapters with your base model. The merging process is straightforward but requires careful attention to detail. Start by loading your base model and LoRA configuration. Then transform the LoRA weight names to match your base model’s structure. The process concludes by merging the adapters and saving the final model in a sharded format.
To put these tools into practice, you can use the following scripts after your fine-tuning job has finished. First, create the Python file, touch consolidation.py and shell file, touch consolidation.sh using the following code.

import argparse
import json
from pathlib import Path
from huggingface_hub import split_torch_state_dict_into_shards
from safetensors.torch import save_file
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints
import torch

def custom_consolidate_to_unified_checkpoint(checkpoint_dir: str, output_dir: str, save_format: str = “safetensors”):
output_dir.mkdir(parents=True, exist_ok=True)
state_dict = consolidate_model_parallel_checkpoints(checkpoint_dir)
for key, value in state_dict.items():
if isinstance(value, torch.Tensor):
state_dict[key] = value.contiguous()

split_result = split_torch_state_dict_into_shards(state_dict, max_shard_size=”5GB”)
# Save shards
for shard_file, shard_tensors in split_result.filename_to_tensors.items():
shard_dict = {name: state_dict[name] for name in shard_tensors}
shard_path = output_dir / shard_file
if save_format == “safetensors”:
save_file(shard_dict, shard_path, metadata={“format”: “pt”})
else:
torch.save(shard_dict, shard_path)

index = {
“metadata”: split_result.metadata,
“weight_map”: split_result.tensor_to_filename
}

index_file = “model.safetensors.index.json” if save_format == “safetensors” else “pytorch_model.bin.index.json”
with open(output_dir / index_file, “w”) as f:
json.dump(index, f, indent=2)

if __name__ == “__main__”:
parser = argparse.ArgumentParser()
parser.add_argument(“–input_dir”, type=str, required=True)
parser.add_argument(“–output_dir”, type=str, required=True)
parser.add_argument(“–save_format”, type=str, choices=[“safetensors”, “pytorch”])
args = parser.parse_args()
output_dir = Path(args.output_dir)
checkpoint_dir = Path(args.input_dir) / “adapter_shards”
custom_consolidate_to_unified_checkpoint(
checkpoint_dir=checkpoint_dir,
output_dir=output_dir,
save_format=args.save_format
)

This code will consolidate the sharded checkpoint files generated during training into a consolidated LoRA adaptersafetensor format. After saving the file, you can invoke this script to trigger the model checkpoint consolidation job. The input directory that you provide points to your fine-tuned model’s sharded checkpoints and the output directory for the consolidated LoRA adapter safetensor file. You trigger this with sbatch consolidation.sh.

#!/bin/bash
#SBATCH –nodes=1
#SBATCH –exclusive

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 “$HOME/peft_ft/consolidation.py” 
–input_dir “/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251” 
–output_dir “$HOME/peft_ft/model_checkpoints/adapter_shards_consolidation”
–save_format “safetensors”

After consolidation is complete, you need to merge the LoRA adapter weights from the consolidated files with the base model’s weights. Begin by creating a new Python file touch merge_lora.py and shell file merge_lora.sh using the following code.

import json
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM
import torch
import argparse
from safetensors import safe_open

def merge_lora_weights(args):
base_model = AutoModelForCausalLM.from_pretrained(args.base_model_path)
with open(args.adapter_config_path, “r”) as f:
config_dict = json.load(f)
peft_config = LoraConfig(**config_dict)
model = PeftModel(base_model, peft_config)

lora_weights_tensors = {}
with safe_open(args.lora_safetensors_path, framework=”pt”, device=’cpu’) as f:
for k in f.keys():
lora_weights_tensors[k] = f.get_tensor(k)

for layer_name in list(lora_weights_tensors):
if ‘layer’ in layer_name and ‘lora’ in layer_name:
new_layer_name = layer_name.replace(‘weight’, ‘default.weight’)
lora_weights_tensors[new_layer_name] = lora_weights_tensors[layer_name].clone()
del lora_weights_tensors[layer_name]
else:
del lora_weights_tensors[layer_name]

updated_state_dict = model.state_dict().copy()
for layer, weights in lora_weights_tensors.items():
updated_state_dict[layer] = weights
model.load_state_dict(updated_state_dict)
merged_model = model.merge_and_unload()
merged_model.save_pretrained(args.final_model_path, safe_serialization=True, max_shard_size=”5GB”)

if __name__ == “__main__”:
parser = argparse.ArgumentParser()
parser.add_argument(“–final_model_path”, type=str)
parser.add_argument(“–adapter_config_path”, type=str)
parser.add_argument(“–base_model_path”, type=str)
parser.add_argument(“–lora_safetensors_path”, type=str)
args = parser.parse_args()
merge_lora_weights(args)

#!/bin/bash
#SBATCH –nodes=1
#SBATCH –exclusive
#SBATCH –output=/fsx/ubuntu/peft_ft/logs/8b/lora_weights.log

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 “$HOME/peft_ft/merge_lora.py”
–final_model_path “/fsx/ubuntu/peft_ft/model_checkpoints/final_model_output”
–adapter_config_path “/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251/adapter_config.json”
–base_model_path “/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B”
–lora_safetensors_path “/fsx/ubuntu/peft_ft/model_checkpoints/adapter_shards_consolidation/model.safetensors”

Trigger the run with sbatch merge_lora.sh to merge the model weights. Here the base_model_path parameter is the local directory where you previously downloaded the model from Hugging Face in step 1 of “Model compilation and fine tuning.” Similarly, the adapter_config_path parameter will be the model’s configuration file previously downloaded and the lora_safetensors_path parameter will be the path to the model.safetensor file output by the LoRA consolidation in the previous step.
Inference
After consolidation and merging, the safetensors files will be saved to your final_model_path output directory containing the updated model weights after fine tuning. Using these updated weights, you can load and generate a prediction for your trained model in the context of the dolly dataset. To check that the fine-tuned model understands the databricks-dolly-15k dataset it was fine tuned on, select a question from the dataset for validation, as shown in the following figure.

Using Hugging Face’s LlamaForCausalLM class you can load your newly fine-tuned model, and generate a prediction for the question, “Who are the Smiths?” (shown in the following figure):

Comparing the generated answer to the ground truth context and response from the training dataset, it’s clear that the fine-tuned Meta Llama 3 model now understands this data and can give coherent responses to posed questions.
Results

Technique
Trainable parameters
Samples processed per second
Training time (minutes)

FPFT
7,570,591,744
2.083
90

PEFT
6,815,744
3.554
53

To benchmark the fine-tuned model’s performance with LoRA on a single ml.trn1.32xlarge, we compared it to full parameter fine tuning (FPFT) for the model over three training epochs. Measuring training samples processed per second showed a 70% increase in throughput and reduction in training time for the LoRA fine-tuned model. Subsequently, on-demand hours required to fine tune the model on the dolly 15k dataset for three epochs was halved compared to FPFT, resulting in a 50% reduction of training costs.
Clean up
To clean up the resources provisioned for this post, first delete the SageMaker HyperPod cluster. This can be done either through the AWS CLI or in the SageMaker console.

aws sagemaker delete-cluster –cluster-name ml-cluster

After the cluster is deleted, delete the CloudFormation template to delete the remaining provisioned resources.

aws cloudformation delete-stack –stack-name sagemaker-hyperpod

Conclusion
In this post, we showed you how to set up a SageMaker HyperPod compute cluster for training. Then we showed you how to perform multi-node distributed fine tuning with Trainium for a Meta Llama 3 model using LoRA. Finally, we showed you how to consolidate model weights across a distributed training environment to generate coherent predictions for the newly fine-tuned model.

About the Authors
Georgios Ioannides is a Deep Learning Architect with the AWS Generative AI Innovation Center. Before AWS, Georgios worked in startups, where he specialized in signal processing, deep learning, and multi-modal and cross-modal machine learning systems for speech, vision, and text applications. He holds Master’s degrees from Imperial College London and Carnegie Mellon University.
Bingchen Liu is a Machine Learning Engineer with the AWS Generative AI Innovation Center. Before AWS, he worked as a lead MLE in ADP focusing on RAG applications, vector database, model development, and serving. He holds a Master’s degree in Computer Science from Columbia University and a PhD in Statistics from Southern Methodist University.
Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a PhD in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.
Jeremy Roghair is a Machine Learning Engineer with the AWS Generative AI Innovation Center, where he focuses on developing generative AI solutions for distributed training workloads and model hosting for customers. Prior to joining AWS, Jeremy worked as a Data Scientist in the finance/insurance industry and earned a Master’s degree in Computer Science with research in reinforcement learning from Iowa State University.

Ad Remarketing Software Reviewed: The 15 Best Retargeting Tools

How often do you visit a site for the first time and make a purchase right away? Probably not very often, if ever.

You aren’t alone. Most visitors don’t convert on the first visit. In fact, about 98% of people leave your site without making a purchase. 

That’s why marketers invented retargeting – to swoop in to save the day. 

Retargeted ads boast click-through rates that are 10x higher than standard display ads and can drive up to a 70% increase in conversions. Awesome, right? 

Sure but to make retargeting work, you need the right tools. 

You need ad remarketing software that helps you track visitors, build smarter retargeting audiences, and deliver personalized ads that actually convert.

To help you ensure you find the right tool, we’ve reviewed 15 of the best ad remarketing tools out there. We’re covering their features, pricing, and showcasing why they deserve a spot in your marketing stack. 

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

What is Ad Remarketing Software?

Ad remarketing software is a tool that helps businesses re-engage potential customers who have previously interacted with their website or digital content. 

Ok, that’s the formal definition. Let’s be real. 

What ad remarketing software really does is help you reconnect with people who’ve already shown interest in your brand. Whether they visited your site, browsed a product, or left something in their cart, this software makes it easy to stay in front of them with targeted ads.

The goal? To bring them back and get them to buy something! 

By creating personalized, high-converting ads tailored to their behavior, remarketing tools help you turn “just looking” into “let’s buy.” 

Plus, it’s one of the best ways to stretch your ad budget further by focusing on people who are already interested in what you’re offering.

What Does Ad Remarketing Software Do?

Ad remarketing software tracks website visitors’ actions, such as viewing a product, adding items to their cart, or leaving without completing a purchase. 

It then uses this data to create highly segmented audiences based on visitor behavior and syncs seamlessly with platforms like Meta, Google Ads, and TikTok. 

What does this mean? Only amazing things! 

Including the fact you can deliver hyper-personalized, cross-channel ads that speak directly to where a potential customer is in their buying journey. Say goodbye to the generic retargeting ads that have taken over thanks to privacy changes and data loss.

Here’s an example of ad remarketing software in action. 

A visitor spends five minutes browsing your latest product launch but doesn’t add it to their cart. Remarketing software can automatically include them in a “high-interest” audience and serve them ads featuring the product, reviews, or even a limited-time discount. Next thing you know…sale. 

Ad remarketing software lets you target the right audience with the right message at the right time…every marketer’s favorite tactic.

Why is Ad Remarketing Essential for Marketers?

Remarketing is the key to bringing people back to your site and turning lost visitors into conversions. 

In fact, retargeting ads are 76% more likely to get clicked than regular display ads, making them a go-to strategy for marketers who want results.

Here’s why ad remarketing is a must for marketers who are serious about results:

Boost Conversion Rates with Targeted Ads

Retargeted ads work because they’re hyper-relevant. Instead of generic messaging, you’re serving people exactly what they were already looking at, whether it’s a product, service, or special offer. 

Retargeting ads see 70% higher conversion rates than standard ones, making them an easy win for marketers who know how to use them effectively.

Maximize ROI from Traffic You Already Paid For

You’ve spent budget getting visitors to your site so why let them leave without a follow-up? 

Remarketing keeps the conversation going by targeting those who already showed interest. For example, someone leaves items in their cart? Serve them an ad featuring those exact products and a limited-time discount. 

It’s a smart way to recover cart abandoners and squeeze more ROI from the traffic you’ve already invested in.

Stay Top-of-Mind with Your Audience

Not every visitor buys right away but that doesn’t mean they’re not interested. 

Retargeting ads keep your brand front and center while they’re deciding. Whether they’re scrolling Instagram, watching a YouTube video, or searching on Google, you’re there when they’re ready to come back. 

Studies even show retargeting can increase branded searches by 104%.

Run Smarter Multi-Platform Campaigns

The best results come when you meet your audience wherever they are. Remarketing tools let you run campaigns across Meta, Google Ads, TikTok, and more, all from one place. 

That way, someone who browsed your site today might see an Instagram ad tomorrow and a Google Search ad next week, bringing them back to purchase.

Get Better Data to Optimize Campaigns

Remarketing tools don’t just retarget. They also give you data. With detailed performance tracking, you can see what’s working, which audiences are converting, and where you can tweak for even better results. 

The more data you have, the more efficiently you can spend your budget and hit your goals.

We all know that remarketing is essential for getting the most out of every visitor to your site so if you’re not doing it yet, it’s time to start.

Top 10 Features to Look for in Ad Remarketing Software

Choosing the right ad remarketing tool can make or break your campaigns. Here are the must-have features to look for:

1. Advanced Audience Segmentation

What it does: Break your audience into precise groups based on behavior, such as cart abandoners, page viewers, or past buyers.

Why it’s essential: Personalization is everything. Segmented ads are far more effective than one-size-fits-all campaigns, leading to better engagement and higher conversion rates.

2. Cross-Platform Integration

What it does: Seamlessly run campaigns across Meta, Google, TikTok, and other platforms.

Why it’s essential: Your audience doesn’t stick to one platform, and your ads shouldn’t either. Cross-platform integration ensures you reach them wherever they are.

3. Dynamic Ad Creation

What it does: Automatically generate ads personalized to user actions, like featuring products they browsed or abandoned.

Why it’s essential: Personalized ads convert better. Showing users content that feels relevant to them increases click-through rates and drives more sales.

4. A/B Testing Capabilities

What it does: Test different ad formats, creatives, or CTAs to find what works best.

Why it’s essential: No audience is the same, and A/B testing lets you optimize your ads for maximum performance, saving money on ineffective campaigns.

5. Real-Time Data and Analytics

What it does: Provide up-to-the-minute performance metrics to fine-tune campaigns.

Why it’s essential: Quick access to data means you can make adjustments in real time, ensuring you’re not wasting budget on ads that aren’t working.

6. Custom Trigger-Based Retargeting

What it does: Set triggers like cart abandonment or specific page visits to retarget users at key moments.

Why it’s essential: Timing matters. Catching users when they’re already interested significantly increases the chances they’ll return and convert.

7. Mobile-First Retargeting

What it does: Optimize ads specifically for mobile devices.

Why it’s essential: With mobile making up the majority of web traffic, ensuring your ads look great on smaller screens is critical for engaging users effectively.

8. AI-Powered Recommendations

What it does: Use machine learning to identify the best audiences, placements, and messaging.

Why it’s essential: AI helps you work smarter, not harder, by automating complex decisions and optimizing campaigns faster than manual adjustments.

9. Geotargeting

What it does: Target users based on their location for locally relevant ads.

Why it’s essential: Whether you’re promoting a regional sale or a local store, geotargeting ensures your ads resonate with the audience they’re designed for.

10. CRM and Ecommerce Integrations

What it does: Sync directly with CRMs or eCommerce platforms for seamless data transfer.

Why it’s essential: Integration saves time and ensures your campaigns are powered by accurate, up-to-date customer data for better targeting and results.

When choosing your ad remarketing software, it’s important to remember that these features aren’t just extras. They’re what separate effective remarketing campaigns from wasted ad spend. 

Make sure to find an ad remarketing tool that checks these boxes, and you’ll be set up to bring your audience back and turn visits into conversions.

The 15 Best Ad Remarketing Tools for 2025

With so many options out there, finding the right ad remarketing tool can feel overwhelming. 

Whether you’re focused on dynamic ads, cross-platform targeting, or maximizing your ROI, these 15 tools stand out for their ability to help you re-engage your audience and drive conversions. 

Let’s break them down.

1. Customers.ai Ad Remarketing Software

What it Does: Customers.ai is an AI-powered ad remarketing and lead capture platform designed to help businesses reconnect with high-intent visitors. It enables advanced audience segmentation, visitor tracking, and multi-channel retargeting through platforms like Meta, Google Ads, and Shopify. Customers.ai also stands out with its automation capabilities, making it easy to set up personalized retargeting campaigns that convert.

Pricing: Flexible pricing with a free trial and paid plans starting at $99/month.

Rating: ★★★★★ (4.9/5)

Customers.ai Ad Remarketing Software Reviews:

Ease of Use: Users frequently highlight its intuitive interface. One G2 reviewer said, “Setting up campaigns was quick and straightforward, even without extensive technical skills.”

AI Automation: Customers love the AI-driven features. A Trustpilot user shared, “The AI takes the guesswork out of campaign setup. It’s a game-changer for retargeting efficiency.”

Remarketing Power: Many reviewers praise its ability to track and retarget visitors effectively. A Capterra review noted, “The visitor tracking and segmentation features helped us identify high-value leads and target them with ads that converted.”

Customer Support: Reviewers appreciate the responsive support team. One user on G2 mentioned, “Whenever we’ve had questions, the support team has gone above and beyond to help us maximize the platform.”

2. AdRoll Ad Remarketing Software

What it Does: AdRoll is a comprehensive ad remarketing platform that enables businesses to re-engage visitors across web, social, and email channels. It offers robust audience segmentation, dynamic ad creation, and detailed analytics to optimize campaign performance. Its integration capabilities with major ecommerce platforms like Shopify make it especially popular among online retailers.

Pricing: Free plan available; paid plans start at $36/month, plus ad spend.

Rating: ★★★★☆ (4.5/5)

AdRoll Ad Remarketing Software Reviews:

Ease of Use: Users appreciate its intuitive dashboard. One reviewer on Trustpilot said, “AdRoll made it simple to set up campaigns and track results without a steep learning curve.”

Cross-Channel Reach: Many highlight the platform’s multi-channel capabilities. A Capterra user noted, “Being able to run ads on Facebook, Instagram, and the web from one platform has been a huge time-saver.”

Dynamic Ads: Dynamic ad creation is a favorite feature. A G2 reviewer mentioned, “AdRoll’s dynamic ads helped us show personalized product recommendations, boosting our ROAS significantly.”

Support: Users praise the customer support team for their quick and detailed responses. One user wrote, “AdRoll’s team has been incredibly helpful in troubleshooting and providing optimization tips.”

3. Google Ads Remarketing Tool

What it Does: Google Ads Remarketing allows businesses to target users who have previously visited their website or interacted with their content. It integrates directly into the Google Ads platform, letting you retarget visitors across Search, Display, YouTube, and more. The platform’s detailed audience segmentation and performance tracking make it a top choice for marketers looking to leverage Google’s massive reach.

Pricing: Pay-per-click or impression-based pricing; no platform fee.

Rating: ★★★★☆ (4.6/5)

Google Ads Remarketing Reviews:

Audience Reach: Marketers love Google’s extensive network. A reviewer on G2 shared, “The ability to target users across Search and Display is unmatched.”

Performance Tracking: Many appreciate the detailed analytics. A Capterra user said, “The reporting tools make it easy to understand which campaigns are driving results.”

Ease of Integration: Users highlight how seamlessly it connects with other Google tools. One reviewer noted, “Integrating with Google Analytics made audience setup and tracking super simple.”

Cost Control: A Trustpilot reviewer said, “Google Ads Remarketing gives us great control over our ad spend, allowing us to optimize for high ROI.”

4. Meta Custom Audiences 

What it Does: Meta’s Custom Audiences feature lets businesses retarget visitors on Facebook and Instagram. By syncing data from your website, CRM, or app, you can create highly personalized campaigns that engage users across Meta’s platforms. It’s a favorite for social-focused marketers aiming to drive engagement and conversions.

Pricing: Pay-per-click or impression-based pricing; no platform fee.

Rating: ★★★★☆ (4.5/5)

Meta Custom Audiences Reviews:

Social Targeting: Users value the focus on Facebook and Instagram. A G2 reviewer said, “It’s the best platform for building brand awareness and converting users socially.”

Personalization: Many highlight the ability to create tailored ads. A reviewer on Capterra noted, “The dynamic ad options let us show users products they’ve browsed, which boosted our click-through rates.”

Integration: One Trustpilot user shared, “Syncing our Shopify data with Meta was seamless and allowed us to create precise custom audiences quickly.”

Cost Efficiency: A reviewer said, “Meta Custom Audiences delivers great value, especially for retargeting campaigns aimed at driving sales.”

5. Criteo Ad Remarketing Software

What it Does: Criteo is a performance-based retargeting platform that specializes in delivering personalized ads across the web. Known for its dynamic ad capabilities and focus on ecommerce, Criteo integrates with major platforms to target users with product-specific ads that drive conversions.

Pricing: Custom pricing based on campaign goals and ad spend.

Rating: ★★★★☆ (4.4/5)

Criteo Ad Remarketing Software Reviews:

Dynamic Ads: Reviewers often highlight its product-specific targeting. A G2 user said, “Criteo’s dynamic ads consistently drive the highest ROI for our retargeting campaigns.”

Ecommerce Focus: A reviewer on Trustpilot mentioned, “Criteo is built for ecommerce. Its integration with our store and ability to show personalized ads to shoppers is outstanding.”

Performance Tracking: Users appreciate the clear analytics. One Capterra user shared, “The dashboard makes it easy to track performance and optimize our campaigns for better results.”

Support: Many reviewers praise the platform’s dedicated account managers. A user noted, “Criteo’s team is hands-on and works with us to continually improve our campaigns.”

6. SharpSpring Ad Remarketing Software

What it Does: SharpSpring is a retargeting platform designed for small to medium-sized businesses. It helps marketers retarget visitors across web, social, and mobile channels with ease. With its focus on simplicity and cost-effectiveness, it’s an excellent option for businesses new to remarketing.

Pricing: Pay-as-you-go pricing; no platform fees.

Rating: ★★★★☆ (4.3/5)

Perfect Audience Ad Remarketing Software Reviews:

Ease of Use: Users highlight its simplicity. A G2 reviewer noted, “SharpSpring made setting up retargeting campaigns straightforward, even for someone without technical expertise.”

Cross-Channel Targeting: Reviewers love its reach. A Trustpilot user shared, “Being able to retarget on both web and social channels from one platform has been a huge help.”

Budget-Friendly: Many mention its affordability. A Capterra user said, “It’s a cost-effective way to dip into retargeting without overspending.”

Support: A reviewer on G2 said, “The support team is responsive and willing to walk you through the setup and optimization process.”

7. StackAdapt

What it Does: StackAdapt is a programmatic advertising platform that offers advanced retargeting capabilities across display, video, native, and connected TV. It’s known for its focus on audience insights and precision targeting, making it a great choice for marketers who want to optimize their multi-channel campaigns.

Pricing: Custom pricing based on campaign needs.

Rating: ★★★★☆ (4.5/5)

StackAdapt Ad Remarketing Software Reviews:

Cross-Channel Targeting: A G2 reviewer mentioned, “StackAdapt makes it easy to retarget users across multiple channels, from display to CTV, all in one platform.”

Audience Insights: Users appreciate the detailed reporting. A Capterra reviewer noted, “The audience insights and analytics have helped us refine our targeting and increase ROI.”

Ease of Use: A Trustpilot user said, “The platform is intuitive and straightforward, even for complex campaigns.”

Support: Many reviewers highlight the responsive customer support. One user shared, “The StackAdapt team is always quick to help with optimizations and troubleshooting.”

8. Marin Software

What it Does: Marin Software is a cross-channel advertising platform that specializes in paid search, social, and display ads. Its retargeting capabilities allow marketers to build precision audiences, deliver dynamic ads, and optimize campaigns for better performance across platforms like Google, Meta, and Amazon.

Pricing: Custom pricing based on campaign size and goals.

Rating: ★★★★☆ (4.4/5)

Marin Software Ad Remarketing Software Reviews:

Cross-Channel Efficiency: A G2 reviewer shared, “Marin makes it easy to manage and optimize campaigns across multiple platforms from one dashboard.”

Performance Insights: Users highlight the detailed reporting. A Capterra reviewer said, “The analytics tools are top-notch and have helped us identify high-performing audiences.”

Dynamic Retargeting: Many reviewers appreciate its ad customization. One Trustpilot user noted, “Marin’s dynamic ad features helped us deliver personalized experiences that significantly boosted conversions.”

Support: Users frequently mention excellent customer service. A G2 reviewer said, “The support team is responsive and always available to guide us through campaign optimizations.”

9. OptiMonk Ad Remarketing Software

What it Does: OptiMonk is a retargeting tool that specializes in on-site pop-ups, sticky bars, and other creative ad formats. It’s an excellent option for ecommerce brands aiming to re-engage visitors before they even leave the site.

Pricing: Free plan available; paid plans start at $39/month.

Rating: ★★★★☆ (4.6/5)

OptiMonk Ad Remarketing Software Reviews:

On-Site Engagement: A G2 reviewer shared, “OptiMonk’s pop-ups helped us capture visitors’ attention and turn more of them into customers before they left.”

Customization: Reviewers love the flexibility. A Capterra user said, “The templates are easy to customize, and the platform makes it simple to test different designs.”

Analytics: A Trustpilot reviewer mentioned, “The reporting tools gave us clear insights into which campaigns were working and where to improve.”

Customer Support: Users frequently praise its support team. One G2 reviewer said, “OptiMonk’s support team was quick to respond and helped us get the most out of the platform.”

10. Taboola Ad Remarketing Software

What it Does: Taboola is a content discovery and retargeting platform that helps businesses reach audiences through native ads across premium publishers. It’s particularly effective for driving traffic back to your site with engaging, contextual ads.

Pricing: Custom pricing based on ad spend and campaign goals.

Rating: ★★★★☆ (4.3/5)

Taboola Ad Remarketing Software Reviews:

Content Engagement: A G2 user said, “Taboola’s native ads consistently drove high-quality traffic back to our site.”

Publisher Reach: Many reviewers highlight its extensive network. A Trustpilot reviewer mentioned, “The ability to place our ads on well-known publishers has been a game-changer.”

Analytics: A Capterra user said, “Taboola’s dashboard makes it easy to measure the ROI of our campaigns and refine them for better performance.”

Support: Reviewers often praise its account managers. One G2 reviewer wrote, “The Taboola team worked closely with us to optimize our campaigns and improve results.”

11. Outbrain Ad Remarketing Software

What it Does: Outbrain is a native advertising platform that helps brands retarget visitors with engaging, non-intrusive ads displayed on high-traffic publisher sites. It’s ideal for marketers looking to boost traffic and conversions through content-focused campaigns.

Pricing: Custom pricing based on ad spend and campaign goals.

Rating: ★★★★☆ (4.4/5)

Outbrain Ad Remarketing Software Reviews:

Content Amplification: A G2 reviewer mentioned, “Outbrain helps us turn our blog content into traffic-driving ads that re-engage our audience.”

Publisher Network: Users praise its premium reach. A Trustpilot user said, “The ability to advertise on top-tier publisher sites has improved our brand’s visibility and credibility.”

Ease of Use: A Capterra user shared, “Setting up campaigns and tracking performance on Outbrain is simple and intuitive.”

Performance Support: Reviewers highlight its optimization tools. One reviewer noted, “The platform’s analytics helped us refine our ads for better engagement.”

12. AdWisely Ad Remarketing Software

What it Does: AdWisely automates ad retargeting for ecommerce stores, focusing on dynamic product ads across platforms like Facebook and Google. It’s built to help small to medium-sized businesses drive conversions with minimal effort.

Pricing: Starts at $19/month plus ad spend.

Rating: ★★★★☆ (4.5/5)

AdWisely Ad Remarketing Software Reviews:

Automation: Users love the hands-off approach. A G2 reviewer said, “Adwisely automates everything, making retargeting a breeze for our small team.”

Dynamic Ads: Reviewers highlight its impact on sales. A Trustpilot user shared, “The product ads retarget exactly what customers were looking at, and it works like a charm.”

Ecommerce Focus: A Capterra reviewer noted, “It’s perfect for Shopify stores—easy to set up and very effective.”

ROI: Many mention its strong performance. One G2 user wrote, “We’ve seen a significant lift in revenue since we started using AdWisely for our campaigns.”

13. LinkedIn Ad Retargeting Software

What it Does: LinkedIn Retargeting allows businesses to re-engage website visitors, email contacts, or video viewers with ads on LinkedIn. It’s a powerful tool for B2B marketers looking to stay in front of decision-makers and drive high-value leads.

Pricing: Pay-per-click or impression-based pricing; no platform fee.

Rating: ★★★★☆ (4.3/5)

LinkedIn Retargeting Reviews:

B2B Focus: A G2 reviewer said, “LinkedIn’s retargeting capabilities are a must-have for B2B campaigns targeting specific industries.”

Audience Insights: Users value the detailed targeting options. A Trustpilot user shared, “Being able to retarget professionals by job title and industry gives us unmatched precision.”

Ad Formats: A Capterra user mentioned, “The variety of ad formats—from sponsored content to video ads—helps us keep campaigns engaging.”

ROI: Many highlight its effectiveness for lead generation. One reviewer wrote, “LinkedIn retargeting has consistently brought in high-quality leads for our business.”

14. RollWorks Ad Remarketing Software

What it Does: RollWorks is an account-based marketing platform that excels at B2B retargeting. It helps businesses identify, engage, and convert high-value accounts through targeted ads and detailed analytics.

Pricing: Custom pricing based on campaign needs.

Rating: ★★★★☆ (4.4/5)

RollWorks Ad Remarketing Software Reviews:

ABM Focus: A G2 user shared, “RollWorks makes account-based retargeting simple and effective, helping us focus on leads that matter most.”

Detailed Analytics: Reviewers love the insights. A Trustpilot user said, “The reporting tools give us a clear picture of what’s working and where we can improve.”

Ease of Use: A Capterra reviewer noted, “Setting up campaigns and tracking performance is straightforward, even for complex B2B strategies.”

Support: Users frequently mention its dedicated customer support. One reviewer said, “The RollWorks team is always ready to help us fine-tune our campaigns.”

15. Wunderkind Ad Remarketing Software

What it Does: Wunderkind is a behavioral marketing platform that helps brands create personalized retargeting campaigns based on real-time customer data. It’s especially effective for ecommerce and retail brands looking to boost customer engagement and loyalty.

Pricing: Custom pricing based on business size and needs.

Rating: ★★★★☆ (4.5/5)

Wunderkind Ad Remarketing Software Reviews:

Behavioral Targeting: A G2 reviewer said, “Wunderkind’s ability to use real-time customer behavior has made our retargeting campaigns incredibly effective.”

Personalization: Reviewers highlight its dynamic capabilities. A Trustpilot user shared, “Every ad feels tailored to the customer, and the results show it.”

Integration: A Capterra reviewer noted, “It integrates seamlessly with our CRM, making it easy to launch targeted campaigns.”

Customer Retention: Many emphasize its impact on loyalty. One G2 user wrote, “Wunderkind has helped us turn one-time buyers into repeat customers through smart retargeting.”

How to Choose Your Ad Remarketing Software

Finding the right ad remarketing tool doesn’t have to be overwhelming. Follow these steps to make sure you’re picking the software that fits your goals and delivers results:

Step 1: Define Your Budget and Growth Plan

Start by setting a clear budget for your remarketing campaigns.

Choose a tool that fits your current needs but also has features to support your growth as your campaigns scale.

Example: A small ecommerce business might start with a tool like RetargetApp, while larger teams might opt for something robust like RollWorks.

Step 2: Check Compatibility with Your Existing Platforms

Make a list of tools you already use—Google Ads, Meta, Shopify, or your CRM.

Ensure the remarketing software you choose integrates seamlessly with these platforms. This avoids unnecessary workarounds and keeps your campaigns running smoothly.

Step 3: Look for Customization and Dynamic Ad Features

Prioritize tools that let you personalize ads based on visitor behavior, like showing specific products or services they interacted with.

Dynamic ad capabilities are key for creating tailored experiences that convert.

Step 4: Assess Ease of Use and Reporting Tools

Test the platform (most tools offer free trials or demos) to ensure the interface is user-friendly.

Check the reporting features to confirm they provide actionable insights without a steep learning curve.

Step 5: Match the Tool to Your Goals

Identify your primary focus: Are you trying to recover abandoned carts? Build brand awareness? Convert B2B leads?

Choose a tool with strengths that align with those goals. For example:

Ecommerce brands should look for dynamic product ads and integrations with shopping platforms.

B2B marketers might need robust account-based targeting and LinkedIn integrations.

By following these steps, you can narrow down your options and pick the tool that fits both your needs and your future plans. 

Why Customers.ai is the Best Ad Remarketing Tool

When it comes to ad remarketing, Customers.ai stands out as the ultimate choice for marketers who want more than just the basics. 

With advanced visitor tracking, seamless integrations across platforms like Meta, Google Ads, and Shopify, and powerful dynamic retargeting features, it gives you the tools to re-engage your audience with precision and ease.

Whether you’re targeting cart abandoners or creating multi-platform campaigns, Customers.ai simplifies the process and delivers real results.

Ready to see what Customers.ai can do for your remarketing strategy? 

Start your free trial today and take the guesswork out of retargeting!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

Ad Remarketing Software FAQs

1. What is ad remarketing software?

Ad remarketing software is a tool that helps businesses re-engage potential customers who’ve previously visited their website or interacted with their digital content. By using tracking technologies like cookies or pixels, the software identifies visitors and serves them targeted ads across platforms such as Meta, Google, and TikTok. The goal is to encourage these users to return and complete desired actions, such as making a purchase or signing up for a service. This software is essential for boosting ROI by focusing ad spend on high-intent audiences.

2. How does ad remarketing software work?

Ad remarketing software works by placing a tracking pixel or cookie on a visitor’s browser when they visit your website. This tracking data is then used to create audience segments based on behavior, such as pages visited or actions taken. The software syncs with ad platforms to deliver personalized ads to those users across the web, social media, or even apps. This process ensures that ads are shown to people who are more likely to engage and convert.

3. What are the benefits of using ad remarketing software?

Ad remarketing software offers several benefits, including:

Higher Conversion Rates: Targeting users who already know your brand leads to better engagement and conversions.

Improved ROI: It focuses your budget on audiences with higher intent, reducing wasted ad spend.

Enhanced Personalization: Dynamic ads tailored to user behavior create a more relevant and engaging experience.

Cross-Platform Reach: Most tools integrate with multiple platforms, ensuring your ads follow users wherever they go.

4. What features should I look for in ad remarketing software?

When evaluating ad remarketing software, focus on features like:

Audience Segmentation: Ability to group users based on behavior or demographics.

Dynamic Ads: Automated creation of personalized ads tailored to individual users.

Cross-Channel Integration: Support for platforms like Meta, Google, TikTok, and more.

Real-Time Analytics: Access to detailed performance metrics to optimize campaigns.

Custom Triggers: Options to retarget based on specific actions, such as cart abandonment.

5. Who can benefit from ad remarketing software?

Ad remarketing software is beneficial for businesses across industries, including:

Ecommerce: Recover abandoned carts and promote products to shoppers who didn’t complete a purchase.

B2B: Nurture leads through targeted ads that align with their stage in the buying cycle.

Service-Based Businesses: Re-engage visitors who explored services but didn’t convert.

Local Businesses: Target visitors with location-specific ads to drive foot traffic or local sales.

6. What is the difference between ad remarketing and ad retargeting?

While often used interchangeably, ad remarketing and ad retargeting can have slightly different meanings.

Remarketing: Typically refers to re-engaging customers via email campaigns or loyalty-based advertising.

Retargeting: Focuses specifically on serving targeted ads to people who have interacted with your website or app.Ad remarketing software usually combines elements of both to deliver personalized campaigns.

7. How can ad remarketing software improve ROI?

Ad remarketing software improves ROI by focusing your ad budget on audiences who are already familiar with your brand. These users are more likely to convert than cold audiences, making your campaigns more cost-efficient. Additionally, tools that offer dynamic ads and advanced segmentation help you deliver highly relevant ads, increasing the chances of engagement and purchase. With detailed analytics, you can continually optimize campaigns to achieve even better results.

8. Is ad remarketing software effective for small businesses?

Yes, ad remarketing software is highly effective for small businesses. It allows you to maximize the value of your existing traffic by focusing on re-engaging users who have already shown interest. Small businesses can benefit from cost-efficient tools that support dynamic ads and cross-platform retargeting without needing a large team or extensive budget. Many platforms, like RetargetApp or Perfect Audience, cater specifically to smaller businesses with user-friendly setups.

9. How does ad remarketing software integrate with ecommerce platforms?

Most ad remarketing software integrates seamlessly with popular ecommerce platforms like Shopify, WooCommerce, and Magento. These integrations allow the software to pull product data, track user actions like cart abandonment, and create dynamic ads featuring specific items. This ensures a smooth workflow for ecommerce businesses, enabling them to launch retargeting campaigns without manual data entry.

10. Can ad remarketing software help with cart abandonment?

Yes, ad remarketing software is one of the best tools for addressing cart abandonment. By tracking users who add items to their cart but don’t complete a purchase, it allows you to serve personalized ads featuring those products. Many tools also support incentives like discount codes within the ad to encourage users to return and complete the checkout process.

11. What is dynamic retargeting in ad remarketing software?

Dynamic retargeting is a feature in ad remarketing software that automatically generates personalized ads based on user behavior. For example, if a user browsed a specific product but didn’t purchase, the software can create an ad showcasing that exact product. This level of personalization significantly increases engagement and conversion rates.

12. How does ad remarketing software handle audience segmentation?

Ad remarketing software uses audience segmentation to group users based on their actions, such as:

Page visits: Users who viewed a specific product or service.

Time spent on site: Engaged visitors who didn’t convert.

Cart abandoners: People who left items in their cart without completing checkout.

Returning visitors: Users who visited multiple times but haven’t made a purchase.These segments allow you to tailor ads to each group for maximum relevance.

13. How do you measure the success of ad remarketing campaigns?

Success in ad remarketing campaigns is typically measured using:

Click-Through Rate (CTR): How often users click on your ads.

Conversion Rate: Percentage of users who take the desired action after clicking.

Return on Ad Spend (ROAS): Revenue generated for every dollar spent.

Impressions: Total number of times your ad was viewed.Analyzing these metrics helps you identify what’s working and where adjustments are needed.

14. Is ad remarketing software GDPR-compliant?

Most ad remarketing software includes tools to ensure GDPR compliance, such as cookie consent banners and options to exclude users who have opted out of tracking. It’s essential to verify that your chosen platform offers features to comply with local regulations and ensure your campaigns remain ethical.

15. Can ad remarketing software target users on mobile apps?

Yes, many ad remarketing tools allow you to target users on mobile apps. Platforms like Google Ads and Meta integrate with app data to retarget users based on their in-app behavior. This is particularly useful for promoting app engagement or driving conversions directly within mobile platforms.

16. What’s the difference between first-party and third-party data in remarketing?

First-Party Data: Information you collect directly from users, such as site visits, app activity, or CRM data.

Third-Party Data: Information from external sources, such as purchased audience segments.Ad remarketing software primarily relies on first-party data for accurate targeting while staying compliant with privacy regulations.

17. How does ad remarketing software help with multi-channel campaigns?

Ad remarketing software enables businesses to run campaigns across multiple platforms, such as Meta, Google Ads, TikTok, and YouTube. It consolidates audience data and syncs it across these channels, ensuring a consistent experience for users. This means a visitor who browsed your product on desktop might see a retargeting ad on Instagram, followed by a reminder on YouTube. Multi-channel targeting increases your chances of re-engaging users wherever they spend their time.

18. How do ad remarketing tools use tracking pixels?

Tracking pixels are small pieces of code that collect data on user behavior. When someone visits your site, the pixel logs their activity, such as pages viewed or items added to their cart. Ad remarketing tools use this data to group users into specific segments and deliver personalized ads based on their actions. For example, a visitor who viewed a specific product might later see an ad featuring that product with a limited-time discount.

19. What types of businesses should use ad remarketing software?

Ad remarketing software is useful for a variety of businesses, including:

Ecommerce Brands: Recover cart abandoners and drive repeat purchases.

B2B Companies: Re-engage leads with account-based targeting.

SaaS Platforms: Nurture trial users into paid customers.

Local Businesses: Use geotargeting to attract nearby customers.No matter your industry, if you have a digital presence, remarketing can help maximize conversions and revenue.

20. How does ad remarketing software integrate with CRMs?

Ad remarketing software integrates with CRMs to pull customer data, such as contact details or purchase history, into your retargeting campaigns. This allows you to create more personalized ads tailored to specific customer segments. For example, you could target previous buyers with ads for complementary products or reward loyal customers with exclusive offers. Popular integrations include HubSpot, Salesforce, and Zoho CRM.

21. Can ad remarketing software retarget anonymous visitors?

Yes, some ad remarketing tools can retarget anonymous visitors using cookies or other tracking methods. While these tools don’t identify the user’s personal information, they can recognize behavior like the pages viewed or time spent on your site. This data enables you to serve relevant ads, even if the visitor hasn’t created an account or shared contact details.

22. What are some common mistakes to avoid in ad remarketing?

Avoiding common pitfalls can make your campaigns more effective:

Overexposure: Bombarding users with ads too frequently can lead to ad fatigue.

Generic Ads: Not personalizing ads based on behavior reduces engagement.

Poor Audience Segmentation: Targeting too broad an audience wastes ad spend.

Ignoring Mobile Optimization: With most users browsing on mobile, unoptimized ads can hurt performance.Carefully managing these factors helps ensure your campaigns deliver results without alienating your audience.

23. How do ad remarketing tools help with personalization?

Ad remarketing tools use dynamic ad creation to deliver personalized content based on user behavior. For example:

A user who browsed a specific product might see an ad featuring that exact item.

Visitors who abandoned their cart might receive an ad with a discount code.This level of personalization creates a more engaging experience and significantly increases the likelihood of conversion.

24. How much does ad remarketing software typically cost?

The cost of ad remarketing software varies based on the platform and features. Many tools operate on a pay-per-click (PPC) or pay-per-impression model, where you only pay for the ads served. Some platforms, like Google Ads or Meta, have no upfront fees, while dedicated tools like Customers.ai or AdRoll start at around $30–$100/month. Advanced platforms with custom features may require higher budgets.

25. Can ad remarketing software help with retention marketing?

Yes, ad remarketing software can be a valuable tool for retention marketing. By targeting past customers with tailored ads, you can promote repeat purchases, upsells, or exclusive loyalty offers. For example, a customer who purchased a camera might see ads for accessories or upgrades. Retargeting past buyers is a cost-effective way to increase customer lifetime value and build brand loyalty.
The post Ad Remarketing Software Reviewed: The 15 Best Retargeting Tools appeared first on Customers.ai.