List of Large Mixture of Experts (MoE) Models: Architecture, Performan …

Mixture of Experts (MoE) models represents a significant breakthrough in machine learning, offering an efficient approach to handling large-scale models. Unlike dense models, where all parameters are active during inference, MoE models activate only a fraction of their parameters. This approach balances computational efficiency with scalability, making MoE models highly attractive for various use cases. MoE models achieve efficiency by activating fewer parameters while maintaining a larger total parameter count. This design introduces unique trade-offs, including increased architectural complexity, but it provides greater flexibility for developers and researchers.

Let’s explore the largest MoE models released to date, focusing on their architecture, capabilities, and relative performance. These models are all publicly available and exceed 100 billion parameters. The analysis is ordered chronologically by release date, with rankings provided where available from the LMSYS leaderboard as of November 4, 2024.

Google’s Switch-C Transformer is one of the earliest models in the MoE space. Released on Hugging Face in November 2022, it boasts a staggering 1.6 trillion total parameters, supported by 2048 experts. Despite being an early innovator in this domain, Switch-C is now considered outdated, as it is not ranked on modern benchmarks like LMSYS. However, it remains noteworthy as a foundational MoE model and continues to influence subsequent innovations. Smaller variants of the Switch-C Transformer are also available, offering more accessible entry points for experimentation.

In March 2024, X AI released Grok-1, a model with 314 billion total parameters and 86 billion active during inference. Unlike its predecessor, Grok-1 utilizes a smaller pool of experts, eight in total, with only two active per inference task. Its 8k context length is suitable for moderately long input sequences, though it is not competitive with newer models. While Grok-1 has limited adoption and is not ranked on LMSYS, its successor, Grok-2, has shown promise in preliminary benchmarks. Grok-2, yet to be publicly released, has ranked fifth overall in specific LMSYS tasks, suggesting that future iterations of this model could redefine performance benchmarks in the MoE landscape.

Shortly after Grok-1, Databricks released DBRX in late March 2024. This model features 132 billion total parameters, with 36 billion active, spread across 16 experts. Its 32k context length significantly outpaces many contemporaries, allowing it to process longer input sequences efficiently. DBRX is supported by multiple backends, including llamacpp, exllama v2, and vLLM, making it a versatile choice for developers. Despite its strong architecture, its LMSYS rankings place it only at 90th overall and 78th for hard prompts in English, indicating room for improvement in quality and adoption.

April 2024 saw the release of Mistral AI’s Mixtral 8x22b. This model stands out with its 141 billion total parameters and 39 billion active during inference. It incorporates eight experts, two of which are chosen dynamically based on the input. With a 64k context length, Mixtral is well-suited for tasks requiring extensive input handling. While its LMSYS rankings, 70th overall and 66th on hard prompts, indicate middling performance, its compatibility with multiple backends ensures usability across diverse platforms.

Another April release was Snowflake’s Arctic, an MoE model with 480 billion total parameters but only 17 billion active during inference. Arctic’s unique design combines sparse (7 billion) and dense (10 billion) components distributed among 128 experts. However, its performance falls short, ranking 99th overall on LMSYS and a notably low 101st for hard prompts. Its limited 4k context length further restricts its applicability, making it a less competitive option despite its innovative architecture.

Skywork joined the MoE space in June 2024 with the release of Skywork-MoE. This model features 146 billion total parameters, of which 22 billion are active, and employs 16 experts during inference. With an 8k context length, it supports moderately lengthy tasks but lacks LMSYS rankings, which suggests limited testing or adoption. The base model is the only available version, as the promised chat variant has yet to be released.

In August 2024, AI21 Labs released Jamba 1.5 Large, a hybrid model that merges MoE and mamba-transformer architectures. With 398 billion total parameters and 98 billion active, Jamba 1.5 Large offers an exceptional 256k context length, making it ideal for tasks requiring extensive input processing. Its LMSYS rankings reflect its high performance, placing 34th overall and 28th for hard prompts. Additionally, Jamba models excel in context benchmarks, particularly the RULER context benchmark, solidifying their reputation for long-context tasks.

DeepSeek V2.5, released in September 2024, currently leads the MoE space in performance. This model incorporates 236 billion total parameters, with 21 billion active during inference. Its architecture includes 160 experts, of which six are dynamically chosen and two are shared, resulting in eight active parameters. With a 128k context length, DeepSeek V2.5 demonstrates robust capabilities for long-context tasks. It ranks 18th overall on LMSYS and 6th for hard prompts, outperforming all available MoE models. Earlier iterations, such as DeepSeek V2, laid the groundwork for its success.

The most recent addition to the MoE family is Tencent’s Hunyuan Large, released in November 2024. With 389 billion total parameters and 52 billion active, Hunyuan Large employs a unique design, where one expert is chosen dynamically and one is shared. This results in two active parameters during inference. Its 128k context length matches that of DeepSeek V2.5, positioning it as a strong competitor. While it is not yet ranked on LMSYS, early indications suggest it could rival or surpass DeepSeek’s performance.

Among the MoE models discussed, DeepSeek V2.5 is the most robust option currently available. However, newer models such as Hunyuan Large and the anticipated Grok-2 may soon shift the rankings. Models like Jamba 1.5 Large also highlight the strengths of hybrid architectures, particularly in tasks requiring extensive context handling. The LMSYS rankings, while useful for initial comparisons, do not capture every nuance of model performance, especially for specialized tasks.

In conclusion, MoE models represent a growing frontier in AI, offering scalable and efficient solutions tailored to diverse applications. Developers and researchers are encouraged to explore these models based on specific use cases, leveraging their unique architectures to optimize performance. As the field evolves, the MoE landscape will likely witness further innovations, pushing the boundaries of what these architectures can achieve.

This article is based on this Reddit post. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions– From Framework to Production
The post List of Large Mixture of Experts (MoE) Models: Architecture, Performance, and Innovations in Scalable AI Solutions appeared first on MarkTechPost.

Why AI Language Models Are Still Vulnerable: Key Insights from Kili Te …

Kili Technology recently released a detailed report highlighting significant vulnerabilities in AI language models, focusing on their susceptibility to pattern-based misinformation attacks. As AI systems become integral to both consumer products and enterprise tools, understanding and mitigating such vulnerabilities is crucial for ensuring their safe and ethical use. This article explores the insights from Kili Technology’s new multilingual study and its associated findings, emphasizing how leading models like CommandR+, Llama 3.2, and GPT4o can be compromised, even with supposedly robust safeguards.

Few/Many Shot Attack and Pattern-Based Vulnerabilities

The core revelation from Kili Technology’s report is that even advanced large language models (LLMs) can be manipulated to produce harmful outputs through the “Few/Many Shot Attack” approach. This technique involves providing the model with carefully selected examples, thereby conditioning it to replicate and extend that pattern in harmful or misleading ways. The study found this method to have a staggering success rate of up to 92.86%, proving highly effective against some of the most advanced models available today.

The research encompassed major LLMs such as CommandR+, Llama 3.2, and GPT4o. Interestingly, all models showed notable susceptibility to pattern-based misinformation despite their built-in safety features. This vulnerability was exacerbated by the models’ inherent reliance on input cues—once a malicious prompt set a misleading context, the model would follow it with high fidelity, regardless of the negative implications.

Cross-Lingual Insights: Disparities in AI Vulnerabilities

Another key aspect of Kili’s research is its focus on multilingual performance. The evaluation extended beyond English to include French, examining whether language differences impact model safety. Remarkably, the models were consistently more vulnerable when prompted in English compared to French, suggesting that current safeguards may not be uniformly effective across languages.

In practical terms, this highlights a critical blind spot in AI safety: models that are reasonably resistant to attack in one language may still be highly vulnerable in another. Kili’s findings emphasize the need for more holistic, cross-lingual approaches to AI safety, which should include diverse languages representing various cultural and geopolitical contexts. Such an approach is particularly pertinent as LLMs are increasingly deployed globally, where multilingual capabilities are essential.

The report mentioned that 102 prompts were crafted for each language, meticulously adapting them to reflect linguistic and cultural nuances. Notably, English prompts were derived from both American and British contexts, and then translated and adapted for French. The results showed that, while French prompts had lower success rates in manipulating models, vulnerabilities remained significant enough to warrant concern.

Erosion of Safety Measures During Extended Interactions

One of the most concerning findings of the report is that AI models tend to exhibit a gradual erosion of their ethical safeguards over the course of extended interactions. Initially, models might respond cautiously, even refusing to generate harmful outputs when prompted directly. However, as the conversation continues, these safeguards often weaken, resulting in the model eventually complying with harmful requests.

For example, in scenarios where CommandR+ was initially reluctant to generate explicit content, the continued conversation led to the model eventually succumbing to user pressure. This raises critical questions about the reliability of current safety frameworks and their ability to maintain consistent ethical boundaries, especially during prolonged user engagements.

Ethical and Societal Implications

The findings presented by Kili Technology underscore significant ethical challenges in AI deployment. The ease with which advanced models can be manipulated to produce harmful or misleading outputs poses risks not just to individual users but also to broader society. From fake news to polarizing narratives, the weaponization of AI for misinformation has the potential to impact everything from political stability to individual safety.

Moreover, the observed inconsistencies in ethical behavior across languages also point to an urgent need for inclusive, multilingual training strategies. The fact that vulnerabilities are more easily exploited in English compared to French suggests that non-English users might currently benefit from an unintentional layer of protection—a disparity that highlights the uneven application of safety standards.

Looking Forward: Strengthening AI Defenses

Kili Technology’s comprehensive evaluation provides a foundation for enhancing LLM safety. Their findings suggest that AI developers need to prioritize the robustness of safety measures across all phases of interaction and in all languages. Techniques like adaptive safety frameworks, which can dynamically adjust to the nature of extended user interactions, may be required to maintain ethical standards without succumbing to gradual degradation.

The research team at Kili Technology emphasized their plans to broaden the scope of their analysis to other languages, including those representing different language families and cultural contexts. This systematic expansion is aimed at building more resilient AI systems that are capable of safeguarding users regardless of their linguistic or cultural background.

Collaboration across AI research organizations will be crucial in mitigating these vulnerabilities. Red teaming techniques must become an integral part of AI model evaluation and development, with a focus on creating adaptive, multilingual, and culturally sensitive safety mechanisms. By systematically addressing the gaps uncovered in Kili’s research, AI developers can work towards creating models that are not only powerful but also ethical and reliable.

Conclusion

Kili Technology’s recent report provides a comprehensive look at the current vulnerabilities in AI language models. Despite advancements in model safety, the findings reveal that significant weaknesses remain, particularly in their susceptibility to misinformation and coercion, as well as the inconsistent performance across different languages. As LLMs become increasingly embedded in various aspects of society, ensuring their safety and ethical alignment is paramount.

Check out the Full Report here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Thanks to Kili Technology for the thought leadership/ Educational article. Kili Technology has supported us in this content/article.
The post Why AI Language Models Are Still Vulnerable: Key Insights from Kili Technology’s Report on Large Language Model Vulnerabilities appeared first on MarkTechPost.

GaLiTe and AGaLiTe: Efficient Transformer Alternatives for Partially O …

In real-world settings, agents often face limited visibility of the environment, complicating decision-making. For instance, a car-driving agent must recall road signs from moments earlier to adjust its speed, yet storing all observations is unscalable due to memory limits. Instead, agents must learn compressed representations of observations. This challenge is compounded in ongoing tasks, where essential past information can only sometimes be retained efficiently. Incremental state construction is key in partially observable online reinforcement learning (RL), where recurrent neural networks (RNNs) like LSTMs handle sequences effectively, though they’re tough to train. Transformers capture long-term dependencies but come with higher computational costs.

Various approaches have extended linear transformers to address their limitations in handling sequential data. One architecture uses a scalar gating method to accumulate values over time, while others add recurrence and non-linear updates to enhance learning from sequential dependencies, although this can reduce parallelization efficiency. Additionally, some models selectively calculate sparse attention or cache previous activations, allowing them to attend to longer sequences without significant memory cost. Other recent innovations reduce the complexity of self-attention, improving transformers’ ability to process long contexts efficiently. Though transformers are commonly used in offline reinforcement learning, their application in model-free settings is still emerging.

Researchers from the University of Alberta and Amii developed two new transformer architectures tailored for partially observable online reinforcement learning, addressing issues with high inference costs and memory demands typical of traditional transformers. Their proposed models, GaLiTe and AGaLiTe, implement a gated self-attention mechanism to manage and update information efficiently, providing a context-independent inference cost and improved performance in long-range dependencies. Testing in 2D and 3D environments, like T-Maze and Craftax, showed these models outperformed or matched the state-of-the-art GTrXL, reducing memory and computation by over 40%, with AGaLiTe achieving up to 37% better performance on complex tasks.

The Gated Linear Transformer (GaLiTe) enhances linear transformers by addressing key limitations, particularly the lack of mechanisms to remove outdated information and the reliance on the kernel feature map choice. GaLiTe introduces a gating mechanism to control information flow, allowing selective memory retention and a parameterized feature map to compute key and query vectors without needing specific kernel functions. For further efficiency, the Approximate Gated Linear Transformer (AGaLiTe) utilizes a low-rank approximation to reduce memory demands, storing recurrent states as vectors rather than matrices. This approach achieves significant space and time savings compared to other architectures, especially in complex reinforcement learning tasks.

The study evaluates the proposed AGaLiTe model across several partially observable RL tasks. In these environments, agents require memory to handle different levels of partial observability, such as recalling single cues in T-Maze, integrating information over time in CartPole, or navigating through complex environments like Mystery Path, Craftax, and Memory Maze. AGaLiTe, equipped with a streamlined self-attention mechanism, achieves high performance, surpassing traditional models like GTrXL and GRU in effectiveness and computational efficiency. The results indicate that AGaLiTe’s design significantly reduces operations and memory usage, offering advantages for RL tasks with extensive context requirements.

In conclusion, Transformers are highly effective for sequential data processing but face limitations in online reinforcement learning due to high computational demands and the need to maintain all historical data for self-attention. This study introduces two efficient alternatives to transformer self-attention, GaLiTe, and AGaLiTe, which are recurrent-based and designed for partially observable RL tasks. Both models perform competitively or better than GTrXL, with over 40% lower inference costs and over 50% reduced memory usage. Future research may improve AGaLiTe with real-time learning updates and applications in model-based RL approaches like Dreamer V3.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions– From Framework to Production
The post GaLiTe and AGaLiTe: Efficient Transformer Alternatives for Partially Observable Online Reinforcement Learning appeared first on MarkTechPost.

Nexa AI Releases OmniVision-968M: World’s Smallest Vision Language M …

Edge AI has long faced the challenge of balancing efficiency and effectiveness. Deploying Vision Language Models (VLMs) on edge devices is difficult due to their large size, high computational demands, and latency issues. Models designed for cloud environments often struggle with the limited resources of edge devices, resulting in excessive battery usage, slower response times, and inconsistent connectivity. The demand for lightweight yet efficient models has been growing, driven by applications such as augmented reality, smart home assistants, and industrial IoT, which require rapid processing of visual and textual inputs. These challenges are further complicated by increased hallucination rates and unreliable results in tasks like visual question answering or image captioning, where quality and accuracy are essential.

Nexa AI Releases OmniVision-968M: World’s Smallest Vision Language Model with 9x Tokens Reduction for Edge Devices. OmniVision-968M has been engineered with improved architecture over LLaVA (Large Language and Vision Assistant), achieving a new level of compactness and efficiency, ideal for running on the edge. With a design focused on the reduction of image tokens by a factor of nine—from 729 to just 81—the latency and computational burden typically associated with such models have been drastically minimized.

OmniVision’s architecture is built around three main components:

Base Language Model: Qwen2.5-0.5B-Instruct serves as the core model for processing text inputs.

Vision Encoder: SigLIP-400M, with a 384 resolution and 14×14 patch size, generates image embeddings.

Projection Layer: A Multi-Layer Perceptron (MLP) aligns the vision encoder’s embeddings with the token space of the language model. Unlike the standard Llava architecture, our projector reduces the number of image tokens by 9 times.

OmniVision-968M integrates several key technical advancements that make it a perfect fit for edge deployment. The model’s architecture has been enhanced based on LLaVA, allowing it to process both visual and text inputs with high efficiency. The image token reduction from 729 to 81 represents a significant leap in optimization, making it almost nine times more efficient in token processing compared to existing models. This has a profound impact on reducing latency and computational costs, which are critical factors for edge devices. Furthermore, OmniVision-968M leverages Direct Preference Optimization (DPO) training with trustworthy data sources, which helps mitigate the problem of hallucination—a common challenge in multimodal AI systems. By focusing on visual question answering and image captioning, the model aims to offer a seamless, accurate user experience, ensuring reliability and robustness in edge applications where real-time response and power efficiency are crucial.

The release of OmniVision-968M represents a notable advancement for several reasons. Primarily, the reduction in token count significantly decreases the computational resources required for inference. For developers and enterprises looking to implement VLMs in constrained environments—such as wearables, mobile devices, and IoT hardware—the compact size and efficiency of OmniVision-968M make it an ideal solution. Furthermore, the DPO training strategy helps minimize hallucination, a common issue where models generate incorrect or misleading information, ensuring that OmniVision-968M is both efficient and reliable. Preliminary benchmarks indicate that OmniVision-968M achieves a 35% reduction in inference time compared to previous models while maintaining or even improving accuracy in tasks like visual question answering and image captioning. These advancements are expected to encourage adoption across industries that require high-speed, low-power AI interactions, such as healthcare, smart cities, and the automotive sector.

In conclusion, Nexa AI’s OmniVision-968M addresses a long-standing gap in the AI industry: the need for highly efficient vision language models that can run seamlessly on edge devices. By reducing image tokens, optimizing LLaVA’s architecture, and incorporating DPO training to ensure trustworthy outputs, OmniVision-968M represents a new frontier in edge AI. This model brings us closer to the vision of ubiquitous AI—where smart, connected devices can perform sophisticated multimodal tasks locally without the need for constant cloud support.

Check out the Model on Hugging Face and Other Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions
The post Nexa AI Releases OmniVision-968M: World’s Smallest Vision Language Model with 9x Tokens Reduction for Edge Devices appeared first on MarkTechPost.

Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning …

Advancements in large language models (LLMs) have revolutionized natural language processing, with applications spanning text generation, translation, and summarization. These models rely on large amounts of data, large parameter counts, and expansive vocabularies, necessitating sophisticated techniques to manage computational and memory requirements. A critical component of LLM training is the cross-entropy loss computation, which, while central to model accuracy, presents significant memory challenges due to the size and complexity of the vocabulary.

The memory requirements of the cross-entropy loss layer constrict training large language models, especially as vocabulary sizes reach hundreds of thousands of tokens. The issue becomes acute in models like Gemma 2 (2B), where the cross-entropy loss computation alone can consume up to 24 GB of memory, accounting for up to 90% of the memory footprint during training. These limitations restrict batch sizes and force trade-offs between model performance and computational feasibility, posing a significant bottleneck for scalability.

Previous methods aimed at reducing memory usage, such as FlashAttention and hierarchical vocabularies, have addressed specific components like self-attention but fall short in alleviating the burden of the cross-entropy layer. Chunking methods reduce memory requirements but introduce latency trade-offs, limiting their practical use. Also, these approaches need to fully exploit the sparsity of gradients or leverage hardware optimizations, leaving room for improvement.

Researchers at Apple introduced the Cut Cross-Entropy (CCE) method, a novel approach designed to overcome the memory challenges associated with large vocabulary models. Unlike conventional methods that compute and store all logits for tokens in memory, CCE dynamically calculates only the necessary logits and performs log-sum-exp reductions in on-chip memory. This technique eliminates the need to materialize large matrices in GPU memory, significantly reducing the memory footprint. For instance, in the Gemma 2 model, the memory usage for loss computation dropped from 24 GB to just 1 MB, with total classifier head memory consumption reduced from 28 GB to 1 GB.

The core of CCE lies in its efficient computation strategy, which employs custom CUDA kernels to process embeddings and perform reductions. By calculating logits on the fly and avoiding intermediate memory storage, the method capitalizes on shared GPU memory, which is faster and more efficient than traditional global memory usage. Also, gradient filtering selectively skips computations that contribute negligibly to the gradient, leveraging the inherent sparsity of the softmax matrix. Vocabulary sorting optimizes processing by grouping tokens with significant contributions, minimizing wasted computation. Together, these innovations enable a memory-efficient, low-latency loss computation mechanism.

The performance gains from CCE are remarkable. Memory reductions enabled a 10-fold increase in batch size for smaller models like GPT-2 and a 1.5-fold increase for larger models like Llama 2 (13B). Training throughput remained unaffected, and experimental results demonstrated stable convergence, matching the performance of traditional methods. For a batch of 8,192 tokens with a vocabulary size 256,000, CCE achieved a peak memory usage of just 1 MB compared to 28 GB in baseline methods. Training stability tests on models such as Llama 3 (8B) and Phi 3.5 Mini confirmed the reliability of CCE, with indistinguishable loss curves compared to existing approaches.

This research highlights several key takeaways:

Significant Memory Reduction: CCE reduces memory usage for cross-entropy loss computation to negligible levels, as low as 1 MB for large-scale models like Gemma 2 (2B).  

Improved Scalability: By enabling larger batch sizes, the method supports more efficient utilization of computational resources, which is crucial for training extensive models.  

Efficiency Gains: Custom CUDA kernels and gradient filtering ensure that the reduction in memory footprint does not compromise training speed or model convergence.  

Practical Applicability: The method is adaptable to various architectures and scenarios, with potential applications extending to image classification and contrastive learning.  

Future Potential: CCE’s ability to handle large vocabularies with minimal memory impact could facilitate training even more extensive models with improved pipeline balancing.

In conclusion, the CCE method represents a significant breakthrough in training large language models by addressing the critical bottleneck of memory-intensive cross-entropy loss layers. Through innovative techniques like dynamic logit computation, gradient filtering, and vocabulary sorting, CCE enables dramatic reductions in memory usage without sacrificing speed or accuracy. This advancement not only enhances the efficiency of current models but also paves the way for more scalable and balanced architectures in the future, opening new possibilities for large-scale machine learning.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions
The post Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory appeared first on MarkTechPost.

Considerations for addressing the core dimensions of responsible AI fo …

The rapid advancement of generative AI promises transformative innovation, yet it also presents significant challenges. Concerns about legal implications, accuracy of AI-generated outputs, data privacy, and broader societal impacts have underscored the importance of responsible AI development. Responsible AI is a practice of designing, developing, and operating AI systems guided by a set of dimensions with the goal to maximize benefits while minimizing potential risks and unintended harm. Our customers want to know that the technology they are using was developed in a responsible way. They also want resources and guidance to implement that technology responsibly in their own organization. Most importantly, they want to make sure the technology they roll out is for everyone’s benefit, including end-users. At AWS, we are committed to developing AI responsibly, taking a people-centric approach that prioritizes education, science, and our customers, integrating responsible AI across the end-to-end AI lifecycle.
What constitutes responsible AI is continually evolving. For now, we consider eight key dimensions of responsible AI: Fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. These dimensions make up the foundation for developing and deploying AI applications in a responsible and safe manner.
At AWS, we help our customers transform responsible AI from theory into practice—by giving them the tools, guidance, and resources to get started with purpose-built services and features, such as Amazon Bedrock Guardrails. In this post, we introduce the core dimensions of responsible AI and explore considerations and strategies on how to address these dimensions for Amazon Bedrock applications. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Safety
The safety dimension in responsible AI focuses on preventing harmful system output and misuse. It focuses on steering AI systems to prioritize user and societal well-being.
Amazon Bedrock is designed to facilitate the development of secure and reliable AI applications by incorporating various safety measures. In the following sections, we explore different aspects of implementing these safety measures and provide guidance for each.
Addressing model toxicity with Amazon Bedrock Guardrails
Amazon Bedrock Guardrails supports AI safety by working towards preventing the application from generating or engaging with content that is considered unsafe or undesirable. These safeguards can be created for multiple use cases and implemented across multiple FMs, depending on your application and responsible AI requirements. For example, you can use Amazon Bedrock Guardrails to filter out harmful user inputs and toxic model outputs, redact by either blocking or masking sensitive information from user inputs and model outputs, or help prevent your application from responding to unsafe or undesired topics.
Content filters can be used to detect and filter harmful or toxic user inputs and model-generated outputs. By implementing content filters, you can help prevent your AI application from responding to inappropriate user behavior, and make sure your application provides only safe outputs. This can also mean providing no output at all, in situations where certain user behavior is unwanted. Content filters support six categories: hate, insults, sexual content, violence, misconduct, and prompt injections. Filtering is done based on confidence classification of user inputs and FM responses across each category. You can adjust filter strengths to determine the sensitivity of filtering harmful content. When a filter is increased, it increases the probability of filtering unwanted content.
Denied topics are a set of topics that are undesirable in the context of your application. These topics will be blocked if detected in user queries or model responses. You define a denied topic by providing a natural language definition of the topic along with a few optional example phrases of the topic. For example, if a medical institution wants to make sure their AI application avoids giving any medication or medical treatment-related advice, they can define the denied topic as “Information, guidance, advice, or diagnoses provided to customers relating to medical conditions, treatments, or medication” and optional input examples like “Can I use medication A instead of medication B,” “Can I use Medication A for treating disease Y,” or “Does this mole look like skin cancer?” Developers will need to specify a message that will be displayed to the user whenever denied topics are detected, for example “I am an AI bot and cannot assist you with this problem, please contact our customer service/your doctor’s office.” Avoiding specific topics that aren’t toxic by nature but can potentially be harmful to the end-user is crucial when creating safe AI applications.
Word filters are used to configure filters to block undesirable words, phrases, and profanity. Such words can include offensive terms or undesirable outputs, like product or competitor information. You can add up to 10,000 items to the custom word filter to filter out topics you don’t want your AI application to produce or engage with.
Sensitive information filters are used to block or redact sensitive information such as personally identifiable information (PII) or your specified context-dependent sensitive information in user inputs and model outputs. This can be useful when you have requirements for sensitive data handling and user privacy. If the AI application doesn’t process PII information, your users and your organization are safer from accidental or intentional misuse or mishandling of PII. The filter is configured to block sensitive information requests; upon such detection, the guardrail will block content and display a preconfigured message. You can also choose to redact or mask sensitive information, which will either replace the data with an identifier or delete it completely.
Measuring model toxicity with Amazon Bedrock model evaluation
Amazon Bedrock provides a built-in capability for model evaluation. Model evaluation is used to compare different models’ outputs and select the most appropriate model for your use case. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization. You can choose to create either an automatic model evaluation job or a model evaluation job that uses a human workforce. For automatic model evaluation jobs, you can either use built-in datasets across three predefined metrics (accuracy, robustness, toxicity) or bring your own datasets. For human-in-the-loop evaluation, which can be done by either AWS managed or customer managed teams, you must bring your own dataset.
If you are planning on using automated model evaluation for toxicity, start by defining what constitutes toxic content for your specific application. This may include offensive language, hate speech, and other forms of harmful communication. Automated evaluations come with curated datasets to choose from. For toxicity, you can use either RealToxicityPrompts or BOLD datasets, or both. If you bring your custom model to Amazon Bedrock, you can implement scheduled evaluations by integrating regular toxicity assessments into your development pipeline at key stages of model development, such as after major updates or retraining sessions. For early detection, implement custom testing scripts that run toxicity evaluations on new data and model outputs continuously.
Amazon Bedrock and its safety capabilities helps developers create AI applications that prioritize safety and reliability, thereby fostering trust and enforcing ethical use of AI technology. You should experiment and iterate on chosen safety approaches to achieve their desired performance. Diverse feedback is also important, so think about implementing human-in-the-loop testing to assess model responses for safety and fairness.
Controllability
Controllability focuses on having mechanisms to monitor and steer AI system behavior. It refers to the ability to manage, guide, and constrain AI systems to make sure they operate within desired parameters.
Guiding AI behavior with Amazon Bedrock Guardrails
To provide direct control over what content the AI application can produce or engage with, you can use Amazon Bedrock Guardrails, which we discussed under the safety dimension. This allows you to steer and manage the system’s outputs effectively.
You can use content filters to manage AI outputs by setting sensitivity levels for detecting harmful or toxic content. By controlling how strictly content is filtered, you can steer the AI’s behavior to help avoid undesirable responses. This allows you to guide the system’s interactions and outputs to align with your requirements. Defining and managing denied topics helps control the AI’s engagement with specific subjects. By blocking responses related to defined topics, you help AI systems remain within the boundaries set for its operation.
Amazon Bedrock Guardrails can also guide the system’s behavior for compliance with content policies and privacy standards. Custom word filters allow you to block specific words, phrases, and profanity, giving you direct control over the language the AI uses. And managing how sensitive information is handled, whether by blocking or redacting it, allows you to control the AI’s approach to data privacy and security.
Monitoring and adjusting performance with Amazon Bedrock model evaluation
To asses and adjust AI performance, you can look at Amazon Bedrock model evaluation. This helps systems operate within desired parameters and meet safety and ethical standards. You can explore both automatic and human-in-the loop evaluation. These evaluation methods help you monitor and guide model performance by assessing how well models meet safety and ethical standards. Regular evaluations allow you to adjust and steer the AI’s behavior based on feedback and performance metrics.
Integrating scheduled toxicity assessments and custom testing scripts into your development pipeline helps you continuously monitor and adjust model behavior. This ongoing control helps AI systems to remain aligned with desired parameters and adapt to new data and scenarios effectively.
Fairness
The fairness dimension in responsible AI considers the impacts of AI on different groups of stakeholders. Achieving fairness requires ongoing monitoring, bias detection, and adjustment of AI systems to maintain impartiality and justice.
To help with fairness in AI applications that are built on top of Amazon Bedrock, application developers should explore model evaluation and human-in-the-loop validation for model outputs at different stages of the machine learning (ML) lifecycle. Measuring bias presence before and after model training as well as at model inference is the first step in mitigating bias. When developing an AI application, you should set fairness goals, metrics, and potential minimum acceptable thresholds to measure performance across different qualities and demographics applicable to the use case. On top of these, you should create remediation plans for potential inaccuracies and bias, which may include modifying datasets, finding and deleting the root cause for bias, introducing new data, and potentially retraining the model.
Amazon Bedrock provides a built-in capability for model evaluation, as we explored under the safety dimension. For general text generation evaluation for measuring model robustness and toxicity, you can use the built-in fairness dataset Bias in Open-ended Language Generation Dataset (BOLD), which focuses on five domains: profession, gender, race, religious ideologies, and political ideologies. To assess fairness for other domains or tasks, you must bring your own custom prompt datasets.
Transparency
The transparency dimension in generative AI focuses on understanding how AI systems make decisions, why they produce specific results, and what data they’re using. Maintaining transparency is critical for building trust in AI systems and fostering responsible AI practices.
To help meet the growing demand for transparency, AWS introduced AWS AI Service Cards, a dedicated resource aimed at enhancing customer understanding of our AI services. AI Service Cards serve as a cornerstone of responsible AI documentation, consolidating essential information in one place. They provide comprehensive insights into the intended use cases, limitations, responsible AI design principles, and best practices for deployment and performance optimization of our AI services. They are part of a comprehensive development process we undertake to build our services in a responsible way.
At the time of writing, we offer the following AI Service Cards for Amazon Bedrock models:

Amazon Titan Text Lite and Titan Text Express
Amazon Titan Text Premier

Service cards for other Amazon Bedrock models can be found directly on the provider’s website. Each card details the service’s specific use cases, the ML techniques employed, and crucial considerations for responsible deployment and use. These cards evolve iteratively based on customer feedback and ongoing service enhancements, so they remain relevant and informative.
An additional effort in providing transparency is the Amazon Titan Image Generator invisible watermark. Images generated by Amazon Titan come with this invisible watermark by default. This watermark detection mechanism enables you to identify images produced by Amazon Titan Image Generator, an FM designed to create realistic, studio-quality images in large volumes and at low cost using natural language prompts. By using watermark detection, you can enhance transparency around AI-generated content, mitigate the risks of harmful content generation, and reduce the spread of misinformation.
Content creators, news organizations, risk analysts, fraud detection teams, and more can use this feature to identify and authenticate images created by Amazon Titan Image Generator. The detection system also provides a confidence score, allowing you to assess the reliability of the detection even if the original image has been modified. Simply upload an image to the Amazon Bedrock console, and the API will detect watermarks embedded in images generated by the Amazon Titan model, including both the base model and customized versions. This tool not only supports responsible AI practices, but also fosters trust and reliability in the use of AI-generated content.
Veracity and robustness
The veracity and robustness dimension in responsible AI focuses on achieving correct system outputs, even with unexpected or adversarial inputs. The main focus of this dimension is to address possible model hallucinations. Model hallucinations occur when an AI system generates false or misleading information that appears to be plausible. Robustness in AI systems makes sure model outputs are consistent and reliable under various conditions, including unexpected or adverse situations. A robust AI model maintains its functionality and delivers consistent and accurate outputs even when faced with incomplete or incorrect input data.
Measuring accuracy and robustness with Amazon Bedrock model evaluation
As introduced in the AI safety and controllability dimensions, Amazon Bedrock provides tools for evaluating AI models in terms of toxicity, robustness, and accuracy. This makes sure the models don’t produce harmful, offensive, or inappropriate content and can withstand various inputs, including unexpected or adversarial scenarios.
Accuracy evaluation helps AI models produce reliable and correct outputs across various tasks and datasets. In the built-in evaluation, accuracy is measured against a TREX dataset and the algorithm calculates the degree to which the model’s predictions match the actual results. The actual metric for accuracy depends on the chosen use case; for example, in text generation, the built-in evaluation calculates a real-world knowledge score, which examines the model’s ability to encode factual knowledge about the real world. This evaluation is essential for maintaining the integrity, credibility, and effectiveness of AI applications.
Robustness evaluation makes sure the model maintains consistent performance across diverse and potentially challenging conditions. This includes handling unexpected inputs, adversarial manipulations, and varying data quality without significant degradation in performance.
Methods for achieving veracity and robustness in Amazon Bedrock applications
There are several techniques that you can consider when using LLMs in your applications to maximize veracity and robustness:

Prompt engineering – You can instruct that model to only engage in discussion about things that the model knows and not generate any new information.
Chain-of-thought (CoT) – This technique involves the model generating intermediate reasoning steps that lead to the final answer, improving the model’s ability to solve complex problems by making its thought process transparent and logical. For example, you can ask the model to explain why it used certain information and created a certain output. This is a powerful method to reduce hallucinations. When you ask the model to explain the process it used to generate the output, the model has to identify different the steps taken and information used, thereby reducing hallucination itself. To learn more about CoT and other prompt engineering techniques for Amazon Bedrock LLMs, see General guidelines for Amazon Bedrock LLM users.
Retrieval Augmented Generation (RAG) – This helps reduce hallucination by providing the right context and augmenting generated outputs with internal data to the models. With RAG, you can provide the context to the model and tell the model to only reply based on the provided context, which leads to fewer hallucinations. With Amazon Bedrock Knowledge Bases, you can implement the RAG workflow from ingestion to retrieval and prompt augmentation. The information retrieved from the knowledge bases is provided with citations to improve AI application transparency and minimize hallucinations.
Fine-tuning and pre-training – There are different techniques for improving model accuracy for specific context, like fine-tuning and continued pre-training. Instead of providing internal data through RAG, with these techniques, you add data straight to the model as part of its dataset. This way, you can customize several Amazon Bedrock FMs by pointing them to datasets that are saved in Amazon Simple Storage Service (Amazon S3) buckets. For fine-tuning, you can take anything between a few dozen and hundreds of labeled examples and train the model with them to improve performance on specific tasks. The model learns to associate certain types of outputs with certain types of inputs. You can also use continued pre-training, in which you provide the model with unlabeled data, familiarizing the model with certain inputs for it to associate and learn patterns. This includes, for example, data from a specific topic that the model doesn’t have enough domain knowledge of, thereby increasing the accuracy of the domain. Both of these customization options make it possible to create an accurate customized model without collecting large volumes of annotated data, resulting in reduced hallucination.
Inference parameters – You can also look into the inference parameters, which are values that you can adjust to modify the model response. There are multiple inference parameters that you can set, and they affect different capabilities of the model. For example, if you want the model to get creative with the responses or generate completely new information, such as in the context of storytelling, you can modify the temperature parameter. This will affect how the model looks for words across probability distribution and select words that are farther apart from each other in that space.
Contextual grounding – Lastly, you can use the contextual grounding check in Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides mechanisms within the Amazon Bedrock service that allow developers to set content filters and specify denied topics to control allowed text-based user inputs and model outputs. You can detect and filter hallucinations in model responses if they are not grounded (factually inaccurate or add new information) in the source information or are irrelevant to the user’s query. For example, you can block or flag responses in RAG applications if the model response deviates from the information in the retrieved passages or doesn’t answer the question by the user.

Model providers and tuners might not mitigate these hallucinations, but can inform the user that they might occur. This could be done by adding some disclaimers about using AI applications at the user’s own risk. We currently also see advances in research in methods that estimate uncertainty based on the amount of variation (measured as entropy) between multiple outputs. These new methods have proved much better at spotting when a question was likely to be answered incorrectly than previous methods.
Explainability
The explainability dimension in responsible AI focuses on understanding and evaluating system outputs. By using an explainable AI framework, humans can examine the models to better understand how they produce their outputs. For the explainability of the output of a generative AI model, you can use techniques like training data attribution and CoT prompting, which we discussed under the veracity and robustness dimension.
For customers wanting to see attribution of information in completion, we recommend using RAG with an Amazon Bedrock knowledge base. Attribution works with RAG because the possible attribution sources are included in the prompt itself. Information retrieved from the knowledge base comes with source attribution to improve transparency and minimize hallucinations. Amazon Bedrock Knowledge Bases manages the end-to-end RAG workflow for you. When using the RetrieveAndGenerate API, the output includes the generated response, the source attribution, and the retrieved text chunks.
Security and privacy
If there is one thing that is absolutely critical to every organization using generative AI technologies, it is making sure everything you do is and remains private, and that your data is protected at all times. The security and privacy dimension in responsible AI focuses on making sure data and models are obtained, used, and protected appropriately.
Built-in security and privacy of Amazon Bedrock
With Amazon Bedrock, if we look from a data privacy and localization perspective, AWS does not store your data—if we don’t store it, it can’t leak, it can’t be seen by model vendors, and it can’t be used by AWS for any other purpose. The only data we store is operational metrics—for example, for accurate billing, AWS collects metrics on how many tokens you send to a specific Amazon Bedrock model and how many tokens you receive in a model output. And, of course, if you create a fine-tuned model, we need to store that in order for AWS to host it for you. Data used in your API requests remains in the AWS Region of your choosing—API requests to the Amazon Bedrock API to a specific Region will remain completely within that Region.
If we look at data security, a common adage is that if it moves, encrypt it. Communications to, from, and within Amazon Bedrock are encrypted in transit—Amazon Bedrock doesn’t have a non-TLS endpoint. Another adage is that if it doesn’t move, encrypt it. Your fine-tuning data and model will by default be encrypted using AWS managed AWS Key Management Service (AWS KMS) keys, but you have the option to use your own KMS keys.
When it comes to identity and access management, AWS Identity and Access Management (IAM) controls who is authorized to use Amazon Bedrock resources. For each model, you can explicitly allow or deny access to actions. For example, one team or account could be allowed to provision capacity for Amazon Titan Text, but not Anthropic models. You can be as broad or as granular as you need to be.
Looking at network data flows for Amazon Bedrock API access, it’s important to remember that traffic is encrypted at all time. If you’re using Amazon Virtual Private Cloud (Amazon VPC), you can use AWS PrivateLink to provide your VPC with private connectivity through the regional network direct to the frontend fleet of Amazon Bedrock, mitigating exposure of your VPC to internet traffic with an internet gateway. Similarly, from a corporate data center perspective, you can set up a VPN or AWS Direct Connect connection to privately connect to a VPC, and from there you can have that traffic sent to Amazon Bedrock over PrivateLink. This should negate the need for your on-premises systems to send Amazon Bedrock related traffic over the internet. Following AWS best practices, you secure PrivateLink endpoints using security groups and endpoint policies to control access to these endpoints following Zero Trust principles.
Let’s also look at network and data security for Amazon Bedrock model customization. The customization process will first load your requested baseline model, then securely read your customization training and validation data from an S3 bucket in your account. Connection to data can happen through a VPC using a gateway endpoint for Amazon S3. That means bucket policies that you have can still be applied, and you don’t have to open up wider access to that S3 bucket. A new model is built, which is then encrypted and delivered to the customized model bucket—at no time does a model vendor have access to or visibility of your training data or your customized model. At the end of the training job, we also deliver output metrics relating to the training job to an S3 bucket that you had specified in the original API request. As mentioned previously, both your training data and customized model can be encrypted using a customer managed KMS key.
Best practices for privacy protection
The first thing to keep in mind when implementing a generative AI application is data encryption. As mentioned earlier, Amazon Bedrock uses encryption in transit and at rest. For encryption at rest, you have the option to choose your own customer managed KMS keys over the default AWS managed KMS keys. Depending on your company’s requirements, you might want to use a customer managed KMS key. For encryption in transit, we recommend using TLS 1.3 to connect to the Amazon Bedrock API.
For terms and conditions and data privacy, it’s important to read the terms and conditions of the models (EULA). Model providers are responsible for setting up these terms and conditions, and you as a customer are responsible for evaluating these and deciding if they’re appropriate for your application. Always make sure you read and understand the terms and conditions before accepting, including when you request model access in Amazon Bedrock. You should make sure you’re comfortable with the terms. Make sure your test data has been approved by your legal team.
For privacy and copyright, it is the responsibility of the provider and the model tuner to make sure the data used for training and fine-tuning is legally available and can actually be used to fine-tune and train those models. It is also the responsibility of the model provider to make sure the data they’re using is appropriate for the models. Public data doesn’t automatically mean public for commercial usage. That means you can’t use this data to fine-tune something and show it to your customers.
To protect user privacy, you can use the sensitive information filters in Amazon Bedrock Guardrails, which we discussed under the safety and controllability dimensions.
Lastly, when automating with generative AI (for example, with Amazon Bedrock Agents), make sure you’re comfortable with the model making automated decisions and consider the consequences of the application providing wrong information or actions. Therefore, consider risk management here.
Governance
The governance dimension makes sure AI systems are developed, deployed, and managed in a way that aligns with ethical standards, legal requirements, and societal values. Governance encompasses the frameworks, policies, and rules that direct AI development and use in a way that is safe, fair, and accountable. Setting and maintaining governance for AI allows stakeholders to make informed decisions around the use of AI applications. This includes transparency about how data is used, the decision-making processes of AI, and the potential impacts on users.
Robust governance is the foundation upon which responsible AI applications are built. AWS offers a range of services and tools that can empower you to establish and operationalize AI governance practices. AWS has also developed an AI governance framework that offers comprehensive guidance on best practices across vital areas such as data and model governance, AI application monitoring, auditing, and risk management.
When looking at auditability, Amazon Bedrock integrates with the AWS generative AI best practices framework v2 from AWS Audit Manager. With this framework, you can start auditing your generative AI usage within Amazon Bedrock by automating evidence collection. This provides a consistent approach for tracking AI model usage and permissions, flagging sensitive data, and alerting on issues. You can use collected evidence to assess your AI application across eight principles: responsibility, safety, fairness, sustainability, resilience, privacy, security, and accuracy.
For monitoring and auditing purposes, you can use Amazon Bedrock built-in integrations with Amazon CloudWatch and AWS CloudTrail. You can monitor Amazon Bedrock using CloudWatch, which collects raw data and processes it into readable, near real-time metrics. CloudWatch helps you track usage metrics such as model invocations and token count, and helps you build customized dashboards for audit purposes either across one or multiple FMs in one or multiple AWS accounts. CloudTrail is a centralized logging service that provides a record of user and API activities in Amazon Bedrock. CloudTrail collects API data into a trail, which needs to be created inside the service. A trail enables CloudTrail to deliver log files to an S3 bucket.
Amazon Bedrock also provides model invocation logging, which is used to collect model input data, prompts, model responses, and request IDs for all invocations in your AWS account used in Amazon Bedrock. This feature provides insights on how your models are being used and how they are performing, enabling you and your stakeholders to make data-driven and responsible decisions around the use of AI applications. Model invocation logs need to be enabled, and you can decide whether you want to store this log data in an S3 bucket or CloudWatch logs.
From a compliance perspective, Amazon Bedrock is in scope for common compliance standards, including ISO, SOC, FedRAMP moderate, PCI, ISMAP, and CSA STAR Level 2, and is Health Insurance Portability and Accountability Act (HIPAA) eligible. You can also use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR). Amazon Bedrock is included in the Cloud Infrastructure Service Providers in Europe Data Protection Code of Conduct (CISPE CODE) Public Register. This register provides independent verification that Amazon Bedrock can be used in compliance with the GDPR. For the most up-to-date information about whether Amazon Bedrock is within the scope of specific compliance programs, see AWS services in Scope by Compliance Program and choose the compliance program you’re interested in.
Implementing responsible AI in Amazon Bedrock applications
When building applications in Amazon Bedrock, consider your application context, needs, and behaviors of your end-users. Also, look into your organization’s needs, legal and regulatory requirements, and metrics you want or need to collect when implementing responsible AI. Take advantage of managed and built-in features available. The following diagram outlines various measures you can implement to address the core dimensions of responsible AI. This is not an exhaustive list, but rather a proposition of how the measures mentioned in this post could be combined together. These measures include:

Model evaluation – Use model evaluation to assess fairness, accuracy, toxicity, robustness, and other metrics to evaluate your chosen FM and its performance.
Amazon Bedrock Guardrails – Use Amazon Bedrock Guardrails to establish content filters, denied topics, word filters, sensitive information filters, and contextual grounding. With guardrails, you can guide model behavior by denying any unsafe or harmful topics or words and protect the safety of your end-users.
Prompt engineering – Utilize prompt engineering techniques, such as CoT, to improve explainability, veracity and robustness, and safety and controllability of your AI application. With prompt engineering, you can set a desired structure for the model response, including tone, scope, and length of responses. You can emphasize safety and controllability by adding denied topics to the prompt template.
Amazon Bedrock Knowledge Bases – Use Amazon Bedrock Knowledge Bases for end-to-end RAG implementation to decrease hallucinations and improve accuracy of the model for internal data use cases. Using RAG will improve veracity and robustness, safety and controllability, and explainability of your AI application.
Logging and monitoring – Maintain comprehensive logging and monitoring to enforce effective governance.

Diagram outlining the various measures you can implement to address the core dimensions of responsible AI.

Conclusion
Building responsible AI applications requires a deliberate and structured approach, iterative development, and continuous effort. Amazon Bedrock offers a robust suite of built-in capabilities that support the development and deployment of responsible AI applications. By providing customizable features and the ability to integrate your own datasets, Amazon Bedrock enables developers to tune AI solutions to their specific application contexts and align them with organizational requirements for responsible AI. This flexibility makes sure AI applications are not only effective, but also ethical and aligned with best practices for fairness, safety, transparency, and accountability.
Implementing AI by following the responsible AI dimensions is key for developing and using AI solutions transparently, and without bias. Responsible development of AI will also help with AI adoption across your organization and build reliability with end customers. The broader the use and impact of your application, the more important following the responsibility framework becomes. Therefore, consider and address the responsible use of AI early on in your AI journey and throughout its lifecycle.
To learn more about the responsible use of ML framework, refer to the following resources:

Responsible Use of ML
AWS generative AI best practices framework v2
Building Generative AI prompt chaining workflows with human in the loop
Foundation Model Evaluations Library

About the Authors
Laura Verghote is a senior solutions architect for public sector customers in EMEA. She works with customers to design and build solutions in the AWS Cloud, bridging the gap between complex business requirements and technical solutions. She joined AWS as a technical trainer and has wide experience delivering training content to developers, administrators, architects, and partners across EMEA.
Maria Lehtinen is a solutions architect for public sector customers in the Nordics. She works as a trusted cloud advisor to her customers, guiding them through cloud system development and implementation with strong emphasis on AI/ML workloads. She joined AWS through an early-career professional program and has previous work experience from cloud consultant position at one of AWS Advanced Consulting Partners.

From RAG to fabric: Lessons learned from building real-world RAGs at G …

In Part 1 of this series, we defined the Retrieval Augmented Generation (RAG) framework to augment large language models (LLMs) with a text-only knowledge base. We gave practical tips, based on hands-on experience with customer use cases, on how to improve text-only RAG solutions, from optimizing the retriever to mitigating and detecting hallucinations.
This post focuses on doing RAG on heterogeneous data formats. We first introduce routers, and how they can help managing diverse data sources. We then give tips on how to handle tabular data and will conclude with multimodal RAG, focusing specifically on solutions that handle both text and image data.
Overview of RAG use cases with heterogeneous data formats
After a first wave of text-only RAG, we saw an increase in customers wanting to use a variety of data for Q&A. The challenge here is to retrieve the relevant data source to answer the question and correctly extract information from that data source. Use cases we have worked on include:

Technical assistance for field engineers – We built a system that aggregates information about a company’s specific products and field expertise. This centralized system consolidates a wide range of data sources, including detailed reports, FAQs, and technical documents. The system integrates structured data, such as tables containing product properties and specifications, with unstructured text documents that provide in-depth product descriptions and usage guidelines. A chatbot enables field engineers to quickly access relevant information, troubleshoot issues more effectively, and share knowledge across the organization.
Oil and gas data analysis – Before beginning operations at a well a well, an oil and gas company will collect and process a diverse range of data to identify potential reservoirs, assess risks, and optimize drilling strategies. The data sources may include seismic surveys, well logs, core samples, geochemical analyses, and production histories, with some of it in industry-specific formats. Each category necessitates specialized generative AI-powered tools to generate insights. We built a chatbot that can answer questions across this complex data landscape, so that oil and gas companies can make faster and more informed decisions, improve exploration success rates, and decrease time to first oil.
Financial data analysis – The financial sector uses both unstructured and structured data for market analysis and decision-making. Unstructured data includes news articles, regulatory filings, and social media, providing qualitative insights. Structured data consists of stock prices, financial statements, and economic indicators. We built a RAG system that combines these diverse data types into a single knowledge base, allowing analysts to efficiently access and correlate information. This approach enables nuanced analysis by combining numerical trends with textual insights to identify opportunities, assess risks, and forecast market movements.
Industrial maintenance – We built a solution that combines maintenance logs, equipment manuals, and visual inspection data to optimize maintenance schedules and troubleshooting. This multimodal approach integrates written reports and procedures with images and diagrams of machinery, allowing maintenance technicians to quickly access both descriptive information and visual representations of equipment. For example, a technician could query the system about a specific machine part, receiving both textual maintenance history and annotated images showing wear patterns or common failure points, enhancing their ability to diagnose and resolve issues efficiently.
Ecommerce product search – We built several solutions to enhance the search capabilities on ecommerce websites to improve the shopping experience for customers. Traditional search engines rely mostly on text-based queries. By integrating multimodal (text and image) RAG, we aimed to create a more comprehensive search experience. The new system can handle both text and image inputs, allowing customers to upload photos of desired items and receive precise product matches.

Using a router to handle heterogeneous data sources
In RAG systems, a router is a component that directs incoming user queries to the appropriate processing pipeline based on the query’s nature and the required data type. This routing capability is crucial when dealing with heterogeneous data sources, because different data types often require distinct retrieval and processing strategies.
Consider a financial data analysis system. For a qualitative question like “What caused inflation in 2023?”, the router would direct the query to a text-based RAG that retrieves relevant documents and uses an LLM to generate an answer based on textual information. However, for a quantitative question such as “What was the average inflation in 2023?”, the router would direct the query to a different pipeline that fetches and analyzes the relevant dataset.
The router accomplishes this through intent detection, analyzing the query to determine the type of data and analysis required to answer it. In systems with heterogeneous data, this process makes sure each data type is processed appropriately, whether it’s unstructured text, structured tables, or multimodal content. For instance, analyzing large tables might require prompting the LLM to generate Python or SQL and running it, rather than passing the tabular data to the LLM. We give more details on that aspect later in this post.
In practice, the router module can be implemented with an initial LLM call. The following is an example prompt for a router, following the example of financial analysis with heterogeneous data. To avoid adding too much latency with the routing step, we recommend using a smaller model, such as Anthropic’s Claude Haiku on Amazon Bedrock.

router_template = “””
You are a financial data assistant that can query different data sources
based on the user’s request. The available data sources are:

<data_sources>
<source>
<name>Stock Prices Database</name>
<description>Contains historical stock price data for publicly traded companies.</description>
</source>
<source>
<name>Analyst Notes Database</name>
<description>Knowledge base containing reports from Analysts on their interpretation and analyis of economic events.</description>
</source>
<source>
<name>Economic Indicators Database</name>
<description>Holds macroeconomic data like GDP, inflation, unemployment rates, etc.</description>
</source>
<source>
<name>Regulatory Filings Database</name>
<description>Contains SEC filings, annual reports, and other regulatory documents for public companies.</description>
</source>
</data_sources>

<instructions>
When the user asks a query, analyze the intent and route it to the appropriate data source.
If the query is not related to any of the available data sources,
respond politely that you cannot assist with that request.
</instructions>

<example>
<query>What was the closing price of Amazon stock on January 1st, 2022?</query>
<data_source>Stock Prices Database</data_source>
<reason>The question is about a stock price.</reason>
</example>

<example>
<query>What caused inflation in 2021?</query>
<data_source>Analyst Notes Database</data_source>
<reason>This is asking for interpretation of an event, I will look in Analyst Notes.</reason>
</example>

<example>
<query>How has the US unemployment rate changed over the past 5 years?</query>
<data_source>Economic Indicators Database</data_source>
<reason>Unemployment rate is an Economic indicator.</reason>
</example>

<example>
<query>I need to see the latest 10-K filing for Amazon.</query>
<data_source>Regulatory Filings Database</data_source>
<reason>SEC 10K which are in Regulatory Filings database.</reason>
</example>

<example>
<query>What’s the best restaurant in town?</query>
<data_source>None</data_source>
<reason>Restaurant recommendations are not related to any data source.</reason>
</example>

Here is the user query
<query>
{user_query}
</query>

Output the data source in <data_source> tags and the explanation in <reason> tags.
“””

Prompting the LLM to explain the routing logic may help with accuracy, by forcing the LLM to “think” about its answer, and also for debugging purposes, to understand why a category might not be routed properly.
The prompt uses XML tags following Anthropic’s Claude best practices. Note that in this example prompt we used <data_source> tags but something similar such as <category> or <label> could also be used. Asking the LLM to also structure its response with XML tags allows us to parse out the category from the LLM answer, which can be done with the following code:

# Parse out the data source
pattern = r”<data_source>(.*?)</data_source>”
data_source = re.findall(
    pattern, llm_response, re.DOTALL
)[0]

From a user’s perspective, if the LLM fails to provide the right routing category, the user can explicitly ask for the data source they want to use in the query. For instance, instead of saying “What caused inflation in 2023?”, the user could disambiguate by asking “What caused inflation in 2023 according to analysts?”, and instead of “What was the average inflation in 2023?”, the user could ask “What was the average inflation in 2023? Look at the indicators.”
Another option for a better user experience is to add an option to ask for clarifications in the router, if the LLM finds that the query is too ambiguous. We can add this as an additional “data source” in the router using the following code:

<source>
<name>Clarifications</name>
<description>If the query is too ambiguous, use this to ask the user for more
clarifications. Put your reply to the user in the reason tags</description>
</source>

We use an associated example:

<example>
<query>What’s can you tell me about Amazon stock?</query>
<data_source>Clarifications</data_source>
<reason>I’m not sure how to best answer your question,
do you want me to look into Stock Prices, Analyst Notes, Regulatory filings?</reason>
</example>

If in the LLM’s response, the data source is Clarifications, we can then directly return the content of the <reason> tags to the user for clarifications.
An alternative approach to routing is to use the native tool use capability (also known as function calling) available within the Bedrock Converse API. In this scenario, each category or data source would be defined as a ‘tool’ within the API, enabling the model to select and use these tools as needed. Refer to this documentation for a detailed example of tool use with the Bedrock Converse API.
Using LLM code generation abilities for RAG with structured data
Consider an oil and gas company analyzing a dataset of daily oil production. The analyst may ask questions such as “Show me all wells that produced oil on June 1st 2024,” “What well produced the most oil in June 2024?”, or “Plot the monthly oil production for well XZY for 2024.” Each question requires different treatment, with varying complexity. The first one involves filtering the dataset to return all wells with production data for that specific date. The second one requires computing the monthly production values from the daily data, then finding the maximum and returning the well ID. The third one requires computing the monthly average for well XYZ and then generating a plot.
LLMs don’t perform well at analyzing tabular data when it’s added directly in the prompt as raw text. A simple way to improve the LLM’s handling of tables is to add it in the prompt in a more structured format, such as markdown or XML. However, this method will only work if the question doesn’t require complex quantitative reasoning and the table is small enough. In other cases, we can’t reliably use an LLM to analyze tabular data, even when provided as structured format in the prompt.
On the other hand, LLMs are notably good at code generation; for instance, Anthropic’s Claude Sonnet 3.5 has 92% accuracy on the HumanEval code benchmark. We can take advantage of that capability by asking the LLM to write Python (if the data is stored in a CSV, Excel, or Parquet file) or SQL (if the data is stored in a SQL database) code that performs the required analysis. Popular libraries Llama Index and LangChain both offer out-of-the-box solutions for text-to-SQL (Llama Index, LangChain) and text-to-Pandas (Llama Index, LangChain) pipelines for quick prototyping. However, for better control over prompts, code execution, and outputs, it might be worth writing your own pipeline. Out-of-the-box solutions will typically prompt the LLM to write Python or SQL code to answer the user’s question, then parse and run the code from the LLM’s response, and finally send the code output back to the LLM for a final answer.
Going back to the oil and gas data analysis use case, take the question “Show me all wells that produced oil on June 1st 2024.” There could be hundreds of entries in the dataframe. In that case, a custom pipeline that directly returns the code output to the UI (the filtered dataframe for the date of June 1st 2024, with oil production greater than 0) would be more efficient than sending it to the LLM for a final answer. If the filtered dataframe is large, the additional call might cause high latency and even risks causing hallucinations. Writing your custom pipelines also allows you to perform some sanity checks on the code, to verify, for instance, that the code generated by the LLM will not create issues (such as modify existing files or data bases).
The following is an example of a prompt that can be used to generate Pandas code for data analysis:

prompt_template = “””
You are an AI assistant designed to answer questions from oil and gas analysts.
You have access to a Pandas dataframe df that contains daily production data for oil producing wells.

Here is a sample from df:
<df_sample>
{sample}
</df_sample>

Here is the analyst’s question:
<question>
{question}
</question>

<instructions>
 – Use <scratchpad> tags to think about what you are going to do.
 – Put your the code in <code> tags.
 – The dataframes may contain nans, so make sure you account for those in your code.
 – In your code, the final variable should be named “result”.
</instructions>
“””

We can then parse the code out from the <code> tags in the LLM response and run it using exec in Python. The following code is a full example:

import boto3
import pandas as pd

# Import the csv into a DataFrame
df = pd.read_csv(‘stock_prices.csv’)

# Create an Amazon Bedrock client
bedrock_client = boto3.client(‘bedrock’)

# Define the prompt
user_query = “Show me all wells that produced oil on June 1st 2024”
prompt = prompt_template.format(sample = df.sample(5), question=user_query))

# Call Anthropic Claude Sonnet
request_body = json.dumps(
    {
        “anthropic_version”: “bedrock-2023-05-31”,
        “max_tokens”: 1000,
        “messages”: [
            {
                “role”: “user”,
                “content”:  prompt
                    }
            
        ]
    }
)
response = bedrock_client.invoke_model(
    modelId=”anthropic.claude-3-sonnet-20240229-v1:0″,
    body=request_body
)
# Get the LLM’s response
llm_response = json.loads(
    response[‘body’].read().decode(‘utf-8’)
    )[‘content’][0][‘text’]

# Extract code from LLM response
 code_pattern = r”<code>(.*?)</code>”
code_matches = re.findall(
    code_pattern, llm_response, re.DOTALL
)  
# Use a dictionary to pass the dataframe to the exec environment
local_vars = {“df”: df}
for match in code_matches:
    exec(
        match, local_vars
    ) 
    
# Variables created in the exec environment get stored in the local_vars dict
code_output = local_vars[“result”]

# We can then return the code output or send the code output
#to the LLM to get the final answer

# Call Anthropic Claude Sonnet
request_body = json.dumps(
    {
        “anthropic_version”: “bedrock-2023-05-31”,
        “max_tokens”: 4000,
        “messages”: [
            {
                “role”: “user”,
                “content”:  prompt
                    },
                            {
                “role”: “assistant”,
                “content”:  llm_response
                    },
                            {
                “role”: “user”,
                “content”:  f”This is the code output: {code_output}”
                    }
            
        ]
    }
)
response = bedrock_client.invoke_model(
    modelId=”anthropic.claude-3-sonnet-20240229-v1:0″,
    body=request_body
)

# Get the final LLM’s response
final_llm_response = json.loads(
    response[‘body’].read().decode(‘utf-8’)
    )[‘content’][0][‘text’]

Because we explicitly prompt the LLM to store the final result in the result variable, we know it will be stored in the local_vars dictionary under that key, and we can retrieve it that way. We can then either directly return this result to the user, or send it back to the LLM to generate its final response. Sending the variable back to the user directly can be useful if the request requires filtering and returning a large dataframe, for instance. Directly returning the variable to the user removes the risk of hallucination that can occur with large inputs and outputs.
Multimodal RAG
An emerging trend in generative AI is multimodality, with models that can use text, images, audio, and video. In this post, we focus exclusively on mixing text and image data sources.
In an industrial maintenance use case, consider a technician facing an issue with a machine. To troubleshoot, they might need visual information about the machine, not just a textual guide.
In ecommerce, using multimodal RAG can enhance the shopping experience not only by allowing users to input images to find visually similar products, but also by providing more accurate and detailed product descriptions from visuals of the products.
We can categorize multimodal text and image RAG questions in three categories:

Image retrieval based on text input – For example:

“Show me a diagram to repair the compressor on the ice cream machine.”
“Show me red summer dresses with floral patterns.”

Text retrieval based on image input – For example:

A technician might take a picture of a specific part of the machine and ask, “Show me the manual section for this part.”

Image retrieval based on text and image input – For example:

A customer could upload an image of a dress and ask, “Show me similar dresses.” or “Show me items with a similar pattern.”

As with traditional RAG pipelines, the retrieval component is the basis of these solutions. Constructing a multimodal retriever requires having an embedding strategy that can handle this multimodality. There are two main options for this.
First, you could use a multimodal embedding model such as Amazon Titan Multimodal Embeddings, which can embed both images and text into a shared vector space. This allows for direct comparison and retrieval of text and images based on semantic similarity. This simple approach is effective for finding images that match a high-level description or for matching images of similar items. For instance, a query like “Show me summer dresses” would return a variety of images that fit that description. It’s also suitable for queries where the user uploads a picture and asks, “Show me dresses similar to that one.”
The following diagram shows the ingestion logic with a multimodal embedding. The images in the database are sent to a multimodal embedding model that returns vector representations of the images. The images and the corresponding vectors are paired up and stored in the vector database.

At retrieval time, the user query (which can be text or image) is passed to the multimodal embedding model, which returns a vectorized user query that is used by the retriever module to search for images that are close to the user query, in the embedding distance. The closest images are then returned.

Alternatively, you could use a multimodal foundation model (FM) such as Anthropic’s Claude v3 Haiku, Sonnet, or Opus, and Sonnet 3.5, all available on Amazon Bedrock, which can generate the caption of an image, which will then be used for retrieval. Specifically, the generated image description is embedded using a traditional text embedding (e.g. Amazon Titan Embedding Text v2) and stored in a vector store along with the image as metadata.
Captions can capture finer details in images, and can be guided to focus on specific aspects such as color, fabric, pattern, shape, and more. This would be better suited for queries where the user uploads an image and looks for similar items but only in some aspects (such as uploading a picture of a dress, and asking for skirts in a similar style). This would also work better to capture the complexity of diagrams in industrial maintenance.
The following figure shows the ingestion logic with a multimodal FM and text embedding. The images in the database are sent to a multimodal FM that returns image captions. The image captions are then sent to a text embedding model and converted to vectors. The images are paired up with the corresponding vectors and captions and stored in the vector database.

At retrieval time, the user query (text) is passed to the text embedding model, which returns a vectorized user query that is used by the retriever module to search for captions that are close to the user query, in the embedding distance. The images corresponding to the closest captions are then returned, optionally with the caption as well. If the user query contains an image, we need to use a multimodal LLM to describe that image similarly to the previous ingestion steps.

Example with a multimodal embedding model
The following is a code sample performing ingestion with Amazon Titan Multimodal Embeddings as described earlier. The embedded image is stored in an OpenSearch index with a k-nearest neighbors (k-NN) vector field.

from utils import *

# Read and encode the image
file_name = ‘image.png’
image_base64 = read_and_encode_image(file_name)

# Embed the image using Amazon Titan Multimodal Embeddings
multi_embedding_model = “amazon.titan-embed-image-v1”
image_embedding = get_embedding(input = image_base64, model = multi_embedding_model)

# Get OpenSearch client (assume this function is available)
open_search = get_open_search_client()

# Create index in OpenSearch for storing embeddings
create_opensearch_index(name = ‘multimodal-image-index’, client = open_search)

# Index the image and its embedding in OpenSearch
request = {
    ‘image’: image_base64,
    “vector_field”: image_embedding,
    “_op_type”: “index”,
    “source”: file_name  # replace with a URL or S3 location if needed
}
result = open_search.index(index=’multimodal-image-index’, body=request)

The following is the code sample performing the retrieval with Amazon Titan Multimodal Embeddings:

# Use Amazon Titan Multimodal Embeddings to embed the user query
query_text = “Show me a diagram to repair the compressor on the ice cream machine.”

query_embedding = get_embedding(input = image_base64, model = multi_embedding_model)

# Search for images that are close to that description in OpenSearch
search_query ={
        ‘query’: {
            ‘bool’: {
                ‘should’: [
                    {
                        ‘knn’: {
                            ‘vector_field’: {
                                ‘vector’: text_embedding,
                                ‘k’: 5
                            }
                        }
                    }
                ]
            }
        }
    }

response = open_search.search(index=’multimodal-image-index’, body=search_query)

In the response, we have the images that are closest to the user query in embedding space, thanks to the multimodal embedding.
Example with a multimodal FM
The following is a code sample performing the retrieval and ingestion described earlier. It uses Anthropic’s Claude Sonnet 3 to caption the image first, and then Amazon Titan Text Embeddings to embed the caption. You could also use another multimodal FM such as Anthropic’s Claude Sonnet 3.5, Haiku 3, or Opus 3 on Amazon Bedrock. The image, caption embedding, and caption are stored in an OpenSearch index. At retrieval time, we embed the user query using the same Amazon Titan Text Embeddings model and perform a k-NN search on the OpenSearch index to retrieve the relevant image.

# Read and encode the image
file_name = ‘image.png’
image_base64 = read_and_encode_image(file_name)

# Use Anthropic Claude Sonnet to caption the image
caption = call_multimodal_llm(
    modelId =”anthropic.claude-3-sonnet-20240229-v1:0″,
    text = “Describe this image in detail. Only output the description, nothing else”
    image = image_base64
)
    
# Compute text embedding for the caption
text_embedding_model = “amazon.titan-embed-text-v2:0”
caption_embedding = get_embedding(input = caption, model = text_embedding_model)

# Create the index with a mapping that has a knn vector field
open_search.indices.create(index=’image-caption-index’, body=mapping)

# Index image in OpenSearch
open_search.index(
    index=’image-caption-index’,
    body={
        “image_base64”: image_base64,
        “vector_field”: caption_embedding,
        “caption”: caption,
        “source”: file_name
    }
)

The following is code to perform the retrieval step using text embeddings:

# Compute embedding for a natural language query with text embedding
user_query= “Show me a diagram to repair the compressor on the ice cream machine.”
query_embedding  = get_embedding(input = caption, model = text_embedding_model)

# Search for images that match that query in OpenSearch
search_query ={
        ‘query’: {
            ‘bool’: {
                ‘should’: [
                    {
                        ‘knn’: {
                            ‘vector_field’: {
                                ‘vector’: query_embedding,
                                ‘k’: 5
                            }
                        }
                    }
                ]
            }
        }
    }

response = open_search.search(index=’image-caption-index’, body=search_query)

This returns the images whose captions are closest to the user query in the embedding space, thanks to the text embeddings. In the response, we get both the images and the corresponding captions for downstream use.
Comparative table of multimodal approaches
The following table provides a comparison between using multimodal embeddings and using a multimodal LLM for image captioning, across several key factors. Multimodal embeddings offer faster ingestion and are generally more cost-effective, making them suitable for large-scale applications where speed and efficiency are crucial. On the other hand, using a multimodal LLM for captions, though slower and less cost-effective, provides more detailed and customizable results, which is particularly useful for scenarios requiring precise image descriptions. Considerations such as latency for different input types, customization needs, and the level of detail required in the output should guide the decision-making process when selecting your approach.

 .
Multimodal Embeddings
Multimodal LLM for Captions

Speed
Faster ingestion
Slower ingestion due to additional LLM call

Cost
More cost-effective
Less cost-effective

Detail
Basic comparison based on embeddings
Detailed captions highlighting specific features

Customization
Less customizable
Highly customizable with prompts

Text Input Latency
Same as multimodal LLM
Same as multimodal embeddings

Image Input Latency
Faster, no extra processing required
Slower, requires extra LLM call to generate image caption

Best Use Case
General use, quick and efficient data handling
Precise searches needing detailed image descriptions

Conclusion
Building real-world RAG systems with heterogeneous data formats presents unique challenges, but also unlocks powerful capabilities for enabling natural language interactions with complex data sources. By employing techniques like intent detection, code generation, and multimodal embeddings, you can create intelligent systems that can understand queries, retrieve relevant information from structured and unstructured data sources, and provide coherent responses. The key to success lies in breaking down the problem into modular components and using the strengths of FMs for each component. Intent detection helps route queries to the appropriate processing logic, and code generation enables quantitative reasoning and analysis on structured data sources. Multimodal embeddings and multimodal FMs enable you to bridge the gap between text and visual data, enabling seamless integration of images and other media into your knowledge bases.
Get started with FMs and embedding models in Amazon Bedrock to build RAG solutions that seamlessly integrate tabular, image, and text data for your organization’s unique needs.

About the Author
Aude Genevay is a Senior Applied Scientist at the Generative AI Innovation Center, where she helps customers tackle critical business challenges and create value using generative AI. She holds a PhD in theoretical machine learning and enjoys turning cutting-edge research into real-world solutions.

Cohere Embed multimodal embeddings model is now available on Amazon Sa …

The Cohere Embed multimodal embeddings model is now generally available on Amazon SageMaker JumpStart. This model is the newest Cohere Embed 3 model, which is now multimodal and capable of generating embeddings from both text and images, enabling enterprises to unlock real value from their vast amounts of data that exist in image form.
In this post, we discuss the benefits and capabilities of this new model with some examples.
Overview of multimodal embeddings and multimodal RAG architectures
Multimodal embeddings are mathematical representations that integrate information not only from text but from multiple data modalities—such as product images, graphs, and charts—into a unified vector space. This integration allows for seamless interaction and comparison between different types of data. As foundational models (FMs) advance, they increasingly require the ability to interpret and generate content across various modalities to better mimic human understanding and communication. This trend toward multimodality enhances the capabilities of AI systems in tasks like cross-modal retrieval, where a query in one modality (such as text) retrieves data in another modality (such as images or design files).
Multimodal embeddings can enable personalized recommendations by understanding user preferences and matching them with the most relevant assets. For instance, in ecommerce, product images are a critical factor influencing purchase decisions. Multimodal embeddings models can enhance personalization through visual similarity search, where users can upload an image or select a product they like, and the system finds visually similar items. In the case of retail and fashion, multimodal embeddings can capture stylistic elements, enabling the search system to recommend products that fit a particular aesthetic, such as “vintage,” “bohemian,” or “minimalist.”
Multimodal Retrieval Augmented Generation (MM-RAG) is emerging as a powerful evolution of traditional RAG systems, addressing limitations and expanding capabilities across diverse data types. Traditionally, RAG systems were text-centric, retrieving information from large text databases to provide relevant context for language models. However, as data becomes increasingly multimodal in nature, extending these systems to handle various data types is crucial to provide more comprehensive and contextually rich responses. MM-RAG systems that use multimodal embeddings models to encode both text and images into a shared vector space can simplify retrieval across modalities. MM-RAG systems can also enable enhanced customer service AI agents that can handle queries that involve both text and images, such as product defects or technical issues.
Cohere Multimodal Embed 3: Powering enterprise search across text and images
Cohere’s embeddings model, Embed 3, is an industry-leading AI search model that is designed to transform semantic search and generative AI applications. Cohere Embed 3 is now multimodal and capable of generating embeddings from both text and images. This enables enterprises to unlock real value from their vast amounts of data that exist in image form. Businesses can now build systems that accurately search important multimodal assets such as complex reports, ecommerce product catalogs, and design files to boost workforce productivity.
Cohere Embed 3 translates input data into long strings of numbers that represent the meaning of the data. These numerical representations are then compared to each other to determine similarities and differences. Cohere Embed 3 places both text and image embeddings in the same space for an integrated experience.
The following figure illustrates an example of this workflow. This figure is simplified for illustrative purposes. In practice, the numerical representations of data (seen in the output column) are far longer and the vector space that stores them has a higher number of dimensions.

This similarity comparison enables applications to retrieve enterprise data that is relevant to an end-user query. In addition to being a fundamental component of semantic search systems, Cohere Embed 3 is useful in RAG systems because it makes generative models like the Command R series have the most relevant context to inform their responses.
All businesses, across industry and size, can benefit from multimodal AI search. Specifically, customers are interested in the following real-world use cases:

Graphs and charts – Visual representations are key to understanding complex data. You can now effortlessly find the right diagrams to inform your business decisions. Simply describe a specific insight and Cohere Embed 3 will retrieve relevant graphs and charts, making data-driven decision-making more efficient for employees across teams.
Ecommerce product catalogs – Traditional search methods often limit you to finding products through text-based product descriptions. Cohere Embed 3 transforms this search experience. Retailers can build applications that surface products that visually match a shopper’s preferences, creating a differentiated shopping experience and improving conversion rates.
Design files and templates – Designers often work with vast libraries of assets, relying on memory or rigorous naming conventions to organize visuals. Cohere Embed 3 makes it simple to locate specific UI mockups, visual templates, and presentation slides based on a text description. This streamlines the creative process.

The following figure illustrates some examples of these use cases.

At a time when businesses are increasingly expected to use their data to drive outcomes, Cohere Embed 3 offers several advantages that accelerate productivity and improves customer experience.
The following chart compares Cohere Embed 3 with another embeddings model. All text-to-image benchmarks are evaluated using Recall@5; text-to-text benchmarks are evaluated using NDCG@10. Text-to-text benchmark accuracy is based on BEIR, a dataset focused on out-of-domain retrievals (14 datasets). Generic text-to-image benchmark accuracy is based on Flickr and CoCo. Graphs and charts benchmark accuracy is based on business reports and presentations constructed internally. ecommerce benchmark accuracy is based on a mix of product catalog and fashion catalog datasets. Design files benchmark accuracy is based on a product design retrieval dataset constructed internally.

BEIR (Benchmarking IR) is a heterogeneous benchmark—it uses a diverse collection of datasets and tasks designed for evaluating information retrieval (IR) models across diverse tasks. It provides a common framework for assessing the performance of natural language processing (NLP)-based retrieval models, making it straightforward to compare different approaches. Recall@5 is a specific metric used in information retrieval evaluation, including in the BEIR benchmark. Recall@5 measures the proportion of relevant items retrieved within the top five results, compared to the total number of relevant items in the dataset
Cohere’s latest Embed 3 model’s text and image encoders share a unified latent space. This approach has a few important benefits. First, it enables you to include both image and text features in a single database and therefore reduces complexity. Second, it means current customers can begin embedding images without re-indexing their existing text corpus. In addition to leading accuracy and ease of use, Embed 3 continues to deliver the same useful enterprise search capabilities as before. It can output compressed embeddings to save on database costs, it’s compatible with over 100 languages for multilingual search, and it maintains strong performance on noisy real-world data.
Solution overview
SageMaker JumpStart offers access to a broad selection of publicly available FMs. These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.
Amazon SageMaker is a comprehensive, fully managed machine learning (ML) platform that revolutionizes the entire ML workflow. It offers an unparalleled suite of tools that cater to every stage of the ML lifecycle, from data preparation to model deployment and monitoring. Data scientists and developers can use the SageMaker integrated development environment (IDE) to access a vast array of pre-built algorithms, customize their own models, and seamlessly scale their solutions. The platform’s strength lies in its ability to abstract away the complexities of infrastructure management, allowing you to focus on innovation rather than operational overhead.
You can access the Cohere Embed family of models using SageMaker JumpStart in Amazon SageMaker Studio.
For those new to SageMaker JumpStart, we walk through using SageMaker Studio to access models in SageMaker JumpStart.
Prerequisites
Make sure you meet the following prerequisites:

Make sure your SageMaker AWS Identity and Access Management (IAM) role has the AmazonSageMakerFullAccess permission policy attached.
To deploy Cohere multimodal embeddings successfully, confirm the following:

Your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Alternatively, confirm your AWS account has a subscription to the model. If so, skip to the next section in this post.

Deployment starts when you choose the Deploy option. You may be prompted to subscribe to this model through AWS Marketplace. If you’re already subscribed, then you can proceed and choose Deploy. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Subscribe to the model package
To subscribe to the model package, complete the following steps:

Depending on the model you want to deploy, open the model package listing page for it.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, choose Accept Offer if you and your organization agrees with EULA, pricing, and support terms.
Choose Continue to configuration and then choose an AWS Region.

You will see a product ARN displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Subscribe to the Cohere embeddings model package on AWS Marketplace.
Choose the appropriate model package ARN for your Region. For example, the ARN for Cohere Embed Model v3 – English is: arn:aws:sagemaker:[REGION]:[ACCOUNT_ID]:model-package/cohere-embed-english-v3-7-6d097a095fdd314d90a8400a620cac54

Deploy the model using the SDK
To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

from cohere_aws import Client
import boto3
region = boto3.Session().region_name
model_package_arn = “Specify the model package ARN here”

Use the SageMaker SDK to create a client and deploy the models:

co = Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name=”cohere-embed-english-v3″, instance_type=”ml.g5.xlarge”, n_instances=1)

If the endpoint is already created using SageMaker Studio, you can simply connect to it:

co.connect_to_endpoint(endpoint_name=”cohere-embed-english-v3″)

Consider the following best practices:

Choose an appropriate instance type based on your performance and cost requirements. This example uses ml.g5.xlarge, but you might need to adjust this based on your specific needs.
Make sure your IAM role has the necessary permissions, including AmazonSageMakerFullAccess2.
Monitor your endpoint’s performance and costs using Amazon CloudWatch.

Inference example with Cohere Embed 3 using the SageMaker SDK
The following code example illustrates how to perform real-time inference using Cohere Embed 3. We walk through a sample notebook to get started. You can also find the source code on the accompanying GitHub repo.
Pre-setup
Import all required packages using the following code:

import requests
import base64
import os
import mimetypes
import numpy as np
from IPython.display import Image, display
import tqdm
import tqdm.auto

Create helper functions
Use the following code to create helper functions that determine whether the input document is text or image, and download images given a list of URLs:

def is_image(doc):
return (doc.endswith(“.jpg”) or doc.endswith(“.png”)) and os.path.exists(doc)

def is_txt(doc):
return (doc.endswith(“.txt”)) and os.path.exists(doc)

def download_images(image_urls):
image_names = []

#print(“Download some example images we want to embed”)
for url in image_urls:
image_name = os.path.basename(url)
image_names.append(image_name)

if not os.path.exists(image_name):
with open(image_name, “wb”) as fOut:
fOut.write(requests.get(url, stream=True).content)

return image_names

Generate embeddings for text and image inputs
The following code shows a compute_embeddings() function we defined that will accept multimodal inputs to generate embeddings with Cohere Embed 3:

def compute_embeddings(docs):
# Compute the embeddings
embeddings = []
for doc in tqdm.auto.tqdm(docs, desc=”encoding”):
if is_image(doc):
print(“Encode image:”, doc)
# Doc is an image, encode it as an image

# Convert the images to base64
with open(doc, “rb”) as fIn:
img_base64 = base64.b64encode(fIn.read()).decode(“utf-8”)

#Get the mime type for the image
mime_type = mimetypes.guess_type(doc)[0]

payload = {
“model”: “embed-english-v3.0”,
“input_type”: ‘image’,
“embedding_types”: [“float”],
“images”: [f”data:{mime_type};base64,{img_base64}”]
}

response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=json.dumps(payload)
)

response = json.loads(response[‘Body’].read().decode(“utf-8”))
response = response[“embeddings”][“float”][0]
elif is_txt(doc):
# Doc is a text file, encode it as a document
with open(doc, “r”) as fIn:
text = fIn.read()

print(“Encode img desc:”, doc, ” – Content:”, text[0:100]+”…”)

payload = {
“texts”: [text],
“model”: “embed-english-v3.0”,
“input_type”: “search_document”,
}

response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=json.dumps(payload)
)
response = json.loads(response[‘Body’].read().decode(“utf-8”))
response = response[“embeddings”][0]
else:
#Encode as document

payload = {
“texts”: [doc],
“model”: “embed-english-v3.0”,
“input_type”: “search_document”,
}

response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=json.dumps(payload)
)
response = json.loads(response[‘Body’].read().decode(“utf-8”))
response = response[“embeddings”][0]
embeddings.append(response)
return np.asarray(embeddings, dtype=”float”)

Find the most relevant embedding based on query
The Search() function generates query embeddings and computes a similarity matrix between the query and embeddings:

def search(query, embeddings, docs):
# Get the query embedding

payload = {
“texts”: [query],
“model”: “embed-english-v3.0”,
“input_type”: “search_document”,
}

response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=json.dumps(payload)
)
query_emb = json.loads(response[‘Body’].read().decode(“utf-8”))
query_emb = query_emb[“embeddings”][0]

# Compute L2 norms of the vector and matrix rows
vector_norm = np.linalg.norm(query_emb)
matrix_norms = np.linalg.norm(embeddings, axis = 1)

# Compute the dot product between the vector and each row of the matrix
dot_products = np.dot(embeddings, query_emb)

# Compute cosine similarities
similarity = dot_products / (matrix_norms * vector_norm)

# Sort decreasing most to least similar
top_hits = np.argsort(-similarity)

print(“Query:”, query, “n”)
# print(similarity)
print(“Search results:”)
for rank, idx in enumerate(top_hits):
print(f”#{rank+1}: ({similarity[idx]*100:.2f})”)
if is_image(docs[idx]):
print(docs[idx])
display(Image(filename=docs[idx], height=300))
elif is_txt(docs[idx]):
print(docs[idx]+” – Image description:”)
with open(docs[idx], “r”) as fIn:
print(fIn.read())
#display(Image(filename=docs[idx].replace(“.txt”, “.jpg”), height=300))
else:
print(docs[idx])
print(“——–“)

Test the solution
Let’s assemble all the input documents; notice that there are both text and image inputs:

# Download images
image_urls = [
“https://images-na.ssl-images-amazon.com/images/I/31KqpOznU1L.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/41RI4qgJLrL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/61NbJr9jthL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/31TW1NCtMZL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/51a6iOTpnwL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/31sa-c%2BfmpL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/41sKETcJYcL.jpg”,
“https://images-na.ssl-images-amazon.com/images/I/416GZ2RZEPL.jpg”
]
image_names = download_images(image_urls)
text_docs = [
“Toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.”,
“This is the perfect introduction to the world of scooters.”,
“2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.”,
“Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play.”
]

docs = image_names + text_docs
print(“Total docs:”, len(docs))
print(docs)

Generate embeddings for the documents:

embeddings = compute_embeddings(docs)
print(“Doc embeddings shape:”, embeddings.shape)

The output is a matrix of 11 items of 1,024 embedding dimensions.
Search for the most relevant documents given the query “Fun animal toy”

search(“Fun animal toy”, embeddings, docs)

The following screenshots show the output.

Query: Fun animal toy

Search results:
#1: (54.28)
Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play.
——–
#2: (52.48)
31TW1NCtMZL.jpg

——–
#3: (51.83)
31sa-c%2BfmpL.jpg

——–
#4: (50.33)
51a6iOTpnwL.jpg

——–
#5: (47.81)
31KqpOznU1L.jpg

——–
#6: (44.70)
61NbJr9jthL.jpg

#7: (44.36)
416GZ2RZEPL.jpg

——–
#8: (43.55)
41RI4qgJLrL.jpg

——–
#9: (41.40)
41sKETcJYcL.jpg

——–
#10: (37.69)
Learning toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.
——–
#11: (35.50)
This is the perfect introduction to the world of scooters.
——–
#12: (33.14)
2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.
——–

Try another query “Learning toy for a 6 year old”.

Query: Learning toy for a 6 year old

Search results:
#1: (47.59)
Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play.
——–
#2: (41.86)
61NbJr9jthL.jpg

——–
#3: (41.66)
2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.
——–
#4: (41.62)
Toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.
——–
#5: (41.25)
This is the perfect introduction to the world of scooters.
——–
#6: (40.94)
31sa-c%2BfmpL.jpg

——–
#7: (40.11)
416GZ2RZEPL.jpg

——–
#8: (40.10)
41sKETcJYcL.jpg

——–
#9: (38.64)
41RI4qgJLrL.jpg

——–
#10: (36.47)
31KqpOznU1L.jpg

——–
#11: (35.27)
31TW1NCtMZL.jpg

——–
#12: (34.76)
51a6iOTpnwL.jpg
——–

As you can see from the results, the images and documents are returns based on the queries from the user and demonstrates functionality of the new version of Cohere embed 3 for multimodal embeddings.
Clean up
To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

# Delete the endpoint
sagemaker.delete_endpoint(EndpointName=’Endpoint-Cohere-Embed-Model-v3-English-1′)
sagemaker.close()

Alternatively, to use the SageMaker console, complete the following steps:

On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
Search for the embedding and text generation endpoints.
On the endpoint details page, choose Delete.
Choose Delete again to confirm.

Conclusion
Cohere Embed 3 for multimodal embeddings is now available with SageMaker and SageMaker JumpStart. To get started, refer to SageMaker JumpStart pretrained models.
Interested in diving deeper? Check out the Cohere on AWS GitHub repo.

About the Authors
Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption. Breanne is also on the Women@Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.
Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.
Yang Yang is an Independent Software Vendor (ISV) Solutions Architect at Amazon Web Services based in Seattle, where he supports customers in the financial services industry. Yang focuses on developing generative AI solutions to solve business and technical challenges and help drive faster time-to-market for ISV customers. Yang holds a Bachelor’s and Master’s degree in Computer Science from Texas A&M University.
Malhar Mane is an Enterprise Solutions Architect at AWS based in Seattle. He supports enterprise customers in the Digital Native Business (DNB) segment and specializes in generative AI and storage. Malhar is passionate about helping customers adopt generative AI to optimize their business. Malhar holds a Bachelor’s in Computer Science from University of California, Irvine.

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as …

In today’s world, CLIP is one of the most important multimodal foundational models. It combines visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale image-text pairs. As a retriever, CLIP supports many tasks, including zero-shot classification, detection, segmentation, and image-text retrieval. Also, as a feature extractor, it has become dominant in virtually all cross-modal representation tasks, such as image understanding, video understanding, and text-to-image/video generation. Its strength mainly comes from its ability to connect images with natural language and capture human knowledge as it is trained on large web data with detailed text descriptions, unlike vision encoders. As the large language models (LLMs) are developing rapidly, the boundaries of language comprehension and generation are continually being pushed. LLMs’ strong text skills can help CLIP better handle long, complex captions, a weakness of the original CLIP. LLMs also have broad knowledge of large text datasets, making training more effective. LLMs have strong understanding skills, but their way of generating text hides abilities that make their outputs unclear. 

Current developments have extended CLIP to handle other modalities, and its influence in the field is growing. New models like Llama3 have been used to extend CLIP’s caption length and improve its performance by leveraging the open-world knowledge of LLMs. However, incorporating LLMs with CLIP takes work due to the limitations of its text encoder. In multiple experiments, it was found that directly integrating LLMs into CLIP leads to reduced performance. Thus, certain challenges exist to overcome to explore the potential benefits of incorporating LLMs into CLIP.

Tongji University and Microsoft Corporation researchers conducted detailed research and proposed the LLM2CLIP approach for enhancing visual representation learning by integrating large language models (LLMs). This method takes a straightforward step by replacing the original CLIP text encoder and enhances the CLIP visual encoder with extensive knowledge of LLMs. It identifies key obstacles associated with this innovative idea and suggests a cost-effective fine-tuning strategy to overcome them. This method boldly replaces the original CLIP text encoder. It recognizes the challenges of this approach and suggests an affordable way to fine-tune the model to address them.

The LLM2CLIP method effectively improved the CLIP model by integrating large language models (LLMs) like Llama. Initially, LLMs struggled as text encoders for CLIP due to their inability to clearly distinguish image captions. Researchers introduced the caption contrastive fine-tuning technique to address this, greatly improving the LLM’s ability to separate captions. This fine-tuning led to a substantial performance boost, surpassing existing state-of-the-art models. The LLM2CLIP framework combined the improved LLM with the pretrained CLIP visual encoder, creating a powerful cross-modal model. The method used large LLMs but remained computationally efficient with minimal added costs.

The experiments mainly focused on fine-tuning models for better image-text matching using datasets like CC-3M. For LLM2CLIP fine-tuning, three dataset sizes were tested: small (CC-3M), medium (CC-3M and CC-12M), and large (CC-3M, CC-12M, YFCC-15M, and Recaption-1B). Training with augmented captions improved performance, while using an untrained language model for CLIP worsened it. Models trained with LLM2CLIP outperformed standard CLIP and EVA in tasks like image-to-text and text-to-image retrieval, highlighting the advantage of integrating large language models with image-text models. 

The method directly boosted the performance of the previous SOTA EVA02 model by 16.5% on both long-text and short-text retrieval tasks, transforming a CLIP model trained solely on English data into a state-of-the-art cross-lingual model. After integrating multimodal training with models like Llava 1.5, it performed better than CLIP on almost all benchmarks, showing significant overall improvements in performance.

In conclusion, the proposed method allows LLMs to assist in CLIP training. By adjusting parameters such as data distribution, length, or categories, the LLM can be modified to fix CLIP’s limitations. It allows LLM to act as a more comprehensive teacher for various tasks. In the proposed work, the LLM gradients were frozen during fine-tuning to maintain a large batch size for CLIP training. In future works, the LLM2CLIP can be trained from scratch on datasets like Laion-2Band and Recaption-1B for better results and performance. This work can be used as a baseline for future research in CLIP training and its wide range of applications!

Check out the Paper, Code, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions
The post Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder appeared first on MarkTechPost.

This Machine Learning Paper Transforms Embodied AI Efficiency: New Sca …

Embodied artificial intelligence (AI) involves creating agents that function within physical or simulated environments, executing tasks autonomously based on pre-defined objectives. Often used in robotics and complex simulations, these agents leverage extensive datasets and sophisticated models to optimize behavior and decision-making. In contrast to more straightforward applications, embodied AI requires models capable of managing vast amounts of sensorimotor data and complex interactive dynamics. As such, the field has increasingly prioritized “scaling,” a process that adjusts model size, dataset volume, and computational power to achieve efficient and effective agent performance across diverse tasks.

The challenge with scaling embodied AI models lies in striking a balance between model size and dataset volume, a process necessary to ensure that these agents can operate optimally within constraints on computational resources. Different from language models, where scaling is well-established, the precise interplay of factors like dataset size, model parameters, and computation costs in embodied AI still needs to be explored. This lack of clarity limits researchers’ ability to construct large-scale models effectively, as it remains unclear how to distribute resources for tasks requiring behavioral and environmental adaptation optimally. For instance, while increasing model size improves performance, doing so without a proportional increase in data can lead to inefficiencies or even diminished returns, especially in tasks like behavior cloning and world modeling.

Language models have developed robust scaling laws that outline relationships between model size, data, and compute requirements. These laws enable researchers to make educated predictions about the necessary configurations for effective model training. However, embodied AI has not fully adopted these principles, partly because of the varied nature of its tasks. In response, researchers have been working on transferring scaling insights from language models to embodied AI, particularly by pre-training agents on large offline datasets that capture diverse environmental and behavioral data. The aim is to establish laws that help embody agents achieve high performance in decision-making and interaction with their surroundings.

Researchers at Microsoft Research have recently developed scaling laws specifically for embodied AI, introducing a methodology that evaluates how changes in model parameters, dataset size, and computational limits impact the learning efficiency of AI agents. The team’s work focused on two major tasks within embodied AI: behavior cloning, where agents learn to replicate observed actions, and world modeling, where agents predict environmental changes based on prior actions and observations. They used transformer-based architectures, testing their models under various configurations to understand how tokenization strategies and model compression rates affect overall efficiency and accuracy. By systematically adjusting the number of parameters and tokens, the researchers observed distinct scaling patterns that could improve model performance and compute efficiency.

The methodology involved training transformers with different tokenization approaches to balance model and dataset sizes. For instance, the team implemented tokenized and CNN-based architectures in behavior cloning, allowing the model to operate under a continuous embedding framework rather than discrete tokens, reducing computational demands significantly. The study found that for world modeling, scaling laws demonstrated that an increase in token count per observation affected model sizing, with the optimal model size coefficient increasing from 0.49 to 0.62 as the tokens rose from 256 to 540 per image. However, for behavior cloning with tokenized observations, optimal model size coefficients were skewed towards larger datasets with smaller models, displaying a need for greater data volume rather than expanded parameters, an opposite trend to that seen in world modeling.

The study presented remarkable findings on how scaling principles from language models could be applied effectively to embodied AI. The optimal trade-off occurred for world modeling when both model and dataset size increased proportionally, matching findings in LLM scaling literature. Specifically, with a 256-token configuration, an optimal balance was achieved by scaling both model and dataset in similar proportions. In contrast, in the 540-token configuration, the emphasis shifted toward larger models, making size adjustments highly dependent on the compression rate of the tokenized observations.

Key results highlighted that model architecture influences the scaling balance, particularly for behavior cloning. In tasks where agents used tokenized observations, model coefficients indicated a preference for extensive data over larger model sizes, with an optimal size coefficient of 0.32 against a dataset coefficient of 0.68. In comparison, behavior cloning tasks based on CNN architectures favored increased model size, with an optimal size coefficient of 0.66. This demonstrated that embodied AI could achieve efficient scaling under specific conditions by tailoring model and dataset proportions based on task requirements.

In testing the accuracy of the derived scaling laws, the research team trained a world-modeling agent with a model size of 894 million parameters, significantly larger than those used in prior scaling analyses. The study found a strong alignment between predictions and actual results, with the loss value closely matching computed optimal loss levels even under substantially increased compute budgets. This validation step underscored the scaling laws’ reliability, suggesting that with appropriate hyperparameter tuning, scaling laws can predict model performance effectively in complex simulations and real-world scenarios.

Key Takeaways from the Research:

Balanced Scaling for World Modeling: For optimal performance in world modeling, both model and dataset sizes must increase proportionally.

Behavior Cloning Optimization: Optimal configurations for behavior cloning favor smaller models paired with extensive datasets when tokenized observations are used. An increase in model size is preferred for CNN-based cloning tasks.

Compression Rate Impact: Higher token compression rates skew scaling laws toward larger models in world modeling, indicating that tokenized data substantially affects optimal model sizes.

Extrapolation Validation: Testing with larger models confirmed the scaling laws’ predictability, supporting these laws as a basis for efficient model sizing in embodied AI.

Distinct Task Requirements: Scaling requirements vary significantly between behavior cloning and world modeling, highlighting the importance of customized scaling approaches for different AI tasks.

In conclusion, this study advances embodied AI by tailoring language model scaling insights to AI agent tasks. This allows researchers to predict and control resource needs more accurately. Establishing these tailored scaling laws supports the development of more efficient, capable agents in environments demanding high computational and data efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions
The post This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks appeared first on MarkTechPost.

Exclusive Talk with Devvret Rishi, CEO and Cofounder at Predibase

Devvret Rishi is the CEO and Cofounder of Predibase. Prior he was an ML product leader at Google working across products like Firebase, Google Research and the Google Assistant as well as Vertex AI. While there, Dev was also the first product lead for Kaggle – a data science and machine learning community with over 8 million users worldwide. Dev’s academic background is in computer science and statistics, and he holds a masters in computer science from Harvard University focused on ML.

Asif: What inspired you to found Predibase, and what gap in the market did you aim to address?

Devvret: We started Predibase in 2021 with the mission to democratize deep learning. At that time, we saw that leading tech companies like Google, Apple, and Uber—where my co-founders and I previously worked—were leveraging neural network models, especially large pre-trained ones, to build better systems for tasks like recommendation engines and working with unstructured data such as text and images. However, most companies were still relying on outdated methods like linear regression or tree-based models. Our goal was to democratize access to these advanced neural networks.

We built Predibase on top of an open-source project my co-founder Piero had started while at Uber. Initially, we believed the way to democratize deep learning would be through platforms like ours, but we were surprised by how quickly the field evolved. What really changed the game was the emergence of models with massive parameter counts, like transformers. When scaled up by 100x or 1000x, these models gained emergent generative properties. Suddenly, engineers could interact with them simply by prompting, without any initial training.

Our platform initially focused on fine-tuning models like BERT in 2021-2022, which were considered large at the time. But as generative AI evolved, we saw that engineers needed more than just pre-trained models—they needed a way to customize them efficiently. This reinforced our original vision. While we initially focused on democratizing deep learning through fine-tuning, we realized that the need for customization platforms like Predibase had only grown stronger.

Asif: Your results seem almost magical; how do you do it? 

Devvret: The core of our success comes from recognizing that machine learning has fundamentally changed. Five years ago, the way you trained models was by throwing a lot of data at them, training from scratch, and waiting hours or days for the process to converge. While training and fine-tuning aren’t going away, there has been a fundamental shift in how models are trained. The biggest trend driving this shift is the technical innovation behind Low-Rank Adaptation (LoRA). LoRA introduced the idea that you can modify only a small fraction of a model’s parameters—typically less than 1%—and still achieve the same level of performance as if you had fine-tuned all 7 billion parameters. This approach allows the model to behave and perform at a high level while being much more efficient.

Many customers assume that training or fine-tuning models will take days and cost tens of thousands of dollars. In contrast, with Predibase, we can fine-tune most models in 30 minutes to an hour for as little as $5-$50. This efficiency empowers teams to experiment more freely and reduces the barriers to building custom models.

So I think the magic in our results is really threefold:

The first key insight we had was recognizing that the way models are trained would change significantly. We fully committed to parameter-efficient fine-tuning, enabling users to achieve high-quality results much faster and with a much smaller computational footprint.

The second step was integrating parameter-efficient training with parameter-efficient serving. We used LoRA-based training and LoRA-optimized serving through our open-source framework, LoRAX. LoRAX allows a single deployment to support multiple fine-tuned models, which means you can achieve excellent results by having many specialized fine-tunes—perhaps one per customer—without significantly increasing serving costs.

The final ingredient behind our success is a lot of hard work and benchmarking. We’ve fine-tuned hundreds of billions of tokens on our platform and tens of thousands of models ourselves. This hands-on experience has given us deep insights into which parameter combinations work best for different use cases. When a customer uploads a dataset and selects a model, we have prior knowledge of how to train that model most effectively—what LoRA rank to use, how large the model should be, and how long to train it. It all comes down to being empirical, and our extensive research, including the Predibase Fine-Tuning Leaderboard, has been baked into the platform to make this process seamless for users.

Asif: Where/when does your solution deliver the best results?

Devvret: Our platform delivers the best results for specialized tasks. As one of our customers put it, “Generalized intelligence might be great, but we don’t need our point-of-sale assistant to recite French poetry.”

We’ve seen this in our Fine-Tuning Leaderboard as well, which shows that fine-tuned models excel at handling specific, focused tasks. LoRA-based fine-tuning and serving are especially effective in these scenarios, enabling organizations to achieve high-quality results tailored to their needs. This approach ensures they get the precision they require without the unnecessary overhead of larger, general-purpose models.

Asif: How does your solution help address the huge cost of running LLMs?

Devvret: We’ve built over 50 optimizations into our fine-tuning stack, incorporating the latest findings from the research community. These optimizations allow you to fine-tune models with minimal resources while still achieving high-quality results. As a result, fine-tuning can typically be completed in minutes or hours–not days–for just $5 to $50, a fraction of what traditional methods would cost.

On the inference side–where a typical organization allocates most of their sped–we tackle costs with GPU autoscaling, so you only pay for the compute you use. Turbo LoRA ensures models are optimized for fast inference with low latency, and our LoRAX framework allows multiple fine-tuned models to run from a single GPU. This means you can efficiently serve fine-tuned models from fewer GPUs, helping keep your infrastructure costs low while supporting high-volume real-time workloads.

Asif: Large enterprises are very concerned about data security and IP, how do you address this?

Devvret: We get it—data security and IP protection are top priorities, especially for enterprises handling sensitive information. That’s why we offer the ability to deploy Predibase in your Virtual Private Cloud or in our cloud. This ensures that data stays under your control, with all the security policies you need, including SOC II Type II compliance. Whether you’re in finance, healthcare, or any other regulated industry, you can fine-tune and deploy models with the confidence that your data and IP are safe.

Asif: How easy/complicated is it to use Predibase?

Devvret: You can get started with Predibase in as few as ~10 lines of code. Whether you’re an engineer or a data scientist, our platform abstracts away the complexities of fine-tuning and deploying models. You can get started through our web interface or SDK, upload your dataset, select a model, and kick off training in no time. We’ve built Predibase to make fine-tuning as simple as possible, so teams can focus on outcomes instead of wrestling with infrastructure.

Asif: Inference speed is key in many use cases, how does Predibase help with that aspect?

Devvret: Predibase boosts inference speed with Turbo LoRA, which increases throughput by up to 4x, and FP8 quantization, which cuts the memory footprint in half for faster processing. On top of that, the LoRAX framework lets multiple fine-tuned models run on a single GPU, reducing costs and improving efficiency. With GPU autoscaling, the platform adjusts resources in real-time based on demand, ensuring fast responses during traffic spikes without overpaying for idle infrastructure. This combination guarantees fast, cost-effective model serving, whether for production workloads or high-volume AI applications.

Asif: How fast is the payback on the fine-tuning initial cost?

Devvret: The payback on fine-tuning with Predibase is incredibly fast because LoRA fine-tuning is remarkably cheap compared to full fine-tuning. Many people still assume that fine-tuning is expensive, imagining the high costs of full model retraining—but with LoRA, fine-tuning typically costs only $5 to $50 for a job, making it a low-risk, high-return investment. With Predibase, enterprises can fine-tune efficiently without running dozens of expensive, time-consuming experiments. This enables rapid deployment of specialized, high-performing models.

Asif: How are you different from other fine tuning providers?Devvret: Predibase stands out with a comprehensive fine-tuning platform that just works—no out-of-memory errors while training or unexpected drops in throughput while serving. We’ve built 50+ optimizations directly into our stack to ensure smooth, high-performance fine-tuning. Combined with LoRAX–which lets you efficiently serve hundreds of fine-tuned adapters on a single GPU–our Turbo LoRA, FP8 quantization, and GPU autoscaling make our model serving infrastructure industry-leading, delivering faster responses at lower costs.

We’ve seen too many teams get bogged down managing infrastructure, building data pipelines, and debugging fragmented open-source tools—leaving less time to actually build and productionize AI. That’s why we provide an end-to-end platform backed by a dedicated team of ML engineers to help you every step of the way. Whether you prefer the flexibility of SaaS in our cloud or full control with VPC deployments in yours, Predibase frees you from the operational burden, so you can focus on delivering impactful AI solutions.

Asif: What are some of the companies that you’re working with and what problem are they solving with SLMs?Devvret: Checkr leverages Predibase to improve the accuracy and efficiency of background checks. They process millions of checks monthly, but 2% of the data in one part of the background check workflow—often messy and unstructured—needed human review. With Predibase, Checkr fine-tuned a small language model, achieving 90%+ accuracy, outperforming GPT-4, and reducing inference costs by 5x. This enabled them to replace manual review with real-time automated decisions, meeting tight latency SLAs and improving customer experience.

Convirza, on the other hand, processes over a million phone calls per month to extract actionable insights that help coach call agents. Previously, managing infrastructure for their AI models was complex and often too much of a burden for their small AI team. With Predibase’s LoRAX multi-adapter serving, they’re able to consolidate 60 adapters into a single deployment, reducing overhead and allowing them to iterate on new models much faster. This efficiency lets them focus on building AI solutions, not infrastructure, unlocking new capabilities for their customers, like creating bespoke call performance indicators on the fly.

Both companies highlight how small language models fine-tuned on Predibase outperform larger models while cutting costs, improving response times, and streamlining operations.

Asif: How do you see the industry evolving?

Devvret: There are two big wars happening in generative AI infrastructure. The first is the competition between small, fine-tuned language models and large, general-purpose models. The second is the battle between open-source and commercial solutions.

The question that comes up a lot is: will the future be about small, task-specific, fine-tuned models, or large, general-purpose ones? I’m convinced it’s going to be more and more about small, fine-tuned models and we’ve already seen this shift starting. In 2023, the market’s focus was all about making models as big as possible, which worked well for quick prototyping. But as companies move into production, the focus shifts to cost, quality, and latency.

A lot of studies have pointed out that the economics of Gen AI haven’t always added up—too much spend, too little benefit. You can’t justify spending billions on infrastructure to solve relatively simple automation tasks. That’s where smaller, task-specific models come in. As teams graduate from prototyping into production, these models will grow in importance.

And if you look at organizations using Gen AI seriously at scale, almost all of them follow this path as they mature. It’s the same reason OpenAI felt the need to roll out something like GPT-4o-mini. I think this trend will continue, and it’s a good thing for the industry because it forces costs to align with ROI.

Talking about the second trend, my view is that the entire pie for both open-source and commercial models will grow very quickly, but the relative share of open-source is going to grow much faster than the commercial side. Based on an A16Z Generative AI survey from 2023, people were looking to spend a lot on LLMs, especially in the enterprise segment. But in 2023–the year of prototyping, as many people say–80 to 90% of the usage was estimated as closed source. However, two-thirds of AI leaders have expressed plans to increase their open-source usage, targeting a 50/50 split. 

Historically, most machine learning has been built on open-source architectures, so this shift aligns with the broader trajectory of the industry.

Asif: What problems are left unsolved and where do you see the greatest opportunity?

Devvret: I think the biggest unsolved problem—and one I find really exciting—is how to create a flywheel where models get better as they’re used. What I mean is introducing a real active learning process for LLMs. Right now, what I hear from organizations is that when they move to production, they can often get a model to 70% accuracy with prompt engineering alone. But as they try to push further, they only see marginal improvements—maybe going from 70% to 71%.

What they really want is a way to reach 80% or 90% accuracy, and they hope that by deploying the model, they can collect enough data to keep improving it. But that workflow isn’t solved yet. The way many companies handle it now is by releasing a model at 70%, collecting production data, manually reviewing it, and then fine-tuning the model based on annotated datasets. But this approach just doesn’t scale—there’s no way to manually review enough data, especially as LLMs handle millions of queries in production.

The real opportunity, in my opinion, lies in building a system where models can improve automatically over time. For example, if a model launches with 70% accuracy in a new domain, you need a way to leverage production data to fine-tune it iteratively. I think the key will be applying some of the breakthroughs we’re already seeing—like using LLMs as judges or generating synthetic data—to create that flywheel. With such a system, a model could launch at 50-70% accuracy, collect data from real use, and improve on its own.

This idea was partially realized in recommender systems, but it hasn’t yet been achieved with generative AI at scale. That’s where I think the industry is headed, and it’s where I see the most exciting potential for growth.

This Interview was originally published in Marktechpost Small Language Model SLM Magazine 2024.
The post Exclusive Talk with Devvret Rishi, CEO and Cofounder at Predibase appeared first on MarkTechPost.

Governing ML lifecycle at scale: Best practices to set up cost and usa …

Cloud costs can significantly impact your business operations. Gaining real-time visibility into infrastructure expenses, usage patterns, and cost drivers is essential. This insight enables agile decision-making, optimized scalability, and maximizes the value derived from cloud investments, providing cost-effective and efficient cloud utilization for your organization’s future growth. What makes cost visibility even more important for the cloud is that cloud usage is dynamic. This requires continuous cost reporting and monitoring to make sure costs don’t exceed expectations and you only pay for the usage you need. Additionally, you can measure the value the cloud delivers to your organization by quantifying the associated cloud costs.
For a multi-account environment, you can track costs at an AWS account level to associate expenses. However, to allocate costs to cloud resources, a tagging strategy is essential. A combination of an AWS account and tags provides the best results. Implementing a cost allocation strategy early is critical for managing your expenses and future optimization activities that will reduce your spend.
This post outlines steps you can take to implement a comprehensive tagging governance strategy across accounts, using AWS tools and services that provide visibility and control. By setting up automated policy enforcement and checks, you can achieve cost optimization across your machine learning (ML) environment.
Implement a tagging strategy
A tag is a label you assign to an AWS resource. Tags consist of a customer-defined key and an optional value to help manage, search for, and filter resources. Tag keys and values are case sensitive. A tag value (for example, Production) is also case sensitive, like the keys.
It’s important to define a tagging strategy for your resources as soon as possible when establishing your cloud foundation. Tagging is an effective scaling mechanism for implementing cloud management and governance strategies. When defining your tagging strategy, you need to determine the right tags that will gather all the necessary information in your environment. You can remove tags when they’re no longer needed and apply new tags whenever required.
Categories for designing tags
Some of the common categories used for designing tags are as follows:

Cost allocation tags – These help track costs by different attributes like department, environment, or application. This allows reporting and filtering costs in billing consoles based on tags.
Automation tags – These are used during resource creation or management workflows. For example, tagging resources with their environment allows automating tasks like stopping non-production instances after hours.
Access control tags – These enable restricting access and permissions based on tags. AWS Identity and Access Management (IAM) roles and policies can reference tags to control which users or services can access specific tagged resources.
Technical tags – These provide metadata about resources. For example, tags like environment or owner help identify technical attributes. The AWS reserved prefix aws: tags provide additional metadata tracked by AWS.
Compliance tags – These may be needed to adhere to regulatory requirements, such as tagging with classification levels or whether data is encrypted or not.
Business tags – These represent business-related attributes, not technical metadata, such as cost centers, business lines, and products. This helps track spending for cost allocation purposes.

A tagging strategy also defines a standardized convention and implementation of tags across all resource types.
When defining tags, use the following conventions:

Use all lowercase for consistency and to avoid confusion
Separate words with hyphens
Use a prefix to identify and separate AWS generated tags from third-party tool generated tags

Tagging dictionary
When defining a tagging dictionary, delineate between mandatory and discretionary tags. Mandatory tags help identify resources and their metadata, regardless of purpose. Discretionary tags are the tags that your tagging strategy defines, and they should be made available to assign to resources as needed. The following table provides examples of a tagging dictionary used for tagging ML resources.

Tag Type
Tag Key
Purpose
Cost Allocation
Mandatory

Workload
anycompany:workload:application-id
Identifies disparate resources that are related to a specific application
Y
Y

Workload
anycompany:workload:environment
Distinguishes between dev, test, and production
Y
Y

Financial
anycompany:finance:owner
Indicates who is responsible for the resource, for example SecurityLead, SecOps, Workload-1-Development-team
Y
Y

Financial
anycompany:finance:business-unit
Identifies the business unit the resource belongs to, for example Finance, Retail, Sales, DevOps, Shared
Y
Y

Financial
anycompany:finance:cost-center
Indicates cost allocation and tracking, for example 5045, Sales-5045, HR-2045
Y
Y

Security
anycompany:security:data-classification
Indicates data confidentiality that the resource supports
N
Y

Automation
anycompany:automation:encryption
Indicates if the resource needs to store encrypted data
N
N

Workload
anycompany:workload:name
Identifies an individual resource
N
N

Workload
anycompany:workload:cluster
Identifies resources that share a common configuration or perform a specific function for the application
N
N

Workload
anycompany:workload:version
Distinguishes between different versions of a resource or application component
N
N

Operations
anycompany:operations:backup
Identifies if the resource needs to be backed up based on the type of workload and the data that it manages
N
N

Regulatory
anycompany:regulatory:framework
Requirements for compliance to specific standards and frameworks, for example NIST, HIPAA, or GDPR
N
N

You need to define what resources require tagging and implement mechanisms to enforce mandatory tags on all necessary resources. For multiple accounts, assign mandatory tags to each one, identifying its purpose and the owner responsible. Avoid personally identifiable information (PII) when labeling resources because tags remain unencrypted and visible.
Tagging ML workloads on AWS
When running ML workloads on AWS, primary costs are incurred from compute resources required, such as Amazon Elastic Compute Cloud (Amazon EC2) instances for hosting notebooks, running training jobs, or deploying hosted models. You also incur storage costs for datasets, notebooks, models, and so on stored in Amazon Simple Storage Service (Amazon S3).
A reference architecture for the ML platform with various AWS services is shown in the following diagram. This framework considers multiple personas and services to govern the ML lifecycle at scale. For more information about the reference architecture in detail, see Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker.

The reference architecture includes a landing zone and multi-account landing zone accounts. These should be tagged to track costs for governance and shared services.
The key contributors towards recurring ML cost that should be tagged and tracked are as follows:

Amazon DataZone – Amazon DataZone allows you to catalog, discover, govern, share, and analyze data across various AWS services. Tags can be added at an Amazon DataZone domain and used for organizing data assets, users, and projects. Usage of data is tracked through the data consumers, such as Amazon Athena, Amazon Redshift, or Amazon SageMaker.
AWS Lake Formation – AWS Lake Formation helps manage data lakes and integrate them with other AWS analytics services. You can define metadata tags and assign them to resources like databases and tables. This identifies teams or cost centers responsible for those resources. Automating resource tags when creating databases or tables with the AWS Command Line Interface (AWS CLI) or SDKs provides consistent tagging. This enables accurate tracking of costs incurred by different teams.
Amazon SageMaker – Amazon SageMaker uses a domain to provide access to an environment and resources. When a domain is created, tags are automatically generated with a DomainId key by SageMaker, and administrators can add a custom ProjectId Together, these tags can be used for project-level resource isolation. Tags on a SageMaker domain are automatically propagated to any SageMaker resources created in the domain.
Amazon SageMaker Feature Store – Amazon SageMaker Feature Store allows you to tag your feature groups and search for feature groups using tags. You can add tags when creating a new feature group or edit the tags of an existing feature group.
Amazon SageMaker resources – When you tag SageMaker resources such as jobs or endpoints, you can track spending based on attributes like project, team, or environment. For example, you can specify tags when creating the SageMaker Estimator that launches a training job.

Using tags allows you to incur costs that align with business needs. Monitoring expenses this way gives insight into how budgets are consumed.
Enforce a tagging strategy
An effective tagging strategy uses mandatory tags and applies them consistently and programmatically across AWS resources. You can use both reactive and proactive approaches for governing tags in your AWS environment.
Proactive governance uses tools such as AWS CloudFormation, AWS Service Catalog, tag policies in AWS Organizations, or IAM resource-level permissions to make sure you apply mandatory tags consistently at resource creation. For example, you can use the CloudFormation Resource Tags property to apply tags to resource types. In Service Catalog, you can add tags that automatically apply when you launch the service.
Reactive governance is for finding resources that lack proper tags using tools such as the AWS Resource Groups tagging API, AWS Config rules, and custom scripts. To find resources manually, you can use Tag Editor and detailed billing reports.
Proactive governance
Proactive governance uses the following tools:

Service catalog – You can apply tags to all resources created when a product launches from the service catalog. The service catalog provides a TagOptions Use this to define the tag key-pairs to associate with the product.
CloudFormation Resource Tags – You can apply tags to resources using the AWS CloudFormation Resource Tags property. Tag only those resources that support tagging through AWS CloudFormation.
Tag policies – Tag policies standardize tags across your organization’s account resources. Define tagging rules in a tag policy that apply when resources get tagged. For example, specify that a CostCenter tag attached to a resource must match the case and values the policy defines. Also specify that noncompliant tagging operations on some resources get enforced, preventing noncompliant requests from completing. The policy doesn’t evaluate untagged resources or undefined tags for compliance. Tag policies involve working with multiple AWS services:

To enable the tag policies feature, use AWS Organizations. You can create tag policies and then attach those policies to organization entities to put the tagging rules into effect.
Use AWS Resource Groups to find noncompliant tags on account resources. Correct the noncompliant tags in the AWS service where you created the resource.

Service Control Policies – You can restrict the creation of an AWS resource without proper tags. Use Service Control Policies (SCPs) to set guardrails around requests to create resources. SCPs allow you to enforce tagging policies on resource creation. To create an SCP, navigate to the AWS Organizations console, choose Policies in the navigation pane, then choose Service Control Policies.

Reactive governance
Reactive governance uses the following tools:

AWS Config rules – Check resources regularly for improper tagging. The AWS Config rule required-tags examines resources to make sure they contain specified tags. You should take action when resources lack necessary tags.
AWS Resource Groups tagging API – The AWS Resource Groups Tagging API lets you tag or untag resources. It also enables searching for resources in a specified AWS Region or account using tag-based filters. Additionally, you can search for existing tags in a Region or account, or find existing values for a key within a specific Region or account. To create a resource tag group, refer to Creating query-based groups in AWS Resource Groups.
Tag Editor – With Tag Editor, you build a query to find resources in one or more Regions that are available for tagging. To find resources to tag, see Finding resources to tag.

SageMaker tag propagation
Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. SageMaker Studio automatically copies and assign tags to the SageMaker Studio notebooks created by the users, so you can track and categorize the cost of SageMaker Studio notebooks.
Amazon SageMaker Pipelines allows you to create end-to-end workflows for managing and deploying SageMaker jobs. Each pipeline is composed of a sequence of steps that transform data into a trained model. Tags can be applied to pipelines similarly to how they are used for other SageMaker resources. When a pipeline is run, its tags can potentially propagate to the underlying jobs launched as part of the pipeline steps.
When models are registered in Amazon SageMaker Model Registry, tags can be propagated from model packages to other related resources like endpoints. Model packages in the registry can be tagged when registering a model version. These tags become associated with the model package. Tags on model packages can potentially propagate to other resources that reference the model, such as endpoints created using the model.
Tag policy quotas
The number of policies that you can attach to an entity (root, OU, and account) is subject to quotas for AWS Organizations. See Quotas and service limits for AWS Organizations for the number of tags that you can attach.
Monitor resources
To achieve financial success and accelerate business value realization in the cloud, you need complete, near real-time visibility of cost and usage information to make informed decisions.
Cost organization
You can apply meaningful metadata to your AWS usage with AWS cost allocation tags. Use AWS Cost Categories to create rules that logically group cost and usage information by account, tags, service, charge type, or other categories. Access the metadata and groupings in services like AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Budgets to trace costs and usage back to specific teams, projects, and business initiatives.
Cost visualization
You can view and analyze your AWS costs and usage over the past 13 months using Cost Explorer. You can also forecast your likely spending for the next 12 months and receive recommendations for Reserved Instance purchases that may reduce your costs. Using Cost Explorer enables you to identify areas needing further inquiry and to view trends to understand your costs. For more detailed cost and usage data, use AWS Data Exports to create exports of your billing and cost management data by selecting SQL columns and rows to filter the data you want to receive. Data exports get delivered on a recurring basis to your S3 bucket for you to use with your business intelligence (BI) or data analytics solutions.
You can use AWS Budgets to set custom budgets that track cost and usage for simple or complex use cases. AWS Budgets also lets you enable email or Amazon Simple Notification Service (Amazon SNS) notifications when actual or forecasted cost and usage exceed your set budget threshold. In addition, AWS Budgets integrates with Cost Explorer.
Cost allocation
Cost Explorer enables you to view and analyze your costs and usage data over time, up to 13 months, through the AWS Management Console. It provides premade views displaying quick information about your cost trends to help you customize views suiting your needs. You can apply various available filters to view specific costs. Also, you can save any view as a report.
Monitoring in a multi-account setup
SageMaker supports cross-account lineage tracking. This allows you to associate and query lineage entities, like models and training jobs, owned by different accounts. It helps you track related resources and costs across accounts. Use the AWS Cost and Usage Report to track costs for SageMaker and other services across accounts. The report aggregates usage and costs based on tags, resources, and more so you can analyze spending per team, project, or other criteria spanning multiple accounts.
Cost Explorer allows you to visualize and analyze SageMaker costs from different accounts. You can filter costs by tags, resources, or other dimensions. You can also export the data to third-party BI tools for customized reporting.
Conclusion
In this post, we discussed how to implement a comprehensive tagging strategy to track costs for ML workloads across multiple accounts. We discussed implementing tagging best practices by logically grouping resources and tracking costs by dimensions like environment, application, team, and more. We also looked at enforcing the tagging strategy using proactive and reactive approaches. Additionally, we explored the capabilities within SageMaker to apply tags. Lastly, we examined approaches to provide visibility of cost and usage for your ML workloads.
For more information about how to govern your ML lifecycle, see Part 1 and Part 2 of this series.

About the authors
Gunjan Jain, an AWS Solutions Architect based in Southern California, specializes in guiding large financial services companies through their cloud transformation journeys. He expertly facilitates cloud adoption, optimization, and implementation of Well-Architected best practices. Gunjan’s professional focus extends to machine learning and cloud resilience, areas where he demonstrates particular enthusiasm. Outside of his professional commitments, he finds balance by spending time in nature.
Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, reliable and scalable GenAI/ML systems to help enterprise customers improve their business outcomes. In his spare time, he rides motorcycle and enjoys walking with his dogs!

Automate invoice processing with Streamlit and Amazon Bedrock

Invoice processing is a critical yet often cumbersome task for businesses of all sizes, especially for large enterprises dealing with invoices from multiple vendors with varying formats. The sheer volume of data, coupled with the need for accuracy and efficiency, can make invoice processing a significant challenge. Invoices can vary widely in format, structure, and content, making efficient processing at scale difficult. Traditional methods relying on manual data entry or custom scripts for each vendor’s format can not only lead to inefficiencies, but can also increase the potential for errors, resulting in financial discrepancies, operational bottlenecks, and backlogs.
To extract key details such as invoice numbers, dates, and amounts, we use Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
In this post, we provide a step-by-step guide with the building blocks needed for creating a Streamlit application to process and review invoices from multiple vendors. Streamlit is an open source framework for data scientists to efficiently create interactive web-based data applications in pure Python. We use Anthropic’s Claude 3 Sonnet model in Amazon Bedrock and Streamlit for building the application front-end.
Solution overview
This solution uses the Amazon Bedrock Knowledge Bases chat with document feature to analyze and extract key details from your invoices, without needing a knowledge base. The results are shown in a Streamlit app, with the invoices and extracted information displayed side-by-side for quick review. Importantly, your document and data are not stored after processing.
The storage layer uses Amazon Simple Storage Service (Amazon S3) to hold the invoices that business users upload. After uploading, you can set up a regular batch job to process these invoices, extract key information, and save the results in a JSON file. In this post, we save the data in JSON format, but you can also choose to store it in your preferred SQL or NoSQL database.
The application layer uses Streamlit to display the PDF invoices alongside the extracted data from Amazon Bedrock. For simplicity, we deploy the app locally, but you can also run it on Amazon SageMaker Studio, Amazon Elastic Compute Cloud (Amazon EC2), or Amazon Elastic Container Service (Amazon ECS) if needed.
Prerequisites
To perform this solution, complete the following:

Create and activate an AWS account. Make sure your AWS credentials are configured correctly.
This tutorial assumes you have the necessary AWS Identity and Access Management (IAM) permissions.
Install AWS Command Line Interface (AWS CLI)
Configure AWS CLI and set the AWS Region to where you would like to run this invoice processor by following the Set up AWS temporary credentials and AWS Region for development documentation. The Region you choose must have Amazon Bedrock and Anthropic’s Claude 3 Sonnet model available.
Install Python 3.7 or later on your local machine.
Access to Anthropic’s Claude 3 Sonnet in Amazon Bedrock.

Install dependencies and clone the example
To get started, install the necessary packages on your local machine or on an EC2 instance. If you’re new to Amazon EC2, refer to the Amazon EC2 User Guide. This tutorial we will use the local machine for project setup.
To install dependencies and clone the example, follow these steps:

Clone the repository into a local folder:

git clone https://github.com/aws-samples/genai-invoice-processor.git

Install Python dependencies

Navigate to the project directory:

cd </path/to/your/folder>/genai-invoice-processor

Upgrade pip

python3 -m pip install –upgrade pip

(Optional) Create a virtual environment isolate dependencies:

python3 -m venv venv

Activate the virtual environment:

Mac/Linux:

source venv/bin/activate

Windows:

venv/Scripts/activate

In the cloned directory, invoke the following to install the necessary Python packages:

pip install -r requirements.txt
This will install the necessary packages, including Boto3 (AWS SDK for Python), Streamlit, and other dependencies.
Update the region in the config.yaml file to the same Region set for your AWS CLI where Amazon Bedrock and Anthropic’s Claude 3 Sonnet model are available.

After completing these steps, the invoice processor code will be set up in your local environment and will be ready for the next stages to process invoices using Amazon Bedrock.
Process invoices using Amazon Bedrock
Now that the environment setup is done, you’re ready to start processing invoices and deploying the Streamlit app. To process invoices using Amazon Bedrock, follow these steps:
Store invoices in Amazon S3
Store invoices from different vendors in an S3 bucket. You can upload them directly using the console, API, or as part of your regular business process. Follow these steps to upload using the CLI:

Create an S3 bucket:

aws s3 mb s3://<your-bucket-name> –region <your-region>
Replace your-bucket-name with the name of the bucket you created and your-region with the Region set for your AWS CLI and in config.yaml (for example, us-east-1)
Upload invoices to S3 bucket. Use one of the following commands to upload the invoice to S3.

To upload invoices to the root of the bucket:

aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/ –recursive

To upload invoices to a specific folder (for example, invoices):

aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/<prefix>/ –recursive

Validate the upload:

aws s3 ls s3://<your-bucket-name>/

Process invoices with Amazon Bedrock
In this section, you will process the invoices in Amazon S3 and store the results in a JSON file (processed_invoice_output.json). You will extract the key details from the invoices (such as invoice numbers, dates, and amounts) and generate summaries.
You can trigger the processing of these invoices using the AWS CLI or automate the process with an Amazon EventBridge rule or AWS Lambda trigger. For this walkthrough, we will use the AWS CLI to trigger the processing.
We packaged the processing logic in the Python script invoices_processor.py, which can be run as follows:

python invoices_processor.py –bucket_name=<your-bucket-name> –prefix=<your-folder>

The –prefix argument is optional. If omitted, all of the PDFs in the bucket will be processed. For example:

python invoices_processor.py –bucket_name=’gen_ai_demo_bucket’

or

python invoices_processor.py –bucket_name=’gen_ai_demo_bucket’ –prefix=’invoice’

Use the solution
This section examines the invoices_processor.py code. You can chat with your document either on the Amazon Bedrock console or by using the Amazon Bedrock RetrieveAndGenerate API (SDK). In this tutorial, we use the API approach.

Initialize the environment: The script imports the necessary libraries and initializes the Amazon Bedrock and Amazon S3 client.

import boto3
import os
import json
import shutil
import argparse
import time
import datetime
import yaml
from typing import Dict, Any, Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed
from threading import Lock
from mypy_boto3_bedrock_runtime.client import BedrockRuntimeClient
from mypy_boto3_s3.client import S3Client

# Load configuration from YAML file
def load_config():
“””
Load and return the configuration from the ‘config.yaml’ file.
“””
with open(‘config.yaml’, ‘r’) as file:
return yaml.safe_load(file)

CONFIG = load_config()

write_lock = Lock() # Lock for managing concurrent writes to the output file

def initialize_aws_clients() -> Tuple[S3Client, BedrockRuntimeClient]:
“””
Initialize and return AWS S3 and Bedrock clients.

Returns:
Tuple[S3Client, BedrockRuntimeClient]
“””
return (
boto3.client(‘s3’, region_name=CONFIG[‘aws’][‘region_name’]),
boto3.client(service_name=’bedrock-agent-runtime’,
region_name=CONFIG[‘aws’][‘region_name’])
)

Configure : The config.yaml file specifies the model ID, Region, prompts for entity extraction, and the output file location for processing.

aws:
region_name: us-west-2
model_id: anthropic.claude-3-sonnet-20240229-v1:0
prompts:
full: Extract data from attached invoice in key-value format.
structured: |
Process the pdf invoice and list all metadata and values in json format for the variables with descriptions in <variables></variables> tags. The result should be returned as JSON as given in the <output></output> tags.

<variables>
Vendor: Name of the company or entity the invoice is from.
InvoiceDate: Date the invoice was created.
DueDate: Date the invoice is due and needs to be paid by.
CurrencyCode: Currency code for the invoice amount based on the symbol and vendor details.
TotalAmountDue: Total amount due for the invoice
Description: a concise summary of the invoice description within 20 words
</variables>

Format your analysis as a JSON object in following structure:
<output> {
“Vendor”: “<vendor name>”,
“InvoiceDate”:”<DD-MM-YYYY>”,
“DueDate”:”<DD-MM-YYYY>”,
“CurrencyCode”:”<Currency code based on the symbol and vendor details>”,
“TotalAmountDue”:”<100.90>” # should be a decimal number in string
“Description”:”<Concise summary of the invoice description within 20 words>”
} </output>
Please proceed with the analysis based on the above instructions. Please don’t state “Based on the ..”
summary: Process the pdf invoice and summarize the invoice under 3 lines

processing:
output_file: processed_invoice_output.json
local_download_folder: invoices

Set up API calls: The RetrieveAndGenerate API fetches the invoice from Amazon S3 and processes it using the FM. It takes several parameters, such as prompt, source type (S3), model ID, AWS Region, and S3 URI of the invoice.

def retrieve_and_generate(bedrock_client: BedrockRuntimeClient, input_prompt: str, document_s3_uri: str) -> Dict[str, Any]:
“””
Use AWS Bedrock to retrieve and generate invoice data based on the provided prompt and S3 document URI.

Args:
bedrock_client (BedrockRuntimeClient): AWS Bedrock client
input_prompt (str): Prompt for the AI model
document_s3_uri (str): S3 URI of the invoice document

Returns:
Dict[str, Any]: Generated data from Bedrock
“””
model_arn = f’arn:aws:bedrock:{CONFIG[“aws”][“region_name”]}::foundation-model/{CONFIG[“aws”][“model_id”]}’
return bedrock_client.retrieve_and_generate(
input={‘text’: input_prompt}, retrieveAndGenerateConfiguration={
‘type’: ‘EXTERNAL_SOURCES’,
‘externalSourcesConfiguration’: {
‘modelArn’: model_arn,
‘sources’: [
{
“sourceType”: “S3”,
“s3Location”: {“uri”: document_s3_uri}
}
]
}
}
)

Batch processing: The batch_process_s3_bucket_invoices function batch process the invoices in parallel in the specified S3 bucket and writes the results to the output file (processed_invoice_output.json as specified by output_file in config.yaml). It relies on the process_invoice function, which calls the Amazon Bedrock RetrieveAndGenerate API for each invoice and prompt.

def process_invoice(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, pdf_file_key: str) -> Dict[str, str]:
“””
Process a single invoice by downloading it from S3 and using Bedrock to analyze it.

Args:
s3_client (S3Client): AWS S3 client
bedrock_client (BedrockRuntimeClient): AWS Bedrock client
bucket_name (str): Name of the S3 bucket
pdf_file_key (str): S3 key of the PDF invoice

Returns:
Dict[str, Any]: Processed invoice data
“””
document_uri = f”s3://{bucket_name}/{pdf_file_key}”
local_file_path = os.path.join(CONFIG[‘processing’][‘local_download_folder’], pdf_file_key)

# Ensure the local directory exists and download the invoice from S3
os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
s3_client.download_file(bucket_name, pdf_file_key, local_file_path)

# Process invoice with different prompts
results = {}
for prompt_name in [“full”, “structured”, “summary”]:
response = retrieve_and_generate(bedrock_client, CONFIG[‘aws’][‘prompts’][prompt_name], document_uri)
results[prompt_name] = response[‘output’][‘text’]

return results

def batch_process_s3_bucket_invoices(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, prefix: str = “”) -> int:
“””
Batch process all invoices in an S3 bucket or a specific prefix within the bucket.

Args:
s3_client (S3Client): AWS S3 client
bedrock_client (BedrockRuntimeClient): AWS Bedrock client
bucket_name (str): Name of the S3 bucket
prefix (str, optional): S3 prefix to filter invoices. Defaults to “”.

Returns:
int: Number of processed invoices
“””
# Clear and recreate local download folder
shutil.rmtree(CONFIG[‘processing’][‘local_download_folder’], ignore_errors=True)
os.makedirs(CONFIG[‘processing’][‘local_download_folder’], exist_ok=True)

# Prepare to iterate through all objects in the S3 bucket
continuation_token = None # Pagination handling
pdf_file_keys = []

while True:
list_kwargs = {‘Bucket’: bucket_name, ‘Prefix’: prefix}
if continuation_token:
list_kwargs[‘ContinuationToken’] = continuation_token

response = s3_client.list_objects_v2(**list_kwargs)

for obj in response.get(‘Contents’, []):
pdf_file_key = obj[‘Key’]
if pdf_file_key.lower().endswith(‘.pdf’): # Skip folders or non-PDF files
pdf_file_keys.append(pdf_file_key)

if not response.get(‘IsTruncated’):
break
continuation_token = response.get(‘NextContinuationToken’)

# Process invoices in parallel
processed_count = 0
with ThreadPoolExecutor() as executor:
future_to_key = {
executor.submit(process_invoice, s3_client, bedrock_client, bucket_name, pdf_file_key): pdf_file_key
for pdf_file_key in pdf_file_keys
}

for future in as_completed(future_to_key):
pdf_file_key = future_to_key[future]
try:
result = future.result()
# Write result to the JSON output file as soon as it’s available
write_to_json_file(CONFIG[‘processing’][‘output_file’], {pdf_file_key: result})
processed_count += 1
print(f”Processed file: s3://{bucket_name}/{pdf_file_key}”)
except Exception as e:
print(f”Failed to process s3://{bucket_name}/{pdf_file_key}: {str(e)}”)

return processed_count

Post-processing: The extracted data in processed_invoice_output.json can be further structured or customized to suit your needs.

This approach allows invoice handling from multiple vendors, each with its own unique format and structure. By using large language models (LLMs), it extracts important details such as invoice numbers, dates, amounts, and vendor information without requiring custom scripts for each vendor format.
Run the Streamlit demo
Now that you have the components in place and the invoices processed using Amazon Bedrock, it’s time to deploy the Streamlit application. You can launch the app by invoking the following command:

streamlit run review-invoice-data.py
or
python -m streamlit run review-invoice-data.py

When the app is up, it will open in your default web browser. From there, you can review the invoices and the extracted data side-by-side. Use the Previous and Next arrows to seamlessly navigate through the processed invoices so you can interact with and analyze the results efficiently. The following screenshot shows the UI.

There are quotas for Amazon Bedrock (of which some are adjustable) that you need to consider when building at scale with Amazon Bedrock.
Cleanup
To clean up after running the demo, follow these steps:

Delete the S3 bucket containing your invoices using the command

aws s3 rb s3://<your-bucket-name> –force

If you set up a virtual environment, deactivate it by invoking deactivate
Remove any local files created during the process, including the cloned repository and output files
If you used any AWS resources such as an EC2 instance, terminate them to avoid unnecessary charges

Conclusion
In this post, we walked through a step-by-step guide to automating invoice processing using Streamlit and Amazon Bedrock, addressing the challenge of handling invoices from multiple vendors with different formats. We showed how to set up the environment, process invoices stored in Amazon S3, and deploy a user-friendly Streamlit application to review and interact with the processed data.
If you are looking to further enhance this solution, consider integrating additional features or deploying the app on scalable AWS services such as Amazon SageMaker, Amazon EC2, or Amazon ECS. Due to this flexibility, your invoice processing solution can evolve with your business, providing long-term value and efficiency.
We encourage you to learn more by exploring Amazon Bedrock, Access Amazon Bedrock foundation models, RetrieveAndGenerate API, and Quotas for Amazon Bedrock and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.

About the Authors
Deepika Kumar is a Solution Architect at AWS. She has over 13 years of experience in the technology industry and has helped enterprises and SaaS organizations build and securely deploy their workloads on the cloud securely. She is passionate about using Generative AI in a responsible manner whether that is driving product innovation, boost productivity or enhancing customer experiences.
Jobandeep Singh is an Associate Solution Architect at AWS specializing in Machine Learning. He supports customers across a wide range of industries to leverage AWS, driving innovation and efficiency in their operations. In his free time, he enjoys playing sports, with a particular love for hockey.
Ratan Kumar is a solutions architect based out of Auckland, New Zealand. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable internet scale applications using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Centralize model governance with SageMaker Model Registry Resource Acc …

We recently announced the general availability of cross-account sharing of Amazon SageMaker Model Registry using AWS Resource Access Manager (AWS RAM), making it easier to securely share and discover machine learning (ML) models across your AWS accounts.
Customers find it challenging to share and access ML models across AWS accounts because they have to set up complex AWS Identity and Access Management (IAM) policies and create custom integrations. With this launch, customers can now seamlessly share and access ML models registered in SageMaker Model Registry between different AWS accounts.
Customers can use the SageMaker Studio UI or APIs to specify the SageMaker Model Registry model to be shared and grant access to specific AWS accounts or to everyone in the organization. Authorized users can then quickly discover and use those shared models in their own AWS accounts. This streamlines the ML workflows, enables better visibility and governance, and accelerates the adoption of ML models across the organization.
In this post, we will show you how to use this new cross-account model sharing feature to build your own centralized model governance capability, which is often needed for centralized model approval, deployment, auditing, and monitoring workflows. Before we dive into the details of the architecture for sharing models, let’s review what use case and model governance is and why it’s needed.
Use case governance is essential to help ensure that AI systems are developed and used in ways that respect values, rights, and regulations. According to the EU AI Act, use case governance refers to the process of overseeing and managing the development, deployment, and use of AI systems in specific contexts or applications. This includes:

Risk assessment: Identifying and evaluating potential risks associated with AI systems.
Mitigation strategies: Implementing measures to minimize or eliminate risks.
Transparency and explainability: Making sure that AI systems are transparent, explainable, and accountable.
Human oversight: Including human involvement in AI decision-making processes.
Monitoring and evaluation: Continuously monitoring and evaluating AI systems to help ensure compliance with regulations and ethical standards.

Model governance involves overseeing the development, deployment, and maintenance of ML models to help ensure that they meet business objectives and are accurate, fair, and compliant with regulations. It includes processes for monitoring model performance, managing risks, ensuring data quality, and maintaining transparency and accountability throughout the model’s lifecycle. In AWS, these model lifecycle activities can be performed over multiple AWS accounts (for example, development, test, and production accounts) at the use case or business unit level. However, model governance functions in an organization are centralized and to perform those functions, teams need access to metadata about model lifecycle activities across those accounts for validation, approval, auditing, and monitoring to manage risk and compliance.
Use case and model governance plays a crucial role in implementing responsible AI and helps with the reliability, fairness, compliance, and risk management of ML models across use cases in the organization. It helps prevent biases, manage risks, protect against misuse, and maintain transparency. By establishing robust oversight, organizations can build trust, meet regulatory requirements, and help ensure ethical use of AI technologies.
Use case and model lifecycle governance overview
In the context of regulations such as the European Union’s Artificial Intelligence Act (EU AI Act), a use case refers to a specific application or scenario where AI is used to achieve a particular goal or solve a problem. The EU AI Act proposes to regulate AI systems based on their intended use cases, which are categorized into four levels of risk:

Unacceptable risk: Significant threat to safety, livelihoods, or rights
High risk: Significant impacts on lives (for example, use of AI in healthcare and transportation)
Limited risk: Minimal impacts (for example, chatbots and virtual assistants)
Minimal risk: Negligible risks (for example, entertainment and gaming)

An AI system is built to satisfy a use case such as credit risk, which can be comprised of workflows orchestrated with one or more ML models—such as credit risk and fraud detection models. You can build a use case (or AI system) using existing models, newly built models, or combination of both. Regardless of how the AI system is built, governance will be applied at the AI system level where use case decisions (for example, denying a loan application) are being made. However, explaining why that decision was made requires next-level detailed reports from each affected model component of that AI system. Therefore, governance applies both at the use case and model level and is driven by each of their lifecycle stages.
Use case lifecycle stages
A use case has its own set of lifecycle stages from development through deployment to production, shown in the following figure. A use case typically starts with an experimentation or proof-of-concept (POC) stage where the idea is explored for feasibility. When the use case is determined to be feasible, it’s approved and moves to the next stage for development. The use case is then developed using various components including ML models and unit testing, and then moved to the next stage—quality assurance (QA)—after approval. Next, the use case is tested, validated, and approved to be moved to the pre-production stage where it’s A/B tested with production-like settings and approved for the next stage. Now, the use case is deployed and operational in production. When the use case is no longer needed for business, it’s retired and decommissioned. Even though these stages are depicted as linear in the diagram, they are frequently iterative.

Model lifecycle stages
When an ML model is developed it goes through a similar set of lifecycle stages as a use case. In the case of an ML model, shown in the following figure, the lifecycle starts with the development or candidate model. Prior to that stage, there would be several experiments performed to build the candidate model. From a governance perspective, tracking starts from the candidate or dev model stage. After approval in dev, the model moves into the QA stage where it’s validated and integration tested to make sure that it meets the use case requirements and then is approved for promotion to the next stage. The model is then A/B tested along with the use case in pre-production with production-like data settings and approved for deployment to the next stage. The model is finally deployed to production. When the model is no longer needed, it’s retired and removed from deployed endpoints.

Stage status types
In the preceding use case and model stages discussion, we mentioned approving the model to go to the next stage. However, there are two other possible states—pending and rejected, as depicted in the following figure. These stages are applicable to both use case and model stages. For example, a use case that’s been moved from the QA stage to pre-production could be rejected and sent back to the development stage for rework because of missing documentation related to meeting certain regulatory controls.

Multi-account architecture for sharing models
A multi-account strategy improves security, scalability, and reliability of your systems. It also helps achieve data, project, and team isolation while supporting software development lifecycle best practices. Cross-account model sharing supports a multi-account strategy, removing the overhead of assuming roles into multiple accounts. Furthermore, sharing model resources directly across multiple accounts helps improve ML model approval, deployment, and auditing.
The following diagram depicts an architecture for centralizing model governance using AWS RAM for sharing models using a SageMaker Model Group, a core construct within SageMaker Model Registry where you register your model version.

Figure 1:  Centralizing Model Governance using AWS RAM Share

In the architecture presented in the preceding figure, the use case stakeholder, data scientist (DS) and ML engineer (MLE) perform the following steps:

The use case stakeholder, that is the DS team lead, receives the request to build an AI use case such as credit risk from their line of business lead.

The DS team lead records the credit risk use case in the POC stage in the stage governance table.
The MLE is notified to set up a model group for new model development. The MLE creates the necessary infrastructure pipeline to set up a new model group.

The MLE sets up the pipeline to share the model group with the necessary permissions (create and update the model version) to the ML project team’s development account. Optionally, this model group can also be shared with their test and production accounts if local account access to model versions is needed.
The DS uses SageMaker Training jobs to generate metrics captured by , selects a candidate model, and registers the model version inside the shared model group in their local model registry.
Because this is a shared model group, the actual model version will be recorded in the shared services account model registry and a link will be maintained in the development account. The Amazon S3 model artifacts associated to the model will be copied to the shared services account when the model is registered in the shared services model registry.
The model group and associated model version will be synced into the model stage governance Amazon DynamoDB table with attributes such as model group, model version, model stage (development, test, production, and so on), model status (pending, approved, or rejected), and model metrics (in JSON format). The ML admin sets up this table with the necessary attributes based on their central governance requirements.
The model version is approved for deployment into the test stage and is deployed into the test account along with necessary infrastructure for invoking the model, such as an Amazon API gateway and AWS Lambda
Model is integration tested in the test environment and model test metrics are updated in the model stage governance table
Model test results are validated, and the model version is approved for deployment into the production stage and is deployed into the production account along with the necessary infrastructure for invoking the model such as an API gateway and Lambda functions.
The model is A/B tested or optionally shadow tested in the production environment and model production metrics are updated in the model stage governance table. When satisfactory production results are attained, the model version is rolled out in the production environment.
The model governance (compliance) officer uses the governance dashboard to act on model governance functions such as reviewing the model to validate compliance and monitoring for risk mitigation.

Building a central model registry using model group resource sharing
Model group resource sharing makes it possible to build a central model registry with few clicks or API calls without needing to write complex IAM policies. We will demonstrate how to set up a central model registry based on the architecture we described in the previous sections. We will start by using the SageMaker Studio UI and then by using APIs. In both cases, we will demonstrate how to create a model package group in the ML Shared Services account (Account A) and share it with the ML Dev account (Account B) so that any updates to model versions in Account B automatically update the corresponding model versions in Account A.
Prerequisites
You need to have the following prerequisites in place to implement the solution in this post.

Two AWS accounts: one for development and another for shared services
IAM Role with access to create, update and delete SageMaker resources
Optional: enable Resource Sharing within AWS Organizations

After you have the prerequisites set up, start by creating and sharing a model group across accounts. The basic steps are:

In Account A, create a model group.
In Account A, create a resource share for the model group, and then attach permissions and specify the target account to share the resource. Permissions can be standard or custom.
Account B should accept the resource sharing invitation to start using the shared resource from Account A.
Optionally, if Account A and Account B are part of the same AWS Organizations, and the resource sharing is enabled within AWS Organizations, then the resource sharing invitation are auto accepted without any manual intervention.

Create and share a model group across accounts using SageMaker Studio
The following section shows how to use SageMaker Studio to share models in a multi-account environment to build a central model registry. The following are instructions for using the AWS Management Console for SageMaker Studio to create a model package group in the shared services account, adding the necessary permissions, with the ML Dev account.
To use the console to create and share a model package:

In the SageMaker Studio console, sign in to Account A and navigate to the model registry, select the model package group (in this example, the credit-risk-package-group-1724904598), and choose Share.
In Account A, select the appropriate permissions to share the model package group with Account B. If you need to allow custom policy, navigate to the AWS RAM console and create the policy.
After selecting the permission policy, specify Account B (and any other accounts) to share the resource, then choose Share.
In Account B, navigate to the model registry, choose Shared with me, and then choose View pending approvals to see the model shared from Account A.
Accept the model invitation from Account A to access the shared model package group and its versions. When accounts are set up in the same organization, invitations will be accepted without requiring user intervention.

Create and share the model group across accounts using APIs
The following section shows how to use APIs to share models in a multi-account environment to build a central model registry. Create a model package group in the ML Shared Services account (Account A) and share it with the ML Dev account (Account B).
Following are the steps completed by using APIs to create and share a model package group across accounts.

In Account A, create a model package group.
In Account A, if needed, create custom sharing permissions; otherwise use standard sharing permissions.
In Account A, create a resource share for the model package group, attach permissions, and specify the target account to share the resource.
In Account B, accept the resource sharing invitation to start using the resource.
If Account A and B are part of the same organization, then the resource sharing invitation can be accepted without any manual intervention.

Run the following code in the ML Shared Services account (Account A).

import json
import time
import os
import boto3

region = boto3.Session().region_name

sm_client = boto3.client(‘sagemaker’, region_name=region)

# Replace model package group name as per use case
model_package_group_name = “model-group-” + str(round(time.time()))
model_package_group_input_dict = {
“ModelPackageGroupName” : model_package_group_name,
“ModelPackageGroupDescription” : “Sample model package group”
}

# Create Model package group with Sagemaker client
create_model_package_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)
model_package_group_arn = create_model_package_group_response[‘ModelPackageGroupArn’]
print(‘ModelPackageGroup Arn : {}’.format(model_package_group_arn))

ram_client = boto3.client(‘ram’)

# # Use this code path to create custom permission
# # Custom permission template resource policy string
# policy_template = ‘{nt”Effect”: “Allow”,nt”Action”: [ntt”sagemaker:DescribeModelPackageGroup”nt]n}’
# permission = ram_client.create_permission(
# name = “custom-permission” + str(round(time.time())),
# resourceType = “sagemaker:ModelPackageGroup”,
# policyTemplate = policy_template
# )
# print(‘Created Permission: {}’.format(permission[‘permission’][‘arn’]))
# permission = permission[‘permission’][‘arn’]

# Use this code path to use managed Permission
# It can be one of:
# 1. arn:aws:ram::aws:permission/AWSRAMDefaultPermissionSageMakerModelPackageGroup
# 2. arn:aws:ram::aws:permission/AWSRAMPermissionSageMakerModelPackageGroupAllowDeploy
# 3. arn:aws:ram::aws:permission/AWSRAMPermissionSageMakerModelPackageGroupAllowRegister
# More details :
permission = ‘arn:aws:ram::aws:permission/AWSRAMDefaultPermissionSageMakerModelPackageGroup’

# Principals can be IAM User, Role, Account or Organization ID. Ref: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ram/client/create_resource_share.html
response = ram_client.create_resource_share(
name=”model-group-resource-share”,
resourceArns=[create_model_package_group_response[‘ModelPackageGroupArn’]],
principals=[‘12345’],
permissionArns = [permission]
)

resource_share_arn = response[‘resourceShare’][‘resourceShareArn’]
print(‘Resource Share Arn : {}’.format(resource_share_arn))

Run the following code in the ML Dev account (Account B).

import json
import os
import boto3
from time import gmtime, strftime

region = boto3.Session().region_name

ram_client = boto3.client(‘ram’)
response = ram_client.get_resource_share_invitations()
pending_invitations = []
# Review all pending invitations
for i in response[‘resourceShareInvitations’]:
if i[‘status’] == “PENDING”:
pending_invitations.append(i)
print(pending_invitations,sep=’n’)

# Accept the resource share invitation from central account
# Replace with the intended invitation arn for acceptance from the central account
if pending_invitations:
response = ram_client.accept_resource_share_invitation(resourceShareInvitationArn=pending_invitations[0][‘resourceShareInvitationArn’])
print(response)

sm_client = boto3.client(‘sagemaker’, region_name=region)

response = sm_client.list_model_package_groups(CrossAccountFilterOption=”CrossAccount”)

MLflow experimentation with the shared model group
The following section shows you how to use Amazon SageMaker with MLflow to track your experiments in the development account and save candidate models in the shared model group while developing a credit risk model. It’s a binary classification problem where the goal is to predict whether a customer is a credit risk. If you want to run the code in your own environment, check out the notebook in this GitHub repository.
SageMaker with MLflow is a capability of SageMaker that you can use to create, manage, analyze, and compare your ML experiments. To get started with MLflow, you need to set up an MLflow tracking server to monitor your experiments and runs. You can set up the server programmatically or by using the SageMaker Studio UI. It can take up to 20 minutes for the setup to complete. The following code snippet shows how to create a tracking server.

sagemaker_client = boto3.client(“sagemaker”)
timestamp = strftime(‘%d-%H-%M-%S’, gmtime())
server_name = f”mlflow-{domain_id}-{timestamp}”
response = sagemaker_client.create_mlflow_tracking_server(
TrackingServerName=server_name,
ArtifactStoreUri=f”s3://{bucket_name}/mlflow/{timestamp}”,
RoleArn=sm_role,
AutomaticModelRegistration=True,
)

mlflow_arn = response[‘TrackingServerArn’]

To set up an MLflow tracking server in SageMaker Studio, choose the MLflow application icon. When your server is running, click on the ellipses button and then click on Open MLflow button to open the MLflow UI.

Now that your MLflow tracking server is running, you can start tracking your experiments. MLflow tracking allows you to programmatically track the inputs, parameters, configurations, and models of your iterations as experiments and runs.

Runs are executions of some piece of data science code and record metadata and generated artifacts.
An experiment collects multiple runs with the same objective.

The following code shows you how to set up an experiment and track your executions while developing the credit risk model.
Data preparation
For this example, you will use the South German Credit dataset open source dataset. To use the dataset to train the model, you need to first do some pre-processing, You can run the pre-processing code in your JupyterLab application or on a SageMaker ephemeral cluster as a SageMaker Training job using the @remote decorator. In both cases, you can track your experiments using MLflow.
The following code demonstrates how to track your experiments when executing your code on a SageMaker ephemeral cluster using the @remote decorator. To get started, set-up a name for your experiment.

from time import gmtime, strftime
experiment_suffix = strftime(‘%d-%H-%M-%S’, gmtime())
experiment_name = f”credit-risk-model-experiment-{experiment_suffix}”

The processing script creates a new MLflow active experiment by calling the mlflow.set_experiment() method with the experiment name above. After that, it invokes mlflow.start_run() to launch an MLflow run under that experiment.

@remote(s3_root_uri=f”s3://{bucket_name}/{prefix}”, dependencies=f”requirements.txt”, instance_type=”ml.m5.large”)
def preprocess(df, experiment_name, mlflow_arn, bucket_name, prefix, run_id=None):
try:
suffix = strftime(‘%d-%H-%M-%S’, gmtime())
mlflow.set_tracking_uri(mlflow_arn)
mlflow.set_experiment(experiment_name=experiment_name if experiment_name else f”credit-risk-model-experiment-{suffix}”)
run = mlflow.start_run(run_id=run_id) if run_id else mlflow.start_run(run_name=f”remote-processing-{suffix}”, nested=True)
…..
except Exception as e:
print(f”Exception in processing script: {e}”)
raise e
finally:
mlflow.end_run()

You can also log the input dataset and the sklearn model used to fit the training set during pre-processing as part of the same script.

model_dataset = mlflow.data.from_pandas(df)
mlflow.log_input(model_dataset, context=”model_dataset”)

…..

featurizer_model = transformer.fit(X)
features = featurizer_model.transform(X)
labels = LabelEncoder().fit_transform(y)

…..

mlflow.sklearn.log_model(
sk_model=featurizer_model,
artifact_path=f”processing/model”,
registered_model_name=”sk-learn-model”,
)

In the MLflow UI, use the Experiments to locate your experiment. Its name should start with “credit-risk-model-experiment”.

Click on the experiment name to reveal the table with the associated Runs and then click on the Run whose name starts with “remote-processing”. You will see its details as sin the following figure.

Click on the Artifacts tab to see the MLFlow model generated.

Model training
You can continue experimenting with different feature engineering techniques in your JupyterLab environment and track your experiments in MLflow. After you have completed the data preparation step, it’s time to train the classification model. You can use the xgboost algorithm for this purpose and run your code either in your JupyterLab environment or as a SageMaker Training job. Again, you can track your experiments using MLflow in both cases. The following example shows how to use MLflow with a SageMaker Training job in your code. You can use the method mlflow.autolog() to log metrics, parameters, and models without the need for explicit log statements.

import xgboost
import pickle as pkl
import os
import mlflow
import tarfile

@remote(s3_root_uri=f”s3://{bucket_name}/{prefix}”, dependencies=f”requirements.txt”, instance_type=”ml.m5.large”)
def train(X, val_X, y, val_y, num_round, params, mlflow_arn, experiment_name,run_id=None):
output_path = “/opt/ml/model”
mlflow.set_tracking_uri(mlflow_arn)
mlflow.autolog()

suffix = strftime(‘%d-%H-%M-%S’, gmtime())
mlflow.set_experiment(experiment_name=experiment_name if experiment_name else f”credit-risk-model-experiment-{suffix}”)
run = mlflow.start_run(run_id=run_id) if run_id else mlflow.start_run(run_name=f”remote-training-{suffix}”, nested=True)

try:
os.makedirs(output_path, exist_ok=True)
print(f”Directory ‘{output_path}’ created successfully.”)
except OSError as e:
print(f”Error creating directory ‘{output_path}’: {e}”)

dtrain = xgboost.DMatrix(X, label=y)
dval = xgboost.DMatrix(val_X, label=val_y)

dtrain = xgboost.DMatrix(X, label=y)
dval = xgboost.DMatrix(val_X, label=val_y)

watchlist = [(dtrain, “train”), (dval, “validation”)]
mlflow.log_params(params)

print(“Training the model”)
evaluation__results = {}
bst = xgboost.train(
params=params, dtrain=dtrain, evals=watchlist, num_boost_round=num_round
)
pkl.dump(bst, open(output_path + “/model.bin”, “wb”))

# Compress the model.bin artifact to a tar file
tar_filename = f”{output_path}/model.tar.gz”
with tarfile.open(tar_filename, “w:gz”) as tar:
tar.add(f”{output_path}/model.bin”, arcname=”model.bin”)

mlflow.log_artifact(local_path=tar_filename)

In addition, you can use the mlflow.log_artifact() method to save the model.tar.gz file in MLflow so that you can directly use it later when you register the model to the model registry.
Navigate back to the MLflow UI. Click on the name of your experiment at the top of your screen starting with “credit-risk-model-experiment” to see the updated Runs table. Click on the name of your remote-training Run to see the overview of the training run including the associated hyperparameters, model metrics, and generated model artifacts.
The following figure shows the overview of a training run.

Click on the Model metrics tab to view the metrics tracked during the training run. The figure below shows the metrics of a training run.

Click on the Artifacts tab to view the artifacts generated during the training run. The following figure shows an example of the generated artifacts.

Registering the model to the model registry
ML experimentation is an iterative process and you typically end up with a number of candidate models. With MLflow, you can compare these models to identify the one that you want to move to quality assurance for approval. The following is an example of how to retrieve the best candidate using the MLflow API based on a specific metric.

from mlflow.entities import ViewType

run_filter = f”””
attributes.run_name LIKE “%training%”
attributes.status = ‘FINISHED’
“””

runs_with_filter = mlflow.search_runs(
experiment_names=[experiment_name],
run_view_type=ViewType.ACTIVE_ONLY,
filter_string=run_filter,
order_by=[“metrics.`validation-auc` DESC”],
)
best_run = runs_with_filter[:1]
artifact_uri = best_run[‘artifact_uri’][0]

After you have selected a model, you can register it to the shared model group in the shared services account. You can discover the model groups that are available to you either through the SageMaker Studio UI or programmatically.

The final step is to register the candidate model to the model group as a new model version.

modelpackage_inference_specification = {
“InferenceSpecification”: {
“Containers”: [
{
“Image”: “885854791233.dkr.ecr.us-east-1.amazonaws.com/sagemaker-distribution-prod@sha256:9e7622bbe2f3ee9dd516797bfe3ed310983b96190eeefbdeeeea69519d3946fe”,
“ModelDataUrl”: f”{artifact_uri}/model.tar.gz”
}
],
“SupportedContentTypes”: [ “text/csv” ],
“SupportedResponseMIMETypes”: [ “text/csv” ],
},
“ModelPackageGroupName” : model_package_group_arn,
“ModelPackageDescription” : “Model to detect credit risk”,
“ModelApprovalStatus” : “PendingManualApproval”
}

model_package_group_name = “model-group-” + str(round(time.time()))

create_model_package_input_dict = {
“ModelPackageGroupName” : model_package_group_name,
“ModelPackageDescription” : “Model to detect credit risk”,
“ModelApprovalStatus” : “PendingManualApproval”
}
create_model_package_input_dict.update(modelpackage_inference_specification)

create_model_package_response = sagemaker_client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response[“ModelPackageArn”]
print(‘ModelPackage Version ARN : {}’.format(model_package_arn))

Design considerations for use case and model stage governance
Use case and model stage governance is a construct to track governance information of a use case or model across various stages in its journey to production. Also, periodic tracking of key model performance and drift metrics is used to surface those metrics for governance functions.
There are several use case and model stage governance attributes that need to be tracked, such as the following:

Use case ID: Unique OD of the use case.
Use case name: Name of the use case.
Use case stage: Current stage of the use case. For example, proof of concept, development, QA, and so on.
Model group: SageMaker model group name.
Model version: SageMaker model version name.
Model owner: Person or entity who owns the model.
Model LoB: Model owner’s line of business.
Model project: Project or use case that the model is part of.
Model stage: Stage where the model version is deployed. For example, development, test, or production.
Model status: Status of the model version in a given stage. For example, pending or approved.
Model risk: Risk categorization of the model version. For example, high, medium, or low.
Model validation metrics: Model validation metrics in JSON format.
Model monitoring metrics: Model monitoring metrics in JSON format. This needs to include the endpoint from which this metrics was captured.
Model audit timestamp: Timestamp when this record was updated.
Model audit user: User who updated this record.

Create a use case or model stage governance construct with the preceding set of attributes and drive your deployment and governance workflows using this table. Next, we will describe the design considerations for deployment and governance workflows.
Design considerations for deployment and governance workflows
Following are the design consideration for the deployment and governance workflows:

The model version is built in the development account and registered with pending status in the central model registry or model group.
A sync process is triggered to capture the key model attributes, derive additional governance attributes, and create a development stage record in the model governance stage table. Model artifacts from the development account are synced into the central model registry account.
The model owner approves the model version in the development stage for deployment to the test stage in the central model registry.
A deployment pipeline is triggered and the model is deployed to the test environment and a new test stage record is created for that model version.
The model version is tested and validated in the test environment and validation metrics are captured in the test stage record in the model governance stage construct.
The governance officer verifies the model validation results and approves the model version for deployment to production. The production stage record is created for the model version in the model governance stage table.
A deployment pipeline is triggered and the model is deployed to the production environment and the production stage record model status is updated to deployed for that model version.
After the model monitoring jobs are set up, model inference metrics are periodically captured and aggregated and model metrics are updated in model stage governance table.
The use case stage value is updated to the next stage when all models for that use case are approved in the previous stage.

Conclusion
In this post, we have discussed how to centralize your use case and model governance function in a multi-account environment using the new model group sharing feature of SageMaker Model Registry. We shared an architecture for setting up central use case and model governance and walked through the steps involved in building that architecture. We provided practical guidance for setting up cross-account model group sharing using SageMaker Studio and APIs. Finally, we discussed key design considerations for building the centralized use case and model governance functions to extend the native SageMaker capabilities. We encourage you to try this model-sharing feature along with centralizing your use case and model governance functions. You can leave feedback in the comments section.

About the authors
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle.
Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers across EMEA build foundation models and create scalable generative AI and machine learning solutions using AWS services.
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.
Madhubalasri B. is a Software Development Engineer at Amazon Web Services (AWS), focusing on the SageMaker Model Registry and machine learning governance domain. She has expertise in cross-account access and model sharing, ensuring secure, scalable, and compliant deployment of machine learning models. Madhubalasri is dedicated to driving innovation in ML governance and optimizing model management processes
Saumitra Vikaram is a Senior Software Engineer at AWS. He is focused on AI/ML technology, ML model management, ML governance, and MLOps to improve overall organizational efficiency and productivity.
Keshav Chandak is a Software Engineer at AWS with a focus on the SageMaker Repository Service. He specializes in developing capabilities to enhance governance and management of ML models.

3 Meta Ads Strategies DTC Brands Are Using to Engage Audiences

When it comes to digital ads, there’s Meta and then there’s everyone else.

Meta ads have become the go-to platform for DTC brands looking to get right in front of their customers. But just because you’re on the Meta train doesn’t mean success is a guarantee.

With a crowded space, you need more than just a catchy headline or a cool photo to stand out. You need a Meta ads strategy that actually connects, converts, and keeps customers coming back for more. Sounds easy right?

Here’s the thing: most people aren’t on Meta to be sold to.

They’re there to scroll, laugh at memes, maybe find a dinner recipe, or catch up on some dog videos. So, if you want your ad to grab attention, it needs to be more than just an ad…it has to be an experience.

That’s where the magic of a strong Meta ads strategy comes in.

We’ve gathered insights from 101 top-performing DTC brands and broken down their winning formulas into three simple strategies. These are the same tactics that make you stop scrolling, take a second look, and maybe even hit that “Add to Cart” button.

So let’s dive in and see how these brands are making Meta ads work in their favor.

Spoiler alert: it’s all about knowing your audience, keeping it real, and telling a story that people actually care about.

Ready to level up your Meta ads game? Let’s do this!

Meta Ads Strategy 1: Audience-Centric Creativity

Let’s be real. Everyone likes to feel special, right? And that’s exactly why tailoring your Meta ads to fit specific audience groups is such a powerful move.

The days of casting a wide net and hoping for the best are over. Now, it’s all about creating Meta ads that feel custom-made for the people you want to reach. Because when an ad hits close to home, it stops being just another scroll and starts feeling like it was crafted just for them.

Connect with Your Audience Through Tailored Meta Ads

When it comes to Meta ads, a one-size-fits-all approach is a thing of the past.

The most effective ads speak directly to a specific group, hitting just the right notes that make people think, “Hey, that’s me!”

That’s the beauty of an audience-centric Meta ads strategy: the more you understand your audience, the easier it is to craft ads that feel personal, relevant, and irresistible.

Why Audience-Centric Ads Just Work

The brands that win are those that make their audience feel seen. Why?

Because people don’t connect with brands! They connect with messages, stories, and values that resonate with their lives.

When an ad addresses their unique needs, aspirations, or challenges, it stops being “just an ad” and becomes something worth paying attention to.

Take ThirdLove, for example. Instead of generic messages about their bras, they zero in on specific comfort issues, like support for active lifestyles or fit challenges for certain body types. Their ads speak directly to women looking for better-fitting bras, with visuals and messaging that say, “We get you.”

BRUNT Workwear is another example that nails audience-centric creativity by showing their rugged, waterproof sweatshirt in action with a simple but effective demo.

In one Meta ad, they spray a hose directly on the sweatshirt, instantly proving its durability. There’s no fluff or over-the-top production—just a straightforward visual that says, “We know what you need, and we’re here to deliver.”

Both of these ads resonate with their specific audience by using visuals that focus on needs and functionality.

Personalizing Your Meta Ads: What Works

Crafting Meta ads tailored for your audience means diving into their world and addressing the details that matter most.

Here are five ways top brands do it:

1. Speak Their Language

If your audience loves running, hiking, or hitting the gym, use language that matches that vibe. Talk about “support that moves with you” or “comfort that keeps up.” People will instantly know you’re speaking to them.

2. Address Their Pain Points

Great Meta ads don’t just show off products; they solve problems. True Classic tees, for example, run ads that hint at common frustrations with fit and quality—without naming names.

The message is clear: these tees give you quality without breaking the bank.

3. Show Your Product in Action

Ads that feature the product in real-life settings are game-changers. Think about Vuori, whose Meta ads often show people stretching, running, or simply relaxing in their clothing.

Each shot captures the comfort and versatility their audience is looking for.

4. Use Visuals that Resonate

Use images and settings that match your audience’s lifestyle. Whether it’s at the gym, in the kitchen, or on the go, let your visuals do the talking.

Kenny Flowers captures lifestyle-centric visuals by featuring their vibrant, tropical shirts in a TikTok-style Meta ad that feels modern and relatable. In one ad, they use casual “talking head” clips and beachy backdrops to show people living their best island life.

The ad captures the relaxed, fun vibe that their audience, often young travelers or beach lovers, craves.

Plus, this ad doesn’t just show the shirts. It shows the lifestyle that comes with wearing them.

5. Highlight Unique Benefits

Focus on the benefits that matter most to your audience. Maybe it’s comfort, durability, or price. Whatever it is, make it crystal clear why your product fits their needs.

When done right, an audience-centric Meta ads strategy isn’t just more effective. It’s the difference between an ad that blends in and an ad that actually connects.

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Meta Ads Strategy 2: Authenticity and Realness

When it comes to Meta ads, brands that keep it real are the ones making the strongest connections.

Remember – customers are savvy. They want to know there’s a real story behind the ad, not just another polished sales pitch. That’s why authenticity is such a powerful tool in building loyalty. Showing the true, unfiltered side of your brand invites people in and creates trust that goes beyond a single purchase.

Why Authenticity Wins

An authentic Meta ads strategy can be the difference between scrolling by and stopping to engage.

Customers want to feel confident that they’re buying from a brand that’s transparent and relatable. When DTC brands use real customers, everyday scenarios, or behind-the-scenes content, it breaks down that barrier between the brand and the audience.

And that’s where loyalty starts.

Keeping It Real: How DTC Brands Do It

Some of the most effective Meta ads look more like a friend’s post than a typical ad. Here’s are a couple examples of how DTC brands use realness to connect with their audience:

Show Real People, Real Reactions

OneBlade uses a video of people reacting to their razors in a giveaway ad.

The unscripted excitement and the casual, friendly vibe make it feel genuine and fun—more like a celebration than a sales pitch. It’s an ad that says, “We’re confident in our product, and you’ll love it too.”

Use Influencers for a Candid Feel

DRMTLGY teams up with influencers who casually talk about the products in everyday settings—no scripted lines, just genuine reactions. One ad even features an influencer in her bathroom, casually discussing the product benefits.

By showing the product as part of real routines, DRMTLGY gives off a vibe of transparency and trustworthiness.

Tips for Adding Realness to Your Meta Ads

Want to give your Meta ads a dose of realness? Here are three ways to keep things authentic:

1. Use User-Generated Content (UGC):

People trust real customers more than models or actors. Featuring UGC is like saying, “Don’t just take our word for it – see what our customers have to say.”

2, Share Behind-the-Scenes Moments:

Letting people peek behind the curtain is a great way to build trust. It could be a quick team video, a snapshot from the warehouse, or a “making of” clip that gives your brand a personal touch.

3. Focus on Everyday Scenarios:

Show how your product fits into daily life. Maybe it’s being used in the kitchen, the gym, or the morning commute. Keeping it grounded in real-life settings helps viewers relate instantly.

Look, a little authenticity can go a long way in Meta ads. When people feel like they’re seeing the real side of your brand, they’re more likely to stick around, engage, and ultimately trust you.

Meta Ads Strategy 3: Visual Storytelling

When you’re scrolling through Meta, a strong visual can do the talking before you even read a single word.

That’s the power of visual storytelling for Meta ads strategy: it captures attention instantly and gets your message across in seconds.

For DTC brands, it’s the perfect way to convey the “why” behind the product, showing off benefits that resonate right away.

Why Visual Storytelling Works

We all know people process visuals faster than text, so when an ad tells a quick, compelling story through imagery, it’s like a shortcut to engagement.

Visual storytelling is your chance to showcase transformations, highlight features, and give people a peek at how your product fits into their lives – all without needing a ton of words.

Done right, it makes your product feel real, relatable, and easy to imagine using.

How Top DTC Brands Nail Visual Storytelling

These brands are pros at showing—not just telling—why their product matters:

Show Transformation with Before-and-After Shots

Knix grabs attention by showing the difference between their modern period underwear and old-school pads in a straightforward visual comparison.

The message? Knix offers a fresh, practical alternative. It’s quick, powerful, and lets the visuals do the convincing.

Demonstrate Your Product in Action

Misen uses a carousel ad featuring their chef’s knife slicing through a variety of foods and materials, showing its strength and versatility.

With each slide, viewers get a clear, visual demonstration of the product’s quality without needing a single line of heavy text.

Tips for Bringing Visual Storytelling into Your Meta Ads

Want to make your visuals do the heavy lifting? Here are some simple but effective ways to weave storytelling into your Meta ads:

Use Before-and-After Comparisons

Whether it’s a skincare routine or a kitchen gadget, before-and-after shots can highlight the product’s impact. A powerful transformation can stop the scroll and make viewers think, “I want that result.”

Feature the Product in Real-Life Scenarios

Help people visualize the product in their daily life by showing it in action. From home use to outdoor adventures, real-life settings add context that makes your ad feel relevant.

Create a Narrative with Carousel or Video Formats

If you have a lot to show, try using a carousel or video. Each slide or scene can reveal a new benefit, feature, or transformation, pulling viewers into a mini-story that showcases your product’s full range.

With a Meta ads strategy with visual storytelling, your ads become more than just promotions. They become experiences that draw people in and make them feel connected to your brand.

Key Takeaways: Building a Winning Meta Ads Strategy

When it comes down to it, a successful Meta ads strategy isn’t about following a one-size-fits-all formula…it’s about experimenting, refining, and discovering what works best for your unique audience.

By focusing on these three core strategies, you’re setting your brand up to stand out and connect on a deeper level:

Audience-Centric Creativity: Speak directly to your audience with tailored visuals and messaging that make them feel seen.

Authenticity and Realness: Build trust by keeping it real. Whether that’s through user-generated content, behind-the-scenes shots, or candid moments.

Visual Storytelling: Use strong visuals to create an instant connection, showcasing your product’s value through before-and-afters, demos, or relatable settings.

Each of these strategies brings something powerful to the table, and when combined, they turn your Meta ads into more than just ads. They become part of a customer’s journey with your brand.

Don’t be afraid to experiment, switch things up, and find what resonates most with your audience.

Ready to dive deeper into what makes Meta ads truly work?

Download our Meta ads guide 101 Meta Ads from Top DTC Brands to Inspire Your Next Campaign for examples and insider strategies to help you craft scroll-stopping campaigns that engage and convert.

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

FAQs: Meta Ads Strategy for DTC Brands

1. What makes a good Meta ads strategy?A good Meta ads strategy focuses on audience targeting, authentic messaging, and compelling visuals that connect with viewers. Understanding your audience and tailoring your ads to their needs is essential.

2. How can I ensure my Meta ads are reaching the right audience?Use Meta’s advanced targeting options to segment by interests, behaviors, location, and demographics. Regularly monitor and adjust your strategy based on engagement metrics to keep your ads on track.

3. Why is visual storytelling important in a Meta ads strategy?Visual storytelling captures attention faster than text and creates an emotional connection. This helps viewers quickly understand your brand’s value and feel more inclined to engage with the ad.

4. How often should I refresh my Meta ads strategy?Refreshing your Meta ads strategy every few weeks is ideal to keep it aligned with audience trends and engagement data. Testing new visuals, messaging, and targeting can also improve ad performance over time.

5. What are some effective ways to build authenticity in Meta ads?Use real customer testimonials, user-generated content, or behind-the-scenes visuals. Authenticity helps build trust and resonates more with viewers, making your ads feel relatable.

6. How can I tailor Meta ads for different audience segments?Segment your audience based on key characteristics (e.g., age, interests, or purchasing behavior) and create specific ads that speak directly to each segment. Personalizing the message increases relevance and engagement.

7. What budget should I allocate for my Meta ads strategy?Budgets vary depending on campaign goals, but start small, monitor performance, and adjust as needed. Consider testing different ad types and allocate more budget to the best-performing ones.

8. How can I measure the success of my Meta ads strategy?Key metrics to track include click-through rate (CTR), engagement rate, conversion rate, and return on ad spend (ROAS). Analyzing these metrics helps you understand what’s working and where to adjust.

9. What role does ad copy play in a Meta ads strategy?Ad copy is crucial as it provides context and adds emotional appeal. It should be clear, engaging, and aligned with the visuals to create a cohesive message that resonates with the audience.

10. Are carousel ads effective in a Meta ads strategy?Yes, carousel ads allow you to showcase multiple products, features, or storytelling steps in a single ad. They’re especially effective for demonstrating product versatility or highlighting different benefits.

11. How can video ads enhance my Meta ads strategy?Video ads are highly engaging and great for storytelling. They’re ideal for showcasing product demonstrations, sharing customer testimonials, or creating a narrative that holds viewers’ attention.

12. How does Meta ads strategy differ for DTC brands compared to other businesses?DTC brands often focus more on brand storytelling, customer experience, and building direct relationships with customers. A Meta ads strategy for DTC brands emphasizes authenticity, engagement, and conversion.

13. How do I test different Meta ads strategies effectively?Run A/B tests on various elements, such as visuals, copy, or CTA. Monitor results to see which performs best and apply those insights to future ads.

14. What’s the best way to incorporate UGC into my Meta ads strategy?UGC adds social proof and authenticity. Feature real customer reviews, testimonials, or photos of them using your product to make the ad feel genuine and trustworthy.

15. How can I make my Meta ads stand out from the competition?Focus on what makes your brand unique—whether it’s a specific product feature, brand story, or value proposition—and use visuals and messaging that are fresh and eye-catching.

16. What’s the importance of A/B testing in a Meta ads strategy?A/B testing allows you to compare different ad variations and see what resonates most with your audience. It’s essential for refining your ads and improving overall performance.

17. How does frequency affect my Meta ads strategy?If an ad’s frequency is too high, viewers may experience ad fatigue. Monitor frequency and refresh or adjust your ads to keep engagement high without overwhelming your audience.

18. Should I focus more on brand awareness or conversions in my Meta ads strategy?It depends on your goals. For new customers, focus on brand awareness. For retargeting or established audiences, prioritize conversion-based ads to drive actions like purchases.

19. How can I improve my Meta ads’ CTR?Optimize visuals and ad copy to make it more compelling, use clear CTAs, and ensure your ad targets the right audience. Experiment with different ad formats and styles to find what grabs attention.

20. Why is audience targeting so crucial in a Meta ads strategy?Audience targeting ensures your ads are shown to the people most likely to be interested in your brand. When your ads reach the right audience, engagement, and conversion rates increase significantly.
The post 3 Meta Ads Strategies DTC Brands Are Using to Engage Audiences appeared first on Customers.ai.