Scaling Laws and Model Comparison: New Frontiers in Large-Scale Machin …

Large language models (LLMs) have gained significant attention in machine learning, shifting the focus from optimizing generalization on small datasets to reducing approximation error on massive text corpora. This paradigm shift presents researchers with new challenges in model development and training methodologies. The primary objective has evolved from preventing overfitting through regularization techniques to effectively scaling up models to consume vast amounts of data. Researchers now face the challenge of balancing computational constraints with the need for improved performance on downstream tasks. This shift necessitates a reevaluation of traditional approaches and the development of robust strategies to harness the power of large-scale language pretraining while addressing the limitations imposed by available computing resources.

The shift from a generalization-centric paradigm to a scaling-centric paradigm in machine learning has necessitated reevaluating traditional approaches. Google DeepMind researchers have identified key differences between these paradigms, focusing on minimizing approximation error through scaling rather than reducing generalization error through regularization. This shift challenges conventional wisdom, as practices that were effective in the generalization-centric paradigm may not yield optimal results in the scaling-centric approach. The phenomenon of “scaling law crossover” further complicates matters, as techniques that enhance performance at smaller scales may not translate effectively to larger ones. To mitigate these challenges, researchers propose developing new principles and methodologies to guide scaling efforts and effectively compare models at unprecedented scales where conducting multiple experiments is often infeasible.

Machine learning aims to develop functions capable of making accurate predictions on unseen data by understanding the underlying structure of the data. This process involves minimizing the test loss on unseen data while learning from a training set. The test error can be decomposed into the generalization gap and the approximation error (training error).

Two distinct paradigms have emerged in machine learning, differentiated by the relative and absolute scales of data and models:

1. The generalization-centric paradigm, which operates with relatively small data scales, is further divided into two sub-paradigms:

   a) The classical bias-variance trade-off regime, where model capacity is intentionally constrained.

   b) The modern over-parameterized regime, where model scale significantly surpasses data scale.

2. The scaling-centric paradigm, characterized by large data and model scales, with data scale exceeding model scale.

These paradigms present different challenges and require distinct approaches to optimize model performance and achieve desired outcomes.

The proposed method employs a decoder-only transformer architecture trained on the C4 dataset, utilizing the NanoDO codebase. Key architectural features include Rotary Positional Embedding, QK-Norm for attention computation, and untied head and embedding weights. The model uses Gelu activation with F = 4D, where D is the model dimension and F is the hidden dimension of the MLP. Attention heads are configured with a head dimension of 64, and the sequence length is set to 512.

The model’s vocabulary size is 32,101, and the total parameter count is approximately 12D²L, where L is the number of transformer layers. Most models are trained to Chinchilla optimality, using 20 × (12D²L + DV) tokens. Compute requirements are estimated using the formula F = 6ND, where F represents the number of floating-point operations.

For optimization, the method employs AdamW with β1 = 0.9, β2 = 0.95, ϵ = 1e-20, and a coupled weight decay λ = 0.1. This combination of architectural choices and optimization strategies aims to enhance the model’s performance in the scaling-centric paradigm.

In the scaling-centric paradigm, traditional regularization techniques are being reevaluated for their effectiveness. Three popular regularization methods commonly used in the generalization-centric paradigm are explicit L2 regularization and the implicit regularization effects of large learning rates and small batch sizes. These techniques have been instrumental in mitigating overfitting and reducing the gap between training and test losses in smaller-scale models.

However, in the context of large language models and the scaling-centric paradigm, the necessity of these regularization techniques is being questioned. As models operate in a regime where overfitting is less of a concern due to the vast amount of training data, the traditional benefits of regularization may no longer apply. This shift prompts researchers to reconsider the role of regularization in model training and to explore alternative approaches that may be more suitable for the scaling-centric paradigm.

The scaling-centric paradigm presents unique challenges in model comparison as traditional validation set approaches become impractical at massive scales. The phenomenon of scaling law crossover further complicates matters, as performance rankings observed at smaller scales may not hold true for larger models. This raises the critical question of how to effectively compare models when training is feasible only once at scale.

In contrast, the generalization-centric paradigm relies heavily on regularization as a guiding principle. This approach has led to insights into hyperparameter choices, weight decay effects, and the benefits of over-parameterization. It also explains the effectiveness of techniques like weight sharing in CNNs, locality, and hierarchy in neural network architectures.

However, the scaling-centric paradigm may require new guiding principles. While regularization has been crucial for understanding and improving generalization in smaller models, its role and effectiveness in large-scale language models are being reevaluated. Researchers are now challenged to develop robust methodologies and principles that can guide the development and comparison of models in this new paradigm, where traditional approaches may no longer apply.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!
The post Scaling Laws and Model Comparison: New Frontiers in Large-Scale Machine Learning appeared first on MarkTechPost.

Ovis-1.6: An Open-Source Multimodal Large Language Model (MLLM) Archit …

Artificial intelligence (AI) is transforming rapidly, particularly in multimodal learning. Multimodal models aim to combine visual and textual information to enable machines to understand and generate content that requires inputs from both sources. This capability is vital for tasks such as image captioning, visual question answering, and content creation, where more than a single data mode is required. While many models have been developed to address these challenges, only some have effectively aligned the disparate representations of visual and textual data, leading to inefficiencies and suboptimal performance in real-world applications.

A significant challenge in multimodal learning arises from how text and image data are encoded and represented. Textual data are typically defined using embeddings derived from a lookup table, ensuring a structured and consistent format. In contrast, visual data are encoded using vision transformers, which produce unstructured continuous embeddings. This discrepancy in representation makes it easier for existing multimodal models to fuse visual and textual data seamlessly. As a result, models struggle to interpret complex visual-textual relationships, limiting their capabilities in advanced AI applications that require coherent understanding across multiple data modalities.

Traditionally, researchers have attempted to mitigate this problem by using a connector, such as a multi-layer perceptron (MLP), to project visual embeddings into a space that can be aligned with textual embeddings. While effective in standard multimodal tasks, this architecture must resolve the fundamental misalignment between visual and textual embeddings. Leading models like LLaVA and Mini-Gemini incorporate advanced methods like cross-attention mechanisms and dual vision encoders to improve performance. However, they still face limitations due to the inherent differences in tokenization and embedding strategies, highlighting the need for a novel approach that addresses these issues at a structural level.

Researchers team from Alibaba Group and Nanjing University introduced a new version of Ovis: Ovis 1.6 is a new multimodal large language model (MLLM) that structurally aligns visual and textual embeddings to address this challenge. Ovis employs a unique visual embedding look-up table, similar to the one used for textual embeddings, to create structured visual representations. This table enables the visual encoder to produce embeddings compatible with textual embeddings, resulting in more effective visual and textual information integration. The model also utilizes probabilistic tokens for visual patches mapped into the visual embedding table multiple times. This approach mirrors the structured representation used in textual data, facilitating a coherent combination of visual and textual inputs.

Ovis’s core innovation lies in using a visual embedding table that aligns visual tokens with their textual counterparts. A probabilistic token represents each image patch and indexes the visual embedding table multiple times to generate a final visual embedding. This process captures the rich semantics of each visual patch and results in embeddings structurally similar to textual tokens. In contrast to conventional methods, which rely on linear projections to map visual embeddings into a joint space, Ovis adopts a probabilistic approach to generate more meaningful visual embeddings. This method enables Ovis to overcome the limitations of connector-based architectures and achieve better performance in multimodal tasks.

Empirical evaluations of Ovis demonstrate its superiority over other open-source MLLMs of similar sizes. For instance, in the MathVista-Mini benchmark, Ovis scored 1808, significantly higher than its competitors. Similarly, in the RealWorldQA benchmark, Ovis outperformed leading proprietary models such as GPT4V and Qwen-VL-Plus, scoring 2230, compared to GPT4V’s 2038. These results highlight Ovis’s strength in handling complex multimodal tasks, making it a promising candidate for future advancements in the field. The researchers also evaluated Ovis on a series of general multimodal benchmarks, including MMBench and MMStar, where it consistently surpassed models like Mini-Gemini-HD and Qwen-VL-Chat by a margin of 7.8% to 14.1%, depending on the specific benchmark.

Key Takeaways from the research:

Structural Alignment: Ovis introduces a novel visual embedding table that structurally aligns visual and textual embeddings, enhancing the model’s ability to process multimodal data.

Superior Performance: Ovis outperforms open-source models of similar sizes in various benchmarks, achieving a 14.1% improvement over connector-based architectures.

High-Resolution Capabilities: The model excels in tasks requiring visual understanding of high-resolution images, such as the RealWorldQA benchmark, where it scored 2230, surpassing GPT4V by 192 points.

Scalability: Ovis demonstrates consistent performance across different parameter tiers (7B, 14B), making it adaptable to various model sizes and computational resources.

Practical Applications: With its advanced multimodal capabilities, Ovis can be applied to complex and challenging real-world scenarios, including visual question answering and image captioning, where existing models struggle. 

In conclusion, the researchers have successfully addressed the longstanding misalignment between visual and textual embeddings. By introducing a structured visual embedding strategy, Ovis enables more effective multimodal data integration, improving performance across various tasks. The model’s ability to outperform open-source and proprietary models of similar parameter scales, such as Qwen-VL-Max, underscores its potential as a new standard in multimodal learning. The research team’s approach offers a significant step forward in developing MLLMs, providing new avenues for future research and application.

Check out the Paper, GitHub, and HF Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!
The post Ovis-1.6: An Open-Source Multimodal Large Language Model (MLLM) Architecture Designed to Structurally Align Visual and Textual Embeddings appeared first on MarkTechPost.

MassiveDS: A 1.4 Trillion-Token Datastore Enabling Language Models to …

Language models have become a cornerstone of modern NLP, enabling significant advancements in various applications, including text generation, machine translation, and question-answering systems. Recent research has focused on scaling these models in terms of the amount of training data and the number of parameters. These scaling laws have demonstrated that increasing data and model parameters yields substantial performance improvements. However, a new scaling dimension is now being explored: the size of external data stores available at inference time. Unlike traditional parametric models, which depend solely on the training data, retrieval-based language models can dynamically access a much larger knowledge base during inference, enhancing their ability to generate more accurate and contextually relevant responses. This novel approach of integrating vast datastores opens new possibilities for efficiently managing knowledge and improving the factual accuracy of LMs.

One major challenge in NLP is retaining and utilizing vast knowledge without incurring significant computational costs. Traditional language models are typically trained on large static datasets encoded into the model parameters. Once trained, these models cannot integrate new information dynamically and require costly retraining to update their knowledge base. This is particularly problematic for knowledge-intensive tasks, where models need to reference extensive external sources. The problem is exacerbated when these models are required to handle diverse domains such as general web data, scientific papers, and technical codes. The inability to adapt dynamically to new information and the computational burden associated with retraining limit the effectiveness of these models. Thus, a new paradigm is needed to enable language models to dynamically access and use external knowledge.

Existing approaches for enhancing language models’ capabilities include using retrieval-based mechanisms that rely on external datastores. These models, known as retrieval-based language models (RIC-LMs), can access additional context during inference by querying an external datastore. This strategy contrasts with parametric models, constrained by the knowledge embedded within their parameters. Notable efforts include the use of Wikipedia-sized datastores with a few billion tokens. However, these datastores are often domain-specific and do not cover the full breadth of information required for complex downstream tasks. Additionally, previous retrieval-based models have computational feasibility and efficiency limitations, as large-scale datastores introduce challenges in maintaining retrieval speed and accuracy. Although some models like RETRO have used proprietary datastores, their results have not been fully replicable due to the closed nature of the datasets.

A research team from the University of Washington and the Allen Institute for AI constructed a new datastore called MassiveDS, which comprises 1.4 trillion tokens. This open-source datastore is the largest and most diverse available for retrieval-based LMs. It includes data from eight domains: books, scientific papers, Wikipedia articles, GitHub repositories, and mathematical texts. MassiveDS was specifically designed to facilitate large-scale retrieval during inference, enabling language models to access and utilize more information than ever before. The researchers implemented an efficient pipeline that reduces the computational overhead associated with datastore scaling. This pipeline allows for systematic evaluation of datastore scaling trends by retrieving a subset of documents and applying operations such as indexing, filtering, and subsampling only to these subsets, making the construction and utilization of large datastores computationally accessible.

The research demonstrated that MassiveDS significantly improves the performance of retrieval-based language models. For example, a smaller LM utilizing this datastore outperformed a larger parametric LM on multiple downstream tasks. Specifically, MassiveDS models achieved lower perplexity scores on general web and scientific data, indicating higher language modeling quality. Furthermore, in knowledge-intensive question-answering tasks such as TriviaQA and Natural Questions, the LMs using MassiveDS consistently outperformed their larger counterparts. On TriviaQA, models with access to less than 100 billion tokens from MassiveDS could surpass the performance of much larger language models that did not utilize external datastores. These findings suggest that increasing the datastore size allows models to perform better without improving their internal parameters, thereby reducing the overall training cost.

The researchers attribute these performance gains to MassiveDS’s ability to provide high-quality, domain-specific information during inference. Even for reasoning-heavy tasks such as MMLU and MedQA, retrieval-based LMs using MassiveDS showed notable improvements compared to parametric models. Using multiple data sources ensures the datastore can provide relevant context for various queries, making the language models more versatile and effective across different domains. The results highlight the importance of using data quality filters and optimized retrieval methods, further enhancing the benefits of datastore scaling.

In conclusion, this study demonstrates that retrieval-based language models equipped with a large datastore like MassiveDS can perform better at a lower computational cost than traditional parametric models. By leveraging an expansive 1.4 trillion-token datastore, these models can dynamically access diverse, high-quality information, significantly improving their ability to handle knowledge-intensive tasks. This represents a promising direction for future research, offering a scalable and efficient method to enhance language models’ performance without increasing the model size or training cost.

Check out the Paper, Dataset, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!
The post MassiveDS: A 1.4 Trillion-Token Datastore Enabling Language Models to Achieve Superior Efficiency and Accuracy in Knowledge-Intensive NLP Applications appeared first on MarkTechPost.

Salesforce AI Introduces SFR-Judge: A Family of Three Judge Models of …

The advancement of large language models (LLMs) in natural language processing has significantly improved various domains. As more complex models are developed, evaluating their outputs accurately becomes essential. Traditionally, human evaluations have been the standard approach for assessing quality, but this process is time consuming and needs to be more scalable for the rapid pace of model development. 

Salesforce AI Research introduces SFR-Judge, a family of three LLM-based judge models, to revolutionize how LLM outputs are evaluated. Built using Meta Llama 3 and Mistral NeMO, SFR-Judge comes in three sizes: 8 billion (8B), 12 billion (12B), and 70 billion (70B) parameters. Each model is designed to perform multiple evaluation tasks, such as pairwise comparisons, single ratings, and binary classification. These models were developed to support research teams in rapidly and effectively evaluating new LLMs.

One of the main limitations of using traditional LLMs as judges is their susceptibility to biases and inconsistencies. Many judge models, for instance, exhibit position bias, where their judgment is influenced by the order in which responses are presented. Others may show length bias, favoring longer responses that seem more complete even when shorter ones are more accurate. To address these issues, the SFR-Judge models are trained using Direct Preference Optimization (DPO), allowing the model to learn from positive and negative examples. This training methodology enables the model to develop a nuanced understanding of evaluation tasks, reducing biases and ensuring consistent judgments.

The SFR-Judge models were tested on 13 benchmarks across three evaluation tasks, demonstrating superior performance to existing judge models, including proprietary models like GPT-4o. Notably, SFR-Judge achieved the best performance on 10 of the 13 benchmarks, setting a new standard in LLM-based evaluation. For example, on the RewardBench leaderboard, SFR-Judge attained an accuracy of 92.7%, marking the first and second times any generative judge model crossed the 90% threshold. These results highlight the effectiveness of SFR-Judge not only as an evaluation model but also as a reward model capable of guiding downstream models in reinforcement learning from human feedback (RLHF) scenarios.

SFR-Judge’s training approach involves three distinct data formats. The first, the Chain-of-Thought Critique, helps the model generate structured and detailed analyses of the evaluated responses. This critique enhances the model’s ability to reason about complex inputs and produce informed judgments. The second format, Standard Judgment, simplifies evaluations by removing the critique providing more direct feedback on whether the responses meet the specified criteria. Finally, Response Deduction enables the model to deduce what a high-quality reaction looks like, reinforcing its judgment capabilities. These three data formats work in conjunction to strengthen the model’s capacity to produce well-rounded and accurate evaluations.

Extensive experiments revealed that SFR-Judge models are significantly less biased than competing models, as demonstrated by their performance on EvalBiasBench, a benchmark designed to test for six types of bias. The models exhibit high levels of pairwise order consistency across multiple benchmarks, indicating that their judgments remain stable even when the order of responses is altered. This robustness positions SFR-Judge as a reliable solution for automating the evaluation of LLMs, reducing the reliance on human annotators, and providing a scalable alternative for model assessment.

Key takeaways from the research:

High Accuracy: SFR-Judge achieved top scores on 10 of 13 benchmarks, including a 92.7% accuracy on RewardBench, outperforming many state-of-the-art judge models.

Bias Mitigation: The models demonstrated lower levels of bias, including length and position bias, compared to other judge models, as confirmed by their performance on EvalBiasBench.

Versatile Applications: SFR-Judge supports three main evaluation tasks – pairwise comparisons, single ratings, and binary classification, making it adaptable to various evaluation scenarios.

Structured Explanations: Unlike many judge models, SFR-Judge is trained to produce detailed explanations for its judgments, reducing the black-box nature of LLM-based evaluations.

Performance Boost in Downstream Models: The model’s explanations can improve downstream models’ outputs, making it an effective tool for RLHF scenarios.

In conclusion, the introduction of SFR-Judge by Salesforce AI Research marks a significant leap forward in the automated evaluation of large language models. By leveraging Direct Preference Optimization and a diverse set of training data, the research team has created a family of judge models that are both robust and reliable. These models can learn from diverse examples, provide detailed feedback, and reduce common biases, making them invaluable tools for evaluating and refining generative content. SFR-Judge sets a new benchmark in LLM-based evaluation and opens the door for further advancements in automated model assessment.

Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post Salesforce AI Introduces SFR-Judge: A Family of Three Judge Models of 8-Billion Parameters 8B, 12B, and 70B Size, Built with Meta Llama 3 and Mistral NeMO appeared first on MarkTechPost.

SELMA: A Novel AI Approach to Enhance Text-to-Image Generation Models …

Text-to-image (T2I) models have seen rapid progress in recent years, allowing the generation of complex images based on natural language inputs. However, even state-of-the-art T2I models need help accurately capture and reflect all the semantics in given prompts, leading to images that may miss crucial details, such as multiple subjects or specific spatial relationships. For instance, generating a composition like “a cat with wings flying over a field of donuts” poses challenges and hurdles due to the inherent complexity and specificity of the prompt. As these models attempt to understand and replicate the nuances of text descriptions, their limitations become apparent. Moreover, enhancing these models is often hindered by the need for high-quality, large-scale annotated datasets, making it both resource-intensive and laborious. The result is a bottleneck in achieving models that can generate consistently faithful and semantically accurate images across diverse scenarios.

A key problem addressed by researchers is the need for help to create images that are truly faithful to complex textual descriptions. This misalignment often results in missing objects, incorrect spatial arrangements, or inconsistent rendering of multiple elements. For example, when asked to generate an image of a park scene featuring a bench, a bird, and a tree, T2I models might need to maintain the correct spatial relationships between these entities, leading to unrealistic images. Current solutions attempt to improve this faithfulness through supervised fine-tuning with annotated data or re-captioned text prompts. Although these methods show improvement, they rely heavily on the availability of extensive human-annotated data. This reliance introduces high training costs and complexity. Thus, there is a pressing need for a solution that can enhance image faithfulness without depending on manual data annotation, which is both costly and time-consuming.

Many existing solutions have attempted to address these challenges. One popular approach is supervised fine-tuning methods, where T2I models are trained using high-quality image-text pairs or manually curated datasets. Another line of research focuses on aligning T2I models with human preference data through reinforcement learning. This involves ranking and scoring images based on how well they match textual descriptions and using these scores to fine-tune the models further. Although these methods have shown promise in improving alignment, they depend on extensive manual annotations and high-quality data. Moreover, integrating additional components, such as bounding boxes or object layouts, to guide image generation has been explored. However, these techniques often require significant human effort and data curation, making them impractical at scale.

Researchers from the University of North Carolina at Chapel Hill have introduced SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data. SELMA presents a novel approach to enhance T2I models without relying on human-annotated data. This method leverages the capabilities of Large Language Models (LLMs) to generate skill-specific text prompts automatically. The T2I models then use these prompts to produce corresponding images, creating a rich dataset without human intervention. The researchers employ a method known as Low-Rank Adaptation (LoRA) to fine-tune the T2I models on these skill-specific datasets, resulting in multiple skill-specific expert models. By merging these expert models, SELMA creates a unified multi-skill T2I model that can generate high-quality images with improved faithfulness and semantic alignment.

SELMA operates through a four-stage pipeline. First, skill-specific prompts are generated using LLMs, which helps ensure diversity in the dataset. The second stage involves generating corresponding images based on these prompts using T2I models. Next, the model is fine-tuned using LoRA modules to specialize in each skill. Finally, these skill-specific experts are merged to produce a robust T2I model capable of handling diverse prompts. This merging process effectively reduces knowledge conflicts between different skills, resulting in a model that can generate more accurate images than traditional multi-skill models. On average, SELMA showed a +2.1% improvement in the TIFA text-image alignment benchmark and a +6.9% enhancement in the DSG benchmark, indicating its effectiveness in improving faithfulness.

The performance of SELMA was validated against state-of-the-art T2I models, such as Stable Diffusion v1.4, v2, and XL. Empirical results demonstrated that SELMA improved text faithfulness and human preference metrics across multiple benchmarks, including PickScore, ImageReward, and Human Preference Score (HPS). For example, fine-tuning with SELMA improved HPS by 3.7 points and human preference metrics by 0.4 on PickScore and 0.39 on ImageReward. Notably, fine-tuning with auto-generated datasets performed comparable to fine-tuning with ground-truth data. The results suggest that SELMA is a cost-effective alternative without extensive manual annotation. The researchers found that fine-tuning a strong T2I model, such as SDXL, using images generated by a weaker model, such as SD v2, led to performance gains, suggesting the potential for weak-to-strong generalization in T2I models.

Key Takeaways from the SELMA Research:

Performance Improvement: SELMA enhanced T2I models by +2.1% on TIFA and +6.9% on DSG benchmarks.

Cost-Effective Data Generation: Auto-generated datasets achieved comparable performance to human-annotated datasets.

Human Preference Metrics: Improved HPS by 3.7 points and increased PickScore and ImageReward by 0.4 and 0.39, respectively.

Weak-to-Strong Generalization: Fine-tuning with images from a weaker model improved the performance of a stronger T2I model.

Reduced Dependency on Human Annotation: SELMA demonstrated that high-quality T2I models could be developed without extensive manual data annotation. 

In conclusion, SELMA offers a robust and efficient approach to enhance the faithfulness and semantic alignment of T2I models. By leveraging auto-generated data and a novel merging mechanism for skill-specific experts, SELMA eliminates the need for costly human-annotated data. This method addresses the key limitations of current T2I models and sets the stage for future advancements in text-to-image generation.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post SELMA: A Novel AI Approach to Enhance Text-to-Image Generation Models Using Auto-Generated Data and Skill-Specific Learning Techniques appeared first on MarkTechPost.

Multi-View and Multi-Scale Alignment (MaMA): Advancing Mammography wit …

Multi-View and Multi-Scale Alignment for Mammography Contrastive Learning:Contrastive Language-Image Pre-training (CLIP) has shown potential in medical imaging, but its application to mammography faces challenges due to limited labeled data, high-resolution images, and imbalanced datasets. This study introduces the first full adaptation of CLIP to mammography through a new framework called Multi-view and Multi-scale Alignment (MaMA). Mammography’s inherent complexities, such as multi-view images with small regions of interest, bilateral asymmetry, and ipsilateral correspondence, demand specialized approaches. MaMA addresses these issues by leveraging the multi-view nature of mammography and aligning image features at different scales. It also uses a symmetric local alignment module to focus on detailed features and a parameter-efficient fine-tuning approach to enhance pre-trained LLMs with medical knowledge. This allows the framework to overcome data scarcity and perform better on mammography tasks.

The MaMA model significantly outperforms existing state-of-the-art methods across multiple tasks on two large mammography datasets, EMBED and RSNA-Mammo, despite using only 52% of the model size compared to the largest baseline. By combining multi-view image alignment and text-image relationships, MaMA effectively learns detailed image representations while maintaining efficient resource usage. This method demonstrates its potential to enhance mammography interpretation through visual-language pre-training, improving cancer detection and diagnosis with fewer computational demands. The code is available for public use to promote further research in this area.

Medical Visual-Language Pre-training Methods:Existing medical Visual-Language Pre-training (VLP) models are classified into two types. The first involves general-purpose models trained on large-scale datasets with multiple anatomical sites, which show strong generalization but are often outperformed by modality-specific models. The second type focuses on chest X-rays due to the availability of extensive datasets, though they face limitations like pixel imbalance and report alignment. Multi-view contrastive learning, which aligns images from different perspectives, has been applied in mammography but needs more integration with CLIP to exploit multimodal supervision signals fully.

Method:The proposed MaMA framework introduces a method for constructing structured mammography reports from tabular data and incorporates a multi-view contrastive image-text pre-training approach. It utilizes a template-based caption generation to enhance image understanding and prevent oversimplification. A multi-view contrastive learning framework improves the model’s capability by comparing mammogram views, while the Symmetric Local Alignment (SLA) module enables fine-grained correspondence between image patches and text. Additionally, parameter-efficient fine-tuning (PEFT) of a large pre-trained LLM is employed to improve text encoding, enhancing overall performance without increasing computational costs.

Model Performance on Mammography Datasets:The experiments utilized the Emory EMBED dataset, comprising over 72,000 multi-view mammograms from 23,356 patients, divided into training, validation, and test sets (70%/10%/20%). The model architecture featured DiNOv2-ViT-B-14 as the image encoder and BioMedLM as the text encoder, with fine-tuning via LoRA for efficiency. The training was optimized using the AdamW optimizer with a 4E-5 learning rate, cosine annealing scheduler, and SLA loss. Hyperparameter tuning included a batch size 144 across four GPUs, and the primary evaluation focused on BI-RADS assessment and breast density prediction, with metrics like balanced accuracy (bACC) and AUC.

MaMA, the proposed model, outperformed baselines such as CLIP, ConVIRT, and MM-MIL in zero-shot and full fine-tuning settings. It demonstrated a 4% improvement in balanced accuracy for BI-RADS and excelled in breast density prediction. MaMA’s robustness was further validated on the out-of-domain RSNA-Mammo dataset for cancer detection, where it achieved higher balanced accuracy and AUC scores compared to the baselines while maintaining adequate sensitivity and specificity. This highlights MaMA’s strong generalization capabilities even with limited training data.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Multi-View and Multi-Scale Alignment (MaMA): Advancing Mammography with Contrastive Learning and Visual-Language Pre-training appeared first on MarkTechPost.

RxEnvironments.jl: A Reactive Programming Approach to Complex Agent-En …

The Free Energy Principle (FEP) and its extension, Active Inference (AIF), present a unique approach to understanding self-organization in natural systems. These frameworks propose that agents use internal generative models to predict observations from unknown external processes, continuously updating their perceptive and control states to minimize prediction errors. While this unifying principle offers profound insights into agent-environment interactions, implementing it in practical scenarios poses significant challenges. Researchers require fine-grained control over agent-environment communication protocols, particularly when simulating proprioceptive feedback or multi-agent systems. Current solutions from reinforcement learning and control theory, such as Gymnasium, need more flexibility for these complex simulations. The imperative programming style employed in existing frameworks restricts communication between agents and environments to predefined parameters, limiting the exploration of diverse interaction scenarios essential for advancing FEP and AIF research.

Existing attempts to address the challenges in simulating agent-environment interactions have primarily focused on reinforcement learning frameworks. Gymnasium has emerged as a standard for creating and sharing control environments, offering a step function to define transition functions and handle environmental simulations. Similar alternatives include Deepmind Control Suite for Python and ReinforcementLearning.jl for Julia. These packages provide high-level interfaces to environments, simplifying timekeeping for users. While designed for reinforcement learning, they have been adapted for Active Inference research. Other packages like PyMDP and SPM-DEM toolbox incorporate environment realization but prioritize agent creation. However, the lack of a standardized approach for defining Active Inference environments has led to inconsistent implementations, with some researchers using Gymnasium and others opting for specialized toolboxes. Reactive Programming, similar to the Actor Model, offers a promising alternative by allowing computations on static datasets and real-time asynchronous sensor observations, aligning more closely with the principles of Active Inference.

Researchers from the Eindhoven University of Technology and GN Hearing present RxEnvironments.jl, a Julia package, introducing  Reactive Environments as a robust approach to modeling agent-environment interactions. This implementation utilizes Reactive Programming principles to create efficient and flexible simulations. The package addresses the limitations of existing frameworks by offering a versatile platform for designing complex, multi-agent environments. By adopting a reactive programming style, RxEnvironments.jl enables researchers to model sophisticated systems with interacting agents more effectively. The package’s design facilitates the exploration of various scenarios, from simple single-agent simulations to intricate multi-agent ecosystems. Through several case studies, RxEnvironments.jl demonstrates its capability to handle diverse and complex environmental setups, showcasing its potential as a powerful tool for advancing research in Active Inference and related fields.

RxEnvironments.jl adopts a reactive programming approach to environment design, addressing the limitations of imperative frameworks. This approach enables multi-sensor, multimodal interactions between agents and environments without strict communication constraints. The package offers detailed control over observations, allowing different sensory channels to operate at varying frequencies or triggers based on specific actions. This flexibility enables the implementation of complex real-world scenarios with fine-grained control over an agent’s perceptions. RxEnvironments.jl natively supports multi-agent environments, allowing multiple instances of the same agent type to coexist without additional coding. The reactive programming style ensures efficient computation, with environments emitting observations when prompted and idling when unnecessary. In addition to that, the package extends beyond simple agent-environment frameworks, supporting multi-entity complex environments for more sophisticated simulations.

The Mountain Car environment, a classic reinforcement learning scenario, is implemented in RxEnvironments.jl with a unique twist. This implementation showcases the package’s ability to handle complex agent-environment interactions. When an agent applies an action, such as setting the engine throttle, the environment responds with an observation containing the actual engine force applied. This approach aligns with current theories on proprioceptive feedback in biological systems. The environment is designed to trigger different implementations of the what_to_send function based on input stimuli. For throttle actions, it returns the applied throttle action, while position and velocity measurements are emitted at a regular 2 Hz frequency, simulating sensor behavior. This setup demonstrates RxEnvironments.jl’s capability to manage distinct types of observations – sensory and proprioceptive feedback – each with its own logic for acquisition and transmission.

RxEnvironments.jl demonstrates its versatility through the implementation of a complex football match simulation. This multi-agent environment involves 22 players, showcasing the package’s ability to handle intricate, real-world scenarios. The simulation is structured with a single Entity representing the world state, containing the ball and references to all 22 player bodies, and 22 separate Entities for individual players. This design allows for realistic collision detection and on-ball actions. Players subscribe to the World Entity but not to each other, streamlining the subscription graph. Agent-to-agent communication is facilitated through the world Entity, which forwards signals between players. The environment distinguishes between global and local states, with the world Entity managing physical interactions and player Entities maintaining their local states and receiving observations from the global state. This setup enables asynchronous command execution for individual players, as demonstrated in a supplementary video. While the simulation focuses on running and on-ball actions rather than comprehensive football rules, it effectively illustrates RxEnvironments.jl’s capacity to model complex, multi-agent systems with individualized observations and interactions.

RxEnvironments.jl further demonstrates its flexibility by modeling a sophisticated hearing aid system that incorporates active inference-based agents for noise reduction. This complex scenario involves multiple interacting entities: the hearing aid itself, the external acoustic environment, the user (patient), and an intelligent agent on the user’s phone. The package adeptly handles the unique challenges of this multi-entity system, where the hearing aid must continuously communicate with three distinct sources. It processes acoustic signals from the outside world, receives feedback from the user about perceived performance, and interacts with the intelligent agent on the phone for advanced computations. This implementation showcases RxEnvironments.jl’s capability to model real-world systems with distributed processing and multiple communication channels, addressing the constraints of limited computing power and battery capacity in hearing aids. The package’s reactive programming approach enables efficient management of these complex, asynchronous interactions, making it an ideal tool for simulating and developing advanced hearing aid technologies.

This study presents Reactive Environments and their implementation in RxEnvironments.jl offering a versatile framework for modeling complex agent-environment interactions. This approach encompasses traditional reinforcement learning scenarios while enabling more sophisticated simulations, particularly for Active Inference. The case studies demonstrate the framework’s expressive power, accommodating diverse environmental setups from classic control problems to multi-agent systems and advanced hearing aid simulations. RxEnvironments.jl’s flexibility in handling complex communication protocols between agents and environments positions it as a valuable tool for researchers. Future work could explore agent classes that effectively utilize this communication protocol, further advancing the field of agent-environment simulations.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post RxEnvironments.jl: A Reactive Programming Approach to Complex Agent-Environment Simulations in the Julia Language appeared first on MarkTechPost.

Voyage AI Introduces Voyage-3 and Voyage-3-Lite: A New Generation of S …

Voyage AI is proud to announce the release of its new generation of embedding models, Voyage-3 and Voyage-3-Lite. The Voyage-3 and Voyage-3-Lite models are designed to outperform existing industry standards in various domains, including technology, law, finance, multilingual applications, and long-context understanding. According to Voyage AI’s evaluations, Voyage-3 outperforms OpenAI’s V3 large model by an average of 7.55% across all tested domains, which include technical documentation, code, law, finance, web content, multilingual datasets, long documents, and conversational data. Moreover, Voyage-3 achieves this with 2.2 times lower costs and a 3x smaller embedding dimension, translating to significantly reduced vector database (vectorDB) costs. Similarly, Voyage-3-Lite offers 3.82% better retrieval accuracy than OpenAI’s V3 large model, with 6x lower costs and a 6x smaller embedding dimension.

Cost Efficiency Without Compromising Quality

Cost efficiency is at the heart of the new Voyage-3 series models. With a context length of 32,000 tokens, four times more than OpenAI’s offering, Voyage-3 is a cost-effective solution for businesses requiring high-quality retrieval without breaking the bank. For example, Voyage-3 costs $0.06 per million tokens, making it 1.6x cheaper than Cohere English V3 and substantially more affordable than OpenAI’s large V3 model. Also, Voyage-3’s smaller embedding dimension (1024 vs. OpenAI’s 3072) results in lower vectorDB costs, enabling companies to scale their applications efficiently.

Voyage-3-Lite, the model’s lighter variant, is optimized for low-latency operations. At $0.02 per million tokens, it is 6.5x cheaper than OpenAI’s V3 large model and has a 6-8x smaller embedding dimension (512 vs. OpenAI’s 3072). This makes Voyage-3-Lite a viable option for organizations looking to maintain high retrieval quality at a fraction of the cost.

Versatility Across Multiple Domains

The success of the Voyage-3 series models extends beyond general-purpose embeddings. Over the past nine months, Voyage AI has released a suite of its Voyage-2 series embedding models, including domain-specific models like Voyage-Large-2, Voyage-Code-2, Voyage-Law-2, Voyage-Finance-2, and Voyage-Multilingual-2. These models have been extensively trained on data from their respective domains, demonstrating exceptional performance in specialized use cases.

For example, Voyage-Multilingual-2 delivers superior retrieval quality in French, German, Japanese, Spanish, and Korean while maintaining best-in-class performance in English. These achievements testify to Voyage AI’s commitment to developing robust models tailored to specific business needs.

Technical Specifications and Innovations

Several research innovations underpin the development of Voyage-3 and Voyage-3-Lite. The models feature an improved architecture, leveraging distillation from larger models and pre-training on over 2 trillion high-quality tokens. Additionally, retrieval result alignment is refined through human feedback, further enhancing the accuracy and relevance of the models.

Key technical specifications of the Voyage-3 series models include:

Voyage-3:

Dimensions: 1024

Context Length: 32,000 tokens

Cost: $0.06 per million tokens

Retrieval Quality (NDCG@10): 76 (outperforms OpenAI’s V3 large by 7.55%)

Voyage-3-Lite:

Dimensions: 512

Context Length: 32,000 tokens

Cost: $0.02 per million tokens

Retrieval Quality (NDCG@10): 72 (outperforms OpenAI’s V3 large by 3.82%)

The models’ ability to handle a 32,000-token context length, compared to OpenAI’s 8,000 tokens and Cohere’s 512 tokens, makes them suitable for applications requiring comprehensive understanding and retrieval of large documents, such as technical manuals, academic papers, and legal case summaries.

Applications and Use Cases

The Voyage-3 series models cater to a wide range of industries, enabling applications in domains like:

Technical Documentation: Providing accurate and context-aware retrieval from large technical manuals and programming guides.

Code: It offers an enhanced understanding of code snippets, docstrings, and programming logic, making it ideal for software development and code review.

Law: Supporting complex legal research by retrieving relevant court opinions, statutes, and legal arguments.

Finance: Streamlining the retrieval of financial statements, SEC filings, and market analysis reports.

Multilingual Applications: Facilitating multilingual search and retrieval across 26 languages, including French, German, Japanese, Spanish, and Korean.

Recommendations for Users

Voyage AI recommends that any general-purpose embedding users upgrade to Voyage-3 for enhanced retrieval quality at a low cost. Voyage-3-Lite offers an excellent balance between performance and affordability for those looking for further cost savings. Domain-specific use cases, such as code, law, and finance, can still benefit from Voyage-2 series models like Voyage-Code-2, Voyage-Law-2, and Voyage-Finance-2, although Voyage-3 provides highly competitive performance in these areas as well.

Future Developments

The Voyage AI team is continuously working to expand the capabilities of the Voyage-3 series models. In the coming weeks, the release of Voyage-3-Large is expected to set a new standard for large-scale general-purpose embeddings, further solidifying Voyage AI’s position as a leader in the field. For those interested in exploring the potential of the Voyage-3 series, the first 200 million tokens are free to try. Users can use these models immediately by specifying “voyage-3” or “voyage-3-lite” as the model parameter in Voyage API calls. Voyage AI’s release of Voyage-3 and Voyage-3-Lite represents a giant leap forward in embedding technology, offering a unique combination of high performance, low cost, and versatility. With these new models, Voyage AI continues to lead the way in creating state-of-the-art solutions for businesses and developers worldwide.

Check out the Models on Hugging Face and Details here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Voyage AI Introduces Voyage-3 and Voyage-3-Lite: A New Generation of Small Embedding Models that Outperforms OpenAI v3 Large by 7.55% appeared first on MarkTechPost.

Microsoft Researchers Introduce Advanced Query Categorization System t …

Large language models (LLMs) have revolutionized the field of AI with their ability to generate human-like text and perform complex reasoning. However, despite their capabilities, LLMs need help with tasks requiring domain-specific knowledge, especially in healthcare, law, and finance. When trained on large datasets, these models often miss critical information from specialized domains, leading to hallucinations or inaccurate responses. Enhancing LLMs with external data has been proposed as a solution to these limitations. By integrating relevant information, models become more precise and effective, significantly improving their performance. The Retrieval-Augmented Generation (RAG) technique is a prime example of this approach, allowing LLMs to retrieve necessary data during the generation process to provide more accurate and timely responses.

One of the most significant problems in deploying LLMs is their inability to handle queries that require specific and updated information. While LLMs are highly capable when dealing with general knowledge, they falter when tasked with specialized or time-sensitive queries. This shortfall occurs because most models are trained on static data, so they can only update their knowledge with external input. For example, in healthcare, a model that needs access to current medical guidelines will struggle to offer accurate advice, potentially putting lives at risk. Similarly, legal and financial systems require constant updates to keep up with changing regulations and market conditions. The challenge, therefore, lies in developing a model that can dynamically pull in relevant data to meet the specific needs of these domains.

Current solutions, such as fine-tuning and RAG, have made strides in addressing these challenges. Fine-tuning allows a model to be retrained on domain-specific data, tailoring it for particular tasks. However, this approach is time-consuming and requires vast training data, which is only sometimes available. Moreover, fine-tuning often results in overfitting, where the model becomes too specialized and needs help with general queries. On the other hand, RAG offers a more flexible approach. Instead of relying solely on pre-trained knowledge, RAG enables models to retrieve external data in real-time, improving their accuracy and relevance. Despite its advantages, RAG still needs several challenges, such as the difficulty of processing unstructured data, which can come in various forms like text, images, and tables.

Researchers at Microsoft Research Asia introduced a novel method that categorizes user queries into four distinct levels based on the complexity and type of external data required. These levels are explicit facts, implicit facts, interpretable rationales, and hidden rationales. The categorization helps tailor the model’s approach to retrieving and processing data, ensuring it selects the most relevant information for a given task. For example, explicit fact queries involve straightforward questions, such as “What is the capital of France?” where the answer can be retrieved from external data. Implicit fact queries require more reasoning, such as combining multiple pieces of information to infer a conclusion. Interpretable rationale queries involve domain-specific guidelines, while hidden rationale queries require deep reasoning and often deal with abstract concepts.

The method proposed by Microsoft Research enables LLMs to differentiate between these query types and apply the appropriate level of reasoning. For instance, in the case of hidden rationale queries, where no clear answer exists, the model could infer patterns and use domain-specific reasoning methods to generate a response. By breaking down queries into these categories, the model becomes more efficient at retrieving the necessary information and providing accurate, context-driven responses. This categorization also helps reduce the computational load on the model, as it can now focus on retrieving only the data relevant to the query type rather than scanning vast amounts of unrelated information.

The study also highlights the impressive results of this approach. The system significantly improved performance in specialized domains like healthcare and legal analysis. For instance, in healthcare applications, the model reduced the rate of hallucinations by up to 40%, providing more grounded and reliable responses. The model’s accuracy in processing complex documents and offering detailed analysis increased by 35% in legal systems. Overall, the proposed method allowed for more accurate retrieval of relevant data, leading to better decision-making and more reliable outputs. The study found that RAG-based systems reduced hallucination incidents by grounding the model’s responses in verifiable data, improving accuracy in critical applications such as medical diagnostics and legal document processing.

In conclusion, this research provides a crucial solution to one of the fundamental problems in deploying LLMs in specialized domains. By introducing a system that categorizes queries based on complexity and type, the researchers at Microsoft Research have developed a method that enhances the accuracy and interpretability of LLM outputs. This framework enables LLMs to retrieve the most relevant external data and apply it effectively to domain-specific queries, reducing hallucinations and improving overall performance. The study demonstrated that using structured query categorization can improve results by up to 40%, making this a significant step forward in AI-powered systems. By addressing both the problem of data retrieval and the integration of external knowledge, this research paves the way for more reliable and robust LLM applications across various industries.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Microsoft Researchers Introduce Advanced Query Categorization System to Enhance Large Language Model Accuracy and Reduce Hallucinations in Specialized Fields appeared first on MarkTechPost.

Architecture to AWS CloudFormation code using Anthropic’s Claude 3 o …

The Anthropic’s Claude 3 family of models, available on Amazon Bedrock, offers multimodal capabilities that enable the processing of images and text. This capability opens up innovative avenues for image understanding, wherein Anthropic’s Claude 3 models can analyze visual information in conjunction with textual data, facilitating more comprehensive and contextual interpretations. By taking advantage of its multimodal prowess, we can ask the model questions like “What objects are in the image, and how are they relatively positioned to each other?” We can also gain an understanding of data presented in charts and graphs by asking questions related to business intelligence (BI) tasks, such as “What is the sales trend for 2023 for company A in the enterprise market?” These are just some examples of the additional richness Anthropic’s Claude 3 brings to generative artificial intelligence (AI) interactions.
Architecting specific AWS Cloud solutions involves creating diagrams that show relationships and interactions between different services. Instead of building the code manually, you can use Anthropic’s Claude 3’s image analysis capabilities to generate AWS CloudFormation templates by passing an architecture diagram as input.
In this post, we explore some ways you can use Anthropic’s Claude 3 Sonnet’s vision capabilities to accelerate the process of moving from architecture to the prototype stage of a solution.
Use cases for architecture to code
The following are relevant use cases for this solution:

Converting whiteboarding sessions to AWS infrastructure – To quickly prototype your designs, you can take the architecture diagrams created during whiteboarding sessions and generate the first draft of a CloudFormation template. You can also iterate over the CloudFormation template to develop a well-architected solution that meets all your requirements.
Fast deployment of architecture diagrams – You can generate boilerplate CloudFormation templates by using architecture diagrams you find on the web. This allows you to experiment quickly with new designs.
Streamlined AWS infrastructure design through collaborative diagramming – You might draw architecture diagrams on a diagramming tool during an all-hands meeting. These raw diagrams can generate boilerplate CloudFormation templates, quickly leading to actionable steps while speeding up collaboration and increasing meeting value.

Solution overview
To demonstrate the solution, we use Streamlit to provide an interface for diagrams and prompts. Amazon Bedrock invokes the Anthropic’s Claude 3 Sonnet model, which provides multimodal capabilities. AWS Fargate is the compute engine for web application. The following diagram illustrates the step-by-step process.

The workflow consists of the following steps:

The user uploads an architecture image (JPEG or PNG) on the Streamlit application, invoking the Amazon Bedrock API to generate a step-by-step explanation of the architecture using the Anthropic’s Claude 3 Sonnet model.
The Anthropic’s Claude 3 Sonnet model is invoked using a step-by-step explanation and few-shot learning examples to generate the initial CloudFormation code. The few-shot learning example consists of three CloudFormation templates; this helps the model understand writing practices associated with CloudFormation code.
The user manually provides instructions using the chat interface to update the initial CloudFormation code.

*Steps 1 and 2 are executed once when architecture diagram is uploaded. To trigger changes to the AWS CloudFormation code (step 3) provide update instructions from the Streamlit app
The CloudFormation templates generated by the web application are intended for inspiration purposes and not for production-level applications. It is the responsibility of a developer to test and verify the CloudFormation template according to security guidelines.
Few-shot Prompting
To help Anthropic’s Claude 3 Sonnet understand the practices of writing CloudFormation code, we use few-shot prompting by providing three CloudFormation templates as reference examples in the prompt. Exposing Anthropic’s Claude 3 Sonnet to multiple CloudFormation templates will allow it to analyze and learn from the structure, resource definitions, parameter configurations, and other essential elements consistently implemented across your organization’s templates. This enables Anthropic’s Claude 3 Sonnet to grasp your team’s coding conventions, naming conventions, and organizational patterns when generating CloudFormation templates. The following examples used for few-shot learning can be found in the GitHub repo.

Few-shot prompting example 1

Few-shot prompting example 2

Few-shot prompting example 3

Furthermore, Anthropic’s Claude 3 Sonnet can observe how different resources and services are configured and integrated within the CloudFormation templates through few-shot prompting. It will gain insights into how to automate the deployment and management of various AWS resources, such as Amazon Simple Storage Service (Amazon S3), AWS Lambda, Amazon DynamoDB, and AWS Step Functions.
Inference parameters are preset, but they can be changed from the web application if desired. We recommend experimenting with various combinations of these parameters. By default, we set the temperature to zero to reduce the variability of outputs and create focused, syntactically correct code.
Prerequisites
To access the Anthropic’s Claude 3 Sonnet foundation model (FM), you must request access through the Amazon Bedrock console. For instructions, see Manage access to Amazon Bedrock foundation models. After requesting access to Anthropic’s Claude 3 Sonnet, you can deploy the following development.yaml CloudFormation template to provision the infrastructure for the demo. For instructions on how to deploy this sample, refer to the GitHub repo. Use the following table to launch the CloudFormation template to quickly deploy the sample in either us-east-1 or us-west-2.

Region
Stack

us-east-1
development.yaml

us-west-2
development.yaml

When deploying the template, you have the option to specify the Amazon Bedrock model ID you want to use for inference. This flexibility allows you to choose the model that best suits your needs. By default, the template uses the Anthropic’s Claude 3 Sonnet model, renowned for its exceptional performance. However, if you prefer to use a different model, you can seamlessly pass its Amazon Bedrock model ID as a parameter during deployment. Verify that you have requested access to the desired model beforehand and that the model possesses the necessary vision capabilities required for your specific use case.
After you launch the CloudFormation stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and collect the Amazon CloudFront URL. Enter the URL in your browser to view the web application.

In this post, we discuss CloudFormation template generation for three different samples. You can find the sample architecture diagrams in the GitHub repo. These samples are similar to the few-shot learning examples, which is intentional. As an enhancement to this architecture, you can employ a Retrieval Augmented Generation (RAG)-based approach to retrieve relevant CloudFormation templates from a knowledge base to dynamically augment the prompt.
Due to the non-deterministic behavior of the large language model (LLM), you might not get the same response as shown in this post.
Let’s generate CloudFormation templates for the following sample architecture diagram.

Uploading the preceding architecture diagram to the web application generates a step-by-step explanation of the diagram using Anthropic’s Claude 3 Sonnet’s vision capabilities.

Let’s analyze the step-by-step explanation. The generated response is divided into three parts:

The context explains what the architecture diagram depicts.
The architecture diagram’s flow gives the order in which AWS services are invoked and their relationship with each other.
We get a summary of the entire generated response.

In the following step-by-step explanation, we see a few highlighted errors.

The step-by-step explanation is augmented with few-shot learning examples to develop an initial CloudFormation template. Let’s analyze the initial CloudFormation template:

AWSTemplateFormatVersion: ‘2010-09-09’
Description: >
This CloudFormation stack sets up a serverless data processing pipeline triggered by file uploads to an S3 bucket.
It uses AWS Lambda to process the uploaded files, and Amazon SNS to send notifications upon successful processing.
This template is not production ready and should only be used for inspiration
Parameters:
S3BucketName:
Type: String
Description: Name of the S3 bucket for file uploads
AllowedPattern: ^[a-z0-9][a-z0-9-]*[a-z0-9]$
MinLength: 1
MaxLength: 63

EmailAddress:
Type: String
Description: Email address to receive notifications
AllowedPattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

Resources:

# S3 Bucket
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref S3BucketName
NotificationConfiguration:
LambdaConfigurations:
– Event: ‘s3:ObjectCreated:*’
Function: !GetAtt ProcessingLambda.Arn

# SNS Topic
SNSTopic:
Type: AWS::SNS::Topic
Properties:
Subscription:
– Endpoint: !Ref EmailAddress
Protocol: email

# Lambda Function
ProcessingLambda:
Type: AWS::Lambda::Function
Properties:
FunctionName: ProcessingLambda
Runtime: python3.9
Handler: index.lambda_handler
Role: !GetAtt LambdaRole.Arn
Code:
ZipFile: |
import boto3

def lambda_handler(event, context):
s3 = boto3.client(‘s3’)
sns = boto3.client(‘sns’)

# Process the uploaded file
for record in event[‘Records’]:
bucket_name = record[‘s3’][‘bucket’][‘name’]
object_key = record[‘s3’][‘object’][‘key’]

# Process the file data
# …

# Send notification upon successful processing
sns.publish(
TopicArn=!Ref SNSTopic,
Message=’File processing completed successfully’,
Subject=’Data Processing Notification’
)

return {
‘statusCode’: 200,
‘body’: ‘File processing completed successfully’
}

# Lambda Role
LambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: ‘sts:AssumeRole’
ManagedPolicyArns:
– arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
– PolicyName: S3Access
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action:
– ‘s3:GetObject’
Resource: !Join [”, [‘arn:aws:s3:::’, !Ref S3BucketName, ‘/*’]]
– PolicyName: SNSPublish
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action:
– ‘sns:Publish’
Resource: !Ref SNSTopic

# Lambda Permissions
LambdaPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !GetAtt ProcessingLambda.Arn
Action: ‘lambda:InvokeFunction’
Principal: s3.amazonaws.com
SourceAccount: !Ref AWS::AccountId
SourceArn: !Join [”, [‘arn:aws:s3:::’, !Ref S3BucketName]]

Outputs:

S3BucketName:
Description: Name of the S3 bucket for file uploads
Value: !Ref S3Bucket
Export:
Name: !Sub ‘${AWS::StackName}-S3BucketName’

LambdaFunctionArn:
Description: ARN of the Lambda function
Value: !GetAtt ProcessingLambda.Arn
Export:
Name: !Sub ‘${AWS::StackName}-LambdaFunctionArn’

SNSTopicArn:
Description: ARN of the SNS topic for notifications
Value: !Ref SNSTopic
Export:
Name: !Sub ‘${AWS::StackName}-SNSTopicArn’

After analyzing the CloudFormation template, we see that the Lambda code refers to an Amazon Simple Notification Service (Amazon SNS) topic using !Ref SNSTopic, which is not valid. We also want to add additional functionality to the template. First, we want to filter the Amazon S3 notification configuration to invoke Lambda only when *.csv files are uploaded. Second, we want to add metadata to the CloudFormation template. To do this, we use the chat interface to give the following update instructions to the web application:

Make the following updates:

Use environment variables for AWS Lambda to access SNS Topic ARN.

Add filter to S3 notification configuration to only invoke AWS lambda when *.csv files are uploaded

Add metadata to CloudFormation template

The updated CloudFormation template is as follows:

AWSTemplateFormatVersion: ‘2010-09-09’
Description: >
This CloudFormation stack sets up a serverless data processing pipeline triggered by file uploads to an S3 bucket.
It uses AWS Lambda to process the uploaded files, and Amazon SNS to send notifications upon successful processing.
This template is not production ready and should only be used for inspiration.
Metadata:
‘AWS::CloudFormation::Interface’:
ParameterGroups:
– Label:
default: ‘S3 Bucket Configuration’
Parameters:
– S3BucketName
– Label:
default: ‘Notification Configuration’
Parameters:
– EmailAddress

Parameters:
S3BucketName:
Type: String
Description: Name of the S3 bucket for file uploads
AllowedPattern: ^[a-z0-9][a-z0-9-]*[a-z0-9]$
MinLength: 1
MaxLength: 63

EmailAddress:
Type: String
Description: Email address to receive notifications
AllowedPattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$

Resources:

# S3 Bucket
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref S3BucketName
NotificationConfiguration:
LambdaConfigurations:
– Event: ‘s3:ObjectCreated:*’
Function: !GetAtt ProcessingLambda.Arn
Filter:
S3Key:
Rules:
– Name: suffix
Value: .csv

# SNS Topic
SNSTopic:
Type: AWS::SNS::Topic
Properties:
Subscription:
– Endpoint: !Ref EmailAddress
Protocol: email

# Lambda Function
ProcessingLambda:
Type: AWS::Lambda::Function
Properties:
FunctionName: ProcessingLambda
Runtime: python3.9
Handler: index.lambda_handler
Role: !GetAtt LambdaRole.Arn
Environment:
Variables:
SNS_TOPIC_ARN: !Ref SNSTopic
Code:
ZipFile: |
import boto3
import os

def lambda_handler(event, context):
s3 = boto3.client(‘s3’)
sns = boto3.client(‘sns’)
sns_topic_arn = os.environ[‘SNS_TOPIC_ARN’]

# Process the uploaded file
for record in event[‘Records’]:
bucket_name = record[‘s3’][‘bucket’][‘name’]
object_key = record[‘s3’][‘object’][‘key’]

# Process the file data
# …

# Send notification upon successful processing
sns.publish(
TopicArn=sns_topic_arn,
Message=’File processing completed successfully’,
Subject=’Data Processing Notification’
)

return {
‘statusCode’: 200,
‘body’: ‘File processing completed successfully’
}

# Lambda Role
LambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: ‘sts:AssumeRole’
ManagedPolicyArns:
– arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
– PolicyName: S3Access
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action:
– ‘s3:GetObject’
Resource: !Join [”, [‘arn:aws:s3:::’, !Ref S3BucketName, ‘/*’]]
– PolicyName: SNSPublish
PolicyDocument:
Version: ‘2012-10-17’
Statement:
– Effect: Allow
Action:
– ‘sns:Publish’
Resource: !Ref SNSTopic

# Lambda Permissions
LambdaPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !GetAtt ProcessingLambda.Arn
Action: ‘lambda:InvokeFunction’
Principal: s3.amazonaws.com
SourceAccount: !Ref AWS::AccountId
SourceArn: !Join [”, [‘arn:aws:s3:::’, !Ref S3BucketName]]

Outputs:

S3BucketName:
Description: Name of the S3 bucket for file uploads
Value: !Ref S3Bucket
Export:
Name: !Sub ‘${AWS::StackName}-S3BucketName’

LambdaFunctionArn:
Description: ARN of the Lambda function
Value: !GetAtt ProcessingLambda.Arn
Export:
Name: !Sub ‘${AWS::StackName}-LambdaFunctionArn’

SNSTopicArn:
Description: ARN of the SNS topic for notifications
Value: !Ref SNSTopic
Export:
Name: !Sub ‘${AWS::StackName}-SNSTopicArn’

Additional examples
We have provided two more sample diagrams, their associated CloudFormation code generated by Anthropic’s Claude 3 Sonnet, and the prompts used to create them. You can see how diagrams in various forms, from digital to hand-drawn, or some combination, can be used. The end-to-end analysis of these samples can be found at sample 2 and sample 3 on the GitHub repo.
Best practices for architecture to code
In the demonstrated use case, you can observe how well the Anthropic’s Claude 3 Sonnet model could pull details and relationships between services from an architecture image. The following are some ways you can improve the performance of Anthropic’s Claude in this use case:

Implement a multimodal RAG approach to enhance the application’s ability to handle a wider variety of complex architecture diagrams, because the current implementation is limited to diagrams similar to the provided static examples.
Enhance the architecture diagrams by incorporating visual cues and features, such as labeling services, indicating orchestration hierarchy levels, grouping related services at the same level, enclosing services within clear boxes, and labeling arrows to represent the flow between services. These additions will aid in better understanding and interpreting the diagrams.
If the application generates an invalid CloudFormation template, provide the error as update instructions. This will help the model understand the mistake and make a correction.
Use Anthropic’s Claude 3 Opus or Anthropic’s Claude 3.5 Sonnet for greater performance on long contexts in order to support near-perfect recall
With careful design and management, orchestrate agentic workflows by using Agents for Amazon Bedrock. This enables you to incorporate self-reflection, tool use, and planning within your workflow to generate more relevant CloudFormation templates.
Use Amazon Bedrock Prompt Flows to accelerate the creation, testing, and deployment of workflows through an intuitive visual interface. This can reduce development effort and accelerate workflow testing.

Clean up
To clean up the resources used in this demo, complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the deployed yaml development.yaml stack and choose Delete.

Conclusion
With the pattern demonstrated with Anthropic’s Claude 3 Sonnet, developers can effortlessly translate their architectural visions into reality by simply sketching their desired cloud solutions. Anthropic’s Claude 3 Sonnet’s advanced image understanding capabilities will analyze these diagrams and generate boilerplate CloudFormation code, minimizing the need for initial complex coding tasks. This visually driven approach empowers developers from a variety of skill levels, fostering collaboration, rapid prototyping, and accelerated innovation.
You can investigate other patterns, such as including RAG and agentic workflows, to improve the accuracy of code generation. You can also explore customizing the LLM by fine-tuning it to write CloudFormation code with greater flexibility.
Now that you have seen Anthropic’s Claude 3 Sonnet in action, try designing your own architecture diagrams using some of the best practices to take your prototyping to the next level.
For additional resources, refer to the :

Anthropic’s Claude 3 Vision
Anthropic’s Claude in Amazon Bedrock
AWS CloudFormation User Guide
GitHub repo

About the Authors
Eashan Kaushik is an Associate Solutions Architect at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

How Northpower used computer vision with AWS to automate safety inspec …

This post is co-written with Andreas Astrom from Northpower.
Northpower provides reliable and affordable electricity and fiber internet services to customers in the Northland region of New Zealand. As an electricity distributor, Northpower aims to improve access, opportunity, and prosperity for its communities by investing in infrastructure, developing new products and services, and giving back to shareholders. Additionally, Northpower is one of New Zealand’s largest infrastructure contractors, serving clients in transmission, distribution, generation, and telecommunications. With over 1,400 staff working across 14 locations, Northpower plays a crucial role in maintaining essential services for customers driven by a purpose of connecting communities and building futures for Northland.
The energy industry is at a critical turning point. There is a strong push from policymakers and the public to decarbonize the industry, while at the same time balancing energy resilience with health, safety, and environmental risk. Recent events including Tropical Cyclone Gabrielle have highlighted the susceptibility of the grid to extreme weather and emphasized the need for climate adaptation with resilient infrastructure. Electricity Distribution Businesses (EDBs) are also facing new demands with the integration of decentralized energy resources like rooftop solar as well as larger-scale renewable energy projects like solar and wind farms. These changes call for innovative solutions to ensure operational efficiency and continued resilience.
In this post, we share how Northpower has worked with their technology partner Sculpt to reduce the effort and carbon required to identify and remediate public safety risks. Specifically, we cover the computer vision and artificial intelligence (AI) techniques used to combine datasets into a list of prioritized tasks for field teams to investigate and mitigate. The resulting dashboard highlighted that 141 power pole assets required action, out of a network of 57,230 poles.
Northpower challenge
Utility poles have stay wires that anchor the pole to the ground for extra stability. These stay wires are meant to have an inline insulator to avoid the situation of the stay wire becoming live, which would create a safety risk for person or animal in the area.
Northpower faced a significant challenge in determining how many of their 57,230 power poles have stay wires without insulators. Without reliable historical data, manual inspections of such a vast and predominantly rural network is labor-intensive and costly. Alternatives like helicopter surveys or field technicians require access to private properties for safety inspections, and are expensive. Moreover, the travel requirement for technicians to physically visit each pole across such a large network posed a considerable logistical challenge, emphasizing the need for a more efficient solution.
Thankfully, some asset datasets were available in digital format, and historical paper-based inspection reports, dating back 20 years, were available in scanned format. This archive, along with 765,933 varied-quality inspection photographs, some over 15 years old, presented a significant data processing challenge. Processing these images and scanned documents is not a cost- or time-efficient task for humans, and requires highly performant infrastructure that can reduce the time to value.
Solution overview
Amazon SageMaker is a fully managed service that helps developers and data scientists build, train, and deploy machine learning (ML) models. In this solution, the team used Amazon SageMaker Studio to launch an object detection model available in Amazon SageMaker JumpStart using the PyTorch framework.
The following diagram illustrates the high-level workflow.

Northpower chose SageMaker for a number of reasons:

SageMaker Studio is a managed service with ready-to-go development environments, saving time otherwise used for setting up environments manually
SageMaker JumpStart took care of the setup and deployed the required ML jobs involved in the project with minimal configuration, further saving development time
The integrated labeling solution with Amazon SageMaker Ground Truth was suitable for large-scale image annotations and simplified the collaboration with a Northpower labeling workforce

In the following sections, we discuss the key components of the solution as illustrated in the preceding diagram.
Data preparation
SageMaker Ground Truth employs a human workforce made up of Northpower volunteers to annotate a set of 10,000 images. The workforce created a bounding box around stay wires and insulators and the output was subsequently used to train an ML model.
Model training, validation, and storage
This component uses the following services:

SageMaker Studio is used to access and deploy a pre-trained object detection model and develop code on managed Jupyter notebooks. The model was then fine-tuned with training data from the data preparation stage. For a step-by-step guide to set up SageMaker Studio, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users.
SageMaker Studio runs custom Python code to augment the training data and transform the metadata output from SageMaker Ground Truth into a format supported by the computer vision model training job. The model is then trained using a fully managed infrastructure, validated, and published to the Amazon SageMaker Model Registry.
Amazon Simple Storage Service (Amazon S3) stores the model artifacts and creates a data lake to host the inference output, document analysis output, and other datasets in CSV format.

Model deployment and inference
In this step, SageMaker hosts the ML model on an endpoint used to run inferences.
A SageMaker Studio notebook was used again post-inference to run custom Python code to simplify the datasets and render bounding boxes on objects based on criteria. This step also applied a custom scoring system that was also rendered onto the final image, and this allowed for an additional human QA step for low confidence images.
Data analytics and visualization
This component includes the following services:

An AWS Glue crawler is used to understand the dataset structures stored in the data lake so that it can be queried by Amazon Athena
Athena allows the use of SQL to combine the inference output and asset datasets to find highest risk items
Amazon QuickSight was used as the tool for both the human QA process and for determining which assets needed a field technician to be sent for physical inspection

Document understanding
In the final step, Amazon Textract digitizes historical paper-based asset assessments and stores the output in CSV format.
Results
The trained PyTorch object detection model enabled the detection of stay wires and insulators on utility poles, and a SageMaker postprocessing job calculated a risk score using an m5.24xlarge Amazon Elastic Compute Cloud (EC2) instance with 200 concurrent Python threads. This instance was also responsible for rendering the score information along with an object bounding box onto an output image, as shown in the following example.

Writing the confidence scores into the S3 data lake alongside the historical inspection results allowed Northpower to run analytics using Athena to understand each classification of image. The sunburst graph below is a visualization of this classification.

Northpower categorized 1,853 poles as high priority risks, 3,922 as medium priority, 36,260 as low priority, and 15,195 as the lowest priority. These were viewable in the QuickSight dashboard and used as an input for humans to review the highest risk assets first.

At the conclusion of the analysis, Northpower found that 31 poles needed stay wire insulators installed and a further 110 poles needed investigation in the field. This significantly reduced the cost and carbon usage involved in manually checking every asset.
Conclusion
Remote asset inspecting remains a challenge for regional EDBs, but using computer vision and AI to uncover new value from data that was previously unused was key to Northpower’s success in this project. SageMaker JumpStart provided deployable models that could be trained for object detection use cases with minimal data science knowledge and overhead.
Discover the publicly available foundation models offered by SageMaker JumpStart and fast-track your own ML project with the following step-by-step tutorial.

About the authors
Scott Patterson is a Senior Solutions Architect at AWS.
Andreas Astrom is the Head of Technology and Innovation at Northpower

GenAI for Aerospace: Empowering the workforce with expert knowledge on …

Aerospace companies face a generational workforce challenge today. With the strong post-COVID recovery, manufacturers are committing to record production rates, requiring the sharing of highly specialized domain knowledge across more workers. At the same time, maintaining the headcount and experience level of the workforce is increasingly challenging, as a generation of subject matter experts (SMEs) retires and increased fluidity characterizes the post-COVID labor market. This domain knowledge is traditionally captured in reference manuals, service bulletins, quality ticketing systems, engineering drawings, and more, but the quantity and complexity of documents is growing and takes time to learn. You simply can’t train new SMEs overnight. Without a mechanism to manage this knowledge transfer gap, productivity across all phases of the lifecycle might suffer from losing expert knowledge and repeating past mistakes.
Generative AI is a modern form of machine learning (ML) that has recently shown significant gains in reasoning, content comprehension, and human interaction. It can be a significant force multiplier to help the human workforce quickly digest, summarize, and answer complex questions from large technical document libraries, accelerating your workforce development. AWS is uniquely positioned to help you address these challenges through generative AI, with a broad and deep range of AI/ML services and over 20 years of experience in developing AI/ML technologies.
This post shows how aerospace customers can use AWS generative AI and ML-based services to address this document-based knowledge use case, using a Q&A chatbot to provide expert-level guidance to technical staff based on large libraries of technical documents. We focus on the use of two AWS services:

Amazon Q can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Although Amazon Q is a great way to get started with no code for business users, Amazon Bedrock Knowledge Bases offers more flexibility at the API level for generative AI developers; we explore both these solutions in the following sections. But first, let’s revisit some basic concepts around Retrieval Augmented Generation (RAG) applications.
Generative AI constraints and RAG
Although generative AI holds great promise for automating complex tasks, our aerospace customers often express concerns about the use of the technology in such a safety- and security-sensitive industry. They ask questions such as:

“How do I keep my generative AI applications secure?”
“How do I make sure my business-critical data isn’t used to train proprietary models?”
“How do I know that answers are accurate and only drawn from authoritative sources?” (Avoiding the well-known problem of hallucination.)
“How can I trace the reasoning of my model back to source documents to build user trust?”
“How do I keep my generative AI applications up to date with an ever-evolving knowledge base?”

In many generative AI applications built on proprietary technical document libraries, these concerns can be addressed by using the RAG architecture. RAG helps maintain the accuracy of responses, keeps up with the rapid pace of document updates, and provides traceable reasoning while keeping your proprietary data private and secure.
This architecture combines a general-purpose large language model (LLM) with a customer-specific document database, which is accessed through a semantic search engine. Rather than fine-tuning the LLM to the specific application, the document library is loaded with the relevant reference material for that application. In RAG, these knowledge sources are often referred to as a knowledge base.
A high-level RAG architecture is shown in the following figure. The workflow includes the following steps:

When the technician has a question, they enter it at the chat prompt.
The technician’s question is used to search the knowledge base.
The search results include a ranked list of most relevant source documentation.
Those documentation snippets are added to the original query as context, and sent to the LLM as a combined prompt.
The LLM returns the answer to the question, as synthesized from the source material in the prompt.

Because RAG uses a semantic search, it can find more relevant material in the database than just a keyword match alone. For more details on the operation of RAG systems, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

This architecture addresses the concerns listed earlier in few key ways:

The underlying LLM doesn’t require custom training because the domain-specialized knowledge is contained in a separate knowledge base. As a result, the RAG-based system can be kept up to date, or retrained to completely new domains, simply by changing the documents in the knowledge base. This mitigates the significant cost typically associated with training custom LLMs.
Because of the document-based prompting, generative AI answers can be constrained to only come from trusted document sources, and provide direct attribution back to those source documents to verify.
RAG-based systems can securely manage access to different knowledge bases by role-based access control. Proprietary knowledge in generative AI remains private and protected in those knowledge bases.

AWS provides customers in aerospace and other high-tech domains the tools they need to rapidly build and securely deploy generative AI solutions at scale, with world-class security. Let’s look at how you can use Amazon Q and Amazon Bedrock to build RAG-based solutions in two different use cases.
Use case 1: Create a chatbot “expert” for technicians with Amazon Q
Aerospace is a high-touch industry, and technicians are the front line of that workforce. Technician work appears at every lifecycle stage for the aircraft (and its components), engineering prototype, qualification testing, manufacture, quality inspection, maintenance, and repair. Technician work is demanding and highly specialized; it requires detailed knowledge of highly technical documentation to make sure products meet safety, functional, and cost requirements. Knowledge management is a high priority for many companies, seeking to spread domain knowledge from experts to junior employees to offset attrition, scale production capacity, and improve quality.
Our customers frequently ask us how they can use customized chatbots built on customized generative AI models to automate access to this information and help technicians make better-informed decisions and accelerate their development. The RAG architecture shown in this post is an excellent solution to this use case because it allows companies to quickly deploy domain-specialized generative AI chatbots built securely on their own proprietary documentation. Amazon Q can deploy fully managed, scalable RAG systems tailored to address a wide range of business problems. It provides immediate, relevant information and advice to help streamline tasks, accelerate decision-making, and help spark creativity and innovation at work. It can automatically connect to over 40 different data sources, including Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, Atlassian Confluence, Slack, and Jira Cloud.
Let’s look at an example of how you can quickly deploy a generative AI-based chatbot “expert” using Amazon Q.

Sign in to the Amazon Q console.

If you haven’t used Amazon Q before, you might be greeted with a request for initial configuration.

Under Connect Amazon Q to IAM Identity Center, choose Create account instance to create a custom credential set for this demo.
Under Select a bundle to get started, under Amazon Q Business Lite, choose Subscribe in Q Business to create a test subscription.

If you have previously used Amazon Q in this account, you can simply reuse an existing user or subscription for this walkthrough.

After you create your AWS IAM Identity Center and Amazon Q subscription, choose Get started on the Amazon Q landing page.

Choose Create application.
For Application name, enter a name (for example, my-tech-assistant).
Under Service access, select Create and use a new service-linked role (SLR).
Choose Create.

This creates the application framework.

Under Retrievers, select Use native retriever.
Under Index provisioning, select Starter for a basic, low-cost retriever.
Choose Next.

Next, we need to configure a data source. For this example, we use Amazon S3 and assume that you have already created a bucket and uploaded documents to it (for more information, see Step 1: Create your first S3 bucket). For this example, we have uploaded some public domain documents from the Federal Aviation Administration (FAA) technical library relating to software, system standards, instrument flight rating, aircraft construction and maintenance, and more.

For Data sources, choose Amazon S3 to point our RAG assistant to this S3 bucket.

For Data source name, enter a name for your data source (independent of the S3 bucket name, such as my-faa-docs).
Under IAM role, choose Create new service role (Recommended).
Under Sync scope, choose the S3 bucket where you uploaded your documents.
Under Sync run schedule, choose Run on demand (or another option, if you want your documents to be re-indexed on a set schedule).
Choose Add data source.
Leave the remaining settings as default and choose Next to finish adding your Amazon S3 data source.

Finally, we need to create user access permissions to our chatbot.

Under Add groups and users, choose Add groups and users.
In the popup that appears, you can choose to either create new users or select existing ones. If you want to use an existing user, you can skip the following steps:

Select Add new users, then choose Next.
Enter the new user information, including a valid email address.

An email will be sent to that address with a link to validate that user.

Now that you have a user, select Assign existing users and groups and choose Next.
Choose your user, then choose Assign.

You should now have a user assigned to your new chatbot application.

Under Web experience service access, select Create and use a new service role.
Choose Create application.

You now have a new generative AI application! Before the chatbot can answer your questions, you have to run the indexer on your documents at least one time.

On the Applications page, choose your application.

Select your data source and choose Sync now.

The synchronization process takes a few minutes to complete.

When the sync is complete, on the Web experience settings tab, choose the link under Deployed URL.

If you haven’t yet, you will be prompted to log in using the user credentials you created; use the email address as the user name.
Your chatbot is now ready to answer technical questions on the large library of documents you provided. Try it out! You’ll notice that for each answer, the chatbot provides a Sources option that indicates the authoritative reference from which it drew its answer.

Our fully customized chatbot required no coding, no custom data schemas, and no managing of underlying infrastructure to scale! Amazon Q fully manages the infrastructure required to securely deploy your technician’s assistant at scale.
Use case 2: Use Amazon Bedrock Knowledge Bases
As we demonstrated in the previous use case, Amazon Q fully manages the end-to-end RAG workflow and allows business users to get started quickly. But what if you need more granular control of parameters related to the vector database, chunking, retrieval, and models used to generate final answers? Amazon Bedrock Knowledge Bases allows generative AI developers to build and interact with proprietary document libraries for accurate and efficient Q&A over documents. In this example, we use the same FAA documents as before, but this time we set up the RAG solution using Amazon Bedrock Knowledge Bases. We demonstrate how to do this using both APIs and the Amazon Bedrock console. The full notebook for following the API-based approach can be downloaded from the GitHub repo.
The following diagram illustrates the architecture of this solution.

Create your knowledge base using the API
To implement the solution using the API, complete the following steps:

Create a role with the necessary policies to access data from Amazon S3 and write embeddings to Amazon OpenSearch Serverless. This role will be used by the knowledge base to retrieve relevant chunks for OpenSearch based on the input query.

# Create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss(vector_store_name=vector_store_name,
aoss_client=aoss_client, bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)

Create an empty OpenSearch Serverless index to store the document embeddings and metadata. OpenSearch Serverless is a fully managed option that allows you to run petabyte-scale workloads without managing clusters.

# Create the OpenSearch Serverless collection
collection = aoss_client.create_collection(name=vector_store_name, type=’VECTORSEARCH’)

# Create the index within the collection
response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
print(‘Creating index:’)
pp.pprint(response)

With the OpenSearch Serverless index set up, you can now create the knowledge base and associate it with a data source containing our documents. For brevity, we haven’t included the full code; to run this example end-to-end, refer to the GitHub repo.

# Initialize OSS configuration for the Knowledge Base
opensearchServerlessConfiguration = { … }

# Set chunking strategy for how to split documents
chunkingStrategyConfiguration = { … }

# Configure S3 data source
s3Configuration = { … }

# Set embedding model ARN
embeddingModelArn = “arn:aws:bedrock:{region}::foundation-model/amazon.titan-embed-text-v2:0”

# Create the Knowledge Base
kb = create_knowledge_base_func()

# Create a data source and associate it with the KB
ds = bedrock_agent_client.create_data_source(…)

# Start ingestion job to load data into OSS
start_job_response = bedrock_agent_client.start_ingestion_job(
knowledgeBaseId=kb[‘knowledgeBaseId’], dataSourceId=ds[“dataSourceId”])

The ingestion job will fetch documents from the Amazon S3 data source, preprocess and chunk the text, create embeddings for each chunk, and store them in the OpenSearch Serverless index.

With the knowledge base populated, you can now query it using the RetrieveAndGenerate API and get responses generated by LLMs like Anthropic’s Claude on Amazon Bedrock:

# Helper function to query the knowledge base
def ask_bedrock_llm_with_knowledge_base(query, model_arn, kb_id):
response = bedrock_agent_runtime_client.retrieve_and_generate(…)
generated_text = response[‘output’][‘text’]
return generated_text

# Example query
query = “How are namespaces registered with the FAA for service providers?”

# Query knowledge base with different Claude models
for model_id in claude_model_ids:
model_arn = f’arn:aws:bedrock:{region}::foundation-model/{model_id[1]}’
generated_text = ask_bedrock_llm_with_knowledge_base(query, model_arn, kb_id)
print(f”Generated using {model_id[0]}:”)
pp.pprint(generated_text)

The RetrieveAndGenerate API converts the query into an embedding, searches the knowledge base for relevant document chunks, and generates a response by providing the retrieved context to the specified language model. We asked the question “How are namespaces registered with the FAA for service providers?” Anthropic’s Claude 3 Sonnet uses the chunks retrieved from our OpenSearch vector index to answer as follows:
To register a namespace with the FAA as a service provider, you need to follow these steps:

Develop the namespaces metadata according to FAA-STD-063 and submit it for registration in the FAA Data Registry (FDR).
The FDR registrar will perform the namespace registration function. The specific process for developing and registering a namespace in the FDR involves:
Searching the FDR for an existing namespace that matches your business domain. If none exists, work with the FDR registrar to create a new one.
Create and document the new namespace according to FAA-STD-063, following the guidelines for organization, web service, or taxonomy namespaces.
Register the namespace in the FDR by either filling out a registration form and submitting it to the FDR registrar, or requesting access to enter the metadata directly into the FDR.

Create your knowledge base on the Amazon Bedrock console
If you prefer, you can build the same solution in Amazon Bedrock Knowledge Bases using the Amazon Bedrock console instead of the API-based implementation shown in the previous section. Complete the following steps:

Sign in to your AWS account.
On the Amazon Bedrock console, choose Get started.

As a first step, you need to set up your permissions to use the various LLMs in Amazon Bedrock.

Choose Model access in the navigation pane.
Choose Modify model access.

Select the LLMs to enable.
Choose Next¸ then choose Submit to complete your access request.

You should now have access to the models you requested.

Now you can set up your knowledge base.

Choose Knowledge bases under Builder tools in the navigation pane.
Choose Create knowledge base.

On the Provide knowledge base details page, keep the default settings and choose Next.
For Data source name, enter a name for your data source or keep the default.
For S3 URI, choose the S3 bucket where you uploaded your documents.
Choose Next.

Under Embeddings model, choose the embeddings LLM to use (for this post, we choose Titan Text Embeddings).
Under Vector database, select Quick create a new vector store.

This option uses OpenSearch Serverless as the vector store.

Choose Next.

Choose Create knowledge base to finish the process.

Your knowledge base is now set up! Before interacting with the chatbot, you need to index your documents. Make sure you have already loaded the desired source documents into your S3 bucket; for this walkthrough, we use the same public-domain FAA library referenced in the previous section.

Under Data source, select the data source you created, then choose Sync.
When the sync is complete, choose Select model in the Test knowledge base pane, and choose the model you want to try (for this post, we use Anthropic Claude 3 Sonnet, but Amazon Bedrock gives you the flexibility to experiment with many other models).

Your technician’s assistant is now set up! You can experiment with it using the chat window in the Test knowledge base pane. Experiment with different LLMs and see how they perform. Amazon Bedrock provides a simple API-based framework to experiment with different models and RAG components so you can tune them to help meet your requirements in production workloads.

Clean up
When you’re done experimenting with the assistant, complete the following steps to clean up your created resources to avoid ongoing charges to your account:

On the Amazon Q Business console, choose Applications in the navigation pane.
Select the application you created, and on the Actions menu, choose Delete.
On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
Select the knowledge base you created, then choose Delete.

Conclusion
This post showed how quickly you can launch generative AI-enabled expert chatbots, trained on your proprietary document sets, to empower your workforce across specific aerospace roles with Amazon Q and Amazon Bedrock. After you have taken these basic steps, more work will be needed to solidify these solutions for production. Future editions in this “GenAI for Aerospace” series will explore follow-up topics, such as creating additional security controls and tuning performance for different content.
Generative AI is changing the way companies address some of their largest challenges. For our aerospace customers, generative AI can help with many of the scaling challenges that come from ramping production rates and the skills of their workforce to match. This post showed how you can apply this technology to expert knowledge challenges in various functions of aerospace development today. The RAG architecture shown can help meet key requirements for aerospace customers: maintaining privacy of data and custom models, minimizing hallucinations, customizing models with private and authoritative reference documents, and direct attribution of answers back to those reference documents. There are many other aerospace applications where generative AI can be applied: non-conformance tracking, business forecasting, bid and proposal management, engineering design and simulation, and more. We examine some of these use cases in future posts.
AWS provides a broad range of AI/ML services to help you develop generative AI solutions for these use cases and more. This includes newly announced services like Amazon Q, which provides fast, relevant answers to pressing business questions drawn from enterprise data sources, with no coding required, and Amazon Bedrock, which provides quick API-level access to a wide range of LLMs, with knowledge base management for your proprietary document libraries and direct integration to external workflows through agents. AWS also offers competitive price-performance for AI workloads, running on purpose-built silicon—the AWS Trainium and AWS Inferentia processors—to run your generative AI services in the most cost-effective, scalable, simple-to-manage way. Get started on addressing your toughest business challenges with generative AI on AWS today!
For more information on working with generative AI and RAG on AWS, refer to Generative AI. For more details on building an aerospace technician’s assistant with AWS generative AI services, refer to Guidance for Aerospace Technician’s Assistant on AWS.

About the authors
Peter Bellows is a Principal Solutions Architect and Head of Technology for Commercial Aviation in the Worldwide Specialist Organization (WWSO) at Amazon Web Services (AWS). He leads technical development for solutions across aerospace domains, including manufacturing, engineering, operations, and security. Prior to AWS, he worked in aerospace engineering for 20+ years.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.
Priyanka Mahankali is a Senior Specialist Solutions Architect for Aerospace at AWS, bringing over 7 years of experience across the cloud and aerospace sectors. She is dedicated to streamlining the journey from innovative industry ideas to cloud-based implementations.

Scalable training platform with Amazon SageMaker HyperPod for innovati …

Video generation has become the latest frontier in AI research, following the success of text-to-image models. Luma AI’s recently launched Dream Machine represents a significant advancement in this field. This text-to-video API generates high-quality, realistic videos quickly from text and images. Trained on the Amazon SageMaker HyperPod, Dream Machine excels in creating consistent characters, smooth motion, and dynamic camera movements.
To accelerate iteration and innovation in this field, sufficient computing resources and a scalable platform are essential. During the iterative research and development phase, data scientists and researchers need to run multiple experiments with different versions of algorithms and scale to larger models. Model parallel training becomes necessary when the total model footprint (model weights, gradients, and optimizer states) exceeds the memory of a single GPU. However, building large distributed training clusters is a complex and time-intensive process that requires in-depth expertise. Furthermore, as clusters scale to larger sizes (for example, more than 32 nodes), they require built-in resiliency mechanisms such as automated faulty node detection and replacement to improve cluster goodput and maintain efficient operations. These challenges underscore the importance of robust infrastructure and management systems in supporting advanced AI research and development.
Amazon SageMaker HyperPod, introduced during re:Invent 2023, is a purpose-built infrastructure designed to address the challenges of large-scale training. It removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs). SageMaker HyperPod offers a highly customizable user interface using Slurm, allowing users to select and install any required frameworks or tools. Clusters are provisioned with the instance type and count of your choice and can be retained across workloads. With these capabilities, customers are adopting SageMaker HyperPod as their innovation platform for more resilient and performant model training, enabling them to build state-of-the-art models faster.
In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.
Training video generation algorithms on Amazon SageMaker HyperPod: background and architecture
Video generation is an exciting and rapidly evolving field that has seen significant advancements in recent years. While generative modeling has made tremendous progress in the domain of image generation, video generation still faces several challenges that require further improvement.
Algorithms architecture complexity with diffusion model family
Diffusion models have recently made significant strides in generating high-quality images, prompting researchers to explore their potential in video generation. By leveraging the architecture and pre-trained generative capabilities of diffusion models, scientists aim to create visually impressive videos. The process extends image generation techniques to the temporal domain. Starting with noisy frames, the model iteratively refines them, removing random elements while adding meaningful details guided by text or image prompts. This approach progressively transforms abstract patterns into coherent video sequences, effectively translating diffusion models’ success in static image creation to dynamic video synthesis.
However, the compute requirements for video generation using diffusion models increase substantially compared to image generation for several reasons:

Temporal dimension – Unlike image generation, video generation requires processing multiple frames simultaneously. This adds a temporal dimension to the original 2D UNet, significantly increasing the amount of data that needs to be processed in parallel.
Iterative denoising process – The diffusion process involves multiple iterations of denoising for each frame. When extended to videos, this iterative process must be applied to multiple frames, multiplying the computational load.
Increased parameter count – To handle the additional complexity of video data, models often require more parameters, leading to larger memory footprints and increased computational demands.
Higher resolution and longer sequences – Video generation often aims for higher resolution outputs and longer sequences compared to single image generation, further amplifying the computational requirements.

Due to these factors, the operational efficiency of diffusion models for video generation is lower and significantly more compute-intensive compared to image generation. This increased computational demand underscores the need for advanced hardware solutions and optimized model architectures to make video generation more practical and accessible.
Handling the increased computational requirements
The improvement in video generation quality necessitates a significant increase in the size of the models and training data. Researchers have concluded that scaling up the base model size leads to substantial enhancements in video generation performance. However, this growth comes with considerable challenges in terms of computing power and memory resources. Training larger models requires more computational power and memory space, which can limit the accessibility and practical use of these models. As the model size increases, the computational requirements grow exponentially, making it difficult to train these models on single GPU, or even single node multi-GPUs environment. Moreover, storing and manipulating the large datasets required for training also pose significant challenges in terms of infrastructure and costs. High-quality video datasets tend to be massive, requiring substantial storage capacity and efficient data management systems. Transferring and processing these datasets can be time-consuming and resource-intensive, adding to the overall computational burden.
Maintaining temporal consistency and continuity
Maintaining temporal consistency and continuity becomes increasingly challenging as the length of the generated video increases. Temporal consistency refers to the continuity of visual elements, such as objects, characters, and scenes, across subsequent frames. Inconsistencies in appearance, movement, or lighting can lead to jarring visual artifacts and disrupt the overall viewing experience. To address this challenge, researchers have explored the use of multiframe inputs, which provide the model with information from multiple consecutive frames to better understand and model the relationships and dependencies across time. These techniques preserve high-resolution details in visual quality while simulating a continuous and smooth temporal motion process. However, they require more sophisticated modeling techniques and increased computational resources.
Algorithm overview
In the following sections, we illustrate how to run the Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation algorithm on Amazon SageMaker HyperPod for video generation. Animate Anyone is one of the methods for transforming character images into animated videos controlled by desired pose sequences. The key components of the architecture include:

ReferenceNet – A symmetrical UNet structure that captures spatial details of the reference image and integrates them into the denoising UNet using spatial-attention to preserve appearance consistency
Pose guider – A lightweight module that efficiently integrates pose control signals into the denoising process to ensure pose controllability
Temporal layer – Added to the denoising UNet to model relationships across multiple frames, preserving high-resolution details and ensuring temporal stability and continuity of the character’s motion

The model architecture is illustrated in the following image from its original research paper. The method is trained on a dataset of video clips and achieves state-of-the-art results on fashion video and human dance synthesis benchmarks, demonstrating its ability to animate arbitrary characters while maintaining appearance consistency and temporal stability. The implementation of AnimateAnyone can be found in this repository.

To address the challenges of large-scale training infrastructure required in video generation training process, we can use the power of Amazon SageMaker HyperPod. While many customers have adopted SageMaker HyperPod for large-scale training, such as Luma’s launch of Dream Machine and Stability AI’s work on FMs for image or video generation, we believe that the capabilities of SageMaker HyperPod can also benefit lighter ML workloads, including full fine-tuning.
Amazon SageMaker HyperPod concept and advantage
SageMaker HyperPod offers a comprehensive set of features that significantly enhance the efficiency and effectiveness of ML workflows. From purpose-built infrastructure for distributed training to customizable environments and seamless integration with tools like Slurm, SageMaker HyperPod empowers ML practitioners to focus on their core tasks while taking advantage of the power of distributed computing. With SageMaker HyperPod, you can accelerate your ML projects, handle larger datasets and models, and drive innovation in your organization. SageMaker HyperPod provides several key features and advantages in the scalable training architecture.

Purpose-built infrastructure – One of the primary advantages of SageMaker HyperPod is its purpose-built infrastructure for distributed training. It simplifies the setup and management of clusters, allowing you to easily configure the desired instance types and counts, which can be retained across workloads. As a result of this flexibility, you can adapt to various scenarios. For example, when working with a smaller backbone model like Stable Diffusion 1.5, you can run multiple experiments simultaneously on a single GPU to accelerate the iterative development process. As your dataset grows, you can seamlessly switch to data parallelism and distribute the workload across multiple GPUs, such as eight GPUs, to reduce compute time. Furthermore, when dealing with larger backbone models like Stable Diffusion XL, SageMaker HyperPod offers the flexibility to scale and use model parallelism.
Shared file system – SageMaker HyperPod supports the attachment of a shared file system, such as Amazon FSx for Lustre. This integration brings several benefits to your ML workflow. FSx for Lustre enables full bidirectional synchronization with Amazon Simple Storage Service (Amazon S3), including the synchronization of deleted files and objects. It also allows you to synchronize file systems with multiple S3 buckets or prefixes, providing a unified view across multiple datasets. In our case, this means that the installed libraries within the conda virtual environment will be synchronized across different worker nodes, even if the cluster is torn down and recreated. Additionally, input video data for training and inference results can be seamlessly synchronized with S3 buckets, enhancing the experience of validating inference results.
Customizable environment – SageMaker HyperPod offers the flexibility to customize your cluster environment using lifecycle scripts. These scripts allow you to install additional frameworks, debugging tools, and optimization libraries tailored to your specific needs. You can also split your training data and model across all nodes for parallel processing, fully using the cluster’s compute and network infrastructure. Moreover, you have full control over the execution environment, including the ability to easily install and customize virtual Python environments for each project. In our case, all the required libraries for running the training script are installed within a conda virtual environment, which is shared across all worker nodes, simplifying the process of distributed training on multi-node setups. We also installed MLflow Tracking on the controller node to monitor the training progress.
Job distribution with Slurm integration – SageMaker HyperPod seamlessly integrates with Slurm, a popular open source cluster management and job scheduling system. Slurm can be installed and set up through lifecycle scripts as part of the cluster creation process, providing a highly customizable user interface. With Slurm, you can efficiently schedule jobs across different GPU resources so you can run multiple experiments in parallel or use distributed training to train large models for improved performance. With Slurm, customers can customize the job queues, prioritization algorithms, and job preemption policies, ensuring optimal resource use and streamlining your ML workflows. If you are searching a Kubernetes-based administrator experience, recently, Amazon SageMaker HyperPod introduces Amazon EKS support to manage their clusters using a Kubernetes-based interface.
Enhanced productivity – To further enhance productivity, SageMaker HyperPod supports connecting to the cluster using Visual Studio Code (VS Code) through a Secure Shell (SSH) connection. You can easily browse and modify code within an integrated development environment (IDE), execute Python scripts seamlessly as if in a local environment, and launch Jupyter notebooks for quick development and debugging. The Jupyter notebook application experience within VS Code provides a familiar and intuitive interface for iterative experimentation and analysis.
Set up SageMaker HyperPod and run video generation algorithms
In this walkthrough, we use the AnimateAnyone algorithm as an illustration for video generation. AnimateAnyone is a state-of-the-art algorithm that generates high-quality videos from input images or videos. Our walkthrough guidance code is available on GitHub.
Set up the cluster
To create the SageMaker HyperPod infrastructure, follow the detailed intuitive and step-by-step guidance for cluster setup from the Amazon SageMaker HyperPod workshop studio.
The two things you need to prepare are a provisioning_parameters.json file required by HyperPod for setting up Slurm and a cluster-config.json file as the configuration file for creating the HyperPod cluster. Inside these configuration files, you need to specify the InstanceGroupName, InstanceType, and InstanceCount for the controller group and worker group, as well as the execution role attached to the group.
One practical setup is to set up bidirectional synchronization with Amazon FSx and Amazon S3. This can be done with the Amazon S3 integration for Amazon FSx for Lustre. It helps to establish a full bidirectional synchronization of your file systems with Amazon S3. In addition, it can synchronize your file systems with multiple S3 buckets or prefixes.
In addition, if you prefer a local IDE such as VSCode, you can set up an SSH connection to the controller node within your IDE. In this way, the worker nodes can be used for running scripts within a conda environment and a Jupyter notebook server.
Run the AnimateAnyone algorithm
When the cluster is in service, you can connect using SSH into the controller node, then go into the worker nodes, where the GPU compute resources are available. You can follow the SSH Access to compute guide. We suggest installing the libraries on the worker nodes directly.
To create the conda environment, follow the instructions at Miniconda’s Quick command line install. You can then use the conda environment to install all required libraries.

source ~/miniconda3/bin/activate
conda create -n videogen
pip install -r requirements.txt

To run AnimateAnyone, clone the GitHub repo and follow the instructions.
To train AnimateAnyone, launch stage 1 for training the denoising UNet and ReferenceNet, which enables the model to generate high-quality animated images under the condition of a given reference image and target pose. The denoising UNet and ReferenceNet are initialized based on the pre-trained weights from Stable Diffusion.

accelerate launch train_stage_1.py –config configs/train/stage1.yaml

In stage 2, the objective is to train the temporal layer to capture the temporal dependencies among video frames.

accelerate launch train_stage_2.py –config configs/train/stage2.yaml

Once the training script executes as expected, use a Slurm scheduled job to run on a single node. We provide a batch file to simulate the single-node training job. It can be a single GPU or a single node with multiple GPUs. If you want to know more, the documentation provides detailed instructions on running jobs on SageMaker HyperPod clusters.

sbatch submit-animateanyone-algo.sh

#!/bin/bash
#SBATCH –job-name=video-gen
#SBATCH -N 1
#SBATCH –exclusive
#SBATCH -o video-gen-stage-1.out
export OMP_NUM_THREADS=1
# Activate the conda environment
source ~/miniconda3/bin/activate
conda activate videogen
srun accelerate launch train_stage_1.py –config configs/train/stage1.yaml
Check the job status using the following code snippet.

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10 dev video-ge ubuntu R 0:16 1 ip-10-1-93-196

By using a small batch size and setting use_8bit_adam=True, you can achieve efficient training on a single GPU. When using a single GPU, use a multi-GPU cluster for running multiple experiments.
The following code block is one example of running four jobs in parallel to test different hyperparameters. We provide the batch file here as well.

sbatch submit-hyperparameter-testing.sh

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4_0 dev video-ge ubuntu R 0:08 1 ip-10-1-17-56
4_1 dev video-ge ubuntu R 0:08 1 ip-10-1-33-49
4_2 dev video-ge ubuntu R 0:08 1 ip-10-1-37-152
4_3 dev video-ge ubuntu R 0:08 1 ip-10-1-83-68

The experiments can then be compared, and you can move forward with the best configuration. In our scenario, shown in the following screenshot, we use different datasets and video preprocessing strategies to validate the stage 1 training. Then, we quickly draw conclusions about the impact on video quality with respect to stage 1 training results.  For experiment tracking, besides installing MLflow on the controller node to monitor the training progress, you can also leverage the fully managed MLflow capability on Amazon SageMaker. This makes it easy for data scientists to use MLflow on SageMaker for model training, registration, and deployment.

Scale to multi-node GPU setup
As model sizes grow, single GPU memory quickly becomes a bottleneck. Large models easily exhaust memory with pure data parallelism, and implementing model parallelism can be challenging. DeepSpeed addresses these issues, accelerating model development and training.
ZeRO
DeepSpeed is a deep learning optimization library that aims to make distributed training easy, efficient, and effective. DeepSpeed’s ZeRO removes memory redundancies across data-parallel processes by partitioning three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. This approach significantly boosts memory efficiency compared to classic data-parallelism while maintaining computational granularity and communication efficiency.
ZeRO offers three stages of optimization:

ZeRO Stage 1 – Partitions optimizer states across processes, with each process updating only its partition
ZeRO Stage 2 – Additionally partitions gradients, with each process retaining only the gradients corresponding to its optimizer state portion
ZeRO Stage 3 – Partitions model parameters across processes, automatically collecting and partitioning them during forward and backward passes

Each stage offers progressively higher memory efficiency at the cost of increased communication overhead. These techniques enable training of extremely large models that would otherwise be impossible. This is particularly useful when working with limited GPU memory or training very large models.
Accelerate
Accelerate is a library that enables running the same PyTorch code across any distributed configuration with minimal code changes. It handles the complexities of distributed setups, allowing developers to focus on their models rather than infrastructure. To put it briefly, Accelerate makes training and inference at scale straightforward, efficient, and adaptable.
Accelerate allows easy integration of DeepSpeed features through a configuration file. Users can supply a custom configuration file or use provided templates. The following is an example of how to use DeepSpeed with Accelerate.
Single node with multiple GPUs job
To run a job on a single node with multiple GPUs, we have tested this configuration on four GPU instances (for example, g5.24xlarge). For these instances, adjust train_width: 768 and train_height: 768, and set use_8bit_adam: False in your configuration file. You’ll likely notice that the model can handle much larger images for generation with these settings.

sbatch submit-deepspeed-singlenode.sh

This Slurm job will:

Allocate a single node
Activate the training environment
Run accelerate launch train_stage_1.py –config configs/train/stage1.yaml

Multi-node with multiple GPUs job
To run a job across multiple nodes, each with multiple GPUs, we have tested this distribution with two ml.g5.24xlarge instances.

sbatch submit-deepspeed-multinode.sh

This Slurm job will:

Allocate the specified number of nodes
Activate the training environment on each node
Run accelerate launch –multi_gpu –num_processes <num_processes> –num_machines <num_machines> train_stage_1.py –config configs/train/stage1.yaml

When running a multi-node job, make sure that the num_processes and num_machines arguments are set correctly based on your cluster configuration.
For optimal performance, adjust the batch size and learning rate according to the number of GPUs and nodes being used. Consider using a learning rate scheduler to adapt the learning rate during training.
Additionally, monitor the GPU memory usage and adjust the model’s architecture or batch size if necessary to prevent out-of-memory issues.
By following these steps and configurations, you can efficiently train your models on single-node and multi-node setups with multiple GPUs, taking advantage of the power of distributed training.
Monitor cluster usage
To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus makes it possible to export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana makes it possible to visualize these metrics through various Grafana dashboards that offer intuitive interfaces for monitoring and analyzing the cluster’s behavior. You can follow the SageMaker documentation on Monitor SageMaker HyperPod cluster resources and Workshop Studio Observability section to bootstrap your cluster monitoring with the metric exporter services. The following screenshot shows a Grafana dashboard.

Inference and results discussion
When the fine-tuned model is ready, you have two primary deployment options: using popular image and video generation GUIs like ComfyUI or deploying an inference endpoint with Amazon SageMaker. The SageMaker option offers several advantages, including easy integration of image generation APIs with video generation endpoints to create end-to-end pipelines. As a managed service with auto scaling, SageMaker makes parallel generation of multiple videos possible using either the same reference image with different reference videos or the reverse. Furthermore, you can deploy various video generation model endpoints such as MimicMotion and UniAnimate, allowing for quality comparisons by generating videos in parallel with the same reference image and video. This approach not only provides flexibility and scalability but also accelerates the production process by making possible the generation of a large number of videos quickly, ultimately streamlining the process of obtaining content that meets business requirements. The SageMaker option thus offers a powerful, efficient, and scalable solution for video generation workflows. The following diagram shows a basic version of video generation pipeline. You can modify it based on your own specific business requirements.

Recent advancements in video generation have rapidly overcome limitations of earlier models like AnimateAnyone. Two notable research papers showcase significant progress in this domain.
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance enhances shape alignment and motion guidance. It demonstrates superior ability in generating high-quality human animations that accurately capture both pose and shape variations, with improved generalization on in-the-wild datasets.
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation makes it possible to generate longer videos, up to one minute, compared to earlier models’ limited frame outputs. It introduces a unified noise input supporting both random noise input and first frame conditioned input, enhancing long-term video generation capabilities.
Cleanup
To avoid incurring future charges, delete the resources created as part of this post:

Delete the SageMaker HyperPod cluster using either the CLI or the console.
Once the SageMaker HyperPod cluster deletion is complete, delete the CloudFormation stack. For more details on cleanup, refer to the cleanup section in the Amazon SageMaker HyperPod workshop.

To delete the endpoints created during deployment, refer to the endpoint deletion section we provided in the Jupyter notebook. Then manually delete the SageMaker notebook.

Conclusion
In this post, we explored the exciting field of video generation and showcased how SageMaker HyperPod can be used to efficiently train video generation algorithms at scale. By using the AnimateAnyone algorithm as an example, we demonstrated the step-by-step process of setting up a SageMaker HyperPod cluster, running the algorithm, scaling it to multiple GPU nodes, and monitoring GPU usage during the training process.
SageMaker HyperPod offers several key advantages that make it an ideal platform for training large-scale ML models, particularly in the domain of video generation. Its purpose-built infrastructure allows for distributed training at scale so you can manage clusters with desired instance types and counts. The ability to attach a shared file system such as Amazon FSx for Lustre provides efficient data storage and retrieval, with full bidirectional synchronization with Amazon S3. Moreover, the SageMaker HyperPod customizable environment, integration with Slurm, and seamless connectivity with Visual Studio Code enhance productivity and simplify the management of distributed training jobs.
We encourage you to use SageMaker HyperPod for your ML training workloads, especially those involved in video generation or other computationally intensive tasks. By harnessing the power of SageMaker HyperPod, you can accelerate your research and development efforts, iterate faster, and build state-of-the-art models more efficiently. Embrace the future of video generation and unlock new possibilities with SageMaker HyperPod. Start your journey today and experience the benefits of distributed training at scale.

About the author
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Gordon Wang is a Senior Data Scientist at AWS. He helps customers imagine and scope the use cases that will create the greatest value for their businesses, define paths to navigate technical or business challenges. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.
Gary LO is a Solutions Architect at AWS based in Hong Kong. He is a highly passionate IT professional with over 10 years of experience in designing and implementing critical and complex solutions for distributed systems, web applications, and mobile platforms for startups and enterprise companies. Outside of the office, he enjoys cooking and sharing the latest technology trends and insights on his social media platforms with thousands of followers.

Control data access to Amazon S3 from Amazon SageMaker Studio with Ama …

Amazon SageMaker Studio provides a single web-based visual interface where different personas like data scientists, machine learning (ML) engineers, and developers can build, train, debug, deploy, and monitor their ML models. These personas rely on access to data in Amazon Simple Storage Service (Amazon S3) for tasks such as extracting data for model training, logging model training metrics, and storing model artifacts after training. For example, data scientists need access to datasets stored in Amazon S3 for tasks like data exploration and model training. ML engineers require access to intermediate model artifacts stored in Amazon S3 from past training jobs.
Traditionally, access to data in Amazon S3 from SageMaker Studio for these personas is provided through roles configured in SageMaker Studio—either at the domain level or user profile level. The SageMaker Studio domain role grants permissions for the SageMaker Studio domain to interact with other AWS services, providing access to data in Amazon S3 for all users of that domain. If no specific user profile roles are created, this role will apply to all user profiles, granting uniform access privileges across the domain. However, if different users of the domain have different access restrictions, then configuring individual user roles allows for more granular control. These roles define the specific actions and access each user profile can have within the environment, providing granular permissions.
Although this approach offers a degree of flexibility, it also entails frequent updates to the policies attached to these roles whenever access requirements change, which can add maintenance overhead. This is where Amazon S3 Access Grants can significantly streamline the process. S3 Access Grants enables you to manage access to Amazon S3 data more dynamically, without the need to constantly update AWS Identity and Access Management (IAM) roles. S3 Access Grants allows data owners or permission administrators to set permissions, such as read-only, write-only, or read/write access, at various levels of Amazon S3, such as at the bucket, prefix, or object level. The permissions can be granted to IAM principals or to users and groups from their corporate directory through integration with AWS IAM Identity Center.
In this post, we demonstrate how to simplify data access to Amazon S3 from SageMaker Studio using S3 Access Grants, specifically for different user personas using IAM principals.
Solution overview
Now that we’ve discussed the benefits of S3 Access Grants, let’s look at how grants can be applied with SageMaker Studio user roles and domain roles for granular access control.
Consider a scenario involving a product team with two members: User A and User B. They use an S3 bucket where the following access requirements are implemented:

All members of the team should have access to the folder named Product within the S3 bucket.
The folder named UserA should be accessible only by User A.
The folder named UserB should be accessible only by User B.
User A will be running an Amazon SageMaker Processing job that uses S3 Access Grants to get data from the S3 bucket. The processing job will access the required data from the S3 bucket using the temporary credentials provided by the access grants.

The following diagram illustrates the solution architecture and workflow.

Let’s start by creating a SageMaker Studio environment as needed for our scenario. This includes establishing a SageMaker Studio domain, setting up user profiles for User A and User B, configuring an S3 bucket with the necessary folders, configuring S3 Access Grants.
Prerequisites
To set up the SageMaker Studio environment and configure S3 Access Grants as described in this post, you need administrative privileges for the AWS account you’ll be working with. If you don’t have administrative access, request assistance from someone who does. Throughout this post, we assume that you have the necessary permissions to create SageMaker Studio domains, create S3 buckets, and configure S3 Access Grants. If you don’t have these permissions, consult with your AWS administrator or account owner for guidance.
Deploy the solution resources using AWS CloudFormation
To provision the necessary resources and streamline the deployment process, we’ve provided an AWS CloudFormation template that automates the provisioning of required services. Deploying the CloudFormation stack in your account incurs AWS usage charges.
The CloudFormation stack creates the following resources:
Virtual private cloud (VPC) with private subnets with relevant route tables, NAT gateway, internet gateway, and security groups

IAM execution roles
S3 Access Grants instance
AWS Lambda function to load the Abalone dataset into Amazon S3
SageMaker domain
SageMaker Studio user profiles

Complete the following steps to deploy the stack:

Choose Launch Stack to launch the CloudFormation stack.
On the Create stack page, leave the default options and choose Next.
On the Specify stack details page, for Stack name, enter a name (for example, blog-sagemaker-s3-access-grants).
Under Parameters, provide the following information:

For PrivateSubnetCIDR, enter the IP address range in CIDR notation that should be allocated for the private subnet.
For ProjectName, enter sagemaker-blog.
 For VpcCIDR, enter the desired IP address range in CIDR notation for the VPC being created.

Choose Next.
On the Configure stack options page, leave the default options and choose Next.
On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Review the template and choose Create stack.

After the successful deployment of stack, you can view the resources created on the stack’s Outputs tab on the AWS CloudFormation console.

Validate data in the S3 bucket
To validate access to the S3 bucket, we use the Abalone dataset. As part of the CloudFormation stack deployment process, a Lambda function is invoked to load the data into Amazon S3. After the Lambda function is complete, you should find the abalone.csv file in all three folders (Product, UserA, and UserB) within the S3 bucket.

Validate the SageMaker domain and associated user profiles
Complete the following steps to validate the SageMaker resources:

On the SageMaker console, choose Domains in the navigation pane.
Choose Product-Domain to be directed to the domain details page.
In the User profiles section, verify that the userA and userB profiles are present.
Choose a user profile name to be directed to the user profile details.
Validate that each user profile is associated with its corresponding IAM role: userA is associated with sagemaker-usera-role, and userB is associated with sagemaker-userb-role.

Validate S3 Access Grants setup
Complete the following steps to validate your configuration of S3 Access Grants:

On the Amazon S3 console, choose Access Grants in the navigation pane.
Choose View details to be directed to the details page of S3 Access Grants.
On the Locations tab, confirm that the URI of S3 bucket created is registered with the S3 Access Grants instance for the location scope.
On the Grants tab, confirm the following:

sagemaker-usera-role has been given read/write permissions on the S3 prefix Product/* and UserA/*
sagemaker-userb-role has been given read/write permissions on the S3 prefix Product/* and UserB/*

Validate access from your SageMaker Studio environment
To validate the access grants we set up, we run a distributed data processing job on the Abalone dataset using SageMaker Processing jobs and PySpark.
To get started, complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose the domain Product-Domain to be directed to the domain details page.
Choose userA under User profiles.
On the User Details page, choose Launch and choose Studio.
On the SageMaker Studio console, choose JupyterLab in the navigation pane.
Choose Create JupyterLab space.
For Name, enter usera-space.
For Sharing, select Private.
Choose Create space.
After the space is created, choose Run space.
When the status shows as Running, choose Open JupyterLab, which will redirect you to the SageMaker JupyterLab experience.
On the Launcher page, choose Python 3 under Notebook.This will open a new Python notebook, which we use to run the PySpark script.Let’s validate the access grants by running a distributed job using SageMaker Processing jobs to process data, because we often need to process data before it can be used for training ML models. SageMaker Processing jobs allow you to run distributed data processing workloads while using the access grants you set up earlier.
Copy the following PySpark script into a cell in your SageMaker Studio notebook. The %%writefile directive is used to save the script locally. The script is used to generate temporary credentials using the access grant and configures Spark to use these credentials for accessing data in Amazon S3. It performs some basic feature engineering on the Abalone dataset, including string indexing, one-hot encoding, and vector assembly, and combines them into a pipeline. It then does an 80/20 split to produce training and validation datasets as outputs, and saves these datasets in Amazon S3.Make sure to replace region_name with the AWS Region you’re using in the script. %%writefile ./preprocess.py
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
import argparse
import subprocess
import sys

def install_packages():
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “boto3==1.35.1”, “botocore>=1.35.0″])

install_packages()
import boto3
print(f”logs: boto3 version in the processing job: {boto3.__version__}”)
import botocore
print(f”logs: botocore version in the processing job: {botocore.__version__}”)

def get_temporary_credentials(account_id, bucket_name, object_key_prefix):
region_name = ‘<region>’
s3control_client = boto3.client(‘s3control’, region_name=region_name)
response = s3control_client.get_data_access(
AccountId=account_id,
Target=f’s3://{bucket_name}/{object_key_prefix}/’,
Permission=’READWRITE’
)
return response[‘Credentials’]

def configure_spark_with_s3a(credentials):
spark = SparkSession.builder
.appName(“PySparkApp”)
.config(“spark.hadoop.fs.s3a.access.key”, credentials[‘AccessKeyId’])
.config(“spark.hadoop.fs.s3a.secret.key”, credentials[‘SecretAccessKey’])
.config(“spark.hadoop.fs.s3a.session.token”, credentials[‘SessionToken’])
.config(“spark.hadoop.fs.s3a.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”)
.config(“spark.hadoop.fs.s3a.aws.credentials.provider”, “org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider”)
.getOrCreate()

spark.sparkContext._jsc.hadoopConfiguration().set(
“mapred.output.committer.class”, “org.apache.hadoop.mapred.FileOutputCommitter”
)
return spark

def csv_line(data):
r = “,”.join(str(d) for d in data[1])
return str(data[0]) + “,” + r

def main():
parser = argparse.ArgumentParser(description=”app inputs and outputs”)
parser.add_argument(“–account_id”, type=str, help=”AWS account ID”)
parser.add_argument(“–s3_input_bucket”, type=str, help=”s3 input bucket”)
parser.add_argument(“–s3_input_key_prefix”, type=str, help=”s3 input key prefix”)
parser.add_argument(“–s3_output_bucket”, type=str, help=”s3 output bucket”)
parser.add_argument(“–s3_output_key_prefix”, type=str, help=”s3 output key prefix”)
args = parser.parse_args()

# Get temporary credentials for both reading and writing
credentials = get_temporary_credentials(args.account_id, args.s3_input_bucket, args.s3_input_key_prefix)
spark = configure_spark_with_s3a(credentials)

# Defining the schema corresponding to the input data
schema = StructType([
StructField(“sex”, StringType(), True),
StructField(“length”, DoubleType(), True),
StructField(“diameter”, DoubleType(), True),
StructField(“height”, DoubleType(), True),
StructField(“whole_weight”, DoubleType(), True),
StructField(“shucked_weight”, DoubleType(), True),
StructField(“viscera_weight”, DoubleType(), True),
StructField(“shell_weight”, DoubleType(), True),
StructField(“rings”, DoubleType(), True),
])

# Reading data directly from S3 using s3a protocol
total_df = spark.read.csv(
f”s3a://{args.s3_input_bucket}/{args.s3_input_key_prefix}/abalone.csv”,
header=False,
schema=schema
)

# Transformations and data processing
sex_indexer = StringIndexer(inputCol=”sex”, outputCol=”indexed_sex”)
sex_encoder = OneHotEncoder(inputCol=”indexed_sex”, outputCol=”sex_vec”)
assembler = VectorAssembler(
inputCols=[
“sex_vec”,
“length”,
“diameter”,
“height”,
“whole_weight”,
“shucked_weight”,
“viscera_weight”,
“shell_weight”,
],
outputCol=”features”
)
pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])
model = pipeline.fit(total_df)
transformed_total_df = model.transform(total_df)
(train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])

# Saving transformed datasets to S3 using RDDs and s3a protocol
train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
train_lines = train_rdd.map(csv_line)
train_lines.saveAsTextFile(
f”s3a://{args.s3_output_bucket}/{args.s3_output_key_prefix}/train”
)

validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
validation_lines = validation_rdd.map(csv_line)
validation_lines.saveAsTextFile(
f”s3a://{args.s3_output_bucket}/{args.s3_output_key_prefix}/validation”
)

if __name__ == “__main__”:
main()
Run the cell to create the preprocess.py file locally.
Next, you use the PySparkProcessor class to define a Spark job and run it using SageMaker Processing. Copy the following code into a new cell in your SageMaker Studio notebook, and run the cell to invoke the SageMaker Processing job: from sagemaker.spark.processing import PySparkProcessor
from time import gmtime, strftime
import boto3
import sagemaker
import logging

# Get region
region = boto3.Session().region_name

# Initialize Boto3 and SageMaker sessions
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)

# Get account id
def get_account_id():
client = boto3.client(“sts”)
return client.get_caller_identity()[“Account”]
account_id = get_account_id()

bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
sagemaker_logger = logging.getLogger(“sagemaker”)
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

# Set up S3 bucket and paths
timestamp_prefix = strftime(“%Y-%m-%d-%H-%M-%S”, gmtime())
prefix = “Product/sagemaker/spark-preprocess-demo/{}”.format(timestamp_prefix)

# Define the account ID and S3 bucket details
input_bucket = f’blog-access-grants-{account_id}-{region}’
input_key_prefix = ‘UserA’
output_bucket = f’blog-access-grants-{account_id}-{region}’
output_key_prefix = ‘UserA/output’

# Define the Spark processor using the custom Docker image
spark_processor = PySparkProcessor(
framework_version=”3.3″,
role=role,
instance_count=2,
instance_type=”ml.m5.2xlarge”,
base_job_name=”spark-preprocess-job”,
sagemaker_session=sagemaker_session
)

# Run the Spark processing job
spark_processor.run(
submit_app=”./preprocess.py”,
arguments=[
“–account_id”, account_id,
“–s3_input_bucket”, input_bucket,
“–s3_input_key_prefix”, input_key_prefix,
“–s3_output_bucket”, output_bucket,
“–s3_output_key_prefix”, output_key_prefix,
],
spark_event_logs_s3_uri=f”s3://{output_bucket}/{prefix}/spark_event_logs”,
logs=False
) A few things to note in the definition of the PySparkProcessor:

This is a multi-node job with two ml.m5.2xlarge instances (specified in the instance_count and instance_type parameters)
The Spark framework version is set to 3.3 using the framework_version parameter
The PySpark script is passed using the submit_app parameter
Command line arguments to the PySpark script (such as the account ID, input/output bucket names, and input/output key prefixes) are passed through the arguments parameter
Spark event logs will be offloaded to the Amazon S3 location specified in spark_event_logs_s3_uri and can be used to view the Spark UI while the job is in progress or after it’s complete.

After the job is complete, validate the output of the preprocessing job by looking at the first five rows of the output dataset using the following validation script: import boto3
import pandas as pd
import io

# Initialize S3 client
s3 = boto3.client(‘s3′)

# Get region
region = boto3.Session().region_name

# Get account id
def get_account_id():
client = boto3.client(“sts”)
return client.get_caller_identity()[“Account”]
account_id = get_account_id()
# Replace with your bucket name and output key prefix
bucket_name = f’blog-access-grants-{account_id}-{region}’
output_key_prefix = ‘UserA/output/train’

# Get temporary credentials for accessing S3 data using user profile role
s3control_client = boto3.client(‘s3control’)
response = s3control_client.get_data_access(
AccountId=account_id,
Target=f’s3://{bucket_name}/{output_key_prefix}’,
Permission=’READ’
)
credentials = response[‘Credentials’]

# Create an S3 client with the temporary credentials
s3_client = boto3.client(
‘s3’,
aws_access_key_id=credentials[‘AccessKeyId’],
aws_secret_access_key=credentials[‘SecretAccessKey’],
aws_session_token=credentials[‘SessionToken’]
)

objects = s3_client.list_objects(Bucket=bucket_name, Prefix=output_key_prefix)

# Read the first part file into a pandas DataFrame
first_part_key = f”{output_key_prefix}/part-00000″
obj = s3_client.get_object(Bucket=bucket_name, Key=first_part_key)
data = obj[‘Body’].read().decode(‘utf-8’)
df = pd.read_csv(io.StringIO(data), header=None)

# Print the top 5 rows
print(f”Top 5 rows from s3://{bucket_name}/{first_part_key}”)
print(df.head()) This script uses the access grants to obtain temporary credentials, reads the first part file (part-00000) from the output location into a pandas DataFrame, and prints the top five rows of the DataFrame.Because the User A role has access to the userA folder, the user can read the contents of the file part-00000, as shown in the following screenshot.Now, let’s validate access to the userA folder from the User B profile.
Repeat the earlier steps to launch a Python notebook under the User B profile.
Use the validation script to read the contents of the file part-00000, which is in the userA folder.

If User B tries to read the contents of the file part-00000, which is in the userA folder, their access will be denied, as shown in the following screenshot, because User B doesn’t have access to the userA folder.

Clean up
To avoid incurring future charges, delete the CloudFormation stack. This will delete resources such as the SageMaker Studio domain, S3 Access Grants instance, and S3 bucket you created.
Conclusion
In this post, you learned how to control data access to Amazon S3 from SageMaker Studio with S3 Access Grants. S3 Access Grants provides a more flexible and scalable mechanism to define access patterns at scale than IAM based techniques. These grants not only support IAM principals but also allow direct granting of access to users and groups from a corporate directory that is synchronized with IAM Identity Center.
Take the next step in optimizing your data management workflow by integrating S3 Access Grants into your AWS environment alongside SageMaker Studio, a web-based visual interface for building, training, debugging, deploying, and monitoring ML models. Take advantage of the granular access control and scalability offered by S3 Access Grants to enable efficient collaboration, secure data access, and simplified access management for your team working in the SageMaker Studio environment. For more details, refer to Managing access with S3 Access Grants and Amazon SageMaker Studio.

About the authors
Koushik Konjeti is a Senior Solutions Architect at Amazon Web Services. He has a passion for aligning architectural guidance with customer goals, ensuring solutions are tailored to their unique requirements. Outside of work, he enjoys playing cricket and tennis.
Vijay Velpula is a Data Architect with AWS Professional Services. He helps customers implement Big Data and Analytics Solutions. Outside of work, he enjoys spending time with family, traveling, hiking and biking.
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey. In his spare time, he rides motorcycle and enjoys the nature with his family.

Build a multimodal social media content generator using Amazon Bedrock

In today’s digital age, social media has revolutionized the way brands interact with their consumers, creating a need for dynamic and engaging content that resonates with their target audience. There’s growing competition for consumer attention in this space; content creators and influencers face constant challenges to produce new, engaging, and brand-consistent content. The challenges come from three key factors: the need for rapid content production, the desire for personalized content that is both captivating and visually appealing and reflects the unique interests of the consumer, and the necessity for content that is consistent with a brand’s identity, messaging, aesthetics, and tone.
Traditionally, the content creation process has been a time-consuming task involving multiple steps such as ideation, research, writing, editing, design, and review. This slow cycle of creation does not fit for the rapid pace of social media.
Generative AI offers new possibilities to address this challenge and can be used by content teams and influencers to enhance their creativity and engagement while maintaining brand consistency. More specifically, multimodal capabilities of large language models (LLMs) allow us to create the rich, engaging content spanning text, images, audio, and video formats that are omnipresent in advertising, marketing, and social media content. With recent advancements in vision LLMs, creators can use visual input, such as reference images, to start the content creation process. Image similarity search and text semantic search further enhance the process by quickly retrieving relevant content and context.
In this post, we walk you through a step-by-step process to create a social media content generator app using vision, language, and embedding models (Anthropic’s Claude 3, Amazon Titan Image Generator, and Amazon Titan Multimodal Embeddings) through Amazon Bedrock API and Amazon OpenSearch Serverless. Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. OpenSearch Serverless is a fully managed service that makes it easier to store vectors and other data types in an index and allows you to perform sub second query latency when searching billions of vectors and measuring the semantic similarity.
Here’s how the proposed process for content creation works:

First, the user (content team or marketing team) uploads a product image with a simple background (such as a handbag). Then, they provide natural language descriptions of the scene and enhancements they wish to add to the image as a prompt (such as “Christmas holiday decorations”).
Next, Amazon Titan Image Generator creates the enhanced image based on the provided scenario.
Then, we generate rich and engaging text that describes the image while aligning with brand guidelines and tone using Claude 3.
After the draft (text and image) is created, our solution performs multimodal similarity searches against historical posts to find similar posts and gain inspiration and recommendations to enhance the draft post.
Finally, based on the generated recommendations, the post text is further refined and provided to the user on the webpage. The following diagram illustrates the end-to-end new content creation process.

Solution overview
In this solution, we start with data preparation, where the raw datasets can be stored in an Amazon Simple Storage Service (Amazon S3) bucket. We provide a Jupyter notebook to preprocess the raw data and use the Amazon Titan Multimodal Embeddings model to convert the image and text into embedding vectors. These vectors are then saved on OpenSearch Serverless as collections, as shown in the following figure.

Next is the content generation. The GUI webpage is hosted using a Streamlit application, where the user can provide an initial product image and a brief description of how they expect the enriched image to look. From the application, the user can also select the brand (which will link to a specific brand template later), choose the image style (such as photographic or cinematic), and select the tone for the post text (such as formal or casual).
After all the configurations are provided, the content creation process, shown in the following figure, is launched.
In stage 1, the solution retrieves the brand-specific template and guidelines from a CSV file. In a production environment, you could maintain the brand template table in Amazon DynamoDB for scalability, reliability, and maintenance. The user input is used to generate the enriched image with the Amazon Titan Image Generator. Together with all the other information, it’s fed into the Claude 3 model, which has vision capability, to generate the initial post text that closely aligns with the brand guidelines and the enriched image. At the end of this stage, the enriched image and initial post text are created and sent back to the GUI to display to users.
In stage 2, we combine the post text and image and use the Amazon Titan Multimodal Embeddings model to generate the embedding vector. Multimodal embedding models integrate information from different data types, such as text and images, into a unified representation. This enables searching for images using text descriptions, identifying similar images based on visual content, or combining both text and image inputs to refine search results. In this solution, the multimodal embedding vector is used to search and retrieve the top three similar historical posts from the OpenSearch vector store. The retrieved results are fed into the Anthropic’s Claude 3 model to generate a caption, provide insights on why these historical posts are engaging, and offer recommendations on how the user can improve their post.
In stage 3, based on the recommendations from stage 2, the solution automatically refines the post text and provides a final version to the user. The user has the flexibility to select the version they like and make changes before publishing. For the end-to-end content generation process, steps are orchestrated with the Streamlit application.
The whole process is shown in the following image:

Implementation steps
This solution has been tested in AWS Region us-east-1. However, it can also work in other Regions where the following services are available. Make sure you have the following set up before moving forward:

An AWS account
An Amazon SageMaker domain
A SageMaker domain user profile

We use Amazon SageMaker Studio to generate historical post embeddings and save those embedding vectors to OpenSearch Serverless. Additionally, you will run the Streamlit app from the SageMaker Studio terminal to visualize and test the solution. Testing the Streamlit app in a SageMaker environment is intended for a temporary demo. For production, we recommend deploying the Streamlit app on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) services with proper security measures such as authentication and authorization.
We use the following models from Amazon Bedrock in the solution. Please see Model support by AWS Region and select the Region that supports all three models:

Amazon Titan Multimodal Embeddings Model
Amazon Titan Image Generator
Claude 3 Sonnet

Set up a JupyterLab space on SageMaker Studio
JupyterLab space is a private or shared space within Sagemaker Studio that manages the storage and compute resources needed to run the JupyterLab application.
To set up a JupyterLab space

Sign in to your AWS account and open the AWS Management Console. Go to SageMaker Studio.
Select your user profile and choose Open Studio.
From Applications in the top left, choose JupyterLab.
If you already have a JupyterLab space, choose Run. If you do not, choose Create JupyterLab Space to create one. Enter a name and choose Create Space.
Change the instance to t3.large and choose Run Space.
Within a minute, you should see that the JupyterLab space is ready. Choose Open JupyterLab.
In the JupyterLab launcher window, choose Terminal.
Run the following command on the terminal to download the sample code from Github:

git clone https://github.com/aws-samples/Build-a-multimodal-social-media-content-generator-using-Amazon-Bedrock.git

Generate sample posts and compute multimodal embeddings
In the code repository, we provide some sample product images (bag, car, perfume, and candle) that were created using the Amazon Titan Image Generator model. Next, you can generate some synthetic social media posts using the notebook: synthetic-data-generation.ipynb by using the following steps. The generated posts’ texts are saved in the metadata.jsonl file (if you prepared your own product images and post texts, you can skip this step). Then, compute multimodal embeddings for the pairs of images and generated texts. Finally, ingest the multimodal embeddings into a vector store on Amazon OpenSearch Serverless.
To generate sample posts

In JupyterLab, choose File Browser and navigate to the folder social-media-generator/embedding-generation.
Open the notebook synthetic-data-generation.ipynb.
Choose the default Python 3 kernel and Data Science 3.0 image, then follow the instructions in the notebook.
At this stage, you will have sample posts that are created and available in data_mapping.csv.
Open the notebook multimodal_embedding_generation.ipynb. The notebook first creates the multimodal embeddings for the post-image pair. It then ingests the computed embeddings into a vector store on Amazon OpenSearch Serverless.
At the end of the notebook, you should be able to perform a simple query to the collection as shown in the following example:

query_prompt = “christmas tree, holiday, bags”
similar_items = find_similar_items_from_query(
query_prompt = query_prompt, k=3, num_results=5,
index_name=index_name, dataset = df,
open_search_client = oss_client)

The preparation steps are complete. If you want to try out the solution directly, you can skip to Run the solution with Streamlit App to quickly test the solution in your SageMaker environment. However, if you want a more detailed understanding of each step’s code and explanations, continue reading.
Generate a social media post (image and text) using FMs
In this solution, we use FMs through Amazon Bedrock for content creation. We start by enhancing the input product image using the Amazon Titan Image Generator model, which adds a dynamically relevant background around the target product.
The get_titan_ai_request_body function creates a JSON request body for the Titan Image Generator model, using its Outpainting feature. It accepts four parameters: outpaint_prompt (for example, “Christmas tree, holiday decoration” or “Mother’s Day, flowers, warm lights”), negative_prompt (elements to exclude from the generated image), mask_prompt (specifies areas to retain, such as “bag” or “car”), and image_str (the input image encoded as a base64 string).
The generate_image function requires model_id and body (the request body from get_titan_ai_request_body). It invokes the model using bedrock.invoke_model and returns the response containing the base64-encoded generated image.
Finally, the code snippet calls get_titan_ai_request_body with the provided prompts and input image string, then passes the request body to generate_image, resulting in the enhanced image.

def get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str=None):

seed = random.randint(0, 2147483647)
body = {
“taskType”: “OUTPAINTING”,
“outPaintingParams”: {
“text”: outpaint_prompt,
“negativeText”: negative_prompt,
“image”: image_str,
“maskPrompt”: mask_prompt,
“outPaintingMode”: “PRECISE” # or DEFAULT
},
“imageGenerationConfig”: {
“numberOfImages”: 1,
“quality”: “premium”,
“cfgScale”: 8,
“seed”: seed,
}
}
return json.dumps(body)

def generate_image(model_id, body):
“””
Args:
model_id (str): The model ID to use.
body (str) : The request body to use.
Returns:
image_bytes (bytes): The image generated by the model.
“””
logger.info(“Generating image with model %s”, model_id)

accept = “application/json”
content_type = “application/json”

response = bedrock.invoke_model(
body=body, modelId=model_id, accept=accept, contentType=content_type
)
response_body = json.loads(response.get(“body”).read())
return response_body

body = get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str = image_str)
response = generate_image(model_id =MODEL_IMAGE, body = body)
image_enhanced = base64_to_image(response[“images”][0])

The following images showcase the enhanced versions generated based on input prompts like “Christmas tree, holiday decoration, warm lights,” a selected position (such as bottom-middle), and a brand (“Luxury Brand”). These settings influence the output images. If the generated image is unsatisfactory, you can repeat the process until you achieve the desired outcome.

Next, generate the post text, taking into consideration the user inputs, brand guidelines (provided in the brand_guideline.csv file, which you can replace with your own data), and the enhanced image generated from the previous step.
The generate_text_with_claude function is the higher-level function that handles the image and text input, prepares the necessary data, and calls generate_vision_answer to interact with the Amazon Bedrock model (Claude 3 models) and receive the desired response. The generate_vision_answer function performs the core interaction with the Amazon Bedrock model, processes the model’s response, and returns it to the caller. Together, they enable generating text responses based on combined image and text inputs.
In the following code snippet, an initial post prompt is constructed using formatting placeholders for various elements such as role, product name, target brand, tone, hashtag, copywriting, and brand messaging. These elements are provided in the brand_guideline.csv file to make sure that the generated text aligns with the brand preferences and guidelines. This initial prompt is then passed to the generate_text_with_claude function, along with the enhanced image to generate the final post text.

def generate_vision_answer(bedrock:boto3.client, messages:list, model_id:str, claude_config:dict,system_prompt:str):
“””
Generates a vision answer using the specified model and configuration.
“””
body={‘messages’: [messages],**claude_config, “system”: system_prompt}
bedrock = boto3.client(service_name=’bedrock-runtime’)

response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))
response = json.loads(response[‘body’].read().decode(‘utf-8’))
print(“Claude vision answer OK”)
formated_response= post_process_answer(response[‘content’][0][‘text’])

return formated_response

def generate_text_with_claude(image, prompt):
”’
Generate text with Claude for post generation and historical posts analysis
”’
with BytesIO() as byte_io:
image.save(byte_io, format=”PNG”)
image_bytes = byte_io.getvalue()

messages={“role”: “user”, “content”: [
{
“type”: “image”,
“source”: {
“type”: “base64”,
“media_type”: “image/jpeg”,
“data”: base64.b64encode(image_bytes).decode(),
}
},
{“type”: “text”,
“text”: prompt}
]}

claude_text = generate_vision_answer(bedrock, messages, MODEL_TEXT, CLAUDE_CONFIG, SYSTEM_PROMPT)
return claude_text

initial_post_prompt = PROMPT_TEXT.format(
role=role, product_name=product_input, target_brand=brand,
tone=tone, hashtag = hashtag, copywriting= copywriting,
brand_messageing = brand_messageing)

post_text = generate_text_with_claude(
image = image_enhanced,
prompt=initial_post_prompt)

The following example shows the generated post text. It provides a detailed description of the product, aligns well with the brand guidelines, and incorporates elements from the image (such as the Christmas tree). Additionally, we instructed the model to include hashtags and emojis where appropriate, and the results demonstrate that it followed the prompt instructions effectively.

Post text: Elevate your style with Luxury Brand’s latest masterpiece. Crafted with timeless elegance and superior quality, this exquisite bag embodies unique craftsmanship. Indulge in the epitome of sophistication and let it be your constant companion for life’s grandest moments. #LuxuryBrand #TimelessElegance #ExclusiveCollection

Retrieve and analyze the top three relevant posts
The next step involves using the generated image and text to search for the top three similar historical posts from a vector database. We use the Amazon Titan Multimodal Embeddings model to create embedding vectors, which are stored in Amazon OpenSearch Serverless. The relevant historical posts, which might have many likes, are displayed on the application webpage to give users an idea of what successful social media posts look like. Additionally, we analyze these retrieved posts and provide actionable improvement recommendations for the user. The following code snippet shows the implementation of this step.
The code defines two functions: find_similar_items and process_images. find_similar_items performs semantic search using the k-nearest neighbors (kNN) algorithm on the input image prompt. It computes a multimodal embedding for the image and query prompt, constructs an OpenSearch kNN query, runs the search, and retrieves the top matching images and post texts. process_images analyzes a list of similar images in parallel using multiprocessing. It generates analysis texts for the images by calling generate_text_with_claude with an analysis prompt, running the calls in parallel, and collecting the results.
In the snippet, find_similar_items is called to retrieve the top three similar images and post texts based on the input image and a combined query prompt. process_images is then called to generate analysis texts for the first three similar images in parallel, displaying the results simultaneously.

def find_similar_items(image_bytes: str, query_prompt:str, k: int, num_results: int, index_name: str, dataset, open_search_client ) -> []:
“””
Main semantic search capability using knn on input image prompt.
Args:
k: number of top-k similar vectors to retrieve from OpenSearch index
num_results: number of the top-k similar vectors to retrieve
index_name: index name in OpenSearch
“””
query_emb = get_titan_multimodal_embedding(image_bytes=image_bytes, description = query_prompt, dimension=1024)[“embedding”]

body = {
“size”: num_results,
“_source”: {
“exclude”: [“image_vector”],
},
“query”: {
“knn”: {
“image_vector”: {
“vector”: query_emb,
“k”: k,
}
}
},
}

res = open_search_client.search(index=index_name, body=body)
images = []
texts = []

for hit in res[“hits”][“hits”]:
id_ = hit[“_id”]
file_name = hit[“_source”][“file_name”]
post_text = hit[“_source”][“post_text”]
image = get_image(file_name = file_name, dataset = dataset)

image.name_and_score = f'{hit[“_score”]}:{hit[“_source”][“file_name”]}’
images.append(image)

texts.append(f”Post Text: {post_text}”)

return images, texts

def process_images(_similar_items, PROMPT_ANALYSIS):
pool = multiprocessing.Pool(processes=3) # Create a pool of 3 worker processes
args = [(image, PROMPT_ANALYSIS) for image in _similar_items[:3]]
results = pool.starmap(generate_text_with_claude, args) # Execute the function calls in parallel
# Unpack the results
analysis_text_0, analysis_text_1, analysis_text_2 = results
# Close the pool and wait for the tasks to finish
pool.close()
pool.join()
return analysis_text_0, analysis_text_1, analysis_text_2

similar_images, post_texts = find_similar_items(
image_bytes=image_enhanced_bytes, query_prompt=text_input + ” ” + post_text,
k=5, num_results=3, index_name=index_name, dataset=mapping_table,
open_search_client=oss_client)

analysis_text_0, analysis_text_1, analysis_text_2 = process_images(similar_images, PROMPT_ANALYSIS)

An example of historical post retrieval and analysis is shown in the following screenshot. Post images are listed on the left. On the right, the full text content of each post is retrieved and displayed. We then use an LLM model to generate a comprehensive scene description for the post image, which can serve as a prompt to inspire image generation. Next, the LLM model generates automatic recommendations for improvement. In this solution, we use the Claude 3 Sonnet model for text generation.
As the final step, the solution incorporates the recommendations and refines the post text to make it more appealing and likely to attract more attention from social media users.

Run the solution with Streamlit App
You can download the solution from this Git repository. Use the following steps to run the Streamlit application and quickly test out the solution in your SageMaker Studio environment.

In SageMaker Studio, choose SageMaker Classic, then start an instance under your user profile.
After you have the JupyterLab environment running, clone the code repository and navigate to the streamlit-app folder in a terminal:

cd streamlit-app/
sh setup.sh
sh run.sh

You will see a webpage link generated in the terminal, which will look similar to the following:

https://[USER-PROFILE-ID].studio.[REGION].sagemaker.aws/jupyter/default/proxy/8501/

To check the status of the Streamlit application, run sh status.sh in the terminal.
To shut down the application, run sh cleanup.sh.

With the Streamlit app downloaded, you can begin by providing initial prompts and selecting the products you want to retain in the image. You have the option to upload an image from your local machine, plug in your camera to take an initial product picture on the fly, or quickly test the solution by selecting a pre-uploaded image example. You can then optionally adjust the product’s location in the image by setting its position. Next, select the brand for the product. In the demo, we use the luxury brand and the fast fashion brand, each with its own preferences and guidelines. Finally, choose the image style. Choose Submit to start the process.
The application will automatically handle post-image and text generation, retrieve similar posts for analysis, and refine the final post. This end-to-end process can take approximately 30 seconds. If you aren’t satisfied with the result, you can repeat the process a few times. An end-to-end demo is shown below.

Inspiration from historical posts using image similarity search
If you find yourself lacking ideas for initial prompts to create the enhanced image, consider using a reverse search approach. During the retrieve and analyze posts step mentioned earlier, scene descriptions are also generated, which can serve as inspiration. You can modify these descriptions as needed and use them to generate new images and accompanying text. This method effectively uses existing content to stimulate creativity and enhance the application’s output.

In the preceding example, the top three similar images to our generated images show perfume pictures posted to social media by users. This insight helps brands understand their target audience and the environments in which their products are used. By using this information, brands can create dynamic and engaging content that resonates with their users. For instance, in the example provided, “a hand holding a glass perfume bottle in the foreground, with a scenic mountain landscape visible in the background,” is unique and visually more appealing than a dull picture of “a perfume bottle standing on a branch in a forest.” This illustrates how capturing the right scene and context can significantly enhance the attractiveness and impact of social media content.
Clean up
When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

Navigate to the Amazon S3 console and delete the S3 bucket and data created for this solution.
Navigate to the Amazon OpenSearch Service console, choose Serverless, and then select Collection. Delete the collection that was created for storing the historical post embedding vectors.
Navigate to the Amazon SageMaker console. Choose Admin configurations and select Domains. Select your user profile and delete the running application from Spaces and Apps.

Conclusion
In this blog post, we introduced a multimodal social media content generator solution that uses FMs from Amazon Bedrock, such as the Amazon Titan Image Generator, Claude 3, and Amazon Titan Multimodal Embeddings. The solution streamlines the content creation process, enabling brands and influencers to produce engaging and brand-consistent content rapidly. You can try out the solution using this code sample.
The solution involves enhancing product images with relevant backgrounds using the Amazon Titan Image Generator, generating brand-aligned text descriptions through Claude 3, and retrieving similar historical posts using Amazon Titan Multimodal Embeddings. It provides actionable recommendations to refine content for better audience resonance. This multimodal AI approach addresses challenges in rapid content production, personalization, and brand consistency, empowering creators to boost creativity and engagement while maintaining brand identity.
We encourage brands, influencers, and content teams to explore this solution and use the capabilities of FMs to streamline their content creation processes. Additionally, we invite developers and researchers to build upon this solution, experiment with different models and techniques, and contribute to the advancement of multimodal AI in the realm of social media content generation.
See this announcement blog post for information about the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings model. For more information, see Amazon Bedrock and Amazon Titan in Amazon Bedrock.

About the Authors
Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS, specialising in building GenAI applications with customers, including RAG and agent solutions. Her expertise spans GenAI, ASR, Computer Vision, NLP, and time series prediction models. Outside of work, she enjoys spending quality time with her family, getting lost in novels, and hiking in the UK’s national parks.
Bishesh Adhikari, is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.