µFormer: A Deep Learning Framework for Efficient Protein Fitness Pred …

Protein engineering is essential for designing proteins with specific functions, but navigating the complex fitness landscape of protein mutations poses a significant challenge, making it hard to find optimal sequences. Zero-shot approaches, which predict mutational effects without relying on homologs or multiple sequence alignments (MSAs), reduce some dependencies but fall short in predicting diverse protein properties. Learning-based models trained on deep mutational scanning (DMS) or MAVE data have been used to predict fitness landscapes alone or with MSAs or language models. Still, these data-driven models often struggle when experimental data is sparse.Microsoft Research AI for Science researchers introduced µFormer, a deep learning framework that integrates a pre-trained protein language model with specialized scoring modules to predict protein mutational effects. µFormer predicts high-order mutants, models epistatic interactions, and handles insertions. With reinforcement learning, µFormer efficiently explores vast mutant spaces to design enhanced protein variants. The model predicted mutants with a 2000-fold increase in bacterial growth rate, driven by improved enzymatic activity. µFormer’s success extends to challenging scenarios, including multi-point mutations and its predictions were validated through wet-lab experiments, highlighting its potential for optimizing protein design.

The µFormer model is a deep learning approach designed to predict the fitness of mutated protein sequences. It operates in two stages: first, by pre-training a masked protein language model (PLM) on a large dataset of unlabeled protein sequences, and second, by predicting fitness scores using three scoring modules integrated into the pre-trained model. These modules—residual-level, motif-level, and sequence-level—capture different aspects of the protein sequence and combine their outputs to generate the final fitness score. The model is trained using known fitness data, minimizing errors between predicted and actual scores.

Additionally, the µFormer is combined with a reinforcement learning (RL) strategy to explore the vast space of possible mutations efficiently. The protein engineering problem in this framework is modeled as a Markov Decision Process (MDP), with Proximal Policy Optimization (PPO) used to optimize mutation policies. Dirichlet noise is added during the mutation search process to ensure effective exploration and avoid local optima. Baseline comparisons were made using models like ESM-1v and ECNet, and they were evaluated on datasets such as FLIP and ProteinGym.

µFormer, a hybrid model combining a self-supervised protein language model with supervised scoring modules, predicts protein fitness scores efficiently. Pre-trained on 30 million protein sequences from UniRef50 and fine-tuned with three scoring modules, µFormer outperformed ten methods in the ProteinGym benchmark, achieving a mean Spearman correlation of 0.703. It predicts high-order mutations and epistasis, with strong correlations for multi-site mutations. In protein optimization, µFormer, paired with reinforcement learning, designed TEM-1 variants that significantly improved growth, with one double mutant outperforming a known quadruple mutant.

In conclusion, Previous studies have shown the potential of sequence-based protein language models in tasks like enzyme function prediction and antibody design. µFormer, a sequence-based model with three scoring modules, was developed to generalize across diverse protein properties. It achieved state-of-the-art performance in fitness prediction tasks, including complex mutations and epistasis. µFormer also demonstrated its ability to optimize enzyme activity, particularly in predicting TEM-1 variants against cefotaxime. Despite its success, improvements can be made by incorporating structural data, developing phenotype-aware models, and creating models capable of handling longer protein sequences for better accuracy.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post µFormer: A Deep Learning Framework for Efficient Protein Fitness Prediction and Optimization appeared first on MarkTechPost.

Genomics England uses Amazon SageMaker to predict cancer subtypes and …

This post is co-written with Francisco Azuaje from Genomics England.
Genomics England analyzes sequenced genomes for The National Health Service (NHS) in the United Kingdom, and then equips researchers to use data to advance biological research. As part of its goal to help people live longer, healthier lives, Genomics England is interested in facilitating more accurate identification of cancer subtypes and severity, using machine learning (ML). To explore whether such ML models can perform at higher accuracy when using multiple modalities, such as genomic and imaging data, Genomics England has launched a multi-modal program aimed at enhancing its dataset and also partnered with the the AWS Global Health and Non-profit Go-to-Market (GHN-GTM) Data Science and AWS Professional Services teams to create an automatic cancer sub-typing and survival detection pipeline and explore its accuracy on publicly available data.
In this post, we detail our collaboration in creating two proof of concept (PoC) exercises around multi-modal machine learning for survival analysis and cancer sub-typing, using genomic (gene expression, mutation and copy number variant data) and imaging (histopathology slides) data. We provide insights on interpretability, robustness, and best practices of architecting complex ML workflows on AWS with Amazon SageMaker. These multi-modal pipelines are being used on the Genomics England cancer cohort to enhance our understanding of cancer biomarkers and biology.
1. Data
The PoCs have used the publicly available cancer research data from The Cancer Genome Atlas (TCGA), which contain paired high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcome and histologic grade labels. Specifically, the PoCs focus on whole slide histopathology images of tissue samples as well as gene expression, copy number variations, and the presence of deleterious genetic variants to perform analysis on two cancer types: Breast cancer (BRCA) and gastrointestinal cancer types (Pan-GI). Table 1 shows the sample sizes for each cancer type.
Table 1. Overview of input data sizes across the different cancer types investigated.
2. Multi-modal machine learning frameworks
The ML pipelines tackling multi-modal subtyping and survival prediction have been built in three phases throughout the PoC exercises. First, a state-of-the-art framework, namely Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE) (Chen et al., 2022) was implemented (Section 2.1). This was followed by the proposal, development, and implementation of a novel architecture based on Hierarchical Extremum Encoding (HEEC) (Section 2.2) by AWS, which aimed to mitigate the limitations of PORPOISE. The final phase improved on the results of HEEC and PORPOISE—both of which have been trained in a supervised fashion—using a foundation model trained in a self-supervised manner, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023).
2.1 Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE)
PORPOISE (Chen et al., 2022) is a multi-modal ML framework that consists of three sub-network components (see Figure 1 at Chen et al., 2022):

CLAM component; an attention-based multiple-instance learning network trained on pre-processed whole slid image (WSI) inputs (CLAM, Lu et al., 2021). CLAM extracts features from image patches of size 256×256 using a pre-trained ResNet50.
A self-normalizing network component for extracting deep molecular features.
A multi-modal fusion layer for integrating feature representations from 1) and 2) by modelling their pairwise interactions. The joint representations obtained from 3) are then used for undertaking the downstream tasks such as survival analysis and cancer-subtyping.

Despite being performant, PORPOISE was observed to output reduced multi-modal performance than single best modality (imaging) performance alone when gene expression data was excluded from the genomic features while performing survival analysis for Pan-GI data (Figure 2). A possible explanation is that the model has difficulty dealing with the extremely high dimensional, sparse genomic data without overfitting.
2.2. Hierarchical Extremum Encoding (HEEC): A novel supervised multi-modal ML framework
To mitigate the limitations of PORPOISE, AWS has developed a novel model structure, HEEC, which is based on three ideas:

Using tree ensembles (LightGBM) to mitigate the sparsity and overfitting issue observed when training PORPOISE (as observed by Grinsztajn et al., 2022, tree-based models tend to overfit less when confronted with high-dimensional data with many largely uninformative features).
Representation construction using a novel encoding scheme (extremum encoding) that preserves spatial relationships and thus interpretability.
Hierarchical learning to allow representations at multiple spatial scales.

Figure 1. Hierarchical Extremum Encoding (HEEC) of pathomic representations.
Figure 1 summarizes the HEEC architecture: starting from the bottom (and clockwise): Every input WSI is cut up into patches of size 4096×4096 and 256×256 pixels in a hierarchical manner and all stacks of patches are fed through ResNet50 to obtain embedding vectors. Additionally, nucleus-level representations (of size 64×64 pixels) are extracted by a graph neural network (GNNs), allowing local nucleus neighborhoods and their spatial relationships to be taken into account. This is followed by filtering for redundancy: Patch embeddings that are important are selected using positive-unlabeled learning, and GNN importance filtering is used for retaining the top nuclei features. The resulting hierarchical embeddings are coded using extremum encoding: the maxima and minima across the embeddings are taken in each vector entry, resulting in a single vector of maxima and minima per WSI. This encoding scheme allows keeping exact track of spatial relationships for each entry in the resulting representation vectors because the model can backtrack each vector entry to a specific patch, and thus to a specific coordinate in the image.
On the genomics side, importance filtering is applied based on excluding features that don’t correlate with the prediction target. The remaining features are horizontally appended to the pathology features, and a gradient boosted decision tree classifier (LightGBM) is applied to achieve predictive analysis.
HEEC architecture is interpretable out of the box, because HEEC embeddings possess implicit spatial information and the LightGBM model supports feature importance, allowing the filtering of the most important features for accurate prediction and backtracking to their location of origin. This location can be visually highlighted on the histology slide to be presented to expert pathologists for verification. Table 2 and Figure 2 show performance results of PORPOISE and HEEC, which show that HEEC is the only algorithm that outperforms the results of the best-performing single modality by combining multiple modalities.
Table 2. Classification and survival prediction performance of the two implemented multi-modal ML models on TCGA data. *Although Chen et al., 2022 provide some interpretability, the proposed attention visualization heatmaps have been deemed difficult to interpret from the pathologist point of view by Genomics England domain experts.
Figure 2. Comparison of performance (AUC) across individual modalities for survival analysis, when excluding the gene expression data. This matches the setting encountered by GEL in practice (GEL’s internal dataset has no gene expression data)
2.3. Improvements using foundation models
Despite yielding promising results, PORPOISE and HEEC algorithms use backbone architectures trained using supervised learning (for example, ImageNet pre-trained ResNet50). To further improve performance, a self-supervised learning-based approach, namely Hierarchical Image Pyramid Transformer (HIPT) (Chen et al., 2023), has been investigated in the final stage of the PoC exercises. Note that HIPT is currently limited to the hierarchical self-supervised learning of the imaging modality (WSIs) and further work includes expansion of self-supervised learning for the genomic modality.
HIPT starts by defining a hierarchy of patches composed of non-overlapping regions of size 16×16, 256×256, and 4096×4096 pixels (see Figure 2 at Chen et al., 2023). The lowest-layer features are extracted from the smallest patches (16×16) using a self-supervised learning algorithm based on DINO with a Vision Transformer (ViT) backbone. For each 256×256 region, the lowest-layer features are then aggregated using a global pooling layer. The aggregated features constitute the (new input) features for the middle-level in the hierarchy, where the process of self-supervised learning followed by global pooling is repeated and the aggregated output features form the input features belonging to the 4096×4096 region. These input features go through self-supervised learning one last time, and the final embeddings are obtained using global attention pooling. After pre-training is completed, fine-tuning is employed only on the final layer of the hierarchy (acting on 4096×4096 regions) using multiple instance learning.
Genomics England investigated whether using HIPT embeddings would be better than using the ImageNet pretrained ResNet50 encoder, and initial experiments have shown a gain in Harrels C-index of approximately 0.05 per cancer type in survival analysis. The embeddings offer other benefits as well. Such as being smaller—meaning that models train faster and the features have a smaller footprint.
3. Architecture on AWS
As part of the PoCs, we built a reference architecture (illustrated in Figure 3) for multi-modal ML using SageMaker, a platform for building training, and deploying ML models, with fully managed infrastructure, tools, and workflows. We aimed to demonstrate some general, reusable patterns that are independent of the specific algorithms:

Decouple data pre-processing and feature computation from model training: In our use case, we process the pathology images into numerical feature representations once, we then store the resulting feature vectors in Amazon Simple Storage Service (Amazon S3) and reuse them to train different models. Analogously, we have a second processing branch that processes and extracts features from the genomic data.
Decouple model training from inference: As we experiment with different model structures and hyperparameters, we keep track of model versions, hyperparameters, and metrics in SageMaker model registry. We refer to the registry to review our experiments and choose which models to deploy for inference.
Wrap long-running computations inside containers and delegate their execution to SageMaker: Any long-running computation benefits from this pattern, whether it’s for data processing, model training, or batch inference. In this way, there’s no need to manage the underlying compute resources for running the containers. Cost is reduced through a pay-as-you-go model (resources are destroyed after a container has finished running) and the architecture is easily scalable to run multiple jobs in parallel.
Orchestrate multiple containerized jobs into SageMaker pipelines: We build a pipeline once and run it multiple times with different parametrization. Hence, pipeline invocations can be referred to at a higher-level of abstraction, without having to constantly monitor the status of its long-running constituent jobs.
Delegate hyperparameter tuning to SageMaker using a hyperparameter tuning job: A tuning job is a family of related training jobs (all managed by SageMaker) that efficiently explore the hyperparameter space. These training jobs take the same input data for training and validation, but each one is run with different hyperparameters for the learning algorithm. Which hyperparameter values to explore at each iteration are automatically chosen by SageMaker.

3.1 Separation between development and production environments
In general, we advise to do all development work outside of a production environment, because this minimizes the risk of leakage and corruption of sensitive production data and the production environment isn’t contaminated with intermediate data and software artifacts that obfuscate lineage tracking. If data scientists require access to production data during developmental stages, for tasks such as exploratory analysis and modelling work, there are numerous strategies that can be employed to minimize risk. One effective strategy is to employ data masking or synthetic data generation techniques in the testing environment to simulate real-world scenarios without compromising sensitive data. Furthermore, production level data can be securely moved into an independent environment for analysis. Access controls and permissions can be implemented to restrict the flow of data between environments, maintaining separation and ensuring minimal access rights.
Genomics England has created two separate ML environments for testing and production level interaction with data. Each environment sits in its own isolated AWS account. The test environment mimics the production environment in its data storage strategy, but contains synthetic data void of personally identifiable information (PII) or protected health information (PHI), instead of production-level data. This test environment is used for developing essential infrastructure components and refining best practices in a controlled setting, which can be tested with synthetic data before deploying to production. Strict access controls, including role-based permissions employing principles of least privilege, are implemented in all environments to ensure that only authorized personnel can interact with sensitive data or modify deployed resources.
3.2 Automation with CI/CD pipelines
On a related note, we advise ML developers to use infrastructure-as-code to describe the resources that are deployed in their AWS accounts and use continuous integration and delivery (CI/CD) pipelines to automate code quality checks, unit testing, and the creation of artifacts, such as container images. Then, also configure the CI/CD pipelines to automatically deploy the created artifacts into the target AWS accounts, whether they’re for development or for production. These well-established automation techniques minimize errors related to manual deployments and maximize the reproducibility between development and production environments.
Genomics England has investigated the use of CI/CD pipelines for automated deployment of platform resources, as well as automated testing of code.
Figure 3. Overview of the AWS reference architecture employed for multi-modal ML in the cloud
4. Conclusion
Genomics England has a long history of working with genomics data, however the inclusion of imaging data adds additional complexity and potential. The two PoCs outlined in this post have been essential in launching Genomics England’s efforts in creating a multi-modal environment that facilitates ML development for the purpose of tackling cancer. The implementation of state-of-the-art models in Genomics England’s multi-modal environment and assistance in developing robust practices will ensure that users are maximally enabled in their research.

“At Genomics England, our mission is to realize the enormous potential of genomic and multi-modal information to further precision medicine and push the boundaries to realize the enormous potential of AWS cloud computing in its success”.
– Dr Prabhu Arumugam, Director of Clinical data and imaging, Genomics England

Acknowledgements
The results published in this blog post are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

About the Authors
Cemre Zor, PhD, is a senior healthcare data scientist at Amazon Web Services. Cemre holds a PhD in theoretical machine learning and postdoctoral experiences on machine learning for computer vision and healthcare. She works with healthcare and life sciences customers globally to support them with machine learning modelling and advanced analytics approaches while tackling real-world healthcare problems.
Tamas Madl, PhD, is a former senior healthcare data scientist and business development lead at Amazon Web Services, with academic as well as industry experience at the intersection between healthcare and machine learning. Tamas helped customers in the Healthcare and Life Science vertical to innovate through the adoption of Machine Learning. He received his PhD in Computer Science from the University of Manchester.
Epameinondas Fritzilas, PhD, is a senior consultant at Amazon Web Services. He works hands-on with customers to design and build solutions for data analytics and AI applications in healthcare. He holds a PhD in bioinformatics and fifteen years of industry experience in the biotech and healthcare sectors.
Lou Warnett is a healthcare data scientist at Amazon Web Services. He assists healthcare and life sciences customers from across the world in tackling some of their most pressing challenges using data science, machine learning and AI, with a particular emphasis more recently on generative AI. Prior to joining AWS, Lou received a master’s in Mathematics and Computing at Imperial College London.
Sam Price is a Professional Services consultant specializing in AI/ML and data analytics at Amazon Web Services. He works closely with public sector customers in healthcare and life sciences to solve challenging problems. When not doing this, Sam enjoys playing guitar and tennis, and seeing his favorite indie bands.
Shreya Ruparelia is a data & AI consultant at Amazon Web Services, specialising in data science and machine learning, with a focus on developing GenAI applications. She collaborates with public sector healthcare organisations to create innovative AI-driven solutions. In her free time, Shreya enjoys activities such as playing tennis, swimming, exploring new countries and taking walks with the family dog, Buddy.
Pablo Nicolas Nunez Polcher, MSc, is a senior solutions architect working for the Public Sector team with Amazon Web Services. Pablo focuses on helping healthcare public sector customers build new, innovative products on AWS in accordance with best practices. He received his M.Sc. in Biological Sciences from Universidad de Buenos Aires. In his spare time, he enjoys cycling and tinkering with ML-enabled embedded devices.
Matthew Howard is the head of Healthcare Data Science and part of the Global Health and Non-Profits team in Amazon Web Services. He focuses on how data, machine learning and artificial intelligence can transform health systems and improve patient outcomes. He leads a team of applied data scientists who work with customers to develop AI-based healthcare solutions. Matthew holds a PhD in Biological Sciences from Imperial College London.
Tom Dyer is a Senior Product Manager at Genomics England. And was previously an Applied Machine Learning Engineer working within the Multimodal squad. His work focussed on building multimodal learning frameworks that allow users to rapidly scale research in the cloud. He also works on developing ML tooling to organise pathology image datasets and explain model predictions on a cohort level
Samuel Barnett is an applied machine learning engineer with Genomics England working on improving healthcare with machine learning. He is embedded with the Multimodal squad and is part of an ongoing effort to show the value of combing genomic, imaging, and text based data in machine learning models.
Prabhu Arumugam is the former Director of Clinical Data Imaging at Genomics England. Having joined the organization in 2019, Prabhu trained in medicine St. Bartholomew’s and the Royal London. He trained in Histopathology and completed his PhD at The Barts Cancer Institute on pancreatic pathology.
Francisco Azuaje, PhD, is the director of bioinformatics at Genomics England, where he provides cross-cutting leadership in strategy and research with a focus on data science and AI. With a career covering academia, the pharmaceutical industry, and the public sector, he has wide experience leading multidisciplinary teams in solving challenges involving diverse data sources and computational modelling approaches. With his expertise in bioinformatics and applied AI, Dr. Azuaje enables the translation of complex data into insights that can improve patient outcomes.

Apple Unveils iPhone 16 with On-Device AI and Apple Intelligence Promp …

Apple’s latest release, the iPhone 16, places a strong emphasis on on-device artificial intelligence (AI), powered by its new Apple Intelligence platform. Unlike cloud-reliant AI systems, Apple is focusing on maintaining privacy by processing AI functions directly on the device with the A18 Bionic chip. This enables faster, more personalized, and more secure interactions, marking a significant evolution in smartphone AI.

On-Device AI for Everyday Tasks

Apple Intelligence uses specific prompts, known as adapters, to perform key tasks efficiently. For instance, when a notification arrives, the AI can use a simple prompt like, “{{userContent}} Is this urgent?” to determine whether to prioritize the alert. This allows for smarter notifications, ensuring that users aren’t overwhelmed by non-essential updates.

One of the standout features is email summarization, with AI helping users respond to emails more quickly and accurately. The prompt “You are an assistant which helps the user respond to their mails…, do not hallucinate…{{userContent}}” ensures that email replies are concise and factually correct, eliminating the risk of inaccuracies that some AI models struggle with.

Advanced Localization and Safety Measures

Apple has also integrated localization features within Apple Intelligence, allowing users to engage with content in different languages based on their location. A simple prompt like “Respond in {{language}}” enables seamless switching between languages, making the phone more adaptable to diverse global users.

Safety is another area where Apple Intelligence shines. By using prompts such as, “You are a helpful assistant that classifies input as Safe or Unsafe…{{userContent}},” the system can detect unsafe content in real time, protecting users from potentially harmful material. This proactive AI feature aligns with Apple’s strong emphasis on user privacy and security.

Personalized, Privacy-First AI

The iPhone 16’s AI capabilities are deeply integrated into iOS 18, enabling personalized recommendations, automated workflows, and smarter device usage across the Apple ecosystem. With on-device processing, users benefit from the advanced AI capabilities without compromising their data privacy.

Apple’s approach to AI reflects a broader trend of making technology not only more intelligent but also more user-friendly and secure. By focusing on practical applications like email management, content safety, and notifications, the iPhone 16 transforms AI from a tech buzzword into a tool that simplifies everyday life.

In summary, Apple’s iPhone 16 positions AI as a core element of the user experience, focusing on privacy, personalization, and performance. With Apple Intelligence, the smartphone becomes an indispensable tool that not only understands user preferences but also protects their privacy, setting a new standard for AI in mobile technology.

Sources:

https://techcrunch.com/2024/09/09/iphone-16-apple-intelligence-airpods-4-and-more-live-updates-on-everything-revealed-at-apple-event-2024/

https://spectrumnews1.com/ca/southern-california/technology/2024/09/09/apple-s-iphone-16-leaps-into-ai-in-attempt-to-turn-a-tech-trend-into-a-cultural-phenomenon

https://www.reuters.com/technology/artificial-intelligence/apples-iphone-16-will-put-ai-features-focus-2024-09-09/

https://x.com/_philschmid/status/1833238741743452192

The post Apple Unveils iPhone 16 with On-Device AI and Apple Intelligence Prompts appeared first on MarkTechPost.

Political DEBATE Language Models: Open-Source Solutions for Efficient …

Text classification has become a crucial tool in various applications, including opinion mining and topic classification. Traditionally, this task required extensive manual labeling and a deep understanding of machine learning techniques, presenting significant barriers to entry. The advent of large language models (LLMs) like ChatGPT has revolutionized this field, enabling zero-shot classification without additional training. This breakthrough has led to the widespread adoption of LLMs in political and social sciences. However, researchers face challenges when using these models for text analysis. Many high-performing LLMs are proprietary and closed, lacking transparency in their training data and historical versions. This opacity conflicts with open science principles. Also, the substantial computational requirements and usage costs associated with LLMs can make large-scale data labeling prohibitively expensive. Consequently, there is a growing call for researchers to prioritize open-source models and provide strong justification when opting for closed systems.

Natural language inference (NLI) has emerged as a versatile classification framework, offering an alternative to generative Large Language Models (LLMs) for text analysis tasks. In NLI, a “premise” document is paired with a “hypothesis” statement, and the model determines if the hypothesis is true based on the premise. This approach allows a single NLI-trained model to function as a universal classifier across various dimensions without additional training. NLI models offer significant advantages in terms of efficiency, as they can operate with much smaller parameter counts compared to generative LLMs. For instance, a BERT model with 86 million parameters can perform NLI tasks, while the smallest effective zero-shot generative LLMs require 7-8 billion parameters. This difference in size translates to substantially reduced computational requirements, making NLI models more accessible for researchers with limited resources. However, NLI classifiers trade flexibility for efficiency, as they are less adept at handling complex, multi-condition classification tasks compared to their larger LLM counterparts.

Researchers from the Department of Politics, Princeton University, Pennsylvania State University and Manship School of Mass Communication, Louisiana State University, propose Political DEBATE (DeBERTa Algorithm for Textual Entailment) models, available in Large and Base versions, which represent a significant advancement in open-source text classification for political science. These models, with 304 million and 86 million parameters, respectively, are designed to perform zero-shot and few-shot classification of political text with efficiency comparable to much larger proprietary models. The DEBATE models achieve their high performance through two key strategies: domain-specific training with carefully curated data and the adoption of the NLI classification framework. This approach allows the use of smaller encoder language models like BERT for classification tasks, dramatically reducing computational requirements compared to generative LLMs. The researchers also introduce the PolNLI dataset, a comprehensive collection of over 200,000 labeled political documents spanning various subfields of political science. Importantly, the team commits to versioning both models and datasets, ensuring replicability and adherence to open science principles.

The Political DEBATE models are trained on the PolNLI dataset, a comprehensive corpus comprising 201,691 documents paired with 852 unique entailment hypotheses. This dataset is categorized into four main tasks: stance detection, topic classification, hate-speech and toxicity detection, and event extraction. PolNLI draws from a diverse range of sources, including social media, news articles, congressional newsletters, legislation, and crowd-sourced responses. It also incorporates adapted versions of established academic datasets, such as the Supreme Court Database. Notably, the vast majority of the text in PolNLI is human-generated, with only a small fraction (1,363 documents) being LLM-generated. The dataset’s construction followed a rigorous five-step process: collecting and vetting datasets, cleaning and preparing data, validating labels, hypothesis augmentation, and splitting the data. This meticulous approach ensures both high-quality labels and diverse data sources, providing a robust foundation for training the DEBATE models.

The Political DEBATE models are built upon the DeBERTa V3 base and large models, which were initially fine-tuned for general-purpose NLI classification. This choice was motivated by DeBERTa V3’s superior performance on NLI tasks among transformer models of similar size. The pre-training on general NLI tasks facilitates efficient transfer learning, allowing the models to quickly adapt to political text classification. The training process utilized the Transformers library, with progress monitored via the Weights and Biases library. After each epoch, model performance was evaluated on a validation set, and checkpoints were saved. The final model selection involved both quantitative and qualitative assessments. Quantitatively, metrics such as training loss, validation loss, Matthew’s Correlation Coefficient, F1 score, and accuracy were considered. Qualitatively, the models were tested across various classification tasks and document types to ensure consistent performance. In addition to this, the models’ stability was assessed by examining their behavior on slightly modified documents and hypotheses, ensuring robustness to minor linguistic variations.

The Political DEBATE models were benchmarked against four other models representing various options for zero-shot classification. These included the DeBERTa base and large general-purpose NLI classifiers, which are currently the best publicly available NLI classifiers. The open-source Llama 3.1 8B, a smaller generative LLM capable of running on high-end desktop GPUs or integrated GPUs like Apple M series chips, was also included in the comparison. Also, Claude 3.5 Sonnet, a state-of-the-art proprietary LLM, was tested to represent the cutting-edge of commercial models. Notably, GPT-4 was excluded from the benchmark due to its involvement in the validation process of the final labels. The primary performance metric used was the Matthews Correlation Coefficient (MCC), chosen for its robustness in binary classification tasks compared to metrics like F1 and accuracy. MCC, ranging from -1 to 1 with higher values indicating better performance, provides a comprehensive measure of model effectiveness across various classification scenarios.

The NLI classification framework enables models to quickly adapt to new classification tasks, demonstrating efficient few-shot learning capabilities. The Political DEBATE models showcase this ability, learning new tasks with only 10-25 randomly sampled documents, rivaling or surpassing the performance of supervised classifiers and generative language models. This capability was tested using two real-world examples: the Mood of the Nation poll and a study on COVID-19 tweet classification.

The testing procedure involved zero-shot classification followed by few-shot learning with 10, 25, 50, and 100 randomly sampled documents. The process was repeated 10 times for each sample size to calculate confidence intervals. Importantly, the researchers used default settings without optimization, emphasizing the models’ out-of-the-box usability for few-shot learning scenarios.

The DEBATE models demonstrated impressive few-shot learning performance, achieving results comparable to or better than specialized supervised classifiers and larger generative models. This efficiency extends to computational requirements as well. While initial training on the large PolNLI dataset may take hours or days with high-end GPUs, few-shot learning can be accomplished in minutes without specialized hardware, making it highly accessible for researchers with limited computational resources.

A cost-effectiveness analysis was conducted by running the DEBATE models and Llama 3.1 on various hardware configurations, using a sample of 5,000 documents from the PolNLI test set. The hardware tested included an NVIDIA GeForce RTX 3090 GPU, an NVIDIA Tesla T4 GPU (available free on Google Colab), a Macbook Pro with an M3 max chip, and an AMD Ryzen 9 5900x CPU.

The results demonstrated that the DEBATE models offer significant speed advantages over small generative LLMs like Llama 3.1 8B across all tested hardware. While high-performance GPUs like the RTX 3090 provided the best speed, the DEBATE models still performed efficiently on more accessible hardware such as laptop GPUs (M3 max) and free cloud GPUs (Tesla T4).

Key findings include:

1. DEBATE models consistently outperformed Llama 3.1 8B in processing speed across all hardware types.

2. High-end GPUs like the RTX 3090 offered the best performance for all models.

3. Even on more modest hardware like the M3 max chip or the free Tesla T4 GPU, DEBATE models maintained relatively brisk classification speeds.

4. The efficiency gap between DEBATE models and Llama 3.1 was particularly pronounced on consumer-grade hardware.

This analysis highlights the DEBATE models’ superior cost-effectiveness and accessibility, making them a viable option for researchers with varying computational resources.

This research presents Political DEBATE models that demonstrate significant promise as accessible, efficient tools for text analysis across stance, topic, hate speech, and event classification in political science. For these models, the researchers also present a comprehensive dataset PolNLI. Their design emphasizes open science principles, offering a reproducible alternative to proprietary models. Future research should focus on extending these models to new tasks, such as entity and relationship identification, and incorporating more diverse document sources. Expanding the PolNLI dataset and further refining these models can enhance their generalizability across political communication contexts. Collaborative efforts in data sharing and model development can drive the creation of domain-adapted language models that serve as valuable public resources for researchers in political science.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

LG AI Research Open-Sources EXAONE 3.0: A 7.8B Bilingual Language Model Excelling in English and Korean with Top Performance in Real-World Applications and Complex Reasoning

The post Political DEBATE Language Models: Open-Source Solutions for Efficient Text Classification in Political Science appeared first on MarkTechPost.

Llama-Deploy: A Fully Open-Source Way to Deploy Your Agents as Product …

The field of Artificial Intelligence (AI-driven) agentic systems has seen significant change in recent times. The deployment of sophisticated, scalable systems depends heavily on workflows. A team of researchers has introduced llama-deploy, a unique and user-friendly solution designed to make agentic workflows constructed using LlamaIndex easier to scale and deploy. With just a few lines of code, llama-deploy, replacing llama-agents, provides a simplified method for deploying workflows as scalable microservices.

Using llama-deploy, developers can create event-driven processes and implement them in real-world settings with ease, bridging the gap between development and production. Llama-deploy builds on the success of previous innovations by providing the convenience of creating LlamaIndex processes and the smooth deployment of those workflows through the use of a microservice architecture. Workflows and llama agents combined have produced a versatile, scalable, and production-ready technology.

Architecture 

Llama-deploy offers an architecture that prioritizes fault tolerance, scalability, and ease of deployment in order to satisfy the increasing requirements of multi-agent systems. Its main elements are as follows.

The message queue is a key component that enables the system to control task processing. It assigns tasks to different services and publishes methods to named queues.

The Control Plane is the brain of the llama-deploy system. It keeps track of services and tasks, controls sessions and states, and assigns tasks using an orchestrator. It is in charge of service registration, which facilitates the scalability and administration of multi-service systems.

The orchestrator controls the flow of results and determines which service should take on a given task. It allows for error handling and retries and assumes that incoming tasks have a specified destination by default.

Workflow services are the fundamental components of where work is really done. Every service handles incoming work and outputs the outcomes. When a workflow is deployed, it becomes a service that performs tasks continuously.

Primary features of llama deploy

Easy deployment: The ability of llama-deploy to deploy workflows with little to no code modifications is one of its best advantages. With the help of this capability, developers can more easily move from creating agents in local environments to deploying them in a scalable infrastructure. It bridges the gap between development and production.

Scalability: llama-deploy’s microservice architecture makes it easy to scale individual components in response to demand. Flexible scalability is made possible with it, whether one needs to add new services or enhance message processing capabilities.

Fault Tolerance: Llama-deploy is engineered to provide robustness in production contexts with integrated techniques for handling errors and retries. Because of these properties, the system is dependable for crucial applications and stays resilient even in the face of failures.

Flexibility: Without causing any systemic disruptions, developers can add new services or modify system components like message queues with the help of the hub-and-spoke architecture. This versatility makes it simple to customize in accordance with the particular requirements of the application.

Async-First: Llama-deploy is optimized for high-concurrency circumstances and enables asynchronous operations, which makes it perfect for high-throughput and real-time applications.

Getting started with llama-deploy is very simple. Pip can be used to install it, and it easily interacts with the production infrastructure already in place. Llama-deploy can be used with both RabbitMQ or Kubernetes (k8s). With an engaged community and an open-source project, llama-deploy is well-positioned to establish itself as the standard agentic workflow deployment tool.

In conclusion, llama-deploy unifies agent workflow UXs and streamlines the deployment process, providing a smooth transition for everyone who has been following the development of llama-agents. Developers can create workflows in LlamaIndex and scale them smoothly in production environments using llama-deploy.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Llama-Deploy: A Fully Open-Source Way to Deploy Your Agents as Production Microservices appeared first on MarkTechPost.

Align Meta Llama 3 to human preferences with DPO, Amazon SageMaker Stu …

Large language models (LLMs) have remarkable capabilities. Nevertheless, using them in customer-facing applications often requires tailoring their responses to align with your organization’s values and brand identity. In this post, we demonstrate how to use direct preference optimization (DPO), a technique that allows you to fine-tune an LLM with human preference data, together with Amazon SageMaker Studio and Amazon SageMaker Ground Truth to align the Meta Llama 3 8B Instruct model responses to your organization’s values.
Using SageMaker Studio and SageMaker Ground Truth for DPO
With DPO, you can fine-tune an LLM with human preference data such as ratings or rankings so that it generates outputs that align to end-user expectations. DPO is computationally efficient and helps enhance a model’s helpfulness, honesty, and harmlessness, divert the LLM from addressing specific subjects, and mitigate biases. In this technique, you typically start with selecting an existing or training a new supervised fine-tuned (SFT) model. You use the model to generate responses and you gather human feedback on these responses. After that, you use this feedback to perform DPO fine-tuning and align the model to human preferences.
Whether you are fine-tuning a pre-trained LLM with supervised fine-tuning (SFT) or loading an existing fine-tuned model for DPO, you typically need powerful GPUs. The same applies during DPO fine-tuning. With Amazon SageMaker, you can get started quickly and experiment rapidly by using managed Jupyter notebooks equipped with GPU instances. You can quickly get started by creating a JupyterLab space in SageMaker Studio, the integrated development environment (IDE) purpose-built for machine learning (ML), launch a JupyterLab application that runs on a GPU instance.
Orchestrating the end-to-end data collection workflow and developing an application for annotators to rate or rank model responses for DPO fine-tuning can be time-consuming. SageMaker Ground Truth offers human-in-the-loop capabilities that help you set up workflows, manage annotators, and collect consistent, high-quality feedback.
This post walks you through the steps of using DPO to align an SFT model’s responses to the values of a fictional digital bank called Example Bank. Your notebook runs in a JupyterLab space in SageMaker Studio powered by a single ml.g5.48xlarge instance (8 A10G GPUs). Optionally, you can choose to run this notebook inside a smaller instance type such as ml.g5.12xlarge (4 A10G GPUs) or ml.g6.12xlarge (4 L4 GPUs) with bitsandbytes quantization. You use Meta Llama 3 8B Instruct (the Meta Llama 3 instruction tuned model optimized for dialogue use cases from the Hugging Face Hub) to generate responses, SageMaker Ground Truth to collect preference data, and the DPOTrainer from the HuggingFace TRL library for DPO fine-tuning together with Parameter-Efficient Fine-Tuning (PEFT). You also deploy the aligned model to a SageMaker endpoint for real-time inference. You can use the same approach with other models.
Solution overview
The following diagram illustrates the approach.

The workflow contains the following key steps:

Load the Meta Llama 3 8B Instruct model into SageMaker Studio and generate responses for a curated set of common and toxic questions. The dataset serves as the initial benchmark for the model’s performance.
The generated question-answer pairs are stored in Amazon Simple Storage Service (Amazon S3). These will be presented to the human annotators later so they can rank the model responses.
Create a workflow in SageMaker Ground Truth to gather human preference data for the responses. This involves creating a work team, designing a UI for feedback collection, and setting up a labeling job.
Human annotators interact with the labeling portal to evaluate and rank the model’s responses based on their alignment to the organization’s values.
The collected data is processed to adhere to the DPOTrainer expected format.
Using the Hugging Face TRL library and the DPOTrainer, fine-tune the Llama 3 model using the processed data from the previous step.
Test the fine-tuned model on a holdout evaluation dataset to assess its performance and verify it meets the desired standards.
When you’re satisfied with the model performance, you can deploy it to a SageMaker endpoint for real-time inference at scale.

Prerequisites
To run the solution described in this post, you must have an AWS account set up, along with an AWS Identity and Access Management (IAM) role that grants you the necessary permissions to create and access the solution resources. If you are new to AWS and haven’t created an account yet, refer to Create a standalone AWS account.
To use SageMaker Studio, you need to have a SageMaker domain set up with a user profile that has the necessary permissions to launch the SageMaker Studio application. If you’re new to SageMaker Studio, the Quick Studio setup is the fastest way to get started. With a single click, SageMaker provisions the required domain with default presets, including setting up the user profile, IAM role, IAM authentication, and public internet access. The notebook associated with this post assumes the use of an ml.g5.48xlarge instance type. To review or increase your quota limits, navigate to the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for Studio JupyterLab Apps running on ml.g5.48xlarge instances.

Request an increase in quota value greater than or equal to 1 for experimentation.

Meta Llama 3 8B Instruct is available under the Llama 3 license. To download the model from Hugging Face, you need an access token. If you don’t already have one, navigate to the Settings page on the Hugging Face website to obtain it.
Make sure that the SageMaker Studio role has the necessary permissions for SageMaker Ground Truth and Amazon S3 access. When you’re working in SageMaker Studio, you’re already using an IAM role, which you’ll need to modify for launching SageMaker Ground Truth labeling jobs. To enable SageMaker Ground Truth functionality, you should attach the AWS managed policy AmazonSageMakerGroundTruthExecution to your SageMaker Studio role. This policy provides the essential permissions for creating and managing labeling jobs.
For Amazon S3 access, scoping permissions to specific buckets and actions enhances security and aligns with best practices. This approach adheres to the principle of least privilege, reducing potential risks associated with overly permissive policies. The following is an example of a restricted Amazon S3 policy that grants only the necessary permissions:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::<YOUR-BUCKET-NAME>”,
“arn:aws:s3:::<YOUR-BUCKET-NAME>/*”
]
}
]
}

To add these policies to your SageMaker Studio role, complete the following steps:

On the IAM console, find and choose your SageMaker Studio role (it usually starts with AmazonSageMaker-ExecutionRole-).
On the Permissions tab, choose Add permissions and then Attach policies.
Search for and attach AmazonSageMakerGroundTruthExecution.
Create and attach the custom Amazon S3 inline policy as shown in the preceding example, if needed.

Remember to follow the principle of least privilege, granting only the permissions necessary for your specific use case. Regularly review your IAM roles and policies to validate their alignment with your security requirements. For more details on IAM policies for SageMaker Ground Truth, refer to Use IAM Managed Policies with Ground Truth.
Set up the notebook and environment
To get started, open SageMaker Studio and create a JupyterLab space. For Instance, choose ml.g5.48xlarge. Run the space, open JupyterLab, and clone the code in the following GitHub repository. You can configure the JupyterLab space to use up to 100 GB in your Amazon Elastic Block Store (Amazon EBS) volume. In addition, the ml.g5 instance family comes with NVMe SSD local storage, which you can use in the JupyterLab application. The NVMe instance store directory is mounted to the application container in /mnt/sagemaker-nvme. For this post, you use the NVMe storage available in the ml.g5.48xlarge instance.
When your space is ready, clone the GitHub repo and open the notebook llama3/rlhf-genai-studio/RLHF-with-Llama3-on-Studio-DPO.ipynb, which contains the solution code. In the pop-up, make sure that the Python 3 kernel is selected.

Let’s go through the notebook. First, install the necessary Python libraries:

import torch
import os
import sagemaker
import boto3
import datetime
from transformers import pipeline
import json
import asyncio
import aiofiles
from datasets import Dataset, load_dataset
from peft import (
get_peft_model,
LoraConfig,
prepare_model_for_kbit_training,
)
import bitsandbytes as bnb
from tqdm import tqdm
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
AutoModelForSequenceClassification
)
from IPython.core.display import display, HTML

The following line sets the default path where you store temporary artifacts to the location in the NVMe storage:
cache_dir = “/mnt/sagemaker-nvme”
This is local storage, which means that your data will be lost when the JupyterLab application is deleted, restarted, or patched. Alternatively, you can increase your EBS volume of your SageMaker Studio space to greater than or equal to 100 GB to provide sufficient storage for the Meta Llama 3 base model, PEFT adapter, and new merged fine-tuned model.
Load Meta Llama 3 8B Instruct in the notebook
After you have imported the necessary libraries, you can download the Meta Llama 3 8B Instruct model and its associated tokenizers from Hugging Face:

base_model_id = “meta-llama/Meta-Llama-3-8B-Instruct”

model = AutoModelForCausalLM.from_pretrained(
base_model_id,
token=hf_access_token,
torch_dtype=torch.bfloat16,
device_map=”auto”,
cache_dir=cache_dir
)

model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
token=hf_access_token,
cache_dir=cache_dir
)

Collect initial model responses for common and toxic questions
The example_bank_questions.txt file contains a list of common questions received by call centers in financial organizations combined with a list of toxic and off-topic questions.
Before you ask the model to generate answers to these questions, you need to specify the brand and core values of Example Bank. You will include these values in the prompt as context later so the model has the appropriate information it needs to respond.

company_context = “””Example Bank is a next-generation digital bank on a mission to revolutionize the banking experience. Founded in 2020, we are committed to leveraging cutting-edge technology to make banking simple, accessible, and transparent for everyone. In Example Bank, we believe that banking should be seamless, intuitive, and tailored to the needs of modern consumers. Our founders, seasoned professionals from the tech and finance industries, set out to create a bank that puts people first, empowering them to take control of their finances with ease. At Example Bank, we envision a world where banking is no longer a chore but a delightful experience. We are dedicated to breaking down barriers and democratizing access to financial services. Our goal is to empower individuals and businesses alike by providing them with the tools and resources they need to thrive in an increasingly digital landscape.
Our values:
– Innovation: We embrace cutting-edge technologies and continuously seek out innovative solutions to deliver the best possible banking experience. We are a digital-only bank, which means we don’t have any physical branches. Instead, we offer all of our services online or through our mobile app. This allows us to keep our costs low and pass the savings on to our customers.
– Transparency: We are committed to being direct and honest with our customers. We believe that transparency is key to building trust, and we want our customers to feel confident that they are making informed decisions about their money. That’s why we provide clear and concise information about our products and services, and we are always available to answer any questions our customers may have.
– Accessibility: Our services are designed to be inclusive and user-friendly, catering to a diverse range of customers, regardless of their financial backgrounds.
– Security: We prioritize the safety and security of our customers’ data and assets, employing state-of-the-art encryption and cybersecurity measures.
In addition to our core values, Example Bank offers a range of innovative financial products and services:
– Loans: Whether you’re looking to buy a home, start a business, or finance a major purchase, our flexible loan options are designed to meet your needs. With competitive interest rates and a simple application process, obtaining a loan has never been easier.
– Credit Cards: Our credit cards come with a host of benefits including cashback rewards, low-interest rates, and no annual fees. Manage your spending effortlessly with real-time notifications and intuitive budgeting tools.
– Mobile Apps: Our user-friendly apps on the Google Play Store and Apple App Store offer a seamless banking experience. From checking balances to transferring funds, our apps ensure you have complete control of your finances at your fingertips.
– Savings and Investments: Grow your wealth with our high-yield savings accounts and a variety of investment options. Our financial advisors are available to help you make informed decisions tailored to your financial goals.
– Customer Support: We provide 24/7 customer support to assist with any inquiries or issues. Our dedicated team is always ready to help, ensuring you receive the best possible service at all times.
At Example Bank, we are committed to enhancing your financial well-being through innovation, transparency, and unparalleled service. Join us today and experience the future of banking.
“””

Now you’re ready to invoke the model. For each question in the file, you construct a prompt that contains the context and the actual question. You send the prompt to the model four times to generate four different outputs and save the results in the llm_responses.json file.

questions = ‘example_bank_questions.txt’
llm_responses = os.path.join(sample_files_path, ‘llm_responses.json’)

from timeit import default_timer as timer
import tqdm.asyncio

async def invoke_model(question, context):
pipe = pipeline(“text-generation”, model=model, tokenizer=tokenizer)
messages = [
{“role”: “user”, “content”: f”{context}: {question}”}
]

terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(“<|eot_id|>”)
]

response = pipe(
messages,
max_new_tokens=120,
do_sample=True,
temperature=gl_temperature,
top_p=gl_top_p,
eos_token_id=terminators
)[0][‘generated_text’][-1]
return response[‘content’]

async def process_lines(file_path):
results = []
context = f”””{company_context} You are a customer service agent for {company_name} Sometimes you are smart with your answers. Answer the following customer question in one or two sentences:
“””
async with aiofiles.open(file_path, ‘r’) as file:
lines = [line async for line in file]
for line in tqdm.asyncio.tqdm(lines, desc=”Processing Question Bank”):
start = timer()
responses = await asyncio.gather(*[invoke_model(line, context) for _ in range(4)])
result = {
‘context’: context,
‘question’: line.strip(),
‘responses’: responses
}
end = timer()
results.append(result)
return results

results = await process_lines(questions)

with open(llm_responses, ‘w’) as file:
json.dump(
results,
file,
indent=4
)

The following is an example entry from llm_reponses.json.

Set up the SageMaker Ground Truth labeling job and human preference data
To fine-tune the model using DPO, you need to gather human preference data for the generated responses. SageMaker Ground Truth helps orchestrate the data collection process. It offers customizable labeling workflows and robust workforce management features for ranking tasks. This section shows you how to set up a SageMaker Ground Truth labeling job and invite a human workforce with requisite expertise to review the LLM responses and rank them.
Set up the workforce
A private workforce in SageMaker Ground Truth consists of individuals who are specifically invited to perform data labeling tasks. These individuals can be employees or contractors who have the required expertise to evaluate the model’s responses. Setting up a private workforce helps achieve data security and quality by limiting access to trusted individuals for data labeling.
For this use case, the workforce consists of the group of people who will rank the model responses. You can set up a private workforce using the SageMaker console by creating a private team and inviting members through email. For detailed instructions, refer to Create a Private Workforce (Amazon SageMaker Console).
Create the instruction template
With the instruction template, you can manage the UI and guide human annotators in reviewing model outputs. It needs to clearly present the model responses and provide a straightforward way for the annotators to rank them. Here, you use the text ranking template. This template allows you to display the instructions for the human reviewer and the prompts with the pregenerated LLM responses. The annotator reviews the prompt and responses and ranks the latter based on their alignment to the organization’s brand.
The definition of the template is as follows. The template shows a pane on the left with instructions from the job requester, a prompt at the top, and three LLM responses in the main body. The right side of the UI is where the annotator ranks the responses from most to least preferable.

<html>
<head>
<meta charset=”UTF-8″ />
<link rel=”stylesheet” href=”https://assets.crowd.aws/css/gen-ai-components.css” />
<link rel=”icon” href=”data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>🥇</text></svg>” />
<title>Text Ranking Tool</title>
<script src=”https://assets.crowd.aws/gen-ai-components.js”></script>
</head>

<body>
<div>
<crowd-text-ranking
crowd-form-element-id=”crowd-form-submit”
instructions=’Rank the following responses from a language model according to their alignment to the organisation’s brand.’
ordinal-ranking-dimensions='[{“name”:”BrandValue”,”allowTie”:true}]’
text='{{ task.input.source }}’
responses='{{ task.input.responses | to_json }}’ />
</div>
<crowd-form id=”crowd-form-submit” style=”display: none”></crowd-form>
<script src=”https://assets.crowd.aws/crowd-html-elements.js”></script>
</body>
</html>

The template is saved locally on your Studio JupyterLab space EBS volume as instructions.template in a temporary directory. Then you upload this template file to your designated S3 bucket using s3.upload_file(), placing it in the specified bucket and prefix. This Amazon S3 hosted template will be referenced when you create the SageMaker Ground Truth labeling job, so workers see the correct interface for the text ranking task.
Preprocess the input data
Before you create the labeling job, verify that the input data matches the format expected by SageMaker Ground Truth and is saved as a JSON file in Amazon S3. You can use the prompts and responses in the llm_responses.json file to create the manifest file inp-manifest-trank.json. Each row in the manifest file contains a JSON object (source-responses pair). The previous entry now looks like the following code.

Upload the structured data to the S3 bucket so that it can be ingested by SageMaker Ground Truth.
Create the labeling job
Now you’re ready to configure and launch the labeling job using the SageMaker API from within the notebook. This involves specifying the work team, UI template, and data stored in the S3 bucket. By setting appropriate parameters such as task time limits and the number of workers per data object, you can run jobs efficiently and effectively. The following code shows how to start the labeling job:

sm_client.create_labeling_job(
LabelingJobName=labeling_job_name,
LabelAttributeName=’label’,
InputConfig={
‘DataSource’: {
‘S3DataSource’: {
‘ManifestS3Uri’: model_responses_s3_uri
}
}
},
OutputConfig={
‘S3OutputPath’: ‘s3://{}/{}/output/’.format(bucket,prefix) #Enter S3 URI of Output folder
},
RoleArn=role,
HumanTaskConfig={
‘WorkteamArn’: WORKTEAM_ARN,
‘UiConfig’:{
‘UiTemplateS3Uri’: UI_TEMPLATE_S3_URI
},
‘PreHumanTaskLambdaArn’: ‘arn:aws:lambda:us-east-1:432418664414:function:PRE-PassThrough’,
‘TaskKeywords’: [
‘QnA’,
],
‘TaskTitle’: ‘Rank LLM responses’,
‘TaskDescription’: “Rank the responses provided by the LLM”,
‘NumberOfHumanWorkersPerDataObject’: 1,
‘TaskTimeLimitInSeconds’: 60*30,
‘TaskAvailabilityLifetimeInSeconds’: 60*60*24*10,
‘MaxConcurrentTaskCount’: 100,
‘AnnotationConsolidationConfig’: {
‘AnnotationConsolidationLambdaArn’: ‘arn:aws:lambda:us-east-1:432418664414:function:ACS-PassThrough’
}
}

As the job is launched, it’s essential to monitor its progress closely, making sure tasks are being distributed and completed as expected.
Gather human feedback through the labeling portal
When the job setup is complete, annotators can log in to the labeling portal and start ranking the model responses.

Workers can first consult the Instructions pane to understand the task, then use the main interface to evaluate and rank the model’s responses according to the given criteria. The following screenshot illustrates the UI.

The human feedback is collected and stored in an S3 bucket. This feedback will be the basis for DPO. With this data, you will fine-tune the Meta Llama 3 model and align its responses with the organization’s values, improving its overall performance.
Align Meta Llama 3 8B Instruct with the DPOTrainer
In this section, we show how to use the preference dataset that you prepared using SageMaker Ground Truth to fine-tune the model using DPO. DPO explicitly optimizes the model’s output based on human evaluations. It aligns the model’s behavior more closely with human expectations and improves its performance on tasks requiring nuanced understanding and contextual appropriateness. By integrating human preferences, DPO enhances the model’s relevance, coherence, and overall effectiveness in generating desired responses.
DPO makes it more straightforward to preference-tune a model in comparison to other popular techniques such as Proximal Policy Optimization (PPO). DPO eliminates the necessity for a separate rewards model, thereby avoiding the cost associated with training it. Additionally, DPO requires significantly less data to achieve performance comparable to PPO.
Fine-tuning a language model using DPO consists of two steps:

Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
Maximize the log-likelihood of the DPO loss directly.

To learn more about the DPO algorithm, refer to the following whitepaper.

Expected data format
The DPO trainer expects a very specific format for the dataset, which contains sentence pairs where one sentence is a chosen response and the other is a rejected response. This is represented as a Python dictionary with three keys:

prompt – Consists of the context prompt given to a model at inference time for text generation
chosen – Contains the preferred generated response to the corresponding prompt
rejected – Contains the response that is not preferred or should not be the sampled response for the given prompt

The following function definition illustrates how to process the data stored in Amazon S3 to create a DPO dataset using with sample pairs and a prompt:

def return_prompt_and_responses(samples, index):
prompt = f”{samples[‘context’]}nn{samples[‘question’]}”
chosen_index = response_rankings[index][“responseRankings”].index(1)
rejected_index = response_rankings[index][“responseRankings”].index(4)

prompt = {“role”: “user”, “content”: prompt},

chosen_messages = [
{“role”: “assistant”, “content”: samples[“responses”][chosen_index]},
]
rejected_messages = [
# {“role”: “system”, “content”: prompt},
{“role”: “assistant”, “content”: samples[“responses”][rejected_index]},
]

return {
“prompt”: tokenizer.apply_chat_template(prompt, tokenize=False),
“chosen”: “{}”.format(tokenizer.apply_chat_template(chosen_messages, tokenize=False).replace(‘<|begin_of_text|>’, ”)),
“rejected”: “{}”.format(tokenizer.apply_chat_template(rejected_messages, tokenize=False).replace(‘<|begin_of_text|>’, ”))
}

Here is an example sentence pair:

You split the DPO trainer dataset into train and test samples using an 80/20 split and tokenize the dataset in preparation for DPO fine-tuning:

dataset = prepared_dataset.train_test_split(test_size=0.2)

dataset[“train”].to_json(
os.path.join(sample_files_path, “processed_human_feedback”, “train_dataset.json”),
orient=”records”,
index=”False”
)

dataset[“test”].to_json(
os.path.join(sample_files_path, “processed_human_feedback”, “test_dataset.json”),
orient=”records”,
index=”False”

Supervised fine-tuning using DPO
Now that the dataset is formatted for the DPO trainer, you can use the train and test datasets prepared earlier to initiate the DPO model fine-tuning. Meta Llama 3 8B belongs to a category of small language models, but even Meta Llama 3 8B barely fits into a SageMaker ML instance like ml.g5.48xlarge in fp16 or fp32, leaving little room for full fine-tuning. You can use PEFT with DPO to fine-tune Meta Llama 3 8B’s responses based on human preferences. PEFT is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and updating only those parameters during training. By doing so, PEFT can significantly reduce the computation required for fine-tuning. See the following code:

# configure PEFT module
peft_config = LoraConfig(
r=512,
lora_alpha=1024,
lora_dropout=0.05,
task_type=”CAUSAL_LM”,
target_modules=”all-linear”,

For a full list of LoraConfig training arguments, refer to LoRA. At a high level, you need to initialize the DPOTrainer with the following components: the model you want to train, a reference model (ref_model) used to calculate the implicit rewards of the preferred and rejected responses, the beta hyperparameter that controls the balance between the implicit rewards assigned to the preferred and rejected responses, and a dataset containing prompt, chosen, and rejected responses. If ref_model=None, the trainer will create a reference model with the same architecture as the input model to be optimized. See the following code:

from trl import DPOConfig, DPOTrainer

dpo_model_dir = “/path/to/save/dpo/model”

args = DPOConfig(
output_dir=dpo_model_dir, # directory to save and repository id
num_train_epochs=5, # number of training epochs
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim = “adamw_torch_fused”, # use fused adamw optimizer
learning_rate=1e-5, # 10x higher LR than QLoRA paper
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.1, # warmup ratio based on QLoRA paper
lr_scheduler_type=”cosine”, # use cosine learning rate scheduler
logging_steps=10,
save_steps=10, # when to save checkpoint
evaluation_strategy=”steps”,
eval_steps=100,
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
push_to_hub=False, # push model to hub,
report_to=’tensorboard’,
remove_unused_columns=False
)

dpo_args = {
“beta”: 0.1, # The beta factor in DPO loss. Higher beta means less divergence
“loss_type”: “sigmoid” # The loss type for DPO.
}

trainer = DPOTrainer(
model,
ref_model=None,
peft_config=peft_config,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
max_length=max_seq_length,
max_prompt_length=prompt_length,
beta=dpo_args[“beta”],
loss_type=dpo_args[“loss_type”],
)

# kick-off model training
trainer.train()

Once you start the training, you can see the status in the notebook:

When model fine-tuning is complete, save the PEFT adapter model to disk and merge it with the base model to create a newly tuned model. You can use the saved model for local inference and validation or deploy it as a SageMaker endpoint after you have gained sufficient confidence in the model’s responses.

peft_output_dir = “/path/to/save/tuned/model/”
print(f”saving peft model to: {peft_output_dir}”)
trainer.save_model(output_dir=peft_output_dir)


merged_model = model.merge_and_unload()


merged_model.save_pretrained(
new_dpo_output_dir,
safe_serialization=True,
max_shard_size=”9GB”
)

Evaluate the fine-tuned model inside a SageMaker Studio notebook
Before you host your model for inference, verify that its response optimization aligns with user preferences. You can collect the model’s response both before and after DPO fine-tuning and compare them side by side, as shown in the following table.

The DPO Model Response column indicates the RLHF aligned model’s response post-fine-tuning, and the Rejected Model Response column refers to the model’s response to the input prompt prior to fine-tuning.
Deploy the model to a SageMaker endpoint
After you have gained sufficient confidence in your model, you can deploy it to a SageMaker endpoint for real-time inference. SageMaker endpoints are fully managed and provide auto scaling capabilities. For this post, we use DJL Serving to host the fine-tuned, DPO-aligned Meta Llama3 8B model. To learn more about hosting your LLM using DJL Serving, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.
To deploy an LLM directly from your SageMaker Studio notebook using DJL Serving, complete the following steps:

Upload model weights and other model artifacts to Amazon S3.
Create a meta-model definition file called serving.properties. This definition file dictates how the DJL Serving container is configured for inference.

engine = DeepSpeed option.tensor_parallel_degree = 1 option.s3url = s3://<MY-TEST-BUCKET>/llama3-dpo-ft/modelweights option.hf_access_token=hf_xx1234

Create a custom inference file called model.py, which defines a custom inference logic:

%%writefile llama3-serving-model/model.py

from djl_python import Input, Output

predictor = None

def get_model(properties):


return generator

def handle(inputs: Input) -> None:

outputs = predictor(message, **generation_kwargs)[0][‘generated_text’][-1]
result = {“outputs”: outputs[‘content’]}
return Output().add(result)

Deploy the DPO fine-tuned model as a SageMaker endpoint:

from sagemaker import image_uris
from sagemaker.model import Model
from datetime import datetime

inference_image_uri = image_uris.retrieve(
framework=”djl-deepspeed”,
region=region,
version=”0.23.0″
)

dpo_model.deploy(
initial_instance_count=1,
instance_type=”ml.g5.2xlarge”,
endpoint_name=f”ep-{dpo_model.name}”,
container_startup_health_check_timeout=900,
wait=False, # <– Set to True, if you would prefer to wait 6-8 minutes for the endpoint to spin up
)

Invoke the hosted model for inference using the sageMaker.Predictor class:

dpo_ft_predictor = sagemaker.Predictor(
endpoint_name=”my_custom_dpo_endpoint”,
sagemaker_session=sess,
serializer=serializers.JSONSerializer(),
deserializer=deserializers.JSONDeserializer(),
)

# invoke inference
response = dpo_ft_predictor.predict(
{
“inputs”: content,
“parameters”: parameters
}
)

Clean up
After you complete your tasks in the SageMaker Studio notebook, remember to stop your JupyterLab workspace to prevent incurring additional charges. You can do this by choosing Stop next to your JupyterLab space. Additionally, you have the option to set up lifecycle configuration scripts that will automatically shut down resources when they’re not in use.

If you deployed the model to a SageMaker endpoint, run the following code at the end of the notebook to delete the endpoint:

#delete your endpoint
sm_client.delete_endpoint(EndpointName=tg_sm_model.endpoint_name)

Conclusion
Amazon SageMaker offers tools to streamline the process of fine-tuning LLMs to align with human preferences. With SageMaker Studio, you can experiment interactively with different models, questions, and fine-tuning techniques. With SageMaker Ground Truth, you can set up workflows, manage teams, and collect consistent, high-quality human feedback.
In this post, we showed how to enhance the performance of Meta Llama 3 8B Instruct by fine-tuning it using DPO on data collected with SageMaker Ground Truth. To get started, launch SageMaker Studio and run the notebook available in the following GitHub repo. Share your thoughts in the comments section!

About the Authors
Anastasia Tzeveleka is a GenAI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers build foundation models and create scalable generative AI and machine learning solutions using AWS services.
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.
Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers build scalable and cost-efficient AI/ML pipelines with Human in the Loop services. In his free time, Sundar loves traveling, sports and enjoying outdoor activities with his family.

Amazon EC2 P5e instances are generally available

State-of-the-art generative AI models and high performance computing (HPC) applications are driving the need for unprecedented levels of compute. Customers are pushing the boundaries of these technologies to bring higher fidelity products and experiences to market across industries.
The size of large language models (LLMs), as measured by the number of parameters, has grown exponentially in recent years, reflecting a significant trend in the field of AI. Model sizes have increased from billions of parameters to hundreds of billions of parameters within a span of 5 years. As LLMs have grown larger, their performance on a wide range of natural language processing tasks has also improved significantly, but the increased size of LLMs has led to significant computational and resource challenges. Training and deploying these models requires vast amounts of computing power, memory, and storage.
The size of an LLM has a significant impact on the choice of compute needed for inference. Larger LLMs require more GPU memory to store the model parameters and intermediate computations, as well as greater computational power to perform the matrix multiplications and other operations needed for inference. Large LLMs take longer to perform a single inference pass due to this increased computational complexity. This increased compute requirement can lead to higher inference latency, which is a critical factor for applications that require real-time or near real-time responses.
HPC customers exhibit similar trends. With the fidelity of HPC customer data collection increasing and datasets reaching exabyte scale, customers are looking for ways to enable faster time to solution across increasingly complex applications.
To address customer needs for high performance and scalability in deep learning, generative AI, and HPC workloads, we are happy to announce the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances, powered by NVIDIA H200 Tensor Core GPUs. AWS is the first leading cloud provider to offer the H200 GPU in production. Additionally, we are announcing that P5en instances, a network optimized variant of P5e instances, are coming soon.
In this post, we discuss the core capabilities of these instances and the use cases they’re well-suited for, and walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.
EC2 P5e instances overview
P5e instances are powered by NVIDIA H200 GPUs with 1.7 times more GPU memory capacity and 1.5 times faster GPU memory bandwidth as compared to NVIDIA H100 Tensor Core GPUs featured in P5 instances.
P5e instances incorporate 8 NVIDIA H200 GPUs with 1128 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. P5e instances also provide 3,200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, enabling lower latency and efficient scale-out performance by bypassing the CPU for internode communication.
The following table summarizes the details for the instance.

Instance Size
vCPUs
Instance Memory (TiB)
GPU
GPU memory
Network Bandwidth (Gbps)
GPUDirect RDMA
GPU Peer to Peer
Instance Storage (TB)
EBS Bandwidth (Gbps)

p5e.48xlarge
192
2
8 x NVIDIA H200
1128 GB HBM3e
3200 Gbps EFA
Yes
900 GB/s NVSwitch
8 x 3.84 NVMe SSD
80

EC2 P5en instances coming soon
One of the bottlenecks in GPU-accelerated computing may lie in the communication between CPUs and GPUs. The transfer of data between these two components can be time-consuming, especially for large datasets or workloads that require frequent data exchanges. This challenge could impact wide range of GPU-accelerated applications such as deep learning, high-performance computing, and real-time data processing. The need to move data between the CPU and GPU can introduce latency and reduce the overall efficiency. Additionally, network latency can become an issue for ML workloads on distributed systems, because data needs to be transferred between multiple machines.
EC2 P5en instances, coming soon in 2024, can help solve these challenges. P5en instances pair the NVIDIA H200 GPUs with custom 4th Generation Intel Xeon Scalable processors, enabling PCIe Gen 5 between CPU and GPU. These instances will provide up to four times the bandwidth between CPU and GPU and lower network latency, thereby improving workload performance.
P5e use cases
P5e instances are ideal for training, fine-tuning, and running inference for increasingly complex LLMs and multimodal foundation models (FMs) behind the most demanding and compute-intensive generative AI applications, including question answering, code generation, video and image generation, speech recognition, and more.
Customers deploying LLMs for inference can benefit from using P5e instances, which offer several key advantages that make them an excellent choice for these workloads.
Firstly, the higher memory bandwidth of the H200 GPUs in the P5e instances allows the GPU to fetch and process data from memory more quickly. This translates to reduced inference latency, which is critical for real-time applications like conversational AI systems where users expect near-instant responses. The higher memory bandwidth also enables higher throughput, allowing the GPU to process more inferences per second. Customers deploying the 70-billion-parameter Meta Llama 3.1 model on P5e instances can expect up to 1.871 times higher throughput and up to 40%1 lower cost compared to using comparable P5 instances. (1Input Sequence Length 121, Output Sequence Length 5000, batch size 10, vLLM framework)
Secondly, the massive scale of modern LLMs, with hundreds of billions of parameters, requires an immense amount of memory to store the model and intermediate computations during inference. On the standard P5 instances, this would likely necessitate the use of multiple instances to accommodate the memory requirements. However, the P5e instances’ 1.76 times higher GPU memory capacity enables you to scale up by using a single instance to fit the entire model. This avoids the complexity and overhead associated with distributed inference systems, such as data synchronization, communication, and load balancing. Customers deploying the 405-billion-parameter Meta Llama 3.1 model on a single P5e instance can expect up to 1.72 times higher throughput and up to 69%2 lower cost compared to using two P5 instances. (2Input Sequence Length 121, Output Sequence Length 50, batch size 10, vLLM framework)
Finally, the higher GPU memory of the P5e instances also enables the use of larger batch sizes during inference for better GPU utilization, resulting in faster inference times and higher overall throughput. This additional memory can be particularly beneficial for customers with high-volume inference requirements.
When optimizing inference throughput and cost, consider adjusting batch size, input/output sequence length, and quantization level, because these parameters can have a substantial impact. Experiment with different configurations to find the optimal balance between performance and cost for your specific use case.
In summary, the combination of higher memory bandwidth, increased GPU memory capacity, and support for larger batch sizes make the P5e instances an excellent choice for customers deploying LLM inference workloads. These instances can deliver significant performance improvements, cost savings, and operational simplicity compared to alternative options.
P5e instances are also well-suited for memory-intensive HPC applications like simulations, pharmaceutical discovery, seismic analysis, weather forecasting, and financial modeling. Customers using dynamic programming (DP) algorithms for applications like genome sequencing or accelerated data analytics can also see further benefit from P5e through support for the DPX instruction set.
Get started with P5e instances
When launching P5 instances, you can use AWS Deep Learning AMIs (DLAMI) to support P5 instances. DLAMI provides ML practitioners and researchers with the infrastructure and tools to quickly build scalable, secure, distributed ML applications in preconfigured environments. You can run containerized applications on P5 instances with AWS Deep Learning Containers using libraries for Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).
P5e instances now available
EC2 P5e instances are now available in the US East (Ohio) AWS Region in the p5e.48xlarge sizes through Amazon EC2 Capacity Blocks for ML. For more information, refer to Amazon EC2 P5 Instances.

About the authors
Avi Kulkarni is an Senior Specialist focusing on worldwide business development and go-to-market for ML and HPC workloads across both commercial and public sector customers. Previously, he has managed partnerships at AWS and led product management for automotive customers at Honeywell, covering electrified, autonomous, and traditional vehicles.
Karthik Venna is a Principal Product Manager at AWS. He leads development of EC2 instances for a wide variety of workloads including deep learning and generative AI.
Khaled Rawashdeh is a Senior Product Manager at AWS. He defines and creates Amazon EC2 accelerated computing instances for most demanding AI/machine learning workloads. Before joining AWS, he worked for leading companies focusing on creating datacenter software and system for enterprise customers.
Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.
Pavel Belevich is a Senior Applied Scientist in the ML Frameworks team at Amazon Web Services. He applies his research in distributed training and inference of large models to real-life customer needs. Before joining AWS Pavel worked in PyTorch Distributed team on various distributed training techniques such as FSDP and Pipeline parallelism.
Dr. Maxime Hugues is a Principal WW Specialist Solutions Architect GenAI at AWS, which he joined in 2020. He holds a M.E. from the French National Engineer School “ISEN-Toulon”, a M.S. degree from the University of Science and a Ph.D. degree in Computer Science in 2011 from the University of Lille 1. His researches were mainly focused on programming paradigms, innovative hardware for Extreme computers and performance of HPC/Machine Learning. Prior joining AWS, he worked as HPC Research Scientist and Tech lead at TotalEnergies.
Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Exploring data using AI chat at Domo with Amazon Bedrock

This post is co-written with Joe Clark from Domo.
Data insights are crucial for businesses to enable data-driven decisions, identify trends, and optimize operations. Traditionally, gaining these insights required skilled analysts using specialized tools, which can make the process slow and less accessible.
Generative artificial intelligence (AI) has revolutionized this by allowing users to interact with data through natural language queries, providing instant insights and visualizations without needing technical expertise. This can democratize data access and speed up analysis.
However, companies can face challenges when using generative AI for data insights, including maintaining data quality, addressing privacy concerns, managing model biases, and integrating AI systems with existing workflows.
Domo is a cloud-centered data experiences innovator that empowers users to make data-driven decisions. Powered by AI and data science, Domo’s user-friendly dashboards and apps make data actionable, driving exponential business impact. Domo connects, transforms, visualizes, and automates data through simple integrations and intelligent automation, strengthening the entire data journey.
In this post, we share how Domo uses Amazon Bedrock to provide a flexible and powerful AI solution.
Domo’s purpose of using generative AI
The Domo enterprise data environment caters to a diverse customer base with varying data-driven requirements. Domo works with organizations that place a strong emphasis on deriving actionable insights from their data assets. Domo’s existing solution already enables these organizations to extract valuable insights through data visualization and analysis. The next step is to provide them with a more intuitive and conversational interface to interact with their data, empowering them to generate meaningful visualizations and reports through natural language interactions.
Domo.AI powered by Amazon Bedrock
Domo.AI simplifies data exploration and analysis by intelligently guiding you at every turn, from data preparation to forecasting to automation. It does this with natural language conversation, contextual and personalized insights with narrative and visual responses, and robust security and governance for a guided risk control experience.
Domo’s AI Service Layer is the foundation of the Domo.AI experience. Domo uses the Domo AI Service Layer with Amazon Bedrock to provide customers with a flexible and powerful AI solution. The AI Service Layer allows Domo to switch between different models provided by Amazon Bedrock for individual tasks and track their performance across key metrics like accuracy, latency, and cost. This enables Domo to optimize model performance through prompt engineering, preprocessing, and postprocessing, and provide contextual information and examples to the AI system. The AI Service Layer and its integration with Amazon Bedrock empower Domo to offer their customers the tools they need to harness AI throughout their organization, from data exploration using natural language-driven AI chat to custom applications and automations powered by a variety of AI models.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that orchestrate tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you’re already familiar with.
Solution overview
The following diagram illustrates the solution architecture and data flow.

The workflow includes the following steps:

End-users interact with Domo.AI either through their website or mobile app. The end-user request first goes through an AI chat agent. The AI chat agent uses the capability of large language models (LLMs) to interpret user input, determine how to solve the user question or request using available tools, and form a final response. The request goes through guardrails, which are mechanisms and strategies to enforce the responsible, ethical, and safe use of the AI model. This helps make sure the responses generated by the AI chat agent are aligned with the organization’s responsible AI policies and don’t contain inappropriate or harmful content. Domo uses custom business logic to implement safeguards in their generative AI applications that are customized to their customers’ use cases and responsible AI policies.
The Agent Planner component is responsible for orchestrating the various tasks required to fulfill the end-user request. It calls the Amazon Bedrock service through an API to create an execution plan, which involves selecting the appropriate tools and models to retrieve relevant information or perform custom actions. The tools refer to the various capabilities or actions that the AI chat agent can use to gather information and perform tasks. The tools provide the agent with access to data and functionality beyond what is available in the underlying LLM.
The Tool Execution component is the process of invoking the selected tools and integrating their outputs to generate the final response. This allows the agent to go beyond the knowledge contained in the LLM and incorporate up-to-date information or perform domain-specific operations.
As tools are run, user input is used to find semantically relevant information using vector search or to query private data from sources such as Amazon Redshift using Domo Cloud Amplifier, which is a native integration with cross-cloud systems to unlock data products at the speed businesses needs them. Vector search is a technique used to find semantically relevant information from unstructured data sources, such as knowledge base articles or other documents. By creating vector embeddings of the content, the AI chat agent can efficiently search for and retrieve information that is most relevant to the user’s query, even if the exact phrasing isn’t present in the source material.
Information such as search or query results from Step 3 is returned to the AI chat agent, where either the agent solver component can aggregate results to formulate a final response, or, in the case of more complex queries, the agent can run another tool.
Each of the components in the solution use the Domo AI Service Layer with Amazon Bedrock for planning and reasoning capabilities, converting user questions into SQL queries, or creating embeddings for vector search, and returning results in a natural language answer to the user’s question grounded by private customer data.

The following video of Domo.AI provides a more detailed overview of the product’s key features and capabilities.

Why Domo chose Amazon Bedrock
Domo chose Amazon Bedrock for the following benefits and features:

Model choice – Amazon Bedrock provides Domo with access to a wide range of models, including best-in-class options and those from various providers such as Anthropic, AI21 Labs, Cohere, Meta, and Stability AI. This variety allows Domo to extensively test their services using different models, enabling them to select the most suitable option for each specific use case. As a result, Domo can accelerate their development process and deliver value to their customers more rapidly by taking advantage of this flexibility in model selection and experimentation.
Security, compliance, and global infrastructure – Amazon Bedrock addresses crucial security and compliance concerns for Domo and their customers. With Amazon Bedrock, Domo makes sure that data remains within the AWS hosting environment, helping prevent model providers from accessing or training on customer data. With encryption in transit and at rest, along with restricted access to model deployment accounts, Amazon Bedrock provides robust data protection. Additionally, Domo has implemented multiple guardrails with varied control combinations to suit different applications and use cases. Amazon Bedrock offers a single API for inference, which facilitates secure communication between users and the FM. Additionally, the global infrastructure and compliance features of Amazon Bedrock enable Domo to scale and deploy their generative AI applications worldwide while adhering to data privacy laws and best practices.
Cost – By using Amazon Bedrock, Domo has achieved significant cost savings, reporting a 50% reduction compared to similar models from other providers. The serverless access to high-quality LLMs eliminates the need for substantial upfront infrastructure investments typically associated with LLMs. This cost-effective approach allows Domo to experiment with and test various models without incurring the hefty expenses usually linked to LLM implementation and maintenance, thereby optimizing their resource allocation and improving overall operational efficiency.

In the following video, Joe Clark, Software Architect at Domo, shares how AWS has been instrumental for Domo in the generative AI space.

Getting started with Amazon Bedrock
With Amazon Bedrock, teams and individuals can immediately start using FMs without having to worry about provisioning infrastructure or setting up and configuring ML frameworks.
Before you get started, verify that your user or role has permission to create or modify Amazon Bedrock resources. For details, see Identity-based policy examples for Amazon Bedrock.
To access the models in Amazon Bedrock, on the Amazon Bedrock console, choose Model access in the navigation pane. Review the EULA and enable the FMs you’d like in your account.
You can start interacting with the FMs through the following methods:

Directly on the Amazon Bedrock console using the Amazon Bedrock playgrounds
Programmatically using the Amazon Bedrock API and SDKs
In a console terminal using the Amazon Bedrock CLI

Conclusion
Amazon Bedrock has been instrumental in enhancing data insights and visualization capabilities at Domo through generative AI. By providing flexibility in FM selection, secure access, and a fully managed experience, Amazon Bedrock has enabled Domo to deliver more value to their customers while reducing costs. The service’s security and compliance features have also allowed Domo to serve customers in highly regulated industries. By using Amazon Bedrock, Domo has seen a 50% reduction in cost compared to a similarly performing model from another provider.
If you’re ready to start building your own FM innovation with Amazon Bedrock, refer to Getting started with Amazon Bedrock. To learn more about other intriguing Amazon Bedrock applications, see the Amazon Bedrock section of the AWS Machine Learning Blog.

About the Authors
Joe Clark is software architect for the Domo Labs team and lead architect for Domo’s AI Service Layer, AI Chat, and Model Management. At Domo, Joe has also led development of features including Jupyter Workspaces, Sandbox, and Code Engine. With 15 years of professional software development experience, he has previously worked on IoT and smart city initiatives.
Aman Tiwari is a General Solutions Architect working with independent software vendors in the data and generative AI vertical at AWS. He helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.
Sindhu Jambunathan is a Senior Solutions Architect at AWS, specializing in supporting ISV customers in the data and generative AI vertical to build scalable, reliable, secure, and cost-effective solutions on AWS. With over 13 years of industry experience, she joined AWS in May 2021 after a successful tenure as a Senior Software Engineer at Microsoft. Sindhu’s diverse background includes engineering roles at Qualcomm and Rockwell Collins, complemented by a Master’s of Science in Computer Engineering from the University of Florida. Her technical expertise is balanced by a passion for culinary exploration, travel, and outdoor activities.
Mohammad Tahsin is an AI/ML Specialist Solutions Architect at Amazon Web Services. He lives for staying up to date with the latest technologies in AI/ML and helping guide customers to deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.

CogVLM2: Advancing Multimodal Visual Language Models for Enhanced Imag …

Large Language Models (LLMs), initially limited to text-based processing, faced significant challenges in comprehending visual data. This limitation led to the development of Visual Language Models (VLMs), which integrate visual understanding with language processing. Early models like VisualGLM, built on architectures such as BLIP-2 and ChatGLM-6B, represented initial efforts in multi-modal integration. However, these models often relied on shallow alignment techniques, restricting the depth of visual and linguistic integration, thereby highlighting the need for more advanced approaches.

Subsequent advancements in VLM architecture, exemplified by models like CogVLM, focused on achieving a deeper fusion of vision and language features, thereby enhancing natural language performance. The development of specialized datasets, such as the Synthetic OCR Dataset, played a crucial role in improving models’ OCR capabilities, enabling broader applications in document analysis, GUI comprehension, and video understanding. These innovations have significantly expanded the potential of LLMs, driving the evolution of visual language models.

This research paper from Zhipu AI and Tsinghua University introduces the CogVLM2 family, a new generation of visual language models designed for enhanced image and video understanding, including models such as CogVLM2, CogVLM2-Video, and GLM-4V. Advancements include a higher-resolution architecture for fine-grained image recognition, exploration of broader modalities like visual grounding and GUI agents, and innovative techniques like post-downsample for efficient image processing. The paper also emphasizes the commitment to open-sourcing these models, providing valuable resources for further research and development in visual language models. 

The CogVLM2 family integrates architectural innovations, including the Visual Expert and high-resolution cross-modules, to enhance the fusion of visual and linguistic features. The training process for CogVLM2-Video involves two stages: Instruction Tuning, using detailed caption data and question-answering datasets with a learning rate of 4e-6, and Temporal Grounding Tuning on the TQA Dataset with a learning rate of 1e-6. Video input processing employs 24 sequential frames, with a convolution layer added to the Vision Transformer model for efficient video feature compression.

CogVLM2’s methodology utilizes substantial datasets, including 330,000 video samples and an in-house video QA dataset, to enhance temporal understanding. The evaluation pipeline involves generating and evaluating video captions using GPT-4o to filter videos based on scene content changes. Two model variants, cogvlm2-video-llama3-base, and cogvlm2-video-llama3-chat, serve different application scenarios, with the latter fine-tuned for enhanced temporal grounding. The training process occurs on an 8-node NVIDIA A100 cluster, completed in approximately 8 hours.

CogVLM2, particularly the CogVLM2-Video model, achieves state-of-the-art performance across multiple video question-answering tasks, excelling in benchmarks like MVBench and VideoChatGPT-Bench. The models also outperform existing models, including larger ones, in image-related tasks, with notable success in OCR comprehension, chart and diagram understanding, and general question-answering. Comprehensive evaluation reveals the models’ versatility in tasks such as video generation and summarization, establishing CogVLM2 as a new standard for visual language models in both image and video understanding.

In conclusion, the CogVLM2 family marks a significant advancement in integrating visual and language modalities, addressing the limitations of traditional text-only models. The development of models capable of interpreting and generating content from images and videos broadens their application in fields such as document analysis, GUI comprehension, and video grounding. Architectural innovations, including the Visual Expert and high-resolution cross-modules, enhance performance in complex visual-language tasks. The CogVLM2 series sets a new benchmark for open-source visual language models, with detailed methodologies for dataset generation supporting its robust capabilities and future research opportunities.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post CogVLM2: Advancing Multimodal Visual Language Models for Enhanced Image, Video Understanding, and Temporal Grounding in Open-Source Applications appeared first on MarkTechPost.

Top Large Language Models (LLMs): A Comprehensive Ranking of AI Giants …

The competition to develop the most advanced Large Language Models (LLMs) has seen major advancements, with the four AI giants, OpenAI, Meta, Anthropic, and Google DeepMind, at the forefront. These LLMs are reshaping industries and significantly impacting the AI-powered applications we use daily, such as virtual assistants, customer support chatbots, and translation services. As competition heats up, these models are constantly evolving, becoming more efficient and capable in various domains, including multitask reasoning, coding, mathematical problem-solving, and performance in real-time applications.

The Rise of Large Language Models

LLMs are built using vast amounts of data and intricate neural networks, allowing them to understand and generate human-like text accurately. These models are the pillar for generative AI applications that range from simple text completion to more complex problem-solving, like generating high-quality programming code or even performing mathematical calculations.

As the demand for AI applications grows, so does the pressure on tech giants to produce more accurate, versatile, and efficient LLMs. In 2024, some of the most critical benchmarks for evaluating these models include Multitask Reasoning (MMLU), coding accuracy (HumanEval), mathematical proficiency (MATH), and latency (TTFT, or time to first token). Cost-efficiency and token context windows are also becoming critical as more companies seek scalable AI solutions.

Best in Multitask Reasoning (MMLU)

The MMLU (Massive Multitask Language Understanding) benchmark is a comprehensive test that evaluates an AI model’s ability to answer questions from various subjects, including science, humanities, and mathematics. The top performers in this category demonstrate the versatility required to handle diverse real-world tasks.

GPT-4o is the leader in multitask reasoning, with an impressive score of 88.7%. Built by OpenAI, It builds on the strengths of its predecessor, GPT -4, and is designed for general-purpose tasks, making it a versatile model for academic and professional applications.

Llama 3.1 405b, the next iteration of Meta’s Llama series, follows closely behind with 88.6%. Known for its lightweight architecture, Llama 3.1 is engineered to perform efficiently while maintaining competitive accuracy across various domains.

Claude 3.5 Sonnet from Anthropic rounds out the top three with 88.3%, proving its capabilities in natural language understanding and reinforcing its presence as a model designed with safety and ethical considerations at its core.

Best in Coding (HumanEval)

As programming continues to play a vital role in automation, AI’s ability to assist developers in writing correct and efficient code is more important than ever. The HumanEval benchmark evaluates a model’s ability to generate accurate code across multiple programming tasks.

Claude 3.5 Sonnet takes the crown here with a 92% accuracy rate, solidifying its reputation as a string tool for developers looking to streamline their coding workflows. Claude’s emphasis on generating ethical and robust solutions has made it particularly appealing in safety-critical environments, such as healthcare and finance.

Although GPT-4o is slightly behind in the coding race with 90.2%, it remains a strong contender, particularly with its ability to handle large-scale enterprise applications. Its coding capabilities are well-rounded, and it continues to support various programming languages and frameworks.

Llama 3.1 405b scores 89%, making it a reliable option for developers seeking cost-efficient models for real-time code generation tasks. Meta’s focus on improving code efficiency and minimizing latency has contributed to Llama’s steady rise in this category.

Best in Math (MATH)

The MATH benchmark tests an LLM’s ability to solve complex mathematical problems and understand numerical concepts. This skill is critical for finance, engineering, and scientific research applications.

GPT-4o again leads the pack with a 76.6% score, showcasing its mathematical prowess. OpenAI’s continuous updates have improved its ability to solve advanced mathematical equations and handle abstract numerical reasoning, making it the go-to model for industries that rely on precision.

Llama 3.1 405b comes in second with 73.8%, demonstrating its potential as a more lightweight yet effective alternative for mathematics-heavy industries. Meta has invested heavily in optimizing its architecture to perform well in tasks requiring logical deduction and numerical accuracy.

GPT-Turbo, another variant from OpenAI’s GPT family, holds its ground with a 72.6% score. While it may not be the top choice for solving the most complex math problems, it is still a solid option for those who need faster response times and cost-effective deployment.

Lowest Latency (TTFT)

Latency, which is how quickly a model generates a response, is critical for real-time applications like chatbots or virtual assistants. The Time to First Token (TTFT) benchmark measures the speed at which an AI model begins outputting a response after receiving a prompt.

Llama 3.1 8b excels with an incredible latency of 0.3 seconds, making it ideal for applications where response time is critical. This model is built to perform under pressure, ensuring minimal delay in real-time interactions.

GPT-3.5-T follows with a respectable 0.4 seconds, balancing speed and accuracy. It provides a competitive edge for developers who prioritize quick interactions without sacrificing too much comprehension or complexity.

Llama 3.1 70b also achieves a 0.4-second latency, making it a reliable option for large-scale deployments that require both speed and scalability. Meta’s investment in optimizing response times has paid off, particularly in customer-facing applications where milliseconds matter.

Cheapest Models

In the era of cost-conscious AI development, affordability is a key factor for enterprises looking to integrate LLMs into their operations. The models below offer some of the most competitive pricing in the market.

Llama 3.1 8b tops the affordability chart with a usage cost of $0.05 (input) / $0.08 (output), making it a lucrative option for small businesses and startups looking for high-performance AI at a fraction of the cost of other models.

Gemini 1.5 Flash is close behind, offering $0.07 (input) / $0.3 (output) rates. Known for its large context window (as we’ll explore further), this model is designed for enterprises that require detailed analysis and larger data processing capacities at a lower cost.

GPT-4o-mini offers a reasonable alternative with $0.15 (input) / $0.6 (output), targeting enterprises that need the power of OpenAI’s GPT family without the hefty price tag.

Largest Context Window

The context window of an LLM defines the amount of text it can consider at once when generating a response. Models with larger context windows are crucial for long-form generation applications, such as legal document analysis, academic research, and customer service.

Gemini 1.5 Flash is the current leader with an astounding 1,000,000 tokens. This capability allows users to feed in entire books, research papers, or extensive customer service logs without breaking the context, offering unprecedented utility for large-scale text generation tasks.

Claude 3/3.5 comes in second, handling 200,000 tokens. Anthropic’s focus on maintaining coherence across long conversations or documents makes this model a powerful tool in industries that rely on continuous dialogue or legal document reviews.

GPT-4 Turbo + GPT-4o family can process 128,000 tokens, which is still a significant leap compared to earlier models. These models are tailored for applications that demand substantial context retention while maintaining high accuracy and relevance.

Factual Accuracy

Factual accuracy has become a critical metric as LLMs are increasingly used in knowledge-driven tasks like medical diagnosis, legal document summarization, and academic research. The accuracy with which an AI model recalls factual information without introducing hallucinations directly impacts its reliability.

Claude 3.5 Sonnet performs exceptionally well, with accuracy rates around 92.5% on fact-checking tests. Anthropic has emphasized building models that are efficient and grounded in verified information, which is key for ethical AI applications.

GPT-4o follows with an accuracy of 90%. OpenAI’s vast dataset helps ensure that GPT-4o pulls from up-to-date and reliable sources of information, making it particularly useful in research-heavy tasks.

Llama 3.1 405b achieves an 88.8% accuracy rate, thanks to Meta’s continued investment in refining the dataset and improving model grounding. However, it is known to struggle with less popular or niche subjects.

Truthfulness and Alignment

The truthfulness metric evaluates how well models align their output with known facts. Alignment ensures that models behave according to predefined ethical guidelines, avoiding harmful, biased, or toxic outputs.

Claude 3.5’s Sonnet again shines with a 91% truthfulness score thanks to Anthropic’s unique alignment research. Claude is designed with safety protocols in mind, ensuring its responses are factual and aligned with ethical standards.

GPT-4o scores 89.5% in truthfulness, showing that it mostly provides high-quality answers but occasionally may hallucinate or give speculative responses when faced with insufficient context.

Llama 3.1 405b earns 87.7% in this area, performing well in general tasks but struggling when pushed to its limits in controversial or highly complex issues. Meta continues to enhance its alignment capabilities.

Safety and Robustness Against Adversarial Prompts

In addition to alignment, LLMs must resist adversarial prompts, inputs designed to make the model generate harmful, biased, or nonsensical outputs.

Claude 3.5 Sonnet ranks highest with a 93% safety score, making it highly resistant to adversarial attacks. Its robust guardrails help prevent the model from providing harmful or toxic outputs, making it suitable for sensitive use cases in sectors like education and healthcare.

GPT-4o trails slightly at 90%, maintaining strong defenses but showing some vulnerability to more sophisticated adversarial inputs.

Llama 3.1 405b scores 88%, a respectable performance, but the model has been reported to exhibit occasional biases when presented with complex, adversarially framed queries. Meta is likely to improve in this area as the model evolves.

Robustness in Multilingual Performance

As more industries operate globally, LLMs must perform well across multiple languages. Multilingual performance metrics assess a model’s ability to generate coherent, accurate, and context-aware responses in non-English languages.

GPT-4o is the leader in multilingual capabilities, scoring 92% on the XGLUE benchmark (a multilingual extension of GLUE). OpenAI’s fine-tuning across various languages, dialects, and regional contexts ensures that GPT-4o can effectively serve users worldwide.

Claude 3.5 Sonnet follows with 89%, optimized primarily for Western and major Asian languages. However, its performance dips slightly in low-resource languages, which Anthropic is working to address.

Llama 3.1 405b has an 86% score, demonstrating strong performance in widely spoken languages like Spanish, Mandarin, and French but struggling in dialects or less-documented languages.

Knowledge Retention and Long-Form Generation

As the demand for large-scale content generation grows, LLMs’ knowledge retention and long-form generation abilities are tested by writing research papers, legal documents, and long conversations with continuous context.

Claude 3.5 Sonnet takes the top spot with a 95% knowledge retention score. It excels in long-form generation, where maintaining continuity and coherence over extended text is crucial. Its high token capacity (200,000 tokens) enables it to generate high-quality long-form content without losing context.

GPT-4o follows closely with 92%, performing exceptionally well when producing research papers or technical documentation. However, its slightly smaller context window (128,000 tokens) than Claude’s means it occasionally struggles with large input texts.

Gemini 1.5 Flash performs admirably in knowledge retention, with a 91% score. It particularly benefits from its staggering 1,000,000 token capacity, making it ideal for tasks where extensive documents or large datasets must be analyzed in a single pass.

Zero-Shot and Few-Shot Learning

In real-world scenarios, LLMs are often tasked with generating responses without explicitly training on similar tasks (zero-shot) or with limited task-specific examples (few-shot).

GPT-4o remains the best performer in zero-shot learning, with an accuracy of 88.5%. OpenAI has optimized GPT-4o for general-purpose tasks, making it highly versatile across domains without additional fine-tuning.

Claude 3.5 Sonnet scores 86% in zero-shot learning, demonstrating its capacity to generalize well across a wide range of unseen tasks. However, it slightly lags in specific technical domains compared to GPT-4o.

Llama 3.1 405b achieves 84%, offering strong generalization abilities, though it sometimes struggles in few-shot scenarios, particularly in niche or highly specialized tasks.

Ethical Considerations and Bias Reduction

The ethical considerations of LLMs, particularly in minimizing bias and avoiding toxic outputs, are becoming increasingly important.

Claude 3.5 Sonnet is widely regarded as the most ethically aligned LLM, with a 93% score in bias reduction and safety against toxic outputs. Anthropic’s continuous focus on ethical AI has resulted in a model that performs well and adheres to ethical standards, reducing the likelihood of biased or harmful content.

GPT-4o has a 91% score, maintaining high ethical standards and ensuring its outputs are safe for a wide range of audiences, although some marginal biases still exist in certain scenarios.

Llama 3.1 405b scores 89%, showing substantial progress in bias reduction but still trailing behind Claude and GPT-4o. Meta continues to refine its bias mitigation techniques, particularly for sensitive topics.

Conclusion

With these metrics comparison and analysis, it becomes clear that the competition among the top LLMs is fierce, and each model excels in different areas. Claude 3.5 Sonnet leads in coding, safety, and long-form content generation, while GPT-4o remains the top choice for multitask reasoning, mathematical prowess, and multilingual performance. Llama 3.1 405b from Meta continues to impress with its cost-effectiveness, speed, and versatility. It is a solid choice for those looking to deploy AI solutions at scale without breaking the bank.
The post Top Large Language Models (LLMs): A Comprehensive Ranking of AI Giants Across 13 Metrics Including Multitask Reasoning, Coding, Math, Latency, Zero-Shot and Few-Shot Learning, and Many More appeared first on MarkTechPost.

This AI Paper from Apple Introduces AdEMAMix: A Novel Optimization App …

Machine learning has made significant advancements, particularly through deep learning techniques. These advancements rely heavily on optimization algorithms to train large-scale models for various tasks, including language processing and image classification. At the core of this process lies the challenge of minimizing complex, non-convex loss functions. Optimization algorithms like Stochastic Gradient Descent (SGD) & its adaptive variants have become critical to this endeavor. Such methods aim to iteratively adjust model parameters to minimize errors during training, ensuring that models can generalize well on unseen data. However, while these optimization techniques have proven useful, there remains significant room for improvement in how they handle long-term gradient information.

A fundamental challenge in training large neural networks is the effective use of gradients, which provide the necessary updates for optimizing model parameters. Traditional optimizers like Adam and AdamW rely heavily on an Exponential Moving Average (EMA) of recent gradients, emphasizing the most current gradient information while discarding older gradients. This approach works well for models where recent changes hold more importance. However, this can be problematic for larger models and long training cycles, as older gradients often still contain valuable information. As a result, the optimization process may be less efficient, requiring longer training periods or failing to reach the best possible solutions.

In current optimization methods, particularly Adam and AdamW, using a single EMA for past gradients can limit the optimizer’s ability to capture a full spectrum of gradient history. These methods can adapt quickly to recent changes but often need more valuable information from older gradients. Researchers have explored several approaches to address this limitation, yet many optimizers still struggle to find the optimal balance between incorporating recent and past gradients effectively. This shortcoming can result in suboptimal convergence rates and poorer model performance, especially in large-scale training scenarios like language models or vision transformers.

Researchers from Apple and EPFL introduced a new approach to this problem with the AdEMAMix optimizer. Their method extends the traditional Adam optimizer by incorporating a mixture of two EMAs, one fast-changing and one slow-changing. This approach allows the optimizer to balance the need to respond to recent updates while retaining valuable older gradients often discarded by existing optimizers. This dual-EMA system, unique to AdEMAMix, enables more efficient training of large-scale models, reducing the total number of tokens needed for training while achieving comparable or better results.

The AdEMAMix optimizer introduces a second EMA to capture older gradients without losing the reactivity provided by the original EMA. Specifically, AdEMAMix maintains a fast-moving EMA that prioritizes recent gradients while tracking a slower-moving EMA that retains information much earlier in the training process. For example, when training a 1.3 billion-parameter language model on the RedPajama dataset, the researchers found that AdEMAMix could match the performance of an AdamW model trained on 197 billion tokens with only 101 billion tokens, a reduction of approximately 95% in token usage. This efficiency gain translates into faster convergence and often better minima, allowing models to reach superior performance with fewer computational resources.

Performance evaluations of AdEMAMix have demonstrated substantial improvements in speed and accuracy over existing optimizers. In one key experiment, a 110 million-parameter model trained with AdEMAMix reached similar loss values as an AdamW model that required nearly twice the number of training iterations. Specifically, the AdEMAMix model, trained for 256,000 iterations, achieved the same results as an AdamW model trained for 500,000 iterations. For even larger models, such as the 1.3 billion-parameter language model, AdEMAMix delivered comparable results to an AdamW model trained for 1.5 million iterations but with 51% fewer tokens. The optimizer also demonstrated a slower rate of forgetting, which is a critical advantage in maintaining model accuracy over long training cycles.

The researchers also addressed some common challenges optimizers face, such as early training instabilities. To overcome these, they introduced warmup steps for the larger of the two EMAs, progressively increasing the value of the slow-changing EMA throughout training. This gradual increase helps stabilize the model during the initial training phase, preventing the optimizer from prematurely relying too heavily on outdated gradients. By carefully scheduling the adjustments for the two EMAs, AdEMAMix ensures that the optimization process remains stable and efficient throughout training, even for models with tens of billions of parameters.

In conclusion, the AdEMAMix optimizer presents a notable advancement in machine learning optimization. Incorporating two EMAs to leverage both recent and older gradients better addresses a key limitation of traditional optimizers like Adam and AdamW. This dual-EMA approach allows models to achieve faster convergence with fewer tokens, reducing the computational burden of training large models; AdEMAMix consistently outperformed A in trialsdamW, demonstrating its potential to improve performance in language modeling and image classification tasks. The method’s ability to reduce model forgetting during training further underscores its value for large-scale, long-term ML projects, making it a powerful tool for researchers and industry.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post This AI Paper from Apple Introduces AdEMAMix: A Novel Optimization Approach Leveraging Dual Exponential Moving Averages to Enhance Gradient Efficiency and Improve Large-Scale Model Training Performance appeared first on MarkTechPost.

DriveGenVLM: Advancing Autonomous Driving with Generated Videos and Vi …

Integrating advanced predictive models into autonomous driving systems has become crucial for enhancing safety and efficiency. Camera-based video prediction emerges as a pivotal component, offering rich real-world data. Content generated by artificial intelligence is presently a leading area of study within the domains of computer vision and artificial intelligence. However, generating photo-realistic and coherent videos poses significant challenges due to limited memory and computation time. Moreover, predicting video from a front-facing camera is critical for advanced driver-assistance systems in autonomous vehicles.

Existing approaches include diffusion-based architectures that have become popular for generating images and videos, with better performance in tasks such as image generation, editing, and translation. Other methods like Generative Adversarial Networks (GANs), flow-based models, auto-regressive models, and Variational Autoencoders (VAEs) have also been used for video generation and prediction. Denoising Diffusion Probabilistic Models (DDPMs) outperform traditional generation models in effectiveness. However, generating long videos continues to be computationally demanding. Although autoregressive models like Phenaki tackle this issue, they often face challenges with unrealistic scene transitions and inconsistencies in longer sequences.

A team of researchers from Columbia University in New York have proposed the DriveGenVLM framework to generate driving videos and used Vision Language Models (VLMs) to understand them. The framework utilizes a video generation approach based on denoising diffusion probabilistic models (DDPM) to predict real-world video sequences. A pre-trained model called Efficient In-context Learning on Egocentric Videos (EILEV) is utilized to evaluate the adequacy of generated videos for VLMs. EILEV also provides narrations for these generated videos, potentially enhancing traffic scene understanding, aiding navigation, and improving planning capabilities in autonomous driving.

The DriveGenVLM framework is validated using the Waymo Open Dataset, which provides diverse real-world driving scenarios from multiple cities. The dataset is split into 108 videos for training and divided equally among the three cameras, and 30 videos for testing (10 per camera). This framework utilizes the Frechet Video Distance (FVD) metric to evaluate the quality of generated videos, where FVD measures the similarity between the distributions of generated and real videos. This metric is valuable for temporal coherence and visual quality evaluation, making it an effective tool for benchmarking video synthesis models in tasks such as video generation and future frame prediction.

The results for the DriveGenVLM framework on the Waymo Open Dataset for three cameras reveal that the adaptive hierarchy-2 sampling method outperforms other sampling schemes by yielding the lowest FVD scores. Prediction videos are generated for each camera using this superior sampling method, where each example is conditioned on the first 40 frames, with ground truth frames and predicted frames. Moreover, the flexible diffusion model’s training on the Waymo dataset shows its capacity for generating coherent and photorealistic videos. However, it still faces challenges in accurately interpreting complex real-world driving scenarios, such as navigating traffic and pedestrians.

In conclusion, researchers from Columbia University have introduced the DriveGenVLM framework to generate driving videos. The DDPM trained on the Waymo dataset is proficient while generating coherent and lifelike images from front and side cameras. Moreover, the pre-trained EILEV model is used to generate action narrations for the videos. The DriveGenVLM framework highlights the potential of integrating generative models and VLMs for autonomous driving tasks. In the future, the generated descriptions of driving scenarios can be used in large language models to offer driver assistance or support language model-based algorithms.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post DriveGenVLM: Advancing Autonomous Driving with Generated Videos and Vision Language Models VLMs appeared first on MarkTechPost.

IBM Research Open-Sources Docling: An AI Tool for High-Precision PDF D …

Document conversion, particularly from PDF to machine-processable formats, has long presented significant challenges due to PDF files’ diverse and often complex nature. These documents, widely used across various industries, frequently need more standardization, resulting in a loss of structural features when optimized for printing. This structural loss complicates the recovery process, as important elements such as tables, figures, and reading order can be misinterpreted or completely lost. As businesses and researchers increasingly rely on digital documents, the need for efficient and accurate conversion tools has become crucial. The advent of advanced AI-driven tools has provided a promising solution to these challenges, enabling better understanding, processing, and extracting content from complex documents.

A critical issue in document conversion is the reliable extraction of content from PDFs while preserving the document’s structural integrity. Traditional methods often falter due to the wide variability in PDF formats, leading to problems such as inaccurate table reconstruction, misplaced text, and lost metadata. This problem is technical and practical, as document conversion accuracy directly impacts downstream tasks such as data analysis, search functionality, and information retrieval. Given the growing reliance on digital documents for academic and industrial purposes, ensuring the fidelity of converted content is essential. The problem lies in developing tools that can handle these tasks with the precision required by modern applications, particularly when dealing with large-scale document collections.

Current tools for PDF conversion, both commercial and open-source, often need to meet the necessary standards of performance and accuracy. Many existing solutions are limited by their dependence on proprietary algorithms and restrictive licenses, which hinder their adaptability and widespread use. Even popular methods struggle with specific tasks, such as accurate table recognition and layout analysis, critical components of high-quality document conversion. For instance, tools like PyPDFium and PyMuPDF have been noted for their shortcomings in processing complex document layouts, resulting in merged text cells or distorted table structures. The lack of an open-source, high-performance solution that can be easily extended and adapted has left a significant gap in the market, particularly for organizations that require reliable tools for large-scale document processing.

The AI4K Group at IBM Research introduced Docling, an open-source package designed specifically for PDF document conversion. Docling distinguishes itself by leveraging specialized AI models for layout analysis and table structure recognition. These models, including DocLayNet and TableFormer, have been trained on extensive datasets and can handle many document types and formats. Docling is efficient, running on commodity hardware, and versatile, offering configurations for batch processing and interactive use. The tool’s ability to operate with minimal resources while delivering high-quality results makes it an attractive option for academic researchers and commercial enterprises. By bridging the gap between commercial software and open-source tools, Docling provides a robust and adaptable solution for document conversion.

The core of Docling’s functionality lies in its processing pipeline, which operates through a series of linear steps to ensure accurate document conversion. Initially, the tool parses the PDF document, extracting text tokens and their geometric coordinates. This is followed by applying AI models that analyze the document’s layout, identify elements such as tables and figures, and reconstruct the original structure with high fidelity. For instance, Docling’s TableFormer model recognizes complex table structures, including those with partial or no borderlines, spanning multiple rows or columns, or containing empty cells. The results of these analyses are then aggregated and post-processed to enhance metadata, determine the document’s language, and correct reading order. This comprehensive approach ensures that the converted document retains its original integrity, whether it is output in JSON or Markdown format.

Docling has demonstrated impressive capabilities across various hardware configurations. Tests conducted on a dataset of 225 pages revealed that Docling could process documents with sub-second latency per page on a single CPU. Specifically, on a MacBook Pro M3 Max with 16 cores, Docling processed 92 pages in just 103 seconds using 16 threads, achieving a throughput of 2.45 pages per second. Even on older hardware, such as an Intel Xeon E5-2690, Docling maintained respectable performance, processing 143 pages in 239 seconds with 16 threads. These results highlight Docling’s ability to deliver fast and accurate document conversion, making it a practical choice for environments with varying resource constraints.

In conclusion, Docling provides a reliable method for converting complex PDF documents into machine-processable formats by combining advanced AI models with a flexible, open-source platform. Its ability to maintain high performance on standard hardware while ensuring the integrity of converted content makes it an invaluable tool for researchers and commercial users.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post IBM Research Open-Sources Docling: An AI Tool for High-Precision PDF Document Conversion and Structural Integrity Maintenance Across Complex Layouts appeared first on MarkTechPost.

Snowflake AI Research Introduces Arctic-SnowCoder-1.3B: A New 1.3B Mod …

Machine learning models, especially those designed for code generation, heavily depend on high-quality data during pretraining. This field has seen rapid advancement, with large language models (LLMs) trained on extensive datasets containing code from various sources. The challenge for researchers is to ensure that the data used is abundant and of high quality, as this significantly impacts the model’s ability to handle complex tasks. In code-related applications, well-structured, annotated, and clean data ensures that models can generate accurate, efficient, and reliable outputs for real-world programming tasks.

A significant issue in code model development is the lack of precise definitions of “high-quality” data. While vast amounts of code data are available, much contains noise, redundancy, or irrelevant information, which can degrade model performance. Relying on raw data, even after filtering, often leads to inefficiencies. This problem becomes evident when models trained on large datasets underperform on practical benchmarks. To address this, there has been an increased focus on not just acquiring large amounts of data but curating data that aligns well with downstream applications, improving the model’s predictive abilities and overall utility.

Historically, the pretraining of code models involved scraping large repositories such as GitHub and processing raw data through basic filtering and deduplication techniques. Researchers would then apply random forest classifiers or simple quality filters to identify educationally valuable code, as seen in models like Phi-1. While these methods improved data quality to an extent, they were not enough to achieve optimal performance on more challenging coding tasks. Newer approaches have adopted more sophisticated tools, such as BERT-based annotators, to classify code quality and select data that would more effectively contribute to the model’s success.

The research team from Snowflake AI Research, University of Illinois at Urbana-Champaign, and Seoul National University introduced Arctic-SnowCoder-1.3B, a novel approach to pretraining code models by progressively refining data quality over three distinct phases. This method combined general pretraining, continued pretraining with high-quality data, and final pretraining with synthetic data. The researchers leveraged existing datasets, such as The Stack v1 and GitHub crawls, and artificial data generated using Llama-3.1-70B to build a smaller, more efficient model. This process focused on optimizing the data used in each phase to ensure that the model could outperform its competitors.

In the first phase, Arctic-SnowCoder was trained on 500 billion code tokens derived from raw data sources such as The Stack v1 and GitHub. This data underwent basic preprocessing steps, including filtering and deduplication, resulting in approximately 400 billion unique tokens. During this phase, the model was trained without advanced quality filters, and the data was grouped by programming language and repository. This approach ensured a broad code knowledge base but required further refinement. In the second phase, the research team selected 50 billion tokens from this initial dataset, focusing on high-quality data. A BERT-based quality annotator was employed to rank code files, and the top 12.5 billion tokens were repeated four times to train the model further. This phase significantly improved the data quality, as the annotator was specifically trained to select tokens aligned with the model’s downstream applications.

The final phase involved enhanced pretraining with 5 billion synthetic tokens generated by Llama-3.1-70B. These tokens were created using the top-quality data from phase two as seeds, transforming lower-quality data into synthetic high-quality documents. This phase further refined the model’s ability to generate precise code by ensuring the training data was relevant and representative of real-world coding tasks. The result was a model that had undergone progressively more rigorous training, with each phase contributing to its enhanced performance.

The effectiveness of this approach is evident in Arctic-SnowCoder-1.3B’s results. Despite being trained on only 555 billion tokens, it significantly outperformed other models of similar size, such as Phi-1.5-1.3B and StarCoderBase-3B, which were trained on over 1 trillion tokens. On the BigCodeBench benchmark, which focuses on practical and challenging programming tasks, Arctic-SnowCoder exceeded the performance of Phi-1.5-1.3B by 36%. It surpassed StarCoder2-3B, trained on over 3 trillion tokens, on HumanEval+, achieving a score of 28.0 compared to StarCoder2-3B’s 27.4. Despite being trained on fewer tokens, the model’s ability to perform well highlights the importance of data quality over quantity.

In conclusion, Arctic-SnowCoder-1.3B illustrates the critical role of progressively refined, high-quality data in the pretraining of code models. By adopting a three-phase approach, the researchers enhanced the model’s performance significantly compared to larger models trained on far more tokens. This method demonstrates the importance of aligning pretraining data with downstream tasks and provides practical guidelines for future model development. Arctic-SnowCoder’s success is a testament to the value of high-quality data, showing that careful data curation and synthetic data generation can lead to substantial improvements in code generation models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit
The post Snowflake AI Research Introduces Arctic-SnowCoder-1.3B: A New 1.3B Model that is SOTA Among Small Language Models for Code appeared first on MarkTechPost.

How Vidmob is using generative AI to transform its creative data lands …

This post was co-written with Mickey Alon from Vidmob.
Generative artificial intelligence (AI) can be vital for marketing because it enables the creation of personalized content and optimizes ad targeting with predictive analytics. Specifically, such data analysis can result in predicting trends and public sentiment while also personalizing customer journeys, ultimately leading to more effective marketing and driving business. For example, insights from creative data (advertising analytics) using campaign performance can not only uncover which creative works best but also help you understand the reasons behind its success.
In this post, we illustrate how Vidmob, a creative data company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to uncover meaningful insights at scale within creative data using Amazon Bedrock. The collaboration involved the following steps:

Use natural language to analyze and generate insights on performance data through different channels (such as TikTok, Meta, and Pinterest)
Generate research information for context such as the value proposition, competitive differentiators, and brand identity of a specific client

Vidmob background
Vidmob is the Creative Data company that uses creative analytics and scoring software to make creative and media decisions for marketers and agencies as they strive to drive business results through improved creative effectiveness. Vidmob’s influence lies in its partnerships and native integrations across the digital ad landscape, its dozens of proprietary models, and operating a reinforcement learning with human feedback (RLHF) model for creativity.
Vidmob’s AI journey
Vidmob uses AI to not only enhance its creative data capabilities, but also pioneer advancements in the field of RLHF for creativity. By seamlessly integrating AI models such as Amazon Rekognition into its innovative stack, Vidmob has continually evolved to stay at the forefront of the creative data landscape.
This journey extends beyond the mere adoption of AI; Vidmob has consistently recognized the importance of curating a differentiated dataset to maximize the potential of its AI-driven solutions. Understanding the intrinsic value of data network effects, Vidmob constructed a product and operational system architecture designed to be the industry’s most comprehensive RLHF solution for marketing creatives.
Use case overview
Vidmob aims to revolutionize its analytics landscape with generative AI. The central goal is to empower customers to directly query and analyze their creative performance data through a chat interface. Over the past 8 years, Vidmob has amassed a wealth of data that provides deep insights into the value of creatives in ad campaigns and strategies for enhancing performance. Vidmob envisions making it effortless for customers to utilize this data to generate insights and make informed decisions about their creative strategies.
Currently, Vidmob and its customers rely on creative strategists to address these questions at the brand level, complemented by machine-generated normative insights at the industry or environment level. This process can take creative strategists many hours. To enhance the customer experience, Vidmob decided to partner with AWS GenAIIC to deliver these insights more quickly and automatically.
Vidmob partnered with AWS GenAIIC to analyze ad data to help Vidmob creative strategists understand the performance of customer ads. Vidmob’s ad data consists of tags created from Amazon Rekognition and other internal models. The chatbot built by AWS GenAIIC would take in this tag data and retrieve insights.
The following were key success criteria for the collaboration:

Analyze and generate insights in a natural language based on performance data and other metadata
Generate client company information to be used as initial research for a creative
Create a scalable solution using Amazon Bedrock that can be integrated with Vidmob’s performance data

However, there were a few challenges in achieving these goals:

Large language models (LLMs) are limited in the volume of data they can analyze to generate insights without hallucination. They are designed to predict and summarize text-based information and are less optimized for computing creative data at a terabyte scale.
LLMs don’t have straightforward automatic evaluation techniques. Therefore, human evaluation was required for insights generated by the LLM.
There are 50–100 creative questions that creative strategists would normally analyze, which means an asynchronous mechanism was needed that would queue up these prompts, aggregate them, and provide the top-most meaningful insights.

Solution overview
The AWS team worked with Vidmob to build a serverless architecture for handling incoming questions from customers. They used the following services in the solution:

Amazon Bedrock
Amazon DynamoDB
AWS Lambda
Amazon Simple Storage Service (Amazon S3)

The following diagram illustrates the high-level workflow of the current solution:

The workflow consists of the following steps:

The user navigates to Vidmob and asks a creative-related query.
Dynamo DB stores the query and the session ID, which is then passed to a Lambda function as a DynamoDB event notification.
The Lambda function calls Amazon Bedrock, obtains an output from the user query, and sends it back to the Streamlit application for the user to view.
The Lambda function updates the status after it receives the completed output from Amazon Bedrock.
In the following sections, we explore the details of the workflow, the dataset, and the results Vidmob achieved.

Workflow details
After the user inputs a query, a prompt is automatically created and then fed into a QA chatbot in which a response is outputted. The main aspects of the LLM prompt include:

 Client description – Background information about the client. This includes the value proposition, brand identity, and competitive differentiators, which is generated by Anthropic Claude v2 on Amazon Bedrock.
Aperture – Important aspects to take into account for a user question. For example, for all questions relating to branding, “What is the best way to incorporate branding for my meta creative” might identify elements that include a logo, tagline, and sincere tone.
Context – The filtered dataset of ad performance referenced by the QA bot.
Question – The user query.

The following screenshot shows the UI where the user can input the client and their ad-related question.

On the backend, a router is used to determine the context (ad-related dataset) as a reference to answer the question. This depends on the question and the client, which is done in the following steps:

Determine whether the question should reference the objective dataset (general for an entire channel like TikTok, Meta, Pinterest) or placement dataset (specific sub-channels like Facebook Reels). For example, “What is the best way to incorporate branding in my Meta creative” is objective-based, whereas “What is the best way to incorporate branding for Facebook News Feed” is placement-based because it references a specific part of the Meta creative.
Obtain the corresponding objective dataset for the client if the query is objective-based. If it’s placement-based, first filter the placement dataset to only columns that are relevant to the query and then pass in the resulting dataset.
Pass the completed prompt to the Anthropic’s Claude v2 model on Amazon Bedrock and display the outputs.

The outputs are displayed as shown in the following screenshot.

Specifically, the outputs include the elements that best answer the question, why this element may be important, and its corresponding percent lift for the creative.
Dataset
The dataset includes a set of ad-related data corresponding to a specific client. Specifically, Vidmob analyzes the client ad campaigns and extracts information related to the ads using various machine learning (ML) models and AWS services. The information about each campaign is collated into a single dataset (creative data). It notes how each element of a given creative performs under a certain metric; for example, how the CTA affects the view-through rate of the ad. The following two datasets were utilized:

Creative strategist filtered performance data for each question – The dataset given was filtered by Vidmob creative strategists for their analysis. The filtered datasets include an element (such as logo or bright colors for a creative) as well as its corresponding average, percent lift (of a particular metric such as view-through rate), creative count, and impressions for each sub-channel (Facebook Explore, Reels, and so on).
Unfiltered raw datasets – This dataset included objective-based and placement-based data for each client.

As we discussed earlier, there are two types of datasets for a particular client: objective-based and placement-based data. Objective data is used for answering generic user queries about ads for channels such as TikTok, Meta, or Pinterest, whereas placement data is used for answering specific questions about ads for sub-channels within Meta such as Facebook Reels, Instream, and News Feed. Therefore, questions such as “What are creative insights in my Meta creative” are more general and therefore reference the objective data, and questions such as “What are insights for Facebook News Feed” reference the News Feed statistics in the placement data.
The objective dataset includes elements and their corresponding average percent lift, creative count, p-values, and many more for an entire channel, whereas placement data includes these same statistics for each sub-channel.
Results
A set of questions were evaluated by the strategists for Vidmob, primarily for the following metrics:

Accuracy – How correct the overall answer is with what you expect to be
Relevancy – How relevant the LLM-generated output to the question is (or in this case, the background information for the client)
Clarity – How clear and understandable the outputs from the performance data and their insights are, or if the LLM is making up things

The client background information for the prompt and a set of questions for the filtered and unfiltered data were evaluated.
Overall, the client background, generated by Anthropic’s Claude, outputted the value proposition, brand identity, and competitive differentiator for a given client. The accuracy and clarity were perfect, whereas relevancy was perfect for most samples. Perfect is determined as being given a 9/10 or 10/10 on the specific metrics by subject matter experts.
When answering a set of questions, the responses generally had high clarity and AWS GenAIIC was able to incrementally improve the QA chatbot’s accuracy and relevancy by adding extra tag information to filter the data by 10% and 5%, respectively. Overall, Vidmob expects a reduction in generating insights for creative campaigns from hours to minutes.
Conclusion
In this post, we shared how the AWS GenAIIC team used Anthropic’s Claude on Amazon Bedrock to extract and summarize insights from Vidmob’s performance data using zero-shot prompt engineering. With these services, creative strategists were able to understand client information through inherent knowledge of the LLM as well as answer user queries through added client background information and tag types such as messaging and branding. Such insights can be retrieved at scale and utilized for enhancing effective ad campaigns.
The success of this engagement allowed Vidmob an opportunity to use generative AI to create more valuable insights for customers in reduced time, allowing for a more scalable solution.
This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases today. If you’re interested in working with the AWS Generative AI Innovation Center, reach out to AWS GenAIIC.

About the Authors
Mickey Alon is a serial entrepreneur and co-author of ‘Mastering Product-Led Growth.’ He co-founded Gainsight PX (Vista) and Insightera (Adobe), a real-time personalization engine. He previously led the global product development team at Marketo (Adobe) and currently serves as the CPTO at Vidmob, a leading creative intelligence platform powered by GenAI.
Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using Large Language Models, primarily through Amazon Bedrock and other AWS Cloud services.
Gaurav Rele is a Senior Data Scientist at the Generative AI Innovation Center, where he works with AWS customers across different verticals to accelerate their use of generative AI and AWS Cloud services to solve their business challenges.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.