Gmail Bulk Sender Rules: Preparing for June Updates & Beyond

Editor’s note: This article was updated on May 20, 2024, to reflect the updated timeline from Gmail and Yahoo Mail regarding enforcement of 2024 policies.

In October 2023, Google and Yahoo announced their new bulk sender rules. 

These rules, which effectively drew a line in the sand when it came to spam complaint rates, were a big change in that this was the first time any big email provider had given specific directives. 

The new rules?

Mandatory digital email signing (domain authentication)

0.3% spam complaint rate threshold

Easy unsubscribe and prompt unsubscribe processing

The updates began to be rolled out starting on February 1, 2024, and were a surprise to many, presenting new challenges for email marketers – especially outbound email marketers.

We’ll dive into an outline of the Gmail and Yahoo Mail requirements as well the implications and needed updates by email marketers in this article, and how Customers.ai helps brands ensure peak deliverability.

Jump to the sections with the links below or read on:

How to Prevent Spam Penalties

Average Spam Complaint Ranges

April 1st Domain Authentication Requirements

June 1st Unsubscribe Requirements

How to Prevent Spam Penalties

First, one of the main reasons emails get marked as spam is because the recipient has never heard of the sender. 

Our inboxes are already so crowded – don’t bother me with something I’ve never even heard of, right?

That’s where Customers.ai comes in. 

We identify people who are already on your site. 

People who are engaging with high-intent pages.

People who haven’t received an email or ad previously.

These are the people you want to target. 

Forget the cold email lists of the past you had to buy. These are the people who are going to open your emails, engage, improve your email deliverability, and drive up revenue!

Second, Customers.ai has a built-in email deliverability test.

After you’ve authenticated your email sender through Customers.ai, get your email deliverability score.

We’ll report back to you how many of your emails are landing in the inbox vs. the Promotions tab vs. the spam filter.

And provide you with recommendations you can use to improve your email deliverability, pointing out if you are missing the newly required digital email signing and domain authentication.

Average Spam Complaint Ranges

As we dug into just how these new rules would impact outbound marketers, we found that a lot of people were going to be in real trouble.

The thing is, since the announcement of the new guidelines, we haven’t seen much chatter about it. 

Until now. 

It seems this is more of a slow rollout than a big change that would have an immediate impact, starting with domain authentication. 

April 1st Domain Authentication Requirements

According to Gmail’s group product manager, Neil Kumaran, all bulk senders* will be required to authenticate their email beginning April 1, 2024. 

What’s key here is the word required.

While email authentication should certainly be the norm by now, the fact Google is requiring it shows how serious they are. 

They are so serious that they are flat-out going to reject emails that don’t meet the requirements!

From Forbes:

“Starting from April 1, Google will reject emails from bulk senders unless they meet new authentication requirements. This strict rule is aimed at reducing the amount of spam that lands in Gmail inboxes and enhancing the security of Gmail users. By implementing these new requirements, Google is aiming to prevent malicious actors from using unauthenticated or compromised domains to deliver their dangerous payloads and reduce unwanted spam.”

As you can see, there are real repercussions here for those who choose not to adhere to the Google guidelines. 

Sounds like it’s time to get your emails in line if they aren’t already.

*Note: Current bulk sender rules refer to 5,000+ email sends to Gmail users

Staying in Line with New Google Guidelines

Authenticating your email account is simple and honestly, the majority of ESPs have this capability built in. 

Here are links to a few of our partner sites:

How to Authenticate Your Email with Sendgrid

How to Verify a Domain with Sendlane

What is DMARC and How Do I Set it Up on Klaviyo?

Along with ensuring your email is authenticated, there are other parts of the spam rules that still need to be adhered to, including unsubscribe links. 

In our original deep drive, we noted that clear unsubscribes were crucial. Make it easy for people to unsubscribe and they are less likely to mark you as spam.

June 1st Unsubscribe Requirements

And come June 1st, not only do bulk senders need to have an unsubscribe, they have to have a one-click unsubscribe option that is processed within 48 hours.

As for the spam complaint rate threshold?

I think we are going to see more about that as these updates continue to roll out. 

In the meantime, make sure you are creating an outbound email marketing program that focuses on warm leads, personalized messaging, and best practices.

Ready to see how we can help?

Sign up for free and start identifying visitors who actually want to hear from you and use state-of-the-art email deliverability testing to ensure your emails reach their intended recipients.

Unlock High-Intent Leads Hiding on Your Site

Book a demo of Customers.ai’s U.S. website visitor identification, customer journey insights and remarketing platform to skyrocket conversions and sales.

Book a Demo

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Gmail Bulk Sender Rules: Preparing for June Updates & Beyond appeared first on Customers.ai.

TRANSMI: A Machine Learning Framework to Create Baseline Models Adapte …

The increasing availability of digital text in diverse languages and scripts presents a significant challenge for natural language processing (NLP). Multilingual pre-trained language models (mPLMs) often struggle to handle transliterated data effectively, leading to performance degradation. Addressing this issue is crucial for improving cross-lingual transfer learning and ensuring accurate NLP applications across various languages and scripts, which is essential for global communication and information processing.

Current methods, including models like XLM-R and Glot500, perform well with text in their original scripts but struggle significantly with transliterated text due to ambiguities and tokenization issues. These limitations degrade their performance in cross-lingual tasks, making them less effective when handling text converted into a common script such as Latin. The inability of these models to accurately interpret transliterations poses a significant barrier to their utility in multilingual settings. 

Researchers from the Center for Information and Language Processing, LMU Munich, and Munich Center for Machine Learning (MCML) introduced TRANSMI, a framework designed to enhance mPLMs for transliterated data without requiring additional training. TRANSMI modifies existing mPLMs using three merge modes—Min-Merge, Average-Merge, and Max-Merge—to incorporate transliterated subwords into their vocabularies, thereby addressing transliteration ambiguities and improving cross-lingual task performance.

TRANSMI integrates new subwords tailored for transliterated data into the mPLMs’ vocabularies, particularly excelling in the Max-Merge mode for high-resource languages. The framework is tested using datasets that include transliterated versions of texts in scripts such as Cyrillic, Arabic, and Devanagari, showing that TRANSMI-modified models outperform their original versions in various tasks like sentence retrieval, text classification, and sequence labeling. This modification ensures that models retain their original capabilities while adapting to the nuances of transliterated text, thus enhancing their overall performance in multilingual NLP applications.

The datasets used to validate TRANSMI span a variety of scripts, providing a comprehensive assessment of its effectiveness. For example, the FURINA model using Max-Merge mode shows significant improvements in sequence labeling tasks, demonstrating TRANSMI’s capability to handle phonetic scripts and mitigate issues arising from transliteration ambiguities. This approach ensures that mPLMs can process a wide range of languages more accurately, enhancing their utility in multilingual contexts.

The results indicate that TRANSMI-modified models achieve higher accuracy compared to their unmodified counterparts. For instance, the FURINA model with Max-Merge mode demonstrates notable performance improvements in sequence labeling tasks across different languages and scripts, showcasing clear gains in key performance metrics. These improvements highlight TRANSMI’s potential as an effective tool for enhancing multilingual NLP models, ensuring better handling of transliterated data and leading to more accurate cross-lingual processing.

In conclusion, TRANSMI addresses the critical challenge of improving mPLMs’ performance on transliterated data by modifying existing models without additional training. This framework enhances mPLMs’ ability to process transliterations, leading to significant improvements in cross-lingual tasks. TRANSMI offers a practical and innovative solution to a complex problem, providing a strong foundation for further advancements in multilingual NLP and improving global communication and information processing.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post TRANSMI: A Machine Learning Framework to Create Baseline Models Adapted for Transliterated Data from Existing Multilingual Pretrained Language Models mPLMs without Any Training appeared first on MarkTechPost.

CinePile: A Novel Dataset and Benchmark Specifically Designed for Auth …

Video understanding is one of the evolving areas of research in artificial intelligence (AI), focusing on enabling machines to comprehend and analyze visual content. Tasks like recognizing objects, understanding human actions, and interpreting events within a video come under this domain. Advancements in this domain find crucial applications in autonomous driving, surveillance, and entertainment industries. Enhancing the ability of AI to process and understand videos, researchers aim to improve the performance and reliability of various technologies that rely on visual data.

The main challenge in video understanding lies in the complexity of interpreting dynamic and multi-faceted visual information. Traditional models need help accurately analyzing temporal aspects, object interactions, and plot progression within scenes. These limitations hinder the development of robust systems capable of comprehensive video comprehension. Addressing this challenge requires innovative approaches that can manage the intricate details and vast amounts of data present in video content, pushing the boundaries of current AI capabilities.

Current methods for video understanding often rely on large multi-modal models that integrate visual and textual information. These models typically use annotated datasets where human-written questions and answers are generated based on specific scenes. However, these approaches are labor-intensive and prone to errors, making them less scalable and unreliable. Existing benchmarks, like MovieQA and TVQA, offer some insights but must cover the full spectrum of video understanding, particularly in handling complex interactions and events within scenes.

Researchers from the University of Maryland and Weizmann Institute of Science have introduced a novel approach called CinePile, which was developed by a team that included members from Gemini and other companies. This method leverages automated question template generation to create a large-scale, long-video understanding benchmark. The system integrates visual and textual data to generate detailed and diverse questions about movie scenes. CinePile aims to bridge the gap between human performance and current AI models by providing a comprehensive dataset that challenges the models’ understanding and reasoning capabilities.

CinePile uses a multi-stage process to curate its dataset. Initially, raw video clips are collected and annotated with scene descriptions. A binary classification model distinguishes between dialogue and visual descriptions. These annotations are then used to generate question templates through a language model, which are applied to the video scenes to create comprehensive question-answer pairs. The process involves shot detection algorithms to pick and annotate important frames using the Gemini Vision API. The concatenated text descriptions produce a visual summary of each scene. This summary then generates long-form questions and answers, focusing on various aspects like character dynamics, plot analysis, thematic exploration, and technical details.

The CinePile benchmark features approximately 300,000 questions in the training set and about 5,000 in the test split. The evaluation of current video-centric models, both open-source and proprietary, showed that even state-of-the-art systems need to catch up to human performance. For example, the models often must adhere more strictly to instructions, producing verbose responses instead of concise answers. The researchers noted that open-source models like Llava 1.5-13B, OtterHD, mPlug-Owl, and MinGPT-4 showed high fidelity in image captioning but struggled with hallucinations and unnecessary text snippets. This highlights the complexity and challenges inherent in video understanding tasks and underscores the need for more sophisticated models and evaluation methods.

In conclusion, the research team addressed a critical gap in video understanding by developing CinePile. This innovative approach enhances the ability to generate diverse and contextually rich questions about videos, paving the way for more advanced and scalable video comprehension models. The work underscores the importance of integrating multi-modal data and automated processes in advancing AI capabilities in video analysis. CinePile sets a new standard for evaluating video-centric AI models by providing a robust benchmark, driving future research and development in this vital field.

Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding appeared first on MarkTechPost.

ALPINE: Autoregressive Learning for Planning in Networks

Large Language Models (LLMs) such as ChatGPT have attracted a lot of attention since they can perform a wide range of activities, including language processing, knowledge extraction, reasoning, planning, coding, and tool use. These abilities have sparked research into creating even more sophisticated AI models and hint at the possibility of Artificial General Intelligence (AGI). 

The Transformer neural network architecture, on which LLMs are based, uses autoregressive learning to anticipate the word that will appear next in a series. This architecture’s success in carrying out a wide range of intelligent activities raises the fundamental question of why predicting the next word in a sequence leads to such high levels of intelligence.

Researchers have been looking at a variety of topics to have a deeper understanding of the power of LLMs. In particular, the planning ability of LLMs has been studied in a recent work, which is an important part of human intelligence that is engaged in tasks such as project organization, travel planning, and mathematical theorem proof. Researchers want to bridge the gap between basic next-word prediction and more sophisticated intelligent behaviors by comprehending how LLMs perform planning tasks.

In a recent research, a team of researchers has presented the findings of the Project ALPINE which stands for “Autoregressive Learning for Planning In NEtworks.” The research dives into how the autoregressive learning mechanisms of Transformer-based language models enable the development of planning capabilities. The team’s goal is to identify any possible shortcomings in the planning capabilities of these models.

The team has defined planning as a network path-finding task to explore this. Creating a legitimate path from a given source node to a selected target node is the objective in this case. The results have demonstrated that Transformers, by embedding adjacency and reachability matrices within their weights, are capable of path-finding tasks.

The team has theoretically investigated Transformers’ gradient-based learning dynamics. According to this, Transformers are capable of learning both a condensed version of the reachability matrix and the adjacency matrix. Experiments were conducted to validate these theoretical ideas, demonstrating that Transformers may learn both an incomplete reachability matrix and an adjacency matrix. The team also used Blocksworld, a real-world planning benchmark, to apply this methodology. The outcomes supported the primary conclusions, indicating the applicability of the methodology.

The study has highlighted a potential drawback of Transformers in path-finding, namely their inability to recognize reachability links through transitivity. This implies that they wouldn’t work in situations where creating a complete path requires path concatenation, i.e., transformers might not be able to correctly produce the right path if the path involves an awareness of connections that span several intermediate nodes.

The team has summarized their primary contributions as follows,

An analysis of Transformers’ path-planning tasks using autoregressive learning in theory has been conducted. 

Transformers’ capacity to extract adjacency and partial reachability information and produce legitimate pathways has been empirically validated.

The Transformers’ inability to fully understand transitive reachability interactions has been highlighted.

In conclusion, this research sheds light on the fundamental workings of autoregressive learning, which facilitates network design. This study expands on the knowledge of Transformer models’ general planning capacities and can help in the creation of more sophisticated AI systems that can handle challenging planning jobs across a range of industries.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post ALPINE: Autoregressive Learning for Planning in Networks appeared first on MarkTechPost.

This AI Paper Introduces Rational Transfer Function: Advancing Sequenc …

State-space models (SSMs) are crucial in deep learning for sequence modeling. They represent systems where the output depends on both current and past inputs. SSMs are widely applied in signal processing, control systems, and natural language processing. The main challenge is the inefficiency of existing SSMs, particularly regarding memory and computational costs. Traditional SSMs need more complexity and resource usage as the state grows, limiting their scalability and performance in large-scale applications.

Existing research includes frameworks like S4 and S4D, which utilize diagonal state-space representations to manage complexity. Fast Fourier Transform (FFT)–based methods are used for efficient sequence parallelism. Transformers revolutionized sequence modeling with self-attention mechanisms, while Hyena incorporates convolutional filters for long-range dependencies. Liquid-S4 and Mamba optimize sequence modeling through selective state spaces and memory management. The Long Range Arena benchmark is standard for evaluating models’ performance on long sequences. These advancements enhance the efficiency and capability of sequence modeling.

In a collaborative effort, researchers from Liquid AI, the University of Tokyo, RIKEN, Stanford University, and MIT have introduced the Rational Transfer Function (RTF) approach, which leverages transfer functions for efficient sequence modeling. This method stands out due to its state-free design, eliminating the need for memory-intensive state-space representations. By utilizing the FFT, the RTF approach achieves parallel inference, significantly improving computational speed and scalability.

The methodology employs FFT to compute the convolutional kernel’s spectrum, allowing for efficient parallel inference. The model was tested using the Long Range Arena (LRA) benchmark, which includes ListOps for mathematical expressions, IMDB for sentiment analysis, and Pathfinder for visuospatial tasks. Synthetic tasks like Copying and Delay were used to assess memorization capabilities. The RTF model was integrated into the Hyena framework, improving performance in language modeling tasks. The datasets included 96,000 training sequences for ListOps, 160,000 for IMDB, and 160,000 for Pathfinder, ensuring comprehensive evaluation across different sequence lengths and complexities.

The RTF model demonstrated significant improvements in multiple benchmarks. On the Long Range Arena, it achieved a 35% faster training speed than S4 and S4D. For the IMDB sentiment analysis, RTF improved classification accuracy by 3%. In the ListOps task, it recorded a 2% increase in accuracy. The Pathfinder task saw a 4% accuracy improvement. Furthermore, in synthetic tasks like Copying and Delay, RTF showed better memorization capabilities, reducing error rates by 15% and 20%, respectively. These results highlight the model’s efficiency and effectiveness across diverse datasets.

To conclude, the research introduced the RTF approach for SSMs, addressing inefficiencies in traditional methods. By leveraging FFT for parallel inference, RTF significantly improved training speed and accuracy across various benchmarks, including Long Range Arena and synthetic tasks. The results demonstrate RTF’s capability to handle long-range dependencies efficiently. This advancement is crucial for scalable and effective sequence modeling, offering a robust solution for diverse deep learning and signal processing applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post This AI Paper Introduces Rational Transfer Function: Advancing Sequence Modeling with FFT Techniques appeared first on MarkTechPost.

Enhancing Graph Classification with Edge-Node Attention-based Differen …

Graph Neural Networks GNNs are advanced tools for graph classification, leveraging neighborhood aggregation to update node representations iteratively. This process captures local and global graph structure, facilitating node classification and link prediction tasks. Effective graph pooling is essential for downsizing and learning representations, categorized into global and hierarchical pooling. Hierarchical methods, such as TopK-based and cluster-based strategies, aim to retain structural features but face challenges like potential information loss and over-smoothing. Recent approaches incorporate self-attention mechanisms to address these issues, though challenges like computational expense and edge importance remain.

Researchers from Beijing Normal University, Central University of Finance and Economics, Zhejiang Normal University, and the University of York have developed a new hierarchical pooling method for GNNs called Edge-Node Attention-based Differentiable Pooling (ENADPool). Unlike traditional methods, ENADPool uses hard clustering and attention mechanisms to compress node features and edge strengths, addressing issues with uniform aggregation. Additionally, they introduced a Multi-distance GNN (MD-GNN) model to reduce over-smoothing by allowing nodes to receive information from neighbors at various distances. ENADPool’s design eliminates the need for separate attention computations, improving efficiency. Experiments show that the MD-GNN combined with ENADPool effectively enhances graph classification performance.

The study reviews existing works related to GNNs, including graph convolutional networks, pooling operations, and attention mechanisms. GNNs, classified into spectral-based and spatial-based, excel in graph data analysis. Spectral methods, like ChebNet, use the Laplacian matrix, while spatial methods, like GraphSAGE, aggregate local node information. Both face over-smoothing issues, addressed by models like MixHop and N-GCN. For graph-level classification, pooling operations, categorized into global and hierarchical methods, are crucial. Hierarchical pooling, like DiffPool, clusters nodes but has limitations addressed by ABDPool, which uses attention mechanisms. Graph attention, used in GAT and GaAN, assigns weights to nodes based on their importance.

ENADPool is a cluster-based hierarchical pooling method that assigns nodes to unique clusters, calculates node importance using attention mechanisms, and compresses node features and edge connectivity for subsequent layers. It involves three steps: hard node assignment, node-based attention, and edge-based attention, resulting in weighted compressed node features and adjacency matrices. The MD-GNN model mitigates over-smoothing by aggregating node information from different distances and reconstructing graph topology to capture comprehensive structural details. This approach enhances the effectiveness of ENADPool and improves graph representation.

The study compares the ENADPool and MD-GNN model against other graph deep learning methods using benchmark datasets like D&D, PROTEINS, NCI1/NCI109, FRANKENSTEIN, and REDDIT-B. Baselines include hierarchical methods (e.g., SAGPool(H), ASAPool, DiffPool, ABDPool) and global pooling methods (e.g., DGCNN, SAGPool(G), KerGNN, GCKN). Using 10-fold cross-validation, the researchers assess the models and report average accuracy and standard deviation. Their architecture employs two pooling layers with MD-GNNs for embeddings and node assignments, optimized with ReLU activation, dropout, and auxiliary classifiers during training. The method performs superior due to hard node assignment, attention-based importance for nodes and edges, MD-GNN integration, and effective feature representation.

In conclusion, ENADPool compresses node features and edge connectivity into hierarchical structures using attention mechanisms after each pooling step, effectively identifying the importance of nodes and edges. This approach addresses the shortcomings of traditional pooling methods that use unclear node assignments and uniform feature aggregation. Additionally, the MD-GNN model mitigates the over-smoothing problem by allowing nodes to receive information from neighbors at various distances. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Enhancing Graph Classification with Edge-Node Attention-based Differentiable Pooling and Multi-Distance Graph Neural Networks GNNs appeared first on MarkTechPost.

01.AI Introduces Yi-1.5-34B Model: An Upgraded Version of Yi with a Hi …

The recent Yi-1.5-34B model introduced by 01.AI has brought about yet another advancement in the field of Artificial Intelligence. Positioned as a major improvement over its predecessors, this unique model bridges the gap between Llama 3 8B and 70B. It promises better performance in a number of areas, such as multimodal capability, code production, and logical reasoning. The complexities of the Yi-1.5-34B model, its creation, and its possible effects on the AI community have been explored in depth by the team of researchers.

The Yi-34B model served as the basis for the Yi-1.5-34B model’s development. The Yi-1.5-34B carries on the tradition of Yi-34B, which was recognized for its superior performance and functioned as an unofficial benchmark in the AI community. This is due to its improved training and optimization. The model’s intense training regimen has been demonstrated by the fact that it was pre-trained on an incredible 500 billion tokens, earning 4.1 trillion tokens in total.

Yi-1.5-34B’s architecture is intended to be a well-balanced combination, providing the computational efficiency of Llama 3 8B-sized models and getting close to the broad capabilities of 70B-sized models. This equilibrium guarantees that the model can carry out intricate tasks without necessitating the enormous computational resources that are generally linked with large-scale models.

When compared against benchmarks, the Yi-1.5-34B model has shown remarkable performance. Its large vocabulary helps it solve logical puzzles with ease and grasp complex ideas in a subtle way. Its capacity to produce code snippets longer than those generated by GPT-4 is one of its most notable properties, demonstrating its usefulness in actual applications. The model’s speed and efficiency have been commended by users who have tested it through demos, making it an appealing option for a variety of AI-driven activities.

The Yi family encompasses multimodal and language models, going beyond text to include vision-language features. This is accomplished by aligning visual representations within the language model’s semantic space by combining a vision transformer encoder with the chat language model. Also, the Yi models are not limited to conventional settings. With lightweight ongoing pretraining, they have been extended to handle long contexts of up to 200,000 tokens. 

One of the main reasons for the Yi models’ effectiveness is the careful data engineering procedure that has been used in their creation. The models used 3.1 trillion tokens from Chinese and English corpora for pretraining. To ensure the best quality inputs, this data was carefully selected utilizing a cascaded deduplication and quality filtering pipeline.

The process of fine-tuning enhanced the model’s capabilities even further. Machine learning engineers iteratively refined and validated a small-scale instruction dataset with less than 10,000 instances. Thanks to this practical approach to data verification, the performance of the refined models is guaranteed to be precise and dependable.

With its combination of excellent performance and usefulness, the Yi-1.5-34B model is a great development in Artificial Intelligence. It is a flexible tool for both researchers and practitioners because of its capacity to perform complicated tasks like multimodal integration, code development, and logical reasoning. 

Check out the Model Card and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post 01.AI Introduces Yi-1.5-34B Model: An Upgraded Version of Yi with a High-Quality Corpus of 500B Tokens and Fine-Tuned on 3M Diverse Fine-Tuning Samples appeared first on MarkTechPost.

This AI Research from Google DeepMind Explores the Performance Gap bet …

RLHF is the standard approach for aligning LLMs. However, recent advances in offline alignment methods, such as direct preference optimization (DPO) and its variants, challenge the necessity of on-policy sampling in RLHF. Offline methods, which align LLMs using pre-existing datasets without active online interaction, have shown practical efficiency and are simpler and cheaper to implement. This raises the question of whether online RL is essential for AI alignment. Comparing online and offline methods is complex due to their different computational demands, necessitating careful calibration of the budget spent to measure performance fairly.

Researchers from Google DeepMind demonstrated that online methods outperform offline methods in their initial experiments, prompting further investigation into this performance gap. Through controlled experiments, they found that factors like offline data coverage and quality must fully explain the discrepancy. Unlike online methods, offline methods excel in pairwise classification but need help with generation. The gap persists regardless of loss function type and model scaling. This suggests that on-policy sampling is crucial for AI alignment, highlighting challenges in offline alignment. The study uses KL divergence from the supervised fine-tuned (SFT) policy to compare performance across algorithms and budgets, revealing persistent differences.

The study complements previous work on RLHF by comparing online and offline RLHF algorithms.  The researchers identify a persistent performance gap between online and offline methods, even when using different loss functions and scaling policy networks. While previous studies noted challenges in offline RL, their findings emphasize that they extend to RLHF. 

The study compares online and offline alignment methods using the IPO loss across various datasets, examining their performance under Goodhart’s law. The IPO loss involves optimizing the weight of winning responses over losing ones, with differences in sampling processes defining the online and offline methods. Online algorithms sample responses on policy, while offline algorithms use a fixed dataset. Experiments reveal that online algorithms achieve better trade-offs between KL divergence and performance, using the KL budget more efficiently and achieving higher peak performance. Several hypotheses are proposed to explain these discrepancies, such as data coverage diversity and sub-optimal offline datasets.

The hypothesis posits that the performance discrepancy between online and offline algorithms can be partially attributed to the classification accuracy of the proxy preference model compared to the policy itself. Firstly, the proxy preference model tends to achieve higher classification accuracy than the policy when used as a classifier. Secondly, it proposes that this difference in classification accuracy contributes to the observed performance gap between online and offline algorithms. In essence, it suggests that better classification leads to better performance, but this hypothesis needs to be further examined and validated through empirical evidence.

In conclusion, the study highlights the critical role of on-policy sampling in effectively aligning LLMs and exposes the challenges associated with offline alignment approaches. The researchers debunked several commonly held beliefs about the performance gap between online and offline algorithms through rigorous experimentation and hypothesis testing. They emphasized the importance of on-policy data generation for enhancing policy learning efficiency. However, they also argue that offline algorithms can improve by adopting strategies that mimic online learning processes. This opens avenues for further exploration, such as hybrid approaches combining the strengths of both online and offline methods and deeper theoretical investigations into reinforcement learning for human feedback.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post This AI Research from Google DeepMind Explores the Performance Gap between Online and Offline Methods for AI Alignment appeared first on MarkTechPost.

SpeechVerse: A Multimodal AI Framework that Enables LLMs to Follow Nat …

Large language models (LLMs) have excelled in natural language tasks and instruction following, yet they struggle with non-textual data like images and audio. Incorporating speech comprehension could vastly improve human-computer interaction. Current methods rely on automated speech recognition (ASR) followed by LLM processing, missing non-textual cues. A promising approach integrates textual LLMs with speech encoders in one training setup. This allows for a more comprehensive understanding of both speech and text, promising richer comprehension compared to text-only methods. Particularly, instruction-following multimodal audio-language models are gaining traction due to their ability to generalize across tasks. While previous works like SpeechT5, Whisper, VIOLA, SpeechGPT, and SLM show promise, they are constrained to a limited range of speech tasks.

Multi-task learning involves leveraging shared representations across diverse tasks to enhance generalization and efficiency. Models like T5 and SpeechNet employ this approach for text and speech tasks, achieving significant results. However, multimodal large language models integrating audio have garnered less attention. Recent efforts like SpeechGPT and Qwen-Audio aim to bridge this gap, showcasing capabilities in various audio tasks. SpeechVerse innovatively combines multi-task learning and instruction finetuning to achieve superior performance in audio-text tasks.

Amazon researchers introduce SpeechVerse, a multi-task framework with supervised instruction finetuning for diverse speech tasks. Unlike SpeechGPT, it utilizes continuous representations from pre-trained speech models for text-only output tasks. In comparison to Qwen-Audio, which requires hierarchical tagging and a large-scale audio encoder, SpeechVerse incorporates multi-task learning and finetuning without task-specific tagging, enabling generalization to unseen tasks through natural language instructions.

The multimodal model architecture of SpeechVerse comprises an audio encoder, a convolution downsampling module, and an LLM. The audio encoder extracts semantic features from audio using a pre-trained model, generating a unified representation. The downsampling module adjusts the audio features for compatibility with LLM token sequences. The LLM processes text and audio input, combining downsampled audio features with token embeddings. Curriculum learning with parameter-efficient finetuning optimizes training, freezing pre-trained components to efficiently handle diverse speech tasks.

The evaluation of end-to-end trained joint speech and language models (E2E-SLM) using the SpeechVerse framework covers 11 tasks spanning various domains and datasets. ASR benchmarks reveal the efficacy of SpeechVerse’s core speech understanding, with task-specific pre-trained ASR models showing promising results. For SLU tasks, end-to-end trained models outperform cascaded pipelines in most cases, demonstrating the effectiveness of SpeechVerse. SpeechVerse models also exhibit competitive or superior performance compared to state-of-the-art models across diverse tasks like ASR, ST, IC, SF, and ER.

To recapitulate, SpeechVerse is introduced by Amazon researchers,  a multimodal framework enabling LLMs to execute diverse speech processing tasks through natural language instructions. Utilizing supervised instruction finetuning and combining representations from pre-trained speech and text models, SpeechVerse exhibits strong zero-shot generalization on unseen tasks. Comparative analysis against conventional baselines underscores SpeechVerse’s superior performance on 9 out of 11 tasks, showcasing its robust instruction-following capability. The model demonstrates resilience across out-of-domain datasets, unseen prompts, and novel tasks, highlighting the effectiveness of the proposed training approach in fostering generalizability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post SpeechVerse: A Multimodal AI Framework that Enables LLMs to Follow Natural Language Instructions for Performing Diverse Speech-Processing Tasks appeared first on MarkTechPost.

Top AI Tools for Real Estate Agents

With AI’s support, the real estate business is seeing a revolutionary shift. With the widespread adoption of AI, real estate agents have access to a suite of AI solutions that can transform their business and provide unparalleled service to clients. Some apps use artificial intelligence to help people choose their ideal homes, forecast real estate values, and even manage their real estate agencies. 

Here are some of the top AI Tools for Real Estate Agents

Styldod

Styldod is an AI-driven platform that provides numerous options for improving the visual appeal of real estate listings. Thanks to its virtual staging tool, potential buyers may picture themselves living in the house. The tool allows users to design empty rooms tastefully.

Compass 

With Compass, artificial intelligence has become the standard in CRM. Having an assistant who is aware of when to contact customers is like having your very own personal helper. Compass’s artificial intelligence system will point you in the correct way if they have been using real estate websites or are otherwise exhibiting behaviors indicative of home hunting. It can even pre-write emails to speed up communication with clients.

REimagineHome 

Users of the AI-powered interior design application REimagineHome can revamp their houses by utilizing personalized design suggestions and inspiration. To do away with time-consuming and error-prone manual design methods, generative AI produces design ideas in seconds. It’s easier than ever to create a lovely and distinctive living space with REimagineHome’s AI-powered design that lets customers rapidly and easily modify their houses.

CoreLogic

By using artificial intelligence to find the ideal houses for each buyer, CoreLogic’s OneHome platform reaches new heights. It’s as if you had a real estate matchmaker who guaranteed the greatest possible pairings. Artificial intelligence (AI) from CoreLogic streamlines mortgage origination by discovering new revenue streams and automating warnings for missing documents. Real estate in North America is being transformed by CoreLogic, which has over 1.2 million agents on board.

Reonomy 

Discover CRE prospects and make data-driven decisions with Reonomy, powered by AI and ML. With Reonomy’s industry-leading CRE property and ownership data, sourcing new deals and discovering off-market opportunities is a breeze.

Rentlytics 

With their platform, Rentlytics is working to make all of the world’s real estate data easily accessible. The world’s leading real estate investment management organizations rely on Rentlytics solutions. Rentlytics is relied upon to provide the data and resources needed to make long-term, profitable portfolio decisions in this ever-changing industry. An inclusive and energetic crew of techies, the Rentlytics Team is here to use AI to revolutionize the real estate investment management sector and meet the demands of today.

PropertyPen

With PropertyPen, an innovative AI-powered tool, real estate teams can easily and rapidly build professional listings. Using natural language processing (NLP) and an advanced language model, it can quickly and accurately describe properties in a way that is both compelling and free of grammar mistakes.

Ailliot

One tool that real estate agents and brokers can use to ease their content creation process is the Ailliot Real Estate AI Assistant. Thanks to this work automation, real estate agents may free up more time to focus on expanding their businesses.

Jude AI

Jude AI is an AI-powered platform for real estate agents and brokers. It provides several solutions for AI-powered real estate companies. With Jude AI, users can easily evaluate market data, create compelling emails, and generate engaging content. Jude AI offers crucial suggestions to help first-time homebuyers navigate the home-buying process.

Epique AI

Among the many real estate-related services offered by Epique AI—a tool driven by artificial intelligence—are the following: the development of real estate blog pieces, newsletters, lead generation ideas, and Instagram quotations for realtors. With Epique AI’s legal AI tool, you can get help with all the rules and laws of your state. Regarding broker advice, Epique AI has you covered with its AI function. The user-friendly chat interface of Epique AI allows users to pose targeted questions and obtain pertinent replies.
The post Top AI Tools for Real Estate Agents appeared first on MarkTechPost.

Mixtral 8x22B is now available in Amazon SageMaker JumpStart

Today, we are excited to announce the Mixtral-8x22B large language model (LLM), developed by Mistral AI, is available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Mixtral-8x22B model.
What is Mixtral 8x22B
Mixtral 8x22B is Mistral AI’s latest open-weights model and sets a new standard for performance and efficiency of available foundation models, as measured by Mistral AI across standard industry benchmarks. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39 billion active parameters out of 141 billion, offering cost-efficiency for its size. Continuing with Mistral AI’s belief in the power of publicly available models and broad distribution to promote innovation and collaboration, Mixtral 8x22B is released under Apache 2.0, making the model available for exploring, testing, and deploying. Mixtral 8x22B is an attractive option for customers selecting between publicly available models and prioritizing quality, and for those wanting a higher quality from mid-sized models, such as Mixtral 8x7B and GPT 3.5 Turbo, while maintaining high throughput.
Mixtral 8x22B provides the following strengths:

Multilingual native capabilities in English, French, Italian, German, and Spanish languages
Strong mathematics and coding capabilities
Capable of function calling that enables application development and tech stack modernization at scale
64,000-token context window that allows precise information recall from large documents

About Mistral AI
Mistral AI is a Paris-based company founded by seasoned researchers from Meta and Google DeepMind. During his time at DeepMind, Arthur Mensch (Mistral CEO) was a lead contributor on key LLM projects such as Flamingo and Chinchilla, while Guillaume Lample (Mistral Chief Scientist) and Timothée Lacroix (Mistral CTO) led the development of LLaMa LLMs during their time at Meta. The trio are part of a new breed of founders who combine deep technical expertise and operating experience working on state-of-the-art ML technology at the largest research labs. Mistral AI has championed small foundational models with superior performance and commitment to model development. They continue to push the frontier of artificial intelligence (AI) and make it accessible to everyone with models that offer unmatched cost-efficiency for their respective sizes, delivering an attractive performance-to-cost ratio. Mixtral 8x22B is a natural continuation of Mistral AI’s family of publicly available models that include Mistral 7B and Mixtral 8x7B, also available on SageMaker JumpStart. More recently, Mistral launched commercial enterprise-grade models, with Mistral Large delivering top-tier performance and outperforming other popular models with native proficiency across multiple languages.
What is SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment. You can now discover and deploy Mixtral-8x22B with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, providing data encryption at rest and in-transit.
SageMaker also adheres to standard security frameworks such as ISO27001 and SOC1/2/3 in addition to complying with various regulatory requirements. Compliance frameworks like General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), and Payment Card Industry Data Security Standard (PCI DSS) are supported to make sure data handling, storing, and process meet stringent security standards.
SageMaker JumpStart availability is dependent on the model; Mixtral-8x22B v0.1 is currently supported in the US East (N. Virginia) and US West (Oregon) AWS Regions.
Discover models
You can access Mixtral-8x22B foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

From the SageMaker JumpStart landing page, you can search for “Mixtral” in the search box. You will see search results showing Mixtral 8x22B Instruct, various Mixtral 8x7B models, and Dolphin 2.5 and 2.7 models.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find the Deploy button, which you can use to deploy the model and create an endpoint.
SageMaker has seamless logging, monitoring, and auditing enabled for deployed models with native integrations with services like AWS CloudTrail for logging and monitoring to provide insights into API calls and Amazon CloudWatch to collect metrics, logs, and event data to provide information into the model’s resource utilization.

Deploy a model
Deployment starts when you choose Deploy. After deployment finishes, an endpoint has been created. You can test the endpoint by passing a sample inference request payload or selecting your testing option using the SDK. When you select the option to use the SDK, you will see example code that you can use in your preferred notebook editor in SageMaker Studio. This will require an AWS Identity and Access Management (IAM) role and policy attached to it to restrict model access. Additionally, if you choose to deploy the model endpoint within SageMaker Studio, you will be prompted to choose an instance type, initial instance count, and maximum instance count. The ml.p4d.24xlarge and ml.p4de.24xlarge instance types are the only instance types currently supported for Mixtral 8x22B Instruct v0.1.
To deploy using the SDK, we start by selecting the Mixtral-8x22b model, specified by the model_id with value huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1. You can deploy any of the selected models on SageMaker with the following code. Similarly, you can deploy Mixtral-8x22B instruct using its own model ID.

from sagemaker.jumpstart.model import JumpStartModel model = JumpStartModel(model_id=””huggingface-llm-mistralai-mixtral-8x22B-instruct-v0-1″) predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel.
After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {“inputs”: “Hello!”}
predictor.predict(payload)

Example prompts
You can interact with a Mixtral-8x22B model like any standard text generation model, where the model processes an input sequence and outputs predicted next words in the sequence. In this section, we provide example prompts.
Mixtral-8x22b Instruct
The instruction-tuned version of Mixtral-8x22B accepts formatted instructions where conversation roles must start with a user prompt and alternate between user instruction and assistant (model answer). The instruction format must be strictly respected, otherwise the model will generate sub-optimal outputs. The template used to build a prompt for the Instruct model is defined as follows:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]]

<s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS), whereas [INST] and [/INST] are regular strings.
The following code shows how you can format the prompt in instruction format:

from typing import Dict, List

def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
“””Format instructions where conversation roles must alternate user/assistant/user/assistant/…”””
prompt: List[str] = []
for user, answer in zip(instructions[::2], instructions[1::2]):
prompt.extend([“<s>”, “[INST] “, (user[“content”]).strip(), ” [/INST] “, (answer[“content”]).strip(), “</s>”])
prompt.extend([“<s>”, “[INST] “, (instructions[-1][“content”]).strip(), ” [/INST] “,”</s>”])
return “”.join(prompt)

def print_instructions(prompt: str, response: str) -> None:
bold, unbold = ‘33[1m’, ‘33[0m’
print(f”{bold}> Input{unbold}n{prompt}nn{bold}> Output{unbold}n{response[0][‘generated_text’]}n”)

Summarization prompt
You can use the following code to get a response for a summarization:

instructions = [{“role”: “user”, “content”: “””Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression – To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.
“””}]
prompt = format_instructions(instructions)
payload = {
“inputs”: prompt,
“parameters”: {“max_new_tokens”: 1500}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> > Input
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression – To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>
> Output
<s>[INST] Summarize the following information. Format your response in short paragraph.

Article:

Contextual compression – To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction. [/INST] </s>Contextual compression is a technique used to manage the issue of context overflow in information retrieval. This method involves compressing and filtering the retrieved documents to align with the query’s context, ensuring that only relevant information is processed. This is achieved through a two-step process: a base retriever fetches the initial set of documents, and a document compressor refines these documents by either reducing their content or excluding them entirely based on relevance.

The contextual compression retriever significantly improves the efficiency of Retrieval-Augmented Generation (RAG) applications by extracting and utilizing only the essential information from a large pool of data. This approach addresses the problem of information overload and irrelevant data processing, leading to improved response quality, cost-effective operations, and a smoother retrieval process. In essence, contextual compression acts as a filter that tailors the information to the specific query, making it an indispensable tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.

Multilingual translation prompt
You can use the following code to get a response for a multilingual translation:

Prompt

instructions = [{“role”: “user”, “content”: “””
<You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]

1. Customer: “I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?”
2. Customer: “I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop’s performance isn’t as expected. Can I have a technician look into it, or should I consider returning it?”
3. Customer: “My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?”
4. Customer: “I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven’t received it. Could you assist in determining where my package might be?”
5. Customer: “I’m trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?”
“””}]
prompt = format_instructions(instructions)
payload = {
“inputs”: prompt,
“parameters”: {“max_new_tokens”: 2000, “do_sample”: True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

The following is an example of the expected output:

> Input
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]

1. Customer: “I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?”
2. Customer: “I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop’s performance isn’t as expected. Can I have a technician look into it, or should I consider returning it?”
3. Customer: “My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?”
4. Customer: “I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven’t received it. Could you assist in determining where my package might be?”
5. Customer: “I’m trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?” [/INST] </s>
> Output
<s>[INST] <You are a multilingual assistant. Translate the following sentences in the order in which they are presented into French, German, and Spanish. Make sure to label each section as French, German, and Spanish. [/INST]

1. Customer: “I recently ordered a set of wireless headphones, but I received a different model. What steps should I take to receive the correct product I ordered?”
2. Customer: “I purchased a customizable laptop last month and opted for specific upgrades. However, the laptop’s performance isn’t as expected. Can I have a technician look into it, or should I consider returning it?”
3. Customer: “My order for a designer handbag was supposed to include a matching wallet as part of a promotional deal, but the wallet was not in the package. How can this issue be resolved?”
4. Customer: “I see that the tracking information for my order of ceramic cookware shows it was delivered, but I haven’t received it. Could you assist in determining where my package might be?”
5. Customer: “I’m trying to buy an antique mirror from your vintage collection, but the website keeps giving me an error when I try to check out. Is there another way to complete my purchase?” [/INST] </s>

French:

1. Client : “J’ai récemment commandé un casque audio sans fil, mais j’ai reçu un modèle différent. Quelles sont les étapes à suivre pour recevoir le produit correct que j’ai commandé ?”
2. Client : “J’ai acheté un ordinateur portable personnalisable le mois dernier et j’ai opté pour des mises à niveau spécifiques. Cependant, les performances de l’ordinateur portable ne sont pas à la hauteur de mes attentes. Puis-je avoir un technicien qui vérifie cela, ou devrais-je envisager de le retourner ?”
3. Client : “Ma commande pour un sac à main de designer devait inclure un portefeuille assorti dans le cadre d’une offre promotionnelle, mais le portefeuille ne se trouvait pas dans le paquet. Comment puis-je résoudre ce problème ?”
4. Client : “Je vois que les informations de suivi de ma commande de batterie de cuisine en céramique indiquent qu’elle a été livrée, mais je ne l’ai pas reçue. Pourriez-vous m’aider à déterminer où se trouve mon colis ?”
5. Client : “J’essaie d’acheter un miroir antique de votre collection vintage, mais le site continue de me donner une erreur lorsque j’essaie de passer à la caisse. Existe-t-il un autre moyen de finaliser mon achat ?”

German:

1. Kunde: “Ich habe kürzlich ein Set kabelloser Kopfhörer bestellt, aber ich habe ein anderes Modell erhalten. Welche Schritte sollte ich unternehmen, um das richtige Produkt zu erhalten, das ich bestellt habe?”
2. Kunde: “Ich habe letzten Monat einen anpassbaren Laptop gekauft und habe mich für spezifische Upgrades entschieden. Allerdings entspricht die Leistung des Laptops nicht meinen Erwartungen. Kann ich einen Techniker hinzuziehen lassen oder sollte ich eine Rückgabe in Erwägung ziehen?”
3. Kunde: “Meine Bestellung für eine Designer-Handtasche sollte inklusive eines passenden Portemonnaies als Teil einer Werbeaktion sein, aber das Portemonnaie war nicht im Paket. Wie kann dieses Problem gelöst werden?”
4. Kunde: “Ich sehe, dass die Sendungsverfolgungsinformationen für meine Bestellung von Keramik-Kochgeschirr anzeigen, dass es geliefert wurde, aber ich habe es nicht erhalten. Könnten Sie mir dabei helfen, festzustellen, wo sich mein Paket befindet?”
5. Kunde: “Ich versuche ein Antikspiegel aus Ihrer Vintage-Kollektion zu kaufen, aber die Website gibt mir jedes Mal einen Fehler, wenn ich versuche, den Bestellvorgang abzuschließen. Gibt es einen anderen Weg, meinen Kauf abzuschließen?”

Spanish:

1. Cliente: “Recientemente ordené un conjunto de audífonos inalámbricos, pero recibí un modelo diferente. ¿Cuáles son los pasos que debo seguir para recibir el producto correcto que ordené?”
2. Cliente: “Compré una computadora personalizable el mes pasado y opté por actualizaciones específicas. Sin embargo, el rendimiento de la computadora no está a la altura de mis expectativas. ¿Puedo tener un técnico que revise esto, o debería considerar devolverla?”
3. Cliente: “Mi pedido de un bolso de diseñador debería haber incluido una billetera a juego como parte de una oferta promocional, pero la billetera no estaba en el paquete. ¿Cómo puedo resolver este problema?”
4. Cliente: “Veo que la información de seguimiento de mi pedido de utensilios de cocina de cerámica indica que ha sido entregado, pero aún no lo he recibido. ¿Podría ayudarme a determinar dónde se encuentra mi paquete?”
5. Cliente: “Estoy tratando de comprar un espejo antiguo de su colección de vintage, pero el sitio sigue dándome un error cada vez que trato de realizar el pago. ¿Hay otra forma de completar mi compra?”

Code generation
You can use the following code to get a response for code generation:

instructions = [{“role”: “user”, “content”: “Write a program on Dijstra’s algorithm using a priority queue in python:”}]
prompt = format_instructions(instructions)
payload = {
“inputs”: prompt,
“parameters”: {“max_new_tokens”: 1000}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

> Input
<s>[INST] Write a program on Dijstra’s algorithm using a priority queue in python: [/INST] </s>

> Output
<s>[INST] Write a program on Dijstra’s algorithm using a priority queue in python: [/INST] </s>Here is a Python implementation of Dijkstra’s algorithm using a priority queue (heapq module):
“`python
import heapq

def dijkstra(graph, start):
# Initialize the priority queue with the starting vertex and its distance (0)
priority_queue = [(0, start)]
# Initialize the distances dictionary with infinite distances for all vertices except the starting vertex
distances = {vertex: float(‘inf’) for vertex in graph}
distances[start] = 0
# Initialize the previous dictionary to keep track of the shortest path
previous = {vertex: None for vertex in graph}

while priority_queue:
# Get the vertex with the smallest distance from the priority queue
current_distance, current_vertex = heapq.heappop(priority_queue)

# If the current distance is greater than the stored distance, skip this vertex
if current_distance > distances[current_vertex]:
continue

# Update the distances of the adjacent vertices
for neighbor, weight in graph[current_vertex].items():
distance = current_distance + weight
# If the new distance is smaller than the stored distance, update the distance and previous vertex
if distance < distances[neighbor]:
distances[neighbor] = distance
previous[neighbor] = current_vertex
# Add the neighbor to the priority queue
heapq.heappush(priority_queue, (distance, neighbor))

return distances, previous

# Example usage:
graph = {
‘A’: {‘B’: 1, ‘C’: 4},
‘B’: {‘A’: 1, ‘C’: 2, ‘D’: 5},
‘C’: {‘A’: 4, ‘B’: 2, ‘D’: 1},
‘D’: {‘B’: 5, ‘C’: 1}
}

distances, previous = dijkstra(graph, ‘A’)
print(distances) # {‘A’: 0, ‘B’: 1, ‘C’: 3, ‘D’: 4}
“`
In this implementation, the `graph` parameter is a dictionary where the keys are the vertices and the values are dictionaries that map adjacent vertices to their weights. The `start` parameter is the starting vertex for the algorithm. The function returns two dictionaries: `distances` and `previous`. The `distances` dictionary maps each vertex to its shortest distance from the starting vertex, and the `previous` dictionary maps each vertex to its previous vertex in the shortest path.

Reasoning and math
You can use the following code to get a response for reasoning and math:

instructions = [{“role”: “user”, “content”: “””Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person.

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It’s ok to make some assumptions as you come to your answer.

“””}]
prompt = format_instructions(instructions)
payload = {
“inputs”: prompt,
“parameters”: {“max_new_tokens”: 2000, “do_sample”: True}
}
response=predictor.predict(payload)
print_instructions(prompt, response)

You get the following output:

<s>[INST] Sarah went to a car shop to buy a car from Fred for $20,000 in 2024. She plans to sell the car but it depreciates by 2% each year after she has bought it. She went to a dealer in which that dealer told her that the car has only depreciated by 1.4% each year. After 7 years of using the car, Sarah decides to sell it directly to another person.

How much did Sarah sell the car for and what year is it? Explain the steps before answering. It’s ok to make some assumptions as you come to your answer. [/INST] </s>To solve this problem, we need to calculate the value of the car after 7 years, assuming two different annual depreciation rates: 2% and 1.4%.

First, let’s calculate the value of the car after 7 years assuming a 2% annual depreciation rate:

Year 1: $20,000 – ($20,000 * 2%) = $19,600
Year 2: $19,600 – ($19,600 * 2%) = $19,208
Year 3: $19,208 – ($19,208 * 2%) = $18,821.60
Year 4: $18,821.60 – ($18,821.60 * 2%) = $18,442.22
Year 5: $18,442.22 – ($18,442.22 * 2%) = $18,071.62
Year 6: $18,071.62 – ($18,071.62 * 2%) = $17,709.76
Year 7: $17,709.76 – ($17,709.76 * 2%) = $17,356.58

So, assuming a 2% annual depreciation rate, the value of the car after 7 years would be approximately $17,356.58.

Next, let’s calculate the value of the car after 7 years assuming a 1.4% annual depreciation rate:

Year 1: $20,000 – ($20,000 * 1.4%) = $19,728
Year 2: $19,728 – ($19,728 * 1.4%) = $19,463.23
Year 3: $19,463.23 – ($19,463.23 * 1.4%) = $19,205.80
Year 4: $19,205.80 – ($19,205.80 * 1.4%) = $18,955.66
Year 5: $18,955.66 – ($18,955.66 * 1.4%) = $18,712.73
Year 6: $18,712.73 – ($18,712.73 * 1.4%) = $18,476.95
Year 7: $18,476.95 – ($18,476.95 * 1.4%) = $18,248.33

So, assuming a 1.4% annual depreciation rate, the value of the car after 7 years would be approximately $18,248.33.

Now, since Sarah sells the car directly to another person, we can assume that she sold it for the higher of the two values. Therefore, she sold the car for $18,248.33.

As for the year in which she sold the car, since we know that Sarah purchased the car in 2024 and used it for 7 years, we can calculate the year as follows:

2024 + 7 = 2031

Therefore, Sarah sold the car in the year 2031.

So, Sarah sold the car for $18,248.33 in the year 2031.

Clean up
After you’re done running the notebook, delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we showed you how to get started with Mixtral-8x22B in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
Now that you are aware of Mistral AI and their Mixtral 8x22B models, we encourage you to deploy an endpoint on SageMaker to perform inference testing and try out responses for yourself. Refer to the following resources for more information:

Open and use JumpStart in Studio
Get Started with Amazon SageMaker JumpStart

About the Authors
Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers and acquisitions. Marco is based in Seattle, WA, and enjoys writing, reading, exercising, and building applications in his free time.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.
June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Building Generative AI prompt chaining workflows with human in the loo …

Generative AI is a type of artificial intelligence (AI) that can be used to create new content, including conversations, stories, images, videos, and music. Like all AI, generative AI works by using machine learning models—very large models that are pretrained on vast amounts of data called foundation models (FMs). FMs are trained on a broad spectrum of generalized and unlabeled data. They’re capable of performing a wide variety of general tasks with a high degree of accuracy based on input prompts. Large language models (LLMs) are one class of FMs. LLMs are specifically focused on language-based tasks such as summarization, text generation, classification, open-ended conversation, and information extraction.
FMs and LLMs, even though they’re pre-trained, can continue to learn from data inputs or prompts during inference. This means that you can develop comprehensive outputs through carefully curated prompts. A prompt is the information you pass into an LLM to elicit a response. This includes task context, data that you pass to the model, conversation and action history, instructions, and even examples. The process of designing and refining prompts to get specific responses from these models is called prompt engineering.
While LLMs are good at following instructions in the prompt, as a task gets complex, they’re known to drop tasks or perform a task not at the desired accuracy. LLMs can handle complex tasks better when you break them down into smaller subtasks. This technique of breaking down a complex task into subtasks is called prompt chaining. With prompt chaining, you construct a set of smaller subtasks as individual prompts. Together, these subtasks make up the overall complex task. To accomplish the overall task, your application feeds each subtask prompt to the LLM in a pre-defined order or according to a set of rules.
While Generative AI can create highly realistic content, including text, images, and videos, it can also generate outputs that appear plausible but are verifiably incorrect. Incorporating human judgment is crucial, especially in complex and high-risk decision-making scenarios. This involves building a human-in-the-loop process where humans play an active role in decision making alongside the AI system.
In this blog post, you will learn about prompt chaining, how to break a complex task into multiple tasks to use prompt chaining with an LLM in a specific order, and how to involve a human to review the response generated by the LLM.
Example overview
To illustrate this example, consider a retail company that allows purchasers to post product reviews on their website. By responding promptly to those reviews, the company demonstrates its commitments to customers and strengthens customer relationships.

Figure 1: Customer review and response
The example application in this post automates the process of responding to customer reviews. For most reviews, the system auto-generates a reply using an LLM. However, if the review or LLM-generated response contains uncertainty around toxicity or tone, the system flags it for a human reviewer. The human reviewer then assesses the flagged content to make the final decision about the toxicity or tone.
The application uses event-driven architecture (EDA), a powerful software design pattern that you can use to build decoupled systems by communicating through events. As soon as the product review is created, the review receiving system uses Amazon EventBridge to send an event that a product review is posted, along with the actual review content. The event starts an AWS Step Functions workflow. The workflow runs through a series of steps including generating content using an LLM and involving human decision making.

Figure 2: Review workflow
The process of generating a review response includes evaluating the toxicity of the review content, identifying sentiment, generating a response, and involving a human approver. This naturally fits into a workflow type of application because it’s a single process containing multiple sequential steps along with the need to manage state between steps. Hence the example uses Step Functions for workflow orchestration. Here are the steps in the review response workflow.

Detect if the review content has any harmful information using the Amazon Comprehend DetectToxicContent API. The API responds with the toxicity score that represents the overall confidence score of detection between 0 and 1 with score closer to 1 indicating high toxicity.
If toxicity of the review is in the range of 0.4 – 0.6, send the review to a human reviewer to make the decision.
If the toxicity of the review is greater than 0.6 or the reviewer finds the review harmful, publish HARMFUL_CONTENT_DETECTED message.
If the toxicity of the review is less than 0.4 or reviewer approves the review, find the sentiment of the review first and then generate the response to the review comment. Both tasks are achieved using a generative AI model.
Repeat the toxicity detection through the Comprehend API for the LLM generated response.
If the toxicity of the LLM generated response is in the range of 0.4 – 0.6, send the LLM generated response to a human reviewer.
If the LLM generated response is found to be non-toxic, publish NEW_REVIEW_RESPONSE_CREATED event.
If the LLM generated response is found to be toxic, publish RESPONSE_GENERATION_FAILED event.

Figure 3: product review evaluation and response workflow
Getting started
Use the instructions in the GitHub repository to deploy and run the application.
Prompt chaining
Prompt chaining simplifies the problem for the LLM by dividing single, detailed, and monolithic tasks into smaller, more manageable tasks. Some, but not all, LLMs are good at following all the instructions in a single prompt. The simplification results in writing focused prompts for the LLM, leading to a more consistent and accurate response. The following is a sample ineffective single prompt.
Read the below customer review, filter for harmful content and provide your thoughts on the overall sentiment in JSON format. Then construct an email response based on the sentiment you determine and enclose the email in JSON format. Based on the sentiment, write a report on how the product can be improved.
To make it more effective, you can split the prompt into multiple subtasks:

Filter for harmful content
Get the sentiment
Generate the email response
Write a report

You can even run some of the tasks in parallel. By breaking down to focused prompts, you achieve the following benefits:

You speed up the entire process. You can handle tasks in parallel, use different models for different tasks, and send response back to the user rather than waiting for the model to process a larger prompt for considerably longer time.
Better prompts provide better output. With focused prompts, you can engineer the prompts by adding additional relevant context thus improving the overall reliability of the output.
You spend less time developing. Prompt engineering is an iterative process. Both debugging LLM calls for detailed prompt and refining the larger prompt for accuracy require significant time and effort. Smaller tasks enable you to experiment and refine through successive iterations.

Step Functions is a natural fit to build prompt chaining because it offers multiple different ways to chain prompts: sequentially, in parallel, and iteratively by passing the state data from one state to another. Consider the situation where you have built the product review response prompt chaining workflow and now want to evaluate the responses from different LLMs to find the best fit using an evaluation test suite. The evaluation test suite consists of hundreds of test product reviews, a reference response to the review, and a set of rules to evaluate the LLM response against the reference response. You can automate the evaluation activity using a Step Functions workflow. The first task in the workflow asks the LLM to generate a review response for the product review. The second task then asks the LLM to compare the generated response to the reference response using the rules and generate an evaluation score. Based on the evaluation score for each review, you can decide if the LLM passes your evaluation criteria or not. You can use the map state in Step Functions to run the evaluations for each review in your evaluation test suite in parallel. See this repository for more prompt chaining examples.
Human in the loop
Involving human decision making in the example allows you to improve the accuracy of the system when the toxicity of the content cannot be determined to be either safe or harmful. You can implement human review within the Step Functions workflow using Wait for a Callback with the Task Token integration. When you use this integration with any supported AWS SDK API, the workflow task generates a unique token and then pauses until the token is returned. You can use this integration to include human decision making, call a legacy on-premises system, wait for completion of long running tasks, and so on.

“Wait for human approval for product review”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::lambda:invoke.waitForTaskToken”,
“Parameters”: {
“FunctionName”: “arn:aws:lambda:{region}:{account}:function:human-approval-helper-product-review-response-automation-stage”,
“Payload”: {
“review_text.$”: “$$.Execution.Input.review_text”,
“token.$”: “$$.Task.Token”,
“api_url”: “https://{apiID}.execute-api.{region}.amazonaws.com/dev”
}

In the sample application, the send email for approval task includes a wait for the callback token. It invokes an AWS Lambda function with a token and waits for the token. The Lambda function builds an email message along with the link to an Amazon API Gateway URL. Lambda then uses Amazon Simple Notification Service (Amazon SNS) to send an email to a human reviewer. The reviewer reviews the content and either accepts or rejects the message by selecting the appropriate link in the email. This action invokes the Step Functions SendTaskSuccess API. The API sends back the task token and a status message of whether to accept or reject the review. Step Functions receives the token, resumes the send email for approval task and then passes control to the choice state. The choice state decides whether to go through acceptance or rejection of the review based on the status message.

Figure 4: Human-in-the-loop workflow
Event-driven architecture
EDA enables building extensible architectures. You can add consumers at any time by subscribing to the event. For example, consider moderating images and videos attached to a product review in addition to the text content. You also need to write code to delete the images and videos if they are found harmful. You can add a consumer, the image moderation system, to the NEW_REVIEW_POSTED event without making any code changes to the existing event consumers or producers. Development of the image moderation system and the review response system to delete harmful images can proceed in parallel which in turn improves development velocity.
When the image moderation workflow finds toxic content, it publishes a HARMFULL_CONTENT_DETECTED event. The event can be processed by a review response system that decides what to do with the event. By decoupling systems through events, you gain many advantages including improved development velocity, variable scaling, and fault tolerance.

Figure 5: Event-driven workflow
Cleanup
Use the instructions in the GitHub repository to delete the sample application.
Conclusion
In this blog post, you learned how to build a generative AI application with prompt chaining and a human-review process. You learned how both techniques improve the accuracy and safety of a generative AI application. You also learned how event-driven architectures along with workflows can integrate existing applications with generative AI applications.
Visit Serverless Land for more Step Functions workflows.

About the authors
Veda Raman is a Senior Specialist Solutions Architect for Generative AI and machine learning based at AWS. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon Sagemaker.
Uma Ramadoss is a Principal Solutions Architect at Amazon Web Services, focused on the Serverless and Integration Services. She is responsible for helping customers design and operate event-driven cloud-native applications using services like Lambda, API Gateway, EventBridge, Step Functions, and SQS. Uma has a hands on experience leading enterprise-scale serverless delivery projects and possesses strong working knowledge of event-driven, micro service and cloud architecture.

Unveiling the Potential of Large Language Models: Enhancing Feedback G …

Feedback is crucial for student success, especially in large computing classes facing increasing demand. Automated tools, incorporating analysis techniques and testing frameworks, are gaining popularity but often need more helpful suggestions. Recent advancements in large language models (LLMs) show promise in offering rapid, human-like feedback. However, concerns about the accuracy, reliability, and ethical implications of using proprietary LLMs persist, necessitating exploring open-source alternatives in computing education.

Automated feedback generation in computing education has been a persistent challenge, focusing mainly on identifying mistakes rather than offering constructive guidance. LLMs present a promising solution to this issue. Recent research has explored using LLMs for automated feedback generation but highlights limitations in their performance. While some studies show LLMs like GPT-3 and GPT-3.5 can identify issues in student code, they also exhibit inconsistencies and inaccuracies in feedback. Also, current state-of-the-art models struggle to match human performance when providing programming exercise feedback. The concept of using LLMs as judges to evaluate other LLMs’ output, termed LLMs-as-judges, has gained traction. This approach has shown promising results, with models like GPT-4 reaching high levels of agreement with human judgments.

Researchers from Aalto University, the University of Jyväskylä, and The University of Auckland provide a thorough study to assess the effectiveness of LLMs in providing feedback on student-written programs and to explore whether open-source LLMs can rival proprietary ones in this regard. The focus lies on feedback that detects errors in student code, such as compiler errors or test failures. Initially, evaluations compare programming feedback from GPT-4 with expert human ratings, establishing a baseline for assessing LLM-generated feedback quality. Subsequently, the study evaluates feedback quality from various open-source LLMs compared to proprietary models. To address these research questions, existing datasets and new feedback generated by open-source models are assessed using GPT-4 as a judge.

Data from an introductory programming course by Aalto University was utilized, consisting of student help requests and feedback generated by GPT-3.5. Evaluation criteria focused on feedback completeness, perceptivity, and selectivity. Feedback was assessed both qualitatively and automatically using GPT-4. Open-source LLMs were evaluated alongside proprietary ones, employing a rubric-based grading system. GPT-4 judged the quality of feedback generated by LLMs based on human annotations. Precision and F0.5-score were key metrics used to evaluate the judge’s performance.

The results show that while most feedback is perceptive, only a little over half is complete, and many contain misleading content. GPT-4 tends to grade feedback more positively compared to human annotators, indicating some positive bias. Classification performance results for GPT-4 show reasonably good performance in completeness classification and slightly lower performance in selectivity. Perceptivity classification scores higher, partially due to data skew. Kappa scores indicate moderate agreement, with GPT-4 maintaining high recall across all criteria while maintaining reasonable precision and accuracy.

To recapitulate, this study examined the effectiveness of GPT-4 in evaluating automatically generated programming feedback and assessed the performance of various large language models, including open-source ones, in generating feedback on student code. Results indicate that GPT-4 shows promise in reliably assessing the quality of automatically generated feedback. Also, open-source language models demonstrate the potential to generate programming feedback. This suggests that LLM-generated feedback could serve as a cost-effective and accessible resource in learning environments, allowing instructors and teaching assistants to focus on more challenging cases where LLMs may currently fall short in assisting students.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Unveiling the Potential of Large Language Models: Enhancing Feedback Generation in Computing Education appeared first on MarkTechPost.

This AI Research from Stanford and UC Berkeley Discusses How ChatGPT …

Large Language Models (LLMs) like GPT 3.5 and GPT 4 have recently gained a lot of attention in the Artificial Intelligence (AI) community. These models are made to process enormous volumes of data, identify patterns, and produce language that resembles that of a human being in response to cues. One of their primary characteristics is their capacity to upgrade over time, adding fresh information and user feedback to improve performance and flexibility. 

However, it is impossible to foresee how modifications in the model would affect its output because of the opaque nature of the process and the impact of these updates on LLM behavior. The problem of LLM updates and their impacts makes it difficult to incorporate these models into intricate processes. When an update causes an LLM’s response to abruptly alter, it can interfere with downstream operations that depend on its output. Because users cannot consistently expect the same performance from the LLM over time, this lack of consistency impedes results’ reproducibility.

In a recent study utilizing versions issued in March 2023 and June 2023, a team of researchers has assessed the performance of GPT-3.5 and GPT-4 across a variety of tasks. The activities covered a wide range, such as answering opinion surveys, resolving sensitive or risky inquiries, solving maths problems, tackling hard, knowledge-intensive queries, writing code, passing tests for U.S. medical licenses, and using visual reasoning.

The results of the research showed that these models’ behaviour and performance varied significantly over the course of the evaluation. For example, the accuracy of GPT-4’s ability to discriminate between prime and composite numbers decreased over time, from 84% in March to 51% in June. A decrease in the GPT-4’s reactivity to prompts requiring the sequential connection of thoughts was one reason for this decline. By June, however, GPT-3.5 showed a significant improvement in this specific activity. 

By June, compared to March, GPT-4 was less likely to respond to delicate or opinion-based questions. On multi-hop knowledge-intensive questions, it performed better throughout that same time frame. On the other side, GPT-3.5’s ability to handle multi-hop queries declined. Code creation was another area of issue; by June, compared to March, the outputs from GPT-4 and GPT-3.5 showed greater formatting problems. 

The study’s key discovery was the apparent decline in GPT-4’s capacity to obey human commands over time, which seemed to be a consistent mechanism causing the behavioral alterations across tasks that were observed. These findings demonstrate how dynamic LLM behavior can be, even over quite short time intervals. 

In conclusion, this study emphasizes how crucial it is to continuously monitor and assess LLMs in order to guarantee their dependability and efficiency across a range of applications. The researchers have openly shared their collection of curated questions and answers from GPT-3.5 and GPT-4 in order to encourage more study in this field. In order to guarantee the dependability and credibility of LLM applications moving forward, they have made the analysis and visualization code available.

Check out the Report. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post This AI Research from Stanford and UC Berkeley Discusses How ChatGPT’s Behavior is Changing Over Time appeared first on MarkTechPost.

Guarding Integrated Speech and Large Language Models: Assessing Safety …

Recently, there’s been a surge in the adoption of Integrated Speech and Large Language Models (SLMs), which can understand spoken commands and generate relevant text responses. However, concerns linger regarding their safety and robustness. LLMs, with their extensive capabilities, raise the need to address potential harm and guard against misuse by malicious users. Although developers have started training models explicitly for “safety alignment,” vulnerabilities persist. Adversarial attacks, such as perturbing prompts to bypass safety measures, have been observed, even extending to VLMs where attacks target image inputs.

Researchers from AWS AI Labs at Amazon have investigated the susceptibility of SLMs to adversarial attacks, focusing on their safety measures. They’ve designed algorithms that generate adversarial examples to bypass SLM safety protocols in white-box and black-box settings without human intervention. Their study demonstrates the effectiveness of these attacks, with success rates as high as 90% on average. However, they’ve also proposed countermeasures to mitigate these vulnerabilities, achieving significant success in reducing the impact of such attacks. This work provides a comprehensive examination of SLM safety and utility, offering insights into potential weaknesses and strategies for improvement.

Concerns surrounding LLMs have led to discussions on aligning them with human values like helpfulness, honesty, and harmlessness. Safety training ensures adherence to these criteria, with examples crafted by dedicated teams to deter harmful responses. However, manual prompting strategies hinder scalability, prompting the exploration of automatic techniques like adversarial attacks to jailbreak LLMs. Multi-modal LLMs are particularly vulnerable, with attacks on continuous signals like images and audio. Evaluation methods vary, with preference-based LLM judges emerging as a scalable approach. This study focuses on generating adversarial perturbations to speech inputs assessing the vulnerability of SLMs to jailbreaking.

In the study on Spoken Question-Answering (QA) tasks using SLMs, the researchers investigate adversarial attacks and defenses. Following established techniques, they explore white-box and black-box attack scenarios, targeting SLMs with tailored responses. They utilize the PGD algorithm for white-box attacks to generate perturbations, aiming to enforce harmful responses. Transfer attacks involve using surrogate models to generate perturbations, which are applied to target models. To counter adversarial attacks, they propose Time-Domain Noise Flooding (TDNF), a simple pre-processing technique that adds white Gaussian noise to input speech signals, effectively mitigating perturbations. This approach offers a practical defense against attacks on SLMs.

In the experiments, the researchers evaluated the effectiveness of the defense technique called TDNF against adversarial attacks on SLMs. TDNF involves adding random noise to the audio inputs before feeding them into the models. They found that TDNF significantly reduced the success rate of adversarial attacks across different models and attack scenarios. Even when attackers were aware of the defense mechanism, they faced challenges in evading it, resulting in reduced attack success and increased perceptibility of the perturbations. Overall, TDNF proved to be a simple yet effective countermeasure against adversarial jailbreaking threats with minimal impact on model utility.

In conclusion, the study investigates the safety alignment of SLMs in Spoken QA applications and their vulnerability to adversarial attacks. Results show that white-box attackers can exploit barely perceptible perturbations to bypass safety alignment and compromise model integrity. Moreover, attacks crafted on one model can successfully jailbreak others, highlighting varying levels of robustness. A noise-flooding defense is effective in mitigating attacks. However, limitations include reliance on a preference model for safety assessment and limited exploration of safety-aligned text-based SLMs. Concerns about misuse prevent dataset and model release, hindering replication by other researchers.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Guarding Integrated Speech and Large Language Models: Assessing Safety and Mitigating Adversarial Threats appeared first on MarkTechPost.