Meet PythiaCHEM: A Machine Learning Toolkit Designed to Develop Data-D …

Artificial Intelligence (AI) and Machine Learning (ML) have grown significantly over the past decade or so, making remarkable progress in almost every field. Be it natural language, mathematical reasoning, or even pharmaceuticals, in today’s age, ML is the driving factor behind innovative solutions in these domains. Chemistry is also one such field where ML has made remarkable inroads, helping researchers in complex tasks like drug discovery, predicting molecular properties, etc. 

Even with the rapid rise in popularity, there are still many shortcomings of ML modeling platforms in terms of the lack of tools that are tailored to problems involving small and sparse datasets. This is mainly because a large amount of labeled data is necessary to achieve optimal results, which is limited in the case of compact datasets. To address this problem, the authors of this research paper have introduced PythiaCHEM, an ML toolkit specifically designed to develop predictive ML models for chemistry.

PythiaCHEM has been implemented in Python and has been organized within Jupyter Notebooks. It makes use of various open-source Python libraries such as Matplotlib, Pandas, Numpy, etc., and can be easily installed using pip, thereby streamlining the setup. Additionally, because of its modular structure, it can be integrated with other toolkits as well without affecting its core functionality.

The toolkit offers ML algorithms such as Decision Trees, Support vectors, Machines, Logistic Regression, and many others, with the flexibility to support other algorithms as well based on the needs of the user. PythiaCHEM has been organized into six user-friendly modules – fingerprints, classification metrics, molecules and structures, plots, scaling, and workflow functions.

To evaluate the capabilities and versatility of the toolkit, the researchers tested the same in two distinct chemistry tasks.

Classifying the transmembrane chloride anion transport activity of synthetic anion transporters: They analyzed the performance of several classifiers and found that Gaussian Process (GP) and Extra Trees (ET) algorithms gave the best results compared to other classifiers, with both of them performing well in terms of precision and recall, i.e., they were able to classify both positive and negative class predictions accurately. Further analysis with SHAP highlighted that GP focuses on experimental conditions, whereas ET emphasizes specific molecular properties.

Predicting the enantioselectivity in the Strecker synthesis of a-amino acids: The researchers assessed the predictions of different ML models for this task. As per their findings, the LASSOCV ML model performed the best among all the models and revealed important electronic and steric receptors, thereby giving valuable insights into the factors that affect the selectivity of this reaction.

In conclusion, PythiaCHEM is an open-source ML toolkit specifically suited for chemistry tasks involving small datasets. It provides a high level of flexibility and automation through the use of Jupyter Notebooks, making it an invaluable resource for beginners and experts alike. The researchers illustrated the use of the toolkit on two different chemistry tasks, showcasing its capabilities. Through this platform, the authors of this research paper aim to foster a deeper understanding of ML models and facilitate the development of powerful applications for the field of chemistry.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Meet PythiaCHEM: A Machine Learning Toolkit Designed to Develop Data-Driven Predictive Models for Chemistry appeared first on MarkTechPost.

Build a vaccination verification solution using the Queries feature in …

Amazon Textract is a machine learning (ML) service that enables automatic extraction of text, handwriting, and data from scanned documents, surpassing traditional optical character recognition (OCR). It can identify, understand, and extract data from tables and forms with remarkable accuracy. Presently, several companies rely on manual extraction methods or basic OCR software, which is tedious and time-consuming, and requires manual configuration that needs updating when the form changes. Amazon Textract helps solve these challenges by utilizing ML to automatically process different document types and accurately extract information with minimal manual intervention. This enables you to automate document processing and use the extracted data for different purposes, such as automating loans processing or gathering information from invoices and receipts.
As travel resumes post-pandemic, verifying a traveler’s vaccination status may be required in many cases. Hotels and travel agencies often need to review vaccination cards to gather important details like whether the traveler is fully vaccinated, vaccine dates, and the traveler’s name. Some agencies do this through manual verification of cards, which can be time-consuming for staff and leaves room for human error. Others have built custom solutions, but these can be costly and difficult to scale, and take significant time to implement. Moving forward, there may be opportunities to streamline the vaccination status verification process in a way that is efficient for businesses while respecting travelers’ privacy and convenience.
Amazon Textract Queries helps address these challenges. Amazon Textract Queries allows you to specify and extract only the piece of information that you need from the document. It gives you precise and accurate information from the document.
In this post, we walk you through a step-by-step implementation guide to build a vaccination status verification solution using Amazon Textract Queries. The solution showcases how to process vaccination cards using an Amazon Textract query, verify the vaccination status, and store the information for future use.
Solution overview
The following diagram illustrates the solution architecture.
The workflow includes the following steps:

The user takes a photo of a vaccination card.
The image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
When the image gets saved in the S3 bucket, it invokes an AWS Step Functions workflow:
The Queries-Decider AWS Lambda function examines the document passed in and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for our example, we have four queries).
NumberQueriesAndPagesChoice is a Choice state that adds conditional logic to a workflow. If there are between 15–31 queries and the number of pages is between 2–3,001, then Amazon Textract asynchronous processing is the only option, because synchronous APIs only support up to 15 queries and one-page documents. For all other cases, we route to the random selection of synchronous or asynchronous processing.
The TextractSync Lambda function sends a request to Amazon Textract to analyze the document based on the following Amazon Textract queries:

What is Vaccination Status?
What is Name?
What is Date of Birth?
What is Document Number?

Amazon Textract analyzes the image and sends the answers of these queries back to the Lambda function.
The Lambda function verifies the customer’s vaccination status and stores the final result in CSV format in the same S3 bucket (demoqueries-textractxxx) in the csv-output folder.

Prerequisites
To complete this solution, you should have an AWS account and the appropriate permissions to create the resources required as part of the solution.
Download the deployment code and sample vaccination card from GitHub.
Use the Queries feature on the Amazon Textract console
Before you build the vaccination verification solution, let’s explore how you can use Amazon Textract Queries to extract vaccination status via the Amazon Textract console. You can use the vaccination card sample you downloaded from the GitHub repo.

On the Amazon Textract console, choose Analyze Document in the navigation pane.
Under Upload document, choose Choose document to upload the vaccination card from your local drive.
After you upload the document, select Queries in the Configure Document section.
You can then add queries in the form of natural language questions. Let’s add the following:

What is Vaccination Status?
What is Name?
What is Date of Birth?
What is Document Number?

After you add all your queries, choose Apply configuration.
Check the Queries tab to see the answers to the questions.

You can see Amazon Textract extracts the answer to your query from the document.
Deploy the vaccination verification solution
In this post, we use an AWS Cloud9 instance and install the necessary dependencies on the instance with the AWS Cloud Development Kit (AWS CDK) and Docker. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.

In the terminal, choose Upload Local Files on the File menu.
Choose Select folder and choose the vaccination_verification_solution folder you downloaded from GitHub.
In the terminal, prepare your serverless application for subsequent steps in your development workflow in AWS Serverless Application Model (AWS SAM) using the following command:

$ cd vaccination_verification_solution/
$ pip install -r requirements.txt

Deploy the application using the cdk deploy command:

cdk deploy DemoQueries –outputs-file demo_queries.json –require-approval never
Wait for the AWS CDK to deploy the model and create the resources mentioned in the template.
When deployment is complete, you can check the deployed resources on the AWS CloudFormation console on the Resources tab of the stack details page.

Test the solution
Now it’s time to test the solution. To trigger the workflow, use aws s3 cp to upload the vac_card.jpg file to DemoQueries.DocumentUploadLocation inside the docs folder:

aws s3 cp docs/vac_card.JPG $(aws cloudformation list-exports –query ‘Exports[?Name==`DemoQueries-DocumentUploadLocation`].Value’ –output text)

The vaccination certificate file automatically gets uploaded to the S3 bucket demoqueries-textractxxx in the uploads folder.
The Step Functions workflow is triggered via a Lambda function as soon as the vaccination certificate file is uploaded to the S3 bucket.
The Queries-Decider Lambda function examines the document and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for this example, we use four queries—document number, customer name, date of birth, and vaccination status).
The TextractSync function sends the input queries to Amazon Textract and synchronously returns the full result as part of the response. It supports 1-page documents (TIFF, PDF, JPG, PNG) and up to 15 queries. The GenerateCsvTask function takes the JSON output from Amazon Textract and converts it to a CSV file.
The final output is stored in the same S3 bucket in the csv-output folder as a CSV file.
You can download the file to your local machine using the following command:

aws s3 cp <paste the S3 URL from TextractOutputCSVPath>

The format of the result is timestamp, classification, filename, page number, key name, key_confidence, value, value_confidence, key_bb_top, key_bb_height, key_bb.width, key_bb_left, value_bb_top, value_bb_height, value_bb_width, value_bb_left.
You can scale the solution to hundreds of vaccination certificate documents for multiple customers by uploading their vaccination certificates to DemoQueries.DocumentUploadLocation. This automatically triggers multiple runs of the Step Functions state machine, and the final result is stored in the same S3 bucket in the csv-output folder.
To change the initial set of queries that are fed into Amazon Textract, you can go to your AWS Cloud9 instance and open the start_execution.py file. In the file view in the left pane, navigate to lambda, start_queries, app, start_execution.py. This Lambda function is invoked when a file is uploaded to DemoQueries.DocumentUploadLocation. The queries sent to the workflow are defined in start_execution.py; you can change those by updating the code as shown in the following screenshot.
Clean up
To avoid incurring ongoing charges, delete the resources created in this post using the following command:

cdk destroy DemoQueries

Answer the question Are you sure you want to delete: DemoQueries (y/n)? with y.
Conclusion
In this post, we showed you how to use Amazon Textract Queries to build a vaccination verification solution for the travel industry. You can use Amazon Textract Queries to build solutions in other industries like finance and healthcare, and retrieve information from documents such as paystubs, mortgage notes, and insurance cards based on natural language questions.
For more information, see Analyzing Documents, or check out the Amazon Textract console and try out this feature.

About the Authors
Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.
Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practices through the implementation of the Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.

Facebook Retargeting Isn’t Dead. It Just Needs a Jump Start

Here at Customers.ai, we like to say our platform helps you remarket like it’s 2019 again.

The glory days of remarketing, when audiences were large, costs were small, and the Facebook ad pixel could identify ~50% of website visitors.

When remarketing was the pinnacle of Facebook ads and even a dollar per day could garner you results. 

Of course, we all know what happened – iOS 14 rolled out and remarketing became well, not so great.

Then iOS 17 rolled out, Click IDs and cross-channel tracking were minimized, and our already shrinking audiences and making them even smaller. 

Where does that bring us now?

Sadly, there’s an 84% drop in efficiency and you’re lucky if the ad pixel catches 7% of site traffic today.

Advertisers have moved their budgets elsewhere and there is definitely a belief that retargeting is dead.

So, should we just say our goodbyes and move on with our lives?

Not a chance!

Remarketing isn’t dead. We just have to give it a jump start.

Expand Your Retargeting Audiences with First-Party Data

One of the main problems with retargeting is that audience sizes have become so small that you can’t reach the number of people you need to in a cost-effective way. 

Whereas Facebook used to be able to identify a large number of your website visitors and put them into your remarketing campaigns, it can’t do that anymore (we haven’t even talked about the Chrome cookie situation which will make it even worse).

This is where first-party data comes into play. 

If the Facebook pixel isn’t going to capture your website visitors, then you need to do it yourself. 

With the Customers.ai Website Visitor ID X-Ray pixel, you can capture the names, emails, locations, phone numbers, etc. of people coming to your site. 

Those people can then be put into workflows, including your Facebook retargeting audiences. 

What’s important about the direct Meta integration is that we not only ID users, encrypt the data, and send it securely to Meta Ad servers, the integration is done via a server-to-server connection, allowing user IDs to remain in place and tracking to remain functional.

The great thing about this is that you can grow your audiences, grow your reach, and target users who are genuinely interested in you with almost no effort.  

It’s silly how effective this is.

See Who Is On Your Site Right Now!

Turn anonymous visitors into genuine contacts.

Try it Free, No Credit Card Required

Get The X-Ray Pixel

Inform & Train Advantage+ Audiences

One piece of feedback we’ve gotten from Facebook advertisers is that they have shifted away from retargeting and are focused on using Advantage+ audiences.

If you aren’t familiar, Advantage+ is Meta’s AI audience generator. It relies on past conversions, Pixel data, interactions with previous ads, and user behavior. 

Facebook’s “Advantage Plus” ads targeting can’t match the precision of actual first-party data.

However, you do have the option of providing targeting suggestions that Meta will initially prioritize.

You can add custom audiences, age ranges, genders, and even interest and behavior targets, to help point Meta’s AI in the right direction.

So if you aren’t running remarketing campaigns but are using Advantage+ audiences, you can take your first-party data and build custom audiences to inform and train Advantage+. 

When it comes to AI, the more data you can provide, the smarter it will be.

Your website visitors are your target audience. They are the people you want to reach. 

Use your website visitor data to inform and train Meta AI.

Build Better Lookalike Audiences

I’ll keep this part short but the bottom line is this – lookalike audiences are only as valuable as the data you provide.

If you are building really solid remarketing lists or using first-party data to inform your Advantage+ audiences, then you can use those same audiences to create better lookalike audiences.

Here’s a tip – let’s say you want to build a really solid lookalike audience.

Do you want to base it on everyone who has visited your site? 

No. 

You want to base it on your high-intent prospects and existing customers. 

Put X-ray on these ‘high-intent’ pages (think pricing pages or shopping carts) and start building your audience. 

This way, you’ll end up with a list of people who aren’t on your site to just read a blog post or window shop, but instead are truly interested in what you have to offer.

Remarket Like It’s 2019

Customers.ai is giving advertisers the ability to bring remarketing back from the dead and return it to it’s former days of glory.

Ad targeting is hard and with all of the privacy changes that continue to roll out, it’s only getting harder. 

Shouldn’t you be doing everything you can to ensure you are reaching your exact audience and driving the best ROAS possible?

The Customers.ai Website Visitor ID X-Ray pixel is free (and easy) to install. 

See for yourself how many visitors you can capture and just how fast and efficiently you can 3x your Facebook remarketing campaigns.

Start remarketing like it’s 2019 again!

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 500 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Facebook Retargeting Isn’t Dead. It Just Needs a Jump Start appeared first on Customers.ai.

This AI Paper from Johns Hopkins and Microsoft Revolutionizes Machine …

Machine translation, a crucial aspect of Natural Language Processing, has significantly increased. Yet, a primary challenge persists: producing translations beyond mere adequacy to reach near perfection. Traditional methods, while effective, often need to be improved by their reliance on large datasets and supervised fine-tuning (SFT), leading to limitations in the quality of the output.

Recent developments in the field have brought attention to moderate-sized large language models (LLMs), such as the ALMA models, which have shown promise in machine translation. However, the efficacy of these models is often constrained by the quality of reference data used in training. Researchers have recognized this issue and explored novel training methodologies to enhance translation performance.

Introducing Contrastive Preference Optimization (CPO), a game-changing approach to refining machine translation training. Achieve unparalleled translation accuracy with this groundbreaking technique. This method diverges from traditional supervised fine-tuning by focusing on more than just aligning model outputs with gold-standard references. Instead, CPO trains models to distinguish between just ‘adequate’ and ‘near-perfect’ translations, pushing the translation quality boundaries.

The mechanics of CPO are intriguing. It employs a contrastive learning strategy that utilizes hard negative examples, a significant shift from the usual practice of minimizing cross-entropy loss. This approach allows the model to develop a preference for generating superior translations while learning to reject high-quality but not flawless ones.

The results of implementing CPO have been nothing short of remarkable. The method has demonstrated a substantial leap in translation quality when applied to ALMA models. The enhanced model, referred to as ALMA-R, has showcased performance that matches or surpasses that of the leading models in the field, such as GPT-4. This improvement was achieved with minimal resource investment – a notable achievement in machine translation.

A detailed examination of the ALMA-R model’s performance reveals its superiority over existing methods. It excels in various test datasets, including those from the WMT competitions, setting new translation accuracy and quality standards. These results highlight the potential of CPO as a transformative tool in machine translation, offering a new direction away from traditional training methodologies that rely heavily on extensive datasets.

In conclusion, the introduction of Contrastive Preference Optimization marks a significant advancement in the field of neural machine translation. By focusing on the quality of translations rather than the quantity of training data, this novel methodology paves the way for more efficient and accurate language models. It challenges existing assumptions about machine translation, setting a new benchmark in the field and opening up possibilities for future research and development.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post This AI Paper from Johns Hopkins and Microsoft Revolutionizes Machine Translation with ALMA-R: A Smaller Sized LLM Model Outperforming GPT-4 appeared first on MarkTechPost.

UCLA Researchers Introduce Group Preference Optimization (GPO): A Mach …

Large Language Models (LLMs) are increasingly employed for various domains, with use cases including creative writing, chatbots, and semantic search. Many of these applications are inherently subjective and require generations catering to different demographics, cultural and societal norms, or individual preferences. Through their large-scale training, current language models are exposed to diverse data that allows them to represent many such opinions. However, expressing these diverse opinions requires steering the LLM generations to user requirements.

The researchers at the University of California introduced Group Preference Optimization (GPO), which signifies a pioneering approach to aligning large language models (LLMs) with the diverse preferences of user groups efficiently. This alignment is critical for applications involving subjective judgments across varied user demographics. The challenges associated with existing alignment algorithms, characterized by high costs and the need for extensive group-specific preference data and computational resources, are addressed by GPO.

The GPO framework incorporates an independent transformer module, enhancing the base LLM. This module is trained to predict the preferences of specific user groups for LLM-generated content. The parameterization of this module as an in-context autoregressive transformer facilitates few-shot learning, and its training is accomplished through meta-learning on multiple user groups.

Key components of GPO include leveraging few-shot learning to enable the model to adapt to group preferences with minimal data and utilizing meta-learning to train the independent transformer module on diverse user groups, allowing rapid adaptation to new preferences.

Empirical validation was conducted through rigorous evaluations using LLMs of varying sizes. Three human opinion adaptation tasks were considered: aligning with the preferences of US demographic groups, global countries, and individual users. GPO’s performance is compared with existing strategies like in-context steering and fine-tuning methods.

The findings demonstrate that GPO achieves more accurate alignment with group preferences and requires fewer group-specific preferences and reduced training and inference computing resources. This underscores GPO’s efficiency and effectiveness in comparison to existing approaches.

Overall, GPO presents a promising solution for efficiently aligning LLMs with the preferences of diverse user groups, making it particularly applicable to real-world scenarios where nuanced subjective judgments are essential. The emphasis on few-shot learning, meta-learning, and the incorporation of the independent transformer module distinguishes GPO from existing strategies.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post UCLA Researchers Introduce Group Preference Optimization (GPO): A Machine Learning-based Alignment Framework that Steers Language Models to Preferences of Individual Groups in a Few-Shot Manner appeared first on MarkTechPost.

ByteDance AI Research Unveils Reinforced Fine-Tuning (ReFT) Method to …

One effective method to improve the reasoning skills of LLMs is to employ supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations. However, this approach has limitations in terms of generalization because it heavily depends on the provided CoT data. In scenarios like math problem-solving, each question in the training data typically has only one annotated reasoning path. In the ideal case, it would be more beneficial for the algorithm to learn from multiple annotated reasoning paths associated with a given question, as this could enhance its overall performance and adaptability.

Researchers from ByteDance Research lab suggest a practical method known as Reinforced Fine-Tuning (ReFT) to improve the generalization capabilities of learning LLMs for reasoning, using math problem-solving as an illustrative example. The ReFT approach begins by initially warming the model through SFT. Subsequently, it leverages online reinforcement learning, specifically employing the Proximal Policy Optimization (PPO) algorithm. During this fine-tuning process, the model is exposed to various reasoning paths automatically sampled based on the given question. The rewards for reinforcement learning come naturally from the ground-truth answers, contributing to a more robust and adaptable LLM for enhanced reasoning abilities.

Recent research efforts have focused on improving CoT prompt design and data engineering, aiming to make CoT comprehensive and fine-grained for step-by-step reasoning solutions. Some approaches have used Python programs as CoT prompts, demonstrating more accurate reasoning steps and significant improvements over natural language CoT. Another line of work focuses on improving the quality and quantity of CoT data, including efforts to increase the amount of CoT data from OpenAI’s ChatGPT. Reinforcement learning has been applied to fine-tuning paradigms to improve performance over conventional supervised fine-tuning, specifically for solving math problems. 

The study proposes ReFT to enhance the generalizability of learning LLMs for reasoning, specifically in math problem-solving. ReFT combines SFT with online reinforcement learning using the PPO algorithm. The model is first warmed with SFT and then fine-tuned using reinforcement learning, where multiple reasoning paths are automatically sampled given the question, and rewards are derived from ground-truth answers. In addition, inference-time strategies such as majority voting and re-ranking are combined with ReFT to boost performance further.

The ReFT method significantly outperforms SFT regarding reasoning capability and generalizability for LLMs in math problem-solving. Extensive experiments on GSM8K, MathQA, and SVAMP datasets demonstrate the better performance of ReFT over SFT. The performance of ReFT can be further boosted by combining inference-time strategies such as majority voting and re-ranking. They use Python programs as CoT prompts, showing more accurate reasoning steps and significant improvements over natural language CoT. Previous work on reinforcement learning and reranking has also demonstrated better performance over supervised fine-tuning and majority voting.

In conclusion, ReFT stands out as a fine-tuning method for enhancing models in solving math problems. Unlike SFT), ReFT optimizes a non-differentiable objective by exploring multiple CoT annotations rather than relying on a single one. Extensive experiments across three datasets using two foundational models have shown that ReFT surpasses SFT in performance and generalization. Models trained with ReFT exhibit compatibility with techniques like majority voting and reward model reranking. ReFT outperforms several open-source open-source models of similar sizes in math problem-solving, highlighting its effectiveness and practical value.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post ByteDance AI Research Unveils Reinforced Fine-Tuning (ReFT) Method to Enhance the Generalizability of Learning LLMs for Reasoning with Math Problem Solving as an Example appeared first on MarkTechPost.

Fireworks AI Introduces FireAttention: A Custom CUDA Kernel Optimized …

Mixture-of-Experts (MoE) is an architecture based on the “divide and conquer” principle to solve complex tasks. Multiple individual machine learning (ML) models (called experts) work individually based on their specializations to provide the most optimal results. To better understand their use cases, Mistral AI recently released Mixtral, an open-source high-quality MoE model that outperformed or matched GPT-3.5 on most standard benchmarks and was first hosted on Fireworks AI’s platform.

Although the platform demonstrated an impressive inference speed of up to 175 tokens/sec, the researchers at Fireworks AI have tried to improve the efficiency of serving MoE models without significantly impacting the quality. They have introduced a large language model (LLM) serving stack having FP16 and FP8-based FireAttention, which delivers four times better speed-up than other open-source software. FireAttention is a custom CUDA kernel that has been optimized for Multi-Query Attention Models like Mixtral and for FP16 and FP8 support in hardware. 

Quantization methods like SmoothQuant and AWQ fell short of improving the model performance, especially during generation. The main reason for that is LLM activations have non-uniform distribution, which is challenging for integer methods. On the contrary, FP8 leverages hardware support, which makes it flexible to deal with such distributions.

For evaluation, the researchers have considered a very general setup of prompt length 1K and the number of generated tokens as 50, which covers long prompt and short generation use cases. Their quality and performance study is based on the Mixtral model. They focused on language understanding and used the MMLU metric for measuring the model quality. The MMLU metric consists of enough test data examples, and the Mixtral model also performs quite well on it, allowing for easy detection of any quantization error. For assessing the latency and throughput, they used the following two metrics: token generation latency for a given number of requests/second (RPS) and total request latency for a given RPS.

The results show that the Fireworks FP16 Mixtral model implementation is superior to that of vLLM (a high-throughput and memory-efficient inference and serving engine for LLMs). Moreover, the FP8 implementation is significantly better than the already efficient FP16 one. Additionally, it reduces the model size by two times and, therefore, allows for a more efficient deployment. When it is combined with the memory bandwidth and FLOPs speed-ups, it leads to a considerable improvement (2x) of the effective requests/second. Lastly, as there is no one-size-fits-all approach regarding the performance of LLMs, different vLLM and Fireworks LLM service configurations show their strengths in different setups.

In conclusion, FireAttention FP16 and FP8 implementations provide a remarkable tradeoff for LLM in terms of accuracy and performance tradeoff. More specifically, FP8 shrinks the model size twice and improves the number of effective requests/second by the same amount, highlighting its superiority over previous quantization methods. This research paper marks a significant step in developing even more efficient serving for MoE models like Mixtral with negligible impact on quality.
The post Fireworks AI Introduces FireAttention: A Custom CUDA Kernel Optimized for Multi-Query Attention Models appeared first on MarkTechPost.

Assessing Natural Language Generation (NLG) in the Age of Large Langua …

The Natural Language Generation (NLG) field stands at the intersection of linguistics and artificial intelligence. It focuses on the creation of human-like text by machines. Recent advancements in Large Language Models (LLMs) have revolutionized NLG, significantly enhancing the ability of systems to generate coherent and contextually relevant text. This evolving field necessitates robust evaluation methodologies to assess the quality of the generated content accurately.

The central challenge in NLG is ensuring that the generated text not only mimics human language in fluency and grammar but also aligns with the intended message and context. Traditional evaluation metrics like BLEU and ROUGE primarily assess surface-level text differences, falling short in evaluating semantic aspects. This limitation hinders progress in the field and can lead to misleading research conclusions. The emerging use of LLMs for evaluation promises a more nuanced and human-aligned assessment, addressing the need for more comprehensive methods.  

The researchers from WICT Peking University, Institute of Information Engineering CAS, UTS, Microsoft, and UCLA present a comprehensive study that can be broken into five sections:

Introduction

Formalization and Taxonomy

Generative Evaluation

Benchmarks and Tasks

Open Problems

1. Introduction:

The introduction sets the stage for the survey by presenting the significance of NLG in AI-driven communication. It highlights the evolution brought by LLMs like GPT-3 in generating text across various applications. The introduction stresses the need for robust evaluation methodologies to gauge generated content’s quality accurately. It critiques traditional NLG evaluation metrics for their limitations in assessing semantic aspects and the emergence of LLMs as a promising solution for a more nuanced evaluation.

2. Formalization and Taxonomy:

This survey provides a formalization of LLM-based NLG Evaluation tasks. It outlines a framework for assessing candidate generations across dimensions like fluency and consistency. The taxonomy categorizes NLG evaluation into dimensions: evaluation task, evaluation references, and evaluation function. Each dimension addresses various aspects of NLG tasks, offering insights into their strengths and limitations in distinct contexts. The approach classifies tasks like Machine Translation, Text Summarization, Dialogue Generation, Story Generation, Image Captioning, Data-to-Text generation, and General Generation.

3. Generative Evaluation:

The study explores the high-capacity generative abilities of LLMs in evaluating NLG text, distinguishing between prompt-based and tuning-based evaluations. It discusses different scoring protocols, including score-based, probability-based, Likert-style, pairwise comparison, ensemble, and advanced evaluation methods. The study provides a detailed exploration of these evaluation methods, accompanied by their respective evaluation protocols, and how they cater to diverse evaluation needs in NLG.

4. Benchmarks and Tasks:

This study presents a comprehensive overview of various NLG tasks and the meta-evaluation benchmarks used to validate the effectiveness of LLM-based evaluators. It discusses benchmarks in Machine Translation, Text Summarizing, Dialogue Generation, Image Caption, Data-to-Text, Story Generation, and General Generation. It provides insights into how these benchmarks assess the concurrence between automatic evaluators and human preferences.

5. Open Problems:

The research addresses the unresolved challenges in the field. It discusses the biases inherent in LLM-based evaluators, the robustness issues of these evaluators, and the complexities surrounding domain-specific evaluation. The study emphasizes the need for more flexible and comprehensive evaluation methods capable of adapting to complex instructions and real-world requirements, highlighting the gap between current evaluation methods and the evolving capabilities of LLMs.

In conclusion, The survey of LLM-based methods for NLG evaluation highlights a significant shift in assessing generated content. These methods offer a more sophisticated and human-aligned approach, addressing the limitations of traditional evaluation metrics. Using LLMs introduces a nuanced understanding of text quality, encompassing semantic coherence and creativity. This advancement marks a pivotal step towards more accurate and comprehensive evaluations in NLG, promising to enhance the reliability and effectiveness of these systems in real-world applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Assessing Natural Language Generation (NLG) in the Age of Large Language Models: A Comprehensive Survey and Taxonomy appeared first on MarkTechPost.

Parameter-Efficient Sparsity Crafting (PESC): A Novel AI Approach to T …

The emergence of large language models (LLMs) like GPT, Claude, Gemini, LLaMA, Mistral, etc., has greatly accelerated recent advances in natural language processing (NLP). Instruction tweaking is a well-known approach to training LLMs. This method allows LLMs to improve their pre-trained representations to follow human instructions using large-scale, well-formatted instruction data. However, these tasks are complex in and of themselves, making fine-tuning the model difficult. For general tasks, larger models may not be able to maximize losses from competing activities, leading to poor performance.

Increasing the model’s capacity can enhance instruction tuning’s efficacy for general tasks. Most LLMs, however, are dense pre-trained models built using transformer architecture, severely restricting scalability when tweaking the instructions. Instruction tweaking offers the chance to obtain outstanding performance on general tasks by turning dense models into MoE models. The MoE models’ expert layers are initially set up as duplicates of the original feedforward neural network (FFN) layers to make this change. Training such massive models is hindered by computational costs and GPU memory constraints caused by the need to update the expert weights in the MoE layer due to the large parameter scale of existing LLMs. 

New research by the Shanghai Artificial Intelligence Laboratory and The Chinese University of Hong Kong presents Parameter-Efficient Sparsity Crafting (PESC), a method for transforming dense models into sparse ones using the MoE blueprint. By integrating adapters into sparse models’ MoE layers, PESC makes it possible to differentiate experts without changing their weights individually. This method drastically cuts down on GPU memory needs and computational expenses. Because adapters are integrated, the model capacity can be expanded with minimal increase in parameters.

To differentiate across experts without changing the weights of each expert in the MoE layers, PESC inserts adapters into the MoE layers of sparse models. The researchers also update other sparse model weights using the QLoRA methodology, a popular PEFT method. 

The researchers simultaneously trained the sparse model with MoE layers on various skills, including coding, mathematics, and other general talents from many areas, to illustrate the model’s learning capabilities. For instruction tuning, this training integrated three separate datasets from different domains: SlimORCA, Magicoder, and MetaMathQA datasets. The final dataset included 520k instructions after filtering and sampling.

Furthermore, they have utilized the PESC method to create Camelidae sparse models. Camelidae-8Ï34B outperforms GPT-3.5 in general and reaches SOTA performance on all open-source sparse models.

Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Parameter-Efficient Sparsity Crafting (PESC): A Novel AI Approach to Transition Dense Models to Sparse Models Using a Mixture-of-Experts (Moe) Architecture appeared first on MarkTechPost.

InstantX Team Unveils InstantID: A Groundbreaking AI Approach to Effic …

A crucial area of interest is generating images from text, particularly focusing on preserving human identity accurately. This task demands high detail and fidelity, especially when dealing with human faces involving complex and nuanced semantics. While existing models adeptly handle general styles and objects, they often need to improve when producing images that maintain the intricate identity details of human subjects.

The main challenge this research addresses is enhancing the controllability and fidelity of image generation from text, specifically for human subjects. Existing methods, reliant on detailed textual descriptions, often need to achieve a strong semantic connection with the desired identity in the generated images. The objective is to create a method that effectively balances high fidelity to the reference image with the flexibility to create diverse images based on that identity without demanding extensive resources or multiple reference images.

Present approaches in personalized image generation can be broadly categorized into two types: methods requiring fine-tuning during testing and those that do not. While accurate, fine-tuning methods, like DreamBooth and Textual Inversion, are resource-heavy and impractical for scenarios with limited data. On the other hand, methods that bypass fine-tuning during inference often fall short in creating high-fidelity, customized images due to their reliance on CLIP’s image encoder, which generates only weakly aligned signals.

The researchers from the InstantX Team have developed InstantID, an innovative approach focusing on instant identity-preserving image synthesis. This method distinguishes itself by its simplicity, efficiency, and ability to handle image personalization in any style using just one facial image while maintaining high fidelity. InstantID employs a novel face encoder to retain intricate details by adding strong semantic and weak spatial conditions, incorporating facial images, landmark images, and textual prompts to guide the image generation process. The key aspects of InstantID are its plug-and-play nature, compatibility with pre-trained models and its tuning-free inference process.

InstantID’s performance is notable for its ability to preserve facial identity with remarkable fidelity using only a single reference image. It achieves this through a novel face encoder that captures detailed identity semantics. This highly economical and practical method makes it an ideal solution for various real-world applications. InstantID’s unique approach includes:

Innovative Face Encoder: Unlike previous methods relying on a CLIP image encoder, InstantID uses a face encoder for stronger semantic detail capture, ensuring high fidelity in ID preservation.

Efficient and Practical: It requires no fine-tuning during inference, making it highly economical and practical for real-world applications.

Superior Performance: Even with a single reference image, InstantID achieves state-of-the-art results, surpassing the performance of training-based methods that rely on multiple reference images.

In summary, InstantID represents a significant advancement in image generation. Its ability to maintain accuracy in identity with minimal resources marks it as an innovative solution in personalized image generation. Key takeaways from this research include:

Bridging Fidelity and Efficiency: InstantID effectively balances high fidelity and efficiency in identity-preserving image generation.

Plug-and-Play Module: Its compatibility with pre-trained models and the plug-and-play nature broadens its applicability without incurring extra costs.

Versatile Applications: The method opens possibilities in novel view synthesis, identity interpolation, and multi-identity synthesis.

However, challenges remain, such as decoupling facial attribute features for enhanced editing flexibility and addressing ethical concerns about using human faces in machine-learning models. The future of InstantID lies in exploring these avenues, potentially revolutionizing how we approach image generation in machine learning.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post InstantX Team Unveils InstantID: A Groundbreaking AI Approach to Efficient, High-Fidelity Personalized Image Synthesis Using Just One Image appeared first on MarkTechPost.

MIT Researchers Unveil InfoCORE: A Machine Learning Approach to Overco …

Recent studies have shown that representation learning has become an important tool for drug discovery and biological system understanding. It is a fundamental component in the identification of drug mechanisms, the prediction of drug toxicity and activity, and the identification of chemical compounds linked to disease states.

The limitation arises in representing the complex interplay between a small molecule’s chemical structure and its physical or biological characteristics. Several molecular representation learning techniques currently in use solely encode a molecule’s chemical identification, leading to unimodal representations, which has drawbacks as molecules with comparable structures can have remarkably diverse functions within a biological setting.

Recent efforts have concentrated on training models that apply multimodal contrastive learning to map 2D chemical structures to high-content cell microscope pictures. In biotechnology, high-throughput drug screening is essential for assessing and understanding the relationship between a drug’s chemical structure and biological activity. This method uses gene expression measures or cell imaging to indicate drug effects. 

However, handling batch effects presents a major challenge when running large-scale screens, necessitating their division into many trials. The appropriate interpretation of results may be hampered by these batch effects, which can potentially incorporate systematic mistakes and non-biological connections into the data. 

To overcome this, a team of researchers has recently presented InfoCORE, an Information maximization strategy for COnfounder REmoval. Effectively managing batch effects and improving the caliber of molecular representations derived from high-throughput drug screening data are the main goals of InfoCORE. Given a batch identifier, the method sets a variational lower bound on the conditional mutual information of latent representations. It does this by adaptively reweighting samples to equalize their inferred batch distribution.

Extensive tests on drug screening data have shown that InfoCORE performs better than other algorithms on a variety of tasks, such as retrieving molecule-phenotype and predicting chemical properties. This implies that InfoCORE successfully reduces the influence of batch effects, resulting in better performance in tasks pertaining to molecular analysis and drug discovery.

The study has also emphasized on how flexible InfoCORE is as a framework that can handle more complex issues. It has shown how InfoCORE can manage shifts in the general distribution and data fairness problems by reducing correlation with bogus characteristics or eliminating sensitive attributes. InfoCORE’s versatility makes it a powerful tool for tackling a variety of challenges connected to data distribution and fairness, in addition to removing the batch effect in drug screening.

The researchers have summarized their primary contributions as follows.

The InfoCORE approach aims to propose a multimodal molecular representation learning framework that can smoothly integrate chemical structures with a variety of high-content drug screens.

The research provides a strong theoretical foundation by demonstrating that InfoCORE maximizes the variational lower bound on the conditional mutual information of the representation given the batch identifier.

InfoCORE has demonstrated its efficiency in molecular property prediction and molecule-phenotype retrieval tasks by consistently outperforming several baseline models in real-world studies.

InfoCORE’s information maximization philosophy extends beyond the field of drug development. Empirical evidence supports its effectiveness in removing sensitive information for representation fairness, making it a flexible tool with wider uses.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post MIT Researchers Unveil InfoCORE: A Machine Learning Approach to Overcome Batch Effects in High-Throughput Drug Screening appeared first on MarkTechPost.

Microsoft AI Research Unveils DeepSpeed-FastGen: Elevating LLM Serving …

Large language models (LLMs) have revolutionized various AI-infused applications, from chat models to autonomous driving. This evolution has spurred the need for systems that can efficiently deploy and serve these models, especially under the increasing demand for handling long-prompt workloads. The major hurdle in this domain has been balancing high throughput and low latency in serving systems, a challenge existing frameworks need help to meet.

Traditional approaches to LLM serving, while adept at training models effectively, falter during inference, especially in tasks like open-ended text generation. This inefficiency stems from the interactive nature of these applications and the poor arithmetic intensity of such tasks, which bottleneck the inference throughput in existing systems. vLLM, powered by PagedAttention, and research systems like Orca have improved LLM inference performance. However, they still face challenges in maintaining a consistent quality of service, particularly for long-prompt workloads.

Historical advancements in LLM inference, such as blocked KV caching and dynamic batching, aimed to address memory efficiency and GPU utilization. Blocked KV caching, as implemented in vLLM’s Paged Attention, tackled memory fragmentation caused by large KV caches, increasing total system throughput. Despite its attempts to improve GPU utilization, dynamic batching often required padding inputs or stalling the system to construct larger batches. These methods, while innovative, are still needed to resolve the challenges of efficiently serving LLMs fully, particularly under the constraints of long-prompt workloads.

Microsoft DeepSpeed researchers introduced DeepSpeed-FastGen, a revolutionary system utilizing the Dynamic SplitFuse technique in response to the abovementioned challenges. This system delivers up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower tail latency compared to state-of-the-art systems like vLLM. DeepSpeed-FastGen combines DeepSpeed-MII and DeepSpeed-Inference to create an efficient, user-friendly serving system for LLMs. It supports a range of models and offers both non-persistent and persistent deployment options, catering to various user scenarios.

The cornerstone of DeepSpeed-FastGen’s efficiency is the Dynamic SplitFuse strategy, which enhances continuous batching and system throughput. This novel token composition strategy for prompt processing and generation allows long prompts to be decomposed into smaller chunks across multiple forward passes. This method leads to better system responsiveness and higher efficiency as long prompts no longer necessitate extremely long forward passes. The approach also ensures consistent forward pass sizes, which is a primary determinant of performance, leading to more consistent latency than competing systems. This translates to significant reductions in generation latency, as evidenced in the performance evaluations.

DeepSpeed-FastGen’s performance was rigorously benchmarked and analyzed. The system was evaluated against vLLM on various models and hardware configurations. The evaluations demonstrated that DeepSpeed-FastGen achieves up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower tail latency compared to vLLM. These improvements are particularly notable in LLM serving, where both throughput and latency are crucial metrics.

To summarize the key takeaways from DeepSpeed-FastGen:

Revolutionary Strategy: Implements Dynamic SplitFuse, a novel token composition strategy.

Significant Performance Gains: Achieve up to 2.3x higher effective throughput and 2x lower latency on average.

Tail Latency Reduction: Offers up to 3.7x lower tail latency than vLLM.

Scalability and Versatility: Demonstrates near-perfect scalability and supports various hardware platforms.

Community Engagement: Encourages contribution and collaboration within the wider DeepSpeed ecosystem.

DeepSpeed-FastGen represents a major advancement in efficiently deploying and scaling large language models. By addressing the critical challenges of throughput and latency in LLM serving, DeepSpeed-FastGen is a notable contribution to the field, paving the way for more efficient and scalable AI applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Microsoft AI Research Unveils DeepSpeed-FastGen: Elevating LLM Serving Efficiency with Innovative Dynamic SplitFuse Technique appeared first on MarkTechPost.

Reduce inference time for BERT models using neural architecture search …

In this post, we demonstrate how to use neural architecture search (NAS) based structural pruning to compress a fine-tuned BERT model to improve model performance and reduce inference times. Pre-trained language models (PLMs) are undergoing rapid commercial and enterprise adoption in the areas of productivity tools, customer service, search and recommendations, business process automation, and content creation. Deploying PLM inference endpoints is typically associated with higher latency and higher infrastructure costs due to the compute requirements and reduced computational efficiency due to the large number of parameters. Pruning a PLM reduces the size and complexity of the model while retaining its predictive capabilities. Pruned PLMs achieve a smaller memory footprint and lower latency. We demonstrate that by pruning a PLM and trading off parameter count and validation error for a specific target task, and are able to achieve faster response times when compared to the base PLM model.
Multi-objective optimization is an area of decision-making that optimizes more than one objective function, such as memory consumption, training time, and compute resources, to be optimized simultaneously. Structural pruning is a technique to reduce the size and computational requirements of PLM by pruning layers or neurons/nodes while attempting to preserve model accuracy. By removing layers, structural pruning achieves higher compression rates, which leads to hardware-friendly structured sparsity that reduces runtimes and response times. Applying a structural pruning technique to a PLM model results in a lighter-weight model with a lower memory footprint that, when hosted as an inference endpoint in SageMaker, offers improved resource efficiency and reduced cost when compared to the original fine-tuned PLM.
The concepts illustrated in this post can be applied to applications that use PLM features, such as recommendation systems, sentiment analysis, and search engines. Specifically, you can use this approach if you have dedicated machine learning (ML) and data science teams who fine-tune their own PLM models using domain-specific datasets and deploy a large number of inference endpoints using Amazon SageMaker. One example is an online retailer who deploys a large number of inference endpoints for text summarization, product catalog classification, and product feedback sentiment classification. Another example might be a healthcare provider who uses PLM inference endpoints for clinical document classification, named entity recognition from medical reports, medical chatbots, and patient risk stratification.
Solution overview
In this section, we present the overall workflow and explain the approach. First, we use an Amazon SageMaker Studio notebook to fine-tune a pre-trained BERT model on a target task using a domain-specific dataset. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model based on the transformer architecture used for natural language processing (NLP) tasks. Neural architecture search (NAS) is an approach for automating the design of artificial neural networks and is closely related to hyperparameter optimization, a widely used approach in the field of machine learning. The goal of NAS is to find the optimal architecture for a given problem by searching over a large set of candidate architectures using techniques such as gradient-free optimization or by optimizing the desired metrics. The performance of the architecture is typically measured using metrics such as validation loss. SageMaker Automatic Model Tuning (AMT) automates the tedious and complex process of finding the optimal combinations of hyperparameters of the ML model that yield the best model performance. AMT uses intelligent search algorithms and iterative evaluations using a range of hyperparameters that you specify. It chooses the hyperparameter values that creates a model that performs the best, as measured by performance metrics such as accuracy and F-1 score.
The fine-tuning approach described in this post is generic and can be applied to any text-based dataset. The task assigned to the BERT PLM can be a text-based task such as sentiment analysis, text classification, or Q&A. In this demo, the target task is a binary classification problem where BERT is used to identify, from a dataset that consists of a collection of pairs of text fragments, whether the meaning of one text fragment can be inferred from the other fragment. We use the Recognizing Textual Entailment dataset from the GLUE benchmarking suite. We perform a multi-objective search using SageMaker AMT to identify the sub-networks that offer optimal trade-offs between parameter count and prediction accuracy for the target task. When performing a multi-objective search, we start with defining the accuracy and parameter count as the objectives that we are aiming to optimize.
Within the BERT PLM network, there can be modular, self-contained sub-networks that allow the model to have specialized capabilities such as language understanding and knowledge representation. BERT PLM uses a multi-headed self-attention sub-network and a feed-forward sub-network. A multi-headed, self-attention layer allows BERT to relate different positions of a single sequence in order to compute a representation of the sequence by allowing multiple heads to attend to multiple context signals. The input is split into multiple subspaces and self-attention is applied to each of the subspaces separately. Multiple heads in a transformer PLM allow the model to jointly attend to information from different representation subspaces. A feed-forward sub-network is a simple neural network that takes the output from the multi-headed self-attention sub-network, processes the data, and returns the final encoder representations.
The goal of random sub-network sampling is to train smaller BERT models that can perform well enough on target tasks. We sample 100 random sub-networks from the fine-tuned base BERT model and evaluate 10 networks simultaneously. The trained sub-networks are evaluated for the objective metrics and the final model is chosen based on the trade-offs found between the objective metrics. We visualize the Pareto front for the sampled sub-networks, which contains the pruned model that offers the optimal trade-off between model accuracy and model size. We select the candidate sub-network (NAS-pruned BERT model) based on the model size and model accuracy that we are willing to trade off. Next, we host the endpoints, the pre-trained BERT base model, and the NAS-pruned BERT model using SageMaker. To perform load testing, we use Locust, an open source load testing tool that you can implement using Python. We run load testing on both endpoints using Locust and visualize the results using the Pareto front to illustrate the trade-off between response times and accuracy for both models. The following diagram provides an overview of the workflow explained in this post.

Prerequisites
For this post, the following prerequisites are required:

An AWS account with access to the AWS Management Console
A SageMaker domain, SageMaker user profile, and SageMaker Studio
An IAM execution role for the SageMaker Studio Domain user

You also need to increase the service quota to access at least three instances of ml.g4dn.xlarge instances in SageMaker. The instance type ml.g4dn.xlarge is the cost efficient GPU instance that allows you to run PyTorch natively. To increase the service quota, complete the following steps:

On the console, navigate to Service Quotas.
For Manage quotas, choose Amazon SageMaker, then choose View quotas.

Search for “ml-g4dn.xlarge for training job usage” and select the quota item.
Choose Request increase at account-level.

For Increase quota value, enter a value of 5 or higher.
Choose Request.

The requested quota approval may take some time to complete depending on the account permissions.

Open SageMaker Studio from the SageMaker console.

Choose System terminal under Utilities and files.

Run the following command to clone the GitHub repo to the SageMaker Studio instance:

git clone https://github.com/aws/amazon-sagemaker-examples.git

Navigate to amazon-sagemaker-examples/hyperparameter_tuning/neural_architecture_search_llm.
Open the file nas_for_llm_with_amt.ipynb.
Set up the environment with an ml.g4dn.xlarge instance and choose Select.

Set up the pre-trained BERT model
In this section, we import the Recognizing Textual Entailment dataset from the dataset library and split the dataset into training and validation sets. This dataset consists of pairs of sentences. The task of the BERT PLM is to recognize, given two text fragments, whether the meaning of one text fragment can be inferred from the other fragment. In the following example, we can infer the meaning of the first phrase from the second phrase:

Phrase 1: A man with a beard, wearing a red shirt with gray sleeves and work gloves, pulling on a rope.
Phrase 2: A bearded man pulls a rope

We load the textual recognizing entailment dataset from the GLUE benchmarking suite via the dataset library from Hugging Face within our training script (./training.py). We split the original training dataset from GLUE into a training and validation set. In our approach, we fine-tune the base BERT model using the training dataset, then we perform a multi-objective search to identify the set of sub-networks that optimally balance between the objective metrics. We use the training dataset exclusively for fine-tuning the BERT model. However, we use validation data for the multi-objective search by measuring accuracy on the holdout validation dataset.
Fine-tune the BERT PLM using a domain-specific dataset
The typical use cases for a raw BERT model include next sentence prediction or masked language modeling. To use the base BERT model for downstream tasks such as textual recognizing entailment, we have to further fine-tune the model using a domain-specific dataset. You can use a fine-tuned BERT model for tasks such as sequence classification, question answering, and token classification. However, for the purposes of this demo, we use the fine-tuned model for binary classification. We fine-tune the pre-trained BERT model with the training dataset that we prepared previously, using the following hyperparameters:

hyperparameters[“per_device_train_batch_size”] = 8
hyperparameters[“per_device_eval_batch_size”] = 8
hyperparameters[“learning_rate”] = 2e-05
hyperparameters[“num_train_epochs”] = 5
hyperparameters[“save_strategy”] = “epoch”
hyperparameters[
“is_regression”
] = False # set this to True if your dataset is a regression dataset, for example STSB

We save the checkpoint of the model training to an Amazon Simple Storage Service (Amazon S3) bucket, so that the model can be loaded during the NAS-based multi-objective search. Before we train the model, we define the metrics such as epoch, training loss, number of parameters, and validation error:

session = Session()
s3_bucket = session.default_bucket()
s3_bucket_prefix = “nas_amt/model_checkpoint”
s3_path = f”s3://{s3_bucket}/{s3_bucket_prefix}”

metric_definitions = [
{“Name”: “epoch”, “Regex”: “epoch: ([0-9\.]+)”},
{“Name”: “training-loss”, “Regex”: “training loss: ([0-9\.]+)”},
{“Name”: “num-parameters”, “Regex”: “number of parameters: ([0-9\.]+)”},
{“Name”: “validation-error”, “Regex”: “validation error: ([0-9\.]+)”},
]

sm_args = dict(
entry_point=”training.py”,
source_dir=os.path.abspath(“”),
instance_type=”ml.g4dn.xlarge”,
instance_count=1,
py_version=”py39″,
framework_version=”1.13″,
transformers_version=”4.26″,
max_run=3600 * 72,
role=get_execution_role(),
checkpoint_local_path=”/opt/ml/checkpoints”,
hyperparameters=hyperparameters,
checkpoint_s3_uri=s3_path,
metric_definitions=metric_definitions,
)
est = PyTorch(**sm_args)
est.fit()

After the fine-tuning process starts, the training job takes around 15 minutes to complete.
Perform a multi-objective search to select sub-networks and visualize the results
In the next step, we perform a multi-objective search on the fine-tuned base BERT model by sampling random sub-networks using SageMaker AMT. To access a sub-network within the super-network (the fine-tuned BERT model), we mask out all the components of the PLM that are not part of the sub-network. Masking a super-network to find sub-networks in a PLM is a technique used to isolate and identify patterns of the model’s behavior. Note that Hugging Face transformers needs the hidden size to be a multiple of the number of heads. The hidden size in a transformer PLM controls the size of the hidden state vector space, which impacts the model’s ability to learn complex representations and patterns in the data. In a BERT PLM, the hidden state vector is of a fixed size (768). We can’t change the hidden size, and therefore the number of heads has to be in [1, 3, 6, 12].
In contrast to single-objective optimization, in the multi-objective setting, we typically don’t have a single solution that simultaneously optimizes all objectives. Instead, we aim to collect a set of solutions that dominate all other solutions in at least one objective (such as validation error). Now we can start the multi-objective search through AMT by setting the metrics that we want to reduce (validation error and number of parameters). The random sub-networks are defined by the parameter max_jobs and the number of simultaneous jobs is defined by the parameter max_parallel_jobs. The code to load the model checkpoint and evaluate the sub-network is available in the evaluate_subnetwork.py script.

# Maximum number of sub-networks we will evaluate
max_jobs = 100
max_parallel_jobs = 5

# Entry point script to load the super-network and evaluate a sub-network
entry_point = “evaluate_subnetwork.py”

# Command line arguments for the entry point script
hyperparameters = {“model_name_or_path”: model_type, “output_dir”: “./tmp”, “task_name”: “rte”}

# Define the metric we want to minimize
metric_definitions = [
{“Name”: “num-parameters”, “Regex”: “number of parameters: ([0-9\.]+)”},
{“Name”: “validation-error”, “Regex”: “validation error: ([0-9\.]+)”},
]

# Define HuggingFace estimator
estimator = HuggingFace(
entry_point=entry_point,
source_dir=”./”,
instance_type=”ml.g4dn.xlarge”, # instance types for the SageMaker training jobs
instance_count=1,
py_version=”py39″,
framework_version=”1.13″,
pytorch_version=”1.13″,
transformers_version=”4.26″,
max_run=3600 * 72,
role=get_execution_role(),
volume_size=125,
model_uri=s3_path,
hyperparameters=hyperparameters,
)

current_time = datetime.now().strftime(“%m-%d-%Y-%H-%M-%S”)
tuning_job_name = f”nas-search-{current_time}”

# Search space to define sub-networks
hyperparameter_ranges = {
“num_layers”: IntegerParameter(0, 12),
# To meet HuggingFace constraints, we can only set the number of head to these values
“num_heads”: CategoricalParameter([1, 3, 6, 12]),
“num_units”: IntegerParameter(0, 3072),
}

# Define AMT Tuner object
my_tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name=”validation-error”,
hyperparameter_ranges=hyperparameter_ranges,
metric_definitions=metric_definitions,
max_jobs=max_jobs,
strategy=”Random”,
random_seed=seed,
objective_type=”Minimize”,
max_parallel_jobs=max_parallel_jobs,
)

# Start hyperparameter tuning job
my_tuner.fit(job_name=tuning_job_name)

The AMT tuning job takes approximately 2 hours and 20 minutes to run. After the AMT tuning job runs successfully, we parse the job’s history and collect the sub-network’s configurations, such as number of heads, number of layers, number of units, and the corresponding metrics such as validation error and number of parameters. The following screenshot shows the summary of a successful AMT tuner job.
Next, we visualize the results using a Pareto set (also known as Pareto frontier or Pareto optimal set), which helps us identify optimal sets of sub-networks that dominate all other sub-networks in the objective metric (validation error):

history = my_tuner.analytics().dataframe()
data = []
configs = []
for i, t in enumerate(my_tuner.analytics().training_job_summaries()):
jn = t[“TrainingJobName”]
df = sagemaker.analytics.TrainingJobAnalytics(jn).dataframe()

row = history[history[“TrainingJobName”] == jn]
config = {
“num-heads”: int(row[“num_heads”].iloc[0].strip(‘”‘)),
“num-layers”: int(row[“num_layers”]),
“num-units”: int(row[“num_units”]),
}
configs.append(config)

p = []
for j, metric in enumerate(metric_definitions):
metric_name = metric[“Name”]
if “metric_name” not in df.keys():
continue
y = float(df[df[“metric_name”] == metric_name][“value”])
p.append(y)
if len(p) > 0:
data.append(p)

data = np.array(data)

First, we collect the data from the AMT tuning job. Then then we plot the Pareto set using matplotlob.pyplot with number of parameters in the x axis and validation error in the y axis. This implies that when we move from one sub-network of the Pareto set to another, we must either sacrifice performance or model size but improve the other. Ultimately, the Pareto set provides us the flexibility to choose the sub-network that best suits our preferences. We can decide how much we want to reduce the size of our network and how much performance we are willing to sacrifice.

import matplotlib.pyplot as plt
from multi_objective import get_pareto_optimal

# get results of the un-pruned network
df = sagemaker.analytics.TrainingJobAnalytics(est.jobs[0].name).dataframe()
validation_error_unpruned_network = float(df[df[“metric_name”] == “validation-error”].value.min())
params_unpruned_network = int(df[df[“metric_name”] == “num-parameters”].value.min())
plt.scatter(
params_unpruned_network,
validation_error_unpruned_network,
marker=”o”,
s=80,
facecolors=”none”,
edgecolors=”C3″,
linewidth=2,
label=”un-pruned super-network”,
)
# get Pareto optimal points
idx = get_pareto_optimal(data)
x = data[idx, 0]
y = data[idx, 1]
plt.scatter(
x,
y,
marker=”o”,
s=80,
facecolors=”none”,
edgecolors=”C0″,
linewidth=2,
label=”Pareto front (sub-networks)”,
)
plt.xlabel(“number of parameters”)
plt.ylabel(“validation error”)
plt.legend()
plt.xscale(“log”)
plt.grid(linewidth=”1″, alpha=0.4, which=”both”)

Deploy the fine-tuned BERT model and the NAS-optimized sub-network model using SageMaker
Next, we deploy the largest model in our Pareto set that leads to the smallest amount of performance degeneration to a SageMaker endpoint. The best model is the one that provides an optimal trade-off between the validation error and the number of parameters for our use case.

# Let’s take the largest model in the Pareto set
indicies = np.arange(len(configs))[idx]
pareto_optimal_sub_networks = [configs[i] for i in indicies]
config_to_deploy = pareto_optimal_sub_networks[-1]

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data=s3_path + “/model.tar.gz”,
role=get_execution_role(),
transformers_version=”4.26″,
pytorch_version=”1.13″,
py_version=”py39″,
entry_point=”inference.py”,
source_dir=”./”,
env={“SM_HPS”: json.dumps(config_to_deploy)},
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(initial_instance_count=1, instance_type=”ml.g4dn.xlarge”)

Model comparison
We took a pre-trained base BERT model, fine-tuned it using a domain-specific dataset, ran a NAS search to identify dominant sub-networks based on the objective metrics, and deployed the pruned model on a SageMaker endpoint. In addition, we took the pre-trained base BERT model and deployed the base model on a second SageMaker endpoint. Next, we ran load-testing using Locust on both inference endpoints and evaluated the performance in terms of response time.
First, we import the necessary Locust and Boto3 libraries. Then we construct a request metadata and record the start time to be used for load testing. Then the payload is passed to the SageMaker endpoint invoke API via the BotoClient to simulate real user requests. We use Locust to spawn multiple virtual users to send requests in parallel and measure the endpoint performance under the load. Tests are run by increasing the number of users for each of the two endpoints, respectively. After the tests are completed, Locust outputs a request statistics CSV file for each of the deployed models.

def send(self):
request_meta = {
“request_type”: “InvokeEndpoint”,
“name”: “SageMaker”,
“start_time”: time.time(),
“response_length”: 0,
“response”: None,
“context”: {},
“exception”: None,
}
start_perf_counter = time.perf_counter()

try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type,
)
logging.info(response[“Body”].read())
except Exception as e:
request_meta[“exception”] = e

request_meta[“response_time”] = (
time.perf_counter() – start_perf_counter
) * 1000

events.request.fire(**request_meta)

Next, we generate the response time plots from the CSV files downloaded after running the tests with Locust. The purpose of plotting the response time vs. the number of users is to analyze the load testing results by visualizing the impact of the response time of the model endpoints. In the following chart, we can see that the NAS-pruned model endpoint achieves a lower response time compared to the base BERT model endpoint.

In the second chart, which is an extension of the first chart, we observe that after around 70 users, SageMaker starts to throttle the base BERT model endpoint and throws an exception. However, for the NAS-pruned model endpoint, the throttling happens between 90–100 users and with a lower response time.

From the two charts, we observe that the pruned model has a faster response time and scales better when compared to the unpruned model. As we scale the number of inference endpoints, as is the case with users who deploy a large number of inference endpoints for their PLM applications, the cost benefits and performance improvement start to become quite substantial.
Clean up
To delete the SageMaker endpoints for the fine-tuned base BERT model and the NAS-pruned model, complete the following steps:

On the SageMaker console, choose Inference and Endpoints in the navigation pane.
Select the endpoint and delete it.

Alternatively, from the SageMaker Studio notebook, run the following commands by providing the endpoint names:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we discussed how to use NAS to prune a fine-tuned BERT model. We first trained a base BERT model using domain-specific data and deployed it to a SageMaker endpoint. We performed a multi-objective search on the fine-tuned base BERT model using SageMaker AMT for a target task. We visualized the Pareto front and selected the Pareto optimal NAS-pruned BERT model and deployed the model to a second SageMaker endpoint. We performed load testing using Locust to simulate users querying both the endpoints, and measured and recorded the response times in a CSV file. We plotted the response time vs. the number of users for both the models.
We observed that the pruned BERT model performed significantly better in both response time and instance throttling threshold. We concluded that the NAS-pruned model was more resilient to an increased load on the endpoint, maintaining a lower response time even as more users stressed the system compared to the base BERT model. You can apply the NAS technique described in this post to any large language model to find a pruned model that can perform the target task with significantly lower response time. You can further optimize the approach by using latency as a parameter in addition to validation loss.
Although we use NAS in this post, quantization is another common approach used to optimize and compress PLM models. Quantization reduces the precision of the weights and activations in a trained network from 32-bit floating point to lower bit widths such as 8-bit or 16-bit integers, which results in a compressed model that generates faster inference. Quantization doesn’t reduce the number of parameters; instead it reduces the precision of the existing parameters to get a compressed model. NAS pruning removes redundant networks in a PLM, which creates a sparse model with fewer parameters. Typically, NAS pruning and quantization are used together to compress large PLMs to maintain model accuracy, reduce validation losses while improving performance, and reduce model size. The other commonly used techniques to reduce the size of PLMs include knowledge distillation, matrix factorization, and distillation cascades.
The approach proposed in the blogpost is suitable for teams that use SageMaker to train and fine-tune the models using domain-specific data and deploy the endpoints to generate inference. If you’re looking for a fully managed service that offers a choice of high-performing foundation models needed to build generative AI applications, consider using Amazon Bedrock. If you’re looking for pre-trained, open source models for a wide range of business use cases and want to access solution templates and example notebooks, consider using Amazon SageMaker JumpStart. A pre-trained version of the Hugging Face BERT base cased model that we used in this post is also available from SageMaker JumpStart.

About the Authors
Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He is a Cloud Architect with 24+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI and Machine Learning Data Engineering. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.
Aaron Klein is a Sr Applied Scientist at AWS working on automated machine learning methods for deep neural networks.
Jacek Golebiowski is a Sr Applied Scientist at AWS.

Meet Puncc: An Open-Source Python Library for Predictive Uncertainty Q …

In machine learning, predicting outcomes accurately is crucial, but it’s equally important to understand the uncertainty associated with those predictions. Uncertainty helps us gauge our confidence in a model’s output. However, not all machine learning models provide this uncertainty information. This can lead to situations where decisions are made based on overly optimistic predictions, potentially causing problems. 

Some existing solutions address this issue but lack the flexibility and comprehensiveness needed for diverse machine-learning tasks. Meet Puncc, a Python library that integrates state-of-the-art conformal prediction algorithms seamlessly. These algorithms cover various machine-learning tasks such as regression, classification, and anomaly detection. Conformal prediction transforms point predictions into interval predictions, providing a measure of uncertainty vital for making informed decisions.

To use Puncc, one must first install the library compatible with Python versions higher than 3.8. Setting up Puncc in a virtual environment is recommended to avoid conflicts with other system dependencies. Installation is straightforward using the pip command: `pip installs punch.` The library has comprehensive online documentation, guiding users through installation, tutorials, and API usage.

Puncc’s strength lies in its ability to work with any predictive model, enhancing it with rigorous uncertainty estimations. The library employs conformal prediction methods, ensuring that generated prediction sets cover the accurate outputs within a user-defined error. This capability is precious in situations where making confident decisions is crucial, but uncertainties in the data make it challenging.

In terms of metrics, Puncc provides a range of tools to evaluate and visualize the results of a conformal procedure. Users can explore metrics for prediction intervals and assess the model’s performance. The library also offers plotting capabilities to enhance the understanding of the generated predictions. For example, the 90% prediction interval with the Split Conformal Prediction Method demonstrates how Puncc achieves high coverage probabilities, ensuring the accurate outputs fall within the predicted intervals with a user-defined error rate.

In conclusion, Puncc addresses a significant challenge in machine learning by providing a versatile and effective solution for predictive uncertainty calibration and conformalization. It offers a practical way to transform point predictions into interval predictions with high coverage probabilities, enabling users to make more informed decisions in the face of uncertainty. The library’s straightforward installation, comprehensive documentation, and flexible API make it accessible to users looking to enhance the reliability of their predictive models.
The post Meet Puncc: An Open-Source Python Library for Predictive Uncertainty Quantification Using Conformal Prediction appeared first on MarkTechPost.