Gretel AI Open-Sourced Synthetic-GSM8K-Reflection-405B Dataset: Advanc …

With AI, the demand for high-quality datasets that can support the training & evaluation of models in various domains is increasing. One such milestone is the open-sourcing of the Synthetic-GSM8K-reflection-405B dataset by Gretel.ai, which holds significant promise for reasoning tasks, specifically those requiring multi-step problem-solving capabilities. This newly released dataset, hosted on Hugging Face, was synthetically generated using Gretel Navigator, with Meta-Llama-3.1-405B serving as the agent language model (LLM). Its creation reflects advancements in leveraging synthetic data generation and AI reflections for developing robust AI models. 

Synthetic Data Generation Using Reflection Techniques

One of the standout features of the synthetic-GSM8K-reflection-405B dataset is its reliance on synthetic data generation. Artificially generated rather than collected from real-world events, synthetic data is increasingly vital in training AI models. In this case, the dataset was created using Gretel Navigator, a sophisticated synthetic data generation tool. This unique dataset uses Meta-Llama-3.1-405B, an advanced LLM, as the generating agent.

The dataset draws inspiration from the popular GSM8K dataset but takes a step further by incorporating reflection techniques. These techniques allow the model to engage in step-by-step reflections during the question-and-answer stages of multi-step problems. The goal of using reflections is to mimic human-like reasoning, where the AI systematically breaks down complex questions into smaller, manageable steps, reflecting on each before moving forward. This approach enhances the model’s ability to understand and solve problems requiring logical thinking, making it an invaluable asset for reasoning tasks.

Diverse Real-World Contexts and Rigorous Validation

Another key feature of the synthetic-GSM8K-reflection-405B dataset is the diversity of its questions. The dataset’s design ensures that the problems are stratified by difficulty and topic, encompassing a wide range of real-world contexts. This diversity makes the dataset highly versatile and applicable to various domains, from academic challenges to industry-specific scenarios that require robust problem-solving skills. 

The dataset also stands out for its rigorously verified nature. All the calculations and problem-solving processes have been meticulously validated using Python’s sympy library. Sympy is a powerful tool for symbolic mathematics, ensuring that the calculations in the dataset are accurate and reliable. This rigorous validation adds a layer of credibility to the dataset, making it a useful tool for AI training and reliable for developing models that can handle complex reasoning tasks with precision.

Train and Test Sets for Model Development

The synthetic-GSM8K-reflection-405B dataset is thoughtfully designed to support AI model development. It comes with both training and test sets, containing a total of 300 examples. These examples are categorized by difficulty levels: medium, hard, and very hard, ensuring that models trained on this dataset can handle a wide spectrum of reasoning challenges. The division into train and test sets is crucial for model evaluation. By providing separate sets for training and testing, the dataset allows developers to train their models on one portion of the data and evaluate their performance on a different portion. This separation helps assess how well the model generalizes to unseen data, a key indicator of the model’s robustness and effectiveness.

Potential Applications and Impact

Gretel.ai’s open-sourcing of synthetic-GSM8K-reflection-405B by Gretel.ai is poised to significantly impact the AI and machine learning community. Its focus on reasoning tasks makes it an ideal dataset for developing models that require step-by-step problem-solving capabilities. These models can be applied in many fields, such as education, where AI can assist in solving complex mathematical problems, or in industries like finance and engineering, where multi-step reasoning is crucial for decision-making processes.

One of the most exciting aspects of this dataset is its ability to enhance the development of AI models that can handle real-world scenarios. The dataset’s stratification by difficulty and topic covers various contexts, from everyday problems to highly specialized challenges. As a result, models trained on this dataset can be deployed in various applications, offering solutions to common and niche problems.

Moreover, the dataset’s reliance on reflection techniques aligns with the growing trend of developing AI systems that mimic human thought processes. By breaking down complex and challenging problems into smaller steps and reflecting on each, the models trained on this dataset are more likely to offer accurate and efficient solutions. This capability is particularly important in fields where accuracy and logical reasoning are paramount.

Image Source

The Role of Hugging Face in Democratizing AI

The open-sourcing of synthetic-GSM8K-reflection-405B on Hugging Face is another step toward democratizing AI. Hugging Face has become a central hub for AI developers and researchers, offering access to many models and datasets. By making this dataset freely available, Gretel.ai contributes to the collaborative nature of AI development, where researchers and developers worldwide can access and build upon existing resources.

Hugging Face’s platform also ensures that the dataset reaches a wide audience, from AI researchers in academia to developers in the industry. The platform’s ease of access and robust model training and evaluation support make it an ideal venue for hosting this dataset. The synthetic-GSM8K-reflection-405B dataset’s open-source nature means that developers can use it to train their models, share their findings, and contribute to advancing AI reasoning capabilities.

‘Datasets like GSM8K are crucial for advancing AI reasoning, as these complex problems are challenging to produce at scale. By releasing an enhanced synthetic GSM8K dataset using Reflection techniques, we’re aiming to push the community beyond current benchmarks and teach AI systems to generate more thoughtful and explainable responses.’ – Alex Watson, Co-founder and CPO

Conclusion

The synthetic-GSM8K-reflection-405B dataset by Gretel.ai represents a significant advancement in AI and machine learning, particularly in reasoning tasks. Its use of synthetic data generation, reflection techniques, and rigorous validation ensures that it is a high-quality resource for training AI models that can handle complex, multi-step problems. By making this dataset open-source on Hugging Face, Gretel.ai democratizes AI development, allowing researchers and developers worldwide to access and utilize this valuable resource.

With its diverse real-world contexts and carefully stratified examples, the synthetic-GSM8K-reflection-405B dataset is set to play a crucial role in improving the reasoning capabilities of AI models. Whether used in academic research, industry applications, or model development for specific problem-solving tasks, this dataset holds great potential for advancing AI systems that can think and reason like humans.

Check out the HF Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Gretel AI Open-Sourced Synthetic-GSM8K-Reflection-405B Dataset: Advancing AI Model Training with Multi-Step Reasoning, Reflection Techniques, and Real-World Problem-Solving Scenarios appeared first on MarkTechPost.

Build RAG-based generative AI applications in AWS using Amazon FSx for …

The post is co-written with Michael Shaul and Sasha Korman from NetApp.
Generative artificial intelligence (AI) applications are commonly built using a technique called Retrieval Augmented Generation (RAG) that provides foundation models (FMs) access to additional data they didn’t have during training. This data is used to enrich the generative AI prompt to deliver more context-specific and accurate responses without continuously retraining the FM, while also improving transparency and minimizing hallucinations.
In this post, we demonstrate a solution using Amazon FSx for NetApp ONTAP with Amazon Bedrock to provide a RAG experience for your generative AI applications on AWS by bringing company-specific, unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way.
Our solution uses an FSx for ONTAP file system as the source of unstructured data and continuously populates an Amazon OpenSearch Serverless vector database with the user’s existing files and folders and associated metadata. This enables a RAG scenario with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data retrieved from the OpenSearch Serverless vector database.
When developing generative AI applications such as a Q&A chatbot using RAG, customers are also concerned about keeping their data secure and preventing end-users from querying information from unauthorized data sources. Our solution also uses FSx for ONTAP to allow users to extend their current data security and access mechanisms to augment model responses from Amazon Bedrock. We use FSx for ONTAP as the source of associated metadata, specifically the user’s security access control list (ACL) configurations attached to their files and folders and populate that metadata into OpenSearch Serverless. By combining access control operations with file events that notify the RAG application of new and changed data on the file system, our solution demonstrates how FSx for ONTAP enables Amazon Bedrock to only use embeddings from authorized files for the specific users that connect to our generative AI application.
AWS serverless services make it straightforward to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible large language model (LLM) orchestration, and Amazon API Gateway provides an API interface that allows for pluggable frontends and event-driven invocation of the LLMs. Our solution also demonstrates how to build a scalable, automated, API-driven serverless application layer on top of Amazon Bedrock and FSx for ONTAP using API Gateway and Lambda.
Solution overview
The solution provisions an FSx for ONTAP Multi-AZ file system with a storage virtual machine (SVM) joined to an AWS Managed Microsoft AD domain. An OpenSearch Serverless vector search collection provides a scalable and high-performance similarity search capability. We use an Amazon Elastic Compute Cloud (Amazon EC2) Windows server as an SMB/CIFS client to the FSx for ONTAP volume and configure data sharing and ACLs for the SMB shares in the volume. We use this data and ACLs to test permissions-based access to the embeddings in a RAG scenario with Amazon Bedrock.
The embeddings container component of our solution is deployed on an EC2 Linux server and mounted as an NFS client on the FSx for ONTAP volume. It periodically migrates existing files and folders along with their security ACL configurations to OpenSearch Serverless. It populates an index in the OpenSearch Serverless vector search collection with company-specific data (and associated metadata and ACLs) from the NFS share on the FSx for ONTAP file system.
The solution implements a RAG Retrieval Lambda function that allows RAG with Amazon Bedrock by enriching the generative AI prompt using Amazon Bedrock APIs with your company-specific data and associated metadata (including ACLs) retrieved from the OpenSearch Serverless index that was populated by the embeddings container component. The RAG Retrieval Lambda function stores conversation history for the user interaction in an Amazon DynamoDB table.
End-users interact with the solution by submitting a natural language prompt either through a chatbot application or directly through the API Gateway interface. The chatbot application container is built using Streamlit and fronted by an AWS Application Load Balancer (ALB). When a user submits a natural language prompt to the chatbot UI using the ALB, the chatbot container interacts with the API Gateway interface that then invokes the RAG Retrieval Lambda function to fetch the response for the user. The user can also directly submit prompt requests to API Gateway and obtain a response. We demonstrate permissions-based access to the RAG documents by explicitly retrieving the SID of a user and then using that SID in the chatbot or API Gateway request, where the RAG Retrieval Lambda function then matches the SID to the Windows ACLs configured for the document. As an additional authentication step in a production environment, you may want to also authenticate the user against an identity provider and then match the user against the permissions configured for the documents.
The following diagram illustrates the end-to-end flow for our solution. We start by configuring data sharing and ACLs with FSx for ONTAP, and then these are periodically scanned by the embeddings container. The embeddings container splits the documents into chunks and uses the Amazon Titan Embeddings model to create vector embeddings from these chunks. It then stores these vector embeddings with associated metadata in our vector database by populating an index in a vector collection in OpenSearch Serverless. The following diagram illustrates the end-to-end flow.

The following architecture diagram illustrates the various components of our solution.
Prerequisites
Complete the following prerequisite steps:

Make sure you have model access in Amazon Bedrock. In this solution, we use Anthropic Claude v3 Sonnet on Amazon Bedrock.
Install the AWS Command Line Interface (AWS CLI).
Install Docker.
Install Terraform.

Deploy the solution
The solution is available for download on this GitHub repo. Cloning the repository and using the Terraform template will provision all the components with their required configurations.

Clone the repository for this solution:

sudo yum install -y unzip
git clone https://github.com/aws-samples/genai-bedrock-fsxontap.git
cd genai-bedrock-fsxontap/terraform

From the terraform folder, deploy the entire solution using Terraform:

terraform init
terraform apply -auto-approve

This process can take 15–20 minutes to complete. When finished, the output of the terraform commands should look like the following:

api-invoke-url = “https://9ng1jjn8qi.execute-api.<region>.amazonaws.com/prod”
fsx-management-ip = toset([
“198.19.255.230”,])
fsx-secret-id = “arn:aws:secretsmanager:<region>:<account-id>:secret:AmazonBedrock-FSx-NetAPP-ONTAP-a2fZEdIt-0fBcS9”
fsx-svm-smb-dns-name = “BRSVM.BEDROCK-01.COM”
lb-dns-name = “chat-load-balancer-2040177936.<region>.elb.amazonaws.com”

Load data and set permissions
To test the solution, we will use the EC2 Windows server (ad_host) mounted as an SMB/CIFS client to the FSx for ONTAP volume to share sample data and set user permissions that will then be used to populate the OpenSearch Serverless index by the solution’s embedding container component. Perform the following steps to mount your FSx for ONTAP SVM data volume as a network drive, upload data to this shared network drive, and set permissions based on Windows ACLs:

Obtain the ad_host instance DNS from the output of your Terraform template.
Navigate to AWS Systems Manager Fleet Manager on your AWS console, locate the ad_host instance and follow instructions here to login with Remote Desktop. Use the domain admin user bedrock-01Admin and obtain the password from AWS Secrets Manager. You can find the password using the Secrets Manager fsx-secret-id secret id from the output of your Terraform template.
To mount an FSx for ONTAP data volume as a network drive, under This PC, choose (right-click) Network and then choose Map Network drive.
Choose the drive letter and use the FSx for ONTAP share path for the mount (\<svm>.<domain >c$<volume-name>):
Upload the Amazon Bedrock User Guide to the shared network drive and set permissions to the admin user only (make sure that you disable inheritance under Advanced):
Upload the Amazon FSx for ONTAP User Guide to the shared drive and make sure permissions are set to Everyone:
On the ad_host server, open the command prompt and enter the following command to obtain the SID for the admin user:

wmic useraccount where name=’Admin’ get sid

Test permissions using the chatbot
To test permissions using the chatbot, obtain the lb-dns-name URL from the output of your Terraform template and access it through your web browser:

For the prompt query, ask any general question on the FSx for ONTAP user guide that is available for access to everyone. In our scenario, we asked “How can I create an FSx for ONTAP file system,” and the model replied back with detailed steps and source attribution in the chat window to create an FSx for ONTAP file system using the AWS Management Console, AWS CLI, or FSx API:

Now, let’s ask a question about the Amazon Bedrock user guide that is available for admin access only. In our scenario, we asked “How do I use foundation models with Amazon Bedrock,” and the model replied with the response that it doesn’t have enough information to provide a detailed answer to the question.:
Use the admin SID on the user (SID) filter search in the chat UI and ask the same question in the prompt. This time, the model should reply with steps detailing how to use FMs with Amazon Bedrock and provide the source attribution used by the model for the response:
Test permissions using API Gateway
You can also query the model directly using API Gateway. Obtain the api-invoke-url parameter from the output of your Terraform template.

curl -v ‘<api-invoke-url>/bedrock_rag_retreival’ -X POST -H ‘content-type: application/json’ -d ‘{“session_id”: “1”,”prompt”: “What is an FSxN ONTAP filesystem?”, “bedrock_model_id”: “anthropic.claude-3-sonnet-20240229-v1:0”, “model_kwargs”: {“temperature”: 1.0, “top_p”: 1.0, “top_k”: 500}, “metadata”: “NA”, “memory_window”: 10}’

Then invoke the API gateway with Everyone access for a query related to the FSx for ONTAP user guide by setting the value of the metadata parameter to NA to indicate Everyone access:

curl -v ‘<api-invoke-url>/bedrock_rag_retreival’ -X POST -H ‘content-type: application/json’ -d ‘{“session_id”: “1”,”prompt”: “what is bedrock?”, “bedrock_model_id”: “anthropic.claude-3-sonnet-20240229-v1:0”, “model_kwargs”: {“temperature”: 1.0, “top_p”: 1.0, “top_k”: 500}, “metadata”: “S-1-5-21-4037439088-1296877785-2872080499-1112”, “memory_window”: 10}’

Cleanup
To avoid recurring charges, clean up your account after trying the solution. From the terraform folder, delete the Terraform template for the solution:

terraform apply –destroy

Conclusion
In this post, we demonstrated a solution that uses FSx for ONTAP with Amazon Bedrock and uses FSx for ONTAP support for file ownership and ACLs to provide permissions-based access in a RAG scenario for generative AI applications. Our solution enables you to build generative AI applications with Amazon Bedrock where you can enrich the generative AI prompt in Amazon Bedrock with your company-specific, unstructured user file data from an FSx for ONTAP file system. This solution enables you to deliver more relevant, context-specific, and accurate responses while also making sure only authorized users have access to that data. Finally, the solution demonstrates the use of AWS serverless services with FSx for ONTAP and Amazon Bedrock that enable automatic scaling, event-driven compute, and API interfaces for your generative AI applications on AWS.
For more information about how to get started building with Amazon Bedrock and FSx for ONTAP, refer to the following resources:

Amazon Bedrock Workshop GitHub repo
Amazon FSx for NetApp ONTAP File Storage Workshop GitHub repo
NetApp helps customers unlock the full potential of GenAI with BlueXP workload factory and Amazon Bedrock

About the authors
Kanishk Mahajan is Principal, Solutions Architecture at AWS. He leads cloud transformation and solution architecture for ISV customers and partner at AWS. Kanishk specializes in containers, cloud operations, migrations and modernizations, AI/ML, resilience and security and compliance. He is a Technical Field Community (TFC) member in each of those domains at AWS.
Michael Shaul is a Principal Architect at NetApp’s office of the CTO. He has over 20 years of experience building data management systems, applications, and infrastructure solutions. He has a unique in-depth perspective on cloud technologies, builder, and AI solutions.
Sasha Korman is a tech visionary leader of dynamic development and QA teams across Israel and India. With 14-years at NetApp that began as a programmer, his hands-on experience and leadership have been pivotal in steering complex projects to success, with a focus on innovation, scalability, and reliability.

Support for AWS DeepComposer ending soon

AWS DeepComposer was first introduced during AWS re:Invent 2019 as a fun way for developers to compose music by using generative AI. AWS DeepComposer was the world’s first machine learning (ML)-enabled keyboard for developers to get hands-on—literally—with a musical keyboard and the latest ML techniques to compose their own music.
After careful consideration, we have made the decision to end support for AWS DeepComposer, effective September 17, 2025. With your help and feedback, our portfolio of products and services has grown to include new tools for developers to get hands-on with AI and ML. Amazon PartyRock, for example, is a generative AI playground for intuitive, code-free help in building web applications.
If you have data stored on the AWS DeepComposer console, you will be able to use AWS DeepComposer as normal until September 17, 2025, when support for the service will end. After this date, you will no longer be able to use AWS DeepComposer through the AWS Management Console, manage AWS DeepComposer devices, or access any compositions or models you have created. Until then, you can continue to work on your compositions or models and export those you would like to keep by using the step-by-step guide in the AWS DeepComposer FAQs.
If you have additional questions, please read our FAQs or contact us.

About the author
Kanchan Jagannathan is a Sr. Program Manager in the AWS AI Devices team where he helps launches AWS devices into sales channel and also oversees the Service Availability Change process for the team. He was a Program Manager for FC automation deployment and launches before joining AWS. Outside of work, he has begun to bravely endeavour camping with his 5-yr old and 1-yr old kids and enjoying the moments he gets to be with them.

Preserve access and explore alternatives for Amazon Lookout for Equipm …

Amazon Lookout for Equipment, the AWS machine learning (ML) service designed for industrial equipment predictive maintenance, will no longer be open to new customers effective October 17, 2024. Existing customers will be able to use the service (both using the AWS Management Console and API) as normal and AWS will continue to invest in security, availability, and performance improvements for Lookout for Equipment, but we do not plan to introduce new features for this service.
This post discusses how you can maintain access to Lookout for Equipment after it is closed to new customers and some alternatives to Lookout for Equipment.
Maintaining access to Lookout for Equipment
You’re considered an existing customer if you use the service, either through cloud training or cloud inferencing, any time in the 30 days prior to October 17, 2024 (September 17, 2024, through October 16, 2024). To maintain access to the service after October 17, 2024, you should complete one of the following tasks from the account for which you intend to maintain access:

On the Lookout for Equipment console, start a new project and successfully complete a model training
On the Lookout for Equipment console, open an existing project, schedule an inference for a given model, and run at least one inference
Use Lookout for Equipment API calls CreateInferenceScheduler and StartInferenceScheduler (and StopInferenceScheduler when done)

For any questions or support needed, contact your assigned AWS Account Manager or Solutions Architect, or create a case from the AWS console.
Alternatives to Lookout for Equipment
If you’re interested in an alternative to Lookout for Equipment, AWS has options for both buyers and builders.
For an out-of-the-box solution, the AWS Partner Network offers solutions from multiple partners. You can browse solutions on the Asset Maintenance and Reliability page in the AWS Solutions Library. This approach provides a solution that addresses your use case without requiring you to have expertise in predictive maintenance, and typically provides the fastest time to value by using the specialized expertise of the AWS Partners.
If you prefer to build your own solution, AWS offers AI/ML tools and services to help you develop an AI-based predictive maintenance solution. Amazon SageMaker provides a set of tools to enable you to build, train, infer, and deploy ML models for your use case with fully managed infrastructure, tools, and workflows.
Summary
Although new customers will no longer have access to Lookout for Equipment after October 17, 2024, AWS offers a powerful set of AI/ML services and solutions in the form of SageMaker tools to build customer models, and also offers a range of solutions from partners through the AWS Partner Network. You should explore these options to determine what works best for your specific needs.
For more details, refer to the following resources:

Amazon Lookout for Equipment Developer Guide
Amazon SageMaker Developer Guide
AWS Solutions Library

About the author
Stuart Gillen is a Sr. Product Manager, Lookout for Equipment, at AWS. Stuart has held a variety of roles in engineering management, business development, product management, and consulting. Most of his career has been focused on industrial applications specifically in reliability practices, maintenance systems, and manufacturing. Stuart is the Product Manager for Lookout for Equipment at AWS where he utilizes his industrial and AI background in applications focusing on Predictive Maintenance and Condition Monitoring.

Rethinking LLM Training: The Promise of Inverse Reinforcement Learning …

Large language models (LLMs) have gained significant attention in the field of artificial intelligence, primarily due to their ability to imitate human knowledge through extensive datasets. The current methodologies for training these models heavily rely on imitation learning, particularly next token prediction using maximum likelihood estimation (MLE) during pretraining and supervised fine-tuning phases. However, this approach faces several challenges, including compounding errors in autoregressive models, exposure bias, and distribution shifts during iterative model application. These issues become more pronounced with longer sequences, potentially leading to degraded performance and misalignment with human intent. As the field progresses, there is a growing need to address these challenges and develop more effective methods for training and aligning LLMs with human preferences and intentions.

Existing attempts to address the challenges in language model training have primarily focused on two main approaches: behavioral cloning (BC) and inverse reinforcement learning (IRL). BC, analogous to supervised fine-tuning via MLE, directly mimics expert demonstrations but suffers from compounding errors and requires extensive data coverage. IRL, on the other hand, jointly infers the policy and reward function, potentially overcoming BC’s limitations by utilizing additional environment interactions. Recent IRL methods have incorporated game-theoretic approaches, entropy regularization, and various optimization techniques to improve stability and scalability. In the context of language modeling, some researchers have explored adversarial training methods, such as SeqGAN, as alternatives to MLE. However, these approaches have shown limited success, working effectively only in specific temperature regimes. Despite these efforts, the field continues to seek more robust and scalable solutions for training and aligning large language models.

DeepMind researchers propose an in-depth investigation of RL-based optimization, particularly focusing on the distribution matching perspective of IRL, for fine-tuning large language models. This approach aims to provide an effective alternative to standard MLE. The study encompasses both adversarial and non-adversarial methods, as well as offline and online techniques. A key innovation is the extension of inverse soft Q-learning to establish a principled connection with classical behavior cloning or MLE. The research evaluates models ranging from 250M to 3B parameters, including encoder-decoder T5 and decoder-only PaLM2 architectures. By examining task performance and generation diversity, the study seeks to demonstrate the benefits of IRL over behavior cloning in imitation learning for language models. In addition to that, the research explores the potential of IRL-obtained reward functions to bridge the gap with later stages of RLHF.

The proposed methodology introduces a unique approach to language model fine-tuning by reformulating inverse soft Q-learning as a temporal difference regularized extension of MLE. This method bridges the gap between MLE and algorithms that exploit the sequential nature of language generation.

The approach models language generation as a sequential decision-making problem, where generating the next token is conditioned on the previously generated sequence. The researchers focus on minimizing the divergence between the γ-discounted state-action distribution of the policy and that of the expert policy, combined with a weighted causal entropy term.

The formulation uses the χ2-divergence and rescales the value function, resulting in the IQLearn objective:

Image source: https://arxiv.org/pdf/2409.01369

This objective consists of two main components:

1. A regularization term that couples the learned policy to a value function, favoring policies where the log probability of actions matches the difference in state values.

2. An MLE term that maintains the connection to traditional language model training.

Importantly, this formulation allows for annealing of the regularization term, providing flexibility in balancing between standard MLE (λ = 0) and stronger regularization. This approach enables offline training using only expert samples, potentially improving computational efficiency in large-scale language model fine-tuning.

The researchers conducted extensive experiments to evaluate the effectiveness of IRL methods compared to MLE for fine-tuning large language models. Their results demonstrate several key findings:

1. Performance improvements: IRL methods, particularly IQLearn, showed small but notable gains in task performance across various benchmarks, including XSUM, GSM8k, TLDR, and WMT22. These improvements were especially pronounced for math and reasoning tasks.

2. Diversity enhancement: IQLearn consistently produced more diverse model generations compared to MLE, as measured by lower Self-BLEU scores. This indicates a better trade-off between task performance and output diversity.

3. Model scalability: The benefits of IRL methods were observed across different model sizes and architectures, including T5 (base, large, and xl) and PaLM2 models.

4. Temperature sensitivity: For PaLM2 models, IQLearn achieved higher performance in low-temperature sampling regimes across all tested tasks, suggesting improved stability in generation quality.

5. Reduced beam search dependency: IQLearn demonstrated the ability to reduce reliance on beam search during inference while maintaining performance, potentially offering computational efficiency gains.

6. GAIL performance: While stabilized for T5 models, GAIL proved challenging to implement effectively for PaLM2 models, highlighting the robustness of the IQLearn approach.

These results suggest that IRL methods, particularly IQLearn, provide a scalable and effective alternative to MLE for fine-tuning large language models, offering improvements in both task performance and generation diversity across a range of tasks and model architectures.

This paper investigates the potential of IRL algorithms for language model fine-tuning, focusing on performance, diversity, and computational efficiency. The researchers introduce a reformulated IQLearn algorithm, enabling a balanced approach between standard supervised fine-tuning and advanced IRL methods. Experiments reveal significant improvements in the trade-off between task performance and generation diversity using IRL. The study majorly demonstrates that computationally efficient offline IRL achieves substantial performance gains over MLE-based optimization without requiring online sampling. Also, the correlation analysis between IRL-extracted rewards and performance metrics suggests the potential for developing more accurate and robust reward functions in language modeling, paving the way for improved language model training and alignment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Rethinking LLM Training: The Promise of Inverse Reinforcement Learning Techniques appeared first on MarkTechPost.

Language Model Aware Speech Tokenization (LAST): A Unique AI Method th …

Speech tokenization is a fundamental process that underpins the functioning of speech-language models, enabling these models to carry out a range of tasks, including text-to-speech (TTS), speech-to-text (STT), and spoken-language modeling. Tokenization offers the structure required by these models to efficiently analyze, process, and create speech by turning raw speech signals into discrete tokens. Tokenization is trained separately from the language model itself in many conventional methods, though. This division can result in a discrepancy between the generation of the tokens and their subsequent application in activities such as speech synthesis or recognition.

Conventional models of speech tokenizers rely on discrete representations of continuous speech signals created by quantization techniques and independent acoustic models. Frequently, the development of these tokenizers occurs independently of the language models they support being trained. Consequently, there is a chance that the way the language model interprets and utilizes the speech tokens produced during the tokenization phase will not match. Because of this mismatch, the speech-language model’s performance can be limited. This is because the tokenization process may not precisely match the learning objectives of the language model.

To overcome some of these issues, a team of researchers from the Hebrew University of Jerusalem have introduced Language Model Aware Speech Tokenisation (LAST). With this approach, the speech tokenization procedure incorporates a pre-trained text language model (LM). There are three primary parts to LAST, which are as follows.

A contextualized speech representation is extracted via a pre-trained, frozen speech SSL model.

These representations are transformed into discrete tokens by an adapter-quantization module.

An already-trained, frozen text learning model that directs the tokenization process, making it more appropriate for sequential modeling.

This technique seeks to provide discrete speech representations that are more appropriate for spoken language modeling and speech-to-text conversion by incorporating the goals of these text-based models into the tokenization process. This method creates a new feature space that is more appropriate for speech Language Model grouping and representation by transforming the features acquired from a pre-trained speech model.

There are various benefits to this alignment of the speech and textual models. First, it makes it possible for the voice tokenization process to be more influenced by the language’s fundamental structure, allowing the tokens to represent linguistic elements pertinent to written and spoken communication. Aligning the tokenization with the LM’s aims decreases the chance of mismatch, leading to more accurate and efficient performance across multiple speech tasks.

The work that presents this approach also includes the effects of important design decisions, such as the size of the text-based language model and the voice vocabulary. By experimenting with various setups, the researchers were able to determine how these variables affect the language model’s overall performance and the efficiency of the tokenization process. According to their research, the integrated tokenization strategy performs better than conventional techniques in speech-to-text and spoken language modeling tasks.

One of this approach’s most important results is the ability to interpret both speech and text inputs with a single pre-trained language model. This is a significant divergence from traditional approaches, which usually ask for distinct models for these various modalities. The suggested tokenization method improves efficiency and performance by streamlining the process with a single model that can handle both speech and text.

In conclusion, this approach to voice tokenization represents a major improvement over conventional methods by guaranteeing a greater alignment between the tokenization process and the goals of the language model. Speech features become a new space that enables more efficient clustering and representation by incorporating pre-trained text-language model objectives. As a result, a single model can be used for both speech and text inputs, leading to a more reliable and adaptable speech-language model that works better on a variety of tasks, including speech-to-text and spoken-language modeling.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Language Model Aware Speech Tokenization (LAST): A Unique AI Method that Integrates a Pre-Trained Text Language Model into the Speech Tokenization Process appeared first on MarkTechPost.

Google DeepMind Researchers Propose Human-Centric Alignment for Vision …

Deep learning has made significant strides in artificial intelligence, particularly in natural language processing and computer vision. However, even the most advanced systems often fail in ways that humans would not, highlighting a critical gap between artificial and human intelligence. This discrepancy has reignited debates about whether neural networks possess the essential components of human cognition. The challenge lies in developing systems that exhibit more human-like behavior, particularly regarding robustness and generalization. Unlike humans, who can adapt to environmental changes and generalize across diverse visual settings, AI models often need help with shifted data distributions between training and test sets. This lack of robustness in visual representations poses significant challenges for downstream applications that require strong generalization capabilities.

Researchers from Google DeepMind, Machine Learning Group, Technische Universität Berlin, BIFOLD, Berlin Institute for the Foundations of Learning and Data, Max Planck Institute for Human Development, Anthropic, Department of Artificial Intelligence, Korea University, Seoul, Max Planck Institute for Informatics propose a unique framework called AligNet to address the misalignment between human and machine visual representations. This approach aims to simulate large-scale human-like similarity judgment datasets for aligning neural network models with human perception. The methodology begins by using an affine transformation to align model representations with human semantic judgments in triplet odd-one-out tasks. This process incorporates uncertainty measures from human responses to improve model calibration. The aligned version of a state-of-the-art vision foundation model (VFM) then serves as a surrogate for generating human-like similarity judgments. By grouping representations into meaningful superordinate categories, the researchers sample semantically significant triplets and obtain odd-one-out responses from the surrogate model, resulting in a comprehensive dataset of human-like triplet judgments called AligNet.

The results demonstrate significant improvements in aligning machine representations with human judgments across multiple levels of abstraction. For global coarse-grained semantics, soft alignment substantially enhanced model performance, with accuracies increasing from 36.09-57.38% to 65.70-68.56%, surpassing the human-to-human reliability score of 61.92%. In local fine-grained semantics, alignment improved moderately, with accuracies rising from 46.04-57.72% to 58.93-62.92%. For class-boundary triplets, AligNet fine-tuning achieved remarkable alignment, with accuracies reaching 93.09-94.24%, exceeding the human noise ceiling of 89.21%. The effectiveness of alignment varied across abstraction levels, with different models showing strengths in different areas. Notably, AligNet fine-tuning generalized well to other human similarity judgment datasets, demonstrating substantial improvements in alignment across various object similarity tasks, including multi-arrangement and Likert-scale pairwise similarity ratings.

The AligNet methodology comprises several key steps to align machine representations with human visual perception. Initially, it uses the THINGS triplet odd-one-out dataset to learn an affine transformation into a global human object similarity space. This transformation is applied to a teacher model’s representations, creating a similarity matrix for object pairs. The process incorporates uncertainty measures about human responses using an approximate Bayesian inference method, replacing hard alignment with soft alignment.

The objective function of learning the uncertainty distillation transformation is to combine soft alignment with regularization to preserve local similarity structure. The transformed representations are then clustered into superordinate categories using k-means clustering. These clusters guide the generation of triplets from distinct ImageNet images, with odd-one-out choices determined by the surrogate teacher model.

Finally, a robust Kullback-Leibler divergence-based objective function facilitates the distillation of the teacher’s pairwise similarity structure into a student network. This AligNet objective is combined with regularization to preserve the pre-trained representation space, resulting in a fine-tuned student model that better aligns with human visual representations across multiple levels of abstraction.

This study addresses a critical deficiency in vision foundation models: their inability to adequately represent the multi-level conceptual structure of human semantic knowledge. By developing the AligNet framework, which aligns deep learning models with human similarity judgments, the research demonstrates significant improvements in model performance across various cognitive and machine learning tasks. The findings contribute to the ongoing debate about neural networks’ capacity to capture human-like intelligence, particularly in relational understanding and hierarchical knowledge organization. Ultimately, this work illustrates how representational alignment can enhance model generalization and robustness, bridging the gap between artificial and human visual perception.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Google DeepMind Researchers Propose Human-Centric Alignment for Vision Models to Boost AI Generalization and Interpretation appeared first on MarkTechPost.

CRISPR-Cas9 guide RNA efficiency prediction with efficiently tuned mod …

The clustered regularly interspaced short palindromic repeat (CRISPR) technology holds the promise to revolutionize gene editing technologies, which is transformative to the way we understand and treat diseases. This technique is based in a natural mechanism found in bacteria that allows a protein coupled to a single guide RNA (gRNA) strand to locate and make cuts in specific sites in the targeted genome. Being able to computationally predict the efficiency and specificity of gRNA is central to the success of gene editing.
Transcribed from DNA sequences, RNA is an important type of biological sequence of ribonucleotides (A, U, G, C), which folds into 3D structure. Benefiting from recent advance in large language models (LLMs), a variety of computational biology tasks can be solved by fine-tuning biological LLMs pre-trained on billions of known biological sequences. The downstream tasks on RNAs are relatively understudied.
In this post, we adopt a pre-trained genomic LLMs for gRNA efficiency prediction. The idea is to treat a computer designed gRNA as a sentence, and fine-tune the LLM to perform sentence-level regression tasks analogous to sentiment analysis. We used Parameter-Efficient Fine-Tuning methods to reduce the number of parameters and GPU usage for this task.
Solution overview
Large language models (LLMs) have gained a lot of interest for their ability to encode syntax and semantics of natural languages. The neural architecture behind LLMs are transformers, which are comprised of attention-based encoder-decoder blocks that generate an internal representation of the data they are trained from (encoder) and are able to generate sequences in the same latent space that resemble the original data (decoder). Due to their success in natural language, recent works have explored the use of LLMs for molecular biology information, which is sequential in nature.
DNABERT is a pre-trained transformer model with non-overlapping human DNA sequence data. The backbone is a BERT architecture made up of 12 encoding layers. The authors of this model report that DNABERT is able to capture a good feature representation of the human genome that enables state-of-the-art performance on downstream tasks like promoter prediction and splice/binding site identification. We decided to use this model as the foundation for our experiments.
Despite the success and popular adoption of LLMs, fine-tuning these models can be difficult because of the number of parameters and computation necessary for it. For this reason, Parameter-Efficient Fine-Tuning (PEFT) methods have been developed. In this post, we use one of these methods, called LoRA (Low-Rank Adaptation). We introduce the method in the following sections.
The following diagram is a representation of the Cas9 DNA target mechanism. The gRNA is the component that helps target the cleavage site.

The goal of this solution is to fine-tune a base DNABERT model to predict activity efficiency from different gRNA candidates. As such, our solution first takes gRNA data and processes it, as described later in this post. Then we use an Amazon SageMaker notebook and the Hugging Face PEFT library to fine-tune the DNABERT model with the processed RNA data. The label we want to predict is the efficiency score as it was calculated in experimental conditions testing with the actual RNA sequences in cell cultures. Those scores describe a balance between being able to edit the genome and not damage DNA that wasn’t targeted.
The following diagram illustrates the workflow of the proposed solution.

Prerequisites
For this solution, you need access to the following:

A SageMaker notebook instance (we trained the model on an ml.g4dn.8xlarge instance with a single NVIDIA T4 GPU)
transformers-4.34.1
peft-0.5.0
DNABERT 6

Dataset
For this post, we use the gRNA data released by researchers in a paper about gRNA prediction using deep learning. This dataset contains efficiency scores calculated for different gRNAs. In this section, we describe the process we followed to create the training and evaluation datasets for this task.
To train the model, you need a 30-mer gRNA sequence and efficiency score. A k-mer is a contiguous sequence of k nucleotide bases extracted from a longer DNA or RNA sequence. For example, if you have the DNA sequence “ATCGATCG” and you choose k = 3, then the k-mers within this sequence would be “ATC,” “TCG,” “CGA,” “GAT,” and “ATC.”
Efficiency score
Start with excel file 41467_2021_23576_MOESM4_ESM.xlsx from the CRISPRon paper in the Supplementary Data 1 section. In this file, the authors released the gRNA (20-mer) sequences and corresponding total_indel_eff scores. We specifically used the data from the sheet named spCas9_eff_D10+dox. We use the total_indel_eff column as the efficiency score.
Training and validation data
Given the 20-mers and the crispron scores (same as the total_indel_eff scores) from earlier, complete the following steps to put together the training and validation data:

Convert the sequences in the sheet “TRAP12K microarray oligos” into an .fa (fasta) file.
Run the script get_30mers_from_fa.py (from the CRISPRon GitHub repository) to obtain all possible 23-mers and 30-mers from the sequences obtained from Step 1.
Use the CRISPRspec_CRISPRoff_pipeline.py script (from the CRISPRon GitHub repository) to obtain the binding energy for the 23-mers obtained from Step 2. For more details on how to run this script, check out the code released by the authors of the CRISPRon paper(check the script CRISPRon.sh).
At this point, we have 23-mers along with the corresponding binding energy scores, and 20-mers along with the corresponding CRISPRon scores. Additionally, we have the 30-mers from Step 2.
Use the script prepare_train_dev_data.py (from our released code) to create training and validation splits. Running this script will create two files: train.csv and dev.csv.

The data looks something like the following:

id,rna,crisproff_score,crispron_score
seq2875_p_129,GTCCAGCCACCGAGACCCTGTGTATGGCAC,24.74484099890205,85.96491228
seq2972_p_129,AAAGGCGAAGCAGTATGTTCTAAAAGGAGG,17.216228493196073,94.81132075
. . .
. . .

Model architecture for gRNA encoding
To encode the gRNA sequence, we used the DNABERT encoder. DNABERT was pre-trained on human genomic data, so it’s a good model to encode gRNA sequences. DNABERT tokenizes the nucleotide sequence into overlapping k-mers, and each k-mer serves as a word in the DNABERT model’s vocabulary. The gRNA sequence is broken into a sequence of k-mers, and then each k-mer is replaced by an embedding for the k-mer at the input layer. Otherwise, the architecture of DNABERT is similar to that of BERT. After we encode the gRNA, we use the representation of the [CLS] token as the final encoding of the gRNA sequence. To predict the efficiency score, we use an additional regression layer. The MSE loss will be the training objective. The following is a code snippet of the DNABertForSequenceClassification model:

class DNABertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.config = config

self.bert = BertModel(config)
classifier_dropout = (
config.classifier_dropout
if config.classifier_dropout is not None
else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# Initialize weights and apply final processing
self.post_init()

def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
token_type_ids: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
head_mask: Optional[torch.Tensor] = None,
inputs_embeds: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
r”””
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, …,
config.num_labels – 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
“””
return_dict = (
return_dict if return_dict is not None else self.config.use_return_dict
)

outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
print(‘bert outputs’, outputs)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)

loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = “regression”
elif self.num_labels > 1 and (
labels.dtype == torch.long or labels.dtype == torch.int
):
self.config.problem_type = “single_label_classification”
else:
self.config.problem_type = “multi_label_classification”

if self.config.problem_type == “regression”:
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == “single_label_classification”:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == “multi_label_classification”:
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output

return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)

Fine-tuning and prompting genomic LLMs
Fine-tuning all the parameters of a model is expensive because the pre-trained model becomes much larger. LoRA is an innovative technique developed to address the challenge of fine-tuning extremely large language models. LoRA offers a solution by suggesting that the pre-trained model’s weights remain fixed while introducing trainable layers (referred to as rank-decomposition matrices) within each transformer block. This approach significantly reduces the number of parameters that need to be trained and lowers the GPU memory requirements, because most model weights don’t require gradient computations.
Therefore, we adopted LoRA as a PEFT method on the DNABERT model. LoRA is implemented in the Hugging Face PEFT library. When using PEFT to train a model with LoRA, the hyperparameters of the low rank adaptation process and the way to wrap base transformers models can be defined as follows:

from peft import LoraConfig

tokenizer = AutoTokenizer.from_pretrained(
data_training_args.model_path,
do_lower_case=False
)
# DNABertForSequenceClassification is a model class for sequence classification task, which is built on top of the DNABert architecture.
model = DNABertForSequenceClassification.from_pretrained(
data_training_args.model_path,
config=config
)

# Define LoRA Config
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
peft_config = LoraConfig(
r=LORA_R, # the dimension of the low-rank matrices
lora_alpha=LORA_ALPHA, #scaling factor for the weight matrices
lora_dropout=LORA_DROPOUT, #dropout probability of the LoRA layers
bias=”none”,
task_type = ‘SEQ_CLS’
)
model = get_peft_model(model, peft_config)

Hold-out evaluation performances
We use RMSE, MSE, and MAE as evaluation metrics, and we tested with rank 8 and 16. Furthermore, we implemented a simple fine-tuning method, which is simply adding several dense layers after the DNABERT embeddings. The following table summarizes the results.

Method
RMSE
MSE
MAE

LoRA (rank = 8)
11.933
142.397
7.014

LoRA (rank = 16)
13.039
170.01
7.157

One dense layer
15.435
238.265
9.351

Three dense layer
15.435
238.241
9.505

CRISPRon
11.788
138.971
7.134

When rank=8, we have 296,450 trainable parameters, which is about 33% trainable of the whole. The performance metrics are “rmse”: 11.933, “mse”: 142.397, “mae”: 7.014.
When rank=16, we have 591,362 trainable parameters, which is about 66% trainable of the whole. The performance metrics are “rmse”: 13.039, “mse”: 170.010, “mae”: 7.157. There might have some overfitting issue here under this setting.
We also compare what happens when adding a few dense layers:

After adding one dense layer, we have “rmse”: 15.435, “mse”: 238.265, “mae”: 9.351
After adding three dense layers, we have “rmse”: 15.435, “mse”: 238.241, “mae”: 9.505

Lastly, we compare with the existing CRISPRon method. CRISPRon is a CNN based deep learning model. The performance metrics are “rmse”: 11.788, “mse”: 138.971, “mae”: 7.134.
As expected, LoRA is doing much better than simply adding a few dense layers. Although the performance of LoRA is a bit worse than CRISPRon, with thorough hyperparameter search, it is likely to outperform CRISPRon.
When using SageMaker notebooks, you have the flexibility to save the work and data produced during the training, turn off the instance, and turn it back on when you’re ready to continue the work, without losing any artifacts. Turning off the instance will keep you from incurring costs on compute you’re not using. We highly recommend only turning it on when you’re actively using it.
Conclusion
In this post, we showed how to use PEFT methods for fine-tuning DNA language models using SageMaker. We focused on predicting efficiency of CRISPR-Cas9 RNA sequences for their impact in current gene-editing technologies. We also provided code that can help you jumpstart your biology applications in AWS.
To learn more about the healthcare and life science space, refer to Run AlphaFold v2.0 on Amazon EC2 or fine-tuning Fine-tune and deploy the ProtBERT model for protein classification using Amazon SageMaker.

About the Authors
Siddharth Varia is an applied scientist in AWS Bedrock. He is broadly interested in natural language processing and has contributed to AWS products such as Amazon Comprehend. Outside of work, he enjoys exploring new places and reading. He got interested in this project after reading the book The Code Breaker.
Yudi Zhang is an Applied Scientist at AWS marketing. Her research interests are in the area of graph neural networks, natural language processing, and statistics.
Erika Pelaez Coyotl is a Sr Applied Scientist in Amazon Bedrock, where she’s currently helping develop the Amazon Titan large language model. Her background is in biomedical science, and she has helped several customers develop ML models in this vertical.
Zichen Wang is a Sr Applied Scientist in AWS AI Research & Education. He is interested in researching graph neural networks and applying AI to accelerate scientific discovery, specifically on molecules and simulations.
Rishita Anubhai is a Principal Applied Scientist in Amazon Bedrock. She has deep expertise in natural language processing and has contributed to AWS projects like Amazon Comprehend, Machine Learning Solutions Lab, and development of Amazon Titan models. She’s keenly interested in using machine learning research, specifically deep learning, to create tangible impact.

Improve RAG performance using Cohere Rerank

This post is co-written with Pradeep Prabhakaran from Cohere.
Retrieval Augmented Generation (RAG) is a powerful technique that can help enterprises develop generative artificial intelligence (AI) apps that integrate real-time data and enable rich, interactive conversations using proprietary data.
RAG allows these AI applications to tap into external, reliable sources of domain-specific knowledge, enriching the context for the language model as it answers user queries. However, the reliability and accuracy of the responses hinges on finding the right source materials. Therefore, honing the search process in RAG is crucial to boosting the trustworthiness of the generated responses.
RAG systems are important tools for building search and retrieval systems, but they often fall short of expectations due to suboptimal retrieval steps. This can be enhanced using a rerank step to improve search quality.
RAG is an approach that combines information retrieval techniques with natural language processing (NLP) to enhance the performance of text generation or language modeling tasks. This method involves retrieving relevant information from a large corpus of text data and using it to augment the generation process. The key idea is to incorporate external knowledge or context into the model to improve the accuracy, diversity, and relevance of the generated responses.
Workflow of RAG Orchestration
The RAG orchestration generally consists of two steps:

Retrieval – RAG fetches relevant documents from an external data source using the generated search queries. When presented with the search queries, the RAG-based application searches the data source for relevant documents or passages.
Grounded generation – Using the retrieved documents or passages, the generation model creates educated answers with inline citations using the fetched documents.

The following diagram shows the RAG workflow.

Document retrieval in RAG orchestration
One technique for retrieving documents in a RAG orchestration is dense retrieval, which is an approach to information retrieval that aims to understand the semantic meaning and intent behind user queries. Dense retrieval finds the closest documents to a user query in the embedding, as shown in the following screenshot.

The goal of dense retrieval is to map both the user queries and documents (or passages) into a dense vector space. In this space, the similarity between the query and document vectors can be computed using standard distance metrics like cosine similarity or euclidean distance. The documents that match closest to the semantic meaning of the user query based on the calculated distance metrics are then presented back to the user.
The quality of the final responses to search queries is significantly influenced by the relevance of the retrieved documents. While dense retrieval models are very efficient and can scale to large datasets, they struggle with more complex data and questions due to the simplicity of the method. Document vectors contain the meaning of text in a compressed representation—typically 786-1536 dimension vectors. This often results in loss of information because information is compressed into a single vector. When documents are retrieved during a vector search the most relevant information is not always presented at the top of the retrieval.
Boost search accuracy with Cohere Rerank
To address the challenges with accuracy, search engineers have used two-stage retrieval as a means of increasing search quality. In these two-stage systems, a first-stage model (an embedding model or retriever) retrieves a set of candidate documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.
A reranking model, such as Cohere Rerank, is a type of model that will output a similarity score when given a query and document pair. This score can be used to reorder the documents that are most relevant to the search query. Among the reranking methodologies, the Cohere Rerank model stands out for its ability to significantly enhance search accuracy. The model diverges from traditional embedding models by employing deep learning to evaluate the alignment between each document and the query directly. Cohere Rerank outputs a relevance score by processing the query and document in tandem, which results in a more nuanced document selection process.
In the following example, the application was presented with a query: “When was the transformer paper coauthored by Aidan Gomez published?” The top-k with k = 6 returned the results shown in the image, in which the retrieved result set did contain the most accurate result, although it was at the bottom of the list. With k = 3, the most relevant document would not be included in the retrieved results.

Cohere Rerank aims to reassess and reorder the relevance of the retrieved documents based on additional criteria, such as semantic content, user intent, and contextual relevance, to output a similarity score. This score is then used to reorder the documents by relevance of the query. The following image shows reorder results using Rerank.

By applying Cohere Rerank after the first-stage retrieval, the RAG orchestration can gain the benefits of both approaches. While first-stage retrieval helps to capture relevant items based on proximity matches within the vector space, reranking helps optimize search according to results by guaranteeing contextually relevant results are surfaced to the top. The following diagram demonstrates this improved efficiency.

The latest version of Cohere Rerank, Rerank 3, is purpose-built to enhance enterprise search and RAG systems. Rerank 3 offers state-of-the-art capabilities for enterprise search, including:

4k context length to significantly improve search quality for longer documents
Ability to search over multi-aspect and semi-structured data (such as emails, invoices, JSON documents, code, and tables)
Multilingual coverage of more than 100 languages
Improved latency and lower total cost of ownership (TCO)

The endpoint takes in a query and a list of documents, and it produces an ordered array with each document assigned a relevance score. This provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement.
Developers and businesses can access Rerank on Cohere’s hosted API and on Amazon SageMaker. This post offers a step-by-step walkthrough of consuming Cohere Rerank on Amazon SageMaker.
Solution overview
This solution follows these high-level steps:

Subscribe to the model package
Create an endpoint and perform real-time inference

Prerequisites
For this walkthrough, you must have the following prerequisites:

The cohere-aws notebook.

This is a reference notebook, and it cannot run unless you make changes suggested in the notebook. It contains elements that render correctly in the Jupyter interface, so you need to open it from an Amazon SageMaker notebook instance or in Amazon SageMaker Studio.

An AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess policy attached. To deploy this machine learning (ML) model successfully, choose one of the following options:

If your AWS account does not have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role needs to have the following three permissions, and you need to have the authority to make AWS Marketplace subscriptions in the AWS account used:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

If your AWS account has a subscription to Cohere Rerank 3 Model – Multilingual, you can skip the instructions for subscribing to the model package.

Refrain from using full access in production environments. Security best practice is to opt for the principle of least privilege.
Implement Rerank 3 on Amazon SageMaker
To improve RAG performance using Cohere Rerank, use the instructions in the following sections.
Subscribe to the model package
To subscribe to the model package, follow these steps:

In AWS Marketplace, open the model package listing page Cohere Rerank 3 Model – Multilingual
Choose Continue to Subscribe.
On the Subscribe to this software page, review the End User License Agreement (EULA), pricing, and support terms and choose Accept Offer.
Choose Continue to configuration and then choose a Region. You will see a Product ARN displayed, as shown in the following screenshot. This is the model package Amazon Resource Name (ARN) that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your Region and enter it in the following cell.

The code snippets included in this post are sourced from the aws-cohere notebook. If you encounter any issues with this code, refer to the notebook for the most up-to-date version.

!pip install –upgrade cohere-aws
# if you upgrade the package, you need to restart the kernel

from cohere_aws import Client
import boto3

On the Configure for AWS CloudFormation page shown in the following screenshot, under Product Arn, make a note of the last part of the product ARN to use as the value in the variable cohere_package in the following code.

cohere_package = ” cohere-rerank-multilingual-v3–13dba038aab73b11b3f0b17fbdb48ea0″

model_package_map = {

“us-east-1″: f”arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}”,

“us-east-2″: f”arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}”,

“us-west-1″: f”arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}”,

“us-west-2″: f”arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}”,

“ca-central-1″: f”arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}”,

“eu-central-1″: f”arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}”,

“eu-west-1″: f”arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}”,

“eu-west-2″: f”arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}”,

“eu-west-3″: f”arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}”,

“eu-north-1″: f”arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}”,

“ap-southeast-1″: f”arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}”,

“ap-southeast-2″: f”arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}”,

“ap-northeast-2″: f”arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}”,

“ap-northeast-1″: f”arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}”,

“ap-south-1″: f”arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}”,

“sa-east-1″: f”arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}”,

}

region = boto3.Session().region_name

if region not in model_package_map.keys():

raise Exception(f”Current boto3 session region {region} is not supported.”)

model_package_arn = model_package_map[region]

Create an endpoint and perform real-time inference
If you want to understand how real-time inference with Amazon SageMaker works, refer to the Amazon SageMaker Developer Guide.
Create an endpoint
To create an endpoint, use the following code.

co = Client(region_name=region)

co.create_endpoint(arn=model_package_arn, endpoint_name=”cohere-rerank-multilingual-v3-0″, instance_type=”ml.g5.2xlarge”, n_instances=1)

# If the endpoint is already created, you just need to connect to it

# co.connect_to_endpoint(endpoint_name=”cohere-rerank-multilingual-v3-0”)

After the endpoint is created, you can perform real-time inference.
Create the input payload
To create the input payload, use the following code.

documents = [
{“Title”:”Contraseña incorrecta”,”Content”:”Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?”},
{“Title”:”Confirmation Email Missed”,”Content”:”Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?”},
{“Title”:”أسئلة حول سياسة الإرجاع”,”Content”:”مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب”},
{“Title”:”Customer Support is Busy”,”Content”:”Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?”},
{“Title”:”Falschen Artikel erhalten”,”Content”:”Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.”},
{“Title”:”Customer Service is Unavailable”,”Content”:”Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?”},
{“Title”:”Return Policy for Defective Product”,”Content”:”Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.”},
{“Title”:”收到错误物品”,”Content”:”早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。”},
{“Title”:”Return Defective Product”,”Content”:”Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.”}
]

 
Perform real-time inference To perform real-time inference, use the following code.

 
response = co.rerank(documents=documents, query=’What emails have been about returning items?’, rank_fields=[“Title”,”Content”], top_n=5)

Visualize output
To visualize output, use the following code.

print(f’Documents: {response}’)

The following screenshot shows the output response.

Cleanup
To avoid any recurring charges, use the following steps to clean up the resources created in this walkthrough.
Delete the model
Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged.

co.delete_endpoint()
co.close()

Unsubscribe to the listing (optional)
If you want to unsubscribe to the model package, follow these steps. Before you cancel the subscription, make sure that you don’t have a deployable model created from the model package or using the algorithm. You can find this information by looking at the container name associated with the model.
Steps to unsubscribe from the product from AWS Marketplace:

On the Your Software subscriptions page, choose the Machine Learning tab
Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription

Summary
RAG is a capable technique for developing AI applications that integrate real-time data and enable interactive conversations using proprietary information. RAG enhances AI responses by tapping into external, domain-specific knowledge sources, but its effectiveness depends on finding the right source materials. This post focuses on improving search efficiency and accuracy in RAG systems using Cohere Rerank. RAG orchestration typically involves two steps: retrieval of relevant documents and generation of answers. While dense retrieval is efficient for large datasets, it can struggle with complex data and questions due to information compression. Cohere Rerank uses deep learning to evaluate the alignment between documents and queries, outputting a relevance score that enables more nuanced document selection.
Customers can find Cohere Rerank 3 and Cohere Rerank 3 Nimble on Amazon Sagemaker Jumpstart.

About the Authors
Shashi Raina is a Senior Partner Solutions Architect at Amazon Web Services (AWS), where he specializes in supporting generative AI (GenAI) startups. With close to 6 years of experience at AWS, Shashi has developed deep expertise across a range of domains, including DevOps, analytics, and generative AI.
Pradeep Prabhakaran is a Senior Manager – Solutions Architecture at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform.

HNSW, Flat, or Inverted Index: Which Should You Choose for Your Search …

A significant challenge in information retrieval today is determining the most efficient method for nearest-neighbor vector search, especially with the growing complexity of dense and sparse retrieval models. Practitioners must navigate a wide range of options for indexing and retrieval methods, including HNSW (Hierarchical Navigable Small-World) graphs, flat indexes, and inverted indexes. These methods offer different trade-offs in terms of speed, scalability, and quality of retrieval results. As datasets become larger and more complex, the absence of clear operational guidance makes it difficult for practitioners to optimize their systems, particularly for applications requiring high performance, such as search engines and AI-driven applications like question-answering systems.

Traditionally, nearest-neighbor search is handled using three main approaches: HNSW indexes, flat indexes, and inverted indexes. HNSW indexes are commonly used for their efficiency and speed in large-scale retrieval tasks, particularly with dense vectors, but they are computationally intensive and require significant indexing time. Flat indexes, while exact in their retrieval results, become impractical for large datasets due to slower query performance. Sparse retrieval models, like BM25 or SPLADE++ ED, rely on inverted indexes and can be effective in specific scenarios but often lack the rich semantic understanding provided by dense retrieval models. The main limitation across these approaches is that none are universally applicable, with each method offering different strengths and weaknesses depending on the dataset size and retrieval 

Researchers from the University of Waterloo introduce a thorough evaluation of the trade-offs between HNSW, flat, and inverted indexes for both dense and sparse retrieval models. This research provides a detailed analysis of the performance of these methods, measured by indexing time, query speed (QPS), and retrieval quality (nDCG@10), using the BEIR benchmark dataset. The researchers aim to give practical, data-driven advice on the optimal use of each method based on the dataset size and retrieval requirements. Their findings indicate that HNSW is highly efficient for large-scale datasets, while flat indexes are better suited for smaller datasets due to their simplicity and exact results. Additionally, the study explores the benefits of using quantization techniques to improve the scalability and speed of the retrieval process, offering a significant enhancement for practitioners working with large datasets.

The experimental setup utilizes the BEIR benchmark, a collection of 29 datasets designed to reflect real-world information retrieval challenges. The dense retrieval model used is BGE (Base General Embeddings), with SPLADE++ ED and BM25 serving as the baselines for sparse retrieval. The evaluation focuses on two types of dense retrieval indexes: HNSW, which constructs graph-based structures for nearest-neighbor search, and flat indexes, which rely on brute-force search. Inverted indexes are used for sparse retrieval models. The evaluations are conducted using the Lucene search library, with specific configurations such as M=16 for HNSW. Performance is assessed using key metrics like nDCG@10 and QPS, with query performance evaluated under two conditions: cached queries (precomputed query encoding) and ONNX-based real-time query encoding.

The results reveal that for smaller datasets (under 100K documents), flat and HNSW indexes show comparable performance in terms of both query speed and retrieval quality. However, as dataset sizes increase, HNSW indexes begin to significantly outperform flat indexes, particularly in terms of query evaluation speed. For large datasets exceeding 1 million documents, HNSW indexes deliver far higher queries per second (QPS), with only a marginal decrease in retrieval quality (nDCG@10). When dealing with datasets of over 15 million documents, HNSW indexes demonstrate substantial improvements in speed while maintaining acceptable retrieval accuracy. Quantization techniques further boost performance, particularly in large datasets, offering notable increases in query speed without a significant reduction in quality. Overall, dense retrieval methods using HNSW prove to be far more effective and efficient than sparse retrieval models, particularly for large-scale applications requiring high performance.

This research offers essential guidance for practitioners in dense and sparse retrieval, providing a comprehensive evaluation of the trade-offs between HNSW, flat, and inverted indexes. The findings suggest that HNSW indexes are well-suited for large-scale retrieval tasks due to their efficiency in handling queries, while flat indexes are ideal for smaller datasets and rapid prototyping due to their simplicity and accuracy. By providing empirically-backed recommendations, this work significantly contributes to the understanding and optimization of modern information retrieval systems, helping practitioners make informed decisions for AI-driven search applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post HNSW, Flat, or Inverted Index: Which Should You Choose for Your Search? This AI Paper Offers Operational Advice for Dense and Sparse Retrievers appeared first on MarkTechPost.

LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and …

Large language models (LLMs) have emerged as powerful general-purpose task solvers, capable of assisting people in various aspects of daily life through conversational interactions. However, the predominant reliance on text-based interactions has significantly limited their application in scenarios where text input and output are not optimal. While recent advancements, such as GPT4o, have introduced speech interaction capabilities with extremely low latency, enhancing user experience, the open-source community still needs comprehensive exploration in building speech interaction models based on LLMs. The pressing challenge that researchers are striving to solve is how to achieve low-latency and high-quality speech interaction with LLMs, expanding their accessibility and applicability across diverse usage scenarios.

Several approaches have been attempted to enable speech interaction with LLMs, each with limitations. The simplest method involves a cascaded system using automatic speech recognition (ASR) and text-to-speech (TTS) models. However, this sequential approach results in higher latency due to the stepwise processing of transcribed text, text response, and speech response. Multimodal speech-language models have also been proposed, discretizing speech into tokens and expanding LLM vocabularies to support speech input and output. While these models theoretically allow direct speech-to-speech generation with low latency, practical implementation often involves generating intermediate text to maintain higher quality, sacrificing some response speed. Other attempts include training language models on semantic or acoustic tokens, joint training of speech tokens and text, and adding speech encoders to LLMs. However, these methods often require substantial data and computational resources or focus solely on speech understanding without generation capabilities.

Researchers from the University of Chinese Academy of Sciences introduced LLaMA-Omni, an innovative model architecture, that has been proposed to overcome the challenge of achieving low-latency and high-quality speech interaction with LLMs. This innovative approach integrates a speech encoder, speech adaptor, LLM, and streaming speech decoder to enable seamless speech-to-speech communication. The model processes speech input directly through the encoder and adaptor before feeding it into the LLM, bypassing the need for intermediate text transcription. A non-autoregressive streaming Transformer serves as the speech decoder, utilizing connectionist temporal classification to predict discrete units corresponding to the speech response. This architecture allows for the simultaneous generation of text and speech outputs, significantly reducing response latency. To support the development and evaluation of this model, researchers created the InstructS2S-200K dataset, tailored specifically for speech interaction scenarios.

LLaMA-Omni’s architecture consists of four main components: a speech encoder, a speech adaptor, an LLM, and a speech decoder. The speech encoder, based on Whisper-large-v3, extracts meaningful representations from the user’s speech input. These representations are then processed by the speech adaptor, which maps them into the LLM’s embedding space through downsampling and a two-layer perceptron. The LLM, based on Llama-3.1-8B-Instruct, generates text responses directly from the speech instruction. The speech decoder, a non-autoregressive streaming Transformer, takes the LLM’s output hidden states and uses connectionist temporal classification (CTC) to predict discrete units corresponding to the speech response.

The model employs a two-stage training strategy. In the first stage, it learns to generate text responses from speech instructions. The second stage focuses on generating speech responses, with only the speech decoder being trained. During inference, LLaMA-Omni simultaneously generates text and speech responses. As the LLM produces text, the speech decoder generates corresponding discrete units, which are then converted into speech waveforms in real-time. This approach enables extremely low-latency speech interaction, with users able to hear responses before the complete text is generated.

The InstructS2S-200K dataset was created to train LLaMA-Omni for speech interaction. It consists of 200,000 triplets of speech instructions, text responses, and speech responses. The construction process involved rewriting text instructions for speech using Llama-3-70B-Instruct, generating concise responses suitable for speech, and synthesizing speech using CosyVoice-300M-SFT for instructions and VITS for responses. The dataset combines 50,000 entries from Alpaca and 150,000 from UltraChat, covering diverse topics. This specialized dataset provides a robust foundation for training LLaMA-Omni in speech-based tasks, ensuring natural and efficient interactions.

LLaMA-Omni outperforms previous models in speech interaction tasks, as demonstrated by results on the InstructS2S-Eval benchmark. It excels in both content and style for speech-to-text and speech-to-speech instruction, achieving better alignment between speech and text responses. The model offers a trade-off between speech quality and response latency, with latency as low as 226ms. LLaMA-Omni’s simultaneous text and speech generation results in significantly faster decoding times compared to other models. Case studies show that LLaMA-Omni provides more concise, detailed, and helpful responses suitable for speech interaction scenarios, outperforming previous models in this context.

LLaMA-Omni, an innovative model architecture, has been developed to enable high-quality, low-latency speech interaction with LLMs. Built upon the Llama-3.1-8B-Instruct model, LLaMA-Omni incorporates a speech encoder for understanding and a streaming speech decoder for simultaneous text and speech response generation. The model’s alignment with speech interaction scenarios was achieved through the creation of InstructionS2S-200K, a dataset containing 200,000 speech instructions and responses. Experimental results demonstrate LLaMA-Omni’s superior performance in both content and style compared to existing speech-language models, with a remarkably low response latency of 226ms. The model’s efficient training process, requiring less than 3 days on 4 GPUs, facilitates the rapid development of speech interaction models based on cutting-edge LLMs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and High-Quality Speech Interaction with LLMs appeared first on MarkTechPost.

SaRA: A Memory-Efficient Fine-Tuning Method for Enhancing Pre-Trained …

Recent advancements in diffusion models have significantly improved tasks like image, video, and 3D generation, with pre-trained models like Stable Diffusion being pivotal. However, adapting these models to new tasks efficiently remains a challenge. Existing fine-tuning approaches—Additive, Reparameterized, and Selective-based—have limitations, such as added latency, overfitting, or complex parameter selection. A proposed solution involves leveraging “temporarily ineffective” parameters—those with minimal current impact but the potential to learn new information—by reactivating them to enhance the model’s generative capabilities without the drawbacks of existing methods.

Researchers from Shanghai Jiao Tong University and Youtu Lab, Tencent, propose SaRA, a fine-tuning method for pre-trained diffusion models. Inspired by model pruning, SaRA reuses “temporarily ineffective” parameters with small absolute values by optimizing them using sparse matrices while preserving prior knowledge. They employ a nuclear-norm-based low-rank training scheme and a progressive parameter adjustment strategy to prevent overfitting. SaRA’s memory-efficient nonstructural backpropagation reduces memory costs by 40% compared to LoRA. Experiments on Stable Diffusion models show SaRA’s superior performance across various tasks, requiring only a single line of code modification for implementation.Diffusion models, such as Stable Diffusion, excel in image generation tasks but are limited by their large parameter sizes, making full fine-tuning challenging. Methods like ControlNet, LoRA, and DreamBooth address this by adding external networks or fine-tuning to enable controlled generation or adaptation to new tasks. Parameter-efficient fine-tuning approaches like Addictive Fine-Tuning (AFT) and Reparameterized Fine-Tuning (RFT) introduce low-rank matrices or adapters. At the same time, Selective Fine-Tuning (SFT) focuses on modifying specific parameters. SaRA improves on these methods by reusing ineffective parameters, maintaining model architecture, reducing memory costs, and enhancing fine-tuning efficiency without additional inference latency.

In diffusion models, “ineffective” parameters, identified by their small absolute values, show minimal impact on performance when pruned. Experiments on Stable Diffusion models (v1.4, v1.5, v2.0, v3.0) revealed that setting parameters below a certain threshold to zero sometimes even improves generative tasks. The ineffectiveness is due to optimization randomness, not model structure. Fine-tuning can make these parameters effective again. SaRA, a method, leverages these temporarily ineffective parameters for fine-tuning, using low-rank constraints and progressive adjustment to prevent overfitting and enhance efficiency, significantly reducing memory and computation costs compared to existing methods like LoRA.The proposed method was evaluated on tasks like backbone fine-tuning, image customization, and video generation using FID, CLIP score, and VLHI metrics. It outperformed existing fine-tuning approaches (LoRA, AdaptFormer, LT-SFT) across datasets, showing superior task-specific learning and prior preservation. Image and video generation achieved better consistency and avoided artifacts. The method also reduced memory usage and training time by over 45%. Ablation studies highlighted the importance of progressive parameter adjustment and low-rank constraints. Correlation analysis revealed more effective knowledge acquisition than other methods, enhancing task performance.

SaRA is a parameter-efficient fine-tuning method that leverages the least impactful parameters in pre-trained models. By utilizing a nuclear norm-based low-rank loss, SaRA prevents overfitting, while its progressive parameter adjustment enhances fine-tuning effectiveness. The unstructured backpropagation reduces memory costs, benefiting other selective fine-tuning methods. SaRA significantly improves generative capabilities in tasks like domain transfer and image editing, outperforming methods like LoRA. It requires only a one-line code modification for easy integration, demonstrating superior performance on models such as Stable Diffusion 1.5, 2.0, and 3.0 across multiple applications.

Check out the Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post SaRA: A Memory-Efficient Fine-Tuning Method for Enhancing Pre-Trained Diffusion Models appeared first on MarkTechPost.

Assessing the Capacity of Large Language Models to Generate Innovative …

Research idea generation methods have evolved through techniques like iterative novelty boosting, multi-agent collaboration, and multi-module retrieval. These approaches aim to enhance idea quality and novelty in research contexts. Previous studies primarily focused on improving generation methods over basic prompting, without comparing results against human expert baselines. Large language models (LLMs) have been applied to various research tasks, including experiment execution, automatic review generation, and related work curation. However, these applications differ from the creative and open-ended task of research ideation addressed in this paper.

The field of computational creativity examines AI’s ability to produce novel and diverse outputs. Previous studies indicated that AI-generated writings tend to be less creative than those from professional writers. In contrast, this paper finds that LLM-generated ideas can be more novel than those from human experts in research ideation. Human evaluations have been conducted to assess the impact of AI exposure or human-AI collaboration on novelty and diversity, yielding mixed results. This study includes a human evaluation of idea novelty, focusing on comparing human experts and LLMs in the challenging task of research ideation.

Recent advancements in LLMs have sparked interest in developing research agents for autonomous idea generation. This study addresses the lack of comprehensive evaluations by rigorously assessing LLM capabilities in producing novel, expert-level research ideas. The experimental design compares an LLM ideation agent with expert NLP researchers, recruiting over 100 participants for idea generation and blind reviews. Findings reveal LLM-generated ideas as more novel but slightly less feasible than human-generated ones. The study identifies open problems in building and evaluating research agents, acknowledges challenges in human judgments of novelty, and proposes a comprehensive design for future research involving idea execution into full projects.

Researchers from Stanford University have introduced Quantum Superposition Prompting (QSP), a novel framework designed to explore and quantify uncertainty in language model outputs. QSP generates a ‘superposition’ of possible interpretations for a given query, assigning complex amplitudes to each interpretation. The method uses ‘measurement’ prompts to collapse this superposition along different bases, yielding probability distributions over outcomes. QSP’s effectiveness will be evaluated on tasks involving multiple valid perspectives or ambiguous interpretations, including ethical dilemmas, creative writing prompts, and open-ended analytical questions.

The study also presents Fractal Uncertainty Decomposition (FUD), a technique that recursively breaks down queries into hierarchical structures of sub-queries, assessing uncertainty at each level. FUD decomposes initial queries, estimates confidence for each sub-component, and recursively applies the process to low-confidence elements. The resulting tree of nested confidence estimates is aggregated using statistical methods and prompted meta-analysis. Evaluation metrics for these methods include diversity and coherence of generated superpositions, ability to capture human-judged ambiguities, and improvements in uncertainty calibration compared to classical methods.

The study reveals that LLMs can generate research ideas judged as more novel than those from human experts, with statistical significance (p < 0.05). However, LLM-generated ideas were rated slightly lower in feasibility. Over 100 NLP researchers participated in generating and blindly reviewing ideas from both sources. The evaluation used metrics including novelty, feasibility, and overall effectiveness. Open problems identified include LLM self-evaluation issues and lack of idea diversity. The research proposes an end-to-end study design for future work, involving the execution of generated ideas into full projects to assess the impact of novelty and feasibility judgments on research outcomes.

In conclusion, this study provides the first rigorous comparison between LLMs and expert NLP researchers in generating research ideas. LLM-generated ideas were judged more novel but slightly less feasible than human-generated ones. The research identifies open problems in LLM self-evaluation and idea diversity, highlighting challenges in developing effective research agents. Acknowledging the complexities of human judgments on novelty, the authors propose an end-to-end study design for future research. This approach involves executing generated ideas into full projects to investigate how differences in novelty and feasibility judgments translate into meaningful research outcomes, addressing the gap between idea generation and practical application.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Assessing the Capacity of Large Language Models to Generate Innovative Research Ideas: Insights from a Study with Over 100 NLP Experts appeared first on MarkTechPost.

gsplat: An Open-Source Python Library for Gaussian Splatting

Gaussian Splatting is a novel 3D rendering technique representing a scene as a collection of 3D Gaussian functions. These Gaussians are splatted, or projected, onto the image plane, enabling faster and more efficient rendering of complex scenes compared to traditional methods like neural radiance fields (NeRF). It particularly effectively renders dynamic and large-scale scenes with high visual quality. Currently, Gaussian Splatting methods like the original implementation and open-source projects such as GauStudio provide foundational tools for 3D reconstruction. However, this method also faces challenges in optimizing memory usage, training speed, and convergence times.

A team of researchers from UC Berkeley, Aalto University, ShanghaiTech University, SpectacularAI, Amazon, and Luma AI addressed these limitations by developing gsplat, an open-source Python library that integrates tightly with PyTorch and features optimized CUDA kernels to improve memory efficiency and training time. Unlike other methods, gsplat is designed for modularity, allowing developers to easily implement the latest Gaussian Splatting research advancements. It also introduces features such as pose optimization, depth rendering, and N-dimensional rasterization, which are missing in previous implementations.

The gsplat library includes several technological advancements and optimizations. For example, it implements advanced densification strategies such as Adaptive Density Control (ADC), the Absgrad method, and Markov Chain Monte Carlo (MCMC), which allow developers to control Gaussian pruning and densification more effectively. The library enables gradient flow to Gaussian parameters and camera view matrices for optimizing camera poses. This feature reduces the pose of uncertainty during 3D reconstruction. gsplat also introduces anti-aliasing techniques to mitigate aliasing effects in 3D scenes, using MipSplatting for enhanced visual quality. The library’s back-end consists of highly optimized CUDA operations, resulting in faster training times and reduced memory consumption, as demonstrated in their experimental results.

gsplat outperforms the original implementation of Gaussian Splatting on several metrics. On the MipNeRF360 dataset, gsplat achieves the same rendering quality but reduces training time by 10% and memory consumption by up to 4×. It also supports advanced features, like the Absgrad and MCMC methods, which further improve performance in specific scenarios. For example, when combined with MCMC, gsplat reduces memory usage to 1.98 GB compared to the original 9 GB and decreases training time by over 40%. These improvements make gsplat suitable for large-scale training and hardware-constrained environments while promoting research by providing a flexible and modular interface.

In conclusion, the gsplat library successfully addresses the limitations of the original Gaussian Splatting methods by improving memory efficiency, reducing training time, and offering advanced features like pose optimization and anti-aliasing. It is designed to promote further research by providing a user-friendly, flexible API that integrates well with PyTorch.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post gsplat: An Open-Source Python Library for Gaussian Splatting appeared first on MarkTechPost.

Advancing Social Network Analysis: Integrating Stochastic Blockmodels, …

The use of relational data in social science has surged over the past two decades, driven by interest in network structures and their behavioral implications. However, the methods for analyzing such data are underdeveloped, leading to ad hoc, nonreplicable research and hindering the development of robust theories. Two emerging approaches, blockmodels and stochastic models for digraphs, offer promising solutions. Blockmodels clearly describe global structure and roles but lack explicit data variability models and formal fit tests. On the other hand, stochastic models handle data variability and fit testing but do not model global structure or relational ties. Combining these approaches could address their limitations and enhance relational data analysis.

Educational Testing Service and Carnegie-Mellon University researchers propose a stochastic model for social networks, partitioning actors into subgroups called blocks. This model extends traditional block models by incorporating stochastic elements and estimation techniques for single-relation networks with predefined blocks. It also introduces an extension to account for tendencies toward tie reciprocation, providing a one-degree-of-freedom test for model fit. The study discusses a merger of this approach with stochastic multigraphs and blockmodels, describes formal fit tests, and uses a numerical example from social network literature to illustrate the methods. The conclusion relates stochastic blockmodels to other blockmodel types.

A stochastic blockmodel is a framework used for analyzing sociometric data where a network is divided into subgroups or blocks, and the distribution of ties between nodes depends on these blocks. This model formalizes the deterministic blockmodel by introducing variability in the data. It is a specific type of probability distribution over adjacency arrays, where nodes are partitioned into blocks, and ties between nodes within the same block are modeled to be statistically equivalent. The model assumes that relations between nodes in the same block are distributed similarly and independently of ties between other pairs of nodes, formalizing the concept of “internal homogeneity” within blocks.

In practical applications, stochastic blockmodels analyze single-relation sociometric data with predefined blocks. The model simplifies estimation by focusing on block densities the probability of a tie between nodes in specific blocks. The estimation process involves calculating the likelihood function for observed data and deriving maximum likelihood estimates for block densities. This approach is particularly efficient as the likelihood function is tractable, and maximum likelihood estimates can be directly computed from observed block densities. This method allows for calculating measures such as the reciprocation of ties, providing insights into network structure beyond what deterministic models can offer.

The study explores advanced blockmodeling techniques to analyze reciprocity and pair-level structures in social networks. It discusses the concept of reciprocity, where mutual ties in relationships can exceed chance expectations, and introduces the Pair-Dependent Stochastic Blockmodel (PSB), which accounts for dependencies between relations. The Stochastic Blockmodel with Reciprocity (SBR) is a specific case of the PSB that includes parameters for mutual, asymmetric, and null ties. The text also covers estimation using Maximum Likelihood Estimation (MLE) and model fit testing. An empirical example from Sampson’s Monastery data illustrates the practical application of these models.

In conclusion, The excerpt addresses two key topics related to non-stochastic blockmodels. First, it discusses the closure challenge, noting that stochastic blockmodels are not closed under the binary product of adjacency matrices, complicating the understanding of indirect ties. Second, it explores the Bayesian approach to generating blocks, where blocks are not predetermined but discovered from data. This approach specifies the number of blocks, block size distributions, and density parameters for different block types. The Bayesian model allows for posterior probability estimation of block memberships, aiding in a more systematic relational data analysis.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)
The post Advancing Social Network Analysis: Integrating Stochastic Blockmodels, Reciprocity, and Bayesian Approaches appeared first on MarkTechPost.