As Large Language Models (LLMs) like ChatGPT, LLaMA, and Mistral continue to advance, concerns about their susceptibility to harmful queries have intensified, prompting the need for robust safeguards. Approaches such as supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) have been widely adopted to enhance the safety of LLMs, enabling them to reject harmful queries.
However, despite these advancements, aligned models may still be vulnerable to sophisticated attack prompts, raising questions about the precise modification of toxic regions within LLMs to achieve detoxification. Recent studies have demonstrated that previous approaches, such as DPO, may only suppress the activations of toxic parameters without effectively addressing underlying vulnerabilities, underscoring the importance of developing precise detoxification methods.
In response to these challenges, recent years have seen significant progress in knowledge editing methods tailored for LLMs, allowing for post-training adjustments without compromising overall performance. Leveraging knowledge editing to detoxify LLMs appears intuitive; however, existing datasets and evaluation metrics have focused on specific harmful issues, overlooking the threat posed by attack prompts and neglecting generalizability to various malicious inputs.
To address this gap, researchers at Zhejiang University have introduced SafeEdit, a comprehensive benchmark designed to evaluate detoxification tasks via knowledge editing. SafeEdit covers nine unsafe categories with powerful attack templates and extends evaluation metrics to include defense success, defense generalization, and general performance, providing a standardized framework for assessing detoxification methods.
Several knowledge editing approaches, including MEND and Ext-Sub, have been explored on LLaMA and Mistral models, demonstrating the potential to detoxify LLMs efficiently with minimal impact on general performance. However, existing methods primarily target factual knowledge and may need help identifying toxic regions in response to complex adversarial inputs spanning multiple sentences.
To address these challenges, researchers have proposed a novel knowledge editing baseline, Detoxifying with Intraoperative Neural Monitoring (DINM), which aims to diminish toxic regions within LLMs while minimizing side effects. Extensive experiments on LLaMA and Mistral models have shown that DINM outperforms traditional SFT and DPO methods in detoxifying LLMs, demonstrating stronger detoxification performance, efficiency, and the importance of accurately locating toxic regions.
In conclusion, the findings underscore the significant potential of knowledge editing for detoxifying LLMs, with the introduction of SafeEdit providing a standardized framework for evaluation. The efficient and effective DINM method represents a promising step towards addressing the challenge of detoxifying LLMs, shedding light on future applications of supervised fine-tuning, direct preference optimization, and knowledge editing in enhancing the safety and robustness of large language models.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 39k+ ML SubReddit
The post This AI Paper Introduces SafeEdit: A New Benchmark to Investigate Detoxifying LLMs via Knowledge Editing appeared first on MarkTechPost.