In recent times, Large Language Models (LLMs) have gained popularity for their ability to respond to user queries in a more human-like manner, accomplished through reinforcement learning. However, aligning these LLMs with human preferences in reinforcement learning from human feedback (RLHF) can lead to a phenomenon known as reward hacking. This occurs when LLMs exploit flaws in the reward model (RM), achieving high rewards without fulfilling the underlying objectives, as illustrated in Figure 1(b). Reward hacking raises concerns such as degraded performance, checkpoint selection challenges, potential biases, and, most critically, safety risks.
The primary challenges identified in designing RMs to mitigate reward hacking include distribution shifts and inconsistent preferences in the preference dataset. Distribution shifts arise due to policy drift during RL, leading to a deviation from the offline preference dataset. Inconsistent preferences stem from noisy binary labels, introducing low inter-labeler agreement and impacting RM robustness. To address these challenges, existing approaches have explored strategies like KL regularization, active learning, and prediction ensembling (ENS). However, these methods face efficiency issues, reliability concerns, and struggle with preference inconsistencies.
To tackle these challenges, this paper proposes Weight Averaged Reward Models (WARM) (illustrated in Figure 1(a)), a simple, efficient, and scalable strategy for obtaining a reliable and robust RM. WARM combines multiple RMs through linear interpolation in the weight space, providing benefits such as efficiency, improved reliability under distribution shifts, and enhanced robustness to label corruption. The diversity across fine-tuned weights is a key contributor to the effectiveness of WARM.
WARM is compared to prediction ensembling (ENS), showcasing its efficiency and practicality by requiring a single model at inference time, eliminating memory and inference overheads. Empirical results indicate that WARM performs similarly to ENS in terms of variance reduction but exhibits superiority under distribution shifts. The paper introduces the concept of linear mode connectivity (LMC) as a key factor in WARM’s success, demonstrating its ability to memorize less and generalize better than ensembling predictions. There are 3 observations that are made in the experiments and are empirically proven in Figure 3 and 4:
Observation 1 (LMC): The accuracy of the interpolated model is at least as good as the interpolation of the individual accuracies.
Observation 2 (WA and ENS): Weight averaging and prediction ensembling perform similarly.
Observation 3 (WA and ENS): The accuracy gains of WA over ENS grow as data moves away from the training distribution.
The benefits of WARM extend beyond its primary goals. It aligns with the updatable machine learning paradigm, allowing parallelization in federated learning scenarios. WARM could contribute to privacy and bias mitigation by reducing memorization of private preferences. The method shows potential for combining RMs trained on different datasets, supporting iterative and evolving preferences. Further exploration includes extending WARM to direct preference optimization strategies.
Despite its innovation, WARM has limitations compared to prediction ensembling methods, including potential limitations in handling diverse architectures and uncertainty estimation. WARM does not entirely eliminate spurious correlations or biases in preference data, suggesting the need for additional methods for a comprehensive solution. Lastly, WARM focuses on enhancing reward modeling and should be considered within the broader context of responsible AI to address safety risks from misalignment.
In conclusion, Weight Averaged Reward Models (WARM) offer a promising solution to challenges in reward modeling, enhancing alignment in RLHF. The paper’s empirical results and theoretical insights position WARM as a valuable contribution toward creating more aligned, transparent, and effective AI systems.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
The post Google DeepMind Researchers Propose WARM: A Novel Approach to Tackle Reward Hacking in Large Language Models Using Weight-Averaged Reward Models appeared first on MarkTechPost.