Self-Rewarding Language Models: Paving the Way to AGI?

Atufa Shireen
2 min readFeb 18, 2024


Image from the internet

In the quest to enhance the responses of chat GPT, we often structure questions with instructions, context, or examples. This technique relies on prompt engineering, leveraging the instruction-following rule of Language Models (LLMs). Continuous training is key to aligning LLM responses correctly. Traditionally, explicit reward models using RL techniques like PPO, based on human preference data, guide LLM training.

This paper introduces a groundbreaking approach: self-rewarding language models. It acknowledges issues with human preference data quality and challenges with explicit reward models. The method involves providing superhuman feedback (AIFT) to language models through an iterative DPO training agent. Think of it as a two-part process:

1. Self-Instruction Creation:

  • The model independently generates responses for given prompts.
  • It creates and assesses a new instruction-following training set to continuously improve its understanding and response capabilities.

Overall Self-Alignment Algorithm:

  • The training begins with the pre-trained Llama 70 B model (M0 model).
  • Incorporating human-annotated instruction prompts (IFT data) and LLM-as-a-Judge prompt data (EFT data), the first iteration fine-tunes the M0 model, resulting in a new model, M1.
  • Subsequent iterations involve self-instruction creation and instruction-following training, leading to models M2 and M3.

But why is this method equivalent to the RLHF way of alignment? The answer lies in DPO, which avoids the reward model by training the same LLM to produce high-ranking outputs on preference data while minimizing drift. The authors have demonstrated the equivalence of DPO’s loss function to RLHF in the paper titled ‘Direct Preference Optimization: Your Language Model is Secretly a Reward Model.’


The paper evaluates the method across three key metrics:

  1. Instruction-following ability: M1 vs. M0 sees a similar win rate, but improvements are observed over iterations.
  2. Reward modeling: Iterations show an increase in win rates.
  3. Performance on the AlpacaEval 2.0 leaderboard: The last iteration model outperforms various models, showcasing the method’s effectiveness with a small seed dataset.


Self-rewarding language models overcome limitations of alignment methods relying on static human data. By integrating instruction-following and evaluation, they create a continuous flow of improved training data from their own generations. This incremental yet meaningful step demonstrates promising movement towards artificial general intelligence, opening new possibilities for simplified advancement powered by robust cycles of autonomous self-learning.