Direct Preference Optimization

Bradley-Terry model

Suppose a bunch of people have various preferences for something, say, person A rates a cookie 5/10, person B rates it 7/10, and so on. Or suppose that we have a bunch of players in a game and each gained a particular score. What’s the probability that a particular will be rated higher than another by two people? Or what’s the probability that one player will win against another? We can model this is as follows

\[p(y_1 \succ y_2) = \frac{e^{\beta_1}}{e^{\beta_1} + e^{\beta_2}}\]

where \(\beta_1\) and \(\beta_2\) are the scores assigned to a \(y_1\) and \(y_2\) respectively by two score assigners. In the cookie case, we have \(p(A \succ B)\), i.e., the probability that the cookie will be preferred by A over B, and \(\beta_1\) and \(\beta_2\) are the scores assigned by A and B respectively. The denominator normalizes, i.e., ensures that the probabilities sum to 1.

RLHF objective

We let LLMs generate completions \(y_1\) and \(y_2\) for a prompt \(x\) and human annotators rank the completions, i.e., they state their preferences for the completions. Let’s denote the ranking mechanism used by the annotators as \(r^*(x, y)\) which is the score assigned by them to a completion \(y\) for a prompt \(x\). Then as per the BT model, we have

\[\begin{aligned} p(y_i \succ y_j) &= \frac{e^{r^*(x, y_i)}}{e^{r^*(x, y_i)} + e^{r^*(x, y_j)}} \\ &= \frac{1}{1 + e^{(r^*(x, y_j) - r^*(x, y_i))}} \tag{1} \end{aligned}\]

where we have divided the numerator and denominator by \(e^{r^*(x, y_i)}\). \(r^*\) is what we first aim to learn in classic RLHF [1], and we do this by training a reward model \(r_\phi\) parametrized by \(\phi\). We then use this reward model to optimize the LLM using an off the shelf online RL algorithm (like PPO). The objective for the RL phase is

\[\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) \right] - \beta D_{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)) \tag{2}\]

where the second term is the KL divergence term that tries to prevent our model \(\pi_\theta\) from deviating too much from the reference pre-trained model \(\pi_{\text{ref}}\) and \(\beta\) is the strength of the penalty. Now, RLHF is expensive because it’s a two-step process: first training a reward model, then generating trajectories using the model to be trained. Using a reward model instead of the true reward also feels janky. The reward model is not perfect, and the LM has to learn to adapt to the imperfect reward model - you’re compounding errors.

Is there a way to frame the objective such that it only depends on the output of the policy model \(\pi_\theta\) and the reference model \(\pi_{\text{ref}}\) and not the learned reward model \(r_\phi\), effectively eliminating the reward model training step? Enter DPO [2].

Reframing the RLHF objective

We can rewrite the reward objective in \((2)\) like so

\[\begin{aligned} & \max_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ r(x, y) \right] - \beta D_{KL}(\pi_\theta(y\,|\,x)\,\|\,\pi_{\text{ref}}(y\,|\,x)) \\ &= \max_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ r(x, y) - \beta \log \frac{\pi_\theta(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} \right] \\ &= \min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ \beta \log \frac{\pi_\theta(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - r(x, y) \right] \\ &= \min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ \log \frac{\pi_\theta(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \frac{1}{\beta} r(x, y) \right] \\ &= \min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ \log \frac{\pi_\theta(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \log \exp\left(\frac{1}{\beta} r(x, y)\right) \right] \\ &= \min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_\theta(y\,|\,x)} \left[ \log \frac{\pi_\theta(y\,|\,x)}{\frac{1}{Z(x)} \pi_{\text{ref}}(y\,|\,x) \exp\left(\frac{1}{\beta} r(x, y)\right)} - \log Z(x) \right] \tag{3} \end{aligned}\]

where in the second step we have moved \(D_{KL}\) inside the expectation because it is the expected log-likelihood and

\[Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)\]

is the partition function. We can show that the optimally trained model \(\pi_\theta^*\) is given by

\[\pi_\theta^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right) \tag{4}\]

From the definition of \(Z(x)\), one can see that \(\sum_y \pi_\theta^*(y\,|\,x) = 1\), and since \(\pi_\theta^*(y\,|\,x) \geq 0\), it is a valid probability distribution. We can bring the expectation over \(y\) inside in \((3)\) and get the objective

\[\begin{aligned} &\min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D}} \left[ \mathbb{E}_{y\,\sim\,\pi_\theta(y\,|\,x)} \left[ \log \frac{\pi_\theta(y\,|\,x)}{\pi_\theta^*(y\,|\,x)} \right] - \log Z(x) \right] \\ &= \min_{\pi_\theta} \mathbb{E}_{x\,\sim\,\mathcal{D}} \left[ D_{KL}(\pi_\theta(y\,|\,x)\,\|\,\pi_\theta^*(y\,|\,x)) - \log Z(x) \right] \end{aligned}\]

The KL divergence is minimized when \(\pi_\theta(y|x) = \pi_\theta^*(y|x)\), which shows that \(\pi_\theta^*\) is the optimal policy model. Now, we need to get rid of \(r\). If \(\pi_\theta^*\) is the optimal policy model, then the ground truth reward \(r^*\) is given by

\[r^*(x, y) = \beta \log \frac{\pi_\theta^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) \tag{5}\]

which you can get by solving for \(r^*\) in \((4)\). What this is saying is this: suppose that we have an RL tuned model \(\pi_\theta^*\) that perfectly satisfies the human annotators’ preference function \(r^*\) (\(r^*\) is basically the heuristic or underlying reward model that represents the human annotators and this is what we learn in classic RLHF with a reward model \(r_\phi\)). Then \(r^*\) is given by \((5)\). This is the score given for a completion \(y\) given a prompt \(x\). We can substitute this score in the Bradley-Terry preference model in \((1)\) to get the probability that completion \(y_1\) is preferred over completion \(y_2\) given prompt \(x\)

\[\begin{aligned} p^*(y_1 \succ y_2 | x) &= \frac{1}{1 + \exp\left(\beta \log \frac{\pi_\theta^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)} - \beta \log \frac{\pi_\theta^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)}\right)} \\ &= \sigma\left(\beta \log \frac{\pi_\theta^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)} - \beta \log \frac{\pi_\theta^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)}\right) \tag{6} \end{aligned}\]

where \(\sigma\) is the sigmoid function. It’s convenient that the \(Z(x)\)s cancel out, because it depends on \(r(x, y)\) and we’re trying to get rid of it from the objective. If \(y_w\) denotes the favorable completion and \(y_l\) denotes the completion that sucks (as rated by the human annotator), substituting \(y_w\) and \(y_l\) for \(y_1\) and \(y_2\) in \((6)\) gives

\[p^*(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi_\theta^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \tag{7}\]

and the above probability is maximum for the optimal policy \(\pi_\theta^*\), i.e., we have

\[p^*(y_w \succ y_l\,|\,x) \geq p^*(y_l \succ y_w\,|\,x)\]

Equation \((7)\) is for the optimal model, and our goal then is to maximize

\[p(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\]

i.e., we maximize the log-likelihood of the preference \(y_w \succ y_l\) for a dataset \(\mathcal{D}\) with prompts \(x\), preferred completions \(y_w\), and bad completions \(y_l\). Our objective becomes minimizing the negative log-likelihood

\[\mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]\]

and as desired, we have removed the reward model out of the picture. The important thing to note is that this is not a surrogate objective but is equivalent to the vanilla RLHF objective. We’re converting online learning (letting the model generate trajectories during training) to offline learning (using a dataset of trajectories, i.e., using the logprobs of preferred and rejected completions).

Pseudocode

To compute the loss using PyTorch as per the paper [2] we do

import torch.nn.functional as F


def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
  """
  pi_logps: policy logprobs, shape (B,)
  ref_logps: reference model logprobs, shape (B,)
  yw_idxs: preferred completion indices in [0, B-1], shape (T,)
  yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
  beta: temperature controlling strength of KL penalty
  Each pair of (yw_idxs[i], yl_idxs[i]) represents the indices of a single preference pair.
  """

  pi_yw_logps,  pi_yl_logps =  pi_logps[yw_idxs],  pi_logps[yl_idxs]
  ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]
  pi_logratios  = pi_yw_logps - pi_yl_logps
  ref_logratios = ref_yw_logps - ref_yl_logps
  losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
  rewards = beta * (pi_logps - ref_logps).detach()
  return losses, rewards

Check out the reference implementation here for more deets (such as how pi_logps and ref_logps are computed).

References

  1. Deep reinforcement learning from human preferences
  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model