Direct Preference Optimization

Bradley-Terry model

Suppose a bunch of people have various preferences for something, say, person A rates a cookie 5/10, person B rates it 7/10, and so on. Or suppose that we have a bunch of players in a game and each gained a particular score. What's the probability that a particular will be rated higher than another by two people? Or what's the probability that one player will win against another? We can model this is as follows

$$p(y_1 \succ y_2) = \frac{e^{\beta_1}}{e^{\beta_1} + e^{\beta_2}}$$

where $\beta_1$ and $\beta_2$ are the scores assigned to a $y_1$ and $y_2$ respectively by two score assigners. In the cookie case, we have $p(A \succ B)$, i.e., the probability that the cookie will be preferred by A over B, and $\beta_1$ and $\beta_2$ are the scores assigned by A and B respectively. The denominator normalizes, i.e., ensures that the probabilities sum to 1.

RLHF objective

We let LLMs generate completions $y_1$ and $y_2$ for a prompt $x$ and human annotators rank the completions, i.e., they state their preferences for the completions. Let's denote the ranking mechanism used by the annotators as $r^*(x, y)$ which is the score assigned by them to a completion $y$ for a prompt $x$. Then as per the BT model, we have

\begin{align*} p(y_i \succ y_j) &= \frac{e^{r^*(x,\,y_i)}}{e^{r^*(x,\,y_i)} + e^{r^*(x,\,y_j)}} \\ &= \frac{1}{1 + e^{(r^*(x,\,y_j) - r^*(x,\,y_i))}} \tag{1} \end{align*}

where we have divided the numerator and denominator by $r^*(x, y_i)$. $r^*$ is what we first aim to learn in classic RLHF [1] , and we do this by training a reward model $r_{\phi}$ parametrized by $\phi$. We then use this reward model to optimize the LLM using an off the shelf online RL algorithm (like PPO). The objective for the RL phase is

$$\max_{\pi_{\theta}}\,\mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r_{\phi}(x, y)\right] - \beta D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\text{ref}}(y\,|\,x)) \tag{2}$$

where the second term is the KL divergence term that tries to prevent our model $\pi_{\theta}$ from deviating too much from the reference pre-trained model $\pi_{\text{ref}}$ and $\beta$ is the strength of the penalty. Now, RLHF is expensive because it's a two-step process: first training a reward model, then generating trajectories using the model to be trained. Using a reward model instead of the true reward also feels janky. The reward model is not perfect, and the LM has to learn to adapt to the imperfect reward model - you're compounding errors.

Is there a way to frame the objective such that it only depends on the output of the policy model $\pi_{\theta}$ and the reference model $\pi_{\text{ref}}$ and not the learned reward model $r_{\phi}$, effectively eliminating the reward model training step? Enter DPO [2].

Reframing the RLHF objective

We can rewrite the reward objective in $(2)$ like so

\begin{align*} & \max_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r(x, y)\right] - \beta D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\text{ref}}(y\,|\,x)) \\ &= \max_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r(x, y) - \beta \log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)}\right]\\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\beta \log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - r(x, y)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \frac{1}{\beta}r(x, y)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \log\exp\left(\frac{1}{\beta}r(x, y)\right)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\frac{1}{Z(x)}\pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right)} - \log Z(x)\right] \tag{3} \end{align*}

where in the second step we have moved $D_{KL}$ inside the expectation because it is the expected log-likelihood and

$$Z(x) = \sum_y \pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right)$$

is the partition function (which we've newly introduced). The $\log Z(x)$ subtraction inside the expectation is to maintain the equality. We can show that the optimally trained model $\pi_{\theta}^*$ is given by

$$\pi_{\theta}^*(y\,|\,x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right) \tag{4}$$

From the definition of $Z(x)$, one can see that $\sum_y \pi_{\theta}^*(y\,|\,x) = 1$, and since $\pi_{\theta}^*(y\,|\,x) \geq 0$, it is a valid probability distribution. We can bring the expectation over $y$ inside in $(3)$ and get the objective

\begin{align*} &\min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D}}\left[\mathbb{E}_{y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log \frac{\pi_{\theta}(y\,|\,x)}{\pi_{\theta}^*(y\,|\,x)}\right] - \log Z(x)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D}}\left[D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\theta}^*(y\,|\,x)) - \log Z(x)\right] \end{align*}

The KL divergence is minimized when $\pi_{\theta}(y\,|\,x) = \pi_{\theta}^*(y\,|\,x)$, which shows that $\pi_{\theta}^*$ is the optimal policy model. Now, we need to get rid of $r$. If $\pi^*_{\theta}$ is the optimal policy model, then the ground truth reward $r^*$ is given by

$$r^*(x, y) = \beta \log\frac{\pi_{\theta}^*(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} + \beta \log Z(x) \tag{5}$$

which you can get by solving for $r^*$ in $(4)$. What this is saying is this: suppose that we have an RL tuned model $\pi_{\theta}^*$ that perfectly satisfies the human annotators' preference function $r^*$ ($r^*$ is basically the heuristic or underlying reward model that represents the human annotators and this is what we learn in classic RLHF with a reward model $r_{\phi}$). Then $r^*$ is given by $(5)$. This is the score given for a completion $y$ given a prompt $x$. We can substitute this score in the Bradley-Terry preference model in $(1)$ to get the probability that completion $y_1$ is preffered over completion $y_2$ given prompt $x$

\begin{align*} p^*(y_1 \succ y_2\,|\,x) &= \frac{1}{1 + \exp\left(\beta\log\frac{\pi_{\theta}^*(y_2\,|\,x)}{\pi_{\text{ref}}(y_2\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_1\,|\,x)}{\pi_{\text{ref}}(y_1\,|\,x)}\right)} \\ &= \sigma\left(\beta\log\frac{\pi_{\theta}^*(y_1\,|\,x)}{\pi_{\text{ref}}(y_1\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_2\,|\,x)}{\pi_{\text{ref}}(y_2\,|\,x)}\right) \tag{5} \end{align*}

where $\sigma$ is the sigmoid function. It's conventient that the $Z(x)$s cancel out, because it depends on $r(x, y)$ and we're trying to get rid of it from the objective. If $y_w$ denotes the favorable completion and $y_l$ denotes the completion that sucks (as rated by the human annotator), substituting $y_w$ and $y_l$ for $y_1$ and $y_2$ in $(5)$ gives

\begin{align*} p^*(y_w \succ y_l\,|\,x) = \sigma\left(\beta\log\frac{\pi_{\theta}^*(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right) \tag{6} \end{align*}

and the above probability is maximum for the optimal policy $\pi_{\theta}^*$, i.e., we have

$$p^*(y_w \succ y_l\,|\,x) \geq p^*(y_l \succ y_w\,|\,x)$$

Equation $(6)$ is for the optimal model, and our goal then is to maximize

\begin{align*} p(y_w \succ y_l\,|\,x) = \sigma\left(\beta\log\frac{\pi_{\theta}(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right) \end{align*}

i.e., we maximize the log-likelihood of the preference $y_w \succ y_l$ for a dataset $\mathcal{D}$ with prompts $x$, preffered completions $y_w$, and bad completions $y_l$. Our objective becomes minimizing the negative log-likelihood

$$\mathcal{L}_{\text{DPO}}(\pi_{\theta}\,;\,\pi_{\text{ref}}) = -\mathbb{E}_{(x,\,y_w,\,y_l)\,\sim\,\mathcal{D}}\left[\log \sigma\left(\beta\log\frac{\pi_{\theta}(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right)\right]$$

and as desired, we have removed the reward model out of the picture. The important thing to note is that this is not a surrogate objective but is equivalent to the vanilla RLHF objective. We're converting online learning (letting the model generate trajectories during training) to offline learning (using a dataset of trajectories, i.e., using the logprobs of prefferred and rejected completions).

Pseudocode

To compute the loss using PyTorch as per the paper [2] we do

import torch.nn.functional as F


def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
  """
  pi_logps: policy logprobs, shape (B,)
  ref_logps: reference model logprobs, shape (B,)
  yw_idxs: preferred completion indices in [0, B-1], shape (T,)
  yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
  beta: temperature controlling strength of KL penalty
  Each pair of (yw_idxs[i], yl_idxs[i]) represents the indices of a single preference pair.
  """

  pi_yw_logps,  pi_yl_logps =  pi_logps[yw_idxs],  pi_logps[yl_idxs]
  ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]
  pi_logratios  = pi_yw_logps - pi_yl_logps
  ref_logratios = ref_yw_logps - ref_yl_logps
  losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
  rewards = beta * (pi_logps - ref_logps).detach()
  return losses, rewards

Check out the reference implementation here for more deets (such as how pi_logps and ref_logps are computed).

References

  1. Deep reinforcement learning from human preferences
  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model