Suppose a bunch of people have various preferences for something, say, person A rates a cookie 5/10, person B rates it 7/10, and so on. Or suppose that we have a bunch of players in a game and each gained a particular score. What's the probability that a particular will be rated higher than another by two people? Or what's the probability that one player will win against another? We can model this is as follows
$$p(y_1 \succ y_2) = \frac{e^{\beta_1}}{e^{\beta_1} + e^{\beta_2}}$$
where $\beta_1$ and $\beta_2$ are the scores assigned to a $y_1$ and $y_2$ respectively by two score assigners. In the cookie case, we have $p(A \succ B)$, i.e., the probability that the cookie will be preferred by A over B, and $\beta_1$ and $\beta_2$ are the scores assigned by A and B respectively. The denominator normalizes, i.e., ensures that the probabilities sum to 1.
We let LLMs generate completions $y_1$ and $y_2$ for a prompt $x$ and human annotators rank the completions, i.e., they state their preferences for the completions. Let's denote the ranking mechanism used by the annotators as $r^*(x, y)$ which is the score assigned by them to a completion $y$ for a prompt $x$. Then as per the BT model, we have
\begin{align*} p(y_i \succ y_j) &= \frac{e^{r^*(x,\,y_i)}}{e^{r^*(x,\,y_i)} + e^{r^*(x,\,y_j)}} \\ &= \frac{1}{1 + e^{(r^*(x,\,y_j) - r^*(x,\,y_i))}} \tag{1} \end{align*}
where we have divided the numerator and denominator by $r^*(x, y_i)$. $r^*$ is what we first aim to learn in classic RLHF [1] , and we do this by training a reward model $r_{\phi}$ parametrized by $\phi$. We then use this reward model to optimize the LLM using an off the shelf online RL algorithm (like PPO). The objective for the RL phase is
$$\max_{\pi_{\theta}}\,\mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r_{\phi}(x, y)\right] - \beta D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\text{ref}}(y\,|\,x)) \tag{2}$$
where the second term is the KL divergence term that tries to prevent our model $\pi_{\theta}$ from deviating too much from the reference pre-trained model $\pi_{\text{ref}}$ and $\beta$ is the strength of the penalty. Now, RLHF is expensive because it's a two-step process: first training a reward model, then generating trajectories using the model to be trained. Using a reward model instead of the true reward also feels janky. The reward model is not perfect, and the LM has to learn to adapt to the imperfect reward model - you're compounding errors.
Is there a way to frame the objective such that it only depends on the output of the policy model $\pi_{\theta}$ and the reference model $\pi_{\text{ref}}$ and not the learned reward model $r_{\phi}$, effectively eliminating the reward model training step? Enter DPO [2].
We can rewrite the reward objective in $(2)$ like so
\begin{align*} & \max_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r(x, y)\right] - \beta D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\text{ref}}(y\,|\,x)) \\ &= \max_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[r(x, y) - \beta \log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)}\right]\\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\beta \log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - r(x, y)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \frac{1}{\beta}r(x, y)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} - \log\exp\left(\frac{1}{\beta}r(x, y)\right)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D},\,y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log\frac{\pi_{\theta}(y\,|\,x)}{\frac{1}{Z(x)}\pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right)} - \log Z(x)\right] \tag{3} \end{align*}
where in the second step we have moved $D_{KL}$ inside the expectation because it is the expected log-likelihood and
$$Z(x) = \sum_y \pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right)$$
is the partition function (which we've newly introduced). The $\log Z(x)$ subtraction inside the expectation is to maintain the equality. We can show that the optimally trained model $\pi_{\theta}^*$ is given by
$$\pi_{\theta}^*(y\,|\,x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y\,|\,x)\exp\left(\frac{1}{\beta}r(x, y)\right) \tag{4}$$
From the definition of $Z(x)$, one can see that $\sum_y \pi_{\theta}^*(y\,|\,x) = 1$, and since $\pi_{\theta}^*(y\,|\,x) \geq 0$, it is a valid probability distribution. We can bring the expectation over $y$ inside in $(3)$ and get the objective
\begin{align*} &\min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D}}\left[\mathbb{E}_{y\,\sim\,\pi_{\theta}(y\,|\,x)}\left[\log \frac{\pi_{\theta}(y\,|\,x)}{\pi_{\theta}^*(y\,|\,x)}\right] - \log Z(x)\right] \\ &= \min_{\pi_{\theta}} \mathbb{E}_{x\,\sim\,\mathcal{D}}\left[D_{KL}(\pi_{\theta}(y\,|\,x)\,||\,\pi_{\theta}^*(y\,|\,x)) - \log Z(x)\right] \end{align*}
The KL divergence is minimized when $\pi_{\theta}(y\,|\,x) = \pi_{\theta}^*(y\,|\,x)$, which shows that $\pi_{\theta}^*$ is the optimal policy model. Now, we need to get rid of $r$. If $\pi^*_{\theta}$ is the optimal policy model, then the ground truth reward $r^*$ is given by
$$r^*(x, y) = \beta \log\frac{\pi_{\theta}^*(y\,|\,x)}{\pi_{\text{ref}}(y\,|\,x)} + \beta \log Z(x) \tag{5}$$
which you can get by solving for $r^*$ in $(4)$. What this is saying is this: suppose that we have an RL tuned model $\pi_{\theta}^*$ that perfectly satisfies the human annotators' preference function $r^*$ ($r^*$ is basically the heuristic or underlying reward model that represents the human annotators and this is what we learn in classic RLHF with a reward model $r_{\phi}$). Then $r^*$ is given by $(5)$. This is the score given for a completion $y$ given a prompt $x$. We can substitute this score in the Bradley-Terry preference model in $(1)$ to get the probability that completion $y_1$ is preffered over completion $y_2$ given prompt $x$
\begin{align*} p^*(y_1 \succ y_2\,|\,x) &= \frac{1}{1 + \exp\left(\beta\log\frac{\pi_{\theta}^*(y_2\,|\,x)}{\pi_{\text{ref}}(y_2\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_1\,|\,x)}{\pi_{\text{ref}}(y_1\,|\,x)}\right)} \\ &= \sigma\left(\beta\log\frac{\pi_{\theta}^*(y_1\,|\,x)}{\pi_{\text{ref}}(y_1\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_2\,|\,x)}{\pi_{\text{ref}}(y_2\,|\,x)}\right) \tag{5} \end{align*}
where $\sigma$ is the sigmoid function. It's conventient that the $Z(x)$s cancel out, because it depends on $r(x, y)$ and we're trying to get rid of it from the objective. If $y_w$ denotes the favorable completion and $y_l$ denotes the completion that sucks (as rated by the human annotator), substituting $y_w$ and $y_l$ for $y_1$ and $y_2$ in $(5)$ gives
\begin{align*} p^*(y_w \succ y_l\,|\,x) = \sigma\left(\beta\log\frac{\pi_{\theta}^*(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}^*(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right) \tag{6} \end{align*}
and the above probability is maximum for the optimal policy $\pi_{\theta}^*$, i.e., we have
$$p^*(y_w \succ y_l\,|\,x) \geq p^*(y_l \succ y_w\,|\,x)$$
Equation $(6)$ is for the optimal model, and our goal then is to maximize
\begin{align*} p(y_w \succ y_l\,|\,x) = \sigma\left(\beta\log\frac{\pi_{\theta}(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right) \end{align*}
i.e., we maximize the log-likelihood of the preference $y_w \succ y_l$ for a dataset $\mathcal{D}$ with prompts $x$, preffered completions $y_w$, and bad completions $y_l$. Our objective becomes minimizing the negative log-likelihood
$$\mathcal{L}_{\text{DPO}}(\pi_{\theta}\,;\,\pi_{\text{ref}}) = -\mathbb{E}_{(x,\,y_w,\,y_l)\,\sim\,\mathcal{D}}\left[\log \sigma\left(\beta\log\frac{\pi_{\theta}(y_w\,|\,x)}{\pi_{\text{ref}}(y_w\,|\,x)} - \beta\log\frac{\pi_{\theta}(y_l\,|\,x)}{\pi_{\text{ref}}(y_l\,|\,x)}\right)\right]$$
and as desired, we have removed the reward model out of the picture. The important thing to note is that this is not a surrogate objective but is equivalent to the vanilla RLHF objective. We're converting online learning (letting the model generate trajectories during training) to offline learning (using a dataset of trajectories, i.e., using the logprobs of prefferred and rejected completions).
To compute the loss using PyTorch as per the paper [2] we do
import torch.nn.functional as F
def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
"""
pi_logps: policy logprobs, shape (B,)
ref_logps: reference model logprobs, shape (B,)
yw_idxs: preferred completion indices in [0, B-1], shape (T,)
yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
beta: temperature controlling strength of KL penalty
Each pair of (yw_idxs[i], yl_idxs[i]) represents the indices of a single preference pair.
"""
pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]
ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]
pi_logratios = pi_yw_logps - pi_yl_logps
ref_logratios = ref_yw_logps - ref_yl_logps
losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
rewards = beta * (pi_logps - ref_logps).detach()
return losses, rewards
Check out the reference implementation here
for more deets (such as how pi_logps
and ref_logps
are computed).