Denoising Diffusion Probabilistic Models (DDPM)

This is a PyTorch implementation/tutorial of the paper Denoising Diffusion Probabilistic Models.

In simple terms, we get an image from data and add noise step by step. Then We train a model to predict that noise at each step and use the model to generate images.

The following definitions and derivations show how this works. For details please refer to the paper.

Forward Process

The forward process adds noise to the data $x_{0} \sim q (x_{0})$ , for $T$ timesteps.

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I) q (x_{1 : T} ∣ x_{0}) = t = 1 \prod T q (x_{t} ∣ x_{t - 1})

where $β_{1}, \dots, β_{T}$ is the variance schedule.

We can sample $x_{t}$ at any timestep $t$ with,

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α_{t}}{ˉ} x_{0}, (1 - \overset{α_{t}}{ˉ}) I)

where $α_{t} = 1 - β_{t}$ and $\overset{α_{t}}{ˉ} = \prod_{s = 1}^{t} α_{s}$

Reverse Process

The reverse process removes noise starting at $p (x_{T}) = N (x_{T}; 0, I)$ for $T$ time steps.

p_{θ} (x_{t - 1} ∣ x_{t}) p_{θ} (x_{0 : T}) p_{θ} (x_{0}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) = p_{θ} (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t}) = \int p_{θ} (x_{0 : T}) d x_{1 : T}

$θ$ are the parameters we train.

Loss

We optimize the ELBO (from Jenson's inequality) on the negative log likelihood.

E [- lo g p_{θ} (x_{0})] \leq E_{q} [- lo g \frac{p _{θ} ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] = L

The loss can be rewritten as follows.

L = E_{q} [- lo g \frac{p _{θ} ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}] = E_{q} [- lo g p (x_{T}) - t = 1 \sum T lo g \frac{p _{θ} ( x _{t - 1} ∣ x _{t} )}{q ( x _{t} ∣ x _{t - 1} )}] = E_{q} [- lo g \frac{p ( x _{T} )}{q ( x _{T} ∣ x _{0} )} - t = 2 \sum T lo g \frac{p _{θ} ( x _{t - 1} ∣ x _{t} )}{q ( x _{t - 1} ∣ x _{t} , x _{0} )} - l o g p_{θ} (x_{0} ∣ x_{1})] = E_{q} [D_{K L} (q (x_{T} ∣ x_{0}) ∥ p (x_{T})) + t = 2 \sum T D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t})) - l o g p_{θ} (x_{0} ∣ x_{1})]

$D_{K L} (q (x_{T} ∣ x_{0}) ∥ p (x_{T}))$ is constant since we keep $β_{1}, \dots, β_{T}$ constant.

Computing $L_{t - 1} = D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}))$

The forward process posterior conditioned by $x_{0}$ is,

q (x_{t - 1} ∣ x_{t}, x_{0}) \tilde{μ}_{t} (x_{t}, x_{0}) \tilde{β_{t}} = N (x_{t - 1}; \tilde{μ}_{t} (x_{t}, x_{0}), \tilde{β_{t}} I) = \frac{α ˉ _{t - 1} β _{t}}{1 - α _{t} ˉ} x_{0} + \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α _{t} ˉ} x_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α _{t} ˉ} β_{t}

The paper sets $Σ_{θ} (x_{t}, t) = σ_{t}^{2} I$ where $σ_{t}^{2}$ is set to constants $β_{t}$ or $\tilde{β_{t}}$ .

Then, $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)$

For given noise $ϵ \sim N (0, I)$ using $q (x_{t} ∣ x_{0})$

x_{t} (x_{0}, ϵ) x_{0} = \overset{α_{t}}{ˉ} x_{0} + 1 - \overset{α_{t}}{ˉ} ϵ = \frac{1}{α _{t} ˉ} (x_{t} (x_{0}, ϵ) - 1 - \overset{α_{t}}{ˉ} ϵ)

This gives,

L_{t - 1} = D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t})) = E_{q} [\frac{1}{2 σ _{t}^{2}} ∥ ∥ \tilde{μ} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ∥ ∥^{2}] = E_{x_{0}, ϵ} [\frac{1}{2 σ _{t}^{2}} ∥ ∥ \frac{1}{α _{t}} (x_{t} (x_{0}, ϵ) - \frac{β _{t}}{1 - α _{t} ˉ} ϵ) - μ_{θ} (x_{t} (x_{0}, ϵ), t) ∥ ∥^{2}]

Re-parameterizing with a model to predict noise

μ_{θ} (x_{t}, t) = \tilde{μ} (x_{t}, \frac{1}{α _{t} ˉ} (x_{t} - 1 - \overset{α_{t}}{ˉ} ϵ_{θ} (x_{t}, t))) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α _{t} ˉ} ϵ_{θ} (x_{t}, t))

where $ϵ_{θ}$ is a learned function that predicts $ϵ$ given $(x_{t}, t)$ .

This gives,

L_{t - 1} = E_{x_{0}, ϵ} [\frac{β _{t} ^{2}}{2 σ _{t}^{2} α _{t} ( 1 - α _{t} ˉ )} ∥ ∥ ϵ - ϵ_{θ} (\overset{α_{t}}{ˉ} x_{0} + 1 - \overset{α_{t}}{ˉ} ϵ, t) ∥ ∥^{2}]

That is, we are training to predict the noise.

Simplified loss

$L_{simple} (θ) = E_{t, x_{0}, ϵ} [∥ ∥ ϵ - ϵ_{θ} (\overset{α_{t}}{ˉ} x_{0} + 1 - \overset{α_{t}}{ˉ} ϵ, t) ∥ ∥^{2}]$

This minimizes $- l o g p_{θ} (x_{0} ∣ x_{1})$ when $t = 1$ and $L_{t - 1}$ for $t > 1$ discarding the weighting in $L_{t - 1}$ . Discarding the weights $\frac{β _{t} ^{2}}{2 σ _{t}^{2} α _{t} ( 1 - α _{t} ˉ )}$ increase the weight given to higher $t$ (which have higher noise levels), therefore increasing the sample quality.

This file implements the loss calculation and a basic sampling method that we use to generate images during training.

Here is the UNet model that gives $ϵ_{θ} (x_{t}, t)$ and training code. This file can generate samples and interpolations from a trained model.

162from typing import Tuple, Optional 163 164import torch 165import torch.nn.functional as F 166import torch.utils.data 167from torch import nn 168 169from labml_nn.diffusion.ddpm.utils import gather

Denoise Diffusion

172class DenoiseDiffusion:

eps_model is $ϵ_{θ} (x_{t}, t)$ model
n_steps is $t$
device is the device to place constants on

177 def __init__(self, eps_model: nn.Module, n_steps: int, device: torch.device):

183 super().__init__() 184 self.eps_model = eps_model

Create $β_{1}, \dots, β_{T}$ linearly increasing variance schedule

187 self.beta = torch.linspace(0.0001, 0.02, n_steps).to(device)

$α_{t} = 1 - β_{t}$

190 self.alpha = 1. - self.beta

$\overset{α_{t}}{ˉ} = \prod_{s = 1}^{t} α_{s}$

192 self.alpha_bar = torch.cumprod(self.alpha, dim=0)

$T$

194 self.n_steps = n_steps

$σ^{2} = β$

196 self.sigma2 = self.beta

Get $q (x_{t} ∣ x_{0})$ distribution

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α_{t}}{ˉ} x_{0}, (1 - \overset{α_{t}}{ˉ}) I)

198 def q_xt_x0(self, x0: torch.Tensor, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:

gather $α_{t}$ and compute $\overset{α_{t}}{ˉ} x_{0}$

208 mean = gather(self.alpha_bar, t) ** 0.5 * x0

$(1 - \overset{α_{t}}{ˉ}) I$

210 var = 1 - gather(self.alpha_bar, t)

212 return mean, var

Sample from $q (x_{t} ∣ x_{0})$

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α_{t}}{ˉ} x_{0}, (1 - \overset{α_{t}}{ˉ}) I)

214 def q_sample(self, x0: torch.Tensor, t: torch.Tensor, eps: Optional[torch.Tensor] = None):

$ϵ \sim N (0, I)$

224 if eps is None: 225 eps = torch.randn_like(x0)

get $q (x_{t} ∣ x_{0})$

228 mean, var = self.q_xt_x0(x0, t)

Sample from $q (x_{t} ∣ x_{0})$

230 return mean + (var ** 0.5) * eps

Sample from $p_{θ} (x_{t - 1} ∣ x_{t})$

p_{θ} (x_{t - 1} ∣ x_{t}) μ_{θ} (x_{t}, t) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α _{t} ˉ} ϵ_{θ} (x_{t}, t))

232 def p_sample(self, xt: torch.Tensor, t: torch.Tensor):

$ϵ_{θ} (x_{t}, t)$

246 eps_theta = self.eps_model(xt, t)

gather $\overset{α_{t}}{ˉ}$

248 alpha_bar = gather(self.alpha_bar, t)

$α_{t}$

250 alpha = gather(self.alpha, t)

$\frac{β}{1 - α _{t} ˉ}$

252 eps_coef = (1 - alpha) / (1 - alpha_bar) ** .5

$\frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α _{t} ˉ} ϵ_{θ} (x_{t}, t))$

255 mean = 1 / (alpha ** 0.5) * (xt - eps_coef * eps_theta)

$σ^{2}$

257 var = gather(self.sigma2, t)

$ϵ \sim N (0, I)$

260 eps = torch.randn(xt.shape, device=xt.device)

Sample

262 return mean + (var ** .5) * eps

Simplified Loss

$L_{simple} (θ) = E_{t, x_{0}, ϵ} [∥ ∥ ϵ - ϵ_{θ} (\overset{α_{t}}{ˉ} x_{0} + 1 - \overset{α_{t}}{ˉ} ϵ, t) ∥ ∥^{2}]$

264 def loss(self, x0: torch.Tensor, noise: Optional[torch.Tensor] = None):

Get batch size

273 batch_size = x0.shape[0]

Get random $t$ for each sample in the batch

275 t = torch.randint(0, self.n_steps, (batch_size,), device=x0.device, dtype=torch.long)

$ϵ \sim N (0, I)$

278 if noise is None: 279 noise = torch.randn_like(x0)

Sample $x_{t}$ for $q (x_{t} ∣ x_{0})$

282 xt = self.q_sample(x0, t, eps=noise)

Get $ϵ_{θ} (\overset{α_{t}}{ˉ} x_{0} + 1 - \overset{α_{t}}{ˉ} ϵ, t)$

284 eps_theta = self.eps_model(xt, t)

MSE loss

287 return F.mse_loss(noise, eps_theta)

Denoising Diffusion Probabilistic Models (DDPM)

Forward Process

Reverse Process

Loss

Computing Lt−1​=DKL​(q(xt−1​∣xt​,x0​)∥pθ​(xt−1​∣xt​))

Simplified loss

Denoise Diffusion

Get q(xt​∣x0​) distribution

Sample from q(xt​∣x0​)

Sample from pθ​(xt−1​∣xt​)

Simplified Loss

Computing $L_{t - 1} = D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} ∣ x_{t}))$

Get $q (x_{t} ∣ x_{0})$ distribution

Sample from $q (x_{t} ∣ x_{0})$

Sample from $p_{θ} (x_{t - 1} ∣ x_{t})$