Diffusion Models & Building a U-Net

Generative Models with Flow Matching

Project Overview

In Part A, we explore pre-trained diffusion models (DeepFloyd IF) to implement sampling loops, denoising techniques, and creative applications like inpainting and optical illusions. This part focuses on understanding how diffusion models work through hands-on experimentation with a state-of-the-art text-to-image model.

Key Concepts

  • Forward Process: Adding noise to images progressively
  • Denoising: Removing noise using trained UNet models
  • Sampling: Generating images from pure noise
  • Classifier-Free Guidance (CFG): Improving generation quality
  • Image-to-Image: Editing existing images via SDEdit
  • Inpainting: Filling masked regions with new content
  • Visual Anagrams: Images that change when flipped
  • Hybrid Images: Combining low/high frequencies from different prompts

Fun with Diffusion Models

Part 0: Setup and Text Prompts

We use the DeepFloyd IF diffusion model, a two-stage model that generates 64×64 images in the first stage and upsamples them to 256×256 in the second stage. After setting up access to DeepFloyd and generating prompt embeddings, we experiment with custom text prompts.

Setup Notes

  • Model: DeepFloyd/IF-I-XL-v1.0 from Hugging Face
  • Prompt embeddings generated via Hugging Face clusters
  • Random seed used consistently throughout: 777

Custom Text Prompts

I created several interesting text prompts and generated their embeddings. Here are 3 examples with their generated images at different inference steps:

20 Inference Steps

Butterfly 20 iterations
a detailed butterfly
Lantern 20 iterations
a lantern eminating light
Mural 20 iterations
a mural on the side of a building

100 Inference Steps

Butterfly 100 iterations
a detailed butterfly
Lantern 100 iterations
a lantern eminating light
Mural 100 iterations
a mural on the side of a building

Reflection: The generated images show good alignment with their text prompts. The butterfly prompt produces detailed wing patterns, the lantern captures the light emission effect, and the mural displays building-side artwork. Increasing inference steps from 20 to 100 improves image quality and detail, with 100 steps producing sharper, more coherent results. However, the improvement is incremental, suggesting diminishing returns beyond a certain number of steps.

Part 1.1: Implementing the Forward Process

The forward process takes a clean image $x_0$ and adds noise to it at timestep $t$, producing a noisy image $x_t$. This is defined by:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$

where $\epsilon \sim \mathcal{N}(0, I)$ is random noise and $\bar{\alpha}_t$ controls the noise level. We test this on the Berkeley Campanile image at different noise levels.

Original Campanile
Original (t=0)
Noisy Campanile at t=250
t=250
Noisy Campanile at t=500
t=500
Noisy Campanile at t=750
t=750

Part 1.2: Classical Denoising

Before using the diffusion model, we attempt to denoise the images using classical Gaussian blur filtering. This serves as a baseline to compare against the learned denoising capabilities of the diffusion model.

Noisy Images (Forward Process)

Noisy at t=250
Noisy at t=250
Noisy at t=500
Noisy at t=500
Noisy at t=750
Noisy at t=750

Classical Denoising (Gaussian Blur)

Gaussian blur denoised at t=250
Gaussian Blur Denoised at t=250
Gaussian blur denoised at t=500
Gaussian Blur Denoised at t=500
Gaussian blur denoised at t=750
Gaussian Blur Denoised at t=750

As expected, classical Gaussian blur filtering struggles to remove noise effectively, especially at higher noise levels. The diffusion model will show significantly better results.

Part 1.3: One-Step Denoising

Now we use the pretrained UNet to denoise images. The UNet predicts the noise in a noisy image, which we can then remove to recover an estimate of the original image. This demonstrates the model's ability to project noisy images back onto the natural image manifold.

Noisy Images (Forward Process)

Noisy at t=250
Noisy at t=250
Noisy at t=500
Noisy at t=500
Noisy at t=750
Noisy at t=750

One-Step Denoised

One-step denoised at t=250
One-Step Denoised at t=250
One-step denoised at t=500
One-Step Denoised at t=500
One-step denoised at t=750
One-Step Denoised at t=750

The UNet does a much better job than Gaussian blur, successfully projecting images back onto the natural image manifold. However, quality degrades with more noise, which motivates iterative denoising.

Part 1.4: Iterative Denoising

Instead of denoising in a single step, we implement iterative denoising by taking multiple steps through the diffusion process. We use strided timesteps (starting at 990, stride of 30) to speed up the process while maintaining quality. The iterative denoising formula is:

Mathematical Foundation of Iterative Denoising

The formula for iterative denoising from timestep $t$ to $t'$ (where $t' < t$) is:

$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}\beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\bar{\alpha}_t(1 - \bar{\alpha}_{t'})}}{1 - \bar{\alpha}_t} x_t + v_{\sigma}$$

where:

  • $x_t$ is your image at timestep $t$
  • $x_{t'}$ is your noisy image at timestep $t'$ where $t' < t$ (less noisy)
  • $\bar{\alpha}_t$ is defined by $\text{alphas\_cumprod}$, as explained above
  • $\alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$
  • $\beta_t = 1 - \alpha_t$
  • $x_0$ is our current estimate of the clean image using one-step denoising
  • $v_{\sigma}$ is random noise, which in the case of DeepFloyd is also predicted. The process to compute this is not very important, so we supply a function, $\text{add\_variance}$, to do this for you.

This formula allows us to iteratively denoise an image by moving from noisier timesteps ($t$) to less noisy timesteps ($t'$), effectively interpolating between the noisy signal and the clean image estimate.

t=90
t=90
t=240
t=240
t=390
t=390
t=540
t=540
t=690
t=690
Final Predicted
Final Predicted
One-Step Denoising
One-Step Denoising
Classic Gaussian Denoise
Classic Gaussian Denoise

Iterative denoising produces the best results, gradually refining the image through multiple steps. One-step denoising struggles with high noise levels, and Gaussian blur is clearly inferior.

Part 1.5: Diffusion Model Sampling

We can generate images from scratch by starting with pure noise and iteratively denoising it. This demonstrates the generative capabilities of the diffusion model.

Here are 5 sampled images using the prompt "a high quality photo":

Sample 1
Sample 1
Sample 2
Sample 2
Sample 3
Sample 3
Sample 4
Sample 4
Sample 5
Sample 5

The generated images show reasonable quality but could be improved. We'll use Classifier-Free Guidance (CFG) in the next section to enhance quality.

Part 1.6: Classifier-Free Guidance (CFG)

Classifier-Free Guidance improves image quality by combining conditional and unconditional noise estimates. The formula is:

$$\hat{\epsilon} = \epsilon_{uncond} + \gamma \times (\epsilon_{cond} - \epsilon_{uncond})$$

where $\gamma > 1$ amplifies the effect of the text prompt. We use $\gamma = 7.5$.

Here are 5 images generated with CFG using the prompt "a high quality photo":

CFG Sample 1
CFG Sample 1
CFG Sample 2
CFG Sample 2
CFG Sample 3
CFG Sample 3
CFG Sample 4
CFG Sample 4
CFG Sample 5
CFG Sample 5

The images generated with CFG show significantly improved quality and better alignment with the text prompt compared to unconditional sampling.

Part 1.7: Image-to-Image Translation (SDEdit)

SDEdit allows us to edit existing images by adding noise and then denoising with the diffusion model. The amount of noise added controls how much the image changes. We test this on the Campanile and custom images at different noise levels (i_start = [1, 3, 5, 7, 10, 20]).

Campanile Original
Original
i_start=1
i_start=1
i_start=3
i_start=3
i_start=5
i_start=5
i_start=7
i_start=7
i_start=10
i_start=10
i_start=20
i_start=20
Christmas Hat Original
Original
Christmas Hat i_start=1
i_start=1
Christmas Hat i_start=3
i_start=3
Christmas Hat i_start=5
i_start=5
Christmas Hat i_start=7
i_start=7
Christmas Hat i_start=10
i_start=10
Christmas Hat i_start=20
i_start=20
Iguana Original
Original
Iguana i_start=1
i_start=1
Iguana i_start=3
i_start=3
Iguana i_start=5
i_start=5
Iguana i_start=7
i_start=7
Iguana i_start=10
i_start=10
Iguana i_start=20
i_start=20

1.7.1: Hand-Drawn and Web Images

SDEdit works particularly well with non-realistic images like sketches or paintings. Here are examples starting from hand-drawn images and web images:

Web image bear original
Original
Web image bear edit 1
i_start=1
Web image bear edit 2
i_start=3
Web image bear edit 3
i_start=5
Web image bear edit 4
i_start=7
Web image bear edit 5
i_start=10
Web image bear edit 6
i_start=20
Hand-drawn sun original
Original
Hand-drawn sun edit 1
i_start=1
Hand-drawn sun edit 2
i_start=3
Hand-drawn sun edit 3
i_start=5
Hand-drawn sun edit 4
i_start=7
Hand-drawn sun edit 5
i_start=10
Hand-drawn sun edit 6
i_start=20
Hand-drawn tree original
Original
Hand-drawn tree edit 1
i_start=1
Hand-drawn tree edit 2
i_start=3
Hand-drawn tree edit 3
i_start=5
Hand-drawn tree edit 4
i_start=7
Hand-drawn tree edit 5
i_start=10
Hand-drawn tree edit 6
i_start=20

1.7.2: Inpainting

Inpainting allows us to fill masked regions of an image with new content. We use the RePaint algorithm, which preserves unmasked regions while generating new content in masked areas.

Original Campanile
Original
Mask
Mask
Campanile inpainted no prompt
Inpainted (no prompt, scale=3)
Campanile inpainted with frog prompt
Inpainted (frog prompt, scale=9)
Bear Original
Original
Bear Mask
Mask
Bear To Replace
To Replace
Bear Inpainted
Inpainted
Hollow Knight Original
Original
Hollow Knight Mask
Mask
Hollow Knight To Replace
To Replace
Hollow Knight Inpainted
Inpainted

1.7.3: Text-Conditional Image-to-Image Translation

By using text prompts during SDEdit, we can guide the image transformation. I am using the prompt "a group of frogs playing in a band" for all the following text-guided edits:

Text-guided Campanile 1
i_start=1
Text-guided Campanile 2
i_start=3
Text-guided Campanile 3
i_start=5
Text-guided Campanile 4
i_start=7
Text-guided Campanile 5
i_start=10
Text-guided Campanile 6
i_start=20
Original Campanile
Original
Text-guided bear 1
i_start=1
Text-guided bear 2
i_start=3
Text-guided bear 3
i_start=5
Text-guided bear 4
i_start=7
Text-guided bear 5
i_start=10
Text-guided bear 6
i_start=20
Original Bear
Original
Text-guided balrog 1
i_start=1
Text-guided balrog 2
i_start=3
Text-guided balrog 3
i_start=5
Text-guided balrog 4
i_start=7
Text-guided balrog 5
i_start=10
Text-guided balrog 6
i_start=20
Original Balrog
Original

Part 1.8: Visual Anagrams

Visual anagrams create optical illusions where an image looks like one thing when viewed normally, but reveals a different image when flipped upside down. We achieve this by averaging noise estimates from two different prompts—one for the normal orientation and one for the flipped orientation.

$$\epsilon_1 = \text{CFG}(\text{UNet}(x_t, t, p_1))$$ $$\epsilon_2 = \text{flip}(\text{CFG}(\text{UNet}(\text{flip}(x_t), t, p_2)))$$ $$\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}$$

where $p_1$ and $p_2$ are two different text prompts, and we average the noise estimates to create the final image.

Visual anagram: Puppy / Christmas Village
A cute and happy puppy a snowy christmas village
Visual anagram: Purple Pipe / Music Group
An orange striped cat smoking a purple pipe A group of people listening to music
Visual anagram: Frog Band / Lizard
a group of frogs playing in a band a photo of a lizard
Visual anagram: Forest / Oil Painting
A high quality scene of a forest an oil painting of a person staring into the water

Part 1.9: Hybrid Images

Hybrid images combine low-frequency information from one prompt with high-frequency information from another, similar to Project 2. We use Gaussian blur to separate frequencies and create composite noise estimates.

$$\epsilon_1 = \text{CFG}(\text{UNet}(x_t, t, p_1))$$ $$\epsilon_2 = \text{CFG}(\text{UNet}(x_t, t, p_2))$$ $$\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2)$$

where UNet is the diffusion model UNet, $f_{\text{lowpass}}$ is a low pass function, $f_{\text{highpass}}$ is a high pass function, and $p_1$ and $p_2$ are two different text prompt embeddings. Our final noise estimate is $\epsilon$.

Hybrid coffee tophat
a coffee cup a top hat
Hybrid skull forest
a high quality scene of a forest a skull
Hybrid mountain face
a mountain landscape a close-up of a human face
Hybrid blocks butterfly
a detailed butterfly abstract colorful blobs

Hybrid images showcase the frequency-domain capabilities of diffusion models, allowing us to create images that appear different at different viewing distances or scales.

Building a U-Net

Project Overview

In Part B, we build and train our own flow matching model on MNIST using PyTorch. This involves implementing a UNet architecture from scratch, training it for single-step denoising, and then extending it to iterative flow matching with time and class conditioning.

Key Concepts

  • UNet Architecture: Encoder-decoder network with skip connections
  • Single-Step Denoising: Training a denoiser to map noisy images to clean images
  • Flow Matching: Learning the velocity field to iteratively denoise images
  • Time Conditioning: Injecting timestep information into the UNet
  • Class Conditioning: Conditioning generation on digit classes (0-9)
  • Classifier-Free Guidance: Improving generation quality through guidance

Part 1: Training a Single-Step Denoising UNet

We start by building a simple one-step denoiser. Given a noisy image, we aim to train a denoiser that maps it to a clean image.

1.1 Implementing the UNet

We implement a denoiser as a UNet that maps noisy images to clean images. The objective loss function for training the denoiser is:

$$\mathcal{L} = \mathbb{E}_{z,x}||D_\theta(z) - x||^2$$

where $D_\theta$ is the denoiser with parameters $\theta$, $z$ is a noisy image, and $x$ is the corresponding clean image. The expectation is taken over the distribution of noisy-clean image pairs.

The UNet consists of downsampling and upsampling blocks with skip connections. The architecture uses standard operations like Conv2d, ConvTranspose2d, BatchNorm, GELU activation, and AvgPool2d for flattening/unflattening operations. We use hidden dimension D = 128 for the single-step denoising UNet.

UNet Architecture Diagram
UNet architecture for denoising: encoder-decoder structure with skip connections. The network takes a noisy 1×28×28 grayscale image as input and outputs a clean 1×28×28 image.
Tensor Operations Blocks
Composed operations using simple tensor operations to make the network deeper. These blocks don't change the tensor's height, width, or number of channels, but add more learnable parameters.

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss:

$$\mathcal{L} = \mathbb{E}_{z,x}||D_\theta(z) - x||^2$$

To train our denoiser, we need to generate training data pairs of $(z, x)$, where each $x$ is a clean MNIST digit. For each training batch, we can generate $z$ from $x$ using the following noising process (Forward Process):

$$z = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

where $\sigma$ is a scalar noise level and $\epsilon$ is random noise drawn from a standard normal distribution.

We can visualize the different noising processes over $\sigma \in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$, assuming normalized $x \in [0, 1]$. Note that images become noisier as $\sigma$ increases.

Sigma 0.0
$\sigma = 0.0$
Sigma 0.2
$\sigma = 0.2$
Sigma 0.4
$\sigma = 0.4$
Sigma 0.5
$\sigma = 0.5$
Sigma 0.6
$\sigma = 0.6$
Sigma 0.8
$\sigma = 0.8$
Sigma 1.0
$\sigma = 1.0$

1.2.1 Training

We train the model on the MNIST training set with a batch size of 256 for 5 epochs. The model uses a UNet with hidden dimension D = 128, Adam optimizer with learning rate 1e-4, and noise level $\sigma = 0.5$.

Lets sample some results on the test set with noise level 0.5 after the first and the 5th epoch. Each image shows three rows: the original image (top), the noisy image with added noise (middle), and the denoised result (bottom). Clearly the model improves as training goes on but its important to note this model is trained only on one noise level (0.5) and may not generalize well to other noise levels.

Test samples epoch 1
After 1 epoch
Test samples epoch 5
After 5 epochs
Training Loss Curve
Training loss curve plot every few iterations during the whole training process with $\sigma = 0.5$.

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $\sigma = 0.5$. Let's see how it performs on different $\sigma$ values that it wasn't trained for.

Sigma 0.0
$\sigma = 0.0$
Sigma 0.2
$\sigma = 0.2$
Sigma 0.4
$\sigma = 0.4$
Sigma 0.5
$\sigma = 0.5$ (training level)
Sigma 0.6
$\sigma = 0.6$
Sigma 0.8
$\sigma = 0.8$
Sigma 1.0
$\sigma = 1.0$

Clearly the model performs poorly on out-of-distribution noise levels especially at higher noise levels. This is expected as the model was trained only on one noise level (0.5).

1.2.3 Denoising Pure Noise

We now transition to making denoising a generative task. We train the model to denoise pure, random Gaussian noise. We can think of this as starting with a blank canvas $x_\sigma$ where $\sigma = 1$ and denoising it to get a clean image $x_0$. Once again we can display testing samples after the first and the 5th epoch to gauge model performance.

Pure noise denoising epoch 1
After 1 epoch
Pure noise denoising epoch 5
After 5 epochs
Training loss curve for pure noise denoising
Training loss curve for denoising pure noise

Key Observation

The generated outputs show blurry, averaged representations of digits. With an MSE loss, the model learns to predict the point that minimizes the sum of squared distances to all training examples—essentially the centroid of the digit distribution. This results in outputs that look like averaged versions of all digits rather than distinct digit samples.

Part 2: Training a Flow Matching Model

One-step denoising does not work well for generative tasks. Instead, we need to iteratively denoise the image using flow matching. We train a UNet model to predict the 'flow' from noisy data to clean data.

For iterative denoising, we define intermediate noisy samples using linear interpolation between noisy $x_1$ and clean $x_0$. The flow matching formulation is as follows:

Mathematical Foundation of Flow Matching

For iterative denoising, we define intermediate samples and the flow:

$$x_t = (1-t)x_1 + tx_0, \quad u(x_t, t) = x_1 - x_0$$

where:

  • $x_0$ is a clean image sampled from distribution $p_0(x_0)$
  • $x_1$ is a noisy image sampled from distribution $p_1(x_1)$
  • $t \in [0, 1]$ is a timestep sampled uniformly from $U_{[0,1]}$
  • $x_t$ is the interpolated sample at timestep $t$
  • $u(x_t, t)$ is the true flow (velocity) from $x_t$ to $x_0$

Our aim is to learn a UNet $u_\theta(x_t, t)$ which approximates this flow, giving us our learning objective:

$$\mathcal{L} = \mathbb{E}_{x_0 \sim p_0(x_0), x_1 \sim p_1(x_1), t \sim U_{[0,1]}} || (x_1 - x_0) - u_\theta(x_t, t) ||^2$$

We can think of this as training a Unet to solve a ODE problem where the velocity field is given by $u(x_t, t) = x_1 - x_0$. When we have that learned velocity field, we can use it to iteratively denoise the image by solving the ODE backwards in time.

2.1 Adding Time Conditioning to UNet

To model a flow matching model, we need a way to incorporate the timestep $t$ into our UNet model. We inject the scalar timestep $t$ into our UNet model using FCBlocks (fully-connected blocks). The key modification is that we concatenate the FCBlock outputs from the embedded timestep $t$ to two specific points in the network: after the Unflatten operation and after the middle UpBlock in the decoder path.

Time-Conditioned UNet Architecture Diagram
Time-conditioned UNet architecture for flow matching. The scalar timestep $t$ is processed through two FCBlocks, and their outputs are concatenated at two points in the decoder path (highlighted in red) to inject time-dependent information into the network.

Training Algorithm

The training algorithm for the time-conditioned UNet is as follows:

Training Algorithm for Time-Conditioned UNet
Training algorithm for the time-conditioned UNet.

2.2 Training the UNet

We train the time-conditioned UNet on MNIST with batch size 64. The model uses hidden dimension D = 64, Adam optimizer with initial learning rate 1e-2, and an exponential learning rate decay scheduler with gamma = 0.95.

Training loss curve for time-conditioned UNet
Training loss curve for the time-conditioned UNet

2.3 Sampling from the UNet

We can now use our UNet for iterative denoising. Starting from pure noise, we iteratively apply the learned flow to generate realistic digit images.

Sampling Algorithm

Sampling Algorithm for Time-Conditioned UNet
Sampling algorithm for the time-conditioned UNet.
Time-conditioned samples epoch 1
After 1 epoch
Time-conditioned samples epoch 5
After 5 epochs
Time-conditioned samples epoch 10
After 10 epochs

We can see our Unet reaches its performance plateau pretty quickly after 5 epochs. This is a good sign the Unet is learning the flow model reasonably well but there is still plenty of room for improvement. Since time conditioning helped our Unet, lets see if we can improve it further by adding class conditioning.

2.4 Adding Class-Conditioning to UNet

To improve results and give us more control for image generation, we condition our UNet on the class of the digit (0-9). We implement classifier-free guidance where 10% of the time we drop the class conditioning vector by setting it to 0.

2.5 Training the UNet

Training for the class-conditioned UNet is similar to time-only conditioning, with the addition of class conditioning vectors and periodic unconditional generation. For this, we also introduced a dropout rate of 0.1 to the class conditioning vectors to help the Unet learn the flow model better and prevent overfitting.

Training Algorithm

Training Algorithm for Class-Conditioned UNet
Training algorithm for the class-conditioned UNet.
Training Loss Curve for Class-Conditioned UNet with AdamW
Training loss curve for the class-conditioned UNet with AdamW optimizer.

2.6 Sampling from the UNet

We sample with class-conditioning and use classifier-free guidance with guidance scale $\gamma = 1.5$.

Sampling Algorithm

Sampling Algorithm for Class-Conditioned UNet
Sampling algorithm for the class-conditioned UNet with classifier-free guidance.

With Learning Rate Scheduler

Class-conditioned samples epoch 1 with LR schedule
After 1 epoch
Class-conditioned samples epoch 5 with LR schedule
After 5 epochs
Class-conditioned samples epoch 10 with LR schedule
After 10 epochs

Without Learning Rate Scheduler (AdamW Optimizer)

We removed the exponential learning rate scheduler and compensated by using the AdamW optimizer, which provides better weight decay and can help maintain performance without explicit learning rate decay.

Class-conditioned samples epoch 1 with AdamW
After 1 epoch
Class-conditioned samples epoch 5 with AdamW
After 5 epochs
Class-conditioned samples epoch 10 with AdamW
After 10 epochs

Compensation Strategy

To maintain performance without the learning rate scheduler, we switched from Adam to AdamW optimizer. AdamW provides better weight decay handling and can achieve similar convergence behavior through its adaptive learning rate mechanism. The results show that we can achieve comparable quality without explicit learning rate decay, simplifying the training setup while maintaining performance.

We can see the class-conditioned UNet performs significantly better than the time-conditioned UNet. This is a good sign that class conditioning is helping the Unet learn the flow model better. Lets see if we can improve it further by adding more conditioning. We can also see that the learning rate scheduler was hindering performance, and switching to Adam with weight decay helped achieve better results.

We have successfully trained a flow matching model that can denoise images and generate realistic digit images!