Project 5: Diffusion Models & U-Net

Project Overview

In Part A, we explore pre-trained diffusion models (DeepFloyd IF) to implement sampling loops, denoising techniques, and creative applications like inpainting and optical illusions. This part focuses on understanding how diffusion models work through hands-on experimentation with a state-of-the-art text-to-image model.

Key Concepts

Forward Process: Adding noise to images progressively
Denoising: Removing noise using trained UNet models
Sampling: Generating images from pure noise
Classifier-Free Guidance (CFG): Improving generation quality
Image-to-Image: Editing existing images via SDEdit
Inpainting: Filling masked regions with new content
Visual Anagrams: Images that change when flipped
Hybrid Images: Combining low/high frequencies from different prompts

Fun with Diffusion Models

Part 0: Setup and Text Prompts

We use the DeepFloyd IF diffusion model, a two-stage model that generates 64×64 images in the first stage and upsamples them to 256×256 in the second stage. After setting up access to DeepFloyd and generating prompt embeddings, we experiment with custom text prompts.

Setup Notes

Model: DeepFloyd/IF-I-XL-v1.0 from Hugging Face
Prompt embeddings generated via Hugging Face clusters
Random seed used consistently throughout: 777

Custom Text Prompts

I created several interesting text prompts and generated their embeddings. Here are 3 examples with their generated images at different inference steps:

20 Inference Steps

a detailed butterfly

a lantern eminating light

a mural on the side of a building

100 Inference Steps

a detailed butterfly

a lantern eminating light

a mural on the side of a building

Reflection: The generated images show good alignment with their text prompts. The butterfly prompt produces detailed wing patterns, the lantern captures the light emission effect, and the mural displays building-side artwork. Increasing inference steps from 20 to 100 improves image quality and detail, with 100 steps producing sharper, more coherent results. However, the improvement is incremental, suggesting diminishing returns beyond a certain number of steps.

Part 1.1: Implementing the Forward Process

The forward process takes a clean image $x_0$ and adds noise to it at timestep $t$, producing a noisy image $x_t$. This is defined by:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$

where $\epsilon \sim \mathcal{N}(0, I)$ is random noise and $\bar{\alpha}_t$ controls the noise level. We test this on the Berkeley Campanile image at different noise levels.

Original (t=0)

t=250

t=500

t=750

Part 1.2: Classical Denoising

Before using the diffusion model, we attempt to denoise the images using classical Gaussian blur filtering. This serves as a baseline to compare against the learned denoising capabilities of the diffusion model.

Noisy Images (Forward Process)

Noisy at t=250

Noisy at t=500

Noisy at t=750

Classical Denoising (Gaussian Blur)

Gaussian Blur Denoised at t=250

Gaussian Blur Denoised at t=500

Gaussian Blur Denoised at t=750

As expected, classical Gaussian blur filtering struggles to remove noise effectively, especially at higher noise levels. The diffusion model will show significantly better results.

Part 1.3: One-Step Denoising

Now we use the pretrained UNet to denoise images. The UNet predicts the noise in a noisy image, which we can then remove to recover an estimate of the original image. This demonstrates the model's ability to project noisy images back onto the natural image manifold.

Noisy Images (Forward Process)

Noisy at t=250

Noisy at t=500

Noisy at t=750

One-Step Denoised

One-Step Denoised at t=250

One-Step Denoised at t=500

One-Step Denoised at t=750

The UNet does a much better job than Gaussian blur, successfully projecting images back onto the natural image manifold. However, quality degrades with more noise, which motivates iterative denoising.

Part 1.4: Iterative Denoising

Instead of denoising in a single step, we implement iterative denoising by taking multiple steps through the diffusion process. We use strided timesteps (starting at 990, stride of 30) to speed up the process while maintaining quality. The iterative denoising formula is:

Mathematical Foundation of Iterative Denoising

The formula for iterative denoising from timestep $t$ to $t'$ (where $t' < t$) is:

$$x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}\beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\bar{\alpha}_t(1 - \bar{\alpha}_{t'})}}{1 - \bar{\alpha}_t} x_t + v_{\sigma}$$

where:

$x_t$ is your image at timestep $t$
$x_{t'}$ is your noisy image at timestep $t'$ where $t' < t$ (less noisy)
$\bar{\alpha}_t$ is defined by $\text{alphas\_cumprod}$, as explained above
$\alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$
$\beta_t = 1 - \alpha_t$
$x_0$ is our current estimate of the clean image using one-step denoising
$v_{\sigma}$ is random noise, which in the case of DeepFloyd is also predicted. The process to compute this is not very important, so we supply a function, $\text{add\_variance}$, to do this for you.

This formula allows us to iteratively denoise an image by moving from noisier timesteps ($t$) to less noisy timesteps ($t'$), effectively interpolating between the noisy signal and the clean image estimate.

t=90

t=240

t=390

t=540

t=690

Final Predicted

One-Step Denoising

Classic Gaussian Denoise

Iterative denoising produces the best results, gradually refining the image through multiple steps. One-step denoising struggles with high noise levels, and Gaussian blur is clearly inferior.

Part 1.5: Diffusion Model Sampling

We can generate images from scratch by starting with pure noise and iteratively denoising it. This demonstrates the generative capabilities of the diffusion model.

Here are 5 sampled images using the prompt "a high quality photo":

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

The generated images show reasonable quality but could be improved. We'll use Classifier-Free Guidance (CFG) in the next section to enhance quality.

Part 1.6: Classifier-Free Guidance (CFG)

Classifier-Free Guidance improves image quality by combining conditional and unconditional noise estimates. The formula is:

$$\hat{\epsilon} = \epsilon_{uncond} + \gamma \times (\epsilon_{cond} - \epsilon_{uncond})$$

where $\gamma > 1$ amplifies the effect of the text prompt. We use $\gamma = 7.5$.

Here are 5 images generated with CFG using the prompt "a high quality photo":

CFG Sample 1

CFG Sample 2

CFG Sample 3

CFG Sample 4

CFG Sample 5

The images generated with CFG show significantly improved quality and better alignment with the text prompt compared to unconditional sampling.

Part 1.7: Image-to-Image Translation (SDEdit)

SDEdit allows us to edit existing images by adding noise and then denoising with the diffusion model. The amount of noise added controls how much the image changes. We test this on the Campanile and custom images at different noise levels (i_start = [1, 3, 5, 7, 10, 20]).

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

1.7.1: Hand-Drawn and Web Images

SDEdit works particularly well with non-realistic images like sketches or paintings. Here are examples starting from hand-drawn images and web images:

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

1.7.2: Inpainting

Inpainting allows us to fill masked regions of an image with new content. We use the RePaint algorithm, which preserves unmasked regions while generating new content in masked areas.

Original

Mask

Inpainted (no prompt, scale=3)

Inpainted (frog prompt, scale=9)

Original

Mask

To Replace

Inpainted

Original

Mask

To Replace

Inpainted

1.7.3: Text-Conditional Image-to-Image Translation

By using text prompts during SDEdit, we can guide the image transformation. I am using the prompt "a group of frogs playing in a band" for all the following text-guided edits:

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

Original

Part 1.8: Visual Anagrams

Visual anagrams create optical illusions where an image looks like one thing when viewed normally, but reveals a different image when flipped upside down. We achieve this by averaging noise estimates from two different prompts—one for the normal orientation and one for the flipped orientation.

$$\epsilon_1 = \text{CFG}(\text{UNet}(x_t, t, p_1))$$ $$\epsilon_2 = \text{flip}(\text{CFG}(\text{UNet}(\text{flip}(x_t), t, p_2)))$$ $$\epsilon = \frac{\epsilon_1 + \epsilon_2}{2}$$

where $p_1$ and $p_2$ are two different text prompts, and we average the noise estimates to create the final image.

Visual anagram: Puppy / Christmas Village

A cute and happy puppy a snowy christmas village

Visual anagram: Purple Pipe / Music Group

An orange striped cat smoking a purple pipe A group of people listening to music

a group of frogs playing in a band a photo of a lizard

A high quality scene of a forest an oil painting of a person staring into the water

Part 1.9: Hybrid Images

Hybrid images combine low-frequency information from one prompt with high-frequency information from another, similar to Project 2. We use Gaussian blur to separate frequencies and create composite noise estimates.

$$\epsilon_1 = \text{CFG}(\text{UNet}(x_t, t, p_1))$$ $$\epsilon_2 = \text{CFG}(\text{UNet}(x_t, t, p_2))$$ $$\epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2)$$

where UNet is the diffusion model UNet, $f_{\text{lowpass}}$ is a low pass function, $f_{\text{highpass}}$ is a high pass function, and $p_1$ and $p_2$ are two different text prompt embeddings. Our final noise estimate is $\epsilon$.

a coffee cup a top hat

a high quality scene of a forest a skull

a mountain landscape a close-up of a human face

a detailed butterfly abstract colorful blobs

Hybrid images showcase the frequency-domain capabilities of diffusion models, allowing us to create images that appear different at different viewing distances or scales.

Project Overview

In Part B, we build and train our own flow matching model on MNIST using PyTorch. This involves implementing a UNet architecture from scratch, training it for single-step denoising, and then extending it to iterative flow matching with time and class conditioning.

Key Concepts

UNet Architecture: Encoder-decoder network with skip connections
Single-Step Denoising: Training a denoiser to map noisy images to clean images
Flow Matching: Learning the velocity field to iteratively denoise images
Time Conditioning: Injecting timestep information into the UNet
Class Conditioning: Conditioning generation on digit classes (0-9)
Classifier-Free Guidance: Improving generation quality through guidance

Part 1: Training a Single-Step Denoising UNet

We start by building a simple one-step denoiser. Given a noisy image, we aim to train a denoiser that maps it to a clean image.

1.1 Implementing the UNet

We implement a denoiser as a UNet that maps noisy images to clean images. The objective loss function for training the denoiser is:

$$\mathcal{L} = \mathbb{E}_{z,x}||D_\theta(z) - x||^2$$

where $D_\theta$ is the denoiser with parameters $\theta$, $z$ is a noisy image, and $x$ is the corresponding clean image. The expectation is taken over the distribution of noisy-clean image pairs.

The UNet consists of downsampling and upsampling blocks with skip connections. The architecture uses standard operations like Conv2d, ConvTranspose2d, BatchNorm, GELU activation, and AvgPool2d for flattening/unflattening operations. We use hidden dimension D = 128 for the single-step denoising UNet.

UNet architecture for denoising: encoder-decoder structure with skip connections. The network takes a noisy 1×28×28 grayscale image as input and outputs a clean 1×28×28 image.

Composed operations using simple tensor operations to make the network deeper. These blocks don't change the tensor's height, width, or number of channels, but add more learnable parameters.

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss:

$$\mathcal{L} = \mathbb{E}_{z,x}||D_\theta(z) - x||^2$$

To train our denoiser, we need to generate training data pairs of $(z, x)$, where each $x$ is a clean MNIST digit. For each training batch, we can generate $z$ from $x$ using the following noising process (Forward Process):

$$z = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

where $\sigma$ is a scalar noise level and $\epsilon$ is random noise drawn from a standard normal distribution.

We can visualize the different noising processes over $\sigma \in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$, assuming normalized $x \in [0, 1]$. Note that images become noisier as $\sigma$ increases.

$\sigma = 0.0$

$\sigma = 0.2$

$\sigma = 0.4$

$\sigma = 0.5$

$\sigma = 0.6$

$\sigma = 0.8$

$\sigma = 1.0$

1.2.1 Training

We train the model on the MNIST training set with a batch size of 256 for 5 epochs. The model uses a UNet with hidden dimension D = 128, Adam optimizer with learning rate 1e-4, and noise level $\sigma = 0.5$.

Lets sample some results on the test set with noise level 0.5 after the first and the 5th epoch. Each image shows three rows: the original image (top), the noisy image with added noise (middle), and the denoised result (bottom). Clearly the model improves as training goes on but its important to note this model is trained only on one noise level (0.5) and may not generalize well to other noise levels.

After 1 epoch

After 5 epochs

Training loss curve plot every few iterations during the whole training process with $\sigma = 0.5$.

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $\sigma = 0.5$. Let's see how it performs on different $\sigma$ values that it wasn't trained for.

$\sigma = 0.0$

$\sigma = 0.2$

$\sigma = 0.4$

$\sigma = 0.5$ (training level)

$\sigma = 0.6$

$\sigma = 0.8$

$\sigma = 1.0$

Clearly the model performs poorly on out-of-distribution noise levels especially at higher noise levels. This is expected as the model was trained only on one noise level (0.5).

1.2.3 Denoising Pure Noise

We now transition to making denoising a generative task. We train the model to denoise pure, random Gaussian noise. We can think of this as starting with a blank canvas $x_\sigma$ where $\sigma = 1$ and denoising it to get a clean image $x_0$. Once again we can display testing samples after the first and the 5th epoch to gauge model performance.

After 1 epoch

After 5 epochs

Training loss curve for pure noise denoising

Training loss curve for denoising pure noise

Key Observation

The generated outputs show blurry, averaged representations of digits. With an MSE loss, the model learns to predict the point that minimizes the sum of squared distances to all training examples—essentially the centroid of the digit distribution. This results in outputs that look like averaged versions of all digits rather than distinct digit samples.

Part 2: Training a Flow Matching Model

One-step denoising does not work well for generative tasks. Instead, we need to iteratively denoise the image using flow matching. We train a UNet model to predict the 'flow' from noisy data to clean data.

For iterative denoising, we define intermediate noisy samples using linear interpolation between noisy $x_1$ and clean $x_0$. The flow matching formulation is as follows:

Mathematical Foundation of Flow Matching

For iterative denoising, we define intermediate samples and the flow:

$$x_t = (1-t)x_1 + tx_0, \quad u(x_t, t) = x_1 - x_0$$

where:

$x_0$ is a clean image sampled from distribution $p_0(x_0)$
$x_1$ is a noisy image sampled from distribution $p_1(x_1)$
$t \in [0, 1]$ is a timestep sampled uniformly from $U_{[0,1]}$
$x_t$ is the interpolated sample at timestep $t$
$u(x_t, t)$ is the true flow (velocity) from $x_t$ to $x_0$

Our aim is to learn a UNet $u_\theta(x_t, t)$ which approximates this flow, giving us our learning objective:

$$\mathcal{L} = \mathbb{E}_{x_0 \sim p_0(x_0), x_1 \sim p_1(x_1), t \sim U_{[0,1]}} || (x_1 - x_0) - u_\theta(x_t, t) ||^2$$

We can think of this as training a Unet to solve a ODE problem where the velocity field is given by $u(x_t, t) = x_1 - x_0$. When we have that learned velocity field, we can use it to iteratively denoise the image by solving the ODE backwards in time.

2.1 Adding Time Conditioning to UNet

To model a flow matching model, we need a way to incorporate the timestep $t$ into our UNet model. We inject the scalar timestep $t$ into our UNet model using FCBlocks (fully-connected blocks). The key modification is that we concatenate the FCBlock outputs from the embedded timestep $t$ to two specific points in the network: after the Unflatten operation and after the middle UpBlock in the decoder path.

Time-Conditioned UNet Architecture Diagram

Time-conditioned UNet architecture for flow matching. The scalar timestep $t$ is processed through two FCBlocks, and their outputs are concatenated at two points in the decoder path (highlighted in red) to inject time-dependent information into the network.

Training Algorithm

The training algorithm for the time-conditioned UNet is as follows:

Training algorithm for the time-conditioned UNet.

2.2 Training the UNet

We train the time-conditioned UNet on MNIST with batch size 64. The model uses hidden dimension D = 64, Adam optimizer with initial learning rate 1e-2, and an exponential learning rate decay scheduler with gamma = 0.95.

Training loss curve for time-conditioned UNet

Training loss curve for the time-conditioned UNet

2.3 Sampling from the UNet

We can now use our UNet for iterative denoising. Starting from pure noise, we iteratively apply the learned flow to generate realistic digit images.

Sampling Algorithm

Sampling algorithm for the time-conditioned UNet.

After 1 epoch

After 5 epochs

After 10 epochs

We can see our Unet reaches its performance plateau pretty quickly after 5 epochs. This is a good sign the Unet is learning the flow model reasonably well but there is still plenty of room for improvement. Since time conditioning helped our Unet, lets see if we can improve it further by adding class conditioning.

2.4 Adding Class-Conditioning to UNet

To improve results and give us more control for image generation, we condition our UNet on the class of the digit (0-9). We implement classifier-free guidance where 10% of the time we drop the class conditioning vector by setting it to 0.

2.5 Training the UNet

Training for the class-conditioned UNet is similar to time-only conditioning, with the addition of class conditioning vectors and periodic unconditional generation. For this, we also introduced a dropout rate of 0.1 to the class conditioning vectors to help the Unet learn the flow model better and prevent overfitting.

Training Algorithm

Training algorithm for the class-conditioned UNet.

Training Loss Curve for Class-Conditioned UNet with AdamW

Training loss curve for the class-conditioned UNet with AdamW optimizer.

2.6 Sampling from the UNet

We sample with class-conditioning and use classifier-free guidance with guidance scale $\gamma = 1.5$.

Sampling Algorithm

Sampling algorithm for the class-conditioned UNet with classifier-free guidance.

With Learning Rate Scheduler

Class-conditioned samples epoch 1 with LR schedule

After 1 epoch

Class-conditioned samples epoch 5 with LR schedule

After 5 epochs

Class-conditioned samples epoch 10 with LR schedule

After 10 epochs

Without Learning Rate Scheduler (AdamW Optimizer)

We removed the exponential learning rate scheduler and compensated by using the AdamW optimizer, which provides better weight decay and can help maintain performance without explicit learning rate decay.

Class-conditioned samples epoch 1 with AdamW

After 1 epoch

Class-conditioned samples epoch 5 with AdamW

After 5 epochs

Class-conditioned samples epoch 10 with AdamW

After 10 epochs

Compensation Strategy

To maintain performance without the learning rate scheduler, we switched from Adam to AdamW optimizer. AdamW provides better weight decay handling and can achieve similar convergence behavior through its adaptive learning rate mechanism. The results show that we can achieve comparable quality without explicit learning rate decay, simplifying the training setup while maintaining performance.

We can see the class-conditioned UNet performs significantly better than the time-conditioned UNet. This is a good sign that class conditioning is helping the Unet learn the flow model better. Lets see if we can improve it further by adding more conditioning. We can also see that the learning rate scheduler was hindering performance, and switching to Adam with weight decay helped achieve better results.

We have successfully trained a flow matching model that can denoise images and generate realistic digit images!

Project Overview

Key Concepts

Fun with Diffusion Models

Part 0: Setup and Text Prompts

Setup Notes

Custom Text Prompts

20 Inference Steps

100 Inference Steps

Part 1.1: Implementing the Forward Process

Part 1.2: Classical Denoising

Noisy Images (Forward Process)

Classical Denoising (Gaussian Blur)

Part 1.3: One-Step Denoising

Noisy Images (Forward Process)

One-Step Denoised

Part 1.4: Iterative Denoising

Mathematical Foundation of Iterative Denoising

Part 1.5: Diffusion Model Sampling

Part 1.6: Classifier-Free Guidance (CFG)

Part 1.7: Image-to-Image Translation (SDEdit)

1.7.1: Hand-Drawn and Web Images

1.7.2: Inpainting

1.7.3: Text-Conditional Image-to-Image Translation

Part 1.8: Visual Anagrams

Part 1.9: Hybrid Images

Building a U-Net

Project Overview

Key Concepts

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

1.2 Using the UNet to Train a Denoiser

1.2.1 Training

1.2.2 Out-of-Distribution Testing

1.2.3 Denoising Pure Noise

Key Observation

Part 2: Training a Flow Matching Model

Mathematical Foundation of Flow Matching

2.1 Adding Time Conditioning to UNet

Training Algorithm

2.2 Training the UNet

2.3 Sampling from the UNet

Sampling Algorithm

2.4 Adding Class-Conditioning to UNet

2.5 Training the UNet

Training Algorithm

2.6 Sampling from the UNet

Sampling Algorithm

With Learning Rate Scheduler

Without Learning Rate Scheduler (AdamW Optimizer)

Compensation Strategy