What’s the deal with Stable Diffusion?

Deep Gan Team
3 min readFeb 28, 2023

It’s nearly impossible to miss the current explosion of generative AI. You may have seen a facebook friend sharing AI generated artworks, or you, yourself, have been impressed when messaging with ChatGPT. Even for the AI engineers in the field, this leaps and bounds in performance has surprised people. In case you somehow did miss all of this, check out an example of a text-to-image model. As an example, we gave the prompt “A spaceship lands in San Francisco” to the website: https://playgroundai.com/create

The high quality of generative images, along with easy to use text or image input has opened doors to both artists and the general public alike in the creative field. The algorithm behind these achievements is called Stable Diffusion. What is diffusion? How does it work? In this blog post, we will introduce the Diffusion algorithm and how text information is embedded in image generative models.

What is diffusion

Diffusion description

The diffusion algorithm is a novel algorithm which operates in two phases. The first phase iteratively adds noise to the input image from timesteps 1 to n until the image completely consists of noise. The 2nd phase reverses this process and iteratively denoises the noisy image until a completely clear image is produced.

To demonstrate the diffusion network, we replicated the implementation of the diffusion algorithm from huggingface, example code is at https://github.com/wileyw/VideoGAN/blob/master/diffusion_study.ipynb.

A few notable steps of the huggingface implementation are:

  • The U-Net backbone architecture shown above. A U-Net is used to preserve information at various scales.
  • The position is encoded by the function: `SinusoidalPositionEmbeddings`
  • The attention module is utilized.
  • `p_sample` and `q_sample` are the backward (denoising) processes and forward diffusion processes respectively.

What’s the difference between regular diffusion and stable diffusion?

Stable diffusion description

Stable diffusion builds on top of the diffusion algorithm by allowing users to condition the final image based on a user prompt. In other words, you can use plain English as an input, and the model can give you back an image to match it.

Latent Space

One of the techniques that stable diffusion uses is to operate on an input in a lower dimension latent space instead of the original pixel space. Processing in this lower dimensional latent space allows for a smaller and faster diffusion model.

Conditioning

Conditioning a stable diffusion model is the process of providing the model a prompt and asking the model to modify its output based on the prompt. Importantly, this conditioning can use networks or models from different domains. As an example, these prompts can potentially be text and images but also more abstract concepts such as semantic maps and representations.

Brief overview of the evolution of generative models

In terms of deep learning-based models, the most popular techniques a few years ago were Generative Adversarial Networks (GANs) and Autoencoder-based methods such as Variational autoencoders (VAEs). GANs used adversarial training for a generative model against a discriminative model trying to tell the difference between the generated and real images. VAEs and other autoencoder methods compress the image content to a vector with an encoding network and use the same vector to train a decoder to recreate the image. VAEs and further variations attempted to control the distribution of the compressed vector representation.
Compared to diffusion, previous attempts had weaknesses in their output. VAEs alone couldn’t produce sharp images due to the nature of the variational constraint. And GANs were finicky to train thanks to the adversarial technique. Attempts to sharpen the image had been tried for VAEs, but the application of stable diffusion today allows for the sharper images we now get in the auto-encoder methodology.

Resources

--

--

Deep Gan Team

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies