Making SinGAN Double

6 min readApr 16, 2020

Intro

Generative Adversarial Network (GAN) is a class of deep learning models fascinating even to deep learning researchers and engineers. It draws attention in mainstream media such as deep fake images and videos, aging apps, and beautification apps. To see completely generated images of humans is quite an astonishing experience. They’re so successful, hard to distinguish images and voices are posing problems in computer ethics and security.

Many of these instances though require large amounts of training data. Several efforts have been made to reduce the training data dependency. A recent paper, that won Best Paper Award at ICCV 2019, relies on only a single image to generate realistic synthesized images. The architecture of SinGAN is both elegant and simple. In this blog post, we will walk you through the key takeaways of SinGAN. At the end, we made some quick changes to show what will happen if you opt to use two images instead of one, named DoubleGAN as a fun divergence.

SinGAN

How to capture the essence of a single image? We have to believe the architecture of the SinGAN is rigid enough, for example, to be able to keep a tree as a whole tree, but flexible enough not to lock in the variation of leaves. SinGAN achieves this by constructing a multi-scale architecture and keeping the network at each level simple and effective.

Multi-Scale Architecture

The SinGAN model consists of a pyramid of generators. At the base level, which is also the coarse level, the network’s patch size is close to ½ of the image size, making this level best fit the structural information in the image. At each level above, the network shares exactly the same structure. This simple construction is also quite powerful.

From the coarse level to the next, the resolution is upsampled by r. To learn the generator G_n, it’s previous generator results are treated as an addition, so that at G_n, SinGAN only focuses on learning the difference.

Repeating this upsampling process, the generator gets finer and finer, filling in the final details of the images. Since the network is fully convolutional, at each level, it roughly learns its effective patch size pattern. As for discriminators, at each level, there is no information passing through from it’s coarser level, but rather acts independently.

Implementation Takeaways

In this blog post, we re-implemented some of the key parts of SinGAN. Nothing is more effective in learning code than writing yourself. We’d like to share some of these key elements with you.

Training Loop

The SinGAN Generator accepts an input tensor (an image + noise) and outputs a new generated image. Adding noise to the input image allows the algorithm to change or add modifications to the original input. The Discriminator learns to classify whether the generated image patch is real or fake. SinGAN trains each network in a coarse-to-fine order. This means that the coarsest Generator and Discriminator are trained to completion before starting training on the next pyramid level.

Network Architecture

The Neural Networks used in SinGAN are pretty simple and straightforward. These networks are fully convolutional, named as ConvBlocks, are basic building blocks for constructing the Discriminator and Generator networks.

ConvBlock

ConvBlocks are fully convolutional, it means that it can generate arbitrary size images. A ConvBlock consists of a 3x3 convolution → BatchNorm2d → LeakyRelu

Generator

The Generator consists of 1 Head → 5 ConvBlocks → 1 Tail → 1 TanH. Multiple generators form a multi-scale structure. The size of the generators are carefully calculated, for example, at the coarsest level, the effective size is about half the size of the image to capture the global structure of the image.

Discriminator

In contrast to the generator, the discriminators at different scales are not connected, but rather individual ones. Each of which consists of 1 Head → 5 ConvBlocks → 1 Tail → 1 Mean Layer

Head and Tail layers

The Head and Tail layers are simply Convolutions that expand or reduce the number of Input Feature Channels to the expected number of Output Feature Channels.

Cool Trick: Size-Agnostic Networks

Traditionally, discriminator networks produce a single value as the output where a value of 0.0 means fake and a value of 1.0 means the image is real. The traditional way to produce a single value is to connect a fully-connected layer to the network.

In contrast, the author’s code simply calculates the network loss using the mean() function in order to produce a single value as the output. E.g. “loss = output.mean()”. Backpropagation is performed normally using “loss.backward(retain_graph=True)”. By using the mean() function to aggregate the loss into a single value, the Discriminator Network can run on any input image size. This trick makes writing code for Neural Networks more simple and modular.

Does Double the GAN Give Double the Fun?

While we were implementing the SinGAN. A natural curiosity led us to wonder, what if we train the multi-scale networks on not one but two images? We modify the training process, and voila, here are some psychedelic results of combining two images. As you can see, the networks are having trouble deciding on the global level of the structure, but it takes on textures and characteristics of both images. Maybe that would be how Coloseum would look like in a volcano!

DoubleGAN Experimental Results

Figure 1A: As an example, we feed an image of a volcano and colosseum to DoubleGAN.

Figure 1B: Here, we see the result of generating random samples after training on both the volcano and the colosseum. The generated photos have components of both the volcano and the colosseum mixed together.

Figure 2A: In another example, we feed two images of cats to DoubleGAN.

Figure 2B: Interestingly, we see that the GAN is unable to learn the higher level structure of image. The GAN is only able to capture lower level details such as the red color and patches of fur.

CoLab Implementation

In the colab notebook, we extracted the key functions to help people understand the code. Our Colab implementation of DoubleGAN that builds upon the original SinGAN code is located here.