StyleGAN Variants

What you will learn in this chapter:

What is missing in Vanilla GAN
StyleGAN1 components and benifits
Drawback of StyleGAN1 and the need for StyleGAN2
Drawback of StyleGAN2 and the need for StyleGAN3
Usecases of StyleGAN

What is missing in Vanilla GAN

Generative Adversarial Networks(GANs) are a class of generative models that produce realistic images. But it is very evident that you don’t have any control over how the images are generated. In Vanilla GANs, you have two networks (i) A Generator, and (ii) A Discriminator. A Discriminator takes an image as input and returns whether it is a real image or a synthetically generated image by the generator. A Generator takes in noise vector (generally sampled from a multivariate Gaussian) and tries to produce images that look similar but not exactly the same as the ones available in the training samples, initially, it will be a junk image but in a long run the aim of the Generator is to fool the Discriminator into believing that the images generated by the generator are real.

Consider a trained GAN, let z1 and z2 be two noise vectors sampled from a gaussian distribution which is sent to the generator to generate images. Let us assume z1 gets converted to an image containing a male wearing glasses and z2 gets converted to a image containing a female without any glasses. What if you need an image of a female wearing glasses. This kind of explicit control decision can’t be intuitively achieved with Vanilla GANs as the features are entangled (more on this below). Let that sink in, you will get to understand it more when you witness what StyleGAN achieves.

TL DR; StyleGAN is a special modification made to the architectural style of the Generator alone whereas the Discriminator remains the same. This modified Generator of StyleGAN provides freedom to generate images as user wants, and provides control over both high-level (pose, facial expression) and stochastic (low-level features like skin pores, local placement of hair etc). Apart from such flexible image-generating capabilities, over the years StyleGAN has been used for several other so-called downstream talks like privacy preservation, image editing etc.

StyleGAN 1 components and benefits

Architecture

Let us just dive into the special components introduced in StyleGAN that give StyleGAN the power which we described above. Don’t get intimidated by the figure above, it is one of the simplest yet powerful ideas which you can easily understand.

As I already said, StyleGAN only modifies Generator and the Discriminator remains the same, hence it is not mentioned above. Diagram (a) corresponds to the structure of ProgessiveGAN. ProgessiveGAN is just a Vanilla GAN, but instead of generating images of a fixed resolution, it progressively generates images of higher resolution in aim of generating realistic high resolution images, i.e., block 1 of generator generates image of resolution 4 by 4, block 2 of generator generates image of resolution 8 by 8 and so on. Diagram (b) is the proposed StyleGAN architecture. It has the following main components;

A mapping network
AdaIN (Adaptive Instance Normalisation)
Concatenation of Noise vector

Let’s break it down one by one.

Mapping Network

Instead of passing the latent code (also known as the noise vector) z directly to the generator as done in traditional GANs, now it is mapped to w by a series of 8 MLP layers. The produced latent code w is not just passed as input to the first layer of the Generator, like in ProgessiveGAN, rather it is passed on to each block of the Generator Network (In StyleGAN terms, it is called a Synthesis Network). There are two major ideas here;

Mapping the latent code from z to w disentangles the feature space. By disentanglement what we mean here is in a latent code of dimension 512, if you change just one of its feature values (say out of 512 values, you just increase or decrease the 4th value), then ideally in disentangled feature space, only one of the real world feature should change. If the 4th feature value corresponds to the real-world feature ‘smile’, then changing the 4th value of the 512-dimension latent code should generate images that are smiling/not smiling/something in between.
Passing latent code to each layer has a profound effect on the kind of the real features controlled. For instance, the effect of passing latent code w to lower blocks of the Synthetis network has control over high-level aspects such as pose, general hairstyle, face shape, and eyeglasses, and the effect of passing latent code w to blocks of the higher resolution of the synthetis network has control over smaller scale facial features, hairstyle, eyes open/closed etc.

Adaptive instance normalisation (AdaIN)

Adaptive instance normalisation

AdaIN modifies the instance Normalization by allowing the normalization parameters (mean and standard deviation) to be dynamically adjusted based on the style information from a separate source. This style information is often derived from the latent code w.

In StyleGAN, the latent code is not directly passed on to synthesis network rather affine transformer w, i.e y is passed to different blocks. y is called the ‘style’ representation. Here, $y_{s,i}$ and $y_{b,i}$ are the mean and standard deviation of the style representation y, and $mu(x_i)$ and $sigma(x_i)$ are the mean and standard deviation of the feature map x.

AdaIN enables the generator to modulate its behavior during the generation process dynamically. This is particularly useful in scenarios where different parts of the generated output may require different styles or characteristics

Concatenation of Noise vector

In traditional GAN, the generator has to learn stochastic features on its own. By stochastic feature, I mean those minuscule, yet, important fine details like the position of hairs, skin pores, etc which should vary from one image generation to another and should not remain constant. Without any explicit structure in traditional GAN, it makes it a difficult task for the generator because it needs to introduce those pixel-level randomnesses from one layer to another all on its own, which often doesn’t produce a diverse set of such stochastic features.

Instead, in StyleGAN, the authors hypothesize that by adding a noise map to the feature map in each block of the synthesis network (also known as the generator), each layer makes use of this information to produce diverse stochastic nature without trying to do all by its own like in traditional GANs. This turned out well.

Example for noise

Drawbacks of StyleGAN1 and the need for StyleGAN2

StyleGAN yields state-of-the-art results in data-driven unconditional generative image modeling. Still, there existed a few issues concerning its existing architecture design which is dealt with the next version, StyleGAN2.

To make this chapter readable, we avoid going into the details of the architecture and rather state the characteristic artifacts found in the first version and how the quality was further improved.

There are two major artifacts addressed in this paper, common blob-like artifacts and location preference artifact arising due to the existing progressive growing architecture.

blob Artifact

You can see the blob structure in the above image, which the authors claim to have originated from the normalisation process of StyleGAN1. Hence, below (d) is the proposed architecture that overcame the issue.

Demodulation

(ii) Fixing strong location preference artifact in Progessive GAN structure

Phase Artifact

In the above figure, each of the images are obtained by interpolating the latent code w to modulate the pose. This leads to quite unrealistic images irrespective of its high visual quality.

A skip generator and a residual discriminator was used to overcome the issue, without progressive growing.

There are also other changes introduced in StyleGAN2, but the above two are important to know at first hand.

Drawbacks of StyleGAN2 and the need for StyleGAN3;

The same set of authors of StyleGAN2 figured out the dependence of the synthesis network on absolute pixel coordinates in an unhealthy manner. This leads to the phenomenon called the aliasing effect.

Animation of aliasing Above, the animation is generated by interpolating the latent code w. You can clearly see that in the left image the texture pixels kind of fix to the location and only the high-level attribute (face pose/expression) changes. This exposes the artificiality when generating such animations. StyleGAN3 tackles this problem ground up and you can see the results from the animation on the right side.

Use Cases

StyleGAN’s ability to generate photorealistic images has opened doors for diverse applications, including image editing, preserving privacy, and even creative exploration.

Image Editing

Image inpainting: Filling in missing image regions in a seamless and realistic manner.
Image style transfer: Transferring the style of one image to another.

Privacy-preserving applications

Generating synthetic data: Replacing sensitive information with realistic synthetic data for training and testing purposes.
Anonymizing images: Blurring or altering identifiable features in images to protect individuals’ privacy.

Creative explorations

Generating fashion designs: StyleGAN can be used to generate realistic and diverse fashion designs
Creating immersive experiences: StyleGAN can be used to create realistic virtual environments for gaming, education, and other applications. For instance, Stylenerf: A style-based. 3d aware generator for high-resolution image synthesis.

These are just a non-exhaustive list.

References

StyleGAN - repository, Paper
StyleGAN2 - repository, Paper
StyleGAN3 - repository, Paper

< > Update on GitHub

Community Computer Vision Course