This article is about one of the best GANs today, StyleGAN from the paper A Style-Based Generator Architecture for Generative Adversarial Networks, we will make a clean, simple, and readable implementation of it using PyTorch, and try to replicate the original paper as closely as possible, so if you read the paper, the implementation should be pretty much identical.
The dataset that we will use in this blog is this dataset from Kaggle which contains 16240 upper clothes for women with 256*192 resolution.
Before you dive into working with StyleGAN using PyTorch, make sure you have the following prerequisites:
-
Basic Knowledge of Deep Learning
Understanding of convolutional neural networks (CNNs).
Familiarity with Generative Adversarial Networks (GANs), including concepts like the generator, discriminator, and adversarial loss.
-
Hardware Requirements
A powerful GPU (NVIDIA recommended) for faster training and inference.
CUDA toolkit installed for GPU acceleration (cuda
and cudnn
).
-
Familiarity with StyleGAN
It’s helpful to have read the original StyleGAN or StyleGAN2 papers to understand architecture improvements and key concepts.
We first will import torch since we will use PyTorch, and from there we import nn. That will help us create and train the networks, and also let us import optim, a package that implements various optimization algorithms (e.g. sgd, adam,…). From torchvision we import datasets and transforms to prepare the data and apply some transforms.
We will import functional as F from torch.nn to upsample the images using interpolate, DataLoader from torch.utils.data to create mini-batch sizes, save_image from torchvision.utils to save some fake samples, and log2 form math because we need the inverse representation of the power of 2 to implement the adaptive minibatch size depending on the output resolution, NumPy for linear algebra, os for interaction with the operating system, tqdm to show progress bars, and finally matplotlib.pyplot to show the results and compare them with the real ones.
- Initialize the DATASET by the path of the real images.
- Specify the start train at image size 8x8.
- Initialize the device by Cuda if it is available and CPU otherwise, and learning rate by 0.001.
- The batch size will be different depending on the resolution of the images that we want to generate, so we initialize BATCH_SIZES by a list of numbers, you can change them depending on your VRAM.
- Initialize image_size by 128 and CHANNELS_IMG by 3 because we will generate 128 by 128 RGB images.
- In the original paper, they initialize Z_DIM, W_DIM, and IN_CHANNELS by 512, but I initialize them by 256 instead for less VRAM usage and speed-up training. We could perhaps even get better results if we doubled them.
- For StyleGAN we can use any of the GANs loss functions we want, so I use WGAN-GP from the paper Improved Training of Wasserstein GANs. This loss contains a parameter name λ and it’s common to set λ = 10.
- Initialize PROGRESSIVE_EPOCHS by 30 for each image size.
Now let’s create a function get_loader to:
- Apply some transformation to the images (resize the images to the resolution that we want, convert them to tensors, then apply some augmentation, and finally normalize them to be all the pixels ranging from -1 to 1).
- Identify the current batch size using the list BATCH_SIZES, and take as an index the integer number of the inverse representation of the power of 2 of image_size/4. And this is actually how we implement the adaptive minibatch size depending on the output resolution.
- Prepare the dataset by using ImageFolder because it’s already structured in a nice way.
- Create mini-batch sizes using DataLoader that take the dataset and batch size with shuffling the data.
- Finally, return the loader and dataset.
Now let’s Implement the StyleGAN1 generator and discriminator(ProGAN and StyleGAN1 have the same discriminator architecture) with the key attributions from the paper. We will try to make the implementation compact but also keep it readable and understandable. Specifically, the key points:
- Noise Mapping Network
- Adaptive Instance Normalization (AdaIN)
- Progressive growing
In this tutorial, we will just generate images with StyleGAN1, and not implement style mixing and stochastic variation, but it shouldn’t be hard to do so.
Let’s define a variable with the name factors that contain the numbers that will multiply with IN_CHANNELS to have the number of channels that we want in each image resolution.
The noise mapping network takes Z and puts it through eight fully connected layers separated by some activation. And don’t forget to equalize the learning rate as the authors do in ProGAN (ProGAN and StyleGan authored by the same researchers).
Let’s first build a class with the name WSLinear (weighted scaled Linear) which will be inherited from nn.Module.
- In the init part we send in_features and out_channels. Create a linear layer, then we define a scale that will be equal to the square root of 2 divided by in_features, we copy the bias of the current column layer into a variable because we don’t want the bias of the linear layer to be scaled, then we remove it, Finally, we initialize linear layer.
- In the forward part, we send x and all that we are going to do is multiplicate x with scale and add the bias after reshaping it.
Now let’s create the MappingNetwork class.
- In the init part we send z_dim and w_din, and we define the network mapping that first normalizes z_dim, followed by eight of WSLInear and ReLU as activation functions.
- In the forward part, we return the network mapping.

Now let’s create AdaIN class
- In the init part w_e_ send channels, w_dim, and we initialize instance_norm which will be the instance normalization part, and we initialize style_scale and style_bias which will be the adaptive parts with WSLinear that maps the Noise Mapping Network W into channels.
- In the forward pass, we send x, apply instance normalization for it, and return style_sclate * x + style_bias.

Now let’s create the class InjectNoise to inject the noise into the generator
- In the init part we sent channels and we initialize weight from a random normal distribution and we use nn.Parameter so that these weights can be optimized
- In the forward part, we send an image x and we return it with random noise added
The authors build StyleGAN upon the official implementation of ProGAN by Karras et al, they use the same discriminator architecture, adaptive minibatch size, hyperparameters, etc. So there are a lot of classes that stay the same from ProGAN implementation.
In this section, we will create the classes that do not change from the ProGAN architecture.
In the code snippet below you can find the class WSConv2d (weighted scaled convolutional layer) to Equalized Learning Rate for the conv layers.
In the code snippet below you can find the class PixelNorm to normalize Z before the Noise Mapping Network.
In the code snippet below you can find the class ConvBock that will help us create the discriminator.
In the code snippet below you can find the class Discriminatowich is the same as in ProGAN.
class Discriminator(nn.Module):
def __init__(self, in_channels, img_channels=3):
super(Discriminator, self).__init__()
self.prog_blocks, self.rgb_layers = nn.ModuleList([]), nn.ModuleList([])
self.leaky = nn.LeakyReLU(0.2)
for i in range(len(factors) - 1, 0, -1):
conv_in = int(in_channels * factors[i])
conv_out = int(in_channels * factors[i - 1])
self.prog_blocks.append(ConvBlock(conv_in, conv_out))
self.rgb_layers.append(
WSConv2d(img_channels, conv_in, kernel_size=1, stride=1, padding=0)
)
self.initial_rgb = WSConv2d(
img_channels, in_channels, kernel_size=1, stride=1, padding=0
)
self.rgb_layers.append(self.initial_rgb)
self.avg_pool = nn.AvgPool2d(
kernel_size=2, stride=2
)
self.final_block = nn.Sequential(
WSConv2d(in_channels + 1, in_channels, kernel_size=3, padding=1),
nn.LeakyReLU(0.2),
WSConv2d(in_channels, in_channels, kernel_size=4, padding=0, stride=1),
nn.LeakyReLU(0.2),
WSConv2d(
in_channels, 1, kernel_size=1, padding=0, stride=1
),
)
def fade_in(self, alpha, downscaled, out):
"""Used to fade in downscaled using avg pooling and output from CNN"""
return alpha * out + (1 - alpha) * downscaled
def minibatch_std(self, x):
batch_statistics = (
torch.std(x, dim=0).mean().repeat(x.shape[0], 1, x.shape[2], x.shape[3])
)
return torch.cat([x, batch_statistics], dim=1)
def forward(self, x, alpha, steps):
cur_step = len(self.prog_blocks) - steps
out = self.leaky(self.rgb_layers[cur_step](x))
if steps == 0:
out = self.minibatch_std(out)
return self.final_block(out).view(out.shape[0], -1)
downscaled = self.leaky(self.rgb_layers[cur_step + 1](self.avg_pool(x)))
out = self.avg_pool(self.prog_blocks[cur_step](out))
out = self.fade_in(alpha, downscaled, out)
for step in range(cur_step + 1, len(self.prog_blocks)):
out = self.prog_blocks[step](out)
out = self.avg_pool(out)
out = self.minibatch_std(out)
return self.final_block(out).view(out.shape[0], -1)
In the generator architecture, we have some patterns that repeat so let’s first create a class for it to make our code as clean as possible, let’s name the class GenBlock which will be inherited from nn.Module.
- In the init part we send in_channels, out_channels, and w_dim, then we initialize conv1 by WSConv2d which maps in_channels to out_channels, conv2 by WSConv2d which maps out_channels to out_channels, leaky by Leaky ReLU with a slope of 0.2 as they use in the paper, inject_noise1, inject_noise2 by the InjectNoise, adain1, and adain2 by AdaIN
- In the forward part, we send x, and we pass it to conv1 then to inject_noise1 with leaky, then we normalize it with adain1, and again we pass that into conv2 then to inject_noise2 with leaky and we normalize it with adain2. And finally, we return x.
Now we have all that we need to create the generator.

- in the init part let’s initialize ‘starting_constant’ by constant 4 x 4 (x 512 channel for the original paper, and 256 in our case) tensor which is put through an iteration of the generator, map by ‘MappingNetwork’, initial_adain1, initial_adain2 by AdaIN, initial_noise1, initial_noise2 by InjectNoise, initial_conv by a conv layer that map in_channels to itself, leaky by Leaky ReLU with a slope of 0.2, initial_rgb by WSConv2d that maps in_channels to img_channels wi=hich is 3 for RGB, prog_blocks by ModuleList() that will contain all the progressive blocks (we indicate convolution input/output channels by multiplicate in_channels which is 512 in paper and 256 in our case with factors), and rgb_blocks by ModuleList() that will contain all the RGB blocks.
- To fade in new layers (an origin component of ProGAN), we add the fade_in part, which we send alpha, scaled, and generated, and we return [tanh(alpha∗generated+(1−alpha)∗upscale)], The reason why we use tanh is that will be the output(the generated image) and we want the pixels to be range between 1 and -1.
- In the forward part, we send the noise (Z_dim), the alpha value which is going to fade in slowly during training (alpha is between 0 and 1), and steps which is the number of the current resolution that we are working with, we pass x into the map to get the intermediate noise vector W, we pass starting_constant to initial_noise1, apply for it and for W initial_adain1, then we passe it into initial_conv, and again we add initial_noise2 for it with leaky as activation function, and apply for it and W initial_adain2. Then we check if steps = 0 if it is, then all we want to do is run it through the initial RGB and we have done, otherwise, we loop over the number of steps, and in each loop we upscaling(upscaled) and we run through the progressive block that corresponds to that resolution(out). In the end, we return fade_in that takes alpha, final_out, and final_upscaled after mapping it to RGB.
class Generator(nn.Module):
def __init__(self, z_dim, w_dim, in_channels, img_channels=3):
super(Generator, self).__init__()
self.starting_constant = nn.Parameter(torch.ones((1, in_channels, 4, 4)))
self.map = MappingNetwork(z_dim, w_dim)
self.initial_adain1 = AdaIN(in_channels, w_dim)
self.initial_adain2 = AdaIN(in_channels, w_dim)
self.initial_noise1 = InjectNoise(in_channels)
self.initial_noise2 = InjectNoise(in_channels)
self.initial_conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
self.leaky = nn.LeakyReLU(0.2, inplace=True)
self.initial_rgb = WSConv2d(
in_channels, img_channels, kernel_size=1, stride=1, padding=0
)
self.prog_blocks, self.rgb_layers = (
nn.ModuleList([]),
nn.ModuleList([self.initial_rgb]),
)
for i in range(len(factors) - 1):
conv_in_c = int(in_channels * factors[i])
conv_out_c = int(in_channels * factors[i + 1])
self.prog_blocks.append(GenBlock(conv_in_c, conv_out_c, w_dim))
self.rgb_layers.append(
WSConv2d(conv_out_c, img_channels, kernel_size=1, stride=1, padding=0)
)
def fade_in(self, alpha, upscaled, generated):
return torch.tanh(alpha * generated + (1 - alpha) * upscaled)
def forward(self, noise, alpha, steps):
w = self.map(noise)
x = self.initial_adain1(self.initial_noise1(self.starting_constant), w)
x = self.initial_conv(x)
out = self.initial_adain2(self.leaky(self.initial_noise2(x)), w)
if steps == 0:
return self.initial_rgb(x)
for step in range(steps):
upscaled = F.interpolate(out, scale_factor=2, mode="bilinear")
out = self.prog_blocks[step](upscaled, w)
final_upscaled = self.rgb_layers[steps - 1](upscaled)
final_out = self.rgb_layers[steps](out)
return self.fade_in(alpha, final_upscaled, final_out)
In the code snippet below you can find the generate_examples function that takes the generator gen, the number of steps to identify the current resolution, and a number n=100. The goal of this function is to generate n fake images and save them as a result.
In the code snippet below you can find the gradient_penalty function for WGAN-GP loss.
For the train function, we send critic (which is the discriminator), gen(generator), loader, dataset, step, alpha, and optimizer for the generator and for the critic.
We start by looping over all the mini-batch sizes that we create with the DataLoader, and we take just the images because we don’t need a label.
Then we set up the training for the discriminator\Critic when we want to maximize E(critic(real)) - E(critic(fake)). This equation means how much the critic can distinguish between real and fake images.
After that, we set up the training for the generator when we want to maximize E(critic(fake)).
Finally, we update the loop and the alpha value for fade_in and ensure that it is between 0 and 1, and we return it.
Now since we have everything let’s put them together to train our StyleGAN.
We start by initializing the generator, the discriminator/critic, and optimizers, then convert the generator and the critic into train mode, then loop over PROGRESSIVE_EPOCHS, and in each loop, we call the train function number of epoch times, then we generate some fake images and save them, as a result, using generate_examples function, and finally, we progress to the next image resolution.
Hopefully, you will be able to follow all of the steps and get a good understanding of how to implement StyleGAN in the right way. Now let’s check out the results that we obtain after training this model in this dataset with 128*x 128 resolution.

In this article, we make a clean, simple, and readable implementation from scratch of StyleGAN1 using PyTorch. we replicate the original paper as closely as possible, so if you read the paper the implementation should be pretty much identical.