Classical convolutional neural networks that revolutionized the field of computer vision in the last 1-2 decades, we next will build VGG, a very deep convolutional neural network, from scratch using PyTorch. You can see the previous articles in the series on my profile, mainly LeNet5 and AlexNet.
As before, we will be looking into the architecture and intuition behind VGG and how the results were at that time. We will then explore our dataset, CIFAR100, and load into our program using memory-efficient code. Then, we will implement VGG16 (number refers to the number of layers, there are two versions basically VGG16 and VGG19) from scratch using PyTorch and then train it our dataset along with evaluating it on our test set to see how it performs on unseen data
VGG
Building on the work of AlexNet, VGG focuses on another crucial aspect of Convolutional Neural Networks (CNNs), depth. It was developed by Simonyan and Zisserman. It normally consists of 16 convolutional layers but can be extended to 19 layers as well (hence the two versions, VGG-16 and VGG-19). All the convolutional layers consists of 3x3 filters. You can read more about the network in the official paper here
Data Loading
Dataset
Before building the model, one of the most important things in any Machine Learning project is to load, analyze, and pre-process the dataset. In this article, we'll be using the CIFAR-100 dataset. This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). We'll be using the "fine" label here. Here's the list of classes in the CIFAR-100:
Importing the libraries
We'll be working mainly with torch
(used for building the model and training), torchvision
(for data loading/processing, which contains datasets and methods for processing those datasets in computer vision), and numpy
(for mathematical manipulation). We will also be defining a variable device
so that the program can use GPU if available
Loading the Data
torchvision
is a library that provides easy access to tons of computer vision datasets and methods to pre-process these datasets in an easy and intuitive manner
- We define a function
data_loader
that returns either train/validation data or test data depending on the arguments - We start by defining the variable
normalize
with the mean and standard deviations of each of the channel (red, green, and blue) in the dataset. These can be calculated manually, but are also available online. This is used in thetransform
variable where we resize the data, convert it to tensors and then normalize it - If the
test
argument is true, we simply load the test split of the dataset and return it using data loaders (explained below) - In case
test
is false (default behaviour as well), we load the train split of the dataset and randomly split it into train and validation set (0.9:0.1) - Finally, we make use of data loaders. This might not affect the performance in the case of a small dataset like CIFAR100, but it can really impede the performance in case of large datasets and is generally considered a good practice. Data loaders allow us to iterate through the data in batches, and the data is loaded while iterating and not all at once in start into your RAM
Bring this project to life
VGG16 from Scratch
To build the model from scratch, we need to first understand how model definitions work in torch
and the different types of layers that we'll be using here:
- Every custom models need to inherit from the
nn.Module
class as it provides some basic functionality that helps the model to train. - Secondly, there are two main things that we need to do. First, define the different layers of our model inside the
__init__
function and the sequence in which these layers will be executed on the input inside theforward
function
Let's now define the various types of layers that we are using here:
nn.Conv2d
: These are the convolutional layers that accepts the number of input and output channels as arguments, along with kernel size for the filter. It also accepts any strides or padding if you want to apply thosenn.BatchNorm2d
: This applies batch normalization to the output from the convolutional layernn.ReLU
: This is the activation applied to various outputs in the networknn.MaxPool2d
: This applies max pooling to the output with the kernel size givennn.Dropout
: This is used to apply dropout to the output with a given probabilitynn.Linear
: This is basically a fully connected layernn.Sequential
: This is technically not a type of layer but it helps in combining different operations that are part of the same step
Using this knowledge, we can now build our VGG16 model using the architecture in the paper:
Hyperparameters
One of the important parts of any machine or deep learning projects is to optimize the hyper-parameters. Here, we won't experiment with different values for those but we will have to define them before hand. These include defining the number of epochs, batch size, learning rate, loss function along with the optimizer
Training
We are now ready to train our model. We'll first look into how we train our model in torch
and then look at the code:
- For every epoch, we go through the images and labels inside our
train_loader
and move those images and labels to the GPU if available. This happens automatically - We use our model to predict on the labels (
model(images)
)and then calculate the loss between the predictions and the true labels using our loss function (criterion(outputs, labels)
) - Then we use that loss to backpropagate (
loss.backward
) and update the weights (optimizer.step()
). But do remember to set the gradients to zero before every update. This is done usingoptimizer.zero_grad()
- Also, at the end of every epoch we use our validation set to calculate the accuracy of the model as well. In this case, we don't need gradients so we use
with torch.no_grad()
for faster evaluation
Now, we combine all of this into the following code:
We can see the output of the above code as follows which does show that the model is actually learning as the loss is decreasing with every epoch:
Testing
For testing, we use exactly the same code as validation but with the test_loader
:
Using the above code and training the model for 20 epochs, we were able to achieve an accuracy of 75% on the test set.
Conclusion
Let's now conclude what we did in this article:
- We started by understanding the architecture and different kinds of layers in the VGG-16 model
- Next, we loaded and pre-processed the CIFAR100 dataset using
torchvision
- Then, we used
PyTorch
to build our VGG-16 model from scratch along with understanding different types of layers available intorch
- Finally, we trained and tested our model on the CIFAR100 dataset, and the model seemed to perform well on the test dataset with 75% accuracy
Future Work
Using this article, you get a good introduction and hand-on learning but you'll learn much more if you extend this and see what you can do else:
- You can try using different datasets. One such dataset is CIFAR10 or a subset of ImageNet dataset.
- You can experiment with different hyperparameters and see the best combination of them for the model
- Finally, you can try adding or removing layers from the dataset to see their impact on the capability of the model. Better yet, try to build the VGG-19 version of this model