In the first part of the series, we took a look at all the different angles the problem of neural architecture is being approached from.
With the foundation covered, we'll now see how to implement some of the important concepts we saw in the first article of the series. Specifically, we will look into designing a neural architecture search method for Multilayer Perceptrons. Our implementation will include three special features:
- One-shot architecture training
- An accuracy predictor in the controller
- REINFORCE gradient for training the controller
In this part, we will look at the search space design for MLPs, creating model architectures from sequences, and how to go about implementing one-shot architectures.
The full code can be found here.
Bring this project to life
Introduction
Multilayer perceptrons are the easiest deep learning architectures to implement. A few linear layers are stacked on top of each other; each takes an input from the previous layer, multiplies it by its weights, adds a vector of biases to them, and passes this vector through an activation function of choice to get the layer output. This feed-forward process goes on until we finally reach our classification or regression output from the final layer. This final output is compared with the ground truth classification or regression values, the loss is calculated using an appropriate loss function, and the weights for all layers are updated one-by-one using gradient descent.
Game Plan for MLPNAS
There are a few things to think about when trying to automate neural architecture creation. Before we dive into it, let's take a look at what a simplified view of our pipeline for multilayer perceptron neural architecture search (MLPNAS) looks like:
As you might have noticed, all the functions for the outer loop belong to the controller and all the functions in the inner loop belong to the MLP generator. In this part of the series we will look at the inner loop. But before we're able to do that, we have to have some idea of how the controller generates architectures in the first place.
The controller we use is an LSTM architecture which generates sequences of numbers. These numbers are decoded to create architecture parameters which are then used to generate the architecture. How a controller goes about sequentially creating valid architectures is something we will look into in the coming articles. For now, we need to understand that each possible layer configuration needs to be encoded into a number, and we need a mechanism to decode said numbers into the corresponding number of neurons and activations in a layer.
Let's look at it in more detail.
Search Space
The first concern is designing the search space. We know that there are theoretically infinite possibilities to how many configurations exist, even if we're dealing with very few hidden layers.
The number of neurons in each hidden layer can be any positive integer. There are a good number of activation functions as well and as mentioned above, they serve different purposes (e.g. you will rarely use a softmax unless it is for the classification layer, or you will only use a sigmoid in the classification layer if it's a binary classification problem).
To take care of all of this, we design the search space which loosely resembles how humans think of MLP architectures and also provides us with a way to numerically encode said configurations, or decode the encoding.
Each hidden layer can be expressed with two parameters: nodes and activation function. So we create a dictionary of the vocabulary of our sequence generator. We consider a discrete search space where the number of nodes can take specific values–8, 16, 32, 64, 128, 256 and 512. Same goes for activation functions–sigmoid, tanh, relu and elu. We represent each such possible layer combination with a tuple, (number of nodes, activation)
. In the dictionary, the keys are the numbers and values are said tuples of layer hyperparameters. We start the encoding from 1
since we will need to pad sequences later so we can train our controller, and don't want 0
to create confusion.
After assigning a numerical code to each of the combination of the above-mentioned nodes and activations, we add another option for dropout. Finally, depending on the target classes, we will also add a final layer. We keep the dropout parameter constant in this project, to prevent over-complicating things.
If there are two target classes then we pick a single node sigmoid layer; otherwise, we choose a softmax layer with as many nodes as the number of classes.
There are also functions to encode a given tuple into its numerical counterpart or vice-versa.
Now that we have defined our search space and a way to encode architectures, let's look into how we can generate neural network architectures given a sequence representing a valid architecture. We have added all the different configurations of layers we might need in the search space but we haven't written rules for which configuration is valid and which isn't. We will do that while writing the controller, which we will discuss in the coming articles.
Model Generator
How do humans go about designing an MLP? If you have any experience with deep learning, you know this job doesn't take more than a few minutes.
The things to consider are:
- How many neurons in each hidden layer: There are countless options, and we need to find which configuration will give us the best accuracy.
- Which activation functions to use for each hidden layer: There are several of these, and figuring out which one works best for a particular dataset requires experimentation that we intend to automate.
- Adding a dropout layer: Does it help the performance of my architecture, or hurt it?
- What's the final layer like: Is it a multi-class problem or a binary classification problem? This determines the number of nodes in our final layer, as well as the loss function we finally use while training these architectures.
- The dimensionality of your data: If it takes 2-D input, we might want to flatten it before adding the linear layers.
- How many hidden layers: Panchal et al. (2011) suggest that we rarely need more than two hidden layers in an MLP for optimum performance.
Many of these will be taken care of when we write our controller.
Generating MLPs
For now, we will assume that our controller is working well and is churning out valid sequences.
We need to write a generator that can take these sequences and convert them into models that can be trained and evaluated. The model generator will include functions for:
- Converting sequences to Keras models
- Compiling these models
We will look at saving weights in the one-shot architectures subsection, and training just after. This will include:
- Setting weights for Keras models
- Saving trained weights after training each model
The logging of accuracies will be looked at in the coming articles.
Our MLPGenerator
class will inherit the MLPSearchSpace
class defined above. We will also save several constants in another file called CONSTANTS.py
. We have imported constants like:
- Target classes
- Optimizer used
- Learning rate
- Decay
- Momentum
- Dropout rate
- Loss function
And others from the file directly, using:
These constants are initialized in the MLPGenerator
class as shown below.
The MLP_ONE_SHOT
constant is a boolean value telling the algorithm whether to use one-shot training or not.
Below is the function for creating a model given a valid sequence that encodes an architecture and input shape. We decode the sequence, create a sequential model, and add each layer in the sequence one by one. We also account for inputs being more than 2-dimensional, which will require us to flatten the input. We add conditions for Dropout as well.
Remember that it is important to name the flatten
and dropout
layers, because the names will come in handy for our one-shot weight setting and updating.
Now we define another function to compile our model, which will get an optimizer and loss function using the constants we defined in our init
functions and return a compiled model using the model.compile
method.
One Shot Architectures
Besides these, another interesting concept we will tackle is that of one-shot learning or parameter sharing. The concept of parameter sharing was introduced in and popularized by Pham et al. (2018), where a controller discovers neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on a validation set.
What this means is that the entire search space is built into one big computational graph, and each new architecture is only a subgraph of this super-architecture. In essence, all the weights between all possible combinations of layers are transferable to each other. For example, if the first neural network generated has the architecture:
Its weights would be of shape:
16 X 32
32 X 10
Then, if a second network has the architecture:
Its weights would be of shape:
64 X 32
32 X 10
A one shot architecture method, or parameter sharing, would want to transfer the trained weights between the second and the final layer from the first architecture to the second architecture before training the second architecture on given data.
Hence, our algorithm requires us to maintain a mapping of different layer pairings and their corresponding weight matrices at all times. Before we train any new architectures, we need to see if a particular layer combination has turned up in the past. If yes, the weights are transferred. If no, the weights are initialized, the model is trained, and the new layer combination is logged into our mapping along with the weights.
The first thing to do here is to initialize a Pandas dataframe to store all our weights in. You can choose to store NumPy arrays directly in different .npz files, or any other other format you find convenient.
To do so, we add this snippet of code in the init
function.
if self.mlp_one_shot:
# path to shared weights file
self.weights_file = 'LOGS/shared_weights.pkl'
# open an empty dataframe with columns for bigrams IDs and weights
self.shared_weights = pd.DataFrame({'bigram_id': [], 'weights': []})
# pickle the dataframe
if not os.path.exists(self.weights_file):
print("Initializing shared weights dictionary...")
self.shared_weights.to_pickle(self.weights_file)
One shot learning requires us to perform two tasks:
- Set weights of an architecture before we start training
- Update our dataframe with newly trained weights
We write two functions for these. These functions take a model and layer-by-layer extract the configurations and convert them into a bigram–a 32 node layer followed by a 64 node layer would thus have a size of (32 x 64), whereas a 16 node layer followed by another 16 node layer means the size is (16 x 16). We remove Dropout from the configs, because dropouts do not affect the weight sizes.
Once we have those, while setting weights, we look through all the available stored weights and see if we already have a weight that satisfies the criteria for weight transfer. If so, we transfer those weights; if not, we let Keras automatically initialize the weights.
While updating weights, we again look through all stored weights in the Pandas dataframe and see if we already have a weight of the same size and activation as in the model after training. If yes, we replace the weights in the dataframe with the new ones. Otherwise we add the new shape bigram along with the weights in a new row in the dataframe.
Once we have the weight transfer functions ready, we can finally write a function to train our models.
Training Generated Architectures
The training function will set model weights, train, and update model weights if one shot learning is enabled. Otherwise, it will simply train the model and track the metrics. It will take the input data, the Keras model, the number of epochs to train the model for, the train test split, and callbacks as input, and train the models accordingly.
In this implementation we haven't added the functionality to train the best models for many more epochs automatically once the search phase is done, but allowing callbacks as a variable in the function can allow us to easily include, for example, Early Stopping in our final training.
And our MLP generator is ready.
Things To Remember
There are a few things to keep in mind while dealing with one shot training. One shot training methods have a few questions that aren't easily answered. For example:
- Are the models that get trained sooner at an inherent disadvantage while getting ranked, because there are no pre-trained weights that get transferred?
- Is it possible that the transfer of weights hurts a particular architecture's performance instead of improving it?
- How does the one-shot architecture methodology change the training of the controller?
Besides pondering over the questions mentioned above, there's another implementation-specific detail that one should make a note of.
The one-shot training weights have to be saved in a way that gets stored as well as searched and retrieved efficiently. I realize that storing them in a Pandas dataframe makes the weight transfer take a lot longer towards the later stages of the NAS, since the dataframe is already populated with a lot of weights and it takes longer to search through them to make the right transfer. If you have another strategy to store and retrieve weights faster, you should definitely attempt to test it in your own implementation. This becomes even more important when the search space you navigate becomes huge, you want deeper or more complex architectures (think CNNs or ResNet), etc.
Conclusion
In this second part of the Neural Architecture Search series, we looked at the conversion of encoded sequences into Keras architectures automatically. We built a search space for our problem, functions to encode a tuple describing a layer's configuration, and a decoding function to convert the encoded values to layer configuration tuples.
We looked at transferring weights to each layer individually and also storing their weights for further use. We saw how to compile and train these models, and used Pandas dataframes to store layer weights according to the bigrams they created. We used the same bigrams to check whether there are weights that can be transferred in a new architecture.
Finally, we took these compiled models along with information like the loss function, optimizer, number of epochs, etc. to write a function for training the model.
In the next part, we will chalk out a controller that can create sequences of numbers that can be converted to valid architectures by the MLPGenerator
. We will also look into how the controller itself is trained, and if we can play around with the controller architecture to get better results.
I hope you enjoyed the article.