In the first part of the series, we took a look at all the different angles the problem of neural architecture is being approached from.
With the foundation covered, we'll now see how to implement some of the important concepts we saw in the first article of the series. Specifically, we will look into designing a neural architecture search method for Multilayer Perceptrons. Our implementation will include three special features:
- One-shot architecture training
- An accuracy predictor in the controller
- REINFORCE gradient for training the controller
In this part, we will look at the search space design for MLPs, creating model architectures from sequences, and how to go about implementing one-shot architectures.
The full code can be found here.
Bring this project to life
Introduction
Multilayer perceptrons are the easiest deep learning architectures to implement. A few linear layers are stacked on top of each other; each takes an input from the previous layer, multiplies it by its weights, adds a vector of biases to them, and passes this vector through an activation function of choice to get the layer output. This feed-forward process goes on until we finally reach our classification or regression output from the final layer. This final output is compared with the ground truth classification or regression values, the loss is calculated using an appropriate loss function, and the weights for all layers are updated one-by-one using gradient descent.
Game Plan for MLPNAS
There are a few things to think about when trying to automate neural architecture creation. Before we dive into it, let's take a look at what a simplified view of our pipeline for multilayer perceptron neural architecture search (MLPNAS) looks like:
def search(self):
# for the number of controller epochs
for controller_epoch in range(controller_sampling_epochs):
# sample a set number of architecture sequences
sequences = sample_architecture_sequences(controller_model, samples_per_controller_epoch)
# predict their accuracies using a hybrid controller
pred_accuracies = get_predicted_accuracies(controller_model, sequences)
# for each of these sequences
for i, sequence in enumerate(sequences):
# create and compile the model corresponding to the sequence
model = create_architecture(sequence)
# train said model
history = train_architecture(model)
# log the training metrics
append_model_metrics(sequence, history, pred_accuracies[i])
# use this data to train the controller
xc, yc, val_acc_target = prepare_controller_data(sequences)
train_controller(controller_model, xc, yc, val_acc_target)
As you might have noticed, all the functions for the outer loop belong to the controller and all the functions in the inner loop belong to the MLP generator. In this part of the series we will look at the inner loop. But before we're able to do that, we have to have some idea of how the controller generates architectures in the first place.
The controller we use is an LSTM architecture which generates sequences of numbers. These numbers are decoded to create architecture parameters which are then used to generate the architecture. How a controller goes about sequentially creating valid architectures is something we will look into in the coming articles. For now, we need to understand that each possible layer configuration needs to be encoded into a number, and we need a mechanism to decode said numbers into the corresponding number of neurons and activations in a layer.
Let's look at it in more detail.
Search Space
The first concern is designing the search space. We know that there are theoretically infinite possibilities to how many configurations exist, even if we're dealing with very few hidden layers.
The number of neurons in each hidden layer can be any positive integer. There are a good number of activation functions as well and as mentioned above, they serve different purposes (e.g. you will rarely use a softmax unless it is for the classification layer, or you will only use a sigmoid in the classification layer if it's a binary classification problem).
To take care of all of this, we design the search space which loosely resembles how humans think of MLP architectures and also provides us with a way to numerically encode said configurations, or decode the encoding.
Each hidden layer can be expressed with two parameters: nodes and activation function. So we create a dictionary of the vocabulary of our sequence generator. We consider a discrete search space where the number of nodes can take specific values–8, 16, 32, 64, 128, 256 and 512. Same goes for activation functions–sigmoid, tanh, relu and elu. We represent each such possible layer combination with a tuple, (number of nodes, activation)
. In the dictionary, the keys are the numbers and values are said tuples of layer hyperparameters. We start the encoding from 1
since we will need to pad sequences later so we can train our controller, and don't want 0
to create confusion.
After assigning a numerical code to each of the combination of the above-mentioned nodes and activations, we add another option for dropout. Finally, depending on the target classes, we will also add a final layer. We keep the dropout parameter constant in this project, to prevent over-complicating things.
If there are two target classes then we pick a single node sigmoid layer; otherwise, we choose a softmax layer with as many nodes as the number of classes.
There are also functions to encode a given tuple into its numerical counterpart or vice-versa.
class MLPSearchSpace(object):
def __init__(self, target_classes):
self.target_classes = target_classes
self.vocab = self.vocab_dict()
def vocab_dict(self):
# define the allowed nodes and activation functions
nodes = [8, 16, 32, 64, 128, 256, 512]
act_funcs = ['sigmoid', 'tanh', 'relu', 'elu']
# initialize lists for keys and values of the vocabulary
layer_params = []
layer_id = []
# for all activation functions for each node
for i in range(len(nodes)):
for j in range(len(act_funcs)):
# create an id and a configuration tuple (node, activation)
layer_params.append((nodes[i], act_funcs[j]))
layer_id.append(len(act_funcs) * i + j + 1)
# zip the id and configurations into a dictionary
vocab = dict(zip(layer_id, layer_params))
# add dropout in the volcabulary
vocab[len(vocab) + 1] = (('dropout'))
# add the final softmax/sigmoid layer in the vocabulary
if self.target_classes == 2:
vocab[len(vocab) + 1] = (self.target_classes - 1, 'sigmoid')
else:
vocab[len(vocab) + 1] = (self.target_classes, 'softmax')
return vocab
# function to encode a sequence of configuration tuples
def encode_sequence(self, sequence):
keys = list(self.vocab.keys())
values = list(self.vocab.values())
encoded_sequence = []
for value in sequence:
encoded_sequence.append(keys[values.index(value)])
return encoded_sequence
# function to decode a sequence back to configuration tuples
def decode_sequence(self, sequence):
keys = list(self.vocab.keys())
values = list(self.vocab.values())
decoded_sequence = []
for key in sequence:
decoded_sequence.append(values[keys.index(key)])
return decoded_sequence
Now that we have defined our search space and a way to encode architectures, let's look into how we can generate neural network architectures given a sequence representing a valid architecture. We have added all the different configurations of layers we might need in the search space but we haven't written rules for which configuration is valid and which isn't. We will do that while writing the controller, which we will discuss in the coming articles.
Model Generator
How do humans go about designing an MLP? If you have any experience with deep learning, you know this job doesn't take more than a few minutes.
The things to consider are:
- How many neurons in each hidden layer: There are countless options, and we need to find which configuration will give us the best accuracy.
- Which activation functions to use for each hidden layer: There are several of these, and figuring out which one works best for a particular dataset requires experimentation that we intend to automate.
- Adding a dropout layer: Does it help the performance of my architecture, or hurt it?
- What's the final layer like: Is it a multi-class problem or a binary classification problem? This determines the number of nodes in our final layer, as well as the loss function we finally use while training these architectures.
- The dimensionality of your data: If it takes 2-D input, we might want to flatten it before adding the linear layers.
- How many hidden layers: Panchal et al. (2011) suggest that we rarely need more than two hidden layers in an MLP for optimum performance.
Many of these will be taken care of when we write our controller.
Generating MLPs
For now, we will assume that our controller is working well and is churning out valid sequences.
We need to write a generator that can take these sequences and convert them into models that can be trained and evaluated. The model generator will include functions for:
- Converting sequences to Keras models
- Compiling these models
We will look at saving weights in the one-shot architectures subsection, and training just after. This will include:
- Setting weights for Keras models
- Saving trained weights after training each model
The logging of accuracies will be looked at in the coming articles.
Our MLPGenerator
class will inherit the MLPSearchSpace
class defined above. We will also save several constants in another file called CONSTANTS.py
. We have imported constants like:
- Target classes
- Optimizer used
- Learning rate
- Decay
- Momentum
- Dropout rate
- Loss function
And others from the file directly, using:
from CONSTANTS import *
These constants are initialized in the MLPGenerator
class as shown below.
class MLPGenerator(MLPSearchSpace):
def __init__(self):
self.target_classes = TARGET_CLASSES
self.mlp_optimizer = MLP_OPTIMIZER
self.mlp_lr = MLP_LEARNING_RATE
self.mlp_decay = MLP_DECAY
self.mlp_momentum = MLP_MOMENTUM
self.mlp_dropout = MLP_DROPOUT
self.mlp_loss_func = MLP_LOSS_FUNCTION
self.mlp_one_shot = MLP_ONE_SHOT
self.metrics = ['accuracy']
super().__init__(TARGET_CLASSES)
The MLP_ONE_SHOT
constant is a boolean value telling the algorithm whether to use one-shot training or not.
Below is the function for creating a model given a valid sequence that encodes an architecture and input shape. We decode the sequence, create a sequential model, and add each layer in the sequence one by one. We also account for inputs being more than 2-dimensional, which will require us to flatten the input. We add conditions for Dropout as well.
# function to create a keras model given a sequence and input data shape
def create_model(self, sequence, mlp_input_shape):
# decode sequence to get nodes and activations of each layer
layer_configs = self.decode_sequence(sequence)
# create a sequential model
model = Sequential()
# add a flatten layer if the input is 3 or higher dimensional
if len(mlp_input_shape) > 1:
model.add(Flatten(name='flatten', input_shape=mlp_input_shape))
# for each element in the decoded sequence
for i, layer_conf in enumerate(layer_configs):
# add a model layer (Dense or Dropout)
if layer_conf is 'dropout':
model.add(Dropout(self.mlp_dropout, name='dropout'))
else:
model.add(Dense(units=layer_conf[0], activation=layer_conf[1]))
else:
# for 2D inputs
for i, layer_conf in enumerate(layer_configs):
# add the first layer (requires the input shape parameter)
if i == 0:
model.add(Dense(units=layer_conf[0], activation=layer_conf[1], input_shape=mlp_input_shape))
# add subsequent layers (Dense or Dropout)
elif layer_conf is 'dropout':
model.add(Dropout(self.mlp_dropout, name='dropout'))
else:
model.add(Dense(units=layer_conf[0], activation=layer_conf[1]))
# return the keras model
return model
Remember that it is important to name the flatten
and dropout
layers, because the names will come in handy for our one-shot weight setting and updating.
Now we define another function to compile our model, which will get an optimizer and loss function using the constants we defined in our init
functions and return a compiled model using the model.compile
method.
# function to compile the model with the appropriate optimizer and loss function
def compile_model(self, model):
# get optimizer
if self.mlp_optimizer == 'sgd':
optim = optimizers.SGD(lr=self.mlp_lr, decay=self.mlp_decay, momentum=self.mlp_momentum)
else:
optim = getattr(optimizers, self.mlp_optimizer)(lr=self.mlp_lr, decay=self.mlp_decay)
# compile model
model.compile(loss=self.mlp_loss_func, optimizer=optim, metrics=self.metrics)
# return the compiled keras model
return model
One Shot Architectures
Besides these, another interesting concept we will tackle is that of one-shot learning or parameter sharing. The concept of parameter sharing was introduced in and popularized by Pham et al. (2018), where a controller discovers neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on a validation set.
What this means is that the entire search space is built into one big computational graph, and each new architecture is only a subgraph of this super-architecture. In essence, all the weights between all possible combinations of layers are transferable to each other. For example, if the first neural network generated has the architecture:
[(16, 'relu'), (32, 'relu'), (10, 'softmax')]
Its weights would be of shape:
16 X 32
32 X 10
Then, if a second network has the architecture:
[(64, 'relu'), (32, 'relu'), (10, 'softmax')]
Its weights would be of shape:
64 X 32
32 X 10
A one shot architecture method, or parameter sharing, would want to transfer the trained weights between the second and the final layer from the first architecture to the second architecture before training the second architecture on given data.
Hence, our algorithm requires us to maintain a mapping of different layer pairings and their corresponding weight matrices at all times. Before we train any new architectures, we need to see if a particular layer combination has turned up in the past. If yes, the weights are transferred. If no, the weights are initialized, the model is trained, and the new layer combination is logged into our mapping along with the weights.
The first thing to do here is to initialize a Pandas dataframe to store all our weights in. You can choose to store NumPy arrays directly in different .npz files, or any other other format you find convenient.
To do so, we add this snippet of code in the init
function.
if self.mlp_one_shot:
# path to shared weights file
self.weights_file = 'LOGS/shared_weights.pkl'
# open an empty dataframe with columns for bigrams IDs and weights
self.shared_weights = pd.DataFrame({'bigram_id': [], 'weights': []})
# pickle the dataframe
if not os.path.exists(self.weights_file):
print("Initializing shared weights dictionary...")
self.shared_weights.to_pickle(self.weights_file)
One shot learning requires us to perform two tasks:
- Set weights of an architecture before we start training
- Update our dataframe with newly trained weights
We write two functions for these. These functions take a model and layer-by-layer extract the configurations and convert them into a bigram–a 32 node layer followed by a 64 node layer would thus have a size of (32 x 64), whereas a 16 node layer followed by another 16 node layer means the size is (16 x 16). We remove Dropout from the configs, because dropouts do not affect the weight sizes.
Once we have those, while setting weights, we look through all the available stored weights and see if we already have a weight that satisfies the criteria for weight transfer. If so, we transfer those weights; if not, we let Keras automatically initialize the weights.
def set_model_weights(self, model):
# get nodes and activations for each layer
layer_configs = ['input']
for layer in model.layers:
# add flatten since it affects the size of the weights
if 'flatten' in layer.name:
layer_configs.append(('flatten'))
# don't add dropout since it doesn't affect weight sizes or activations
elif 'dropout' not in layer.name:
layer_configs.append((layer.get_config()['units'], layer.get_config()['activation']))
# get bigrams of relevant layers for weights transfer
config_ids = []
for i in range(1, len(layer_configs)):
config_ids.append((layer_configs[i - 1], layer_configs[i]))
# for all layers
j = 0
for i, layer in enumerate(model.layers):
if 'dropout' not in layer.name:
warnings.simplefilter(action='ignore', category=FutureWarning)
# get all bigram values we already have weights for
bigram_ids = self.shared_weights['bigram_id'].values
# check if a bigram already exists in the dataframe
search_index = []
for i in range(len(bigram_ids)):
if config_ids[j] == bigram_ids[i]:
search_index.append(i)
# set layer weights if there is a bigram match in the dataframe
if len(search_index) > 0:
print("Transferring weights for layer:", config_ids[j])
layer.set_weights(self.shared_weights['weights'].values[search_index[0]])
j += 1
While updating weights, we again look through all stored weights in the Pandas dataframe and see if we already have a weight of the same size and activation as in the model after training. If yes, we replace the weights in the dataframe with the new ones. Otherwise we add the new shape bigram along with the weights in a new row in the dataframe.
def update_weights(self, model):
# get nodes and activations for each layer
layer_configs = ['input']
for layer in model.layers:
# add flatten since it affects the size of the weights
if 'flatten' in layer.name:
layer_configs.append(('flatten'))
# don't add dropout since it doesn't affect weight sizes or activations
elif 'dropout' not in layer.name:
layer_configs.append((layer.get_config()['units'], layer.get_config()['activation']))
# get bigrams of relevant layers for weights transfer
config_ids = []
for i in range(1, len(layer_configs)):
config_ids.append((layer_configs[i - 1], layer_configs[i]))
# for all layers
j = 0
for i, layer in enumerate(model.layers):
if 'dropout' not in layer.name:
warnings.simplefilter(action='ignore', category=FutureWarning)
#get all bigram values we already have weights for
bigram_ids = self.shared_weights['bigram_id'].values
# check if a bigram already exists in the dataframe
search_index = []
for i in range(len(bigram_ids)):
if config_ids[j] == bigram_ids[i]:
search_index.append(i)
# add weights to df in a new row if weights aren't already available
if len(search_index) == 0:
self.shared_weights = self.shared_weights.append({'bigram_id': config_ids[j],
'weights': layer.get_weights()},
ignore_index=True)
# else update weights
else:
self.shared_weights.at[search_index[0], 'weights'] = layer.get_weights()
j += 1
self.shared_weights.to_pickle(self.weights_file)
Once we have the weight transfer functions ready, we can finally write a function to train our models.
Training Generated Architectures
The training function will set model weights, train, and update model weights if one shot learning is enabled. Otherwise, it will simply train the model and track the metrics. It will take the input data, the Keras model, the number of epochs to train the model for, the train test split, and callbacks as input, and train the models accordingly.
In this implementation we haven't added the functionality to train the best models for many more epochs automatically once the search phase is done, but allowing callbacks as a variable in the function can allow us to easily include, for example, Early Stopping in our final training.
def train_model(self, model, x_data, y_data, nb_epochs, validation_split=0.1, callbacks=None):
if self.mlp_one_shot:
self.set_model_weights(model)
history = model.fit(x_data,
y_data,
epochs=nb_epochs,
validation_split=validation_split,
callbacks=callbacks,
verbose=0)
self.update_weights(model)
else:
history = model.fit(x_data,
y_data,
epochs=nb_epochs,
validation_split=validation_split,
callbacks=callbacks,
verbose=0)
return history
And our MLP generator is ready.
Things To Remember
There are a few things to keep in mind while dealing with one shot training. One shot training methods have a few questions that aren't easily answered. For example:
- Are the models that get trained sooner at an inherent disadvantage while getting ranked, because there are no pre-trained weights that get transferred?
- Is it possible that the transfer of weights hurts a particular architecture's performance instead of improving it?
- How does the one-shot architecture methodology change the training of the controller?
Besides pondering over the questions mentioned above, there's another implementation-specific detail that one should make a note of.
The one-shot training weights have to be saved in a way that gets stored as well as searched and retrieved efficiently. I realize that storing them in a Pandas dataframe makes the weight transfer take a lot longer towards the later stages of the NAS, since the dataframe is already populated with a lot of weights and it takes longer to search through them to make the right transfer. If you have another strategy to store and retrieve weights faster, you should definitely attempt to test it in your own implementation. This becomes even more important when the search space you navigate becomes huge, you want deeper or more complex architectures (think CNNs or ResNet), etc.
Conclusion
In this second part of the Neural Architecture Search series, we looked at the conversion of encoded sequences into Keras architectures automatically. We built a search space for our problem, functions to encode a tuple describing a layer's configuration, and a decoding function to convert the encoded values to layer configuration tuples.
We looked at transferring weights to each layer individually and also storing their weights for further use. We saw how to compile and train these models, and used Pandas dataframes to store layer weights according to the bigrams they created. We used the same bigrams to check whether there are weights that can be transferred in a new architecture.
Finally, we took these compiled models along with information like the loss function, optimizer, number of epochs, etc. to write a function for training the model.
In the next part, we will chalk out a controller that can create sequences of numbers that can be converted to valid architectures by the MLPGenerator
. We will also look into how the controller itself is trained, and if we can play around with the controller architecture to get better results.
I hope you enjoyed the article.