In the first part of this series we saw an overview of neural architecture search, including a state of the art review of the literature. In Part 2 we then saw how to turn our encoded sequences into MLP models. We also looked at training these models and transferring weights layer-by-layer for one-shot learning, and saving these weights as well.

To get to these encoded sequences, we need another mechanism that will generate the sequences in a way that corresponds to valid architectures of MLPs. We also want to make sure we are not training the same architecture twice. That's what we'll cover here.

Specifically, we'll go over the following topics in this article:

  • Role of Controllers in NAS
  • Creating a Controller
  • Controller Architecture
  • Accuracy Predictors
  • Training the Controller
  • Sampling Architectures
  • Getting Predicted Accuracies
  • Conclusion

Bring this project to life

Role of Controllers in NAS

The way to get these encoded sequences is by using a recurrent network that will continuously generate the sequences for us. Each sequence is a part of the search space that we navigate in a directed fashion. The direction of our search for the best architecture is determined by how we train the controller itself.

We'll break the code down into sections, but the complete code can also be found here.

In Part 2 we saw that the general pipeline for our NAS project would look something like this:

def search(self):
	# for number of controller epochs 
	for controller_epoch in range(controller_sampling_epochs):

		# sample a set number of architecture sequences
		sequences = sample_architecture_sequences(controller_model, samples_per_controller_epoch)

		# predict their accuracies using a hybrid controller
		pred_accuracies = get_predicted_accuracies(controller_model, sequences)

		# for each of these sequences
		for i, sequence in enumerate(sequences):

			# create and compile the model corresponding to the sequence
			model = create_architecture(sequence)

			# train said model
			history = train_architecture(model)

			# log the training metrics
			append_model_metrics(sequence, history, pred_accuracies[i])

		# use this data to train the controller
		xc, yc, val_acc_target = prepare_controller_data(sequences)
		train_controller(controller_model, xc, yc, val_acc_target)

The inner loop of the search is mostly the task of the model generator, whereas the outer loop is that of the controller. There are functions in both the inner and outer loops that involve preparing data and storing model metrics that are necessary for our generator and controller to work smoothly. These are part of the main MLPNAS class, and we will look at these in the next (and final) part of the series.

Now we'll focus on the controller, in particular:

  • How it is designed, and different alternatives to the controller design
  • How to generate valid sequences that can be passed to the MLP generator to create and train architectures
  • How to train the controller itself

The controller architecture can be designed to incorporate an accuracy predictor. This can also be done in multiple ways, for example, by sharing the LSTM weights of the controller and the accuracy predictor. There are some drawbacks to using an accuracy predictor. We will also talk about those.

Creating the Controller

The controller is a recurrent system which generates the encoded sequences according to the mapping we designed in the search space (in the second part of the series). This controller will be a model that can be iteratively trained on the sequences it generates. This starts with a controller that generates sequences without any knowledge of what an architecture that performs well looks like. We create a few sequences, train those sequences, evaluate them, and create a dataset out of these sequences to train our controller. In essence, in every controller epoch, a new dataset is created for the controller to learn from.

To be able to do any of this we'll need a few parameters initialized in the controller classโ€”the constants we will need. The Controller will inherit the MLPSearchSpace class we created in the second part of the series.

These constants include:

  • Number of controller LSTM hidden layers
  • Optimizer to be used for training
  • Learning rate to be used for training
  • Decay to used for training
  • Momentum for training (in case of SGD optimizer)
  • Whether to use an accuracy predictor or not
  • The maximum length of an architecture
class Controller(MLPSearchSpace):

    def __init__(self):
		
        # defining training and sequence creation related parameters
        self.max_len = MAX_ARCHITECTURE_LENGTH
        self.controller_lstm_dim = CONTROLLER_LSTM_DIM
        self.controller_optimizer = CONTROLLER_OPTIMIZER
        self.controller_lr = CONTROLLER_LEARNING_RATE
        self.controller_decay = CONTROLLER_DECAY
        self.controller_momentum = CONTROLLER_MOMENTUM
        self.use_predictor = CONTROLLER_USE_PREDICTOR
        
        # file path of controller weights to be stored at
        self.controller_weights = 'LOGS/controller_weights.h5'

        # initializing a list for all the sequences created
        self.seq_data = []

        # inheriting from the search space
        super().__init__(TARGET_CLASSES)

        # number of classes for the controller (+ 1 for padding)
        self.controller_classes = len(self.vocab) + 1

We also initialize an empty list to store all the encoded architectures our controller has already created and tested. This will prevent our controller from sampling the same sequences over and over again. We need this sequence to be initialized at the start (as opposed to being in a function) because this sequence data needs to persist through multiple controller epochs, whereas the functions for sampling are called several times.

In cases where you are not just designing MLPs but deep CNN-based architectures (similar to Inception or ResNets), you might want to consider not storing these architectures in a list but saving it in a temporary file somewhere to free up memory.

Controller Architecture

The controller can be designed in several ways, and there's no real limit to the amount of experimentation one can do. Fundamentally we need a sequential output that can be extracted from the controller and decoded into real MLP architectures. RNNs and LSTMs sound like great options for this.

Trying different optimization techniques for the controller's learning mostly requires us to deal with different optimizers or building custom loss functions.

A simple LSTM controller can be seen below.

    def control_model(self, controller_input_shape, controller_batch_size):
        main_input = Input(shape=controller_input_shape, batch_shape=controller_batch_size, name='main_input')
        x = LSTM(self.controller_lstm_dim, return_sequences=True)(main_input)
        main_output = Dense(self.controller_classes, activation='softmax', name='main_output')(x)
        model = Model(inputs=[main_input], outputs=[main_output])
        return model

The architecture shown above is pretty straight forward. There's:

  • An input layer, its size dependent on the input shape and batch shape
  • An LSTM layer, with user-specified dimensions
  • A dense layer, with nodes dependent on the size of the vocabulary

This is a sequential architecture, and can be trained using the optimizer and loss functions of our choosing.

But there are other approaches to designing these controllers, besides playing around with the architecture mentioned above. We could:

  • Vary the LSTM dimensions of the LSTM layer in our architecture
  • Add more LSTM layers and vary their dimensions
  • Add dense layers, and vary the number of nodes and activation functions

Other methods include building models that can output two things at once.

Accuracy Predictors

This simple LSTM architecture can be turned into an adversarial model by accounting for not just optimization based on loss a function, but also a parallel model using the accuracy predictors. The accuracy predictor will share weights with the LSTM layer of our sequence generator and help us create architectures that generalize better.

def hybrid_control_model(self, controller_input_shape, controller_batch_size):
    # input layer initialized with input shape and batch size
    main_input = Input(shape=controller_input_shape, batch_shape=controller_batch_size, name='main_input')
    
    # LSTM layer
    x = LSTM(self.controller_lstm_dim, return_sequences=True)(main_input)
    
    # two layers take the same LSTM layer as the input, 
    # the accuracy predictor as well as the sequence generation classification layer
    predictor_output = Dense(1, activation='sigmoid', name='predictor_output')(x)
    main_output = Dense(self.controller_classes, activation='softmax', name='main_output')(x)
    
    # finally the Keras Model class is used to create a multi-output model
    model = Model(inputs=[main_input], outputs=[main_output, predictor_output])
    return model

The predictor output will be a single-neuron dense layer with a sigmoid activation function. The output of this layer will serve as a proxy for an architecture's validation accuracy.

We could also separate the accuracy predictor LSTM from the main output LSTM, as is done below.

def hybrid_control_model(self, controller_input_shape, controller_batch_size):
    # input layer initialized with input shape and batch size
    main_input = Input(shape=controller_input_shape, batch_shape=controller_batch_size, name='main_input')
    
    # LSTM layer
    x1 = LSTM(self.controller_lstm_dim, return_sequences=True)(main_input)
    # output for the sequence generator network
    main_output = Dense(self.controller_classes, activation='softmax', name='main_output')(x1)

    # LSTM layer
    x2 = LSTM(self.controller_lstm_dim, return_sequences=True)(main_input)
    # single neuron sigmoid layer for accuracy prediction
    predictor_output = Dense(1, activation='sigmoid', name='predictor_output')(x2)
    
    # finally the Keras Model class is used to create a multi-output model
    model = Model(inputs=[main_input], outputs=[main_output, predictor_output])
    return model

That way we won't affect our sequence predictions due to the accuracy predictor, but we still have a network learning to predict accuracies of architectures without training them.

In the next and final part we'll see how our loss function actually already accounts for the validation accuracy of each architecture when training the model by applying the REINFORCE gradient. The accuracy predictor's meddling in fact leads the controller to create architectures that don't give us as as high an accuracy as the ones generated by using only one-shot learning. We mention the accuracy predictor here for the sake of thoroughness.

Training the Controller

Once we have the model ready, we write a function to train it. As input the function will take the loss function, data, batch size, and number of epochs. This way we can use a custom loss function to train our controller.

The training code below is for the simple controller mentioned above. It doesn't include the accuracy predictor.

def train_control_model(self, model, x_data, y_data, loss_func, controller_batch_size, nb_epochs):
    # get the optimizer required for training
    if self.controller_optimizer == 'sgd':
        optim = optimizers.SGD(lr=self.controller_lr,
                               decay=self.controller_decay,
                               momentum=self.controller_momentum)
    else:
        optim = getattr(optimizers, self.controller_optimizer)(lr=self.controller_lr, 
                                                   decay=self.controller_decay)
                                                   
    # compile model depending on loss function and optimizer provided
    model.compile(optimizer=optim, loss={'main_output': loss_func})
    
    # load controller weights
    if os.path.exists(self.controller_weights):
        model.load_weights(self.controller_weights)
        
    # train the controller
    print("TRAINING CONTROLLER...")
    model.fit({'main_input': x_data},
              {'main_output': y_data.reshape(len(y_data), 1, self.controller_classes)},
              epochs=nb_epochs,
              batch_size=controller_batch_size,
              verbose=0)
    
    # save controller weights
    model.save_weights(self.controller_weights)

To train the model with accuracy predictors, the function above is modified in two places: the compile model stage and the training stage. It needs to include the two losses used for two different outputs, and the weights used for each loss. Similarly, the training command needs to include the second output in the output dictionary. For the predictor we use mean squared error as the loss function.

def train_control_model(self, model, x_data, y_data, loss_func, controller_batch_size, nb_epochs):
    # get the optimizer required for training
    if self.controller_optimizer == 'sgd':
        optim = optimizers.SGD(lr=self.controller_lr,
                               decay=self.controller_decay,
                               momentum=self.controller_momentum)
    else:
        optim = getattr(optimizers, self.controller_optimizer)(lr=self.controller_lr, 
                                                   decay=self.controller_decay)
                                                   
    # compile model depending on loss function and optimizer provided
    model.compile(optimizer=optim,
                  loss={'main_output': loss_func, 'predictor_output': 'mse'},
                  loss_weights={'main_output': 1, 'predictor_output': 1})

    # load controller weights
    if os.path.exists(self.controller_weights):
        model.load_weights(self.controller_weights)
        
    # train the controller
    print("TRAINING CONTROLLER...")
    model.fit({'main_input': x_data},
              {'main_output': y_data.reshape(len(y_data), 1, self.controller_classes),
              'predictor_output': np.array(pred_target).reshape(len(pred_target), 1, 1)},
              epochs=nb_epochs,
              batch_size=controller_batch_size,
              verbose=0)
    
    # save controller weights
    model.save_weights(self.controller_weights)

Sampling Architectures

Once the model architecture and training functions are done, we need to finally use these models to predict architecture sequences. If you're using an accuracy predictor you'll need a function to acquire the predicted accuracies as well.

The sampling process requires us to encode some rules about how MLP architectures are designed in order to avoid invalid architectures. We also don't want the same architecture created again and again. Some other things to consider are:

  • When and where the dropout layer appears
  • The maximum length of architectures
  • The minimum length of architectures
  • How many architectures to sample in every controller epoch

These concerns are taken into account in the function below for sampling architecture sequences. We run a nested loop; the outer loop continues until we have the number of sequences sampled that we require. The inner loop uses the controller model to predict the next element in each architecture sequence, starting with an empty sequence and ending with an architecture with either 1 or more hidden layers. Other constraints are that dropout can't be in the first layer, and the final layer can't be repeated.

While generating the next element in the sequence, we randomly sample it given the probability distribution for all possible elements. This allows us to utilize the softmax distribution acquired from the controller to navigate the search space. Probabilistic sampling aids efficient exploration of the search space, while not deviating too much from what the controller model dictates.

    def sample_architecture_sequences(self, model, number_of_samples):
        # define values needed for sampling 
        final_layer_id = len(self.vocab)
        dropout_id = final_layer_id - 1
        vocab_idx = [0] + list(self.vocab.keys())
        
        # initialize list for architecture samples
        samples = []
        print("GENERATING ARCHITECTURE SAMPLES...")
        print('------------------------------------------------------')
        
        # while number of architectures sampled is less than required
        while len(samples) < number_of_samples:
            
            # initialise the empty list for architecture sequence
            seed = []
            
            # while len of generated sequence is less than maximum architecture length
            while len(seed) < self.max_len:
                
                # pad sequence for correctly shaped input for controller
                sequence = pad_sequences([seed], maxlen=self.max_len - 1, padding='post')
                sequence = sequence.reshape(1, 1, self.max_len - 1)
                
                # given the previous elements, get softmax distribution for the next element
                if self.use_predictor:
                    (probab, _) = model.predict(sequence)
                else:
                    probab = model.predict(sequence)
                probab = probab[0][0]
                
                # sample the next element randomly given the probability of next elements (the softmax distribution)
                next = np.random.choice(vocab_idx, size=1, p=probab)[0]
                
                # first layer isn't dropout
                if next == dropout_id and len(seed) == 0:
                    continue
                # first layer is not final layer
                if next == final_layer_id and len(seed) == 0:
                    continue
                # if final layer, break out of inner loop
                if next == final_layer_id:
                    seed.append(next)
                    break
                # if sequence length is 1 less than maximum, add final
                # layer and break out of inner loop
                if len(seed) == self.max_len - 1:
                    seed.append(final_layer_id)
                    break
                # ignore padding
                if not next == 0:
                    seed.append(next)
            
            # check if the generated sequence has been generated before.
            # if not, add it to the sequence data. 
            if seed not in self.seq_data:
                samples.append(seed)
                self.seq_data.append(seed)
        return samples

Getting Predicted Accuracies

Another part of getting predictions is acquiring the predicted accuracy for each model. The input to this function will be the model and the sequences generated. The output will be a number between 0 and 1โ€“the predicted validation accuracy for the model.

    def get_predicted_accuracies_hybrid_model(self, model, seqs):
        pred_accuracies = []        
        for seq in seqs:
            # pad each sequence
            control_sequences = pad_sequences([seq], maxlen=self.max_len, padding='post')
            xc = control_sequences[:, :-1].reshape(len(control_sequences), 1, self.max_len - 1)
            # get predicted accuracies
            (_, pred_accuracy) = [x[0][0] for x in model.predict(xc)]
            pred_accuracies.append(pred_accuracy[0])
        return pred_accuracies

And we're done with building our controller.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Conclusion

Previously we saw how to write our model generator, which would implement one-shot learning as an optional feature. The model generator is only useful if we already have encoded sequences of the architectures, and the search space that maps these encodings. We therefore defined the search space along with the generator in the second part of the series.

In this part we learned how we can use an LSTM-based architecture to sample the encoded sequences of architectures. We looked at different approaches to designing the controller, and how to utilize an accuracy predictor. The accuracy predictor doesn't necessarily create superior architectures, but it does help in creating models that generalize better. We also discussed sharing weights between the accuracy predictor and the sequence generator.

After that we learned how to train these controller models, depending on if they are single-output or multi-output models. We looked at the architecture-encoding generator itself, which takes into account different constraints to create architectures that are valid in terms of the order of layers used, the maximum length of these architectures, etc. We finished our controller with a tiny function for getting the predicted accuracies given a model and the sequences for which we need predictions.

The next and final part of this series will wrap up the complete workflow of MLPNAS. We will finally integrate the model generator from the second part with the controller in the third part to automate the process of neural architecture search. We will learn about the REINFORCE gradient as well, as an optimization method for our controller.

Stay tuned.