Ensembling Neural Network Models With Tensorflow

Boosting the performance and generalization of models by ensembling multiple neural network models.

21 days ago   •   14 min read

By Samuel Ozechi
Photo by Alina Grubnyak / Unsplash

Bring this project to life

Model ensembling is a common practice in machine learning to improve the performance and generalizability of models.  It can be simply described as the technique of combining multiple models in diverse ways to improve performance on a single problem.

The major merits of model ensembling include its ability to improve the performance, robustness and generalization of machine learning models on unseen data.

Tree-based algorithms are particularly known to perform better for certain tasks due to their ability to utilize the ensembling of multiple trees to improve the model's overall performance.

The ensemble members ( i.e models that are combined to a single model or prediction) are combined using diverse aggregation techniques such as simple or weighted averaging to other advanced techniques such as bagging, stacking or boosting.

In contrast to classical algorithms, it's not a common for multiple neural network models to be ensembled for a single task. This is mostly because single models are often enough to map the relationship between features and targets and ensembling multiple neural networks might complicate the model for smaller tasks and lead to overfitting of the model. It's most common for practitioners to increase the size of a single neural network model to properly fit the data than ensemble multiple neural network models into one.

However, for larger tasks, ensembling multiple neural networks, especially ones trained on similar problems, could prove quite useful and improve the performance of the final model and its ability to generalize on broader use cases.

Oftentimes, practitioners try out multiple models with different configurations to select the best-performing model for a problem. Neural network ensembling offers the option of utilizing the information of the different models to develop a more balanced and efficient model.

For example, combining multiple network models that were trained on a particular dataset (such as the Imagenet dataset) could prove useful to improve the performance of a model being trained on a dataset with similar classes. The different information learned by the models could be combined through model ensembling to improve the model's overall performance and robustness of an aggregated model. Before ensembling such models, it is important to train and fine-tune the ensemble members to ensure that each model contributes relevant information to the final model.

The Tensorflow API provides certain techniques for aggregating multiple network models using built-in network layers or by building custom layers.

In this tutorial, we ensemble custom and pre-trained network models that are trained on similar datasets using custom and built-in Tensorflow layers to approach an image classification problem. We would ensemble our models using concatenation, average and weighted average ensembling methods.

This aims to serve as a template of different techniques for combining multiple network models in your projects.

The Dataset

Our use case is an image classification of natural images. We would be using the Natural Images Dataset which contains 6899 images of natural images representing 8 classes. The represented classes in the dataset are aeroplane, car, cat, dog, fruit, motorbike and person.

Our aim is to develop a neural network model that is capable of identifying images belonging to the different classes and accurately classifying new images of each class.


We would develop three different models for our task and then ensemble using different methods. It is important to configure our models to provide relevant information to our final model.

Seed the environment for reproducibility and preview some of the images in the dataset representing each class.

# Seed environment 
seed_value = 1 # seed value 

# Set`PYTHONHASHSEED` environment variable at a fixed value 
import os os.environ['PYTHONHASHSEED']=str(seed_value) 

# Set the `python` built-in pseudo-random generator at a fixed value 
import random 
random.seed = seed_value 

# Set the `numpy` pseudo-random generator at a fixed value 
import numpy as np 
np.random.seed = seed_value 

# Set the `tensorflow` pseudo-random generator at a fixed value import tensorflow as tf tf.seed = seed_value

Set the data configurations and get the classes

# set configs 
base_path = './natural_images' 
target_size = (224,224,3) 

# define shape for all images # get classes classes = os.listdir(base_path) print(classes)
Image classes

Plot Sample images of the dataset

# plot sample images 

import matplotlib.pyplot as plt
import cv2

f, axes = plt.subplots(2, 4, sharex=True, sharey=True, figsize = (16,7))

for ax, label in zip(axes.ravel(), classes):
    img = np.random.choice(os.listdir(os.path.join(base_path, label)))
    img = cv2.imread(os.path.join(base_path, label, img))
    img = cv2.resize(img, target_size[:2])
    ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGRA2RGB))
Sample Image for each class

Load the images.

We load the images in batches using the Keras Image Generator and specify a validation split to split the dataset into training and validation splits.

We also apply random image augmentation to expand the training dataset in order to improve the performance of the model and its ability to generalize.

from keras.preprocessing.image import ImageDataGenerator

batch_size = 32

datagen = ImageDataGenerator(rescale=1./255,
                                   width_shift_range = 0.2,
                                   height_shift_range = 0.2,
                                   vertical_flip = True,
train_gen = datagen.flow_from_directory(base_path,
val_gen =  datagen.flow_from_directory(base_path,
Train and Validation Split

1. Build and train a custom model.

For our first model, we build a custom model using the Tensorflow functional method

# Build model 
input = Input(shape= target_size) 

x = Conv2D(filters=32, kernel_size=(3,3), activation='relu')(input) 
x = MaxPool2D(2,2)(x) 

x = Conv2D(filters=64, kernel_size=(3,3), activation='relu')(x) 
x = MaxPool2D(2,2)(x) 

x = Conv2D(filters=128, kernel_size=(3,3), activation='relu')(x) 
x = MaxPool2D(2,2)(x) 

x = Conv2D(filters=256, kernel_size=(3,3), activation='relu')(x) 
x = MaxPool2D(2,2)(x) 

x = Dropout(0.25)(x) 
x = Flatten()(x) 
x = Dense(units=128, activation='relu')(x) 
x = Dense(units=64, activation='relu')(x) 
output = Dense(units=8, activation='softmax')(x) 

custom_model  = Model(input, output, name= 'Custom_Model')

Next, we compile the custom model and initialize the callbacks to monitor and control training. For our callback, we reduce the learning rate when there is no improvement for a few epochs, we initialize EarlyStopping to stop the training process if the model does not further improve and also checkpoint the training to save our best model at each epoch. These are useful standard practices for training deep learning models.

# compile model
custom_model.compile(loss= 'categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']) 

# initialize callbacks 
reduceLR = ReduceLROnPlateau(monitor='val_loss', patience= 3, verbose= 1,                                  mode='min', factor=  0.2, min_lr = 1e-6) 

early_stopping = EarlyStopping(monitor='val_loss', patience = 5 , verbose=1,                                  mode='min', restore_best_weights= True) 

checkpoint = ModelCheckpoint('CustomModel.weights.hdf5', monitor='val_loss',                               verbose=1,save_best_only=True, mode= 'min') 

callbacks= [reduceLR, early_stopping,checkpoint]

Now we can define our training configurations and train our custom model.

# define training config 
TRAIN_STEPS = 5177 // batch_size 
VAL_STEPS = 1722 //batch_size 
epochs = 80 

# train model
custom_model.fit(train_gen, steps_per_epoch= TRAIN_STEPS, validation_data=val_gen, validation_steps=VAL_STEPS, epochs= epochs, callbacks= callbacks)

After training our custom model we can evaluate the model's performance. I will be evaluating the model on the validation set. In practice, it is preferable to set aside a separate test set for model evaluation.

# Evaluate the model 
Validation loss and accuracy

‌We can also retrieve the prediction and actual validation labels so as to check the classification report and confusion matrix evaluations for the custom model.

# get validation labels
val_labels = [] 
for i in range(VAL_STEPS + 1):

val_labels = np.argmax(val_labels, axis=1)
# show classification report 

from sklearn.metrics import classification_report 

print(classification_report(val_labels, predicted_labels, target_names=classes))
Classification Report
# function to plot confusion matrix

import itertools

def plot_confusion_matrix(actual, predicted):

    cm = confusion_matrix(actual, predicted)
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title('Confusion matrix', fontsize=25)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90, fontsize=15)
    plt.yticks(tick_marks, classes, fontsize=15)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], '.2f'),
        color="white" if cm[i, j] > thresh else "black", fontsize = 14)

    plt.ylabel('True label', fontsize=20)
    plt.xlabel('Predicted label', fontsize=20)
# plot confusion matrix
plot_confusion_matrix(val_labels, predicted_labels)
Confusion Matrix using our custom model

From the model evaluation, we notice that the custom model could not properly identify the images of cats and flowers. It also gives sub-optimal precision for identifying aeroplanes.

Next, we would try out a different model-building technique, by using weights learned by networks pre-trained on the Imagenet dataset to boost our model's performance in the classes with lower accuracy scores.

Bring this project to life

2. Using a pre-trained VGG16 model.

for our second model ,we would be using the VGG16 model pre-trained on the Imagenet dataset to train a model for our use case. This allows us to inherit the weights learned from training on a similar dataset to boost our model's performance.

Initializing and fine-tuning the VGG16 model.

# Import the VGG16 pretrained model 
from tensorflow.keras.applications import VGG16 

# initialize the model vgg16 = VGG16(input_shape=(224,224,3), weights='imagenet', include_top=False) 

# Freeze all but the last 3 layers for layer in vgg16.layers[:-3]: layer.trainable = False 

# build model 
input = vgg16.layers[-1].output # input is the last output from vgg16 

x = Dropout(0.25)(input) 
x = Flatten()(x) 
output = Dense(8, activation='softmax')(x) 

# create the model 
vgg16_model = Model(vgg16.input, output, name='VGG16_Model')

We initialize the pre-trained model and freeze all but the last three layers so we can utilize the weights of the models and learn new information in the last 3 layers that are specific to our use case. We also add a dropout layer to regularize the model from overfitting on our dataset before adding our final output layer.

We can then train our VGG16 fine-tuned model using similar training configurations to our custom model.

# compile the model 
vgg16_model.compile(optimizer= SGD(learning_rate=1e-3), loss= 'categorical_crossentropy', metrics= ['accuracy']) 

# reinitialize callbacks 
checkpoint = ModelCheckpoint('VggModel.weights.hdf5', monitor='val_loss', verbose=1,save_best_only=True, mode= 'min') 

callbacks= [reduceLR, early_stopping,checkpoint] 

# Train model 
vgg16_model.fit(train_gen, steps_per_epoch= TRAIN_STEPS, validation_data=val_gen, validation_steps=VAL_STEPS, epochs= epochs, callbacks= callbacks)

After training, we can evaluate the model's performance using similar scripts from the last model evaluation.

# Evaluate the model 

# get the model predictions 
predicted_labels = np.argmax(vgg16_model.predict(val_gen), axis=1) 

# show classification report 
print(classification_report(val_labels, predicted_labels, target_names=classes)) 

# plot confusion matrix 
plot_confusion_matrix(val_labels, predicted_labels)
Classification report using the pre-trained VGG model 
Confusion Matrix report using the pre-trained VGG model 

Using a pre-trained VGG16 model produced better performance, especially for the classes with lower scores.

Finally, before ensembling, we would use another pre-trained model with much lesser parameters than the VGG16. The aim is to utilize different architectures so we can benefit from their varying properties to improve the quality of our final model. We would be using the Mobilenet model pre-trained on the same Imagenet dataset.

3. Using a pre-trained Mobilenet model.

For our final model, we would be finetuning the mobilenet model

# initializing the mobilenet model
mobilenet = MobileNet(input_shape=(224,224,3), weights='imagenet', include_top=False)

# freezing all but the last 5 layers
for layer in mobilenet.layers[:-5]:
  layer.trainable = False

# add few mor layers
x = mobilenet.layers[-1].output
x = Dropout(0.5)(x)
x = Flatten()(x) 
x = Dense(32, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(16, activation='relu')(x)
output = Dense(8, activation='softmax')(x)

# Create the model
mobilenet_model = Model(mobilenet.input, output, name= "Mobilenet_Model")

For training the model, we maintain some of the previous configurations from our previous training.

# compile the model 
mobilenet_model.compile(optimizer= SGD(learning_rate=1e-3), loss= 'categorical_crossentropy', metrics= ['accuracy']) 

# reinitialize callbacks 
checkpoint = ModelCheckpoint('MobilenetModel.weights.hdf5', monitor='val_loss', verbose=1,save_best_only=True, mode= 'min') 

callbacks= [reduceLR, early_stopping,checkpoint] 

# model training 
mobilenet_model.fit(train_gen, steps_per_epoch= TRAIN_STEPS, validation_data=val_gen, validation_steps=VAL_STEPS, epochs=epochs, callbacks= callbacks)

After training our fine-tuned Mobilenet model, we can then evaluate its performance.

# Evaluate the model 

# get the model's predictions 
predicted_labels = np.argmax(mobilenet_model.predict(val_gen), axis=1) 

# show the classification report 
print(classification_report(val_labels, predicted_labels, target_names=classes)) 

# plot the confusion matrix 
plot_confusion_matrix(val_labels, predicted_labels)

The fine-tuned Mobilenet model performs better than the previous models and notably improves the identification of dogs and flowers.

While it would be fine to use the trained Mobilenet model or any of our models as our final choice, combining the structure and weights of the three models might prove better in developing a single, more generalizable model in production.

Ensembling The Models

To ensemble the models, we would be using three different ensembling methods as earlier stated: Concatenation, Average and Weighted Average.

The methods have their pros and cons which are often considered to determine the choice for a particular use case.

Concatenation Ensemble

This involves merging the parameters of multiple models into a single model. This is made possible in Tensorflow by the concatenation layer which receives inputs of tensors of the same shape (except for the concatenation axis) and merges them into one. This layer can therefore be used to merge all the information learnt by the different models side by side on a single axis into a single model. See here for more information about the Tensorflow concatenation layer.

To merge our models, we would simply input our models into the concatenation layer.

# concatenate the models

# import concatenate layer
from tensorflow.keras.layers import Concatenate

# get list of models
models = [custom_model, vgg16_model, mobilenet_model] 

input = Input(shape=(224, 224, 3), name='input') # input layer

# get output for each model input
outputs = [model(input) for model in models]

# contenate the ouputs
x = Concatenate()(outputs) 

# add further layers
x = Dropout(0.5)(x) 
output = Dense(8, activation='softmax', name='output')(x) # output layer

# create concatenated model
conc_model = Model(input, output, name= 'Concatenated_Model')

After concatenation, we added further layers as deemed appropriate for our task. In this example, I have added a dropout layer and a final output layer with the appropriate output dimensions of our use case.

Let's check the structure of our concatenated model.

# show model structure 
from tensorflow.keras.utils import plot_model 
Structure of the concatenated model

We can see how the three different functional models we've built are merged into one with further dropout and output layers.

This could prove useful for certain cases as we are utilizing all the information gained by the different models. The problem with this method is that the dimension of our final model is a concatenation of the dimensions of all the ensemble members which could lead to a dimensionality explosion, the inclusion of irrelevant information and consequently, overfitting.

Average Ensemble

Rather than risk exploding the model's dimension and overfitting for simpler tasks by concatenating multiple models, we could simply make our final model the average of all our models. This way, we are able to maintain a reasonable average dimension for our final model while utilizing the information of all our models.

This time we will use the Tensorflow Average layer which receives inputs and outputs their average. Therefore our final model is an average of the parameters of our input models. See here for more information about the average layer.

To ensemble our model by average,  we simply combine our models at the average layer.

# average ensemble model 

# import Average layer
from tensorflow.keras.layers import Average 

input = Input(shape=(224, 224, 3), name='input')  # input layer

# get output for each input model
outputs = [model(input) for model in models] 

# take average of the outputs
x = Average()(outputs) 

x = Dense(16, activation='relu')(x) 
x = Dropout(0.3)(x) 
output = Dense(8, activation='softmax', name='output')(x) # output layer

# create average ensembled model
avg_model = Model(input, output)

As previously seen,  further layers can be added to the network as deemed fit. We can also see the structure of the Average ensembled model.

# show model structure 

This approach to ensembling is particularly useful because by taking an average, the lower performance for certain classes is boosted by models with higher performance for such classes, thereby improving the model's overall performance and generalization.

The downside to this approach is that by taking the average, certain information from the models would not be included in the final model. Another is that because we are taking a simple average, the superiority and relevance of certain models are lost. As in our use case, the superiority in performance of the Mobilenet Model over other models is lost.

This could be resolved by taking a weighted average of the models when ensembling, such that it reflects the importance of each model to the final decision.

Weighted Average Ensemble

In weighted average ensembling, the outputs of the models are multiplied by an assigned weight and an average is taken to get a final model.  By assigning weights to the models when ensembling this way, we preserve their importance and contribution to the final decision.

While TensorFlow does not have a built-in layer for weighted average ensembling, we could utilize attributes of the built-in Layer class to implement a custom weighted average ensemble layer. For a weighted average ensemble, it is necessary that the weights sum up to one, we can ensure this by applying a softmax function on the weights before ensembling. You can see more about creating custom layers in TensorFlow here.

Implementing a custom weighted average layer

First, we define a custom function to set our weights.

# function for setting weights

import numpy as np

def weight_init(shape =(1,1,3), weights=[1,2,3], dtype=tf.float32):
    return tf.constant(np.array(weights).reshape(shape), dtype=dtype)

Here we have set a weight of 1,2 and 3 for our models respectively. We can then build our custom weighted average layer.

# implement custom weighted average layer

import tensorflow as tf
from tensorflow.keras.layers import Layer, Concatenate

class WeightedAverage(Layer):

    def __init__(self):
        super(WeightedAverage, self).__init__()
    def build(self, input_shape):
        self.W = self.add_weight(
    def call(self, inputs):
        inputs = [tf.expand_dims(i, -1) for i in inputs]
        inputs = Concatenate(axis=-1)(inputs) 
        weights = tf.nn.softmax(self.W, axis=-1)

        return tf.reduce_mean(weights*inputs, axis=-1)

To build our custom weighted average layer, we inherited attributes of the built-in Layer class and added weights using our weights initializing function. We apply a softmax to our initialized weights so that they sum up to one before multiplying it with the models.  Finally, we get the mean reduction of the weighted inputs to get our final model.

We can then ensemble our model using the custom weighted average layer.

input = Input(shape=(224, 224, 3), name='input')  # input layer

# get output for each input model
outputs = [model(input) for model in models] 

# get weighted average of outputs
x = WeightedAverage()(outputs)

output = Dense(8, activation='softmax')(x) # output layer

weighted_avg_model = Model(input, output, name= 'Weighted_AVerage_Model')

Let's check the structure of our weighted average model.

# plot model
Structure of the Weighted Average Ensemble Model 

As in the previous methods, our models are ensembled, but this time, as a weighted average of the three models. The output dimension of the ensembled model is also compatible with the required output dimension for our task without adding any further layers.

Also, rather than manually set our weights using a custom weight initializing function, we could also utilize Tensorflow built-in weight initializers to initialize weights that are optimized during training. See here for the available built-in weight initializers in Tensorflow.


Model ensembling provides methods of combining multiple models to boost the performance and generalization of machine learning models. Neural networks can be combined using Tensorflow by concatenation, average or custom weighted average methods. All the information from the ensemble members is preserved using the concatenation technique at the risk of dimension explosion. Using the average method maintains a reasonable dimension for our model but loses certain information and does not account for the importance of each contributing model. This can be resolved by using a custom weighted average with weights that are optimized during the training process. There is also the possibility of extending  Neural Network ensembling using other operations.

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

Spread the word

Keep reading