While working on a machine learning problem some months back, I realized that I needed a better way to save the best version of a model during training. This led me to find a way to save a model when specific conditions are satisfied. In this article, I will be discussing what checkpointing is and how it can be implemented using TensorFlow.
Prerequisites
- Familiarity with Python programming
- Understanding of basic machine learning terminologies
- Basic familiarity with TensorFlow
Checkpointing
Checkpointing generally involves saving the state of a system so that at a later time it can be restored for use. In this way, information that would have been otherwise lost is retained and reloaded for use later. Yes, it is quite similar to just saving a file but as a term is more frequently used when computational systems are saved under naturally complicated conditions.
In machine learning, checkpointing involves saving the current state of a model (architecture, weight, optimizer and so on) so that it can be reloaded for use later on.
Why Checkpointing?
- Interrupted Training Loops: When a training sequence gets intentionally or unintentionally terminated, the entire progress can be lost. This can be costly in terms of the cost of computing and development time. Checkpointing helps to store the model for later use when the training can be continued.
- Saving optimal models for production: During training, it is usually the case that the final model is not the best in terms of the desired metric. This might lead to one trying to identify the best epoch and then training for only that number of epochs. This however is an important time that will be wasted given the stochastic nature of these models. Checkpointing helps with saving the model when the desired condition is satisfied, making the training and saving of models a more straightforward process.
Tensorflow
Tensorflow is an open-source platform for end-to-end machine learning. Though it has support for multiple languages, the most popular option for machine learning is Python and that is what we will be using in this article. Tensorflow has a checkpointing method under tf.train.checkpoint
but in this piece, we will be using the ModelCheckpoint
callback method under the Keras module.
ModelCheckpoint and its Arguments
Callbacks are pieces of programs or scripts that are executed when a particular condition is satisfied. In our case, the condition is to save the model after some training epochs. The keras.callbacks.ModelCheckpoint
method implements the checkpointing but requires us to specify some arguments. We will discuss these briefly.
filepath
: This is the only required parameter and it refers to the location of the stored checkpoint. Required here implies that once this is specified, the checkpointing callback can be used with the other arguments already set to their default values.
monitor
: This parameter is a string that determines the metric to be monitored by the checkpointing callback during training. It is set to val_loss
by default but can be tweaked to be whatever best fits the scenario. To monitor training metrics, the val_
prefix is excluded. Also, if the desired metric's name is user-defined, that should be passed in as the parameter.
verbose
: this argument is an integer that is used to set the logging level and is either 0
or 1
representing none or all. It has a default value of 0
.
save_only_best_model
: this argument takes in a boolean value to specify that only the best model should be saved. This is based on whatever metric it was set to monitor. The default for this is False
and so if not changed, the model will be saved at every instance when it should be to the same file path. If the goal is to save the multiple models individually, care has to be taken as the model will overwrite the specified file at filename with every epoch if the filename is not formatted to reflect the epochs and differentiate the files.
mode
: The mode parameter is a string that can be set to one of {'auto', 'min', 'max'}
. 'min'
and 'max'
are respectively set for metrics that need to be minimised (eg. loss) and maximized (e.g. accuracy). When set to 'auto'
which is its default, metrics that represent accuracy or F-score are maximized and all other ones are minimized.
save_weights_only
: this parameter accepts a boolean value that determines whether just the model’s weights should be saved or not. If True
, the model.save_weights
method will be called instead of the model.save
function.
save_freq
: this parameter accepts the string 'epoch'
to save the model after every epoch or an integer to represent the frequency of the saving in terms of the number of batches. The default for this is 'epoch'
since saving by batches could make the monitored metric less reliable.
options
: This is an optional tf.train.CheckpointOptions
object that is used when save_weights_only
is True
. When it is false, the tf.saved_model.SaveOptions
object is used. These objects provide additional options for checkpointing or saving a model that is useful in distributed settings.
Now we know what checkpointing is, why it is necessary and the parameters required to define a checkpoint in TensorFlow. But how is it really used in practice? To answer this, we will be building a model to classify IMDB movie reviews into positive and negative classes.
Installing and importing required libraries
$pip install tensorflow tensorflow_datasets
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
In the code snippet above, we first install tensorflow
and tensorflow_datasets
with pip
via the command line in case they are not already installed in your development environment. Next, we import numpy
and tensorflow
. We also import keras
from tensorflow
as we will be making several calls from that specific module. Finally, we import tensorflow_datasets
and disable its progress bar.
Creating a visualization function
import matplotlib.pyplot as plt
def plot_graphs(history, metric):
plt.plot(history.history[metric])
plt.plot(history.history['val_'+metric], '')
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend([metric, 'val_'+metric])
Next, we import matplotlib.pyplot
to create a function that will aid visualization later on. The function takes in a history object and a desired metric and plots the train and validation data for that metric against the number of epochs. Labels are also added for the x and y axes.
Bring this project to life
Load IMDB reviews dataset
dataset, info = tfds.load('imdb_reviews', with_info=True,
as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
print(info.description)
print(info.citation)
print(f"\n{train_dataset.element_spec}")
Here, we pass in the name of the dataset as a string and two parameters to determine the nature of the information returned. with_info
is a boolean that determines whether metadata about the dataset gets returned and as_supervised
is a boolean to determine whether the dataset is loaded for supervised learning applications or not.
Next, we separate the dataset into its train and test portions by passing in 'train'
and 'test'
as keys to the dataset variable. To conclude, we display the description and citation information from the information collected. We also view the element specifications of the train_dataset
variable to see the datatypes of its constituents. We can see that it contains two Tensors with tf.string
and tf.int64
data types respectively.
Configuring the data pipeline
print(info.splits)
{'test': <tfds.core.SplitInfo num_examples=25000>,
'train': <tfds.core.SplitInfo num_examples=25000>,
'unsupervised': <tfds.core.SplitInfo num_examples=50000>}
First, we use the splits attribute to see the size of the train and test data.This is given as 25000 each.
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
for review, label in train_dataset.take(1):
print('reviews: ', review.numpy()[:3])
print('\nlabels: ', label.numpy()[:3])
reviews: [b'This was a big disappointment for me. I think this is the worst Mastroianni-movie ever made. Cosmatos tries too hard to make this movie a masterpiece and that makes this movie a typical "art"-movie. I give 4/10 for this movie.'
b"This picture for me scores very highly as it is a hugely enjoyable ... These type of films often deserve a cult following:<br /><br />8/10."
b'Almost too well done... "John Carpenter\'s Vampires" was entertaining, ... "Vampires: Los Muertos" is almost too well done. (I give it 7 of 10)']
labels: [0 1 1]
In the code snippet above, we set a batch size and a buffer size. The batch size determines the number of samples that are processed at a time. It is passed into the batch method. The buffer size on the other hand is passed in as a parameter to the shuffle method where it determines the scope from which the method initially shuffles. By setting it to 10000, the method initially selects random samples from just 10000 out of the 25000 training samples.
Then, it replaces the selected samples with new samples from the remaining 15000 samples. Doing this iteratively, the entire dataset is divided into batches of the set batch size. It is worthy to note that the shuffle method is used for just the train dataset since the order does not really matter in the test dataset.
Finally, we display a portion of the dataset by taking a batch from the train_dataset
and printing the first 3 reviews and labels.
Creating the text encoder
So far, we still have the input data as raw text but it will need to be encoded into numbers for a model to train on it.
VOCAB_SIZE = 1000
encoder = keras.layers.TextVectorization(
max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))
First, we declare a vocabulary size as 1000
. This means that we are only taking care of encodings for a thousand words while the others are left as unknown words.
Then, we pass this integer into the max_tokens
method of the keras.layers.TextVectorization
method. Then, we adapt the encoder to our training data. We also use the map
method to extract just the text from the train_dataset
by adding a lambda
function for just that.
vocab = np.array(encoder.get_vocabulary())
vocab[:20]
array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
dtype='<U14')
encoded_example = encoder(review)[:3].numpy()
print(encoded_example)
array([[ 11, 14, 4, ..., 0, 0, 0],
[ 11, 433, 16, ..., 0, 0, 0],
[210, 100, 74, ..., 1, 5, 302]])
Here, we output 20 of the elements in the vocabulary of the encoder. Notice the '[UNK]'
string that represents all unknown words. We also output the encoding of three reviews. With the default setting that remained unchanged in the TextVectorization
method, this process will not be reversible. However, since we are building a classifier, that will not be a problem.
Building the model
model = keras.Sequential([
encoder,
keras.layers.Embedding(
input_dim=len(encoder.get_vocabulary()),
output_dim=64,
),
keras.layers.Bidirectional(keras.layers.LSTM(64)),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(1)
])
The model is defined as a keras.Sequential
object. The first layer is the encoder we just built so that input encoding is the first thing that happens. Next is an embedding layer. This layer converts the sequences of word indices into sequences of trainable vectors. Post training, words with similar meanings have the same embeddings.
Next, we add a Long Short Term Memory (LSTM) layer wrapped in a keras.layers.Bidirectional
layer. The LSTM processes sequences by iterating through the elements and passing the results from one timestep to the next. Then the bidirectional wrapper propagates the input forward and backwards through the LSTM layer and concatenates the output.
When the LSTM has transformed the sequence to a single vector, the two keras.Layers.Dense
layers do final processing to convert the vector to a single logit as the classification output. If this value is greater than 0
, the classification is positive; else, it is negative.
All hidden layers were set to have just 64 neurons. This can of course be changed if it improves performance.
To be sure that this works, we parse a text to the untrained model via the model.predict function and print the result as displayed below.
sample_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print("Positive" if predictions[0]>0 else "Negative")
Negative
As expected, the model wrongly classifies the review.
Compiling the Model
We have come a long way but it was indeed important to see checkpointing used in an actual problem so that its utility can be appreciated. Next, we compile the model and define the checkpointing and early stopping callbacks.
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
checkpoint = keras.callbacks.ModelCheckpoint("best_model",
save_best_only=True)
stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
In the above code snippet, we compile the model with a BinaryCrossEntropy
loss, an Adam optimizer with a learning rate set as 1e-4
and 'accuracy'
as a metric.
Next, the checkpoint method is created by instantiating a keras.callbacks.ModelCheckpoint
object and passing in a string 'best_model'
as the file path to save to as well as setting save_best_only
to True
. All other parameters are left with their earlier stated default values.
Next, we define the early stopping callback and parse in parameters that make it monitor the validation loss and have a patience value of 5
epochs. This means that the training loop will stop when the validation loss has not increased after 5
epochs.
Training the model
Here we call the fit function parsing in the training data, validation data, number of epochs as 10
, validation steps and callbacks. The validation_steps
parameter limits the number of test batches we validate to 30
. This can be excluded in order to run the validation loop on the entire test set but is a costly process. The callback parameter is what does the trick by taking in both the defined early stopping and checkpointing callbacks as elements in a list. Running this cell trains the model and saves it every time a new best validation loss is obtained.
The results of this training are stored in the history variable and visualized as shown below focusing on accuracy and loss.
history = model.fit(train_dataset,
validation_data=test_dataset,
epochs=10,
validation_steps=30,
callbacks=[stop_early, checkpoint],
)
Epoch 1/10
391/391 [==============================] - 94s 229ms/step - loss: 0.6162 - accuracy: 0.5985 - val_loss: 0.8097 - val_accuracy: 0.5661
INFO:tensorflow:Assets written to: best_model/assets
INFO:tensorflow:Assets written to: best_model/assets
Epoch 2/10
391/391 [==============================] - 88s 223ms/step - loss: 0.4062 - accuracy: 0.8184 - val_loss: 0.3691 - val_accuracy: 0.8339
INFO:tensorflow:Assets written to: best_model/assets
INFO:tensorflow:Assets written to: best_model/assets
Epoch 3/10
391/391 [==============================] - 89s 225ms/step - loss: 0.3470 - accuracy: 0.8510 - val_loss: 0.3729 - val_accuracy: 0.8547
Epoch 4/10
391/391 [==============================] - 88s 223ms/step - loss: 0.3334 - accuracy: 0.8568 - val_loss: 0.3491 - val_accuracy: 0.8380
INFO:tensorflow:Assets written to: best_model/assets
INFO:tensorflow:Assets written to: best_model/assets
Epoch 5/10
391/391 [==============================] - 89s 226ms/step - loss: 0.3245 - accuracy: 0.8619 - val_loss: 0.3371 - val_accuracy: 0.8479
Epoch 6/10
391/391 [==============================] - 90s 226ms/step - loss: 0.3180 - accuracy: 0.8645 - val_loss: 0.3372 - val_accuracy: 0.8526
Epoch 7/10
391/391 [==============================] - 90s 228ms/step - loss: 0.3174 - accuracy: 0.8658 - val_loss: 0.3275 - val_accuracy: 0.8604
INFO:tensorflow:Assets written to: best_model/assets
INFO:tensorflow:Assets written to: best_model/assets
Epoch 8/10
391/391 [==============================] - 90s 227ms/step - loss: 0.3120 - accuracy: 0.8664 - val_loss: 0.3359 - val_accuracy: 0.8609
Epoch 9/10
391/391 [==============================] - 89s 225ms/step - loss: 0.3111 - accuracy: 0.8681 - val_loss: 0.3378 - val_accuracy: 0.8552
Epoch 10/10
391/391 [==============================] - 90s 226ms/step - loss: 0.3077 - accuracy: 0.8698 - val_loss: 0.3285 - val_accuracy: 0.8562
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
Load Best Model
loaded_model = keras.models.load_model('best_model')
sample_text = ('The movie was nice. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = loaded_model.predict(np.array([sample_text]))
print("Positive" if predictions[0][0]>0 else "Negative")
Positive
To load the saved best model, we simply parse the file path to the keras.models.load_model
method. Next, we predict by parsing a sample text to the model.predict
method. We get a positive result that corresponds with our interpretation.
Continue Training with checkpoints
To continue training a loaded model with checkpoints, we simply rerun the model.fit
function with the callback still parsed. This however overwrites the currently saved best model, so make sure to change the checkpoint file path if this is undesired.
loaded_model = keras.models.load_model('best_model')
new_history = loaded_model.fit(train_dataset, epochs=20,
validation_data=test_dataset,
validation_steps=30,
callbacks=[stop_early, checkpoint],
)
Conclusion
In this tutorial, you have learnt what checkpointing is, why it is important, the basics of natural language processing in the form of sentiment analysis and how to implement checkpointing in a practical scenario. I hope you found this useful.