How to maximize GPU utilization by finding the right batch size

In this article, we saw how to use various tools to maximize GPU utilization by finding the right batch size for model training in Gradient Notebooks.

16 days ago   •   8 min read

Bring this project to life

Oftentimes, one of the most asked questions by new data scientists and ML engineers is whether their deep learning training processes are running optimally. In this guide, we will learn how to diagnose and fix deep learning performance issues regardless of whether we are working on one or numerous machines. This is to help us understand how to make practical and effective use of Paperspace's wide variety of GPUs.

We will start by understanding what GPU utilization is, and we'll finish by discussing the optimal batch size for maximum GPU utilization.

Note: This guide assumes we have basic understanding of the Linux operating system and the Python programming language. The latest Linux distros come with Ubuntu pre-installed, so we can go ahead and install pip and conda, as we will use them here.

What is GPU Utilization?

In machine and deep learning training sessions, GPU utilization is the most important aspect to observe, and is available through notable GPU third party and built in tools.

We can define GPU’s utilization as the speed that a single or multiple GPU kernels are operating over the last second, which is parallel to a GPU being used by a deep learning program. We could also say that

How do you know you need more GPU compute?

Let us look at a real scenario here,

In a typical day, a data scientist gets two GPUs that he/she can use – these “should” be sufficient resources. Most of the days during the build part, there’s no problem interacting with the GPU’s short cycles and the workflow is smooth. Then the training phase kicks in, and suddenly the workflow demands additional GPU compute that is not readily available.

This means that more compute resources will be required to do any sort of significvant work. We place particular emphasis on the following tasks as being impossible when all RAM is allocated:

• Run more experiments
• Run multi-GPU training to speed up training for experimenting larger batch sizes and achieve higher model accuracy
• Focus on a new model while training model runs independently

Benefits of GPU Utilization.

In general, these upgrades transform into a double increase in the utilization of hardware and 100% increase  in model training speed.

• GPU utilization will enable us to manage resource allocations more efficiently, and ultimately reduce GPU idle time and increase cluster utilization
• From the point of a deep learning specialist, consuming more GPU compute power will give room for running more experiments that will improve our productivity and the quality of their models
• Additionally, IT administrators can run distributed training models using multiple GPUs, like the NVlink multi-GPU machines offered by Paperspace, which shortens training times

The optimal batch size for GPU utilization

The general experience with batch size is always confusing because there is no single “best” batch size for a given data set and model architecture.  If we decide to pick a larger batch size, it will train faster and consume more memory, but it might show lower accuracy in the end. First, let us understand what a batch size is and why you need it.

What is a batch size?

It is important to specify a batch size when it pertains to training a model like a deep learning neural network. Put simply, the batch size is the number of samples that will be passed through to a network at one time.

Batch size in an example

Let's say we want to train our network to recognize different cat breeds using 1000 photos of cats. Let's now assume that we have chosen a batch size of 10. Therefore, it means that at one moment, the network will get 10 photographs of cats as a group or a batch in our case.

Cool, we have the idea of batch size now, but what’s the point? We could just pass each data element individually to our model rather than putting the data in batches. We've explained why we need them in the section below.

Why use batches?

We mentioned earlier that a larger batch size will help a model complete each epoch during training quickly. This is because, a machine may be able to produce much more than one single character at a time depending on the computational resources available.

However, even if our machine is capable of handling very larger batches, the final output of the model may degrade as we set our batch larger and may ultimately limit the model to generalize on new data.

We can now concur that a batch size is another hyper-parameter we need to assess and tweak depending on how a particular model is doing throughout training sessions. This setting will also need to be examined to see how well our machine utilizes the GPU when running different batch sizes.

For instance, if we set our batch size to a rather high amount, say 100, then it's possible that our machine won't have enough processing capacity to process all 100 images simultaneously. This would indicate that we need to reduce our batch size.

Now that we have understood a general idea of what a batch size is, let’s see how we can optimize the right batch size in code using PyTorch and Keras.

Find the right batch size using PyTorch

In this section we will run through finding the right batch size on a Resnet18 model. We will use the PyTorch profiler to measure the training performance and GPU utilization of the Resnet18 model. You can follow along with this tutorial within a Gradient Notebook by clicking the "Run on Gradient" link below or at the top of this page.

Bring this project to life

In order to demonstrate more PyTorch usage on TensorBoard to monitor model performance, we will utilize the PyTorch profiler in this code but turn on extra options.

Setup and preparation of data and model

Type the following command to install torch, torchvision, and Profiler.

pip3 install torch torchvision torch-tb-profiler

The following code will grab our dataset from CIFAR10 . Next, we will use transfer learning with the pre-trained model resnet18 and train the model.

#import all the necessary libraries
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T
#prepare input data and transform it
transform = T.Compose(
[T.Resize(224),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# use dataloader to launch each batch
# Create a Resnet model, loss function, and optimizer objects. To run on GPU, move model and loss to a GPU device
device = torch.device("cuda:0")
model = torchvision.models.resnet18(pretrained=True).cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
# define the training step for each batch of input data
def train(data):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

We have successfully setup our basic model, now we are going to enable the optional features in the profiler to record more information during the training process. Let's include the following parameters:

• schedule - this parameter takes a single  step(int), and returns the profiler action to perform at every stage.
• profile_memory - This is used to allocate GPU memory and setting it to true may cost you additional time.
• with_stack - used to record source information for all traces.

Now that we understand these terms, we can return to the code:

with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
if step >= (1 + 1 + 3) * 2:
break
train(batch_data)
prof.step()  # Need call this at the end of each step to notify profiler of steps' boundary.

Find the right batch size using Keras

We are going to use an arbitrary sequential model in this case;

model = Sequential([
Dense(units=16, input_shape=(1,), activation='relu'),
Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
Dense(units=2, activation='sigmoid')
])

Let's concentrate on where we call model.fit(). This is the function where an artificial neural network learn and calls to train our model.

model.fit(

x=scaled_train_samples,
y=train_labels,
validation_data=valid_set,
batch_size=10,
epochs=20,
shuffle=True,
verbose=2
)

The fit() function above here accepts a parameter called batch_size. This is where we assign a value for our batch_size variable. In this model, we have just set the value to 10.  Therefore, in the training of this model, we will be passing in 10 characters at a time until all the cycle is complete. Thereafter, we can begin the process over again to complete the next cycle.

Important things to pay attention to

When performing multi-GPU training, pay close attention to the batch size as it might affect speed/memory, convergence of your model, and if we're not careful, our model weights could be corrupted!

Speed and memory - Without a doubt, training and prediction are performed more quickly with larger batches. Small batches incur higher overhead as a result of the overhead associated with loading and unloading data from the GPUs, but some studies indicate training with a small batch size will yield a higher overall, final efficacy scores for such models. On the other hand, you require additional GPU RAM for larger batches. A large batch size can result in out-of-memory issues since the inputs for each layer are retained in memory, especially during training when they are needed for the back-propagation step.

Convergence - If you train your model with stochastic gradient descent (SGD) or one of its variants, you should be aware that the batch size might have an impact on how well your network converges and generalizes. In many computer vision problems, batch sizes typically range from 32 to 512 instances.

Corrupting the GPUs - This irritating technical detail could have disastrous effects. When performing multi-GPU training, it's crucial to provide data to each GPU. It is possible for your epoch's final batch to include fewer data than expected (because the size of our dataset can not be divided exactly by the size of our batch).

Some GPUs may not get any data during the final step as a result of this. Sadly, some Keras Layers—most notably the Batch Normalization Layer—can't handle that, which causes NaN values to appear in the weights (the running mean and variance in the BN layer).

To make matters worse, because the specific layer uses the batch's mean/variance in the estimations, one will not notice the issue during training (when learning phase is 1). However, the running mean/variance is employed during predictions (learning phase set to 0), which in our scenario can become nan leading to subpar results.

Therefore, while performing multi-GPU training, we should always make sure that the batch size is fixed. Rejecting batches that don't fit the predefined size or repeating the entries in the batch until it does are two straightforward approaches to accomplish this. Last but not least, remember that in a multi-GPU configuration, the batch size should be more than the total number of GPUs on your system.

To wrap up

In this article, we saw how to use various tools to maximize GPU utilization by finding the right batch size. As long as you set a respectable batch size (16+) and keep the iterations and epochs the same, the batch size has little impact on performance. Training time will be impacted, though. We should select the smallest batch size possible for multi-GPU so that each GPU can train with its full capacity. 16 per GPU is a good number.