Computing GPU memory bandwidth with Deep Learning Benchmarks

In this article, we look at GPUs in depth to learn about memory bandwidth and how it affects the processing speed of the accelerator unit for deep learning and other pertinent computational tasks.

13 days ago   •   11 min read

By David Clinton
Table of contents

In this post, we will first go over what a GPU and GPU memory bandwidth means. Then we will discuss how GPU memory bandwidth is determined underneath the hood. We will also talk about the principles of GPU computing, variations across GPU generations and an overview of current high end Deep Learning GPU benchmarks.

This blog post covers the functionalities and capabilities of GPUs in general. If you would like to read about benchmarking the Paperspace GPUs for deep learning, please refer to our latest round of benchmarks.

Introduction

The idea here is to explore the history of GPUs, how and by whom they were developed, how GPU memory bandwidth is determined, relevant business models, and the substantive changes made in GPUs across the generations. This will provide us with enough information to dive into Deep Learning benchmarks focusing specifically on a GPU model and microarchitecture.

There are multiple technologies and programming languages to get started with when developing GPU-accelerated applications, but in this article we will use CUDA ( a framework) and Python as our examples.

NOTE: This post requires you to have basic knowledge of Linux OS and Python Programming language as we will use it because it is very prevalent in the domains of research, engineering, data analytics, and deep learning, each of which rely significantly on parallel computing.

What are GPUs?

The abbreviation GPU stands for Graphics Processing Unit, which is a customized processor created to speed up graphics performance. You can think of a GPU as a device that developers use to solve some of the most difficult tasks in the computing world. GPUs can handle large amounts of data in real - time, making them suitable for deep learning, video production, and video games development.

Today, GPUs are arguably most recognized for their usage in producing the fluid images that modern video and video game viewers anticipate.

While the phrases GPU and video card (or graphics card) are sometimes used identically, there is a small difference between the two. A video card and GPU do not share memory, thus writing data to the card and reading the output back are two independent operations that might cause "bottlenecks" on your system. A graphics card is just the main link between a GPU and the computer system.

Instead of relying on a GPU integrated into a motherboard, most computer systems utilize a dedicated graphics card with a GPU for enhanced performance.

A bit of History of GPUs

GPUs were first introduced to the market by Nvidia in 1999. The first GPU by Nvidia was known as the Geforce 256, and it was designed to accommodate at least 10 million polygons per second. The acronym "GPU," did not exist prior to the release of the GeForce 256 in 1999. This allowed Nvidia to coin and claim ownership of the "world's debut GPU." Before the GeForce 256, there were different, other accelerators like graphics multipliers.

Figure 1: GeForce 256, Image credit, Konstantine Lanzet, Wikipedia

The GeForce 256 is outdated and hopelessly underpowered in contrast to newer GPUs, yet it is nevertheless a significant card in the evolution of 3D graphics. It was designed on a 220 nm technology and was equipped to handle 50 gigaflops of floating point computations

GPU Memory Bandwidth

GPU memory bandwidth refers to the potential maximum amount of data that the bus can handle at any given time, and plays a role in deciding how speedily a GPU can retrieve and use its framebuffer. Memory bandwidth is one of the most widely publicized metrics for each new GPU, with numerous models capable of hundreds of gigabytes per second of transfer.

Does GPU Memory Bandwidth Matter?

The higher your memory bandwidth, the better. This is a hard rule. For instance, a video card with more memory bandwidth can draw graphics quicker and more accurately.

Bandwidth Counters can be used to tell whether a program accesses enough memory. When you wish to replicate data as rapidly as possible, you may want the extent of system memory to be accessible to be large. Similarly, if your program executes few arithmetic operations in comparison to the amount of system memory requested, you may wish to use a lot of bandwidth.

Calculation of GPU Memory Bandwidth

To determine GPU memory bandwidth, certain fundamental ideas must first be understood (They will be all applied in the Calculation later on):

  1. Bits and Bites are two different things.
  2. We have Different Data Rates of Memory

Difference between Bits and Bytes | Data Rates of Memory

Bits Versus Bytes
The major contrast between bits and bytes is that a bit is the most basic type of computer memory, capable of storing only two different values, but a byte, made up of eight bits, may store up to 256 different values.

Bits are denoted in Computing with a lowercase 'b', i.e Gbps

Bytes are denoted via an uppercase 'B', i.e GB/s

Data Rates of Memory

The data rate is the number of bits that a module can send in a given amount of time. We have three Data Rates of Random Access Memory:

  • Single Data Rate (SDR)
  • Double Data Rate (DDR)
  • Quad Data Rate (QDR)

The most common type is DDR, For example, DDR4 RAM. DDR sends two signals per clock cycle. Each clock cycle has a rise and fall.

QDR therefore sends four signals per clock cycle.

The attained sustained memory bandwidth may be calculated as the ratio of bytes transferred to kernel execution time.
A typical contemporary GPU may deliver a STREAM result that is equal to or more than 80% of its peak memory bandwidth.

GPU- STREAM as a tool, informs an application developer about the performance of a memory bandwidth constrained kernel. GPU-STREAM is open source and may be found on this repository on GitHub.

EXAMPLE: Let us calculate the GPU memory bandwidth of the GDDR5 Nvidia GeForce, GTX 1660 - 2019. (Memory Clock = 2001 MHz, Effective Mem.Clock = 8 Gbps, Memory Bus Width = 192 bit)

SOLUTION:
Step 1

Calculate Effective Memory Clock
Effective Memory Clock is given by, Memory Clock * 2(rise and fall) * (Data Rate Type)

Mathematical Expression:
Effective Memory Clock = Memory Clock 2(rise and fall) (Data Rate Type)

Therefore:
2001MHz * 2(rise and fall) * 2(DDR) = 8004 bits
8004 bits = 8Gbps

Step 2

Calculate Memory Bandwidth
Memory Bandwidth = Effective Memory Cloth * Memory Bus width / 8

Memory Bandwidth:
8004bits * 192/8 = 192 096 Bytes = 192 GB/s
**
Final Output

The GPU Memory Bandwidth is 192GB/s

Looking Out for Memory Bandwidth Across GPU generations?

Understanding when and how to use every type of memory makes a big difference toward maximizing the speed of your application. It is generally preferable to utilize shared memory since threads inside the same frame that uses shared memory can interact. When coupled with its maximum performance, shared memory is a fantastic 'all-around' choice when used correctly. However, in some instances, it may be preferable to employ the other types of accessible memory.

Their are four main form of GPUs according to Memory Bandwidth, namely:

  • Dedicated Graphics Card
  • Integrated Graphics Processing Unit
  • Hybrid Graphics Processing
  • Stream Processing and General Purpose GPUs

In the low-end desktop and laptop sectors, the newest class of GPUs (Hybrid Graphics Processing) competes with integrated graphics. ATI's HyperMemory and Nvidia's TurboCache are the most frequent adaptations of this. Hybrid graphics cards cost slightly more than integrated graphics but far less than standalone graphics cards. Both swap memory with the host and feature a modest specialized memory cache to compensate for the computer RAM's excessive responsiveness.

Most GPUs are built for a certain purpose based on memory bandwidth and other mass calculations:

  1. Deep Learning and Artificial Intelligence: Nvidia Tesla/Data Center, AMD Radeon Instinct.
  2. Video Production and Gaming: GeForce GTX RTX, Nvidia Titan, Radeon VII, Radeon and Navi Series
  3. Small-scale Workstation: Nvidia Quadro, Nvidia RTX, AMD FirePro, AMD Radeon Pro
  4. Cloud Workstation: Nvidia Tesla/Data Center, AMD Firestream
  5. Robotics: Nvidia Drive PX

Well, once you've learned a little bit about the various types of memory accessible in GPU applications, you're prepared to learn how and when to use them effectively.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Language Solutions for GPU Programming - CUDA for Python

GPU Programming is a technique for executing extremely parallel general-purpose calculations using GPU accelerators. While GPUs were originally created for 3d modeling, they are now widely utilized for broad computation.

GPU-powered parallel computing is being utilized for deep learning and other parallelization-intensive tasks in addition to graphics rendering.

There are multiple technologies and programming languages to get started with when developing GPU-accelerated applications, but in this article we will use CUDA and Python as our examples.

Let's go over how to use CUDA in Python, starting with installation of CUDA on your machine.

Installation of CUDA

To get started, you will  ensure your machine has the following:

  • CUDA-capable GPU
  • Supported Version of Linux - Ubuntu 20.04
  • GCC installed on your system
  • Correct kernel herders and development packages installed

And then install CUDA . Follow the instructions below to install CUDA on a local installer:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.4.2/local_installers/cuda-repo-ubuntu2004-11-4-local_11.4.2-470.57.02-1_amd64.deb  
sudo dpkg -i cuda-repo-ubuntu2004-11-4-local_11.2-470.57.02-1_amd64.deb  
sudo apt-key add /var/cuda-repo-ubuntu2004-11-4-local/7fa2af80.pub 
sudo apt-get update  
sudo apt-get -y install cuda

You can remove this hassle by ditching your local for a more powerful cloud machine like a Gradient Notebook or Core Machine. Every Paperspace machine, unless specified otherwise, comes preinstalled with CUDA and CuPy to facilitate any deep learning needs a user may have.

Install CuPy Library

Secondly, because NumPy is the foundational library of the Python Data Science environment, we shall utilize it for this session.

The simplest approach to utilize NumPy is to use CuPy, a drop-in replacement library that duplicates NumPy functionalities on a GPU.

Pip may be used to install the stable release version of the CuPy source package:

pip install cupy

Confirm CuPy is properly Installed

This step is necessary to confirm that CuPy can enhance your system to what level. For this we will do the trick by writing a simple Python script in a .py file.

Note: This step requires you to know the basic file structure of a Python program.

The script below will import the NumPy and CuPy libraries, as well as the time library, which will be used to benchmark processing units.

import numpy as np
import cupy as cp
from time import time

Benchmarking

Let us now define a function that will be utilized to compare GPU and CPU.

def benchmark_speed(arr, func, argument):
	start_time = time()
    func(arr, argument) #your argument will be broadcasted into a matrix
    finish_time = finish_time - start_time
    return elapsed_time

Then you must create two matrices: one for the CPU and one for the GPU For our matrices, we will choose a form of 9999 by 9999.

# store a matrix into global memory
array_cpu = np.random.randint(0, 255, size=(9999, 9999))

# store the same matrix to GPU memory
array_gpu = cp.asarray(array_cpu)

Finally, we'll use a basic addition function to test the efficiency of CPU and GPU processors.

# benchmark matrix addition on CPU by using a NumPy addition function
cpu_time = benchmark_speed(array_cpu, np.add, 999)

# you need to run a pilot iteration on a GPU first to compile and cache the function kernel on a GPU
benchmark_speed(array_gpu, cp.add, 1)

# benchmark matrix addition on GPU by using CuPy addition function
gpu_time = benchmark_speed(array_gpu, cp.add, 999)

# Compare GPU and CPU speed
faster_speed = (gpu_time - cpu_time) / gpu_time * 100

Let us Print out our result to the console

print(f"CPU time: {cpu_time} seconds\nGPU time: {gpu_time} seconds.\nGPU was {faster_speed} percent faster")

We have confirmed that integer addition is much quicker on a GPU. If you work with significant amounts of data that can be handled in parallel, it's generally worthwhile to learn more about GPU programming. As you've seen, employing GPU computation for big matrices increases performance significantly.

If you do not have a GPU, you can run this code in a Gradient Notebook to see for yourself how their GPUs perform on this benchmarking task. You can run this code in any cell in a Gradient Notebook you already have, or you can spin up a new machine with any GPU as it's machine type.

Bring this project to life

The connection between Deep Learning and GPUs

Artificial intelligence (AI) is quickly changing, with new neural network models, methodologies, and application cases appearing on a regular basis. Because no one technology is optimal for all machine learning and deep learning applications, GPUs can provide unique benefits over other different hardware platforms in particular use scenarios.

Many of today's deep learning solutions rely on GPUs collaborating with CPUs. Because GPUs have a tremendous amount of processing capacity, they may provide great acceleration in workloads that benefit from GPUs' parallel computing design, such as image recognition. AI and deep learning are two of the most interesting uses for GPU technology.

A summary of the benchmarked GPUs

NVIDIA's newest products are included, including the Ampere GPU generation. The functionality of multi-GPU configurations, such as a quad RTX 3090 arrangement, is also assessed.

This section covers some options for local GPUs that are currently some of the best suited for deep learning training and development owing to their compute and memory performance and connectivity with current deep learning frameworks.

GPU Name Description
GTX 1080TI NVIDIA's traditional GPU for Deep Learning was introduced in 2017 and was geared for computing tasks, featuring 11 GB DDR5 memory and 3584 CUDA cores. It has been out of production for some time and was just added as a reference point.
RTX 2080TI The RTX 2080 TI was introduced in the fourth quarter of 2018. It has 5342 CUDA cores structured as 544 NVIDIA Turing mixed-precision Tensor Cores with 107 Tensor TFLOPS of AI capability and 11 GB of ultra-fast GDDR6 memory. This GPU was discontinued in September 2020 and is no longer available.
Titan RTX The Titan RTX is powered by the most powerful TuringTM architecture. With 576 tensor cores and 24 GB of ultra-fast GDDR6 memory, the Titan RTX provides 130 Tensor TFLOPs of acceleration.
Quadro RTX 6000 The Quadro RTX 6000 is the server variant of the famous Titan RTX, with enhanced multi-GPU blower ventilation, expanded virtualization functionality, and ECC memory. It uses the same TuringTM core as the Titan RTX, which has 576 tensor cores and delivers 130 Tensor TFLOPs of productivity as well as 24 GB of ultra-fast GDDR6 ECC memory.
Quadro RTX 8000 The Quadro RTX 8000 is the RTX 6000's bigger sibling. With the same GPU processing unit but double the GPU memory (48 GB GDDR6 ECC). In fact, it is presently the GPU with the highest accessible GPU memory, making it ideal for the most memory-intensive activities.
RTX 3080 One of the first GPU models to use the NVIDIA AmpereTM architecture, with improved RT and Tensor Cores and new live multiprocessors. The RTX 3080 has 10 GB of ultrafast GDDR6X memory and 8704 CUDA cores.
RTX 3080 Ti The RTX 3080's bigger brother, featuring 12 GB of ultra-fast GDDR6X memory and 10240 CUDA cores.
RTX 3090 The GeForce RTX 3090 belongs to NVIDIA's AmpereTM GPU generation's TITAN class. It is powered by 10496 CUDA cores, 328 Tensor Cores of the third generation, and innovative streaming multiprocessors. It, like the Titan RTX, has 24 GB of GDDR6X memory.
NVIDIA RTX A6000 The NVIDIA RTX A6000 is the Quadro RTX 6000's Ampere-based update. It has the same GPU processor (GA-102) as the RTX 3090, however it supports all CPU cores. As a result, there are 10752 CUDA cores and 336 third-generation Tensor Cores. Furthermore, it has twice as much GPU memory as an RTX 3090: 48GB GDDR6 ECC.

Using a local GPU has a number of limitations, however. First, you are limited to the capability of your purchase. It is either impossible or prohibitively expensive for most GPU owners to switch to a different GPU. Second, these are largely GeForce 30 and workstation series GPUs from Nvidia. These are not designed for processing Big Data like Tesla/Data Center GPUs are.

You can access a full list of their available GPUs here: https://docs.paperspace.com/gradient/machines/

You can also access a full benchmarking analysis of their GPUs here: https://blog.paperspace.com/best-gpu-paperspace-2022/

To wrap Up

The optimal GPU for your project will be determined by the level of maturity of your AI operation, the size at which you work, and the specific algorithms and models you use. Many factors have been offered in the preceding sections to assist you in selecting the optimal GPU or group of GPUs for your purposes.

Paperspace lets you switch between different GPUs as needed for any and all deep learning tasks, so try their service before purchasing an expensive GPU.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word

Keep reading