GPU

Divide and Conquer: The Role of Warps in Parallel Processing

In this review, we look at the role of warps in parallel processing in GPUs to try and understand what our machines are doing under the hood when training AI models.

a month ago • 5 min read

By Melani Maheswaran

Introduction

GPUs are described as parallel processors for their ability to execute work in parallel. Tasks are divided into smaller sub-tasks, executed simultaneously by multiple processing units, and combined to produce the final result. These processing units (threads, warps, thread blocks, cores, multiprocessors) share resources, such as memory, facilitating collaboration between them and enhancing overall GPU efficiency.

Note: it may be helpful to read this "CUDA refresher" before proceeding

One unit in particular, warps, are a cornerstone of parallel processing. By grouping threads together into a single execution unit, warps allow for the simplification of thread management, the sharing of data and resources among threads, as well as the masking of memory latency with effective scheduling.

"*The term warp originates from weaving, the first parallel-thread technology*"

In this article, we will outline how warps are useful for optimizing the performance of GPU-accelerated applications. By building an intuition around warps, developers can achieve significant gains in computational speed and efficiency.

Warps Unraveled

Thread blocks are partitioned into warps comprised of 32 threads each. All threads in a warp run on the same Streaming Multiprocessor. Figure from an NVIDIA presentation on GPGPU AND ACCELERATOR TRENDS

When a Streaming Multiprocessor (SM) is assigned thread blocks for execution, it subdivides the threads into warps. Modern GPU architectures typically have a warp size of 32 threads.

The number of warps in a thread block depends on the thread block size configured by the CUDA programmer. For example, if the thread block size is 96 threads and the warp size is 32 threads, the number of warps per thread block would be: 96 threads/ 32 threads per warp = 3 warps per thread block.

In this figure, 3 thread blocks are assigned to the SM. The thread blocks are comprised of 3 warps each. A warp contains 32 consecutive threads. Figure from Medium article

Note how, in the figure, the threads are indexed, starting at 0 and continuing between the warps in the thread block. The first warp is made of the first 32 threads (0-31), the subsequent warp has the next 32 threads (32-63), and so forth.

Now that we've defined warps, let's take a step back and look at Flynn's Taxonomy, focusing on how this categorization scheme applies to GPUs and warp-level thread management.

GPUs: SIMD or SIMT?

**Flynn's Taxonomy** is a classification system based on a computer architecture's number of instructions and data streams. There are 4 classes: **SISD** (Single Instruction Single Data) , **SIMD** (Single Instruction Multiple Data), **MISD** (Multiple Instruction Single Data), **MIMD** (Multiple Instruction Multiple Data). Figure taken from CERN's PEP root6 workshop

Flynn's Taxonomy is a classification system based on a computer architecture's number of instructions and data streams. GPUs are often described as Single Instruction Multiple Data (SIMD), meaning they simultaneously perform the same operation on multiple data operands. Single Instruction Multiple Thread (SIMT), a term coined by NVIDIA, extends upon Flynn's Taxonomy to better describe the thread-level parallelism NVIDIA GPUs exhibit. In an SIMT architecture, multiple threads issue the same instructions to data. The combined effort of the CUDA compiler and GPU allow for threads of a warp to synchronize and execute identical instructions in unison as frequently as possible, optimizing performance.

While both SIMD and SIMT exploit data-level parallelism, they are differentiated in their approach. SIMD excels at uniform data processing, whereas SIMT offers increased flexibility as a result of its dynamic thread management and conditional execution.

Warp Scheduling Hides Latency

In the context of warps, latency is the number of clock cycles for a warp to finish executing an instruction and become available to process the next one.

W denotes warp and T denotes thread. GPUs leverage warp scheduling to hide latency whereas CPUs execute sequentially with context switching. Figure from Lecture 6 of CalTech's CS179

Maximum utilization is attained when all warp schedulers always have instructions to issue at every clock cycle. Thus, the number of resident warps, warps that are being executed on the SM at a given moment, directly affect utilization. In other words, there needs to be warps for warp schedulers to issue instructions to. Multiple resident warps enable the SM to switch between them, hiding latency and maximizing throughput.

Program Counters

Program counters increment each instruction cycle to retrieve the program sequence from memory, guiding the flow of the program's execution. Notably, while threads in a warp share a common starting program address, they maintain separate program counters, allowing for autonomous execution and branching of the individual threads.

Pre-Volta GPUs had a single program counter for a 32 thread warp. Following the introduction of the Volta micro-architecture, each thread has its own program counter. As Stephen Jones puts it during his GTC' 17 talk : "*so now all these threads are wholly independent- they still work better if you gang them together...but you're no longer dead in the water if you split them up.*"Figure from Inside Volta GPUs (GTC'17).

Branching

Separate program counters allow for branching, an if-then-else programming structure, where instructions are processed only if threads are active. Since optimal performance is attained when a warp's 32 threads converge on one instruction, it is advised for programmers to write code that minimizes instances where threads within a warp take a divergent path.

Add speed and simplicity to your Machine Learning workflow today

Get started

Conclusion : Tying Up Loose Threads

Warps play an important role in GPU programming. This 32-thread unit leverages SIMT to increase the efficiency of parallel processing. Effective warp scheduling hides latency and maximizes throughput, allowing for the streamlined execution of complex workloads. Additionally, program counters and branching facilitate flexible thread management. Despite this flexibility, programmers are advised to avoid long sequences of diverged execution for threads in the same warp.

References

Tags:
GPU

public

Optimizing AI Models with Quanto on H100 GPUs

public

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

Optimizing AI Models with Quanto on H100 GPUs

Exploring the TextAttack Framework: Components, Features, and Practical Applications

Solutions

Product

Resources

Company

Introduction

Warps Unraveled

GPUs: SIMD or SIMT?

Warp Scheduling Hides Latency

Program Counters

Branching

Conclusion : Tying Up Loose Threads

References

Spread the word

Optimizing AI Models with Quanto on H100 GPUs

Exploring the TextAttack Framework: Components, Features, and Practical Applications

Keep reading

The Hidden Bottleneck: How GPU Memory Hierarchy Affects Your Computing Experience

Accelerating Large Language Models: The H100 GPU’s Role in Advanced AI Development

Rent vs. Buy: Is Renting an NVIDIA H100 GPU the Right Choice for Your AI Workload?

Subscribe to our newsletter

Solutions

Product

Resources

Company