Let’s take a look at how a GPU is organized at a high level. By understanding how a GPU is organized/composed, we can better understand how to write code that fully utilizes it’s compute potential.

A graphic card consists of a bunch of Streaming Multiprocessors (SMs). For example, my card has 30 SMs.


+--------------------------+
|       GRAPHIC CARD       |
|                          |
|   +------+     +------+  |
|   |  SM  | ... |  SM  |  |  My card has 30 SMs total
|   |      |     |      |  |
|   +------+     +------+  |
|                          |
+--------------------------+



Each SM has a bunch of schedulers and ALUs. Each scheduler can be assigned a bunch of warps.


+------------------------------------+
|              SM                    |
|                                    |
|   +-----------+    +-----------+   |
|   | Scheduler |... | Scheduler |   |  In my card:
|   |           |    |           |   |    - each SM has 4 schedulers
|   | Warp 1    |    | Warp 1    |   |    - each scheduler can be assigned 8 warps
|   | Warp 2    |    | Warp 2    |   |    - each warp is 32 threads
|   | etc       |    | etc       |   |
|   +-----------+    +-----------+   |
|                                    |
|   +----+----+----+ ... +----+----+ |
|   |ALU |ALU |ALU |     |ALU |ALU | |  Each of my SMs has 64 ALUs
|   +----+----+----+ ... +----+----+ |
|                                    |
+------------------------------------+



As noted, each scheduler is assigned a bunch of warps (e.g. 8 warps in my case). Each cycle (clock cycle), each scheduler gets a chance to “issue” (not execute) the next instruction of one (or more) if its assigned warps

• the instructions are actually executed by the execution units of an SM (an execution unit is just an ALU (either a floating point ALU or an int ALU))
• if during a cycle, none of the assigned warps of a scheduler are “ready” (“eligible” is the technical word) to have their next instruction be executed, that scheduler will not issue an instruction for that cycle, and therefore you are not using some of the compute potential of the SM.
• a warp is considered “not ready” if its previous instruction is, say, waiting for a memory fetch

During a given cycle, for a given scheduler that is assigned n warps, if all of its n warps are waiting for a memory fetch, then that scheduler will not be able to issue any instructions and thus you are wasting a little bit of compute potential

So the solution here would be to maximize cache hit and memory coalescing (i.e. make sure that threads in a warp are accessing memory in a coalesced manner). If all 32 threads of a warp are accessing adjacent memory, then you are accessing memory in a coalesced manner. By maximizing cache hit and memory coalescing, you are minimizing the chance that a warp is not ready because it is waiting for a memory operation.