- CUDA Driver vs Architecture vs Toolkit (aka SDK)
- Adding to an Array From Multiple Threads - Lock Free
- How to Structure a CUDA C++ Project
- Numbers to Know (GPU v CPU)
- GPU Architecture
- How does a graphic card actually draw stuff?
Let’s go over some performance related numbers for GPUs and CPUs, and in the process we will gain a better understanding of how to optimize for a GPU.
For hardware (both CPU and GPU) I’m assuming typical, “mid range” 2021 specs.
- a typical GPU consists of ~2000 cores each operating at ~1 GHz. This means 2000 billion (or 2 trillion) cycles per second (spread across all cores). Assuming the GPU can do 1 instruction per cycle, this means 2 trillion instructions per second. Most GPUs can do 2 instructions per cycle and have a bit more than 2000 cores, so we can safely settle for 5 trillion instructions per second or 5 tera FLOPS.
- a typical CPU consists of 8 cores each operating at 3 GHz. This means (3*24) 24 billion cycles per second. A CPU can do about 4 instructions per cycle, so that’s about 100 billion instructions per second or 0.1 tera FLOPS.
- a typical GPU can access its own memory (GPU memory) at around 300 GB/s (gigabytes per second). This is assuming that the cores are accessing memory in a way that maximize memory access throughput (memory coalescing, prefering shared memory, locality, etc).
- typical CPU can access main memory at around 25 GB/s.
- main memory to GPU memory transfers (and vice versa) use PCIe, and is about 50 GB/s (each way, so 100 GB/s if you are doing max bi-directional transfers).
- it takes a few microseconds to launch a kernel on the GPU. In other words, the “overhead” of a kernel launch is a few microseconds.
What the above facts tell us:
- to take max advantage of a GPU’s extremely high computational throughput, make sure each core (each thread rather) does a lot of computation on the data that it is operating on. The more time a thread spends on doing computation (adding/subtracting/multiplying/dividing/etc), the better, becauset the max computational throughput of GPUs are typically 5-10 tera (trillion) FLOPS.
- memory bandwidth on a GPU is still great (300 GB/s), but still way less than the max computational throughput, so, again, try to make each GPU core/thread do as much computation as possible on the data it is operating on
- notice the relatively slow main memory to GPU memory (and vice versa) memory transfer rate. Try to minimize memory transfers between these two regions.
Another thing to keep in mind is that there is high latency between CPU to GPU (and vice versa) communication. So don’t send commands super frequently to the GPU, try to make sure the GPU does quite a bit of work for each command. This generally means transferring memory to the GPU in large chunks, and then letting the GPU operate on it.