How to Structure a CUDA C++ Project

a CUDA C++ project is a C/C++ project that runs some of its code on a CPU and some on a GPU

code that is meant to be run on the GPU has to be marked as such, so that it gets compiled for the GPU

add the __global__ qualifier to functions you want running on a GPU (these are invoked/started via CPU, but run on the GPU)

add the __device__ qualifier to GPU helper functions (functions you want invoked and executed by GPU code)

add the __device__ __host__ qualifier (both the global and the host qualifiers) to functions that you want 2 copies of (2 compiled copies, one source copy). One version is compiled for the CPU, the other is compiled for the GPU. When you call this function from the CPU, the CPU version is called. When you call it from the GPU, the GPU version is called. These are useful for little helper functions like square(x) that are useful for both CPU and GPU code.

put declaration of your kernels in a .cuh file and put the definition in a .cu file

for every kernel, you will generally have 2 functions

one that is marked with __global__ and is the kernel itself

one that isn’t marked with anything, and is the host function that invokes the kernel, using the <<<...>>> syntax

you generally name these functions the same, but add a kernel suffix to the kernel function (or something similar)

the key is that in the .cuh file, you want to use no CUDA specific keywards, just regular C++ declarations

this is because, the .cuh file is included by both the corresponding .cu file and the regular C++ code (a .cpp file) that invokes the functions declared in the .cuh file

First, read this page to learn how to structure a regular C++ project. Structuring a CUDA C++ project is just like structuring a regular C++ project, with some additional details to keep in mind, which we cover here.

Try to keep your “regular C++” (aka “host” or “cpu”) code seperate from the gpu or “device” code.

As you may know, CUDA allows some of your C/C++ program’s logic to run on a GPU. It allows you to mark certain function as “GPU functions”. These functions are called kernels. When you wanna execute a kernel (i.e. you wanna execute a function on the GPU), you:

move some data from main memory (system memory) to the gpu memory. This is the data that the function you are about to execute on the GPU will operate on
you tell the GPU to start executing said function, and you tell it how many threads should execute this function simultaneously
- you also specify arguments (basically GPU memory addresses) that the function will operate on

The GPU will then asynchronously execute the function using the specified number of threads. If you want your CPU to wait for the gpu to finish, you call cudaDeviceSynchronize().

When working on a CUDA project, clearly some of your code runs on the CPU, and some on the GPU. The code that runs on the GPU are just functions that have a certain “qualifier” added in front of them. For example, if you add global before your function declaration/definition, it means this function runs on a GPU (device) and is callable from the CPU (host). The CUDA SDK (in particular the CUDA compiler), will compile these functions into GPU code and the CUDA runtime will store the resultant binary code on the GPU memory.

Sometimes, you need little helper functions that other GPU functions (kernels) can use, but you don’t want these functions to be callable via the host (for whatever reason). You can mark such functions with device.

Other times, you need helper functions that can be called/executed on both host and device. For example, a square(x) function. This is a little helper function that can be useful to both the CPU and GPU. You would mark such functions as host device (you mark them as both host and device). This will actually result in 2 different compiled functions. One that runs on the CPU and one that runs on the GPU. When the CPU is calling the function, it will execute the CPU copy (stored in main memory). When a GPU function calls it, it will call the one stored in the GPU memory.

So, now, in a CUDA C++ project, you have have some functions that will only execute on the CPU, some that will execute in the GPU but are called (invoked) by the CPU, and some that are executed and invoked by other GPU functions. You can have all these types of functions in your source files. How do you organize things? In general, keep your CPU code separate from your GPU code. Put your GPU kernel declarations in a .cuh file and definitions in a .cu file. You generally want to put your .cuh files in a specific directory inside your include directory. Similarly, you want to keep your .cu files in a specific directory inside your src directory. So you’d have a structure like so:

include

host (.h files here)

cuda (.cuh files here)

src

host (.cpp files here)

cuda (.cu files here)

One final caveat is that for each kernel you want 2 functions. One of the functions has the global qualifier, and is the actual kernel. The other function doesn’t have any qualifiers and is the function that will utilize the <<< >>> syntax to invoke the kernel. You generally name these functions the same, but add a kernel suffix to the kernel function (or something similar). The key is that in the .cuh file, you want to use no CUDA specific keywards/syntax, just regular C++ declarations. This is because, the .cuh file is included by both the corresponding .cu file and the regular C++ code (a .cpp file) that invokes the functions declared in the .cuh file. Since the .cuh file is included in regular .cpp code, it cannot have any cuda specific keywords/syntax, otherwise the regular C++ compiler will complain. Remember, only .cu files are compiled by the CUDA compiler. .cpp files are compiled by the regular C++ compiler.

Not too shabby eh? Hope you have an awesome day!