I recently experimented with doing my own rendering using opengl, and it ended up being faster than built-in rendering provided by popular/industry-standard frameworks like Godot and SFML (for certain cases).

Setup

I tested rendering lots of circles in Godot, SFML, and my own rendering implementation (which uses OpenGL).

Results

My own renderer can render 300,000 circles at 30 fps. SFML can only render 20,000, and godot 50,000. My renderer beats godot by 6 times the amount of circles it can render at 30 fps.

Why

The reason why my renderer is better (for this particular case) is because it utilizes batching and instancing. Batching minimizes draw() calls (just one draw() call that draws all circles instead of one draw() call per circle), and instancing minimizes GPU memory throughput (by consolidating shared attributes of the circles, i.e. the circle geometry/vertices).

Batching

Communication from cpu to gpu takes a long time, so you don’t want to be constantly sending messages to the GPU. You don’t want to for example:

  • send it the vertices of a certain circle
  • tell it to draw
  • send it vertices of a different circle
  • tell it to draw
  • etc

Instead, you want to:

  • send it the vertices for a ton of circles
  • tell it to draw all the circles

That is batching.

Batching reduces the number of messages you send to the GPU.

Instancing

Now, on to instancing. Cpu to GPU communication is also slow with respect to throughput. This means not only do you want to minimize how frequently you send messages/data to the GPU, you want to minimize how much data you send to it, period.

So instead of sending the following to the GPU:

  • x,y for all vertices of circle1 (8 bytes per vertex, so if you have 100 vertices per cirlce, 800 bytes per circle!)
  • color for circle1 (RGBA, thus 32 bytes)
  • x,y for all vertices of circle2 (another 800 bytes)
  • color for circle2 (RGBA, thus 32 bytes)
  • now for circle3 (another 800 + 32 bytes)
  • now for circle4 (another 800 + 32 bytes)
  • etc

Send this:

  • one set of vertices that all circles will use (each will simply apply a different transformation/color when drawing them)
    • this is again, a total of 800 bytes, but not per circle, for all the circles!
  • for each circle
    • color (RGBA, 32)
    • radius (4 bytes)

You see how much less data you are sending to the GPU? You are avoiding 800 * num-circles bytes!

This is what instancing is - consolidating shared data of your geometry.

Remember though, even though you want to minimize the amount of data you send to the GPU, if you have to send a lot of data, make sure you send it in as large chunks as possible.

Lesson

The lesson here is, that custom solutions (“reinventing the wheel”) can sometimes be beneficial not just for learning, but actual performance. Often libraries are generalized to provide decent performence for a wide variety of cases, which hurts their performance in specialized cases.

Additionally, batching/instancing are very easy ways to significantly improve the speed of things. This isn’t only true for rendering/GPU code, it is true of all code. Add batching/instancing to your toolbox, which should already include things like caching, using efficient data structures, etc.

Pictures

Let’s compare the visual output of each renderer. The background color (clear color) is black for all of them. So the less black you see, the more circles are being drawn. The circles are 1-5 pixels in radius.

Here is SFMl rendering 20,000 circles at ~30 fps:

Here is Godot rendering 50,000 circles at ~30 fps (the fps counter in my Godot implementation is in stdout, look at at the very bottom-left of the image):

Here is my renderer rendering 300,000 circles at ~30 fps:

You see almost no black, that is because the huge amount of circles is covering the entire screen, and in fact cover the screen several times.

Just so you don’t think I’m cheating, here is the output of my renderer drawing only 20,000 circles (same as SFML’s max), look at the fps!

And here is my renderer drawing 50,000 circles (Godot’s max), again, check out the fps!

Animations

Next, I added some code to randomly move the circles (2d random walk).

In the first version, I tried to move the circles one circle at a time. In other words, I:

  • modified circle1’s x,y data on the GPU memory using glBufferSubData()
  • modified circle2’s x,y data on the GPU memory using glBufferSubData()
  • etc

As you can see, I am sending one message to the GPU per circle, and as we covered earlier, this is slow! Here is the result, only 5 FPS!

In the next version, I determine a new position for each circle on the CPU, and then send this data in bulk to the GPU. In other words, I minimize the frequency of cpu->gpu messages, i.e. batching. Runs at about 30 fps!