SamSuka
vercidium
vercidium

patreon


Multi-threaded Multi-buffered UI Rendering

When people think of optimisation they usually think about fast 3D rendering, quick loading screens, powerful physics engines and small download sizes.

But one day I was profiling Sector's Edge and found that the user interface took the same amount of time to render as the 3D world. I was shocked - the UI was so simple but so slow!

It turns out that UI is still a form of rendering and requires the same optimisations as 3D rendering. OpenGL makes it easy to apply these optimisations to buffers (3D models, like in my YouTube video here), but it's trickier to apply them to textures.

To achieve a smooth, stutter free UI rendering system, we need to:

Overview

The UI you see in games starts out as a bitmap on the CPU. A bitmap is a grid of pixels, and we can change the colour of each pixel like so:

To draw this bitmap on your screen, we need to first copy it to a texture on the GPU. We'll create a texture of the same size, and then copy the bitmap data to it.

Then we can use this texture in a shader to draw it on the screen.

Sounds simple! But there are crucial optimisations we need to implement at each stage.

Setup

For this demo we're using SkiaSharp to create and paint bitmaps, and RichTextKit to paint the text.

Multithreading

Our first goal is to allocate and paint bitmaps on background threads. This is considered heavy lifting because allocating memory for large bitmaps is slow, and text is notoriously slow to paint.

'Creating' refers to allocating memory for the pixels

'Painting' refers to modifying the pixels in a bitmap, e.g. text, vector rasterisation, shapes

'Rendering' refers to displaying the texture on your screen

Let's say we're making a first person shooter game and we want to show the player's ammo in the bottom right of the screen. This would consist of 3 elements:

We could paint these 3 elements on the main thread and the game would still run quickly, because it only takes a fraction of a millisecond to paint them. But games usually have larger, complex UI elements that update often. Painting these on the main thread every frame would slow the game down.

So rather than painting these bitmaps on the main thread, we'll store the parameters of each paint function in a list of PaintCommand objects. This means we've saved a copy of all the information required to paint this bitmap, and can send it to a background thread for painting.

For example, the DrawIcon function takes these parameters:

Paint Commands

There are many types of PaintCommands, such as:

Each of these commands share 3 common functions, but each have their own implementation:

For example, here's a PaintCommandBitmap, which draws a bitmap at the specified position.

Painting on Background Threads

Let's say you have a friend in real life who can paint really nice pictures. You describe to them what you want the painting to look like, and then they go off and paint it. A few days later they give you a beautifully framed painting exactly like you asked for.

This is what we want to achieve with our bitmap painting. We have a class called Painter that we can give a list of commands to. Then - on another thread - the Painter will figure out how big the bitmap needs to be, allocates memory for it, and executes each command.

Since the command list contains all the information the Painter needs, the Painter can safely perform its work on background threads.

Before asking the Painter to draw something, we need to figure out the size of the final bitmap. We'll do this by looping over each command and combine their bounding boxes together:

Now we'll ask the Painter to allocate a bitmap that's 122px wide and 54px tall, and paint our list of commands to it.

Coordinate System

To paint a command, the bitmap needs to first convert the command's position from screen-coordinates to bitmap-coordinates. For example, if we have a very basic bitmap that's just a green rectangle, we'd store a command with data:

Since our bitmap is 50x50px in size, if we paint a rectangle within the bitmap at position (275, 80), we'll be drawing way outside the bitmap. What we should do is paint the rectangle at position (0, 0) within the bitmap, and then draw the final bitmap at (275, 80) on the screen.

Each Paint command will perform this conversion before painting, e.g. in PaintCommandBitmap the position is converted to a bitmap-relative netPosition:

GPU Optimisations

All CPU optimisations are now applied, and it's time to move on to the GPU. There isn't as much code here, but the concepts are important to understand.

Let's start with the reason why texture transfers cause stutters.

Let's say our game is running at 60 FPS, which means the GPU takes 16ms to render one frame. Each frame consists of many OpenGL calls, mainly these 4 over and over again:

If the CPU only takes a 3ms to enqueue these commands, the GPU will get further and further behind the CPU:

From my research, NVIDIA allows their GPUs to be at most 3 frames behind the CPU, and AMD allow their GPUs to be at most 5 frames behind. In the above scenario, the GPU already has 3 frames of work to process, so NVIDIA will make the CPU wait for the GPU to finish rendering Frame 1. This causes a stall on the CPU:

These stalls can be mitigated by either:

Did You Know - in the diagram above, the gap between when the CPU issues Frame 4 and when the GPU finishes processing Frame 4 is the cause of Input Lag. There's a 30ms delay between when the CPU processes your mouse movements, and when they are reflected on your screen

CPUs can also cause a stutter by updating a texture that's currently in use. Let's say the CPU has enqueued 2 frames of commands, and then during the 3rd frame it updates a texture.

If this texture was used during frame 1 or 2, the OpenGL Driver will say "Hey, the GPU hasn't finished rendering that texture yet. You'll have to wait". This causes the CPU to stall until that texture is no longer being rendered:

Now the CPU can copy the new UI bitmap data to the texture. This will cause a stall on the GPU, because the GPU can't start rendering the texture until the memory transfer completes:

The player will only perceive the 5ms stutter on the GPU timeline, but we've still lost valuable time on the CPU.

Solving the Stutters

The workaround to this is to create another texture that we'll store the new bitmap data in, rather than attempting to update a texture that's being used by the GPU. This removes all stalls on the CPU, but the GPU is still slowed down by the time it takes to transfer the texture from the CPU to the GPU (orange Stall bar in the diagram above). The CPU is also transferring the texture data on its main thread, which slows the CPU down.

To solve this, we'll use multithreading on both the CPU and GPU to transfer the texture asynchronously, in the background. OpenGL has excellent support for updating buffers on background threads (Persistent-Mapped Buffers), but it doesn't have the same functionality for textures.

To work around this, OpenGL has a special kind of buffer called a Pixel Buffer Object (PBO). This buffer is special because its contents can be copied directly into a texture. This means we can:

The expensive memory copy is now performed on a background thread, and when the flush completes our bitmap data is now stored on the PBO on the GPU!

The last step is to copy the data from the PBO to the texture. This is much faster than copying from our CPU to the texture, and since both PBO and texture live on the GPU, it won't stall the CPU or the GPU.

However when copying from a PBO to a texture, NVIDIA GPUs will throw this warning:

Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.

Although GPUs have thousands of cores, they can only process one OpenGL command at a time. The GPU will stop rendering, copy the texture data, and then resume rendering. This means our texture transfer is slowing the game down.

NVIDIA solved this by adding Copy Engines to their GPUs, which can copy texture data around at the same time that the Compute Engine performs rendering. AMD has a something feature but I'm not sure what it's called.

Copy Engines

To utilise these copy engines, we need to understand how OpenGL contexts work.

If you're running two games at once on your computer, they will each have their own OpenGL context. By default, each context stores its own textures, buffers, shaders, framebuffers, and has its own queue of frames to process. This means both games are kept completely separate, and prevents Game A from rendering a texture that's owned by Game B.

Because the OpenGL Driver knows that both games don't share resources, it allows Game A to transfer textures while Game B is rendering.

To utilise this functionality within one game, we can create two OpenGL contexts when the game starts up. The first OpenGL context will only be used on the main thread and performs all our typical OpenGL calls. The second OpenGL context will only be used on a background thread and will manage all texture copies.

Since both contexts were created within the same process, OpenGL allows us to share some objects - e.g. buffers and textures - between the two.

All of our code stays the same, except the glTexSubImage2D call to copy data from the PBO to the texture will be executed on the 2nd OpenGL context. Since this context runs on a background thread, it's glTexSubImage2D calls will be processed by NVIDIA's Copy Engine. The 1st context will continue to render commands on the main thread unimpeded.

However, since we are sharing textures across multiple threads, the main thread needs to know when the texture transfer has completed and it's safe to render the texture. If we don't, the main thread will attempt to render the texture while it's half-updated, which may display a corrupt texture on your screen or cause the GPU to stall while it waits for the transfer to finish.

To do this, we'll create a fence using glFenceSync after the glTexSubImage2D call on the background thread. The background thread will poll the status of this fence, and when it has signaled it will tell the main thread that it's safe to render the texture.

Code

In a few hours I'm flying to Japan for a holiday. When I'm back, I will create a public GitHub repo containing the code for the CPU and GPU optimisations covered in this post.

If you are a paid Patreon member, you can access this code on the lagcomp branch on the ve repository. I still have much to improve and clean, but the main files are:


More Creators