Making an easily extendable GPU Driven Renderer in Vulkan

Looping over thousands of entities and generating draw calls every frame turns out to be pretty expensive, what if the GPU did most of the work instead?

In the past, draw call generation was typically a major bottleneck for game renderers, and is done on the CPU where ideally you want as much frame-time budget as possible for gameplay systems and things like physics. The obvious solution here is multithreading draw command generation, but this is still not efficient and creates a slew of other problems.

What are draw calls?

Draw calls are how you submit work to the GPU using a graphics API. The easy way to set up a game renderer, is to have the renderer loop every entity in the scene, grab the mesh and material, and send draw calls to the GPU every frame. This obviously scales horribly with large scenes.

Thankfully, modern graphics APIs introduced MultiDrawIndirect commands, this lets us simply send a huge buffer of draw commands to the GPU, and then tell the GPU to execute all of these draw commands. Still, this provides no performance benefit if we are just looping every entity in the scene, every frame, and sending them to the GPU.

I decided to implement a system similar to what Remedy employed in their engine for Alan Wake 2.

Instead of looping over every entity in the scene every frame, we generate draw calls for a mesh when it is added to the scene, these are stored in multiple arrays, called 'buckets'. We have different buckets to separate meshes by shader, and draw order, for instance we have an opaque mesh bucket, a transparent mesh bucket, and an alpha tested mesh bucket.

struct MeshletDrawCommand {
    uint64_t vertexBufferAddress;
    uint64_t indexBufferAddress;
    uint64_t meshletDataBufferAddress;
    uint64_t meshletBoundsBufferAddress;
    uint64_t meshletVerticesBufferAddress;
    uint64_t meshletTrianglesBufferAddress;
    uint32_t entityIndex;
    uint32_t meshletDataIndex;
    uint32_t meshletVerticesOffset;
    uint32_t globalVerticesOffset;
    uint32_t globalTriangleOffset;
    VkAccelerationStructureKHR blasHandle;
};

Every frame, we just upload all the data from these arrays over to the GPU, and call vkCmdDrawIndirect. This makes the time spent on the CPU every frame nearly consistent regardless of scene size. In my renderer, the CPU time is less than 1ms even in a scene with thousands of meshes and it runs on just a single thread.

This brings up the question of how to do culling? If we want to have the CPU do as little work as possible, then the GPU needs to be sent the entire array of meshes in our scene, even if most of them aren't visible. This leads us to compute and mesh shaders.

There are multiple ways we can do GPU side culling in a compute shader, for instance each draw commands can contain a bit for whether it is culled or not, and the compute shader can set the cull value based on coarse frustum and occlusion culling done in the compute shader, or the compute shader can just populate its own draw command arrays. Personally I am not using a coarse compute culling pass in my renderer yet, I am just using mesh shaders to cull at the meshlet level.

Mesh Shaders

Mesh shaders are a relatively new feature of modern GPU hardware, mesh shaders completely replace the standard rendering pipeline, composed of many different shader stages like Vertex, Geometry, Pixel shaders etc. with just Task shaders, Mesh shaders, and Fragment shaders.

Task and Mesh shaders work very similarly to compute shaders, but they have the ability to output completely custom geometry to the fragment shader, this means they can do everything Vertex, Geometry, and Tessellation shaders could do all from one shader.

This also means we can very easily just do frustum and occlusion culling from inside the mesh shader.

Mesh shaders are used with meshlets. The idea being that you split a mesh into many little pieces when it is loaded into the engine, and each mesh shader invocation is responsible for handling one meshlet, this allows us to do very granular culling, and GPU side Level of Detail computations. This can greatly improve performance in scenes with billions of triangles.

Now that is the GPU driven rendering covered, but especially in vulkan managing all this is still quite verbose, so I decided to implement a custom render graph to drive my renderer.

A render graph is a directed acyclic graph that orders how a scene should be rendered

My render graph is still quite primitive, it reads a bunch of JSON files from a subdirectory in the game folder and organizes them into a graph, based on their inputs and outputs

{
    "m_prefabType": 3,
    "name": "main_pass",
    "disabled": false,
    "input_count": 1,
    "inputs": [
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 1.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": true,
            "flags": 32,
            "format": "DEPTH",
            "height": 0,
            "isDepth": true,
            "load_op": 0,
            "mips": 0,
            "name": "depth_image",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        }
    ],
    "output_count": 5,
    "outputs": [
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 0.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": false,
            "flags": 29,
            "format": "VK_FORMAT_B8G8R8A8_UNORM",
            "height": 0,
            "isDepth": false,
            "load_op": 1,
            "mips": 1,
            "name": "gbuffer_color",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        },
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 0.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": false,
            "flags": 29,
            "format": "VK_FORMAT_R16G16_SNORM",
            "height": 0,
            "isDepth": false,
            "load_op": 1,
            "mips": 1,
            "name": "gbuffer_normal",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        },
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 0.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": false,
            "flags": 29,
            "format": "VK_FORMAT_B8G8R8A8_UNORM",
            "height": 0,
            "isDepth": false,
            "load_op": 1,
            "mips": 1,
            "name": "gbuffer_orm",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        },
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 0.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": false,
            "flags": 29,
            "format": "VK_FORMAT_R8G8B8A8_UNORM",
            "height": 0,
            "isDepth": false,
            "load_op": 1,
            "mips": 1,
            "name": "gbuffer_emissive",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        },
        {
            "clear_color_w": 0.0,
            "clear_color_x": 0.0,
            "clear_color_y": 0.0,
            "clear_color_z": 0.0,
            "clear_depth_depth": 0.0,
            "clear_depth_stencil": 0,
            "depth": 1,
            "depth_pyramid": false,
            "flags": 30,
            "format": "VK_FORMAT_R32G32_SFLOAT",
            "height": 0,
            "isDepth": false,
            "load_op": 1,
            "mips": 1,
            "name": "frustum_motion_vector_image",
            "shading_rate_image": false,
            "store_op": 0,
            "type": 1,
            "width": 0
        }
    ]
}

It then does a depth first search over the graph which gives us proper scene order. This allows the render graph to set dynamic states needed for rendering each pass, without having to hardcode a bunch of different functions for each render pass. Still we do need custom code sometimes so the rendergraph is setup to use std::function, and we can set custom PreRender, Render, and PostRender functions for each pass when necessary

renderGraph->SetNodePreRender("depth_pyramid_pass", std::bind(&VulkanRenderer::DepthPyramidPreRender, this, std::placeholders::_1, std::placeholders::_2));
renderGraph->SetNodeRender("depth_pyramid_pass", std::bind(&VulkanRenderer::DepthPyramidRender, this, std::placeholders::_1, std::placeholders::_2));
renderGraph->SetNodePostRender("depth_pyramid_pass", std::bind(&VulkanRenderer::EmptyRenderFunc, this, std::placeholders::_1, std::placeholders::_2));

This also allows us to easily keep track of what resources each pass use, since the rendergraph handles resource creation

My render graph still doesn't handle barriers or resource aliasing, but I will definitely need to add that eventually. This is good enough for now so I can work on making pretty pixels show on the screen.

Oct 2024

Designing a data reflection system for my game engine, inspired by Unreal Header Tool

Oct 2024

Designing the input system for my game engine

Jan 2024

Reverse engineering Star Citizen's serialization system to have it reverse engineer the rest of the game for you

Jun 2023

Utilizing the memflow memory introspection library to spy on BattlEye

Making an easily extendable GPU Driven Renderer in Vulkan

More Posts