Generating vertex data in compute shader

Hi there,

I am working on a 3d game engine in Swift and Metal. Currently I dynamically generate vertex buffers for terrain "chunks" on the CPU and pass all models to the GPU via argument buffer and make indirect draw calls.

Calculating where vertices should be is costly and I would like to offload the work to a compute shader. Setting up the shader was straightforward and I can see that it (at least as an empty function) is being executed in the CommandBuffer.

However, I come to this problem: since I do not know ahead of time how many vertices a chunk of terrain will have, I cannot create a correctly-sized MTLBuffer to pass into the compute function to be populated for later use in a draw call.

The only solution I could think of is something like the following:

  • For each chunk model, allocate a VertexBuffer and IndexBuffer that will accommodate the maximum possible number of vertices for a chunk.
  • Pass in the empty too-large buffers to the compute function
  • Populate the too-large buffers and set the actual vertex count and index count on the relevant argument buffer.
  • On the CPU, before the render encoder executes commands in the indirect command buffer, do the following:
  1. for each chunk argument buffer, create new buffers that fix the actual vertex count and index count
  2. blit copy the populated sections of memory from the original too-large buffers to the new correctly-sized buffers
  3. replace the buffers on each chunk model and update the argument buffers for the draw kernel function

But I am still a Metal novice and would like to know if there is any more straightforward or optimal way to accomplish something like this.

Firstly, did you profile why the vertices are expensive to compute before going for this solution?

Also, it's unclear how you're computing the vertices since you haven't provided code or an algorithm for that part, so it's hard to tell if you're doing the compute step optimally. Successfully using compute relies heavily on taking advantage of parallelism, so make sure it makes sense to use a compute kernel.

Roughly, I can imagine you can allocate one gigantic buffer -- no need for multiple. Conceptually split the buffer into some fixed-size sections (X vertices) that are handled by some specified number of threads. You can tune this size.

Beyond that, it's tricky to help, but maybe with more specific info, it'll be easier.

Reusing one big buffer instead of many seems like a great idea. Less moving parts the better here.

Why the vertices are expensive to compute is mostly a question of scale. For instance, with indirect draw calls I can handle 4000+ chunks of terrain and 25 million+ vertices at 60 fps. With much lower terrain distances the meshing on the CPU can keep up fine, but when I push it there is a lot of frame dropping when moving over the terrain and generating meshes for dozens or hundreds of chunks. Part of this is that my meshing is single-threaded on the CPU, but my experience with Swift multi-threading is so far hit and miss for this kind of thing.

As such, parallelism is the entire point of trying to offload this to a compute shader.

The process of vertex generation is pretty standard for block/Minecraft-style terrain. We loop through all the blocks in a chunk, and if a block is solid, we check if any of its neighbours are air, and therefore if we should put a quad opposite it. Roughly:

for y in i..<CHUNK_SIZE_Y {
  for z in i..<CHUNK_SIZE {
    for x in i..<CHUNK_SIZE { 
      let index = blockIndex(x, y, z)
      let block = block(at: index)
      if block.type != BLOCK_TYPE_AIR {
         for faceOffset in blockFaceOffsets {
           let neighbourPosition = chunkPosition + faceOffset
           if shouldDrawFace(at: neighbourPosition, in: chunk) {
             // add interleaved vertices plus indices for the face
          }
        }
      }
    } 
  }
}
      

There is no doubt room for optimization in what I am doing, but there is also the limitation of not doing this in parallel. I think it's appropriate, since each result does not depend on any other result. I'll experiment with using a single buffer for copying the compute shader results + offsets.

Also curious if there are any other obvious approaches I'm missing to take advantage of the parallelism of GPU shaders!

Consider looking into mesh shaders, they are designed for this kind of thing.

Mesh shaders are fascinating but appear to be about generating - or modifying existing - geometry each frame. Whereas I need only to generate the geometry when new terrain is loaded in, and for it to use buffer memory to avoid repeating the work. It sounds like compute shader and temporary buffers are the way to go.

I would still consider mesh shaders even if regenerating the geometry every frame seems wasteful. You will likely end up with considerably simpler code and unless your terrain generation is extremely complicated I kind of doubt that you will see any performance degradation (don't forget that the mesh shader also performs the function of the vertex shader — with the benefit of having access to neighbouring vertex data "for free"). Tessellation for example also regenerates tons of geometry every frame, and yet it's a popular technique fr improving performance (because storing and moving all that geometry in memory ends up being more expensive than regenerating it on the fly).

Follow up on this:

Mesh shaders were not viable for my use case here as it involves a triple-nested for loop over 10,000+ terrain blocks for each thread, ideally for thousands of chunks of terrain.

Using a compute shader to generate vertices is working great. The biggest difference from the initial approach I described is cutting out the step where I synchronize with the CPU and copy vertex and index data back into another correctly-sized buffer.

This is unnecessary, and if you can afford a bit of extra memory just specify the size when you call set_vertex_buffer in the indirect command encoder. This way, the generated vertex and index buffers never have to be touched by the CPU at all and can be set to storage mode private. I would think about combining them all into one buffer, but there doesn't seem to be a set_vertex_buffer command that can be used in the kernel that allows for specifying offset.

All in all, this seems like a fine approach for me. Also: Tier 2 Argument Buffers are amazing! A game changer. One it clicks it's very natural and makes development simpler.

Generating vertex data in compute shader
 
 
Q