rendering thousands of small meshes

I have on the order of 50k small meshes (~64 vertices), all different connectivity, some subset of which change each frame (generated by a compute kernel). Can I render those in a performant way with Metal?

I'm assuming 50k separate draw calls would be too slow. I have a few ideas:

  1. encode those draw calls on the GPU
  2. or lay out the meshes linearly in blocks, with some maximum size, and use a single draw call, but wasting vertex shader threads on the blocks that aren't full
  3. or use another kernel to combine the little meshes into a big mesh

thanks!

Answered by Graphics and Games Engineer in 686787022

Good question!

The answer really depends on a few factors, maybe the most important question is: are the meshes sharing pipeline state objects and bindings? Considering that in point 3 above you are talking about combining all little meshes into a big mesh I'd assume that all the input mesh share the same PSO. If this is actually the case then the best solution would likely be to use the indirect drawing API. As you mentioned as well, you can have a compute kernel to encode a unified index buffer. The kernel should also write the size of the produced index buffer into a second buffer which can then be used with the Metal API below:

- (void)drawIndexedPrimitives:(MTLPrimitiveType)primitiveType 
                    indexType:(MTLIndexType)indexType 
                  indexBuffer:(id<MTLBuffer>)indexBuffer 			// <-- index buffer built by the kernel
            indexBufferOffset:(NSUInteger)indexBufferOffset 		// <-- offset into the index buffer built by the kernel
               indirectBuffer:(id<MTLBuffer>)indirectBuffer 		// <-- buffer containing the number of valid indices produced by the kernel (index count)
         indirectBufferOffset:(NSUInteger)indirectBufferOffset;		// <-- offset of index count in the indirect buffer

If more than one PSO is needed, then I'd recommend to use the same approach above creating one unified index buffer per PSO. There are also other indirect draw methods if they suit your use case better.

Accepted Answer

Good question!

The answer really depends on a few factors, maybe the most important question is: are the meshes sharing pipeline state objects and bindings? Considering that in point 3 above you are talking about combining all little meshes into a big mesh I'd assume that all the input mesh share the same PSO. If this is actually the case then the best solution would likely be to use the indirect drawing API. As you mentioned as well, you can have a compute kernel to encode a unified index buffer. The kernel should also write the size of the produced index buffer into a second buffer which can then be used with the Metal API below:

- (void)drawIndexedPrimitives:(MTLPrimitiveType)primitiveType 
                    indexType:(MTLIndexType)indexType 
                  indexBuffer:(id<MTLBuffer>)indexBuffer 			// <-- index buffer built by the kernel
            indexBufferOffset:(NSUInteger)indexBufferOffset 		// <-- offset into the index buffer built by the kernel
               indirectBuffer:(id<MTLBuffer>)indirectBuffer 		// <-- buffer containing the number of valid indices produced by the kernel (index count)
         indirectBufferOffset:(NSUInteger)indirectBufferOffset;		// <-- offset of index count in the indirect buffer

If more than one PSO is needed, then I'd recommend to use the same approach above creating one unified index buffer per PSO. There are also other indirect draw methods if they suit your use case better.

DrawIndirect doesn't work well in iOS or macOS, since you can only submit one draw at a time referencing the buffer and offset above. There's also no stride to store additional instance data, or drawIndirectCount like in Vulkan, where the GPU supplies the count of things to draw. So it's not really saving much over making the draw calls themselves.

If you can target A9 which is where DrawIndirect started, then look into IndirectCommandBuffer which can then supply a range of draw calls which is the only way to submit a batch of commands as one submission.

rendering thousands of small meshes
 
 
Q