viewport-based culling for metal shaders

I have a similar geometry culling question to this, but my situation is more about culling geometries outside the viewport.

In my case I have a custom geometry, let's say it's a 2d polyline with fixed points for simplicity. (In reality it's a bit more complex than this, the vertices require some computation and sometimes there are faces but rarely they overlap so depth-based culling is of limited help.)

Code Block
typedef struct PolyLine {
simd_float2 start;
simd_float2 a;
simd_float2 b;
simd_float2 end;
}


I have these in a device PolyLine *buffer and encode a draw call with

Code Block
_commandEncoder.drawPrimitives(type: .lineStrip, vertexStart: 0, vertexCount: 4, instanceCount: 50000)


If you're zoomed in very closely on this scene, possibly most of the geometry is well outside the viewing area. Or in some cases geometry could be partially visible (such as start->a), but not the rest (a->b and b->end). For geometries with more vertices it is more likely that a very small number of the full geometry is visible.


Often, I don't cull this at all. Other times if the shader is more expensive I will do some kind of bailout check on the whole instance as a prelude in the vertex shader. If the instance is invisible, I choose a vertex output position with some constant value outside the viewport.

What is the best practice for avoiding unnecessary work here? Should I be indirectly encoding draws for each visible instance, or does that introduce more overhead? Is there a best-practice way to tell Metal that a vertex (or an instance) can be discarded or is picking some faraway position ok?
Answered by Graphics and Games Engineer in 614348022
Metal provides a lot of flexibility to optimize the vertex processing stage. The most optimal approach will depend on your scene's characteristics. 

Let Metal System Trace and Metal Performance Counters guide your efforts: https://developer.apple.com/documentation/metal/render_pipelines/optimizing_performance_with_gpu_counters

If your app generates a lot of geometry then hierarchical frustum culling on the CPU or coarse-grained culling in a compute pre-pass could be a good choice. If it's a heavyweight vertex shader, then a fine-grained culling pre-pass on the GPU is worth exploring. GPU culling will generate indirect draw calls or an indirect command buffer. See below for some samples that use this approach. But splitting an instanced draw into many non-instanced indirect draws will incur overhead. You may need to find a way to keep the indirect draws instanced. For example, this may be easy if you cull in batches of instances. In other cases, you may need to generate index and instance buffers in the culling pre-pass, and whether this is a performance win will depend on your scene's ratio of visible to culled primitives.

For many apps, a coarse frustum cull and then letting the GPU's fixed function hardware take care of the fine clipping and culling is the right balance.

For the question about discarding a vertex: There is not any significant overhead for discarding a vertex by moving it off-screen, but you will need to take care to ensure that the algorithm can handle clipped primitives if one of the vertices remains inside the view frustum.

Samples:
https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu
https://developer.apple.com/documentation/metal/modern_rendering_with_metal 
Metal provides a lot of flexibility to optimize the vertex processing stage. The most optimal approach will depend on your scene's characteristics. 

Let Metal System Trace and Metal Performance Counters guide your efforts: https://developer.apple.com/documentation/metal/render_pipelines/optimizing_performance_with_gpu_counters

If your app generates a lot of geometry then hierarchical frustum culling on the CPU or coarse-grained culling in a compute pre-pass could be a good choice. If it's a heavyweight vertex shader, then a fine-grained culling pre-pass on the GPU is worth exploring. GPU culling will generate indirect draw calls or an indirect command buffer. See below for some samples that use this approach. But splitting an instanced draw into many non-instanced indirect draws will incur overhead. You may need to find a way to keep the indirect draws instanced. For example, this may be easy if you cull in batches of instances. In other cases, you may need to generate index and instance buffers in the culling pre-pass, and whether this is a performance win will depend on your scene's ratio of visible to culled primitives.

For many apps, a coarse frustum cull and then letting the GPU's fixed function hardware take care of the fine clipping and culling is the right balance.

For the question about discarding a vertex: There is not any significant overhead for discarding a vertex by moving it off-screen, but you will need to take care to ensure that the algorithm can handle clipped primitives if one of the vertices remains inside the view frustum.

Samples:
https://developer.apple.com/documentation/metal/indirect_command_buffers/encoding_indirect_command_buffers_on_the_gpu
https://developer.apple.com/documentation/metal/modern_rendering_with_metal 
viewport-based culling for metal shaders
 
 
Q