Followup to https://developer.apple.com/forums/thread/722047
After experimenting a bit more with mesh-shader on M1, come to theory(can't really proof, as there is no profiler for them), that culling is broken in Metal3: in my content culling is somewhat simple:
- First 16 invocations do poke HiZ pyramid and vote.
- a) If all vote for non-visible, then shader set primitive-count to zero and exits
- b) if visible - each thread processes one vertex (usual geometry process) and writes valid meshlet
Yet, if HiZ-test is ignored and mesh processed anyway performance is close to same. Also noted, that culling with mesh-shader was never mentioned in any official materials(in oppose to object-shader).
Here I'm reading in between lines a bit: maybe driver assumes only object-shader based culling, and mesh threadgoup always allocates resources for worst possible case?
My questions at this point:
- what is cost of empty meshlet?
- any upfront cost of launching mesh-threadgrid, like it is with ios-compute shader?
- any issues with large(1024+) workgroup sizes?
Thanks in advance!