Mesh-shader culling is broken?

Followup to https://developer.apple.com/forums/thread/722047

After experimenting a bit more with mesh-shader on M1, come to theory(can't really proof, as there is no profiler for them), that culling is broken in Metal3: in my content culling is somewhat simple:

  1. First 16 invocations do poke HiZ pyramid and vote.
  2. a) If all vote for non-visible, then shader set primitive-count to zero and exits
  3. b) if visible - each thread processes one vertex (usual geometry process) and writes valid meshlet

Yet, if HiZ-test is ignored and mesh processed anyway performance is close to same. Also noted, that culling with mesh-shader was never mentioned in any official materials(in oppose to object-shader).
Here I'm reading in between lines a bit: maybe driver assumes only object-shader based culling, and mesh threadgoup always allocates resources for worst possible case?

My questions at this point:

  • what is cost of empty meshlet?
  • any upfront cost of launching mesh-threadgrid, like it is with ios-compute shader?
  • any issues with large(1024+) workgroup sizes?

Thanks in advance!

Mesh-shader culling is broken?
 
 
Q