Bad mesh shader performance

Hello!

I'm working on a opensource game project, that runs with all 3 major graphics-api's including metal.

Mesh shader is key-component for my gpu-driven workflow. Implementation is done with GL_EXT_mesh_shader on PC, and for Metal I'm cross-compiling with the my fork of spirv-cross.

Unfortunately Metal-version appears to be supper slow, showing 2x regression (Apple-M1) in compare to draw-call based version. This is quite surprising, on RTX3070 numbers are quite opposite (1.5x speedup).

Note, that mesh shader does culling and ibo-compression, opposite to draw-call based version.

Shader source GLSL:

https://github.com/Try/OpenGothic/blob/master/shader/materials/main.mesh

Cross-compiled MSL:

https://shader-playground.timjones.io/0f60082c67e30fbb8ad9015b48405628

My question is what are possible causes of performance regression? Any general performance recommendation? Are there any rule similar to prefersLocalInvocationVertexOutput, from Vulkan? How expensive threadgroup memory is?

Note1: Cross-compiling flow on my side is not perfect - it emits all varying as shared memory arrays. This is something that is hard to workaround.

Note2: No Task(aka Object) shader - that one has bad performance on PC(NVidia).

Note3: C++ / MacOs 13.0.1 (22A400) / Apple-M1

Thanks in advance!

This is probably working as expected with the M1 GPU;  Mesh shaders on M1 are intended to enable use-cases that cannot be expressed as draws (such as, dynamic geometry expansion / culling).  If draws are faster then, you should probably use that path instead. However, each GPU has a different performance profile, so you should try both paths on different GPUs.

intended to enable use-cases that cannot be expressed as draws (such as, dynamic geometry expansion / culling)

There is culling.

Tested today compilation variant, when instead of shared memory array local variable been used. This happen to be valid for my shaders, since there is 1-to-1 match between gl_LocalInvocationID and vertex.

Shader example: https://shader-playground.timjones.io/641b24c9f6700a03eb9f69414ebbf22b

Still FPS roughly as bad as it was. So far it doen't look like mesh-shader is working well on M1. Can it be some sort of driver bug? I mean: I can understand something like 5-10% performance regression since it's new feature, but not 200%.

According to xcrun metal-opt, some Metal-supported devices implement mesh shaders through emulation. Others have it native. Not sure whether the M1 is one of those GPUs.

Metal-supported devices implement mesh shaders through emulation

Very interesting, many thanks @philipturner! Unfortunately can't really test on anything other than M1.

Meanwhile I have a few new questions to @Apple on perfornace side:

Any caps/property application can check in runtime, to know that shaders are emulated? (I don't really want to black-list devices)

AFAIK at least one of mobile vendors do not run vertex shader natively, splitting it into position+varying shader instead. Is it same for M1 or not?

Would it make sense to reduce varying count and relay on [[barycentric_coord]] as much as possible ?

Thanks!

Bad mesh shader performance
 
 
Q