Hello!
I'm working on a opensource game project, that runs with all 3 major graphics-api's including metal.
Mesh shader is key-component for my gpu-driven workflow. Implementation is done with GL_EXT_mesh_shader on PC, and for Metal I'm cross-compiling with the my fork of spirv-cross.
Unfortunately Metal-version appears to be supper slow, showing 2x regression (Apple-M1) in compare to draw-call based version. This is quite surprising, on RTX3070 numbers are quite opposite (1.5x speedup).
Note, that mesh shader does culling and ibo-compression, opposite to draw-call based version.
Shader source GLSL:
https://github.com/Try/OpenGothic/blob/master/shader/materials/main.mesh
Cross-compiled MSL:
https://shader-playground.timjones.io/0f60082c67e30fbb8ad9015b48405628
My question is what are possible causes of performance regression?
Any general performance recommendation? Are there any rule similar to prefersLocalInvocationVertexOutput, from Vulkan?
How expensive threadgroup memory is?
Note1:
Cross-compiling flow on my side is not perfect - it emits all varying as shared memory arrays. This is something that is hard to workaround.
Note2:
No Task(aka Object) shader - that one has bad performance on PC(NVidia).
Note3: C++ / MacOs 13.0.1 (22A400) / Apple-M1
Thanks in advance!
Post
Replies
Boosts
Views
Activity
Followup to https://developer.apple.com/forums/thread/722047
After experimenting a bit more with mesh-shader on M1, come to theory(can't really proof, as there is no profiler for them), that culling is broken in Metal3:
in my content culling is somewhat simple:
First 16 invocations do poke HiZ pyramid and vote.
a) If all vote for non-visible, then shader set primitive-count to zero and exits
b) if visible - each thread processes one vertex (usual geometry process) and writes valid meshlet
Yet, if HiZ-test is ignored and mesh processed anyway performance is close to same. Also noted, that culling with mesh-shader was never mentioned in any official materials(in oppose to object-shader).
Here I'm reading in between lines a bit: maybe driver assumes only object-shader based culling, and mesh threadgoup always allocates resources for worst possible case?
My questions at this point:
what is cost of empty meshlet?
any upfront cost of launching mesh-threadgrid, like it is with ios-compute shader?
any issues with large(1024+) workgroup sizes?
Thanks in advance!
Hi, I'm working on integrating Rayquery, into my game-engine. Vulkan/DX12 work fine on PC, but Metal(on Mac) doesn't:
Here is screenshot on how rendering looks:
And similar spot from XCode-debugger shows:
Ship, cannons, items are there - TLAS look as they should.
Note: in game screenshoot above there are no shadows, but it not always the case:
Here only some parts of object do cast shadow.
Fragment shader:
https://shader-playground.timjones.io/44de178b7b8a715ea235c7f12cd0aabc
// relevant part
bool isShadow(...)
{
...
uint flags = 4u;
flags |= 128u;
rayQuery.reset(ray(rayOrigin, rayDirection, tMin, rayDistance), topLevelAS, spvMakeIntersectionParams(flags));
for (;;) // spirv-cross not pretty here :(
{
bool _116 = rayQuery.next();
if (_116)
{
continue;
}
else
{
break;
}
}
uint _120 = uint(rayQuery.get_committed_intersection_type());
if (_120 == 0u)
{
return false;
}
return true;
}
----
intersection_params spvMakeIntersectionParams(uint flags)
{
// hacked this part, while debugging - setting up for simple most any-hit
intersection_params ip;
ip.force_opacity(forced_opacity::opaque);
ip.accept_any_intersection(true);
return ip;
}
After verifying TLAS and ray-query loop can conclude, that most likely it's a driver bug here, or generated shader code is wrong (but looks correct to me!).
PS:
one more small thing about Metal-RT:
Metal doc about MTL::AccelerationStructureTriangleGeometryDescriptor::setIndexBufferOffset says:
"Specify an offset that is a multiple of the index data type size and a multiple of the platform’s buffer offset alignment."
Buffer-offset-alignment (32 bytes in worst case) is very hard to workaround for multi-material meshes . No other api requires so, and there is no good workaround for this.
Hello!
I run into, what seem to be compiler issue. The shader source given to Metal is: https://shader-playground.timjones.io/1bcf3ffbb313878ccd594ddbb27b746e
This shader is generated by spirv-cross, from GLSL source, so for readability here is original source: https://github.com/Try/OpenGothic/blob/master/shader/hiz/hiz_mip.comp
(shader variant uses SSBO counter, not atomic-image)
Here is relevant path of application log:
2024-04-21 16:27:13.621218+0200 Gothic2Notr[23992:2003969] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2024-04-21 16:27:13.656559+0200 Gothic2Notr[23992:2003969] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2024-04-21 16:27:13.701323+0200 Gothic2Notr[23992:2003969] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2024-04-21 16:27:13.701477+0200 Gothic2Notr[23992:2003969] MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 3 try
2024-04-21 16:27:13.701817+0200 Gothic2Notr[23992:2003969] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
iOS version: 15.8.2
MTL::CompileOptions::languageVersion: 2.4 (also tested other version - same result)
Offended part of shader:
void store(int mip, ivec2 uv, float z) {
// NOTE: replacing this function to NOP, avoid the crash
// NOTE2: this switch-case is crude emulation of bindless storage-image
switch(mip) {
case 1:
imageStore(mip1, uv, vec4(z));
break;
case 2:
imageStore(mip2, uv, vec4(z));
break;
case 3:
imageStore(mip3, uv, vec4(z));
break;
case 4:
imageStore(mip4, uv, vec4(z));
break;
case 5:
imageStore(mip5, uv, vec4(z));
break;
case 6:
imageStore(mip6, uv, vec4(z));
break;
case 7:
imageStore(mip7, uv, vec4(z));
break;
case 8:
imageStore(mip8, uv, vec4(z));
break;
}
}
Some extra info:
The shader is simplified single-pass mip-map generator.
The same shader is know to work on mac M1 laptop without any issues
Please have a look and looking forward for driver-fix. Thanks!