Hi PYNing,
You should understand how SIMD instructions works.
SIMD instruction must be the same for all threads in SIMD group, unless it specified in API. And the same SIMD instruction must be executed by all threads in SIMD group. mmmetal already clarified that in your case only 1 thread will be active and will execute the code. But for correct behaviour all threads in SIMD group should execute this instructions.
For Apple GPUs SIMD size is fixed and equal to 32. So in case of M1 all 32 threads must execute the same code path to produce correct result.
Be careful also from CPU side, you probably should run tasks in SIMD group size quantities.
Post
Replies
Boosts
Views
Activity
Hi, have you tried MPSMatrixMultiplication? It should use this features when possible and it supports fp16/fp32 precision.