[Metal] What's wrong with simdgroup_load or simdgroup_store?

OS: MacOS 12.2.1

Hardwear: MacBook Pro 2020, M1

Metal: 2.4

Xcode: 13.2.1

Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store

kernel void fun(
    const device half * Src                 [[ buffer(0) ]],
    constant uint4 & SrcShape               [[ buffer(1) ]],
    device half * Dst                       [[ buffer(2) ]],
    constant uint4 & DstShape               [[ buffer(3) ]],
    const device half * Weight              [[ buffer(4) ]],
    ushort3 threadgroup_position_in_grid    [[ threadgroup_position_in_grid ]],
    ushort3 thread_position_in_threadgroup  [[ thread_position_in_threadgroup ]],
    ushort3 threads_per_threadgroup         [[ threads_per_threadgroup ]],
    ushort3 thread_position_in_grid         [[ thread_position_in_grid ]])
{

    const int SrcSlices = (int)SrcShape[0];
    const int SrcHeight = (int)SrcShape[1];
    const int SrcWidth  = (int)SrcShape[2];
    const int DstSlices = (int)DstShape[0];
    const int DstHeight = (int)DstShape[1];
    const int DstWidth  = (int)DstShape[2];
    const int Kernel_X = 3;
    const int KernelElemNum = 3 * 3;
    const int N_Pack = 8;

    // test only 1 thread
   if(thread_position_in_grid.z != 0 || thread_position_in_grid.y != 0 || thread_position_in_grid.x * N_Pack != 0) return;

    simdgroup_half8x8 sgMatY;
    simdgroup_load(sgMatY, Src);

    simdgroup_store(sgMatY, Dst);

}

It's a simple shader, however output buffer only save the first 2 values from input buffer, the other 62 values are ALL ZERO

Here is the result from Xcode Metal Capture

How can I debug or fix it?

Hi PYNing, it seems likely your kernel is working as expected, let me explain:

  • simdgroup_half8x8 is a 64-wide operation (8 x 8 = 64).
  • M1 uses 32 threads per threadgroup, this can be determined via threadExecutionWidth as explained here.
  • Using 32 threads, each thread performs 2 of the 64 operations.

This if statement:

// test only 1 thread
if(thread_position_in_grid.z != 0 || thread_position_in_grid.y != 0 || thread_position_in_grid.x * N_Pack != 0) return;

is testing whether thread_position_in_grid.xyz != ushort3(0, 0, 0) and thus every thread except thread 0 is masked out (inactive after the if statement). If your input data contains 0 and 1 at indices 0 and 1, then thread 0 writes out exactly these 2 values and your output is as expected.

You mentioned you already captured a GPU frame, I'd recommend having a look at Metal Shader Debugging and Profiling to learn how to debug shaders in Xcode.

Hi PYNing,

You should understand how SIMD instructions works.

SIMD instruction must be the same for all threads in SIMD group, unless it specified in API. And the same SIMD instruction must be executed by all threads in SIMD group. mmmetal already clarified that in your case only 1 thread will be active and will execute the code. But for correct behaviour all threads in SIMD group should execute this instructions.

For Apple GPUs SIMD size is fixed and equal to 32. So in case of M1 all 32 threads must execute the same code path to produce correct result.

Be careful also from CPU side, you probably should run tasks in SIMD group size quantities.

Thanks for mmmetal and roserg! I did misunderstanding the SIMDgroup usage, you are right.

By the way, Anyone who intent to use this feature can refer in the implementation in TF-Lite: https://github.com/alpa-projects/tensorflow-alpa/blob/ee8f6612b515ada4509fa53491c5ba5b3ef8524a/tensorflow/lite/delegates/gpu/common/tasks/conv_metal_simd.cc

[Metal] What's wrong with simdgroup_load or simdgroup_store?
 
 
Q