OS: MacOS 12.2.1
Hardwear: MacBook Pro 2020, M1
Metal: 2.4
Xcode: 13.2.1
Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store
kernel void fun(
const device half * Src [[ buffer(0) ]],
constant uint4 & SrcShape [[ buffer(1) ]],
device half * Dst [[ buffer(2) ]],
constant uint4 & DstShape [[ buffer(3) ]],
const device half * Weight [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid [[ thread_position_in_grid ]])
{
const int SrcSlices = (int)SrcShape[0];
const int SrcHeight = (int)SrcShape[1];
const int SrcWidth = (int)SrcShape[2];
const int DstSlices = (int)DstShape[0];
const int DstHeight = (int)DstShape[1];
const int DstWidth = (int)DstShape[2];
const int Kernel_X = 3;
const int KernelElemNum = 3 * 3;
const int N_Pack = 8;
// test only 1 thread
if(thread_position_in_grid.z != 0 || thread_position_in_grid.y != 0 || thread_position_in_grid.x * N_Pack != 0) return;
simdgroup_half8x8 sgMatY;
simdgroup_load(sgMatY, Src);
simdgroup_store(sgMatY, Dst);
}
It's a simple shader, however output buffer only save the first 2 values from input buffer, the other 62 values are ALL ZERO
Here is the result from Xcode Metal Capture
How can I debug or fix it?