I have the following Metal kernel that simply adds two arrays (size Nx*Ny) in parallel:
kernel void add_arrays(device const float* inA,
device const float* inB,
device float* result,
uint tid [[thread_position_in_grid]])
{
const unsigned int Nx = 1024*5;
const unsigned int Ny = 128*5;
/* Contiguous memory access in SIMD-group */
uint x = tid % Nx;
uint y = tid / Nx;
uint index = y * Nx + x;
/* Strided memory access in SIMD-group */
/* uint x = tid / Ny;
uint y = tid % Ny;
uint index = x * Ny + y; */
result[index] = inA[index] + inB[index];
}
As a newcomer to Metal coming from CUDA, I expected much worse performance from the strided memory access when compared to contiguous access. However, both versions run in about the same time on average (~80 usec) on my M1 Max. Does this make sense? I have another more complicated kernel where strided access is actually 30% faster than contiguous access. I'd have expected that it would be preferable for SIMD-groups to access contiguous memory locations. Am I missing something as to how the SIMD-group is organized on Apple silicon?