Does strided memory access affect … | Apple Developer Forums

I have the following Metal kernel that simply adds two arrays (size Nx*Ny) in parallel:

kernel void add_arrays(device const float* inA,
                       device const float* inB,
                       device float* result,
                       uint tid [[thread_position_in_grid]])
{
    const unsigned int Nx = 1024*5;
    const unsigned int Ny = 128*5;
        
    &#x2F;* Contiguous memory access in SIMD-group *&#x2F;
    uint x = tid % Nx;
    uint y = tid &#x2F; Nx;
    uint index = y * Nx + x;
    
    &#x2F;* Strided memory access in SIMD-group *&#x2F;
    &#x2F;* uint x = tid &#x2F; Ny;
       uint y = tid % Ny;
       uint index = x * Ny + y; *&#x2F;
    
    result[index] = inA[index] + inB[index];
}

As a newcomer to Metal coming from CUDA, I expected much worse performance from the strided memory access when compared to contiguous access. However, both versions run in about the same time on average (~80 usec) on my M1 Max. Does this make sense? I have another more complicated kernel where strided access is actually 30% faster than contiguous access. I'd have expected that it would be preferable for SIMD-groups to access contiguous memory locations. Am I missing something as to how the SIMD-group is organized on Apple silicon?

Does strided memory access affect performance in a Metal compute kernel when running on Apple silicon?