Does strided memory access affect performance in a Metal compute kernel when running on Apple silicon?

I have the following Metal kernel that simply adds two arrays (size Nx*Ny) in parallel:

kernel void add_arrays(device const float* inA,
                       device const float* inB,
                       device float* result,
                       uint tid [[thread_position_in_grid]])
{
    const unsigned int Nx = 1024*5;
    const unsigned int Ny = 128*5;
        
    /* Contiguous memory access in SIMD-group */
    uint x = tid % Nx;
    uint y = tid / Nx;
    uint index = y * Nx + x;
    
    /* Strided memory access in SIMD-group */
    /* uint x = tid / Ny;
       uint y = tid % Ny;
       uint index = x * Ny + y; */
    
    result[index] = inA[index] + inB[index];
}

As a newcomer to Metal coming from CUDA, I expected much worse performance from the strided memory access when compared to contiguous access. However, both versions run in about the same time on average (~80 usec) on my M1 Max. Does this make sense? I have another more complicated kernel where strided access is actually 30% faster than contiguous access. I'd have expected that it would be preferable for SIMD-groups to access contiguous memory locations. Am I missing something as to how the SIMD-group is organized on Apple silicon?