I have a metal kernel function that has a huge array of data for input, stored in device memory, and I'm basically using one element per thread for further processing.
device Element *elements [[ buffer(0) ]],
I'm wondering what's better in terms of performance? :
Make a copy of the array element into local thread memory :
Element element = elements[thread_id];
Or, use a pointer to that element :
device Element *element = &particles[thread_id];