Metal Shader : Pointer vs Local Copy Performance

I have a metal kernel function that has a huge array of data for input, stored in device memory, and I'm basically using one element per thread for further processing.


device Element *elements [[ buffer(0) ]],


I'm wondering what's better in terms of performance? :


Make a copy of the array element into local thread memory :

Element element = elements[thread_id];


Or, use a pointer to that element :

device Element *element = &particles[thread_id];
Answered by Graphics and Games Engineer in 616488022
In most cases, regardless of the approach you will take, Metal compiler will produce optimized code, reducing the number of memory operations and used hardware registers. It is reasonable to expect that you will get very similar performance profile.
However, it is also true that performance of your code will not be determined only by how you are reading the values from input buffers, but also how those values are used in the shader. If you have any reason to believe you are leaving the performance on the table, we recommend to profile your app using GPU counters. It will give you a deep understanding of the code Metal generated for your shader and will let you optimize for specific case.
Counters that you may want to check first are limiter counters, to see if your app is ALU or Buffer Read Limited. For more information on how to use them, please watch this great presentation:
https://developer.apple.com/videos/play/wwdc2020/10603/
In most cases, regardless of the approach you will take, Metal compiler will produce optimized code, reducing the number of memory operations and used hardware registers. It is reasonable to expect that you will get very similar performance profile.
However, it is also true that performance of your code will not be determined only by how you are reading the values from input buffers, but also how those values are used in the shader. If you have any reason to believe you are leaving the performance on the table, we recommend to profile your app using GPU counters. It will give you a deep understanding of the code Metal generated for your shader and will let you optimize for specific case.
Counters that you may want to check first are limiter counters, to see if your app is ALU or Buffer Read Limited. For more information on how to use them, please watch this great presentation:
https://developer.apple.com/videos/play/wwdc2020/10603/
Metal Shader : Pointer vs Local Copy Performance
 
 
Q