I've read up on the use of Metal for computing - and written a few tests to try and better create a mental model of GPU/Metal. There are threads and thread_groups. To optimize a thread_group, you determine the threadExecutionWidth and hand of multiples of that number to each thread_group.
So my mental model of a thread_group is that its like a processor core that can operate on 1 to threadExecutionWidths of "stacks". In my testing, I see no time difference between 1 and threadExecutionWidth number of threads in a thread_group.
There is a maximum number of threads that one thread_group can have (maxTotalThreadsPerThreadgroup) - what I observe is that as you add threadExecutionWidth worth of threads, the compute times goes up linearly (per threadExecutionWidth worth of threads).
However, I can find no documented way to determine the maximum number of thread_groups. In my testing, it appears that once the total number of threads exceeds maxTotalThreadsPerThreadgroup (by having n thread_groups, each with threadExecutionWidth number of threads), compute time jumps.
So again my mental model is that there are n number of processing cores, and once you hit that work gets queued. I tested on an iPhone 6s+, and from what I read the GPU has 6 cores. But my testing used a thread group size of 32 (the threadExecutionWidth), and 16 thread_groups. [maxTotalThreadsPerThreadgroup is 512 on my iPhone].
Am I now consuming one of the six cores? If I were to create threads and hand off more work to the GPU, would I get another 512 threads running at one time?
OK, let me try, although my English is far from perfect.
What your GPU has, is 6 "cores" (that would be "streaming multiprocessors" in nVidia parlance, I believe). These have one instruction decoding unit (and so, can execute only one program at the time) but they execute single instruction stream on multiple data. Hence SIMD (Single Instruction, Multiple Data). Now threadExecutionWidth will be some number of threads that SIMD unit "likes" to execute "at once". For example 16 or 32 (doesn't necessarily mean that SIMD is exactly 16 or 32 wide, but usually something like this). Use less than that, and you're wasting some SIMD lanes. You can use more, but only integer multiples make sense. So that's why "threadExecutionWidth". Why the upper limit?
Well, execution unit "contain"/"sees" several types of memory:
- "thread" - this is fastest, private to thread (thread can only access it's own)
- "threadgroup" - sometimes called shared, or local memory (all threads in threadgroup can access it, and it can be used for communication between threads, for example in parallel reduction)
- "constant" - memory dedicated to stuff that isn't changing during execution of program
- "device" - that is where textures and buffers of the GPU live in.
Of those, "thread" and "threadgroup" memory is most limited. So your threadgroup cannot grow beyond what is practical from memory point of view. There are strict limits in the standard, for example "each thread must have access to at least X of threadgroup memory". This is why the limits. So in your case:
threadExecutionWidth is 32 because SIMD can't do less work, and group size limit is 512 because there is not enough resources for more than 512 threads in execution unit. You can't have more _in single thread group_!
Now, if group size is bigger than threadExecutionWidth, execution unit will execute these in batches. This is why "what I observe is that as you add threadExecutionWidth worth of threads, the compute times goes up linearly (per threadExecutionWidth worth of threads).". There is no magic, if you have threadExecutionWidth of threads, execution unit can do everything in one pass. All threads will be "in flight". If you have 2 * threadExecutionWidth, two passes (in first pass say 32 will be "in flight", and other 32 suspended, then other way round), and so on. Note that the exact way execution unit does this is not specified. If you want some particular behavior (for example because you want all threads from thread group to synchronise and exchange results at some stage) you need to force that by calling synchronization functions.
Now, to number of groups. Each execution unit (core, whatever) executes only one thread group at once. Meaning that for thread group number <= number of execution units you will probably have all thread groups "in flight" at the same time. So if you have 6 cores, up to 6 thread groups can be executed at once. So 1, 2, 3, 4, 5, 6 thread groups will take same time - you're using more and more execution units, but all will be started in same moment. Of course you can have more groups - they will most probably be executed in batches of 6. And separate groups (unlike threads of same group) allow for NO other communication than via global memory, so there is no important cost associated with work groups - you can have plenty of these, because they'll simply wait for their turn to be executed, much like tasks in queue.
To sum things up:
32 threads running _at the same time_ _in single execution unit_
6 physical execution units
32 * 6 = 192 - you can have up to 192 threads "in flight"
But of course you can have a thousands of those, for example by scheduling 60 thread groups with 32 threads each. That will give 1920 threads to execute, but at any moment of time there are going to be at most 6 thread groups working, each at most 32 threads, so at most 192 threads "in flight".
Hope that helps. If you have some specific question, shoot.
Michal
PS. Apart from language problems, I use "probably" and similar above because these things are in constant flux, and depend on your particular device, API, etc.
PPS. From what I wrote above, it should be fairly obvious what is GPU programmer biggest problem - you can't really "branch" a SIMD device. Well, you can do ifs in the code all right, but this is usually done by code taking _both_ execution paths, and then right answer being picked by masking. Funny stuff.