How to max computing throughput?

I've read up on the use of Metal for computing - and written a few tests to try and better create a mental model of GPU/Metal. There are threads and thread_groups. To optimize a thread_group, you determine the threadExecutionWidth and hand of multiples of that number to each thread_group.


So my mental model of a thread_group is that its like a processor core that can operate on 1 to threadExecutionWidths of "stacks". In my testing, I see no time difference between 1 and threadExecutionWidth number of threads in a thread_group.


There is a maximum number of threads that one thread_group can have (maxTotalThreadsPerThreadgroup) - what I observe is that as you add threadExecutionWidth worth of threads, the compute times goes up linearly (per threadExecutionWidth worth of threads).


However, I can find no documented way to determine the maximum number of thread_groups. In my testing, it appears that once the total number of threads exceeds maxTotalThreadsPerThreadgroup (by having n thread_groups, each with threadExecutionWidth number of threads), compute time jumps.


So again my mental model is that there are n number of processing cores, and once you hit that work gets queued. I tested on an iPhone 6s+, and from what I read the GPU has 6 cores. But my testing used a thread group size of 32 (the threadExecutionWidth), and 16 thread_groups. [maxTotalThreadsPerThreadgroup is 512 on my iPhone].


Am I now consuming one of the six cores? If I were to create threads and hand off more work to the GPU, would I get another 512 threads running at one time?

Answered by MikeAlpha in 231210022

OK, let me try, although my English is far from perfect.


What your GPU has, is 6 "cores" (that would be "streaming multiprocessors" in nVidia parlance, I believe). These have one instruction decoding unit (and so, can execute only one program at the time) but they execute single instruction stream on multiple data. Hence SIMD (Single Instruction, Multiple Data). Now threadExecutionWidth will be some number of threads that SIMD unit "likes" to execute "at once". For example 16 or 32 (doesn't necessarily mean that SIMD is exactly 16 or 32 wide, but usually something like this). Use less than that, and you're wasting some SIMD lanes. You can use more, but only integer multiples make sense. So that's why "threadExecutionWidth". Why the upper limit?


Well, execution unit "contain"/"sees" several types of memory:

- "thread" - this is fastest, private to thread (thread can only access it's own)

- "threadgroup" - sometimes called shared, or local memory (all threads in threadgroup can access it, and it can be used for communication between threads, for example in parallel reduction)

- "constant" - memory dedicated to stuff that isn't changing during execution of program

- "device" - that is where textures and buffers of the GPU live in.


Of those, "thread" and "threadgroup" memory is most limited. So your threadgroup cannot grow beyond what is practical from memory point of view. There are strict limits in the standard, for example "each thread must have access to at least X of threadgroup memory". This is why the limits. So in your case:

threadExecutionWidth is 32 because SIMD can't do less work, and group size limit is 512 because there is not enough resources for more than 512 threads in execution unit. You can't have more _in single thread group_!


Now, if group size is bigger than threadExecutionWidth, execution unit will execute these in batches. This is why "what I observe is that as you add threadExecutionWidth worth of threads, the compute times goes up linearly (per threadExecutionWidth worth of threads).". There is no magic, if you have threadExecutionWidth of threads, execution unit can do everything in one pass. All threads will be "in flight". If you have 2 * threadExecutionWidth, two passes (in first pass say 32 will be "in flight", and other 32 suspended, then other way round), and so on. Note that the exact way execution unit does this is not specified. If you want some particular behavior (for example because you want all threads from thread group to synchronise and exchange results at some stage) you need to force that by calling synchronization functions.


Now, to number of groups. Each execution unit (core, whatever) executes only one thread group at once. Meaning that for thread group number <= number of execution units you will probably have all thread groups "in flight" at the same time. So if you have 6 cores, up to 6 thread groups can be executed at once. So 1, 2, 3, 4, 5, 6 thread groups will take same time - you're using more and more execution units, but all will be started in same moment. Of course you can have more groups - they will most probably be executed in batches of 6. And separate groups (unlike threads of same group) allow for NO other communication than via global memory, so there is no important cost associated with work groups - you can have plenty of these, because they'll simply wait for their turn to be executed, much like tasks in queue.


To sum things up:

32 threads running _at the same time_ _in single execution unit_

6 physical execution units

32 * 6 = 192 - you can have up to 192 threads "in flight"

But of course you can have a thousands of those, for example by scheduling 60 thread groups with 32 threads each. That will give 1920 threads to execute, but at any moment of time there are going to be at most 6 thread groups working, each at most 32 threads, so at most 192 threads "in flight".


Hope that helps. If you have some specific question, shoot.

Michal

PS. Apart from language problems, I use "probably" and similar above because these things are in constant flux, and depend on your particular device, API, etc.

PPS. From what I wrote above, it should be fairly obvious what is GPU programmer biggest problem - you can't really "branch" a SIMD device. Well, you can do ifs in the code all right, but this is usually done by code taking _both_ execution paths, and then right answer being picked by masking. Funny stuff.

i think you have confusion between core and thread , you can see this : https://streamcomputing.eu/blog/2017-01-24/many-threads-can-run-gpu/

Thaks for the link but still confused. I see from googling "A9's GPU: Imagination PowerVR GT7600" and choosing Anand link (doing this to avoid URL approval delay) that it should be possible to have thousands of threads scheduled at the same time - but I can only get 512.


# of SIMDs is 6 - this aligns with the "6 GPU cores"

# of MADs (unexplained acronym) 32 - this appears to align with the "threadExecutionWidth" property


But then the total number of MADs is 192, but I can schedule 512 threads total (in 16 thread groups) without seeing any additional delay from just 32 threads. Perhaps my 6s+ has 16 cores?


What I have been hoping is the ability to schedule thousands of threads, not just a few hundred.


I know I'm not thinking about this properly - your article helped - but still don't understand it all.

Accepted Answer

OK, let me try, although my English is far from perfect.


What your GPU has, is 6 "cores" (that would be "streaming multiprocessors" in nVidia parlance, I believe). These have one instruction decoding unit (and so, can execute only one program at the time) but they execute single instruction stream on multiple data. Hence SIMD (Single Instruction, Multiple Data). Now threadExecutionWidth will be some number of threads that SIMD unit "likes" to execute "at once". For example 16 or 32 (doesn't necessarily mean that SIMD is exactly 16 or 32 wide, but usually something like this). Use less than that, and you're wasting some SIMD lanes. You can use more, but only integer multiples make sense. So that's why "threadExecutionWidth". Why the upper limit?


Well, execution unit "contain"/"sees" several types of memory:

- "thread" - this is fastest, private to thread (thread can only access it's own)

- "threadgroup" - sometimes called shared, or local memory (all threads in threadgroup can access it, and it can be used for communication between threads, for example in parallel reduction)

- "constant" - memory dedicated to stuff that isn't changing during execution of program

- "device" - that is where textures and buffers of the GPU live in.


Of those, "thread" and "threadgroup" memory is most limited. So your threadgroup cannot grow beyond what is practical from memory point of view. There are strict limits in the standard, for example "each thread must have access to at least X of threadgroup memory". This is why the limits. So in your case:

threadExecutionWidth is 32 because SIMD can't do less work, and group size limit is 512 because there is not enough resources for more than 512 threads in execution unit. You can't have more _in single thread group_!


Now, if group size is bigger than threadExecutionWidth, execution unit will execute these in batches. This is why "what I observe is that as you add threadExecutionWidth worth of threads, the compute times goes up linearly (per threadExecutionWidth worth of threads).". There is no magic, if you have threadExecutionWidth of threads, execution unit can do everything in one pass. All threads will be "in flight". If you have 2 * threadExecutionWidth, two passes (in first pass say 32 will be "in flight", and other 32 suspended, then other way round), and so on. Note that the exact way execution unit does this is not specified. If you want some particular behavior (for example because you want all threads from thread group to synchronise and exchange results at some stage) you need to force that by calling synchronization functions.


Now, to number of groups. Each execution unit (core, whatever) executes only one thread group at once. Meaning that for thread group number <= number of execution units you will probably have all thread groups "in flight" at the same time. So if you have 6 cores, up to 6 thread groups can be executed at once. So 1, 2, 3, 4, 5, 6 thread groups will take same time - you're using more and more execution units, but all will be started in same moment. Of course you can have more groups - they will most probably be executed in batches of 6. And separate groups (unlike threads of same group) allow for NO other communication than via global memory, so there is no important cost associated with work groups - you can have plenty of these, because they'll simply wait for their turn to be executed, much like tasks in queue.


To sum things up:

32 threads running _at the same time_ _in single execution unit_

6 physical execution units

32 * 6 = 192 - you can have up to 192 threads "in flight"

But of course you can have a thousands of those, for example by scheduling 60 thread groups with 32 threads each. That will give 1920 threads to execute, but at any moment of time there are going to be at most 6 thread groups working, each at most 32 threads, so at most 192 threads "in flight".


Hope that helps. If you have some specific question, shoot.

Michal

PS. Apart from language problems, I use "probably" and similar above because these things are in constant flux, and depend on your particular device, API, etc.

PPS. From what I wrote above, it should be fairly obvious what is GPU programmer biggest problem - you can't really "branch" a SIMD device. Well, you can do ifs in the code all right, but this is usually done by code taking _both_ execution paths, and then right answer being picked by masking. Funny stuff.

Thanks so much! And your written English is excellent - 10,000x better than my German :-)


David

Hey David


Glad to hear that. I just noticed that your original question was "how to max computing throughput" and realised that there is another very important thing to know about that. More of a trick exploiting how the things work, and it goes like this: it is often desirable to do more than "one thing" in one thread. For example, if you're doing some kind of image processing, it is only natural to map one image pixel to one thread. And then adjust thread group size/number of thread groups to the image in question. And that will work. But if you do, say, two or four pixels per thread (and change your thread group sizes accordingly), you'll find more often than not that the whole computing will execute faster. This is because there is some constant cost "per thread", part of it induced by how things work, part because how the kernel is written (say, you'll calculate "index" of pixel first, then do actual operations - so every thread has to devote "thread" memory to this "index" variable). If you have same amount of threads, but doing twice the work, costs of processing rise twice (you still have to process same amount of pixels) but these additional costs don't. Depending on scenario, this may give nice speedup. If interested, google "Better performance at lower occupancy" and/or other papers by Vasily Volkov. Unfortunately these are in nVidia CUDA parlance, but general idea applies to many, if not all GPUs. And it certainly worked for me in Metal, too.


Regards

Michal

Again, thanks for the tips and reference! In my case, say calculating the variance of a large array, I can break it up into pieces and do enough work in each thread to essentially have all cores working at once to calculate one value. Suppose I have n number of these - how best to schedule them so when one gets done the other starts up? Should I have multiple CPU threads doing the work, or just loop on these:


// Loop on ...
computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

This is perfect example how confusing is terminology here. I was refering to Metal "thread", not OS one! So for example something like this:

kernel void variance(device float * array [ [ buffer( 0 ) ] ], constant float mean [ [ buffer( 1 ) ] ],
                     uint2 coords [ [ thread_position_in_grid ] ], uint2 dims [ [ threads_per_grid ] ] )
{
     uint index = coords.y * dims.x + coords.x;
     // do one piece of work per kernel "thread"
     array[ index ] = abs( array[ index ] - mean );     // or whatever you want
}

Is OK, and will work just fine, but you can probably get more performance out of something like this:

kernel void variance2(device float * array [ [ buffer( 0 ) ] ], constant float mean [ [ buffer( 1 ) ] ],
                     uint2 coords [ [ thread_position_in_grid ] ], uint2 dims [ [ threads_per_grid ] ] )
{
     uint index = (coords.y * dims.x + coords.x) * 2;
     // do two pieces of work per kernel "thread"
     array[ index ] = abs( array[ index ] - mean );
     array[ index + 1 ] = abs( array[ index + 1 ] - mean );
}

Now you run the second version with X resolution halved, so you have half the execution threads (again, Metal "threads", not CPU ones!). It should

give you same results (assuming that you can halve X resolution and all). But it would probably be faster. Hope it is clear now.


Michal

PS. Please understand that these are just sketches, not actual working code. I just wrote them there to illustrate general idea.

PPS. As for CPU-side Metal threading, I know that it is possible (there are threaded command encoders) but I never actually used them, and from what I understand these are for special cases, like rendering a lot of objects with different properties. Something that will happen more often in computer game area, I guess, less in actual numerical computations.

How to max computing throughput?
 
 
Q