I am designing an algorithm for which I need to know how many threads can run simultaneously on the GPU. The app runs on iOS, but I am also running test code via MacCatalyst on macOS. I found that on macOS for the "AMD Radeon Pro Vega 64" GPU the maximum number of simultaneous threads is 16384 by timing it. Is there a reliable way to find out this number via the Metal API? Or do I need to run some dummy tasks and time them to find it out?
Number of simultaneous Metal threads
It seems that the answer to this is more complicated than I thought. It not only depends on the specific kind of machine it is running on, but also on the code itself, specifically how much local thread memory the compute shader uses.
After experimenting and measuring, I've found out some numbers:
I've measured these numbers with a dummy compute shader that uses about 4K bytes thread local memory. What it means is that for example on my iPhone XS Max (Apple A12 GPU), running the compute shader once takes as much time as running the compute shader up to 4096 times; running it more times will take at least double that run time. Not sure how reliable these numbers are, so ymmv ...
Code Block Swift static func maxNumberOfSimultaneousThreads(device : MTLDevice) -> Int? { switch device.name { case "AMD Radeon Pro Vega 64": return 8192 case "AMD Radeon RX Vega 56": return 4096 case "AMD Radeon Pro Vega 20": return 4096 case "AMD Radeon R9 M370X": return 2048 case "Apple A8 GPU": return 512 case "Apple A9 GPU": return 1024 case "Apple A10 GPU": return 1024 case "Apple A11 GPU": return 1024 case "Apple A12 GPU": return 4096 case "Apple A12X GPU": return 2048 default: return nil } }
I've measured these numbers with a dummy compute shader that uses about 4K bytes thread local memory. What it means is that for example on my iPhone XS Max (Apple A12 GPU), running the compute shader once takes as much time as running the compute shader up to 4096 times; running it more times will take at least double that run time. Not sure how reliable these numbers are, so ymmv ...
Did you check https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes ?
Especially the part with
« You calculate the number of threads per threadgroup based on two MTLComputePipelineState properties. One property is maxTotalThreadsPerThreadgroup (the maximum number of threads that can be in a single threadgroup). The other is threadExecutionWidth (the number of threads scheduled to execute in parallel on the GPU). »
Looks like these properties would help.
Especially the part with
« You calculate the number of threads per threadgroup based on two MTLComputePipelineState properties. One property is maxTotalThreadsPerThreadgroup (the maximum number of threads that can be in a single threadgroup). The other is threadExecutionWidth (the number of threads scheduled to execute in parallel on the GPU). »
Looks like these properties would help.