I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]]
, but that's not a compile time value, so it can't be used as an array dimension.
Now, according to various WWDC session videos (e.g. WWDC2022 "Scale compute workloads acroos Apple GPUs), threadExecutionWidth
on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex:
on the compute encoder.
This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth
always seems to report 32) but I'm hitting configurations where threadExecutionWidth
does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth
= 16 for some complex kernels, although SIMD group size seems to be 32)
Will the SIMD group size always be the same for all kernels on a device, so should I trust threadExecutionWidth
only for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]
?
I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes due to exhaustion of some resource (registers?), although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth
.