In the "Discover advances Metal for A15 Bionic" Tech Talk right around the 20:00 mark, the presenter (Katelyn Hinson) says:
The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, [with] each thread writing to a single output.
Supposing that we know the simdgroup will contain 32 threads (which they mention in the talk is true for Apple Silicon), is the only way to ensure that the threads in each simdgroup will be arranged into a 4 x 8 chunk to perform a dispatch with threadgroups that have a width dividing the number of threads per simdgroup? I can't think of another way to control the shape of a simdgroup directly within threadgroups since there is no explicit API to do so.
For example, if we perform a dispatchThreadgroups(_:threadsPerThreadgroup:)
with a threadgroup size of 8 x 8 to attempt to recreate the visuals in the presentation, wouldn't the resulting simdgroup shape be an 8 x 4 region and not a 4 x 8 region?
The assumptions made in the video about where to sample the source texture and which shuffle functions to use are heavily influenced by the shape of the simdgroup. I'm trying to implement a similar reduction but I'm currently figuring out how to shape each simdgroup.
If we don't know whether the simdgroup is 32 threads (I believe it's possible simdgroups have 64 threads?). What would be a reliable way to control the structure of the simdgroups? I believe if we always ensure that the width of the threadgroup divides the number of threads in the simdgroup we should get the behavior that we want, but I'm looking to confirm this logic.
IIRC, simdgroups will always have a multiple of 8 threads (maybe it was only 4?), so perhaps a width of 8 (or 4) would always suffice for the threadgroup and you could specify a height of computePipelineState.maxTotalThreadsPerThreadgroup / 4
for example. Finally, must we only use uniform threadgroups (viz. we couldn't use dispatchThreads(_:threadsPerThreadgroup:)
) for reliable results? I'm thinking that non-uniform threadgroups would again violate our assumptions about the simdgroup shape