When is a `simdgroup_barrier()` required?

Metal offers both threadgroup_barrier() and simdgroup_barrier(). I understand the need for threadroup barriers — it would not be possible to rely on well cooperation between threads in a threadgroup without them, as different threads can execute on different SIMD partitions at different times. But I don't really get the simdgroup_barrier() — it was my impression that all threads in a simdgroup execute in lockstep and this if one thread in a simdgroup makes progress, all other active threads in the simdgroup are also guaranteed to make progress. If this were not the case we'd need to insert simdgroup barrier pretty much any time we read or write any storage or perform SIMD-scoped operations. It doesn't seem like Apple uses simdgroup_barrier() in any of their sample code. In fact, it seems like it's a no-op on current Apple Silicon hardware.

Is there a situation when I need to use simdgroup barriers or is this a superfluous operation?

P.S. It seems that Apple engineers are as confused by this as I am, see https://github.com/ml-explore/mlx/blame/1f6ab6a556045961c639735efceebbee7cce814d/mlx/backend/metal/kernels/scan.metal#L355

Replies

I found related resources in WWDC.

According to https://developer.apple.com/wwdc16/606?time=869 and https://developer.apple.com/wwdc20/10631?time=1481, they are using simdgroup_barrier() if thread group fits in a single SIMD group because threadgroup_barrier() is more expensive than simdgroup_barrier().

simdgroup size is normally 32 (afaik 64 in some AMD gpus) so I think simdgroup_barrier() can be used if thread group size is smaller than 32 or 64.

  • Thank you! Based on these examples, it seems that access to shared memory can be non-uniform even for a SIMD, which is why extra synchronization point is required. I wish that the official documentation was more clear about this. It would be also great to have official comment from the Apple GPU team. P.S. The second video is odd... why are they using simd width of 64 if Apple Silicon uses 32-wide SIMD?

Add a Comment