Threadgroup memory write then read without barrier

I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it.

I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario.

The code contains a segment like so:
Code Block metal
// In AAPLKernels.metal
// Parameter of the kernel
threadgroup float4* sharedPosition [[threadgroup(0)]]
// Body
...
// For each particle / body
for(i = 0; i < params.numBodies; i += numThreadsInGroup)
{
// Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements
// of sharedPosition will be initialized at once (not just one element at lid as it
// may look like)
sharedPosition[threadInGroup] = oldPosition[sourcePosition];
j = 0;
while(j < numThreadsInGroup)
{
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
} // while
sourcePosition += numThreadsInGroup;
} // for


In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?
Accepted Answer
Hi Maxwell,

This looks like a bug in the sample and there should be barriers there.

Right now the sample relies on some undefined behavior; because numThreadsInGroup is 32 and the warp/wavefront/simd-group size is greater than or equal to 32 on all macOS devices, sharedPosition happens to get updated in parallel, but this may not be a guarantee on all GPUs

The correct code should be:

Code Block
sharedPosition[threadInGroup] = oldPosition[sourcePosition];
threadgroup_barrier(metal::mem_flags::mem_threadgroup);
j = 0;
while(j < numThreadsInGroup)
{
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
} // while
threadgroup_barrier(metal::mem_flags::mem_threadgroup);


If you'd like to do this without barriers, you still can using simd-Group functions.
Threadgroup memory write then read without barrier
 
 
Q