Threadgroup memory write then read without barrier

Question

Created Feb ’21

Replies 1

Boosts 0

Views 1k

Participants 2

I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it.

I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario.

The code contains a segment like so:

Code Block metal// In AAPLKernels.metal
// Parameter of the kernel
threadgroup float4* sharedPosition [[threadgroup(0)]]
// Body
   ...
    // For each particle / body
    for(i = 0; i < params.numBodies; i += numThreadsInGroup)
    {
        // Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements
        // of sharedPosition will be initialized at once (not just one element at lid as it
        // may look like)
        sharedPosition[threadInGroup] = oldPosition[sourcePosition];
        j = 0;
        while(j < numThreadsInGroup)
        {
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
            acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
        } // while
        sourcePosition += numThreadsInGroup;
    } // for

In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?

Boost

Answer 1

Graphics and Games Engineer OP

Apple

Feb ’21

Accepted Answer

Hi Maxwell,

This looks like a bug in the sample and there should be barriers there.

Right now the sample relies on some undefined behavior; because numThreadsInGroup is 32 and the warp/wavefront/simd-group size is greater than or equal to 32 on all macOS devices, sharedPosition happens to get updated in parallel, but this may not be a guarantee on all GPUs

The correct code should be:

Code Block     sharedPosition[threadInGroup] = oldPosition[sourcePosition];
    threadgroup_barrier(metal::mem_flags::mem_threadgroup);
    j = 0;
    while(j < numThreadsInGroup)
    {
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
           acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
    } // while
    threadgroup_barrier(metal::mem_flags::mem_threadgroup);

If you'd like to do this without barriers, you still can using simd-Group functions.

0