I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it.
I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario.
The code contains a segment like so:
In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?
I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario.
The code contains a segment like so:
Code Block metal // In AAPLKernels.metal // Parameter of the kernel threadgroup float4* sharedPosition [[threadgroup(0)]] // Body ... // For each particle / body for(i = 0; i < params.numBodies; i += numThreadsInGroup) { // Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements // of sharedPosition will be initialized at once (not just one element at lid as it // may look like) sharedPosition[threadInGroup] = oldPosition[sourcePosition]; j = 0; while(j < numThreadsInGroup) { acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); } // while sourcePosition += numThreadsInGroup; } // for
In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?