Data race-free memory-access-pattern in compute-kernels?

Hi,


my assumption was that it should be possible to read / write to an array without the need of atomics by creating threadgroups of the size (threadExecutionWidth, 1, 1) and using the thread_index_in_threadgroup as index of the array.


A simple kernel to demonstrate what I mean:

kernel void test(volatile device uint *counter [[buffer(0)]],
                 ushort tid [[thread_index_in_threadgroup]]) {
    counter[tid] += 1;
}


The test shows that my assumption is wrong (for grids with a width of threadExecutionWidth, the value stays at 1 for all elements up to a grid-height of 100 and only gets bigger when the height gets way higher).


I hoped to avoid atomic operations / synchronisations by using this pattern.


So, I have two questions:

1. Is a thread_index_in_threadgroup unique across all active threadgroups at a time? (Otherwise, the pattern obviously fails because of data-races)

2. Is there a cheap way to force the counter-array be synchronised across the threads? Seeing a value of 1 doesn't make any sense and must mean that the threads get a cached value.


Thanks,

Hendrik

I think I found my mental misconception of the threads: A thread won't work on a single function until the function is finished, but may interrupt the current execution of the function and move to a different function (or another instance of the same function) - basically like a normal thread on the CPU.

Data race-free memory-access-pattern in compute-kernels?
 
 
Q