Hi,
my assumption was that it should be possible to read / write to an array without the need of atomics by creating threadgroups of the size (threadExecutionWidth, 1, 1) and using the thread_index_in_threadgroup as index of the array.
A simple kernel to demonstrate what I mean:
kernel void test(volatile device uint *counter [[buffer(0)]],
ushort tid [[thread_index_in_threadgroup]]) {
counter[tid] += 1;
}
The test shows that my assumption is wrong (for grids with a width of threadExecutionWidth, the value stays at 1 for all elements up to a grid-height of 100 and only gets bigger when the height gets way higher).
I hoped to avoid atomic operations / synchronisations by using this pattern.
So, I have two questions:
1. Is a thread_index_in_threadgroup unique across all active threadgroups at a time? (Otherwise, the pattern obviously fails because of data-races)
2. Is there a cheap way to force the counter-array be synchronised across the threads? Seeing a value of 1 doesn't make any sense and must mean that the threads get a cached value.
Thanks,
Hendrik