Ok. I solved it myself, I knew posting it would help is somehow :-)
Hear some help for future developers who might get confused in the same way. There is a difference in the code:
computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
vs
[computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize];
I hadn't taken care to fully understand the difference between the functions. The functions look similar but the arguments although all MTLSize() have very different meaning,
I was wrongly thinking the difference was whether the thread groups neatly align with the grid, or might go beyond the grid boundary where there isn't a neat divisor, but this isn't the only difference. (you should use dispatchThreads only if your device support non-uniform thread group sizes)
The crucial difference is that what I called gridSize doesn't have the same meaning in both cases.
In the former its the number of threads groups over the grid, and in the later its the number of threads in the grid.
In the former case, where on my GPU maxTotalThreadsPerThreadgroup=1024 I had
gridSize = MTLSize(1024,1,1)
threadsPerThreadgroup = MTLSize(1024,1,1)
so it seems I failed to get the parallelism I expected since the grid was divided into too many thread groups (perhaps!)
I still don't fully understand it, and why I can't get full GPU utilisation like this, but clearly I'm getting better GPU utilisation with dispatchThreads.
I checked the results in the array with some extra code, and I get the same results in all cases, the difference is just efficiency, not what work is done,
Happy coding!
See https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/2866532-dispatchthreads?language=objc
and https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443138-dispatchthreadgroups?language=objc
Post
Replies
Boosts
Views
Activity
Well spotted. That was a silly difference I meant to fix. objC uses a global, but yeah, it shouldn't matter but sorry for that distraction