How to Debug psort_r(3)/dispatch_group Stall on MacStudio?

We've recently noticed an issue on our new MacStudios where calls to psort_r(3) stall forever. We haven't changed our HPC (particle simulation) code at all, and sampling the app shows psort_r is stuck in dispatch_group_wait(3).

Taking the code from Libc 1439.141.1, we've assembled our own implementation which allows us to pass a dispatch_queue_t, dispatch_group_t, and specify a wait time. After 10 seconds (a very long time in our case) the call returns with a non-zero exit code and shows the group and queue are in agreement: four additional blocks are waiting to dispatch, but haven't. This still takes more than two hours of simulation to achieve, where calls to psort_r must be succeeding to make forward progress.

Prior to this code change we've seen dispatch_group_wait stuck for hours. What else could we do to diagnose/debug this? We only see it on our M1 Ultra MacStudios, and the comparator passed to psort_r is simple C code (constant time). FB10893202

Is it possible this is caused by the Ultra putting certain cores to sleep and that the queue has work assigned to those cores? This seems to happen only with smaller problems (core usage is <= 4 of 20), so opens the possibility that parts of the chip are asleep. I suppose another one is cache coherency between the chips, that the call to "wake" the queue to finish the work "in-flight" is missing.

GROUP_FAIL
<OS_dispatch_queue_concurrent: QUEUE_NAME[ADDR] = { xref = 1, ref = 9, sref = 1, target = com.apple.root.default-qos[ADDR], width = 0xffe, state = 0x00000c1000000000, in-flight = 4}>
<OS_dispatch_group: group[ADDR] = { xref = 1, ref = 2, count = 4, gen = 0, waiters = 1, notifs = 0 }>
NORMAL_OPERATION
<OS_dispatch_queue_concurrent: QUEUE_NAME[ADDR] = { xref = 1, ref = 1, sref = 1, target = com.apple.root.default-qos[ADDR], width = 0xffe, state = 0x0000041000000000, in-flight = 0}>
<OS_dispatch_group: group[ADDR] = { xref = 1, ref = 1, count = 0, gen = 0, waiters = 0, notifs = 0 }>

It looks like the internal state of the queue is smashed when this happens. If the dispatch_group_wait is allowed to exit early then dispatch_barrier_sync is called on the same queue, it will wait forever. It may be related to low overall CPU usage, but I'm not sure how. It always seems to take a while (a couple of hours of continuous use) before the issue occurs. Never seen it on Intel or single-chip Apple Silicon

After replacing psort_r with a dispatch_apply-based version we're calling "wpsort", it looks like the stall can still occur, but in other ways. A completely new dispatch_group running on a global queue fails to notify, and it looks like the whole app effectively goes single-threaded, as far as GCD is concerned.

It's not the old days with +[NSThread isMultithreaded], but I don't know what could cause GCD to fail to spin up new worker threads suddenly, after more than two hours of computation. It seems to be related to low CPU utilization before the threads become limited, and somehow to the M1 Ultra MacStudio, but this is unusual.

As a kind of postscript, we've adjusted or rewritten all algorithms as necessary to avoid stalls (like using dispatch_group_wait), and the issue seems to affect all clients of GCD in-process. Relaunching the app temporarily resolves the issue, but it's still unknown. Making our calculations "stall resistant" isn't ideal, but we haven't heard back at all. It looks like we have to set a watchdog on the app and if triggered pause, save, relaunch, load, and continue until there's something more to work with.

How to Debug psort_r(3)/dispatch_group Stall on MacStudio?
 
 
Q