I'm trying to hint the task scheduler that some threads should be scheduled together using the task_policy_set
API with THREAD_AFFINITY_POLICY (in lieu of there being no "real" thread to core affinity API).
All the examples mention setting the policy after creation but before execution of the task(s). Unfortunately, I'm not creating these tasks (but OpenMP is), and when I then try to use the API on an already running thread, I get a return value of KERN_INVALID_ARGUMENT(= 4)
thread_affinity_policy_data_t policy = { 1 };
auto r = thread_policy_set(mach_task_self(), THREAD_AFFINITY_POLICY, (thread_policy_t)&policy, THREAD_AFFINITY_POLICY_COUNT);
When I replace mach_task_self()
by pthread_mach_thread_np(pthread_self())
, I get an KERN_NOT_SUPPORTED
error instead (= 46, "Empty thread activation (No thread linked to it)").
Has anyone used these APIs successfully on an already running thread?
Background: The code I'm working on divides a problem set into a small number of roughly equal sized pieces (e.g. 8 or 16, this is an input parameter derived from the number of cores to be utilized). These pieces are not entirely independent but need to be processed in lock-step (as occasionally data from neighboring pieces is accessed).
Sometimes when a neighboring piece isn't ready yet for a fairly long time, we call std::this_thread::yield()
which unfortunately seems to indicate to the scheduler that this thread should move to the efficiency cores (which then wreaks havoc with the assumption of each computation over a piece roughly requiring the same amount of time so all threads can remain in lock-step). :(
A similar (?) problem seems to happen with OpenMP barriers, which have terrible performance on the M1 Ultra at least unless KMP_USE_YIELD=0
is used (for the OpenMP run-time from LLVM). Can this automatic migration (note: not the relinquishing of the remaining time-slice) be prevented?
I got a nice explanation from a person in DTS, which I'll briefly summarize here for posterity:
- The
mach_task_self()
shouldn't work at all and is wrong (I got the idea from https://codereview.chromium.org/276043002/ where they are used interchangeably). - The other call makes it to the right place, but thread affinity is not implemented / supported for Apple Silicon
(There the argument was made that "all the cores are basically sharing a single unified cache" which doesn't quite match up with the video describing the 4 P-core to a shared L2 cache arrangement.)
And because I always have trouble following XNU dispatching of function calls (especially once the Mach layer gets involved), here's the walk-though of the dispatches:
- main entry-point
task_policy_set(...)
https://github.com/apple-oss-distributions/xnu/blob/xnu-8019.80.24/osfmk/kern/thread_policy.c - going to
thread_policy_set_internal(...)
- that one asking
thread_affinity_is_supported()
which comes from https://github.com/apple-oss-distributions/xnu/blob/e6231be02a03711ca404e5121a151b24afbff733/osfmk/kern/affinity.c - then dispatching to
ml_get_max_affinity_sets() != 0
(which is an architecture specific function) - which for ARM says "no sets supported" https://github.com/apple-oss-distributions/xnu/blob/bb611c8fecc755a0d8e56e2fa51513527c5b7a0e/osfmk/arm/cpu_affinity.h
- and voila,
KERN_NOT_SUPPORTED