M1 Pro / Max / Ultra Thread Affinity (e.g. in OpenMP) and scheduler core migration

I'm trying to hint the task scheduler that some threads should be scheduled together using the task_policy_set API with THREAD_AFFINITY_POLICY (in lieu of there being no "real" thread to core affinity API).

All the examples mention setting the policy after creation but before execution of the task(s). Unfortunately, I'm not creating these tasks (but OpenMP is), and when I then try to use the API on an already running thread, I get a return value of KERN_INVALID_ARGUMENT(= 4)

thread_affinity_policy_data_t policy = { 1 };
auto r = thread_policy_set(mach_task_self(), THREAD_AFFINITY_POLICY, (thread_policy_t)&policy, THREAD_AFFINITY_POLICY_COUNT);

When I replace mach_task_self() by pthread_mach_thread_np(pthread_self()), I get an KERN_NOT_SUPPORTED error instead (= 46, "Empty thread activation (No thread linked to it)").

Has anyone used these APIs successfully on an already running thread?

Background: The code I'm working on divides a problem set into a small number of roughly equal sized pieces (e.g. 8 or 16, this is an input parameter derived from the number of cores to be utilized). These pieces are not entirely independent but need to be processed in lock-step (as occasionally data from neighboring pieces is accessed).

Sometimes when a neighboring piece isn't ready yet for a fairly long time, we call std::this_thread::yield() which unfortunately seems to indicate to the scheduler that this thread should move to the efficiency cores (which then wreaks havoc with the assumption of each computation over a piece roughly requiring the same amount of time so all threads can remain in lock-step). :(

A similar (?) problem seems to happen with OpenMP barriers, which have terrible performance on the M1 Ultra at least unless KMP_USE_YIELD=0 is used (for the OpenMP run-time from LLVM). Can this automatic migration (note: not the relinquishing of the remaining time-slice) be prevented?

Answered by maven in 709279022

I got a nice explanation from a person in DTS, which I'll briefly summarize here for posterity:

  • The mach_task_self() shouldn't work at all and is wrong (I got the idea from https://codereview.chromium.org/276043002/ where they are used interchangeably).
  • The other call makes it to the right place, but thread affinity is not implemented / supported for Apple Silicon

(There the argument was made that "all the cores are basically sharing a single unified cache" which doesn't quite match up with the video describing the 4 P-core to a shared L2 cache arrangement.)

And because I always have trouble following XNU dispatching of function calls (especially once the Mach layer gets involved), here's the walk-though of the dispatches:

No it can't be prevented

I don’t have direct answers to your questions — if you’d like me, or more likely one of my colleagues, to research your specific questions in depth, please open a DTS tech support incident — but I wanted to post a link to the Tune CPU job scheduling for Apple silicon games techtalk from Dec 2021. It is very enlightening.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

I've filed a DTS (Case ID: 797192903), the repo looks as follows

// #include <pthread.h>
#include <mach/mach_init.h>
#include <mach/thread_policy.h>
#include <mach/thread_act.h>

#include <iostream>

int main (int argc, char const *argv[])
{
#ifdef _OPENMP
#pragma omp parallel
#endif
  {
    thread_affinity_policy_data_t policy = { 1 }; // non-zero affinity tag
    // todo: should release returned port?
    auto r1 = thread_policy_set(mach_task_self(), THREAD_AFFINITY_POLICY,
      (thread_policy_t)&policy, THREAD_AFFINITY_POLICY_COUNT); // 4 = KERN_INVALID_ARGUMENT
    auto r2 = thread_policy_set(pthread_mach_thread_np(pthread_self()), THREAD_AFFINITY_POLICY,
      (thread_policy_t)&policy, THREAD_AFFINITY_POLICY_COUNT); // 46 = KERN_NOT_SUPPORTED
#ifdef _OPENMP
#pragma omp critical
#endif
    std::cout << "r1 = " << r1 << " r2 = " << r2 << std::endl;
  }
  
  return 0;
}

For non-OpenMP compile with clang++ -std=c++11 main.cpp, for the OpenMP version use something like /opt/homebrew/opt/llvm/bin/clang++ -fopenmp -std=c++11 main.cpp -L /opt/homebrew/opt/llvm/lib.

Accepted Answer

I got a nice explanation from a person in DTS, which I'll briefly summarize here for posterity:

  • The mach_task_self() shouldn't work at all and is wrong (I got the idea from https://codereview.chromium.org/276043002/ where they are used interchangeably).
  • The other call makes it to the right place, but thread affinity is not implemented / supported for Apple Silicon

(There the argument was made that "all the cores are basically sharing a single unified cache" which doesn't quite match up with the video describing the 4 P-core to a shared L2 cache arrangement.)

And because I always have trouble following XNU dispatching of function calls (especially once the Mach layer gets involved), here's the walk-though of the dispatches:

M1 Pro / Max / Ultra Thread Affinity (e.g. in OpenMP) and scheduler core migration
 
 
Q