M1Utra vs Old iMac, why is retain/release on M1 42 times slower?

I have a swift project that operates on pixels (image generation) in parallel. Running the exact same code with the same settings (81M pixels) on my 2017 iMac and my M1Ultra (release builds, MacOS). The M1 Ultra takes nearly 2 min and the iMac takes 10.3 seconds. The main difference in profiler is calls to retain/release totaling 84 seconds on M1Ultra and 2 seconds on iMac i7. Like, how is this even possible? It makes this $4K Mac Studio a boat anchor as far as my need for it. Despite having 20 threads running vs 8, the MS is horrifically slow solely due to the retain/release molasses.

This is not a user app but a command line app. Instruments seems hopeless to tell me exactly what is even calling retain/release. I can see its ARM vs Intel code generation and otherwise the rest of the code seems equivalent.

If someone from Apple would comment I'd be willing to burn a support ticket (I used to work at DTS long ago). Otherwise I will post it to Medium to see if anyone can explain this.

This is the whole reason I spent so much on the MS, now it seems a waste of $.

This would make sense if the objective-C runtime (or ARC) is still running on Intel, thus requiring a switch to Rosetta. If so, people should be made aware that any capture of self could cause a slowdown on M1. Odd if true, but I can't see how Apple could otherwise code such a simple concept so badly if it's native ARM.

Rewrote it to use a struct passed instead of referencing class properties, now massive fast. Still be nice to know for sure if my conjecture about runtime is true.

[message deleted]

M1Utra vs Old iMac, why is retain/release on M1 42 times slower?
 
 
Q