dlopen taking huge amount of time in macOS Mojave and Catalina

Hello, we have encountered delays of ~35 seconds when loading some of our internal libraries (*.dylib files) using dlopen. The libraries are codesigned. Only the first load of the library is taking 35 seconds. The next dlopen calls (for the same library) are taking a few ms.

We encountered the issue on macOS Mojave 10.14.6 and macOS Catalin 10.15. We are calling dlopen in these two versions:

dlopen(library, RTLDNOW)
dlopen( library, RTLD
LAZY | RTLD_GLOBAL )

Please let us know if you experienced the issue and if there is a solution to this.

Many thanks in advance

I’d like to clarify what you mean by “first load”. Does the delay come back when you:
  • Quit the loading process?

  • Rebuild the library?

  • Restart the Mac?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
By "first load" I mean when dlopen is called for the first time (when the application starts) for the respective library. After a few seconds dlopen is called again for the same library but this time is returning immediately. This is normal because the library is already loaded in the process memory.

The delay comes back if I restart the machine, rebuild the library and/or quit the loading process.
Basically everytime the application starts, the delay comes back.
If you disconnect your Mac from the network, do you still see the delay?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
Thanks for your help, we will ask our customer to perform the test.
The issue is reproducing only on customer environment, we couldn't reproduce it on our own.
One thing to mention is that the dynamic libraries are signed and notarized with a different certificate than the Application that is loading them. Hardened Runtime is enabled. Hope this is helpful information.
Thanks again!

LATER UPDATE: we are also seeing delays of about 4-7 seconds when checking signature of dynamic libraries
It seems our customer is not using internet connection at all, when our application is calling dlopen. Do you think this could be the cause?

Do you think this could be the cause?

It’s certainly worth considering.

The reason I asked about networking is that I’ve seen similar scenarios (not related to code, but other subsystems) where a long delays like this are related to network timeouts. I can think of two places where running code might hit the network:
  • Doing trust evaluation on the code’s signing certificate might hit the network for various reasons (revocation checking, fetching intermediates, and so on).

  • Gatekeeper might be trying to fetch the code’s notarisation ticket.

Normally such requests fail quickly if the network is unavailable but you can see long delays if the network is on but broken in some way. Hence my earlier recommendation to test with the network off.

There’s a couple of ways you could continue here:
  • You could try setting up a broken network in your office in an attempt to reproduce the problem. One good option here in the Network Link Conditioner.

  • You could get a sysdiagnose from your customer and then rummage through the system log to see if you can spot the cause of the delay.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
Hi, sorry for this late reply.
Thanks for you answer, I will try to reproduce it using Network Link Conditioner.

Thanks again!
Hi, we're experiencing the same issue here running Catalina 10.15.7, even when internet connectivity is off. We saw that during the 30 seconds delay the process CPU has been at 100%, and when we ran a few core dumps during the 30 second delay we found several stack traces of Apple's ImageLoader code running for the entire duration of the 30 second hang.

example backtrace (most of them were stuck in trieWalk):
Code Block
* frame #0: 0x0000000110e15889 dyld`ImageLoader::trieWalk(unsigned char const*, unsigned char const*, char const*) + 177
frame #1: 0x0000000110e1f400 dyld`ImageLoaderMachOCompressed::findShallowExportedSymbol(char const*, ImageLoader const**) const + 112
frame #2: 0x0000000110e19075 dyld`ImageLoaderMachO::findExportedSymbol(char const*, bool, char const*, ImageLoader const**) const + 37
frame #3: 0x0000000110e145c3 dyld`ImageLoader::weakBindOld(ImageLoader::LinkContext const&) + 1485
frame #4: 0x0000000110e1228f dyld`ImageLoader::link(ImageLoader::LinkContext const&, bool, bool, bool, ImageLoader::RPathChain const&, char const*) + 333
frame #5: 0x0000000110e04a01 dyld`dyld::link(ImageLoader*, bool, bool, ImageLoader::RPathChain const&, unsigned int) + 161
frame #6: 0x0000000110e0ee0b dyld`dlopen_internal + 477

After further investigation we've found that all of the symbols from our .so files are exported. About 2/3 of them are bounded successfully by ImageLoader::recursiveBind() almost instantaneously, but the last 1/3 is handled by ImageLoader::weakBindOld(), it searches for them, fails to "strong bind" them and logs them as "found weak". I'm still not sure what's the difference between regular binding and a weak binding, so some context on what we should change in our build process would really help to resolve this.

I'm still not sure what's the difference between regular binding and a
weak binding

In dyld the term weak can refer to two different concepts:
  • A library can import a symbol (or an entire library) weakly. In that case the library will load even if the symbol (or entire library) is missing. Historically this was used for availability checking.

  • The C++ One Definition Rule requires dyld to merge definitions across multiple Mach-O images. If rely on this a lot, it can radically slow down library loading.

I don’t know the dyld implementation well enough to tell you which of these you’re hitting based on the backtraces you posted, but I suspect it’s the latter. Do you make heavy use of C++?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
From what I read in the ImageLoader::weakBindOld it looks like that this is the process of binding undefined symbols in one library with their implementation dependent libraries.
The function is going through all the images that were just loaded that need coalescing, goes through every undefined symbol and iterates through all other loaded images to find the implementation for it using a trie search according to the symbol's name.
We compile these .so/.dylib files for Linux and Mac, and it seems that Linux does this binding process in less that 1 second.
Where can I get some guidance to figure out what should we try to change in our compilation/linkage process to mitigate this?

We compile these .so/.dylib files for Linux and Mac

Do your libraries make heavy use of C++? That’s by far the most common source of symbol coalescing.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
Yes, these libraries are C++ libraries, they include libraries that have a huge amount of symbols like boost and folly, and since everything is built into dynamic libraries a lot of load time coalescing is expected. What bothers me is the Apple specific implementation in dyld. Our code is fully portable and this issue doesn't happen on Linux, even though it's about the same amount of symbols. This looks like a bug in weakBind.

Yes, these libraries are C++ libraries

OK. I’m going to recommend that you open a DTS tech support incident and talk to our tools specialist about this.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"
dlopen taking huge amount of time in macOS Mojave and Catalina
 
 
Q