Hi All,
We have Endpoint Security System Extension. We are facing an issue in macOS Sonoma only where we have found that open() API is not returning any response when we try to open the files and OS killing/crashing the extension.
We have found in log streaming below lines for our extension:
error 12:50:51.093673+0530 tccd Failed to create LSApplicationRecord for file:///Library/SystemExtensions/3378971F-D41D-4230-A887-E0DC0F61E98D/com.*.sysextcontainer.onlineext.systemextension/: 'The operation couldn’t be completed. (OSStatus error -10811.)'
It seems internally some access is removed by apple on booting however we can still see our extension has Full Disk Access in System Settings.
We have installed new macOS Sequoia Public beta 24A5289h
and above issue is not observed and also issue not seen in previous OS(Big Sur, Monterey, Ventura) and seen only in Sonoma.
We already have filed a Feedback : FB13806349
...
Thanks & Regards,
Mohmad Vasim
We have Endpoint Security System Extension. We are facing an issue in macOS Sonoma only where we have found that open() API is not returning any response when we try to open the files and OS killing/crashing the extension.
To start with, I wrote about about auth request response times in an earlier forum post, which I would recommend review. At the simplest level, you cannot call "open" as part of handling any auth request while ensuring the <100ms response time I recommended.
However, open in particular raises many additional issues, namely:
-
It's a critical security bottleneck that requires additional security verification by the system and is very likely to be monitored by other ES clients. This means that any syscall your make is VERY likely to generate additional auth calls into your ES client. At best this creates additional delays, at worst it creates a recursive auth loop.
-
It's a very high volume syscall, so your ES client is often going to be processing a large number of open calls at any given time. This means that any issues introduced by #1 can then magnified and expanded by the complexity created by the additional call volume. For example, it's possible for a nested auth call cycle to work fine at relatively low call volume to stall/deadlock at higher call volumes because your client doesn't have enough threads/processing "width" to process the dependent auth calls.
-
The only reason to open the input file is to perform some kind of I/O action on it, typically some form of file scanning. Unfortunately, that also provides a clear vector an attacker can use to break/interfere with your ES client. The problem here is that, particularly for read/write calls, the attacker has a great deal of control over how long any I/O call will actually take. The attacker may have truly direct control (for example, by controlling the response time of a target server or by attaching an intentionally slow mass storage device) or they may simply be slowing overall response time by drowning the target in I/O requests. Either way can easily be used to slow things enough that the system will terminate you.
Basically, you need to be VERY aware that any system call your ES client makes is potentially dangerous, ESPECIALLY I/O related called like "open".
In terms of the specific error you mentioned here:
error 12:50:51.093673+0530 tccd Failed to create LSApplicationRecord for file:///Library/SystemExtensions/3378971F-D41D-4230-A887-E0DC0F61E98D/com.*.sysextcontainer.onlineext.systemextension/: 'The operation couldn’t be completed. (OSStatus error -10811.)'
...that error is not meaningful. If you look at the specific messages logged for your extension, I believe you'll find that all of them are returned as "Allowed", so tccd didn't actually block anything. I'll also note that I didn't see any connection between tccd activity and the ES client crashes in the log.
In terms of why you're crashing, I don't really know. You didn't include any crash logs from your component so I don't know what was actually occurring there and the system log can only really provide hints.
However, I did spend sometime looking at the log and didn't notice a few interesting things.
A) Process 395 is part of your app and both registered as an ES client and then unregistered at a later point:
2024-05-22 12:09:29.511+0530 kernel Client connected (pid 395 cdhash 45f1ae0fba8338785855e33d2029c005135e7e19)
...
2024-05-22 12:12:01.932+0530 kernel Client disconnected (pid 395, com.quickheal.sy)
However, 395 did NOT exit and was in fact still alive ~1 hour later when the spindump was collected by the sysdiagnose. I don't know why that occurred, but it was odd enough that it seemed worth noting.
B) I'm not sure how it connects to the underlying failure, but I believe exec auth request are what's causing the specific failure. If you filter the data down enough, there's a very clear sequence that breaks down as:
Client Connects into ES:
2024-05-22 13:27:19.500+0530 kernel Client connected (pid 15803 cdhash 1c441b3d4b9184a3b10f6dfd708f9b6e1a03f99a)
Launchd tries to spawn a process:
2024-05-22 13:27:19.551+0530 launchd[1:129500] [] internal event: WILL_SPAWN, code = 0
ES client is terminated for failing to respond:
2024-05-22 13:27:34.502+0530 kernel Client did not respond in appropriate amount of time (client pid: 15803), sent SIGKILL
2024-05-22 13:27:34.502+0530 kernel Client disconnected (pid 15803...)
Launchd's exec completes:
2024-05-22 13:27:34.504+0530 launchd[1:129500] [] xpcproxy spawned with pid 15804
What's particularly interesting is that this is ALSO true of the FIRST crash that starts the crash cycle:
2024-05-22 12:09:29.324+0530 kernel Client connected (pid 403 cdhash 1c441b3d4b9184a3b10f6dfd708f9b6e1a03f99a)
...
2024-05-22 13:26:13.802+0530 launchd[1:126618] [] internal event: WILL_SPAWN, code = 0
2024-05-22 13:26:28.084+0530 kernel Client did not respond in appropriate amount of time (client pid: 403), sent SIGKILL
2024-05-22 13:26:28.084+0530 kernel Client disconnected (pid 403 )
2024-05-22 13:26:28.087+0530 launchd[1:126618] [] xpcproxy spawned with pid 15740
Basically, your "original" client ran fine for ~90 min, processing ~1600 WILL_SPAWN events without issue. It was then terminated EXACTLY 15s after the "WILL_SPAWN" event above. After that point, as far as I can tell, you ES client never process a WILL_SPAWN event again. It's possible that this is coincidental, but the pattern occurred often enough that it seemed interesting, particularly since it also occurred with the initial failure of pid 403.
__
Kevin Elliott
DTS Engineering