This is a follow up to feedback FB9144718, which we also discussed at a WWDC21 "Performance, power, and stability" lab session.
Issue Summary
The actual issue we are facing is that our XPC service is not running as fast as we would expect it to be, especially not on Intel machines; somewhat better on M1 machines but still not really good.
After a lot of profilling with instruments, it finally turned out that the problem is caused by our processing getting regularly stopped as our processing thread is being preempted and put on hold for sometimes a tramendous amount of time (up to over 32 ms have been monitored). Even it is preempted for just a couple of ms most of the time, this is still a lot considering that the actual work it would otherwise perform is only in the range of microseconds.
The reason why this is happening is probably caused by the fact that we don't use the XPC service just for processing application messages through the XPC protocol, which we do as well, but also retrieve requests through a mach port from another process.
This causes our thread priority to be dropped down to 4 (see highlighted log line) and that's the reason why we get preempted for so long. The reason why it's not equally dramatic on M1 is that we are not preempted there, instead we are forced to run on the high efficiency cores instead of the high performance ones.
Ideas from the Lab
Other than completely restructuring our entire implementation which is eventually going to happen in the future anyway for Big Sur and newer, we still have to maintain this structure as long as we need to also support pre-Big Sur macOS version, so it would be great to have a less dramatic fix.
Two suggestions were made at the lab:
Change the RunLoopType in the XPC Plist from dispatch_main to NSRunLoop. We tried that but that didn't made any difference.
Add a key ProcessType with the value Interactive to the XPC Plist. This key is not documented for XPC services, only for launchd daemons but we were told it should actually work for XPC services as well. We tried that as well, both, top level as well as adding it to the XPC sub-key but that didn't make a difference either.
Another Idea That Didn't Work
Now that second suggestion made me look up that key in the man page for launchd.plist and what I found there was pretty interesting. Apparently there is a ProcessType value documented as
Adaptive
Adaptive jobs move between the Background and Interactive classifications based on activity over XPC connections. See xpc_transaction_begin(3) for details.
This seems to be our problem. Our XPC service is considered inactive when it processes messages over the mach port. Looking up the documentation of xpc_transaction_begin(3) tells me:
Services may extend the default behavior using xpc_transaction_begin() and xpc_transaction_end(), which increment and decrement the transaction count respectively. This may be necessary for services that send periodic messages to their clients, not in direct reply to a received message.
Using these two messages also frees us from the requirement to enable/disable sudden termination our own as it will automatically be controlled by these two functions as well. Yet even using these two functions to indicate activity doesn't prevent us from being preempted at regular intervals as our priority still drops to priority level 4 while we are still in the middle of processing (haven't called xpc_transaction_end() yet) . We seem to use it correctly though as it correctly disables sudden termination on our behalf as long as our XPC service remain in the active state (it will only receive mach messages for processing while in that state) and also gets re-enabled when we leave the active state again.
Final Thoughts
Also on the man page of xpc_transaction_begin() is written:
The XPC runtime will also automatically manage the service's priority based on where a message came from. If an app sends a message to the service, the act of sending that message will boost the destination service's priority and resource limits so that it can more quickly fill the request. If, however, a service gets a message from a background process, the service stays at a lower priority so as not to interfere with work initiated as a direct result of user interaction.
It looks like this is not working the way we use the XPC service at the moment. Our mach port messages either come from a System Extension (Big Sur and up) or from a root daemon started by launchd (Catalina and below, ProcessType is Interactive and nice value is -10) but apparently these messages cannot boost our XPC service and so it will stay on low prio.
Post
Replies
Boosts
Views
Activity
When trying to activate my System Extension of type Network Extension, the delegate receives OSSystemExtensionErrorValidationFailed as error. However, when I remove the NEMachServiceName entry that Xcode created in the Info.plist file (and where I replaced the ID with the real ID of the System Extension), the activation succeeds.Without that key I can even create a connection and start it, yet nothing seems to happen when I do so. System Preferences shows the created connection and that it is in state connecting but I see no process getting spawned and it doesn't seem as if the class set for the key com.apple.networkextension.packet-tunnel is ever created either. There's no error reported anywhere and nothing seems to happen until I stop the connection again.I wonder if there is a general problem with the validity of that System Extension and removing the NEMachServiceName doesn't really solve that problem, it just prevents that problem from being detected. Or is the key NEMachServiceName even a requirement for a System Network Extension and it is expected behavior that I cannot be launched if that key isn't present? Signing and profile are managed by Xcode and Xcode says everything is okay and entitlements should be okay as well.When installing without that key, I can also see that the extension has been installed using "systemextensionsctl list". However, everytime I activate the same system extension again from my app, it seems as if the installed one is uninstalled and the same version is then reinstalled. Not sure if that is an indicator of a problem or just because I start my app from Xcode and the system extension gets a new build ID on every run.
We have a System Extension that fetches packets for various VPN protocols that our app supports and then hands them off to various XPC services (started and maintained by our app) that implement the actual protocols (e.g. IPSec, OpenVPN, SSL…). This design allows us to easily use the actual protocol implementation with versions that don't use a System Extension.
On older macOS releases, instead of a System Extension, we used a KEXT and root process to fetch packets, but also handed these off to the same XPC services.
The issue
With the new System Extension → XPC design, we're seeing a significant throughput performance hit once network speeds exceed a certain threshold, which we need to address.
Questions
Our old design would drop outgoing packets if they weren't being processed fast enough (as we must grab them from the kernel either way). It's unclear what the SysExt does in this case - are they dropped, is there a buffer, could this be cause for the delay?
We're using Mach messages to pass packets from the SysExt to our XPC service (we are using exactly the same kind of messaging and code to pass from the root process to the XPC service). There's no other processing being done in the SysExt itself. Is there a faster IPC call we should be using to talk to the SysExt? Other options we've considered are passing a reference to a socket for direct communication or using a shared memory approach (mmap).
Are there any other common optimizations that we might need to investigate for the System Extension?
All targets can see all my header files in the project if I just import them with #import "header.h", except my unit tests.
When editing the unit test the editor says file not found and marks the line in red.
However, when running the unit tests, it compiles, it runs, and it succeeds, which would be impossible if the header wasn't found. So the header is found when building the test but it isn't found when editing the test. How can I fix that?
Explicitly listing all directories with header files as User Header Search Paths solves the problem, but why do I have to do that? I don't have to do that with other targets and apparently Xcode can find the header on its own when building the unit tests.
It seems like Xcode is not using header map files when editing unit tests, only when building them. Yet it uses header map files everywhere else in the project and according to build settings it should use them for unit tests, too (settings have the same values as for all other targets).
I'm using Xcode 12 and the same issue exists in Xcode 12.2 Beta 2.
Most of the time we can use our network system extension (packet tunnel provider) as desired: The main app creates a new manager, saves it, loads it again (otherwise it won't have the correct config for some reason) and then start the connection bound to that manager.
But after several start/stop operations, all of a sudden it stops working. The only two error messages I see in console when trying to start a connection as described above are:
neagent NEAgentSession: failed to create the delegate
nesessionmanager <our-main-app-id>[476]: Tearing down XPC connection due to setup error: Error Domain=NEAgentErrorDomain Code=2 "(null)"
And that's it. Since I have not found any reference for the NEAgentErrorDomain, I have no idea what error 2 is supposed to tell me. Nor do I have any idea why this is happening at all.
This can only be fixed by running systemextensionsctl reset and re-installing the system extension. Then it will work again for some time until the problem repeats.
We are trying to develop a packet tunnel system extension. When we try to start our main application from within Xcode, it crashes immediately withEXC_CRASH (Code Signature Invalid)Looking at the Console, it saysUnsatisfied entitlements: com.apple.developer.networking.networkextensionRunningcodesign -d --entitlements - ${PATH_TO_OUR_APP}says<dict>
<key>com.apple.application-identifier</key>
<string>${OUR_TEAM_ID}.${OUR_APP_ID}</string>
<key>com.apple.developer.networking.networkextension</key>
<array>
<string>packet-tunnel-provider-systemextension</string>
</array>
<key>com.apple.developer.system-extension.install</key>
<true/>
<key>com.apple.developer.team-identifier</key>
<string>${OUR_TEAM_ID}</string>
<key>com.apple.security.application-groups</key>
<array>
<string>${OUR_TEAM_ID}.${OUR_APP_ID}</string>
</array>
<key>com.apple.security.get-task-allow</key>
<true/>
</dict>Which looks reasonable to us. And runningsecurity cms -D -i ${PATH_TO_OUR_APP}/Contents/embedded.provisionprofilesays <key>Entitlements</key>
<dict>
<key>com.apple.developer.system-extension.install</key>
<true/>
<key>com.apple.application-identifier</key>
<string>${OUR_TEAM_ID}.${OUR_APP_ID}</string>
<key>com.apple.developer.networking.networkextension</key>
<array>
<string>app-proxy-provider</string>
<string>content-filter-provider</string>
<string>packet-tunnel-provider</string>
<string>dns-proxy</string>
</array>
<key>keychain-access-groups</key>
<array>
<string>${OUR_TEAM_ID}.*</string>
</array>
<key>com.apple.developer.team-identifier</key>
<string>${OUR_TEAM_ID}</string>
<key>com.apple.developer.aps-environment</key>
<string>development</string>
<key>com.apple.developer.networking.vpn.api</key>
<array>
<string>allow-vpn</string>
</array>
</dict>As for the system, it is running 10.15.5 (19F101) and# csrutil status
System Integrity Protection status: disabled.# systemextensionsctl developer
Developer mode is onAny ideas what could be wrong?One thing I noticed is that Xcode offers no way to -systemextension values for com.apple.developer.networking.networkextension, we had to manually edit the entitlements file and now in Xcode the section "Network Extensions" has no checkbox set anymore. Also when looking at the entitlement plist, the values says "packet-tunnel-provider-systemextension" whereas the non-systemextension values are displayed as nice strings but we think that is because Xcode doesn't have any real support for these values yet; albeit system extensions was introduced almost a year ago, so maybe there is something wrong with our project setup?Anothor thing we noticed is that the embedded provisioning profile doesn't seem to list the -systemextension variants, yet we don't know how to change that. On the developer web site we edited our profil to include "System Extensions" and "Network Extensions" and the ℹ -box says "Developer ID" distribution.If we remove com.apple.developer.networking.networkextension from the entitlements file, the app starts okay and it can even install our system extension without any problem but when we try to create a connection based on the system extension, this fails as we may not interact with the Network Extension framework without the appropriate entitlement.Finally, everything works fine if we use packet-tunnel-provider instead of packet-tunnel-provider-systemextension but then we cannot make a Developer ID build as Network Extensions that aren't System Extensions cannot be deployed using Developer ID. In that case it only works when starting a dev build from within Xcode but we get the same issue when trying to start a Developer ID build on another machine. Strange enough notarization did work for that build.
Our SystemExtensions implements a NetworkExtensions by inheriting from NEPacketTunnelProvider. We override the two methods startWithOptions:completionHandler: and stopTunnelWithReason:completionHandler:.
When starting the connection from the network preferences using the Connect button and then stopping it again using the Disconnect button, the start/stop methods are called as expected.
But when we do the same from our main application using startVPNTunnel() and stopVPNTunnel(), the start method is also called as expected but the stop method is never called, although the connection is actually stopped (it shows disconnected in the network preferences after that call).
What could be the reason for the stop method to never be called? We need to perform clean up in that method and that isn't performed, the connection cannot be started again afterwards, so this is quite a showstopper at the moment.