I have the filtering app installed on two test machines. The installation and IPC communication works fine, and the flow analysis block/allow logic is working really well, which is the core of the app's value proposition.
The problem I'm encountering is this: after the network filter has been installed and running on a machine for 1-3 days, it begins to intermittently hang/fail when after the test machine is woken up from a sleep. To be more precise, the filter works very well, even after multiple sessions, waking/sleeping, shutdown and restart. But then, on both test machines, it seems to occasionally just completely fail. Every time this has happened, it has been after the test machine was idle, or sleeping for at least a few hours (usually overnight). Then, when the machine is woken back up, all network traffic flows fail. Some relevent (I think) facts:
when the failure happens, the network extension still is running, I can find the pid with ps, and it shows green and running in the Network portion of System preferences
I can still see the filter system extension logging things to console via os_log(), although I don't seem to be able to see in realtime the requests that are failing, but I see a lot of activity in console.log, indicating that the process itself doesn't seem to be stuck or hung up, as far as I can tell...
removing the filter extension from the network system prefs panel and re-installing/activating it from the containing app always seems to restore correct behavior
I tried attaching to the pid with lldb, which I was able to do, but I was over my head on what to do next at that point, I’m going to be be doing some research on live debugging running processes with lldb, because I have zero experience with this.
the containing app is not showing any unusual memory or cpu consumption in xcode or activity monitor, and I don't know (yet) how to check similar vital stats for the root system extension, so I can't speak to that
the network connections that fail seem to eventually just time-out, based on watching os logs in console. I can see some requests that I initiate eventually (30 seconds later or so) showiong up as errors in console, with references to timeout exceptions etc. But I don't see logging of those timed out requests in my content-filter system extension -- it's like they're hanging/timing out without ever being allowed to be handled by my extension
i do an os_log for every request that I block, but these requests that just never complete also never show up in my extensions logging.
what would be your first guess as far as what to troubleshoot/test given the description above?
what would be your gut feeling of what I should do first, to investigate this and hopefully get to the bottom?
could it be a memory issue? if so, can someone point me in the right direction of how to determine this and address it?
has anyone had a similar problem with long-running sysex processes like this? Is it possible there's a known issue, or some undocumented footgun I'm running into having to do with the system extension API, and not so much my own code?
I know these api's are relatively new -- do people have production apps in the wild using the content-filter extension point? Does anyone have any experience with these processes running for a long time on host machines? Or are there not many real deployed use-cases yet out there?
the area of the app I feel the least confident in is the code around my inter-process-communication. Is it possible that some error of mine with overusing or wrongly managing the IPC communication could manifest as the problem above described?
what happens with the content-filter sysex when the machine is asleep? Is it possible that requests queue up in some way that overwhelms the system after waking a long time later?
or is there something else about the sleep/wake cycle that would point to the likely cause of the issue?