problem with long-running content-filter network system extension

I'm still working on building a network content-filtering app. At this point, I've got the extension lifecycle activation issues working, have got some custom IPC communication working well between the containing app and the extension, and finally got test machine installation figured out, and now I'm struggling with a different sort of issue.

I have the filtering app installed on two test machines. The installation and IPC communication works fine, and the flow analysis block/allow logic is working really well, which is the core of the app's value proposition.

The problem I'm encountering is this: after the network filter has been installed and running on a machine for 1-3 days, it begins to intermittently hang/fail when after the test machine is woken up from a sleep. To be more precise, the filter works very well, even after multiple sessions, waking/sleeping, shutdown and restart. But then, on both test machines, it seems to occasionally just completely fail. Every time this has happened, it has been after the test machine was idle, or sleeping for at least a few hours (usually overnight). Then, when the machine is woken back up, all network traffic flows fail. Some relevent (I think) facts:
  • when the failure happens, the network extension still is running, I can find the pid with ps, and it shows green and running in the Network portion of System preferences

  • I can still see the filter system extension logging things to console via os_log(), although I don't seem to be able to see in realtime the requests that are failing, but I see a lot of activity in console.log, indicating that the process itself doesn't seem to be stuck or hung up, as far as I can tell...

  • removing the filter extension from the network system prefs panel and re-installing/activating it from the containing app always seems to restore correct behavior

  • I tried attaching to the pid with lldb, which I was able to do, but I was over my head on what to do next at that point, I’m going to be be doing some research on live debugging running processes with lldb, because I have zero experience with this.

  • the containing app is not showing any unusual memory or cpu consumption in xcode or activity monitor, and I don't know (yet) how to check similar vital stats for the root system extension, so I can't speak to that

  • the network connections that fail seem to eventually just time-out, based on watching os logs in console. I can see some requests that I initiate eventually (30 seconds later or so) showiong up as errors in console, with references to timeout exceptions etc. But I don't see logging of those timed out requests in my content-filter system extension -- it's like they're hanging/timing out without ever being allowed to be handled by my extension

  • i do an os_log for every request that I block, but these requests that just never complete also never show up in my extensions logging.

I guess what I'm hoping is that the description of the problem I'm having triggers some idea in someone more experienced than I am in this domain. I'm struggling to know even where to start trying to debug the issue, especially since it usually takes a day or two to manifest, and then, only after sitting asleep overnight, or for several hours at a time, so it's difficult to repro and troubleshoot. Here are some of the questions I'd love any response to...
  1. what would be your first guess as far as what to troubleshoot/test given the description above?

  2. what would be your gut feeling of what I should do first, to investigate this and hopefully get to the bottom?

  3. could it be a memory issue? if so, can someone point me in the right direction of how to determine this and address it?

  4. has anyone had a similar problem with long-running sysex processes like this? Is it possible there's a known issue, or some undocumented footgun I'm running into having to do with the system extension API, and not so much my own code?

  5. I know these api's are relatively new -- do people have production apps in the wild using the content-filter extension point? Does anyone have any experience with these processes running for a long time on host machines? Or are there not many real deployed use-cases yet out there?

  6. the area of the app I feel the least confident in is the code around my inter-process-communication. Is it possible that some error of mine with overusing or wrongly managing the IPC communication could manifest as the problem above described?

  7. what happens with the content-filter sysex when the machine is asleep? Is it possible that requests queue up in some way that overwhelms the system after waking a long time later?

  8. or is there something else about the sleep/wake cycle that would point to the likely cause of the issue?

Sorry for the super long writeup, I wanted to provide as much context to help trigger any helpful ideas from whomever might read this. I truly appreciate any input on this, thanks so much.


So, no one has replied to my plaintive cry for help yet, so I'll add a few thoughts here, for the sake of anyone else who might be struggling with similar stuff, or if someone would still be so kind as to offer any thoughts, I would greatly appreciate it.

I forgot to mention in my first writeup that in in my NEFilterDataProvider class I asked to intercept ALL traffic, not just .TCP or .UDP, I set it to .any. I mention this, because after a lot of poking through system logs, it seems that the problem had to do with DNS requests not resolving.

DNS seems like a tricky thing if you want to filter all traffic. I don't honestly care about stopping DNS requests, but I do care about some other UDP traffic, so in my provider I tried to allow all DNS traffic through by testing if the request was a UDP request to port 53 -- I let all of those pass through unblocked.

And again, the problem only manifests after waking from sleep after running well for a number of hours, so DNS things do work for a while at least. But it makes me wonder if there is something having to do with DNS cache, or some other issue I can't quite put my finger on...

I also did some more research to see if the Content Filter system extension seemed fully stable for Catalina, trying to answer my question about whether there were known problems with the framework, or if people were successfully deploying production apps with this filter. I found an article detailing that Apple had put in a set of exclusions into Catalina -- apps that were not subject to filtering. I found that interesting because I had noticed some odd that seemed like my filter never got a chance to see them. I definitely believe the responses from Apple quoted in the article saying that it wasn't some ill-intended plot to circumvent the filter secretly, but rather that there were problems getting the feature ready in time for Catalina, and so the exclusion list was a temporary workaround. But I mention it because this seems to provide fairly strong evidence that (in Catalina at least, which is the only OS version I've tested against so far), the content filter system extension might not be 100% stable or reliable. I can't post the link to the article here, but just google "apple content filter exclusion security researcher big sur catalina" and you'll find it.

I'd still be interested in any (even off-the-cuff) ideas about what to troubleshoot, or whether any of my thoughts outlined above prompt any ideas. I'm considering doing some testing on a recent version of Big Sur to see if things work better.

I'd still definitely be interested if anyone could talk about how many production apps are using this framework, especially in Catalina. Or if anyone could shed light on or had any experience with filtering ALL requests, including (necessarily) DNS requests.

I'd be happy to communicate directly with anyone working on a similar app, if anyone wants to form an informal support-group/griefshare you can reach me at jared [at] netrivet [dot] com. The app I'm working on is still experimental and for various reasons I don't feel much commercial pressure at all about it, so I'm happy to share lessons learned, and would definitely appreciate some feedback or insight from anyone else working on something similar.

So, no one has replied to my plaintive cry for help yet

Keep in mind that DevForums is an informal support channel. Matt and I do what we can to answer all the networking questions that show up here, but we have a lot of other commitments and there are limits to how much time we can spend on that.

If you need formal support, open a DTS tech support incident.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

what would be your first guess as far as what to troubleshoot/test given the description above?
what would be your gut feeling of what I should do first, to investigate this and hopefully get to the bottom?

The long running tests that you have been conducting are the right way to go about debugging something like this. Some things I would recommend:
  1. Try running these tests with isolated and combined network interfaces to see if it makes a difference. For example, try running with only Wi-Fi active, only ethernet active, and with the both combine to see if it makes a difference.

  2. Try installing these debug profiles on your macOS machine(s) (VPN - Network Extension, Network Diagnostics, Net-Diagnose, and Wi-Fi.), then reproduce the issue and trigger a sysdiagnose. Take a look at the logarchive file at the time you reproduced the issue. There should be some sort of hint or clue on failures that took place when you reproduced your issue.

  3. Regarding your mention of, " removing the filter extension from the network system prefs panel and re-installing/activating it from the containing app always seems to restore correct behavior." I worked on a very similar case to which you are describing, and one of the ways to workaround that was concluded was to deactivate the NEFilterPacketProvider on sleep in the container app and keep the NEFilterDataProvider activated still. If this fits your architecture then this could be something you could try as well.

could it be a memory issue? if so, can someone point me in the right direction of how to determine this and address it?

I am operating under the assumption that this is macOS, and if I am correct, then no, I do not think this is a memory issue.


has anyone had a similar problem with long-running sysex processes like this? Is it possible there's a known issue, or some undocumented footgun I'm running into having to do with the system extension API, and not so much my own code?

I would get a bug report down for this with the sysdiagnose that I mentioned in (2) above. That way this issue can be evaluated in closer detail. There was a bug that caused issues specifically for NEFilterPacketProvider in Catalina while connected to Wi-Fi going through sleep/wake cycles (r. 59247350), but I believe this is now resolved in Big Sur. Also, this did cause a panic and not a failure so this may not be the same bug here.

what happens with the content-filter sysex when the machine is asleep? Is it possible that requests queue up in some way that overwhelms the system after waking a long time later?
or is there something else about the sleep/wake cycle that would point to the likely cause of the issue?

Since I am assuming that you are working with Network System Extensions the system keep the extension alive during these transitions. One thing I would recommend is if your application needs to prepare for any state transitions during this time, as there is extremely limited functionality available on the system, I would recommend that you do it in the sleep / wake methods of your provider.


Matt Eaton
DTS Engineering, CoreOS
meaton3@apple.com

@meaton That all sounds really helpful, exactly the kind of hints I was hoping for. Truly appreciate it. I'll try some of these things and report back any findings that could be useful to others. 👍

@eskimo -- yup, totally understand that, which is why I was trying to partially answer my question for future googlers. thanks for taking a look.
problem with long-running content-filter network system extension
 
 
Q