How to debug com.apple.flow-divert being closed?

In response to my feedback submission, apple says that our transparent network proxy is stopping because, somehow, the file descriptor for com.apple.flow-divert is being closed. Only, they haven't (yet?) given any advice on how to debug that -- the extension is written in Swift, and by itself does not close any file descriptor. So I have no idea how I'd go about trying to debug that, let alone fix it.

Anyone have any thoughts about this?

In response to my feedback submission

What was your bug number?

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

FB11837553 -- the analysis (which, now that I know what to look for, makes sense) was that, for some reason, my extension does not have the com.apple.flow-divert socket(?) open, which of course would make it difficult for it to do any network proxying, transparent or otherwise ;).

FB11837553

Ah, OK, looking at that bug it seems that our best guess as to what’s going wrong here is that something within your process is closing a flow divert file descriptor. This is a pretty common failure mode with the Unix file system API. It generally goes down something like this:

  1. Subsystem A opens a file, getting file descriptor X.

  2. Subsystem A closes file descriptor X.

  3. Subsystem B opens a file descriptor and gets the first unused file descriptor, which just happens to be X.

  4. Subsystem A accidentally closes file descriptor X again.

At this point subsystem B fails because its file descriptor was closed out from underneath it.

Debugging problems like this is a pain. Your best option is to put a guard on file descriptor X, which causes a trap if the person closing the file descriptor doesn’t remove the guard first. This is, for example, what Core Data does [1].

However, the guard facility isn’t API and you don’t own the file descriptor, both of which make this technique challenging.

One thing you could do is run fs_usage to watch the descriptor opens and closes in your process.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] ’cause they got sick of fielding bug reports against Core Data that were caused by folks accidentally closing file descriptors out from underneath it.

The problem with that is that the extension

  • is written in Swift
  • is a network system extension
  • doesn't close anything other than flows using the close*WithError methods
  • has no idea what the flow divert file descriptor is

I can tell, using lsof, that the last one is fd 5, but that is all I can tell. Unless I missed something, there is no way to tell, in the network system extension, which file descriptor is the control one, since all of the communication is done via methods.

For those following along at home, I’ll be helping kithrup with this issue in a different context.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

As usual, Quinn was immensely helpful in the TSI (and, I assume, various other engineers -- please thank them for me, especially if I know them 😄), and determined the root cause: a flow could not have more than 128k of data pre-macOS 12.3. This doesn't seem to have been documented, and instead of returning an error in the .write method, it closed the file descriptor. ooops. In 12.3 and later, it breaks the data into chunks to keep it working; I had, based on Quinn's analysis, added code to do the same (but at 48k chunks, since I didn't know what the limit was at that point). (And I just changed the code to use #available(macos 12.3, *) to set the limit, and it's going through QA now.)

I believe the fix @kithrup mentioned has problems in some cases. The behavior I see is:

  1. All working fine and I have my TransparentProxy provider handling the all network connections (it is setup with a catch all rule).
  2. I see errors(see below) in unified log from kernel.
  3. All connections die in TransparentProxy, I see "write error: Flow is not connected" for all existing connections it was maintaining.
  4. No new "handleNewFlow" method calls, my networking is down and I can't connect to anything.
  5. I "sudo kill -9 <myNetworkExtensionPid>" and after about 10 seconds my networking is working again.

The errors from the kernel are:

2024-05-12 13:11:03.745695+0300 0x3e63b3   Error       0x0                  0      0    kernel: (250387962): mbuf_split failed: 12
2024-05-12 13:11:03.746038+0300 0x3e63b3   Error       0x0                  30596  0    com.my.networkextension: (libnetworkextension.dylib) [com.apple.networkextension:] (0): data: write error: Cannot allocate memory
2024-05-12 13:11:03.746176+0300 0x3e63b3   Default     0x0                  30596  0    com.my.networkextension: (libnetworkextension.dylib) [com.apple.networkextension:] (0): Aborting the director

I found that this "mbuf_split failed:" is from flow divert: https://opensource.apple.com/source/xnu/xnu-4570.1.46/bsd/netinet/flow_divert.c.auto.html flow_divert_send_app_data() function.

It looks like @kithrup mentioned that it tries to split the buffer but the split fails and then it kills something inside the extension(closes divert socket?) rendering networking completely dead until extension restart.

I cannot reproduce this consistently but it happens quite a lot in our product and we are scratching our heads trying to find a solution.

How to debug com.apple.flow-divert being closed?
 
 
Q