DispatchIO crashes, tips on debugging?

BUG IN CLIENT OF LIBDISPATCH: Unexpected EV_VANISHED (do not destroy random mach ports or file descriptors)

Which, ok, clear: somehow a file descriptor is being closed before DispatchIO.close() is called, yes?

Only I can't figure out where it is being closed. I am currently using change_fdguard_np() to prevent closes anywhere else, and every single place where I call Darwin.close() is preceded by another call to change_fdguard_npand thenDispatchIO.close()`. eg

            self.unguardSocket()
            self.readDispatcher?.close()
            Darwin.close(self.socket)
            self.socket = -1
            self.completion(self)
Answered by DTS Engineer in 797727022

Create a socketpair Create a DispatchIO using socket[0] Create a FileHandle using socket[1] Do work When done, DispatchIO.close() and FileHandle.close() Also close the sockets using Darwin.close().

Did you immediately issue the "Darwin.close"? If so, then I believe that's a programmatic error. This isn't very clear in the swift documentation, but "close" doesn't mean that the system is done with your fd. That doesn't occur until your cleanup handler is called:

"If the existing channel is associated with a file descriptor, the system maintains control over the file descriptor until the new channel is also closed, an error occurs on the file descriptor, or all references to channels tied to that file descriptor are released. When the file descriptor is released, the cleanup_handler block is enqueued on the specified queue and the system relinquishes control over the file descriptor."

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Can you attach crash logs? I'd like to get a better picture of what's actually going on before I offer any suggestions.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

The crashed thread is

Thread 5 Crashed:
0   libdispatch.dylib             	    0x7ff80af41e5f _dispatch_source_merge_evt.cold.1 + 24
1   libdispatch.dylib             	    0x7ff80af25bb1 _dispatch_source_merge_evt + 149
2   libdispatch.dylib             	    0x7ff80af2f9b6 _dispatch_event_loop_merge + 112
3   libdispatch.dylib             	    0x7ff80af21db3 _dispatch_workloop_worker_thread + 438
4   libsystem_pthread.dylib       	    0x7ff80b0c5fd0 _pthread_wqthread + 326
5   libsystem_pthread.dylib       	    0x7ff80b0c4f57 start_wqthread + 15

I've attached a trimmed & edited copy of the full crash log (but not the entire .ips file).

So, yay, I can reproduce it in my test code. And I used fdguard (with CLOSE not DUP), and never unguard the file descriptors, and it still gets the EV_VANISH crash.

The basic flow of the code is:

  • Create a socketpair
  • Create a DispatchIO using socket[0]
  • Create a FileHandle using socket[1]
  • Do work
  • When done, DispatchIO.close() and FileHandle.close()
  • Also close the sockets using Darwin.close().
  • CRASH!

The problematical one seems to be the file descriptor for the DispatchIO.

Accepted Answer

Create a socketpair Create a DispatchIO using socket[0] Create a FileHandle using socket[1] Do work When done, DispatchIO.close() and FileHandle.close() Also close the sockets using Darwin.close().

Did you immediately issue the "Darwin.close"? If so, then I believe that's a programmatic error. This isn't very clear in the swift documentation, but "close" doesn't mean that the system is done with your fd. That doesn't occur until your cleanup handler is called:

"If the existing channel is associated with a file descriptor, the system maintains control over the file descriptor until the new channel is also closed, an error occurs on the file descriptor, or all references to channels tied to that file descriptor are released. When the file descriptor is released, the cleanup_handler block is enqueued on the specified queue and the system relinquishes control over the file descriptor."

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Ok, I am going to say that is horribly unclear then -- I read it completely differently, and "the file descriptor should not be closed by the application until the cleanup handler is called" would have made all the difference... 😄

And now I can't get it to call the cleanup handler at all, sigh.

Ok, using that information, I was able to reproduce it in a standalone program, which helped me track it down. (And then I had to deal with the cleanupHandler not being called, but I found that out and got it done.)

Now I'm sometimes getting a crash:

fileHandle?.readabilityHandler = { fh in
    let data = try fh.availableData
    print("Got \(data.count) bytes")
}
*** Terminating app due to uncaught exception 'NSFileHandleOperationException', reason: '*** -[NSConcreteFileHandle availableData]: Bad file descriptor'

This only happens sometimes, in my tests.

Now I'm sometimes getting a crash:

Sigh... I suspected you might run into issue here next, but I didn't have time to write up the full set of issues here.

SO, the first question I have here is "What are you actually trying to do?". Frankly, choosing to use both FileHandle and DispatchIO for "parallel" operation a bit odd and, using both correctly is MUCH trickier than it might appear. The issue here is that while the two APIs provide similar functionality, they were designed and built at totally different time (>15 years apart) with totally different expectations (for example, run loops vs GCD queue). While it's possible to use both of them, actually using both me them correctly in the same context can be much trickier than it might seem and would require a fairly deep knowledge of the exact implementation of both.

Finally, and this is probably the biggest concern, why are you "bothering" with both? Using both APIs means that your creating two completely different implementation to solve the same underlying issue, opening up yourself to a lot more bugs without a lot of "upside".

In terms of the specific issue here:

*** Terminating app due to uncaught exception 'NSFileHandleOperationException', reason: '*** -[NSConcreteFileHandle availableData]: Bad file descriptor'

NSFileHandle has requirements as DispatchIO ("you can't close the fd until the API is completely done with it"), except it's implementation makes that trickier to implement. In particular:

-The API is inherently "racy". You can't coordinate your own close with FileHandle internals, so it's entirely possible to call "close" while your readabilityHandler is (or is about to be) being called.

-You can tell FileHandle not to dealloc the descriptor, but that doesn't actually make things "safer". If YOU close the fd before FileHandle is completely "done", then you can create exactly this crash. Practically speaking, I think using this option would only really make sense if there was some other architecture in place that meant you weren't toing to close the fd until LONG after FileHandle was "gone".

-Note that "close after the FileHandle is gone" can be much harder than it sounds, particularly if GCD is added into the mix. In particular, Foundation (and AppKit/UIKit) routinely autorelease objects and that means they regularly "leak" object references into GCD queues, meaning you no longer control the objects lifetime.

In concrete terms, with code like this:

queue.async {
	let handle = FileHandle(fileDescriptor: fd)

}

...the assumption is that "handle" will be destroyed at the end of the block is NOT necessarily true. FileHandle autoreleases, which means it won't actually be destroyed until some "later" point when the queue drains it's own autorelease pool.

If you're going to use FileHandle, then my recommendation would be that you pass ownership of the fd over to it and let it close it, at which point all you need to do is be sure that the FileHandle object itself isn't leaked. However, my actual recommendation is that you pick which API you like better (presumably, DispatchIO) and ONLY use that API.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Most this particular session has been in order to come up with a way to have tests for our transparent proxy provider that can be run without using the main extension -- so I rolled the networking code out, wrote a mimic of NEAppProxyFlow, and then wrote tests for it. It uses socket I/O to communicate with each side ("app" and "TPP"). Internally, it uses DispachIO to handle data sent from the "app", while it exposes the "TPP" side as a FileHandle (and it does that so that I could, if I needed to, pass it over XPC).

This isn't the only way I could have implemented it, admittedly, but it grew out of my first approach, which was to use kqueue.

(Oh, and I got the FileHandle issue solved -- the class behaves super badly if it gets closed more than once, so I isolated that code into a function, turned the object into an optional, and set it to nil after closing it.)

Since I don't think I'll end up using XPC to pass the FileHandle around, I can probably turn it all into DispatchIO. I hadn't primarily because I kept getting crashes, and/or leaked file descriptors, but now that I can get the DispatchIO part working without crashes, it's time to revisit it.

Most this particular session has been in order to come up with a way to have tests for our transparent proxy provider that can be run without using the main extension -- so I rolled the networking code out, wrote a mimic of NEAppProxyFlow, and then wrote tests for it.

OK, that makes me feel a lot better about all of this. Most of the issues you'd have here (beyond, "it's crashing") are caused by complicated differences and interactions between dispatch queues and run loops. Those tend to cause issues with extended/unexpected object lifetimes and/or leaks, which can be a huge problem in long running apps, but much less of an issue in shorter running tests that you're not going to sell to anyone.

It uses socket I/O to communicate with each side ("app" and "TPP"). Internally, it uses DispachIO to handle data sent from the "app", while it exposes the "TPP" side as a FileHandle (and it does that so that I could, if I needed to, pass it over XPC).

FYI, you can also pass dispatch objects across XPC but it would be more complicated than using FileHandle/NSXPCConnection integration. This is another case of the kind of shortcut that's reasonable in a test tool but probably a bad idea in shipping code.

(Oh, and I got the FileHandle issue solved -- the class behaves super badly if it gets closed more than once, so I isolated that code into a function, turned the object into an optional, and set it to nil after closing it.)

Yep, I can see that. Again, keep in mind that FileHandle is ANCIENT- it came from NextStep, so it's older than macOS X. A lof of things have changed* since then, but it's also the kind of class where changing any of it's implementation details can be risky.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Yeah, based on all of this, I think I'm going to take a stab at refactoring all of it to just use DispatchIO. If I end up choosing to pass it over XPC, I could just use the FileHandle and turn it into a DispatchIO on the other side.

DispatchIO crashes, tips on debugging?
 
 
Q