NetworkExtension: UserEventAgent crash on network change

Question

Created Aug ’20

Replies 7

Boosts 0

Views 2.1k

Participants 3

On rare occasions when changing network on a machine with network extension installed (e.g. when disconnecting laptop from Docking station and connecting to Wi-Fi), Network freezes, and a few seconds later machine crashes.

On console we can see that there was a crash in UserEventAgent:

Code Block languageTermination Reason:    WATCHDOG, [0x1] monitoring timed out for service
Termination Details:   WATCHDOG, remoted connection down, inducing a crash

And on system.log we can see:

Code Block languageUserEventAgent[9151]: assertion failed: 19G2021: com.apple.bonjour + 6922 [54D5EA10-98BA-3AD4-94C3-182BB3700419]: 0xfffffffffffeffe0

(Which if I understand correctly, is kDNSServiceErr_Timeout).

I would have thought it indicated some issue with our Network Extension, but when I got to sample the extension when network was frozen, I could see the following stack:

Code Block languageCall graph:
    2833 Thread_159027: Main Thread   DispatchQueue_<multiple>
    + 2833 _dispatch_sig_thread  (in libdispatch.dylib) + 53  [0x7fff71ec6476]
    +   2833 _dispatch_sigsuspend  (in libdispatch.dylib) + 36  [0x7fff71ec649a]
    +     2833 sigsuspend_nocancel  (in libsystem_kernel.dylib) + 10  [0x7fff72056502]
    2833 Thread_170723
    + 2833 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff7210fb8b]
    +   2833 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff72114109]
    +     2833 Poco::ThreadImpl::runnableEntry(void*)  (in libPocoFoundation.60.dylib) + 94  [0x10327b2fe]
    +       2833 Poco::ActiveDispatcher::run()  (in libPocoFoundation.60.dylib) + 72  [0x10327f598]
    +         2833 Poco::NotificationQueue::waitDequeueNotification()  (in libPocoFoundation.60.dylib) + 392  [0x103243e48]
    +           2833 Poco::EventImpl::waitImpl()  (in libPocoFoundation.60.dylib) + 70  [0x10320c8a6]
    +             2833 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff72114425]
    +               2833 psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff72053882]
    2833 Thread_180423
    + 2833 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff7210fb8b]
    +   2833 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff72114109]
    +     2833 Poco::ThreadImpl::runnableEntry(void*)  (in libPocoFoundation.60.dylib) + 94  [0x10327b2fe]
    +       2833 Poco::ActiveDispatcher::run()  (in libPocoFoundation.60.dylib) + 72  [0x10327f598]
    +         2833 Poco::NotificationQueue::waitDequeueNotification()  (in libPocoFoundation.60.dylib) + 392  [0x103243e48]
    +           2833 Poco::EventImpl::waitImpl()  (in libPocoFoundation.60.dylib) + 70  [0x10320c8a6]
    +             2833 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff72114425]
    +               2833 psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff72053882]
    2833 Thread_209573   DispatchQueue_59: NEFilterExtensionProviderContext queue  (serial)
    + 2833 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff7210fb77]
    +   2833 _pthread_wqthread  (in libsystem_pthread.dylib) + 290  [0x7fff72110a3d]
    +     2833 _dispatch_workloop_worker_thread  (in libdispatch.dylib) + 596  [0x7fff71ec5c09]
    +       2833 _dispatch_lane_invoke  (in libdispatch.dylib) + 363  [0x7fff71ebc5d6]
    +         2833 _dispatch_lane_serial_drain  (in libdispatch.dylib) + 263  [0x7fff71ebbaf6]
    +           2833 _dispatch_source_invoke  (in libdispatch.dylib) + 2084  [0x7fff71ec84be]
    +             2833 _dispatch_continuation_pop  (in libdispatch.dylib) + 414  [0x7fff71eb8818]
    +               2833 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff71eb6658]
    +                 2833 -[NEFilterDataExtensionProviderContext handleSocketSourceEventWithSocket:]  (in NetworkExtension) + 1115  [0x7fff3e92e818]
    +                   2833 -[NEFilterDataExtensionProviderContext handleDataCompleteForFlow:direction:reply:controlSocket:completionHandler:]  (in NetworkExtension) + 381  [0x7fff3e92bf09]
    +                     2833 -[NEFilterDataSavedMessageHandler enqueueWithFlow:context:]  (in NetworkExtension) + 183  [0x7fff3e93085c]
    +                       2833 -[NEFilterDataSavedMessageHandler executeWithFlow:context:]  (in NetworkExtension) + 294  [0x7fff3e9309a4]
    +                         2833 -[NEFilterDataSavedMessageHandler executeVerdictHandlerWithFlow:verdict:context:]  (in NetworkExtension) + 305  [0x7fff3e930b83]
    +                           2833 114-[NEFilterDataExtensionProviderContext handleDataCompleteForFlow:direction:reply:controlSocket:completionHandler:]_block_invoke.312  (in NetworkExtension) + 133  [0x7fff3e92c03d]
    +                             2833 -[NEFilterSocketFlow createDataCompleteReply:controlSocket:direction:verdict:context:]  (in NetworkExtension) + 174  [0x7fff3e9362dc]
    +                               2833 -[NEFilterSocketFlow createDataReply:controlSocket:direction:verdict:context:]  (in NetworkExtension) + 109  [0x7fff3e936215]
    +                                 2833 -[NEFilterSocketFlow writeCurrentVerdictWithMessage:controlSocket:]  (in NetworkExtension) + 366  [0x7fff3e936479]
    +                                   2833 +[NEFilterSocketFlow writeMessageWithControlSocket:drop:socketID:inboundPassOffset:inboundPeekOffset:outboundPassOffset:outboundPeekOffset:statsReportFrequency:]  (in NetworkExtension) + 158  [0x7fff3e9368eb]
    +                                     2833 write  (in libsystem_kernel.dylib) + 10  [0x7fff72053bf6]
    2833 Thread_210045
      2833 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff7210fb77]
        2833 _pthread_wqthread  (in libsystem_pthread.dylib) + 390  [0x7fff72110aa1]
          2833 __workq_kernreturn  (in libsystem_kernel.dylib) + 10  [0x7fff720524ce]

Which if I understand correctly, indicates extension callbacks weren't even called in the time of the sample.

I also see that memory for network extension grows through time, which might indicate some kind of leak.

My question is - is that a known issue for network extension? How can I debug network extension with instruments? It crashes immediately when I try to attach to it.

Thanks!

Boost

Answer 1

Systems Engineer OP

Apple

Aug ’20

Based on your spindump it looks like you may be seeing the rise in memory due to the bytes that could be getting built up in your write buffer while waiting for an available path to flush this buffer out again. For example:

Code Block 2833 -[NEFilterSocketFlow writeCurrentVerdictWithMessage:controlSocket:]  (in NetworkExtension) + 366  [0x7fff3e936479]
  2833 +[NEFilterSocketFlow writeMessageWithControlSocket:drop:socketID:inboundPassOffset:inboundPeekOffset:outboundPassOffset:outboundPeekOffset:statsReportFrequency:]  (in NetworkExtension) + 158  [0x7fff3e9368eb]
    2833 write  (in libsystem_kernel.dylib) + 10  [0x7fff72053bf6]

Typically you could attach Instruments to the active process for the Network Extension, but if the process dies this will terminate your profiling session. In terms of debugging memory issues, Quinn has an excellent post on the Standard Memory Debugging tools at your disposal.

Matt Eaton
DTS Engineering, CoreOS
meaton3@apple.com

0

Answer 2

obenshalom OP

Aug ’20

So what might cause the crash, if network extension is not even in the stack trace?

Also, we have been seeing a lot of issues when trying to attach to signed processes with Instruments (even when signed for debug), is there a way around it?

0

Answer 3

Systems Engineer OP

Apple

Sep ’20

So what might cause the crash, if network extension is not even in the stack trace?

The Network Extension is in the stack trace, it looks like you are using a Network Extension Content Filter to filter sockets, or a NEFilterDataProvider. The cause of the crash looks to be a WATCHDOG termination.

Code Block Termination Reason:    WATCHDOG, [0x1] monitoring timed out for service
Termination Details:   WATCHDOG, remoted connection down, inducing a crash

The termination details indicate that a remote XPC connection went down somewhere between macOS and possibly bridgeOS which caused a timeout, then followed by this termination.

Also, we have been seeing a lot of issues when trying to attach to signed processes with Instruments (even when signed for debug), is there a way around it?

One way I have found to do this is by starting the Network System Extension, then from Xcode opening up Instruments and then trying to attach to the active process for the Network Extension. If this does not work, you can also attempt to log stream any further logs that may be helpfule. For example:
% log stream --level debug --predicate 'process == "myprocessname"'

Matt Eaton
DTS Engineering, CoreOS
meaton3@apple.com

0

Answer 4

obenshalom OP

Sep ’20

Hi Matt,
This crash doesn't sound Extension related, and yet it happens only when extension is installed.

Also - I am struggling with attaching instruments to Network Extension, As I keep getting:

Code Block Unable to acquire required task port

I am compiling the extension in debug mode with get-task-allow entitlement, and disabled hardened-runtime. It is signed with development provisioning profile. is there anything else I am missing?

0

Answer 5

Systems Engineer OP

Apple

Sep ’20

Also - I am struggling with attaching instruments to Network Extension

Yeah, this can be difficult. If you are running the Network System Extension from the /Applications folder the best way to do this is to start Instruments and look for the Network Extension process in the running or system processes lists and attach to it from there.

If that does not work you could try using log stream or attaching to the container app in instruments for more clues. If all else fails you could try turning developer mode on and running this from Xcode and see if that allows Instruments to find your Network System Extension process.

Matt Eaton
DTS Engineering, CoreOS
meaton3@apple.com

0

Answer 6

obenshalom OP

Nov ’20

Hi,

Those crashes still occur, repeatedly. It does not matter if we reduce functionality for FilterDataProvider, or even not collect statistics for UserEventAgent (in this case we receive crashes on remoted daemon, which causes network to halt, and in a few minutes the machine to crash).

With the attached spindump from the moment of the crash, it seems very likely this is an OS issue (kernel deadlock). Is this a known issue? Is there a workaround we can manage?
We have several hundreds of machines with Network Extension installed, that repeatedly crash.

Attached is a spindump from the moment of the crash:

spindump

Panic log:

panic.log

Just before remoted crashed I can see in system.log:

Code Block bashNov 26 15:19:21 TLVMAC62Y1JGH6 Console[37400]: BUG in libdispatch client: vnode, monitored resource vanished before the source cancel handler was invoked { 0x600000fc2c00[source], ident: 6 / 0x6, handler: 0x7fff6f54dc0c }

Also opened ticket: FB8906238, But solution I was given (to upgrade to 11.1 beta version) is of course not a possible solution for customers.

0

Answer 7

Shay39 OP

Dec ’20

Hi Obenshalom,

I also run into some similar kernel hang/panic. When inspecting the spindump captured before panic, I found 50+ processes were blocked in the "sosend". I resolved these symbols to find that, the kernel might had exhausted mbuf(s).

I have no idea how to reproduce this issue, and I didn't collect memory footprint this time.

Could you check my spindump to see if there are any similarities?

Thanks
Shay

spindump.txt

0