We have been having a mysterious crash in our media server app that I've never seen before. After fixing a number of other rare thread safety crashes relating to Metal buffers, this rare crash happens inside a Metal com.Metal.CompletionQueueDispatch?
I have no clue what is happening here. It looks to me like Metal is specifically calling abort() for some reason.
All of the other threads in the crash log appear to be in a normal state.
Thread 70 Crashed:: updateAllMedia Dispatch queue: com.Metal.CompletionQueueDispatch
0 libsystem_kernel.dylib 0x1af572d38 __pthread_kill + 8
1 libsystem_pthread.dylib 0x1af5a7ee0 pthread_kill + 288
2 libsystem_c.dylib 0x1af4e2330 abort + 168
3 libc++abi.dylib 0x1af562b18 abort_message + 132
4 libc++abi.dylib 0x1af552a3c demangling_terminate_handler() + 312
5 libobjc.A.dylib 0x1af4481c8 _objc_terminate() + 160
6 libc++abi.dylib 0x1af561eb4 std::__terminate(void (*)()) + 20
7 libc++abi.dylib 0x1af561e50 std::terminate() + 64
8 libdispatch.dylib 0x1af3e4288 _dispatch_client_callout4 + 40
9 libdispatch.dylib 0x1af40053c _dispatch_mach_msg_invoke + 464
10 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
11 libdispatch.dylib 0x1af40125c _dispatch_mach_invoke + 456
12 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
13 libdispatch.dylib 0x1af3ec438 _dispatch_lane_invoke + 444
14 libdispatch.dylib 0x1af3eb784 _dispatch_lane_serial_drain + 376
15 libdispatch.dylib 0x1af3ec404 _dispatch_lane_invoke + 392
16 libdispatch.dylib 0x1af3f6c98 _dispatch_workloop_worker_thread + 648
17 libsystem_pthread.dylib 0x1af5a4360 _pthread_wqthread + 288
18 libsystem_pthread.dylib 0x1af5a3080 start_wqthread + 8
Note that the thread name "updateAllMedia" is a misnomer because this thread appears to be a general Metal dispatch queue. I wish there was a debugging option in Metal that called "setThreadName" to name its internal threads.
Post
Replies
Boosts
Views
Activity
We have a production Metal app with a complex multithreaded Metal pipeline.
When everything is operating smoothly, it works great.
Even when extremely overloaded, it does not crash for days at a time.
This isn't good enough for our users.
Unfortunately, when I have zero visibility into id, I have no way of knowing when metal is "done" with an id.
When overloaded, stale metal render passes need to be 'aborted', which results in metal callbacks not being called.
for example, these callbacks may not be called after an aborted pass:
id<MTLCommandBuffer> m_cmdbuf;
[m_cmdbuf addScheduledHandler:^(id <MTLCommandBuffer> cb) {
cpr->scheduled = MachAbsoluteTime();
}];
[m_cmdbuf addCompletedHandler:^(id <MTLCommandBuffer> cb) {
cpr->completed = MachAbsoluteTime();
}];
For the moment, our workaround is a system which waits a few seconds after we "think" a rendering pass should be done with all its (aborted) resources before releasing buffers. This is not ideal, to say the least.
So, in summary, my question is, it would be nice to be able to 'query' an id to know when metal is done with it, so that we know that its safe to release it along with our own internal resources.
Is there any such (undocumented) mechanism? I have exhaustively read all existing Metal documentation many times.
An idea that I've been toying with... it would be nice to have something akin to Zombie detection running all the time for id only.
In OpenGL, it was OK to use a released texture... you may display a bad frame, but not CRASH!. Is there any similar option for id?