FB9108925
FB10408005
Since Apple Silicon we've seen a lot of WebDAV instability in macOS 11.x, 12.x and now 13.x that isn't found on x86 Macs. Some were fixed in earlier minor OS upgrades (e.g. webdavfs-387.100.1 that added a missing mutex init), but it's still highly unreliable. The purpose of this post is to put more focus on the bug, see if there is something else we can do to help solve this, as well as hear about potential workarounds from other people experiencing the same problems.
I've got a reproducible case described below that triggers a deadlock in VFS every time, requiring a hard reboot to fully recover. Before reboot I've captured this stack trace showing the WebDAV/VFS/UBC/VM layers getting tangled up (macOS 13.2 Build 22D49 running on Macmini9,1):
Thread 0x16358 1001 samples (1-1001) priority 46 (base 31)
1001 thread_start + 8 (libsystem_pthread.dylib + 7724) [0x18d2e0e2c]
1001 _pthread_start + 148 (libsystem_pthread.dylib + 28780) [0x18d2e606c]
1001 ??? (diskarbitrationd + 99400) [0x100d5c448]
1001 unmount + 8 (libsystem_kernel.dylib + 55056) [0x18d2b2710]
*1001 ??? (kernel.release.t8103 + 30712) [0xfffffe00083437f8]
*1001 ??? (kernel.release.t8103 + 1775524) [0xfffffe00084ed7a4]
*1001 ??? (kernel.release.t8103 + 7081508) [0xfffffe00089fce24]
*1001 ??? (kernel.release.t8103 + 2522264) [0xfffffe00085a3c98]
*1001 ??? (kernel.release.t8103 + 2523168) [0xfffffe00085a4020]
*1001 vnode_iterate + 728 (kernel.release.t8103 + 2410988) [0xfffffe00085889ec]
*1001 ??? (kernel.release.t8103 + 6095404) [0xfffffe000890c22c]
*1001 ??? (kernel.release.t8103 + 1097172) [0xfffffe0008447dd4]
*1001 ??? (kernel.release.t8103 + 1100852) [0xfffffe0008448c34]
*1001 ??? (kernel.release.t8103 + 1024804) [0xfffffe0008436324]
*1001 ??? (kernel.release.t8103 + 1025092) [0xfffffe0008436444]
*1001 ??? (kernel.release.t8103 + 6497736) [0xfffffe000896e5c8]
*1001 ??? (kernel.release.t8103 + 2705840) [0xfffffe00085d09b0]
*1001 webdav_vnop_pageout + 432 (com.apple.filesystems.webdav + 16920) [0xfffffe000b2db7b8]
*1001 webdav_vnop_close + 64 (com.apple.filesystems.webdav + 9492) [0xfffffe000b2d9ab4]
*1001 webdav_vnop_close_locked + 96 (com.apple.filesystems.webdav + 19708) [0xfffffe000b2dc29c]
*1001 webdav_close_mnomap + 264 (com.apple.filesystems.webdav + 20004) [0xfffffe000b2dc3c4]
*1001 webdav_fsync + 404 (com.apple.filesystems.webdav + 20484) [0xfffffe000b2dc5a4]
*1001 ubc_msync + 184 (kernel.release.t8103 + 6096856) [0xfffffe000890c7d8]
*1001 ??? (kernel.release.t8103 + 1097172) [0xfffffe0008447dd4]
*1001 ??? (kernel.release.t8103 + 1100728) [0xfffffe0008448bb8]
*1001 lck_rw_sleep + 136 (kernel.release.t8103 + 505804) [0xfffffe00083b77cc]
*1001 ??? (kernel.release.t8103 + 607656) [0xfffffe00083d05a8]
*1001 ??? (kernel.release.t8103 + 613952) [0xfffffe00083d1e40]
I've spent countless hours reading the xnu-8792.81.2 and webdavfs-392 sources trying to understand what happens. Symbols mapped back to the source code tell me it's trying to flush a dirty mmap'ed file back to the WebDAV host when the volume is about to get unmounted, but I suspect the pageout request is triggered recursively, perhaps because the mmap'ed file has shrunk and pages need to be released?
The test case:
Use Finder to connect to a WebDAV volume which holds a fairly large image (200 MB Photoshop file in my case).
Navigate to this file in column mode so Finder renders a preview (using a QuickLook process). I believe this mmap's the file, but that alone isn't sufficient, so I think the Finder tries to write an updated thumbnail back to the volume as well.
Click the Eject icon in the Finder to unmount the volume, which now deadlocks that file system.
In the end something remains unreleased in the filesystem since the unmount request never completes, so whether that's a VNode lock or just open file refcount or something else I don't know.
Now, why this deadlock is only seen on Apple Silicon is a mystery. Is Finder/QuickLook executing different code paths for generating or storing the thumbnail? Or is there yet more cases of uninitialized mutexes/locks that happen to be accidentally functional on x86 but expose a problem on AS? I've been through a lot of kernel source code trying to find any but have come up short. But since the above is easily reproduced I'm hoping someone with filesystem/kernel debug capability can succeed in pinpointing the bug. It's at least positive that the overall architecture works on x86 so I'm hoping it is a simple fix in the end.
The reason I'm debugging this is we've got a lot of customers running WebDAV on M1/M2 and they find Finder file copying highly unreliable (i.e. writing many files to the WebDAV server, possibly overwriting existing files; some users have reported a need to reboot 20 times a day). I'm really looking for a bug that's common to all of these tasks, not just the mmap + unmount problem which is a minimal test case that I've cooked up in the lab. The few spindumps I've seen from end users have also included the combination of webdav_vnop_pageout + webdav_fsync + ubc_msync + lck_rw_sleep even if unmounting wasn't the initial op that forced the deadlock.
This problem has been reproduced with different WebDAV server vendors, and there is a test account on a server running Apache provided in FB10408005 (though please select the PSD file, not just the tiny JPG).
Thanks in advance!