Description of current OS X watchdog infrastructure? (watchdogd, AppleOSXWatchdog.kext)

Can anyone point me to a good description of exactly how Apple's current OS X "watchdog" infrastructure works? "man watchdogd" gives some minimal information which says it'll trigger a reboot if kernel or user space hangs, but it doesn't describe at all *how* that's done. AFAICS, there are probably multiple parts to this current watchdog service, but one aspect I'm particularly interested in is whether Apple still retains the any sort of *hardware*-based failsafe reboot service, like what was available several years ago in the separate OS X Server OS product. In that OS, a "watchdog" command could initialize a countdown timer in the PMU hardware, and the machine would reboot if the timer ever ran down to zero. In healthy systems, that timeout was typically avoided by a daemon "watchdogtimerd" periodically putting more time back on the timer. If the kernel hung, though, the daemon wouldn't be able to update the PMU timer, and the timer would soon run out and cause a reboot. So does Apple's current "watchdog" infrastructure retain any of that hardware-based functionality? If not, is it any poorer for it? Notably, can it reboot automatically if the kernel hangs? (Not panics - panics are handled by an event-handling system that assumes that the kernel is still running.)


Thanks,


-- Jonathan

I have found out part of the answers I was looking for. It appears that the PMU-timer-based reboot functionality is accessible in OS X now via the IOWatchDogTimer kernel class. The source code for the class is available on Apple OpenSource here (for 15C50/xnu-3248.20.55):

http://www.opensource.apple.com/source/xnu/xnu-3248.20.55/iokit/IOKit/system_management/IOWatchDogTimer.h

http://www.opensource.apple.com/source/xnu/xnu-3248.20.55/iokit/Families/IOSystemManagement/IOWatchDogTimer.cpp


An example of how to use the class can be found in the source code examples from the O'Reilly book "Hacking and Securing iOS Applications":

http://examples.oreilly.com/0636920023234/ch03/watchdog.c


In that example, the author is *disabling* the timer with a value of 0, but providing another value (in seconds) for the countdown value does work, with two caveats: (1) The program needs to be run as "root" (or via "sudo"), and (2) Apple's "watchdogd" daemon needs to be disabled beforehand (via "launchctl unload"), for the reboot to occur reliably. #2 suggests to me that "watchdogd" is indeed providing something of the same sort of protection from kernel hangs that "watchdogtimerd" used to do for OS X Server - periodically putting more time on the PMU timer while it's running normally, so that in the event of a kernel hang, the timer runs out and reboots the system. One interesting point is that if you disable watchdogd via launchctl, the system does *not* then reboot shortly afterward. So that doesn't look like a kernel hang to the system. Maybe "launchctl unload" causes a bit of clean-up to be done which disables the timer - I really don't know. Another interesting point is that the timer seems to get canceled if the system is put to sleep.


The findings above seem to be sufficient to implement a user-controlled fail-safe reboot - i.e., "reboot my machine in x seconds, no matter what happens in between", which is potentially useful for certain usage scenarios (e.g., testing). However, of course, you do lose other potential benefits of the Apple "watchdogd" daemon in this case, as you need to disable it. I'd still like to know more about these and other benefits of the Apple watchdog infrastructure, to know whether that's an acceptable trade-off.

Thanks for the info. I've had regular kernel panics from watchdogd as well, with no solution apparent until now.

Description of current OS X watchdog infrastructure? (watchdogd, AppleOSXWatchdog.kext)
 
 
Q