For participants only. Not for public distribution.

Note #39
Booting, Software and Hardware Watchdogs

 

Khian Hao Lim
Last revised December 7, 2003 .

Objective

Unexpected circumstances might cause programs to reach illegal states or abort. Our software architecture has been designed to be able to detect and handle software failures. This will be done using the software and hardware watchdogs present. The first cut at correctness is being able to detect that a program has become non-responsive, cause a global reset on all the machines (computers, sensors and actuators(?) ).

Booting

One computer, gcrear1, is designated as the computer that will run the watchdog program. The boot sequence on this computer has been modified. After running its usual boot up scripts for qnx, it will ask for the user whether it should enter normal mode (no watchdog). When the user input times out, it would progress into vehicle mode. In vehicle mode, it would change user to vehicle user and start running the watchdog program with the default startfile that spawns all the programs needed for vehicle operation.

Detecting Non-Responsiveness of Programs and Machines

Each program spawned from the watchdog is responsible for sending a heartbeat periodically to inform the watchdog program know that its working fine. If the program fails to check in (either intentionally, if its aborted, got stuck in infinite loops ...) on time, the watchdog will take it that that program has entered an illegal state. If communication is broken between the machines or a machine is held up in computations, the watchdog will also notice that some of the programs are not checking in, leading to the same ending.

Rebooting

From the list of all programs, we see that a monitoring program, called watchpuppy is run on each of the machines. This program talks directly to the hardware watchdog on the machine it is running on. This small program starts its life off initializing the hardware watchdog on the machine it is running on with a suitable timeout. It will then periodically send heart beats to the software watchdog. If the heart beat goes through fine, the watchpuppy would "tickle" the hardware watchdog on the machine, ensuring that the machine does not reboot. When the watchpuppy fails to perform this "tickle" on time, the hardware watchdog will cause a reboot of the machine.

To cause a reboot of all machines, the watchdog will just need to return error msgs in response to the heartbeats from each of the watchpuppy's. The watchpuppy's on each of the machines would stop "tickling" the hardware watchdogs and lead to a reset on all machines. If communication is broken between the machines or a machine is held up in computation, it would eventually lead to a reset of one or more machines, which would even lead to the reset of all the machines.

Possible Future Improvements

Resetting all the machines is very time consuming. We should be able to restart the process that died in the hopes that it is able to resume normal operation.