Note #11: Restart and recovery

	For participants only. Not for public distribution.
	Note #11 Restart and recovery John Nagle Last revised August 3, 2003.

How to deal with software and hardware failures - a first cut.

The watchdog process

At system startup on each machine, a watchdog process will be started. This functions much like "inetd"; it has a list of programs which it starts and monitors.

The watchdog process is a server, in the QNX sense. The service it provides to worker processes is lookup of other worker processes. This allows worker processes to set up interprocess communication.

Worker processes each have a watchdog time limit, enforced by the watchdog process. Each worker process must send a message to the watchdog process periodically to indicate that it is alive and functioning properly. The watchdog process itself will be timed by a hardware stall timer, which will reset and reboot the entire machine if necessary.

The watchdog process will kill a worker that doesn't reset its watchdog timer. As soon as that process is completely dead, the worker process will be restarted. During this termination period, the watchdog process will not reset its own watchdog timer, so if a process hangs up in a half-dead state, unable to respond to a kill signal, the whole machine will be rebooted.

The watchdog startup file

The watchdog program takes a file with lines that look like UNIX command lines, although this is not a standard shell program.

# Watchdog input file for sensor node
ID="SONAR" PRI=12 MAXPRI=14 SCHED=FIFO MAXMEM=10000000 MAXWATCH=0.1 FAIL=RELAUNCH sonar -v

In typical UNIX style, environment variables can be specified. The following variables have special meaning.

Variable	Type	Notes
ID	string	Process name, used for interprocess communication and messages
USER	integer	User ID (UID). Default is the NOBODY UID. Use USER=0 if root privileges are needed to access hardware directly.
PRI	integer	Initial process priority. Default is 10, which is low and suitable only for interactive or background programs.
MAXPRI	integer	Maximum allowed process priority. Default is the same as PRI.
MAXMEM	integer	Maximum allowed process memory consumption (bytes). Limits damage from memory leaks. Default is no limit.
SCHED	string	QNX scheduling policy. One of "FIFO", "RR", or "SPORADIC". Default is FIFO if a priority above the default is specified, else RR.
MAXWATCH	float	Max real time between watchdog calls in seconds. If a process does not check in within the time limit, it has failed and the watchdog takes action. Default is infinite (no watchdog). Required if PRI is specified.
INITWATCH	float	Max real time in seconds between program launch and first watchdog call. Allows extra time for a program to start up. Default is the same as MAXWATCH.
FAIL	string	Action to be taken if program fails. One of "REBOOT", "STOP", or "RELAUNCH". "REBOOT" causes an E-stop and system reboot. "STOP" causes a controlled vehicle stop and restart. "RELAUNCH" simply restarts the program. Default is "STOP".

The default values are suitable for a background program. Hard real-time programs require more specific settings.

More to come.

Kill and restart

When a worker program is restarted, if its previous incarnation ran for less than some reasonable period of time (say, one minute), a flag will be set on the program's command line indicating this situation. The process may then want to start up in a "safe mode", making conservative assumptions about what is going on.

Killed processes will result in core dumps, as usual.

Cascading worker failures

When all worker processes have been up for a reasonable period of time (the one minute mentioned above), the system is considered to be in a good state, and worker processes that fail will then be restarted individually. If more than one worker process fails within one minute, all worker processes are killed and restarted.

Rules for worker processes

Aborting is permitted

It's OK for a worker program to abort. If a program gets in trouble, it's better to abort than to continue in a messed up state.

Restartability is required

Worker programs must be able to start up regardless of the state of any transient files. In general, there shouldn't be any transient files in any of the low-level components. Lock files and process ID files are not allowed; there are better ways to do locking under QNX.

After a system crash and restart, the QNX file system may mark a file open for writing as unusable. Workers must be able to deal with that.

Server processes may go away

Processes must be prepared for the possibility that any server process they use may die. When this happens, the process will see it as a failure of a "send" or "receive" call. Processes must then re-request the process ID of the worker they need from the watchdog task and establish communication with the new instance of the worker. Because QNX interprocess communication is process-ID oriented, when a process dies, all of its interprocess communication channels die with it, immediately. So there's no danger of accidentally reestablishing communication with a new instance of a worker without knowing about it.

Note, though, that a worker process may be in the process of being restarted, and asking for its process ID may result in a no-find, or a block waiting for the restart.

The mechanisms for this will probably be encapsulated in a C++ class for convenience.

Avoid locking up a server or client so that it can't terminate.

There are ways in QNX for a server to prevent a client from terminating until the server releases it. Don't do that. Doing so may lock up other processes and force the watchdog process to try to restart all workers.

Inter-computer restart interaction

(to be supplied)

Interaction with the emergency stop system

(to be supplied)

Testing

We'll test this with a process that randomly kills a worker process every minute or two. We should be able to operate in that situation.