How to deal with software and hardware failures - a first cut.
The watchdog process
At system startup on each machine, a watchdog process will be started. This functions much like "inetd"; it has a list of programs which it starts and monitors.
The watchdog process is a server, in the QNX sense. The service it provides to worker processes is lookup of other worker processes. This allows worker processes to set up interprocess communication.
Worker processes each have a watchdog time limit, enforced by the watchdog process. Each worker process must send a message to the watchdog process periodically to indicate that it is alive and functioning properly. The watchdog process itself will be timed by a hardware stall timer, which will reset and reboot the entire machine if necessary.
The watchdog process will kill a worker that doesn't reset its watchdog timer. As soon as that process is completely dead, the worker process will be restarted. During this termination period, the watchdog process will not reset its own watchdog timer, so if a process hangs up in a half-dead state, unable to respond to a kill signal, the whole machine will be rebooted.
The watchdog startup file
The watchdog program takes a file with lines that look like UNIX command lines, although this is not a standard shell program.
In typical UNIX style, environment variables can be specified. The following variables have special meaning.
The default values are suitable for a background program. Hard real-time programs require more specific settings.
More to come.
Kill and restart
When a worker program is restarted, if its previous incarnation ran for less than some reasonable period of time (say, one minute), a flag will be set on the program's command line indicating this situation. The process may then want to start up in a "safe mode", making conservative assumptions about what is going on.
Killed processes will result in core dumps, as usual.
Cascading worker failures
When all worker processes have been up for a reasonable period of time (the one minute mentioned above), the system is considered to be in a good state, and worker processes that fail will then be restarted individually. If more than one worker process fails within one minute, all worker processes are killed and restarted.
Rules for worker processes
Aborting is permitted
It's OK for a worker program to abort. If a program gets in trouble, it's better to abort than to continue in a messed up state.
Restartability is required
Worker programs must be able to start up regardless of the state of any transient files. In general, there shouldn't be any transient files in any of the low-level components. Lock files and process ID files are not allowed; there are better ways to do locking under QNX.
After a system crash and restart, the QNX file system may mark a file open for writing as unusable. Workers must be able to deal with that.
Server processes may go away
Processes must be prepared for the possibility that any server process they use may die. When this happens, the process will see it as a failure of a "send" or "receive" call. Processes must then re-request the process ID of the worker they need from the watchdog task and establish communication with the new instance of the worker. Because QNX interprocess communication is process-ID oriented, when a process dies, all of its interprocess communication channels die with it, immediately. So there's no danger of accidentally reestablishing communication with a new instance of a worker without knowing about it.
Note, though, that a worker process may be in the process of being restarted, and asking for its process ID may result in a no-find, or a block waiting for the restart.
The mechanisms for this will probably be encapsulated in a C++ class for convenience.
Avoid locking up a server or client so that it can't terminate.
There are ways in QNX for a server to prevent a client from terminating until the server releases it. Don't do that. Doing so may lock up other processes and force the watchdog process to try to restart all workers.
Inter-computer restart interaction
(to be supplied)
Interaction with the emergency stop system
(to be supplied)
We'll test this with a process that randomly kills a worker process every minute or two. We should be able to operate in that situation.