ugh.book

(singke) #1
If You Can’t Fix It, Restart It! 197

If You Can’t Fix It, Restart It! ...................................................


So what do system administrators and others do with vital software that
doesn’t properly handle errors, bad data, and bad operating conditions?
Well, if it runs OK for a short period of time, you can make it run for a long
period of time by periodically restarting it. The solution isn’t very reliable,
nor scalable, but it is good enough to keep Unix creaking along.


Here’s an example of this type of workaround, which was put in place to
keep mail service running in the face of an unreliable named program:


Date: 14 May 91 05:43:35 GMT
From: [email protected] (Theodore Ts’o)^4
Subject: Re: DNS performance metering: a wish list for bind 4.8.4
Newsgroups: comp.protocols.tcp-ip.domains

This is what we do now to solve this problem: I’ve written a pro-
gram called “ninit” that starts named in nofork mode and waits for
it to exit. When it exits, ninit restarts a new named. In addition,
every 5 minutes, ninit wakes up and sends a SIGIOT to named.
This causes named to dump statistical information to /usr/tmp/
named.stats. Every 60 seconds, ninit tries to do a name resolution
using the local named. If it fails to get an answer back in some short
amount of time, it kills the existing named and starts a new one.

We are running this on the MIT nameservers and our mailhub. We
find that it is extremely useful in catching nameds that die mysteri-
ously or that get hung for some unknown reason. It’s especially use-
ful on our mailhub, since our mail queue will explode if we lose
name resolution even for a short time.

Of course, such a solution leaves open an obvious question: how to handle
a buggy ninit program? Write another program to fork ninits when they
die for “unknown reasons”? But how do you keep that program running?


Such an attitude toward errant software is not unique. The following man
page recently crossed our desk. We still haven’t figured out whether it's a
joke or not. The BUGS section is revealing, as the bugs it lists are the usual
bugs that Unix programmers never seem to be able to expunge from their
server code:


NANNY(8) Unix Programmer's Manual NANNY(8)

(^4) Forwarded to UNIX-HATERS by Henry Minsky.

Free download pdf