bytes of memory leaked away in the kernel. This added up to megabytes lost per day—
enough to bring the entry-level workstation that we give presidents to its knees after two or
three days.
We corrected the off-by-one bug in the vnode reference count algorithm, and the regular
kernel memory free routine kicked in just as it was supposed to. We also changed printtool
to use a smarter algorithm than just continually yammering at the printer every few seconds.
The memory leak was plugged, the programmers breathed a sigh of relief, the engineering
manager started to smile again, and the president went back to using printtool.
[1] Actually, this is quite a good idea. Having the president run early release software and
participate in the internal testing process keeps everyone on their toes. It ensures that upper
management has a good understanding of the evolving product and how fast it is improving.
And it provides product engineering with both the motivation and resources to shake out the
last few bugs.
The operating system kernel also manages its memory use dynamically. Many tables of data in the
kernel are dynamically allocated, so that no fixed limit is set in advance. If a kernel programming
error causes a memory leak, the machine slows down; in the limiting case the machine hangs or even
panics. When kernel routines ask for memory they usually wait until it becomes available. If memory
is leaking away, eventually there is none available, and everyone ends up waiting—the machine is
hung. Memory leaks in the kernel usually show up rapidly, as most paths through the kernel are pretty
well travelled. We also have specialized software tools to test for and exercise kernel memory
management.
Bus Error, Take the Train
When I first started programming on UNIX in the late 1970's, like many people I quickly ran into two
common runtime errors:
bus error (core dumped)
and
segmentation fault (core dumped)
At the time these errors were very frustrating: there was no simple explanation of the kind of source
errors that caused them, the messages gave no clue where to look in the code, and the difference
between them wasn't at all clear. And it's still the same today.
Most of the problem lies in the fact that the errors represent an anomaly the operating system has
detected, and the anomaly is reported in terms most convenient to the operating system. The precise
causes of a bus error and a segmentation fault will thus vary among different versions of operating
system. Here, we describe what they mean on SunOS running on the SPARC architecture, and what
causes them.
Both errors occur when hardware tells the OS about a problematic memory reference. The OS
communicates this to the faulting process by sending it a signal. A signal is an event notification or a
software-generated interrupt, much used in UNIX systems programming and hardly ever used in