1.6 Floating-point representation 21
Extended precision
It is sometimes necessary to use extended precision that employs 128-bit
(16-byte) word lengths and allows us to describe a number up to the twentieth
significant figure. This high level of resolution is necessary for solving a certain
class of highly sensitive, nearly ill-posed mathematical problems.
C++ allows us to implement arbitrary precision by dividing a number in
pieces and storing them in separate memory slots.
Round-off error
An arbitrary real number that has a finite number of digits in the decimal
system generally requires an infinite number of bits in the binary system. In
fact, only the numbers±n 2 mare represented exactly in the single-precision
floating point representation, where 0≤n< 223 ,and− 127 ≤m≤126, with
mandnbeing two integers. An ideal computing machine would be able to
register the number and carry out additions and multiplications with infinite
precision, yielding the exact result to all figures. In real life, one must deal with
non-ideal machines that work with only a finite number of bits and thus incur
round-off error.
Some computers round a real number to the closest number they can de-
scribe with an equal probability of positive or negative error. Other computers
simply chop off the extra digits in a guillotine-like fashion.
When two real numbers (non-integers) are added in the floating-point
representation, the significant digits of the number with the smaller exponent
are shifted to align the decimal point, and this causes the loss of significant
digits. Floating-point normalization of the resulting number incurs additional
losses. Consequently, arithmetic operations between real variables exacerbates
the magnitude of the round-off error. Unless integers are only involved, iden-
tities that are precise in exact arithmetic become approximate in computer
arithmetic.
The accumulation of the round-off error in the course of a computation
may range from negligible, to observable, to significant, to disastrous. Depend-
ing on the nature of the problem and the sequence of computations, the round-
off error may amplify, become comparable to, or even exceed the magnitude of
the actual variables.
In certain simple cases, the damaging effect of the round-off error can be
predicted and thus minimized or controlled. As a general rule, one should avoid
subtracting two nearly equal numbers. In computing the sum of a sequence of
numbers, we should start summing the numbers with the smaller magnitudes
first, and the largest magnitudes last.