20 2 Introduction to C++ and Fortran
and
cos( 0. 007 )≈ 0. 99998.
The first expression forf(x)results in
f(x) =
1 − 0. 99998
0. 59999 × 10 −^2
=
0. 2 × 10 −^4
0. 59999 × 10 −^2
= 0. 33334 × 10 −^2 ,
while the second expression results in
f(x) =
0. 59999 × 10 −^2
1 + 0. 99998
=
0. 59999 × 10 −^2
1. 99998
= 0. 30000 × 10 −^2 ,
which is also the exact result. In the first expression, due toour choice of precision, we
have only one relevant digit in the numerator, after the subtraction. This leads to a loss of
precision and a wrong result due to a cancellation of two nearly equal numbers. If we had
chosen a precision of six leading digits, both expressions yield the same answer. If we were
to evaluatex∼π, then the second expression forf(x)can lead to potential losses of precision
due to cancellations of nearly equal numbers.
This simple example demonstrates the loss of numerical precision due to roundoff errors,
where the number of leading digits is lost in a subtraction oftwo near equal numbers. The
lesson to be drawn is that we cannot blindly compute a function. We will always need to
carefully analyze our algorithm in the search for potentialpitfalls. There is no magic recipe
however, the only guideline is an understanding of the fact that a machine cannot represent
correctlyallnumbers.
2.3.1 Representation of real numbers.
Real numbers are stored with a decimal precision (or mantissa) and the decimal exponent
range. The mantissa contains the significant figures of the number (and thereby the precision
of the number). A number like( 9. 90625 ) 10 in the decimal representation is given in a binary
representation by
( 1001. 11101 ) 2 = 1 × 23 + 0 × 22 + 0 × 21 + 1 × 20 + 1 × 2 −^1 + 1 × 2 −^2 + 1 × 2 −^3 + 0 × 2 −^4 + 1 × 2 −^5 ,
and it has an exact machine number representation since we need a finite number of bits to
represent this number. This representation is however not very practical. Rather, we prefer
to use a scientific notation. In the decimal system we would write a number like 9. 90625 in
what is called the normalized scientific notation. This means simply that the decimal point is
shifted and appropriate powers of 10 are supplied. Our number could then be written as
9. 90625 = 0. 990625 × 101 ,
and a real non-zero number could be generalized as
x=±r× 10 n,
with ara number in the range 1 / 10 ≤r< 1. In a similar way we can represent a binary number
in scientific notation as
x=±q× 2 m,
with aqa number in the range 1 / 2 ≤q< 1. This means that the mantissa of a binary number
would be represented by the general formula