Computational Physics - Department of Physics

(Axel Boer) #1

2.3 Real Numbers and Numerical Precision 23


Instead of usingx−andx+as the machine numbers closest tox, we introduce the relative
error
|x−x|
|x|
≤ 2 n−^24 ,


withxbeing the machine number closest tox. Defining


εx=
x−x
x

,

we can write the previous inequality


f l(x) =x( 1 +εx)

where|εx|≤εM= 2 −^24 for variables of length 32 bits. The notationf l(x)stands for the machine
approximation of the numberx. The numberεMis given by the specified machine precision,
approximately 10 −^7 for single and 10 −^16 for double precision, respectively.
There are several mathematical operations where an eventual loss of precision may ap-
pear. A subraction, especially important in the definition of numerical derivatives discussed
in chapter 3 is one important operation. In the computation of derivatives we end up sub-
tracting two nearly equal quantities. In case of such a subtractiona=b−c, we have


f l(a) =f l(b)−f l(c) =a( 1 +εa),

or
f l(a) =b( 1 +εb)−c( 1 +εc),


meaning that


f l(a)/a= 1 +εb
b
a−εc

c
a,
and ifb≈cwe see that there is a potential for an increased error in the machine representa-
tion off l(a). This is because we are subtracting two numbers of equal sizeand what remains
is only the least significant part of these numbers. This partis prone to roundoff errors and if
ais small we see that (withb≈c)


εa≈
b
a
(εb−εc),

can become very large. The latter equation represents the relative error of this calculation.
To see this, we define first the absolute error as


|f l(a)−a|,

whereas the relative error is
|f l(a)−a|
a ≤εa.
The above subraction is thus


|f l(a)−a|
a =

|f l(b)−f(c)−(b−c)|
a ,

yielding
|f l(a)−a|
a
=|bεb−cεc|
a


.

An interesting question is then how many significant binary bits are lost in a subtraction
a=b−cwhen we haveb≈c. The loss of precision theorem for a subtractiona=b−cstates
that [23]:ifbandcare positive normalized floating-point binary machine numbers withb>c
and

Free download pdf