Computational Physics - Department of Physics

(Axel Boer) #1

22 2 Introduction to C++ and Fortran


1 .(a− 1 a− 2 ...a−n) 2

is always truncated at some stagendue to its limited number of bits, there is only a limited
number of real binary numbers. The spacing between every real binary number is given by
the chosen machine precision. For a 32 bit words this number is approximatelyεM∼ 10 −^7 and
for double precision (64 bits) we haveεM∼ 10 −^16 , or in terms of a binary base as 2 −^23 and 2 −^52
for single and double precision, respectively.


2.3.2 Machine numbers


To understand that a given floating point number can be written as in Eq. (2.1), we assume
for the sake of simplicity that we work with real numbers withwords of length 32 bits, or four
bytes. Then a given numberxin the binary representation can be represented as


x= ( 1 .a− 1 a− 2 ...a− 23 a− 24 a− 25 ...) 2 × 2 n,

or in a more compact form
x=r× 2 n,


with 1 ≤r< 2 and− 126 ≤n≤ 127 since our exponent is defined by eight bits.
In most cases there will not be an exact machine representation of the numberx. Our
number will be placed between two exact 32 bits machine numbersx−andx+. Following the
discussion of Kincaid and Cheney [23] these numbers are given by


x−= ( 1 .a− 1 a− 2 ...a− 23 ) 2 × 2 n,

and
x+=


(

( 1 .a− 1 a− 2 ...a− 23 )) 2 + 2 −^23

)

× 2 n.

If we assume that our numberxis closer tox−we have that the absolute error is constrained
by the relation


|x−x−|≤

1

2 |x+−x−|=

1

2 ×^2

n− (^23) = 2 n− (^24).
A similar expression can be obtained ifxis closer tox+. The absolute error conveys one
type of information. However, we may have cases where two equal absolute errors arise from
rather different numbers. Consider for example the decimalnumbersa= 2 anda= 2. 001. The
absolute error between these two numbers is 0. 001. In a similar way, the two decimal numbers
b= 2000 andb= 2000. 001 give exactly the same absolute error. We note here thatb= 2000. 001
has more leading digits thanb.
If we compare the relative errors
|a−a|
|a|


= 1. 0 × 10 −^3 ,

|b−b|
|b|

= 1. 0 × 10 −^6 ,

we see that the relative error inbis much smaller than the relative error ina. We will see
below that the relative error is intimately connected with the number of leading digits in the
way we approximate a real number. The relative error is therefore the quantity of interest in
scientific work. Information about the absolute error is normally of little use in the absence
of the magnitude of the quantity being measured.
We define then the relative error forxas
|x−x−|
|x|



2 n−^24
r× 2 n

=

1

q

× 2 −^24 ≤ 2 −^24.
Free download pdf