Computational Physics Lectures: Introduction to programming (C++ and Fortran)

Loss of Precision

A floating number x, labelled $fl(x)$ will therefore always be represented as $\begin{equation} fl(x) = x(1\pm \epsilon_x), \tag{6} \end{equation}$ with $x$ the exact number and the error $|\epsilon_x| \le |\epsilon_M|$ , where $\epsilon_M$ is the precision assigned. A number like $1/10$ has no exact binary representation with single or double precision. Since the mantissa $\left(1.a_{-1}a_{-2}\dots a_{-n}\right)_2$ is always truncated at some stage $n$ due to its limited number of bits, there is only a limited number of real binary numbers. The spacing between every real binary number is given by the chosen machine precision. For a 32 bit words this number is approximately $ \epsilon_M \sim 10^{-7}$ and for double precision (64 bits) we have $ \epsilon_M \sim 10^{-16}$, or in terms of a binary base as $2^{-23}$ and $2^{-52}$ for single and double precision, respectively.