In binary there are two ways to represent fractional numbers :

Fixed point notation (where there are a fixed known amount of bits that are after the point)
Floating point notation (where the point is not fixed and can move, using an exponential)

Fixed point representation

We have an entire part $x_{int}$ and a fractional part $x_{fr}$ :

x = x_{int} + x_{fr}

Let’s give it a true mathematical definition :

x = i = - f \sum m - 1 X_{i} 2^{i}

The point is fixed after the $0^{t h}$ bit → Hence “fixed point”

If the point is not shown, we assume it is to the right of the least significant digit

This principle applies to binary just as well as for decimal.

Precision

The precision is the maximum amount of bits that can be used It is given by

Precision (x) = m + f

Resolution

The resolution is the smallest possible distance between two consecutive numbers It is given by

Resolution (x) = 2^{- f}

Range

The range is the maximum distance between the most positive and the most negative number that is representable It is given by

Range (x) = x_{ma x} - x_{min}

Accuracy

The accuracy is the magnitude between the value we’re trying to represent The worst case scenario is given by

Accuracy (x) = \frac{Resolution ( x )}{2}

Dynamic range

The dynamic range is the ratio of the maximum absolute value that is representable, and the minimum positive absolute value that is representable It is given by

Dynamic Range (x) = \frac{2 ^{m - 1}}{2 ^{- f}} = 2^{m - 1 + f}

Floating Point Representation

Floating point representation consits of two components

The mantissa $M^{*}$
The signed exponent $E$

x = M^{*} \times b^{E}

It is similar to the scientific notation $a \cdot 1 0^{b}$

Significand

It is written in Sign and Magnitude.

However this system is redundant !!! → To avoid redundancy, we normalize the mantissa between $1$ (included )and $2$ (excluded)

Therefore there is a bit that is always on, and we hide it

The corresponding significand value becomes

(- 1)^{S} \times (1 + i = 1 \sum n M_{n - i} 2^{- i})

Exponent

The exponent is signed using “Biased Representation”

Biased Representation

It is basically unsigned, and just has its lowest (negative) value be $000 \dots 0$ This allows for simple comparisons for checking what number is bigger than another. We shift the value by removing the bias $B$ , which is given by

B = 2^{n - 1} - 1

Rounding

The result of floating point operation is a real number that might require an infinite amount of digits

We have to round the number to be able to represent it

Rounding Modes

Nearest (even when tie)
Towards $0$ (Floor-ish)
Towards the closest infinity (Roof-ish)

IEEE Standard 754

$(n + 1)$ -bit significand
- Sign and magnitude
- Normalized, and one hidden bit
$m$ -bit exponent
- Biased by $B = 2^{m - 1} - 1$

f32	f64
Single precision	Double precision
Sign : $1$ bit	Sign : $1$ bit
Exponent : $8$ bits	Exponent : $11$ bits
Fraction : $23$ bits	Fraction : $52$ bits

Special values

Floating point zero:

$E = 0$
$F = 0$

Infinity (p/neg):

$E$ is all ones
$F = 0$

NaN (not a number):

Sign is $0$ or $1$
$E$ is all ones
$F \neq = 0$

Exceptions

The value can overflow (set the result to infinity)
The value can underflow
Division by 0
Inexact result
Invalid result

Example -- Conversion

$S = 1$ $E = 01111100 = 4 + 8 + 16 + 32 + 64 - B = 124 - 127 = - 3$ $M = 010 \dots = .25 + 1 (hidden bit)$ $- 1.25 * 2^{- 3} = - 1.25/8 = - 0.15625$

Arithmetic

We are interested in how to manipulate fractional numbers in binary

Fixed-Point Arithmetic

Addition and subtraction and multiplication are done the same way as for integers

Multiplication of fixed point numbers can be wrong, since we have to format the result into what the fixed point format requires

Floating-Point Arithmetic

Because we don’t want numbers to be redundant, we have to normalize the significands.

Addition/Subtraction

We format the number $x$ and $y$ as $(S_{x}, M_{x}, E_{x})$ and $(S_{y}, M_{y}, E_{y})$

The result is $z$ and formatted as $(S_{z}, M_{z}, E_{z})$ The result is also normalized

Algorithm

Four main steps to compute and produce the result

We add/subtract the significands (mantissas), set the exponent

The mantissa of the number with the smaller exponent has to be multiplied by two to the power of the difference between exponents (we call this alignement) (it can be simplified to a simple bit shift).

We then add/remove this value to the mantissa of the other number

We normalize the result, and adjust the exponent (if required)

Round the result and normalize and adjust the exponent (if required)

We set the value to the Special values if required

Alignement

Definition

Alignement Three approaches :

Align the operands to a common exponent

Align to the common exponent minus the minimum of the two

Align to the common exponent minus the maximum of the two

We prefer the third approach, as it is more optimized, and produces less error (as we end up removing less significant bits, so we differ by a tiny amount)

Normalization

When we normalize a number, various situations may occur

The result is already normalized, so we do nothing
The significand might overflow
- We shift the result by one position
- We increment the exponent by one
When subtracting, the result might have leading 0s
- We shift the result to the left until there are no more leading 0s
- Each time we shift, we decrement the exponent

Rounding

The result might not be representable, as it can be situated in between two values. We perform rounding towards the nearest value, and to the even number when there is a tie

We prefer to tie to the nearest even number, as it leads to smaller errors when the result is divided by two

Max Round-off Error

If a number is really big, we can have a huge error, in fact, the max rounding error happens when the exponent is at its max value, and the exact value in a tie

Lecture Notes

Explorer

01.3 Fractional Binary

Fixed point representation

Precision

Resolution

Range

Accuracy

Dynamic range

Floating Point Representation

Significand

Exponent

Biased Representation

Rounding

Rounding Modes

IEEE Standard 754

Special values

Exceptions

Arithmetic

Fixed-Point Arithmetic

Floating-Point Arithmetic

Addition/Subtraction

Alignement

Normalization

Rounding

Max Round-off Error

Graph View

Table of Contents