fds

In binary there are two ways to represent fractional numbers :

  • Fixed point notation (where there are a fixed known amount of bits that are after the point)
  • Floating point notation (where the point is not fixed and can move, using an exponential)

Fixed point representation

We have an entire part and a fractional part :

Let’s give it a true mathematical definition :

The point is fixed after the bit Hence “fixed point”

If the point is not shown, we assume it is to the right of the least significant digit

This principle applies to binary just as well as for decimal.

Precision

The precision is the maximum amount of bits that can be used It is given by

Resolution

The resolution is the smallest possible distance between two consecutive numbers It is given by

Range

The range is the maximum distance between the most positive and the most negative number that is representable It is given by

Accuracy

The accuracy is the magnitude between the value we’re trying to represent The worst case scenario is given by

Dynamic range

The dynamic range is the ratio of the maximum absolute value that is representable, and the minimum positive absolute value that is representable It is given by

Floating Point Representation

Floating point representation consits of two components

  • The mantissa
  • The signed exponent

It is similar to the scientific notation

Significand

It is written in Sign and Magnitude.

However this system is redundant !!! To avoid redundancy, we normalize the mantissa between (included )and (excluded)

Therefore there is a bit that is always on, and we hide it

The corresponding significand value becomes

Exponent

The exponent is signed using “Biased Representation”

Biased Representation

It is basically unsigned, and just has its lowest (negative) value be This allows for simple comparisons for checking what number is bigger than another. We shift the value by removing the bias , which is given by

Rounding

The result of floating point operation is a real number that might require an infinite amount of digits

We have to round the number to be able to represent it

Rounding Modes

  • Nearest (even when tie)
  • Towards (Floor-ish)
  • Towards the closest infinity (Roof-ish)

IEEE Standard 754

  • -bit significand
    • Sign and magnitude
    • Normalized, and one hidden bit
  • -bit exponent
    • Biased by
f32f64
Single precisionDouble precision
Sign : bitSign : bit
Exponent : bitsExponent : bits
Fraction : bitsFraction : bits

Special values

Floating point zero:

Infinity (p/neg):

  • is all ones

NaN (not a number):

  • Sign is or
  • is all ones

Exceptions

  • The value can overflow (set the result to infinity)
  • The value can underflow
  • Division by 0
  • Inexact result
  • Invalid result

Example -- Conversion

Arithmetic

We are interested in how to manipulate fractional numbers in binary

Fixed-Point Arithmetic

Addition and subtraction and multiplication are done the same way as for integers

Multiplication of fixed point numbers can be wrong, since we have to format the result into what the fixed point format requires

Floating-Point Arithmetic

Because we don’t want numbers to be redundant, we have to normalize the significands.

Addition/Subtraction

We format the number and as and

The result is and formatted as The result is also normalized

Algorithm

Four main steps to compute and produce the result

  1. We add/subtract the significands (mantissas), set the exponent

  • The mantissa of the number with the smaller exponent has to be multiplied by two to the power of the difference between exponents (we call this alignement) (it can be simplified to a simple bit shift).

  • We then add/remove this value to the mantissa of the other number

  1. We normalize the result, and adjust the exponent (if required)
  2. Round the result and normalize and adjust the exponent (if required)
  3. We set the value to the Special values if required

Alignement

Definition

Alignement Three approaches :

  • Align the operands to a common exponent
  • Align to the common exponent minus the minimum of the two
  • Align to the common exponent minus the maximum of the two

We prefer the third approach, as it is more optimized, and produces less error (as we end up removing less significant bits, so we differ by a tiny amount)

Normalization

When we normalize a number, various situations may occur

  1. The result is already normalized, so we do nothing
  2. The significand might overflow
    • We shift the result by one position
    • We increment the exponent by one
  3. When subtracting, the result might have leading 0s
    • We shift the result to the left until there are no more leading 0s
    • Each time we shift, we decrement the exponent

Rounding

The result might not be representable, as it can be situated in between two values. We perform rounding towards the nearest value, and to the even number when there is a tie

We prefer to tie to the nearest even number, as it leads to smaller errors when the result is divided by two

Max Round-off Error

If a number is really big, we can have a huge error, in fact, the max rounding error happens when the exponent is at its max value, and the exact value in a tie