In binary there are two ways to represent fractional numbers :
- Fixed point notation (where there are a fixed known amount of bits that are after the point)
- Floating point notation (where the point is not fixed and can move, using an exponential)
Fixed point representation
We have an entire part and a fractional part :
Let’s give it a true mathematical definition :
The point is fixed after the bit → Hence “fixed point”
If the point is not shown, we assume it is to the right of the least significant digit
This principle applies to binary just as well as for decimal.
Precision
The precision is the maximum amount of bits that can be used It is given by
Resolution
The resolution is the smallest possible distance between two consecutive numbers It is given by
Range
The range is the maximum distance between the most positive and the most negative number that is representable It is given by
Accuracy
The accuracy is the magnitude between the value we’re trying to represent The worst case scenario is given by
Dynamic range
The dynamic range is the ratio of the maximum absolute value that is representable, and the minimum positive absolute value that is representable It is given by
Floating Point Representation
Floating point representation consits of two components
- The mantissa
- The signed exponent
It is similar to the scientific notation
Significand
It is written in Sign and Magnitude.
However this system is redundant !!! → To avoid redundancy, we normalize the mantissa between (included )and (excluded)
Therefore there is a bit that is always on, and we hide it
The corresponding significand value becomes
Exponent
The exponent is signed using “Biased Representation”
Biased Representation
It is basically unsigned, and just has its lowest (negative) value be This allows for simple comparisons for checking what number is bigger than another. We shift the value by removing the bias , which is given by
Rounding
The result of floating point operation is a real number that might require an infinite amount of digits
We have to round the number to be able to represent it
Rounding Modes
- Nearest (even when tie)
- Towards (Floor-ish)
- Towards the closest infinity (Roof-ish)
IEEE Standard 754
- -bit significand
- Sign and magnitude
- Normalized, and one hidden bit
- -bit exponent
- Biased by
| f32 | f64 |
|---|---|
| Single precision | Double precision |
| Sign : bit | Sign : bit |
| Exponent : bits | Exponent : bits |
| Fraction : bits | Fraction : bits |
Special values
Floating point zero:
Infinity (p/neg):
- is all ones
NaN (not a number):
- Sign is or
- is all ones
Exceptions
- The value can overflow (set the result to infinity)
- The value can underflow
- Division by 0
- Inexact result
- Invalid result
Example -- Conversion
Arithmetic
We are interested in how to manipulate fractional numbers in binary
Fixed-Point Arithmetic
Addition and subtraction and multiplication are done the same way as for integers
Multiplication of fixed point numbers can be wrong, since we have to format the result into what the fixed point format requires
Floating-Point Arithmetic
Because we don’t want numbers to be redundant, we have to normalize the significands.
Addition/Subtraction
We format the number and as and
The result is and formatted as The result is also normalized
Algorithm
Four main steps to compute and produce the result
We add/subtract the significands (mantissas), set the exponent
The mantissa of the number with the smaller exponent has to be multiplied by two to the power of the difference between exponents (we call this alignement) (it can be simplified to a simple bit shift).
We then add/remove this value to the mantissa of the other number
- We normalize the result, and adjust the exponent (if required)
- Round the result and normalize and adjust the exponent (if required)
- We set the value to the Special values if required
Alignement
Definition
Alignement Three approaches :
- Align the operands to a common exponent
- Align to the common exponent minus the minimum of the two
- Align to the common exponent minus the maximum of the two
We prefer the third approach, as it is more optimized, and produces less error (as we end up removing less significant bits, so we differ by a tiny amount)
Normalization
When we normalize a number, various situations may occur
- The result is already normalized, so we do nothing
- The significand might overflow
- We shift the result by one position
- We increment the exponent by one
- When subtracting, the result might have leading 0s
- We shift the result to the left until there are no more leading 0s
- Each time we shift, we decrement the exponent
Rounding
The result might not be representable, as it can be situated in between two values. We perform rounding towards the nearest value, and to the even number when there is a tie
We prefer to tie to the nearest even number, as it leads to smaller errors when the result is divided by two
Max Round-off Error
If a number is really big, we can have a huge error, in fact, the max rounding error happens when the exponent is at its max value, and the exact value in a tie