Source coding is seen as a way to compress the source, or a way to efficiently describe the source output.
The source is specified by its alphabet, and by the source statistic (?)
flowchart LR a["`Source`"] b@{ shape: text, label: "Blabla" } c["`Encoder`"] d@{ shape: text, label: "426c61626c61"} a --> b b --> c c --> d
The encoder is specified by:
- The input alphabet ()
- the output alphabet ()
- the codebook () which consists of finite sequences over the output alphabet
- the one-to-one encoding map from
Example
But we want to avoid the issue where some decoding can be ambiguous
Definition
We say that a code is uniquely decodable if every concatenation of codewords always has a unique parsing into a sequence of codewords (unique function)
Example
The code is not uniquely decodable
Example
Code is always uniquely decodable, because it is of fixed length.
Theorem
A fixed-length code is always uniquely decodable
Definition
A code is prefix-free if no codeword is a prefix of another codeword
Theorem
- A prefix-free code is always uniquely decodable
- A uniquely decodable code is not necessarily prefix-free
Definition
A prefix free code is also called instantaneous code
Theorem, Kraft-McMillan's inequality
If a D-ary code is uniquely decodable then its codeword lengths satisfies
So if Kraft’s inequality is not fulfilled, there doesn’t exist a uniquely decodable code with this length (Contrapositive)
satisfy Kraft's inequality for some integer , then there exists a -ary prefix-free code (hence decodable)
If the positive integers
This implies that any uniquely decodable code can be substituted by a prefix free code of the same codeword lengths
We will focus on prefix-free codes
- No loss of optimality
- A prefix-free codeword is recognized as soon as its last digit is seen less delay
Definition
The average codeword-length is given by
Theorem
Let be the encoding map of a base code for the random variable If the code is uniquely decodable, then the average codeword-length is lower bounded by the entropy of
Theorem
For every random variable , and for every integer , there exists a prefix-free -ary code for such that
We call those codes -ary Shannon Fano codes
But these codes seem complicated for nothing ! Well no, they are actually rather good :
Theorem
The average codeword-length of a -ary Shannon-Fano code for the random variable fulfills
Sadly they are not optimal, some some examples can be drawn that have a smaller average length
Example
Let’s say we have a letter that has an absurdly small probability. It’s length will be very big, and could probably truncated, because it has a lot of empty nodes
Optimal solution : Huffman code
We start by having all the leaves of the tree with their given weights. We order them by their weights. We join the leaves that have the lowest weights together. We keep going until everything
Lemma
The average path length of a tree with probabilities is the sum of the probabilities of the intermediate nodes
Theorem
It is guaranteed for a Huffman code to be more optimized than another code ;