The idea is to pack a lot of symbols into supersymbols
Theorem
The average codeword-length of a uniquely decodable code for must satisfy :
And there exists a uniquely decodable code satisfying
(By joint entropy)
Let’s understand , for that we need to
- understand conditional entropy
- understand how to model many random variables
Conditional Probability
Definition (coin flip source)
The source models a sequence So , with is heads and is tails for all Hence
Definition (Sunny-Rainy Source)
The source models a sequence So , with being sunny weather and being rainy weather The weather is uniformly distributed in For all other days, the weather stays the same with probability
As we have seen conditional probability, I won’t re-explain it.
Conditional Expectation of given
Definition
The conditional expectation of given is defined as
We have every probability associated to an entropy
Definition
The conditional entropy of given is defined as
Theorem
The conditional entropy of a discrete random variable knowing
with equality on the left for some , and with equality on the right for all
However conditional entropy is not bounded by the entropy of its random variable
You might find a 100 papers in a machine learning course that say that this inequality holds, but that's because they don't like doing real work
Definition
The conditional entropy of given is defined as
Theorem
The conditional entropy of a discrete random variable given
with equality on the left for every for some , and with equality on the right for all and all
Conditioning reduces entropy
Theorem
For any two discrete random variables and ,
with equality and are independent
(Makes sense because if gives no information about the state of , why would it affect its entropy) On average, the uncertainty of can only become smaller if we know
Theorem
For any three discrete random variables , and
with equality and are conditionally independent random variables
Chain Rule for Entropy
We can rewrite as
Or, in english, to find the joint entropy of two random variables, we can first calculate the entropy of one of the two, and add the conditional entropy of the second given the first.
→ We can chain this a bunch of times (hence the name)
Theorem
Let be discrete random variables. Then
Infinite alphabet prefix-free code
Let’s do an example with integers
The binary expansion of the number is not prefix free. Therefore it is not desirable.
Elias Code
What if we prefix the binary expansion with zeros? The code is prefix-free, but it is rather long: → This is not optimal
Elias Code 2
We prefix with elias code of the length of the binary expansion.