The idea is to pack a lot of symbols into supersymbols

Theorem

The average codeword-length of a uniquely decodable code for must satisfy :

And there exists a uniquely decodable code satisfying

(By joint entropy)

Let’s understand , for that we need to

  • understand conditional entropy
  • understand how to model many random variables

Conditional Probability

Definition (coin flip source)

The source models a sequence So , with is heads and is tails for all Hence

Definition (Sunny-Rainy Source)

The source models a sequence So , with being sunny weather and being rainy weather The weather is uniformly distributed in For all other days, the weather stays the same with probability

As we have seen conditional probability, I won’t re-explain it.

Conditional Expectation of given

Definition

The conditional expectation of given is defined as

We have every probability associated to an entropy

Definition

The conditional entropy of given is defined as

Theorem

The conditional entropy of a discrete random variable knowing

with equality on the left for some , and with equality on the right for all

However conditional entropy is not bounded by the entropy of its random variable

You might find a 100 papers in a machine learning course that say that this inequality holds, but that's because they don't like doing real work

Definition

The conditional entropy of given is defined as

Theorem

The conditional entropy of a discrete random variable given

with equality on the left for every for some , and with equality on the right for all and all

Conditioning reduces entropy

Theorem

For any two discrete random variables and ,

with equality and are independent

(Makes sense because if gives no information about the state of , why would it affect its entropy) On average, the uncertainty of can only become smaller if we know

Theorem

For any three discrete random variables , and

with equality and are conditionally independent random variables

Chain Rule for Entropy

We can rewrite as

Or, in english, to find the joint entropy of two random variables, we can first calculate the entropy of one of the two, and add the conditional entropy of the second given the first.

We can chain this a bunch of times (hence the name)

Theorem

Let be discrete random variables. Then

Infinite alphabet prefix-free code

Let’s do an example with integers

The binary expansion of the number is not prefix free. Therefore it is not desirable.

Elias Code

What if we prefix the binary expansion with zeros? The code is prefix-free, but it is rather long: This is not optimal

Elias Code 2

We prefix with elias code of the length of the binary expansion.