We explore the role of entropy in prediction and learning problems

Example — Classify Images

For every image , our Neural Network outputs a probability distribution over all possible labels

For every image , the true label distribution is

Ideally, we want (for every pair image-label)

This will never happen !

Instead, we consider cross entropy loss

Concretely, for every image , we wish to minimize the loss

Definition

Cross Entropy Loss between two distributions and

Theorem

For a fixed probability distribution , the minimum

Is attained iff we select , and in this case would be given by