We explore the role of entropy in prediction and learning problems
Example — Classify Images
For every image , our Neural Network outputs a probability distribution over all possible labels
For every image , the true label distribution is
Ideally, we want (for every pair image-label)
→ This will never happen !
Instead, we consider cross entropy loss
Concretely, for every image , we wish to minimize the loss
Definition
Cross Entropy Loss between two distributions and
Theorem
For a fixed probability distribution , the minimum
Is attained iff we select , and in this case would be given by