Entropy can be understood as the minimum amount of data needed to be
transmitted in order to communicate a piece of information. Concretely,
this is the average number of bits needed to encode a message. For
example, imagine a spaceship sends a status code every minute to
indicate if it has found any alien civilization. The code is any letter
from the English alphabet, with A
meaning “nothing new”, and some other letter describing the alien. For
example F means the alien is friendly,
and B means the alien is blue, etc. For
simplicity assume every code is only 1-letter long. Then we can simply
encode this status code with \log(26)
bits, where A = 0 \dots 0, B = 0\dots 1,
\dots. However, we can be a little clever because we know most of
the time the spaceship won’t be finding new civilizations. In that case,
the satellite can remain silent to indicate “nothing new”; otherwise it
sends the status code with our original encoding ^{1}.

Then we only need to send on average little more than 0 bit per
minute. In general, we can save some bits if we know certain messages
occur with high/low probability. In other words, the minimum commucation
cost depends on the probability distribution of the data. Entropy
precisely formalizes this intuition. Formally, if X is a random variable with outcomes x_1, \dots, x_N each of probabilities p_1, \dots, p_N, then its
**entropy** is defined as:

`H(X) = \sum_i p_i \log \frac{1}{p_i}`

This matches our intuition: when X is uniform and |X| = N, H(X)=N(1/N \log N)=\log N; when X is almost always a certain message, say A, then H(X)= p_A \log \frac{1}{p_A} + \sum_{i \not = A} p_i \log \frac{1}{p_i} = 0.99999 \log \frac{1}{0.99999} + \delta \approx 0. For a more general case, suppose message A occurs half of the time, B one quarter of the time, C one eighth and so on. Then we can use one bit 1 to indiate that A occurs; otherwise we first send one bit 0 to indicate it’s not A, then send one bit 1 if B occurs and 0 otherwise, and so on. On average we need p_A \times 1 + p_B \times 2 + \dots = p_A \times \log(\frac{1}{p_A}) + p_B \times \log(\frac{1}{p_B}) + \dots = H(X) bits.