Entropy can be understood as the minimum amount of data needed to be
transmitted in order to communicate a piece of information. Concretely,
this is the average number of bits needed to encode a message. For
example, imagine a spaceship sends a status code every minute to
indicate if it has found any alien civilization. The code is any letter
from the English alphabet, with
$A$
meaning “nothing new”, and some other letter describing the alien. For
example
$F$
means the alien is friendly, and
$B$
means the alien is blue, etc. For simplicity assume every code is only
1-letter long. Then we can simply encode this status code with
$\log(26)$
bits, where
$A = 0 \dots 0, B = 0\dots 1, \dots$.
However, we can be a little clever because we know most of the time the
spaceship won’t be finding new civilizations. In that case, the
satellite can remain silent to indicate “nothing new”; otherwise it
sends the status code with our original encoding ^{1}.

Then we only need to send on average little more than 0 bit per
minute. In general, we can save some bits if we know certain messages
occur with high/low probability. In other words, the minimum commucation
cost depends on the probability distribution of the data. Entropy
precisely formalizes this intuition. Formally, if
$X$
is a random variable with outcomes
$x_1, \dots, x_N$
each of probabilities
$p_1, \dots, p_N$,
then its **entropy** is defined as:

`H(X) = \sum_i p_i \log \frac{1}{p_i}`

This matches our intuition: when $X$ is uniform and $|X| = N$, $H(X)=N(1/N \log N)=\log N$; when $X$ is almost always a certain message, say $A$, then $H(X)= p_A \log \frac{1}{p_A} + \sum_{i \not = A} p_i \log \frac{1}{p_i} = 0.99999 \log \frac{1}{0.99999} + \delta \approx 0$. For a more general case, suppose message $A$ occurs half of the time, $B$ one quarter of the time, $C$ one eighth and so on. Then we can use one bit $1$ to indiate that $A$ occurs; otherwise we first send one bit $0$ to indicate it’s not $A$, then send one bit $1$ if $B$ occurs and $0$ otherwise, and so on. On average we need $p_A \times 1 + p_B \times 2 + \dots = p_A \times \log(\frac{1}{p_A}) + p_B \times \log(\frac{1}{p_B}) + \dots = H(X)$ bits.