Entropy can be understood as the minimum amount of data needed to be transmitted in order to communicate a piece of information. Concretely, this is the average number of bits needed to encode a message. For example, imagine a spaceship sends a status code every minute to indicate if it has found any alien civilization. The code is any letter from the English alphabet, with A meaning “nothing new”, and some other letter describing the alien. For example F means the alien is friendly, and B means the alien is blue, etc. For simplicity assume every code is only 1-letter long. Then we can simply encode this status code with \log(26) bits, where A = 0 \dots 0, B = 0\dots 1, \dots. However, we can be a little clever because we know most of the time the spaceship won’t be finding new civilizations. In that case, the satellite can remain silent to indicate “nothing new”; otherwise it sends the status code with our original encoding 1.

Then we only need to send on average little more than 0 bit per minute. In general, we can save some bits if we know certain messages occur with high/low probability. In other words, the minimum commucation cost depends on the probability distribution of the data. Entropy precisely formalizes this intuition. Formally, if X is a random variable with outcomes x_1, \dots, x_N each of probabilities p_1, \dots, p_N, then its entropy is defined as:

H(X) = \sum_i p_i \log \frac{1}{p_i}

This matches our intuition: when X is uniform and |X| = N, H(X)=N(1/N \log N)=\log N; when X is almost always a certain message, say A, then H(X)= p_A \log \frac{1}{p_A} + \sum_{i \not = A} p_i \log \frac{1}{p_i} = 0.99999 \log \frac{1}{0.99999} + \delta \approx 0. For a more general case, suppose message A occurs half of the time, B one quarter of the time, C one eighth and so on. Then we can use one bit 1 to indiate that A occurs; otherwise we first send one bit 0 to indicate it’s not A, then send one bit 1 if B occurs and 0 otherwise, and so on. On average we need p_A \times 1 + p_B \times 2 + \dots = p_A \times \log(\frac{1}{p_A}) + p_B \times \log(\frac{1}{p_B}) + \dots = H(X) bits.