Entropy can be understood as the minimum amount of data needed to be transmitted in order to communicate a piece of information. Concretely, this is the average number of bits needed to encode a message. For example, imagine a spaceship sends a status code every minute to indicate if it has found any alien civilization. The code is any letter from the English alphabet, with AA meaning “nothing new”, and some other letter describing the alien. For example FF means the alien is friendly, and BB means the alien is blue, etc. For simplicity assume every code is only 1-letter long. Then we can simply encode this status code with log(26)\log(26) bits, where A=00,B=01,A = 0 \dots 0, B = 0\dots 1, \dots. However, we can be a little clever because we know most of the time the spaceship won’t be finding new civilizations. In that case, the satellite can remain silent to indicate “nothing new”; otherwise it sends the status code with our original encoding 1.

Then we only need to send on average little more than 0 bit per minute. In general, we can save some bits if we know certain messages occur with high/low probability. In other words, the minimum commucation cost depends on the probability distribution of the data. Entropy precisely formalizes this intuition. Formally, if XX is a random variable with outcomes x1,,xNx_1, \dots, x_N each of probabilities p1,,pNp_1, \dots, p_N, then its entropy is defined as:

H(X) = \sum_i p_i \log \frac{1}{p_i}

This matches our intuition: when XX is uniform and |X|=N|X| = N, H(X)=N(1/NlogN)=logNH(X)=N(1/N \log N)=\log N; when XX is almost always a certain message, say AA, then H(X)=pAlog1pA+iApilog1pi=0.99999log10.99999+δ0H(X)= p_A \log \frac{1}{p_A} + \sum_{i \not = A} p_i \log \frac{1}{p_i} = 0.99999 \log \frac{1}{0.99999} + \delta \approx 0. For a more general case, suppose message AA occurs half of the time, BB one quarter of the time, CC one eighth and so on. Then we can use one bit 11 to indiate that AA occurs; otherwise we first send one bit 00 to indicate it’s not AA, then send one bit 11 if BB occurs and 00 otherwise, and so on. On average we need pA×1+pB×2+=pA×log(1pA)+pB×log(1pB)+=H(X)p_A \times 1 + p_B \times 2 + \dots = p_A \times \log(\frac{1}{p_A}) + p_B \times \log(\frac{1}{p_B}) + \dots = H(X) bits.