Entropy can be understood as the minimum amount of data needed to be transmitted in order to communicate a piece of information. Concretely, this is the average number of bits needed to encode a message. For example, imagine a spaceship sends a status code every minute to indicate if it has found any alien civilization. The code is any letter from the English alphabet, with meaning “nothing new”, and some other letter describing the alien. For example means the alien is friendly, and means the alien is blue, etc. For simplicity assume every code is only 1-letter long. Then we can simply encode this status code with bits, where . However, we can be a little clever because we know most of the time the spaceship won’t be finding new civilizations. In that case, the satellite can remain silent to indicate “nothing new”; otherwise it sends the status code with our original encoding 1.
Then we only need to send on average little more than 0 bit per minute. In general, we can save some bits if we know certain messages occur with high/low probability. In other words, the minimum commucation cost depends on the probability distribution of the data. Entropy precisely formalizes this intuition. Formally, if is a random variable with outcomes each of probabilities , then its entropy is defined as:
H(X) = \sum_i p_i \log \frac{1}{p_i}
This matches our intuition: when is uniform and , ; when is almost always a certain message, say , then . For a more general case, suppose message occurs half of the time, one quarter of the time, one eighth and so on. Then we can use one bit to indiate that occurs; otherwise we first send one bit to indicate it’s not , then send one bit if occurs and otherwise, and so on. On average we need bits.