Fundamental Entropic Bounds

When my advisor Dan Suciu taught me entropy, he said everyone should know 3 inequalities: the “entropy range”, monotonicity, and submodularity. Luckily I don’t have to memorize the bounds as each inequality has a very simple intuition. First, the “entropy range” simply bounds the value of any entropy function:

0 \leq H(X) \leq \log(|X|)

On one hand, the entropy is 0 if XX takes a certain value AA with probabilty 11. In that case, we know XX’s value (AA) without communicating a single bit. On the other hand, we can always simply use full binary encoding with log(|X|)\log(|X|) bits to encode XX, ignoring the probability distribution.

Monotonicity says that the entropy of a string of random variables is no less than the entropy of any substring:

H(X) \leq H(XY)

Here XYXY simply “concatenates” XX and YY, in that a value of XYXY concatenates a value of XX with a value of YY. The entropy H(XY)H(XY) is the number of bits necessary to transmit a string in XYXY. With this in mind, monotonicity simpy says that transmitting more information requires more bits.

Finally, our last inequality, submodularity, conveys the intuition of “diminishing returns”: 10 dollars matter less to a millionaire than to a PhD student. More concretely, suppose we have a function ff from wealth to quality of life. Then ff is submodular because f(x+δ)f(x)f(x + \delta) - f(x) gets smaller and smaller as xx increases. In the context of information theory, HH is submodular because adding additional information to a long message takes little effort. For example, suppose a submarine needs to send reports describing the fish it finds, and the description includes weight, length and species. Then if it says the fish is 80 feet long, you’ll know it’s a blue whale without looking at the species field. In general, we can save some bits by inferring facts from the correlation of data; if all variables are independent we can save nothing. With this intuition, let’s look at the formal statement of submodularity:

H(X) + H(Y) \geq H(X \cup Y) + H(X \cap Y)

Rearranging, we get H(XY)H(X)H(Y)H(XY)H(X \cup Y) - H(X) \leq H(Y) - H(X \cap Y). Note (XY)X=Y(XY)(X \cup Y) - X = Y - (X\cap Y), and if we define δ=Y(XY)\delta = Y - (X\cap Y), the inequality becomes H(X+δ)H(X)H((XY)+δ)H(XY)H(X + \delta) - H(X) \leq H((X\cap Y) + \delta) - H(X\cap Y), which states precisly “the law of diminishing returns” because X(XY)X \geq (X\cap Y)!