When my advisor Dan Suciu taught me entropy, he said everyone should know 3 inequalities: the “entropy range”, monotonicity, and submodularity. Luckily I don’t have to memorize the bounds as each inequality has a very simple intuition. First, the “entropy range” simply bounds the value of any entropy function:

`0 \leq H(X) \leq \log(|X|)`

On one hand, the entropy is 0 if $X$ takes a certain value $A$ with probabilty $1$. In that case, we know $X$’s value ($A$) without communicating a single bit. On the other hand, we can always simply use full binary encoding with $\log(|X|)$ bits to encode $X$, ignoring the probability distribution.

**Monotonicity** says that the entropy of a string of
random variables is no less than the entropy of any substring:

`H(X) \leq H(XY)`

Here $XY$ simply “concatenates” $X$ and $Y$, in that a value of $XY$ concatenates a value of $X$ with a value of $Y$. The entropy $H(XY)$ is the number of bits necessary to transmit a string in $XY$. With this in mind, monotonicity simpy says that transmitting more information requires more bits.

Finally, our last inequality, submodularity, conveys the intuition of
“diminishing returns”: 10 dollars matter less to a millionaire than to a
PhD student. More concretely, suppose we have a function
$f$
from wealth to quality of life. Then
$f$
is submodular because
$f(x + \delta) - f(x)$
gets smaller and smaller as
$x$
increases. In the context of information theory,
$H$
is submodular because adding additional information to a long message
takes little effort. For example, suppose a submarine needs to send
reports describing the fish it finds, and the description includes
weight, length and species. Then if it says the fish is 80 feet long,
you’ll know it’s a blue whale without looking at the species field. In
general, we can save some bits by inferring facts from the correlation
of data; if all variables are independent we can save nothing. With this
intuition, let’s look at the formal statement of
**submodularity**:

`H(X) + H(Y) \geq H(X \cup Y) + H(X \cap Y)`

Rearranging, we get $H(X \cup Y) - H(X) \leq H(Y) - H(X \cap Y)$. Note $(X \cup Y) - X = Y - (X\cap Y)$, and if we define $\delta = Y - (X\cap Y)$, the inequality becomes $H(X + \delta) - H(X) \leq H((X\cap Y) + \delta) - H(X\cap Y)$, which states precisly “the law of diminishing returns” because $X \geq (X\cap Y)$!