14.1 Entropy and optimal coding 178
The easiest choice is to usedlog 2 (N)ebits no matter the value ofX. This simple
code is sometimes effective, but is not entirely satisfactory ifXis far from uniform.
To understand why, suppose thatNis extremely large andP(X= 1) = 0.99 and
the remaining probability mass is uniform over [N]{ 1 }. Then it seems preferable
to have a short code for 1 and slightly longer codes for the alternatives. With
this in mind, a natural objective is to find a code that minimizes the expected
code length. That is
c∗= argminc
∑N
i=1
pi`(c(i)), (14.1)
where theargminis taken over valid codes and`(·) is a function that returns
the length of a code. The optimization problem in(14.1)can be solved using
Huffman codingand the optimal value satisfies
H 2 (P)≤
∑N
i=1
pi`(c∗(i))≤H 2 (P) + 1, (14.2)
whereH 2 (P) is theentropyofP,
H 2 (P) =
∑
i∈[N]:pi> 0
pilog 2
(
1
pi
)
.
Whenpi= 1/Nis uniform the naive idea of using a code of uniform length is
recovered, but for non-uniform distributions the code adapts to assign shorter
codes to symbols with larger probability. It is worth pointing out that the sum
is only over outcomes that occur with non-zero probability, which is motivated
by observing thatlimx→0+xlog(1/x) = 0 or by thinking of the entropy as an
expectation of the log-probability with respect toPand expectations should not
change when the value of the random variable is perturbed on a measure zero set.
It turns out thatH 2 (P) is not just an approximation on the expected length of
the Huffman code, but is itself a fundamental quantity. Imagine that Alice wants
to transmit a long string of symbols sampled fromP. She could use a Huffman
code to send Bob each symbol one at a time, but this introduces rounding errors
that accumulate as the message length grows. There is another scheme called
arithmetic codingfor which the average number of bits per symbol approaches
H 2 (P) and thesource coding theoremsays that this is unimprovable.
The definition of entropy using base 2 makes sense from the perspective of
sending binary message. Mathematically, however, it is more convenient to define
the entropy using the natural logarithm.
H(P) =
∑
i∈[N]:pi> 0
pilog
(
1
pi
)
. (14.3)
This is nothing more than a scaling of theH 2. Measuring information using base
2 logarithms has a unit ofbitsand for the natural logarithm the unit isnats.
By slightly abusing terminology, we will also callH(P) the entropy ofP.