# 17.11. Information Theory¶

The universe is overflowing with information. Information provides a common language across disciplinary rifts: from Shakespeare’s Sonnet to researchers’ paper on Cornell ArXiv, from Van Gogh’s printing Starry Night to Beethoven’s music Symphony No. 5, from the first programming language Plankalkül to the state-of-the-art machine learning algorithms. Everything must follow the rules of information theory, no matter the format. With information theory, we can measure and compare how much information is present in different signals. In this section, we will investigate the fundamental concepts of information theory and applications of information theory in machine learning.

Before we get started, let’s outline the relationship between machine learning and information theory. Machine learning aims to extract interesting signals from data and make critical predictions. On the other hand, information theory studies encoding, decoding, transmitting, and manipulating information. As a result, information theory provides fundamental language for discussing the information processing in machine learned systems. For example, many machine learning applications use the cross entropy loss as described in Section 3.4. This loss can be directly derived from information theoretic considerations.

## 17.11.1. Information¶

Let’s start with the “soul” of information theory: information.
*Information* can be encoded in anything with a particular sequence of
one or more encoding formats. Suppose that we task ourselves with trying
to define a notion of information. What could be are starting point?

Consider the following thought experiment. We have a friend with a deck of cards. They will shuffle the deck, flip over some cards, and tell us statements about the cards. We will try to assess the information content of each statement.

First, they flip over a card and tell us, “I see a card.” This provides us with no information at all. We were already certain that this was the case so we hope the information should be zero.

Next, they flip over a card and say, “I see a heart.” This provides us some information, but in reality there are only \(4\) different suits that were possible, each equally likely, so we are not surprised by this outcome. We hope that whatever the measure of information, this event should have low information content.

Next, they flip over a card and say, “This is the \(3\) of spades.” This is more information. Indeed there were \(52\) equally likely possible outcomes, and our friend told us which one it was. This should be a medium amount of information.

Let’s take this to the logical extreme. Suppose that finally they flip over every card from the deck and read off the entire sequence of the shuffled deck. There are \(52!\) different orders to the deck, again all equally likely, so we need a lot of information to know which one it is.

Any notion of information we develop must conform to this intuition. Indeed, in the next sections we will learn how to compute that these events have \(0\text{ bits}\), \(2\text{ bits}\), \(~5.7\text{ bits}\), and \(~225.6\text{ bits}\) of information respectively.

If we read through these thought experiments, we see a natural idea. As a starting point, rather than caring about the knowledge, we may build off the idea that information represents the degree of surprise or the abstract possibility of the event. For example, if we want to describe an unusual event, we need a lot information. For a common event, we may not need much information.

In 1948, Claude E. Shannon published *A Mathematical Theory of
Communication* [Shannon, 1948] establishing the theory of
information. In his book, Shannon introduced the concept of information
entropy for the first time. We will begin our journey here.

### 17.11.1.1. Self-information¶

Since information embodies the abstract possibility of an event, how do
we map the possibility to the number of bits? Shannon introduced the
terminology *bit* as the unit of information, which was originally
created by John Tukey. So what is a “bit” and why do we use it to
measure information? Historically, an antique transmitter can only send
or receive two types of code: \(0\) and \(1\). Indeed, binary
encoding is still in common use on all modern digital computers. In this
way, any information is encoded by a series of \(0\) and \(1\).
And hence, a series of binary digits of length \(n\) contains
\(n\) bits of information.

Now, suppose that for any series of codes, each \(0\) or \(1\)
occurs with a probability of \(\frac{1}{2}\). Hence, an event
\(X\) with a series of codes of length \(n\), occurs with a
probability of \(\frac{1}{2^n}\). At the same time, as we mentioned
before, this series contains \(n\) bits of information. So, can we
generalize to a math function which can transfer the probability
\(p\) to the number of bits? Shannon gave the answer by defining
*self-information*

as the *bits* of information we have received for this event \(X\).
Note that we will always use base-2 logarithms in this section. For the
sake of simplicity, the rest of this section will omit the subscript 2
in the logarithm notation, i.e., \(\log(.)\) always refers to
\(\log_2(.)\). For example, the code “0010” has a self-information

We can calculate self information in MXNet as shown below. Before that, let’s first import all the necessary packages in this section.

```
from mxnet import np
from mxnet.metric import NegativeLogLikelihood
from mxnet.ndarray import nansum
import random
def self_information(p):
return -np.log2(p)
self_information(1/64)
```

```
6.0
```

## 17.11.2. Entropy¶

As self-information only measures the information of a single discrete event, we need a more generalized measure for any random variable of either discrete or continuous distribution.

### 17.11.2.1. Motivating Entropy¶

Let’s try to get specific about what we want. This will be an informal
statement of what are known as the *axioms of Shannon entropy*. It will
turn out that the following collection of common-sense statements force
us to a unique definition of information. A formal version of these
axioms, along with several others may be found in
[Csiszar, 2008].

The information we gain by observing a random variable does not depend on what we call the elements, or the presence of additional elements which have probability zero.

The information we gain by observing two random variables is no more than the sum of the information we gain by observing them separately. If they are independent, then it is exactly the sum.

The information gained when observing (nearly) certain events is (nearly) zero.

While proving this fact is beyond the scope of our text, it is important to know that this uniquely determines the form that entropy must take. The only ambiguity that these allow is in the choice of fundamental units, which is most often normalized by making the choice we saw before that the information provided by a single fair coin flip is one bit.

### 17.11.2.2. Definition¶

For any random variable \(X\) that follows a probability
distribution \(P\) with a probability density function (p.d.f.) or a
probability mass function (p.m.f.) \(p(x)\), we measure the expected
amount of information through *entropy* (or *Shannon entropy*)

To be specific, if \(X\) is discrete,

Otherwise, if \(X\) is continuous, we also refer entropy as
*differential entropy*

In MXNet, we can define entropy as below.

```
def entropy(p):
entropy = - p * np.log2(p)
# nansum will sum up the non-nan number
out = nansum(entropy.as_nd_ndarray())
return out
entropy(np.array([0.1, 0.5, 0.1, 0.3]))
```

```
[1.6854753]
<NDArray 1 @cpu(0)>
```

### 17.11.2.3. Interpretations¶

You may be curious: in the entropy definition (17.11.3), why do we use an expectation of a negative logarithm? Here are some intuitions.

First, why do we use a *logarithm* function \(\log\)? Suppose that
\(p(x) = f_1(x) f_2(x) \ldots, f_n(x)\), where each component
function \(f_i(x)\) is independent from each other. This means that
each \(f_i(x)\) contributes independently to the total information
obtained from \(p(x)\). As discussed above, we want the entropy
formula to be additive over independent random variables. Luckily,
\(\log\) can naturally turn a product of probability distributions
to a summation of the individual terms.

Next, why do we use a *negative* \(\log\)? Intuitively, more
frequent events should contain less information than less common events,
since we often gain more information from an unusual case than from an
ordinary one. However, \(\log\) is monotonically increasing with the
probabilities, and indeed negative for all values in \([0, 1]\). We
need to construct a monotonically decreasing relationship between the
probability of events and their entropy, which will ideally be always
positive (for nothing we observe should force us to forget what we have
known). Hence, we add a negative sign in front of \(\log\) function.

Last, where does the *expectation* function come from? Consider a random
variable \(X\). We can interpret the self-information
(\(-\log(p)\)) as the amount of *surprise* we have at seeing a
particular outcome. Indeed, as the probability approaches zero, the
surprise becomes infinite. Similarly, we can interpret The entropy as
the average amount of surprise from observing \(X\). For example,
imagine that a slot machine system emits statistical independently
symbols \({s_1, \ldots, s_k}\) with probabilities
\({p_1, \ldots, p_k}\) respectively. Then the entropy of this system
equals to the average self-information from observing each output, i.e.,

### 17.11.2.4. Properties of Entropy¶

By the above examples and interpretations, we can derive the following properties of entropy (17.11.3). Here, we refer to X as an event and P as the probability distribution of X.

Entropy is non-negative, i.e., \(H(X) \geq 0, \forall X\).

If \(X \sim P\) with a p.d.f. or a p.m.f. \(p(x)\), and we try to estimate \(P\) by a new probability distribution \(Q\) with a p.d.f. or a p.m.f. \(q(x)\), then

(17.11.7)¶\[H(X) = - E_{x \sim P} [\log p(x)] \leq - E_{x \sim P} [\log q(x)], \text{ with equality if and only if } P = Q.\]Alternatively, \(H(X)\) gives a lower bound of the average number of bits needed to encode symbols drawn from \(P\).

If \(X \sim P\), then \(x\) conveys the maximum amount of information if it spreads evenly among all possible outcomes. Specifically, if the probability distribution \(P\) is discrete with \(k\)-class \(\{p_1, \ldots, p_k \}\), then

(17.11.8)¶\[H(X) \leq \log(k), \text{ with equality if and only if } p_i = \frac{1}{k}, \forall x_i.\]If \(P\) is a continuous random variable, then the story becomes much more complicated. However, if we additionally impose that \(P\) is supported on a finite interval (with all values between \(0\) and \(1\)), then \(P\) has the highest entropy if it is the uniform distribution on that interval.

## 17.11.3. Mutual Information¶

Previously we defined entropy of a single random variable \(X\), how about the entropy of a pair random variables \((X, Y)\)? We can think of these techniques as trying to answer the following type of question, “What information is contained in \(X\) and \(Y\) together compared to each separately? Is there redundant information, or is it all unique?”

For the following discussion, we always use \((X, Y)\) as a pair of random variables that follows a joint probability distribution \(P\) with a p.d.f. or a p.m.f. \(p_{X, Y}(x, y)\), while \(X\) and \(Y\) follow probability distribution \(p_X(x)\) and \(p_Y(y)\), respectively.

### 17.11.3.1. Joint Entropy¶

Similar to entropy of a single random variable (17.11.3), we
define the *joint entropy* \(H(X, Y)\) of a pair random variables
\((X, Y)\) as

Precisely, on the one hand, if \((X, Y)\) is a pair of discrete random variables, then

On the other hand, if \((X, Y)\) is a pair of continuous random
variables, then we define the *differential joint entropy* as

We can think of (17.11.9) as telling us the total randomness in the pair of random variables. As a pair of extremes, if \(X = Y\) are two identical random variables, then the information in the pair is exactly the information in one and we have \(H(X, Y) = H(X) = H(Y)\). On the other extreme, if \(X\) and \(Y\) are independent then \(H(X, Y) = H(X) + H(Y)\). Indeed we will always have that the information contained in a pair of random variables is no smaller than the entropy of either random variable and no more than the sum of both.

Let’s implement joint entropy from scratch in MXNet.

```
def joint_entropy(p_xy):
joint_ent = -p_xy * np.log2(p_xy)
# nansum will sum up the non-nan number
out = nansum(joint_ent.as_nd_ndarray())
return out
joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))
```

```
[1.6854753]
<NDArray 1 @cpu(0)>
```

Notice that this is the same *code* as before, but now we interpret it
differently as working on the joint distribution of the two random
variables.

### 17.11.3.2. Conditional Entropy¶

The joint entropy defined above the amount of information contained in a pair of random variables. This is useful, but often times it is not what we care about. Consider the setting of machine learning. Let’s take \(X\) to be the random variable (or vector of random variables) that describes the pixel values of an image, and \(Y\) to be the random variable which is the class label. \(X\) should contain substantial information—a natural image is a complex thing. However, the information contained in \(Y\) once the image has been show should be low. Indeed, the image of a digit should already contain the information about what digit it is unless the digit is illegible. Thus, to continue to extend our vocabulary of information theory, we need to be able to reason about the information content in a random variable conditional on another.

In the probability theory, we saw the definition of the *conditional
probability* to measure the relationship between variables. We now want
to analogously define the *conditional entropy* \(H(Y \mid X)\). We
can write this as

where \(p(y \mid x) = \frac{p_{X, Y}(x, y)}{p_X(x)}\) is the conditional probability. Specifically, if \((X, Y)\) is a pair of discrete random variables, then

If \((X, Y)\) is a pair of continuous random variables, then the
*differential joint entropy* is similarly defined as

It is now natural to ask, how does the *conditional entropy*
\(H(Y \mid X)\) relate to the entropy \(H(X)\) and the joint
entropy \(H(X, Y)\)? Using the definitions above, we can express
this cleanly:

This has an intuitive interpretation: the information in \(Y\) given \(X\) (\(H(Y \mid X)\)) is the same as the information in both \(X\) and \(Y\) together (\(H(X, Y)\)) minus the information already contained in \(X\). This gives us the information in \(Y\) which is not also represented in \(X\).

Now, let’s implement conditional entropy (17.11.13) from scratch in MXNet.

```
def conditional_entropy(p_xy, p_x):
p_y_given_x = p_xy/p_x
cond_ent = -p_xy * np.log2(p_y_given_x)
# nansum will sum up the non-nan number
out = nansum(cond_ent.as_nd_ndarray())
return out
conditional_entropy(np.array([[0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8]))
```

```
[0.8635472]
<NDArray 1 @cpu(0)>
```

### 17.11.3.3. Mutual Information¶

Given the previous setting of random variables \((X, Y)\), you may
wonder: “Now that we know how much information is contained in \(Y\)
but not in \(X\), can we similarly ask how much information is
shared between \(X\) and \(Y\)?” The answer will be the *mutual
information* of \((X, Y)\), which we will write as \(I(X, Y)\).

Rather than diving straight into the formal definition, let’s practice our intuition by first trying to derive an expression for the mutual information entirely based on terms we have constructed before. We wish to find the information shared between two random variables. One way we could try to do this is to start with all the information contained in both \(X\) and \(Y\) together, and then we take off the parts that are not shared. The information contained in both \(X\) and \(Y\) together is written as \(H(X, Y)\). We want to subtract from this the information contained in \(X\) but not in \(Y\), and the information contained in \(Y\) but not in \(X\). As we saw in the previous section, this is given by \(H(X \mid Y)\) and \(H(Y \mid X)\) respectively. Thus, we have that the mutual information should be

Indeed, this is a valid definition for the mutual information. If we expand out the definitions of these terms and combine them, a little algebra shows that this is the same as

We can summarize all of these relationships in image Fig. 17.11.1. It is an excellent test of intuition to see why the following statements are all also equivalent to \(I(X, Y)\).

\(H(X) − H(X \mid Y)\)

\(H(Y) − H(Y \mid X)\)

\(H(X) + H(Y) − H(X, Y)\)

In many ways we can think of the mutual information (17.11.18) as principled extension of correlation coefficient we saw in Section 17.6. This allows us to ask not only for linear relationships between variables, but for the maximum information shared between the two random variables of any kind.

Now, let’s implement mutual information from scratch.

```
def mutual_information(p_xy, p_x, p_y):
p = p_xy / (p_x * p_y)
mutual = -p_xy * np.log2(p)
# nansum will sum up the non-nan number
out = nansum(mutual.as_nd_ndarray())
return out
mutual_information(np.array([[0.1, 0.5], [0.1, 0.3]]),
np.array([0.2, 0.8]),
np.array([[0.75, 0.25]]))
```

```
[-0.71946025]
<NDArray 1 @cpu(0)>
```

### 17.11.3.4. Properties of Mutual Information¶

Rather than memorizing the definition of mutual information (17.11.18), you only need to keep in mind its notable properties:

Mutual information is symmetric, i.e., \(I(X, Y) = I(Y, X)\).

Mutual information is non-negative, i.e., \(I(X, Y) \geq 0\).

\(I(X, Y) = 0\) if and only if \(X\) and \(Y\) are independent. For example, if \(X\) and \(Y\) are independent, then knowing \(Y\) does not give any information about \(X\) and vice versa, so their mutual information is zero.

Alternatively, if \(X\) is an invertible function of \(Y\), then \(Y\) and \(X\) share all information and

(17.11.19)¶\[I(X, Y) = H(Y) = H(X).\]

### 17.11.3.5. Pointwise Mutual Information¶

When we worked with entropy at the beginning of this chapter, we were
able to provide an interpretation of \(-\log(p_X(x))\) as how
*surprised* we were with the particular outcome. We may give a similar
interpretation to the logarithmic term in the mutual information, which
is often referred to as the *pointwise mutual information*:

We can think of (17.11.20) as measuring how much more or less
likely the specific combination of outcomes \(x\) and \(y\) are
compared to what we would expect for independent random outcomes. If it
is large and positive, then these two specific outcomes occur much more
frequently than they would compared to random chance (*note*: the
denominator is \(p_X(x) p_Y(y)\) which is the probability of the two
outcomes were independent), whereas if it is large and negative it
represents the two outcomes happening far less than we would expect by
random chance.

This allows us to interpret the mutual information (17.11.18) as the average amount that we were surprised to see two outcomes occurring together compared to what we would expect if they were independent.

### 17.11.3.6. Applications of Mutual Information¶

Mutual information may be a little abstract in it pure definition, so
how does it related to machine learning? In natural language processing,
one of the most difficult problems is the *ambiguity resolution*, or the
issue of the meaning of a word being unclear from context. For example,
recently a headline in the news reported that “Amazon is on fire”. You
may wonder whether the company Amazon has a building on fire, or the
Amazon rain forest is on fire.

In this case, mutual information can help us resolve this ambiguity. We first find the group of words that each has a relatively large mutual information with the company Amazon, such as e-commerce, technology, and online. Second, we find another group of words that each has a relatively large mutual information with the Amazon rain forest, such as rain, forest, and tropical. When we need to disambiguate “Amazon”, we can compare which group has more occurrence in the context of the word Amazon. In this case the article would go on to describe the forest, and make the context clear.

## 17.11.4. Kullback–Leibler Divergence¶

As what we have discussed in Section 2.3, we can use
norms to measure distance between two points in space of any
dimensionality. We would like to be able to do a similar task with
probability distributions. There are many ways to go about this, but
information theory provides one of the nicest. We now explore the
*Kullback–Leibler (KL) divergence*, which provides a way to measure if
two distributions are close together or not.

### 17.11.4.1. Definition¶

Given a random variable \(X\) that follows the probability
distribution \(P\) with a p.d.f. or a p.m.f. \(p(x)\), and we
estimate \(P\) by another probability distribution \(Q\) with a
p.d.f. or a p.m.f. \(q(x)\). Then the *Kullback–Leibler (KL)
divergence* (or *relative entropy*) between \(P\) and \(Q\) is

As with the pointwise mutual information (17.11.20), we can
again provide an interpretation of the logarithmic term:
\(-\log \frac{q(x)}{p(x)} = -\log(q(x)) - (-\log(p(x)))\) will be
large and positive if we see \(x\) far more often under \(P\)
than we would expect for \(Q\), and large and negative if we see the
outcome far less than expected. In this way, we can interpret it as our
*relative* surprise at observing the outcome compared to how surprised
we would be observing it from our reference distribution.

In MXNet, let’s implement the KL divergence from Scratch.

```
def kl_divergence(p, q):
kl = p * np.log2(p / q)
out = nansum(kl.as_nd_ndarray())
return out.abs().asscalar()
```

### 17.11.4.2. KL Divergence Properties¶

Let’s take a look at some properties of the KL divergence (17.11.21).

KL divergence is non-symmetric, i.e.,

(17.11.22)¶\[D_{\mathrm{KL}}(P\|Q) \neq D_{\mathrm{KL}}(Q\|P), \text{ if } P \neq Q.\]KL divergence is non-negative, i.e.,

(17.11.23)¶\[D_{\mathrm{KL}}(P\|Q) \geq 0.\]Note that the equality holds only when \(P = Q\).

If there exists an \(x\) such that \(p(x) > 0\) and \(q(x) = 0\), then \(D_{\mathrm{KL}}(P\|Q) = \infty\).

There is a close relationship between KL divergence and mutual information. Besides the relationship shown in Fig. 17.11.1, \(I(X, Y)\) is also numerically equivalent with the following terms:

\(D_{\mathrm{KL}}(P(X, Y) \ \| \ P(X)P(Y))\);

\(E_Y \{ D_{\mathrm{KL}}(P(X \mid Y) \ \| \ P(X)) \}\);

\(E_X \{ D_{\mathrm{KL}}(P(Y \mid X) \ \| \ P(Y)) \}\).

For the first term, we interpret mutual information as the KL divergence between \(P(X, Y)\) and the product of \(P(X)\) and \(P(Y)\), and thus is a measure of how different the joint distribution is from the distribution if they were independent. For the second term, mutual information tells us the average reduction in uncertainty about \(Y\) that results from learning the value of the \(X\)’s distribution. Similarly to the third term.

### 17.11.4.3. Example¶

Let’s go through a toy example to see the non-symmetry explicitly.

First, let’s generate and sort three `ndarray`

s of length
\(10,000\): an objective `ndarray`

\(p\) which follows a
normal distribution \(N(0, 1)\), and two candidate `ndarray`

s
\(q_1\) and \(q_2\) which follow normal distributions
\(N(-1, 1)\) and \(N(1, 1)\) respectively.

```
random.seed(1)
nd_length = 10000
p = np.random.normal(loc=0, scale=1, size=(nd_length, ))
q1 = np.random.normal(loc=-1, scale=1, size=(nd_length, ))
q2 = np.random.normal(loc=1, scale=1, size=(nd_length, ))
p = np.array(sorted(p.asnumpy()))
q1 = np.array(sorted(q1.asnumpy()))
q2 = np.array(sorted(q2.asnumpy()))
```

Since \(q_1\) and \(q_2\) are symmetric with respect to the y-axis (i.e., \(x=0\)), we expect a similar value of KL divergence between \(D_{\mathrm{KL}}(p\|q_1)\) and \(D_{\mathrm{KL}}(p\|q_2)\). As you can see below, there is only a 1% off between \(D_{\mathrm{KL}}(p\|q_1)\) and \(D_{\mathrm{KL}}(p\|q_2)\).

```
kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100
kl_pq1, kl_pq2, similar_percentage
```

```
(8470.638, 8664.999, 2.268504302642314)
```

In contrast, you may find that \(D_{\mathrm{KL}}(q_2 \|p)\) and \(D_{\mathrm{KL}}(p \| q_2)\) are off a lot, with around 40% off as shown below.

```
kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100
kl_q2p, differ_percentage
```

```
(13536.835, 43.88678828000115)
```

## 17.11.5. Cross Entropy¶

If you are curious about applications of information theory in deep learning, here is a quick example. We define the true distribution \(P\) with probability distribution \(p(x)\), and the estimated distribution \(Q\) with probability distribution \(q(x)\), and we will use them in the rest of this section.

Say we need to solve a binary classification problem based on given \(n\) data points {\(x_1, \ldots, x_n\)}. Assume that we encode \(1\) and \(0\) as the positive and negative class label \(y_i\) respectively, and our neural network is parameterized by \(\theta\). If we aim to find a best \(\theta\) so that \(\hat{y}_i= p_{\theta}(y_i \mid x_i)\), it is natural to apply the maximum log-likelihood approach as was seen in Section 17.7. To be specific, for true labels \(y_i\) and predictions \(\hat{y}_i= p_{\theta}(y_i \mid x_i)\), the probability to be classified as positive is \(\pi_i= p_{\theta}(y_i = 1 \mid x_i)\). Hence, the log-likelihood function would be

Maximizing the log-likelihood function \(l(\theta)\) is identical to
minimizing \(- l(\theta)\), and hence we can find the best
\(\theta\) from here. To generalize the above loss to any
distributions, we also called \(-l(\theta)\) the *cross entropy
loss* \(\mathrm{CE}(y, \hat{y})\), where \(y\) follows the true
distribution \(P\) and \(\hat{y}\) follows the estimated
distribution \(Q\).

This was all derived by working from the maximum likelihood point of view. However, if we look closely we can see that terms like \(\log(\pi_i)\) have entered into our computation which is a solid indication that we can understand the expression from an information theoretic point of view.

### 17.11.5.1. Formal Definition¶

Like KL divergence, for a random variable \(X\), we can also measure
the divergence between the estimating distribution \(Q\) and the
true distribution \(P\) via *cross entropy*,

By using properties of entropy discussed above, we can also interpret it as the summation of the entropy \(H(P)\) and the KL divergence between \(P\) and \(Q\), i.e.,

In MXNet, we can implement the cross entropy loss as below.

```
def cross_entropy(y_hat, y):
ce = -np.log(y_hat[range(len(y_hat)), y])
return ce.mean()
```

Now define two `ndarray`

s for the labels and predictions, and
calculate the cross entropy loss of them.

```
labels = np.array([0, 2])
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])
cross_entropy(preds, labels)
```

```
array(0.94856)
```

### 17.11.5.2. Properties¶

As alluded in the beginning of this section, cross entropy (17.11.25) can be used to define a loss function in the optimization problem. It turns out that the following are equivalent:

Maximizing predictive probability of \(Q\) for distribution \(P\), (i.e., \(E_{x \sim P} [\log (q(x))]\));

Minimizing cross entropy \(\mathrm{CE} (P, Q)\);

Minimizing the KL divergence \(D_{\mathrm{KL}}(P\|Q)\).

The definition of cross entropy indirectly proves the equivalent relationship between objective 2 and objective 3, as long as the entropy of true data \(H(P)\) is constant.

### 17.11.5.3. Cross Entropy as An Objective Function of Multi-class Classification¶

If we dive deep into the classification objective function with cross entropy loss \(\mathrm{CE}\), we will find minimizing \(\mathrm{CE}\) is equivalent to maximizing the log-likelihood function \(L\).

To begin with, suppose that we are given a dataset with \(n\)
samples, and it can be classified into \(k\)-classes. For each data
point \(i\), we represent any \(k\)-class label
\(\mathbf{y}_i = (y_{i1}, \ldots, y_{ik})\) by *one-hot encoding*.
To be specific, if the data point \(i\) belongs to class \(j\),
then we set the \(j\)-th entry to \(1\), and all other
components to \(0\), i.e.,

For instance, if a multi-class classification problem contains three classes \(A\), \(B\), and \(C\), then the labels \(\mathbf{y}_i\) can be encoded in {\(A: (1, 0, 0); B: (0, 1, 0); C: (0, 0, 1)\)}.

Assume that our neural network is parameterized by \(\theta\). For true label vectors \(\mathbf{y}_i\) and predictions

Hence, the *cross entropy loss* would be

On the other side, we can also approach the problem through maximum
likelihood estimation. To begin with, let’s quickly introduce a
\(k\)-class multinoulli distribution. It is an extension of the
Bernoulli distribution from binary class to multi-class. If a random
variable \(\mathbf{z} = (z_{1}, \ldots, z_{k})\) follows a
\(k\)-class *multinoulli distribution* with probabilities
\(\mathbf{p} =\) (\(p_{1}, \ldots, p_{k}\)), i.e.,

then the joint probability mass function(p.m.f.) of \(\mathbf{z}\) is

It can be seen that each data point, \(\mathbf{y}_i\), is following a \(k\)-class multinoulli distribution with probabilities \(\boldsymbol{\pi} =\) (\(\pi_{1}, \ldots, \pi_{k}\)). Therefore, the joint p.m.f. of each data point \(\mathbf{y}_i\) is \(\mathbf{\pi}^{\mathbf{y}_i} = \prod_{j=1}^k \pi_{j}^{y_{ij}}.\) Hence, the log-likelihood function would be

Since in maximum likelihood estimation, we maximizing the objective function \(l(\theta)\) by having \(\pi_{j} = p_{\theta} (y_{ij} \mid \mathbf{x}_i)\). Therefore, for any multi-class classification, maximizing the above log-likelihood function \(l(\theta)\) is equivalent to minimizing the CE loss \(\mathrm{CE}(y, \hat{y})\).

To test the above proof, let’s apply the built-in measure
`NegativeLogLikelihood`

in MXNet. Using the same `labels`

and
`preds`

as in the earlier example, we will get the same numerical loss
as the previous example up to the 5 decimal place.

```
nll_loss = NegativeLogLikelihood()
nll_loss.update(labels.as_nd_ndarray(), preds.as_nd_ndarray())
nll_loss.get()
```

```
('nll-loss', 0.9485599994659424)
```

## 17.11.6. Summary¶

Information theory is a field of study about encoding, decoding, transmitting, and manipulating information.

Entropy is the unit to measure how much information is presented in different signals.

KL divergence can also measure the divergence between two distributions.

Cross Entropy can be viewed as an objective function of multi-class classification. Minimizing cross entropy loss is equivalent to maximizing the log-likelihood function.

## 17.11.7. Exercises¶

Verify that the card examples from the first section indeed have the claimed entropy.

Let’s compute the entropy from a few data sources:

Assume that you are watching the output generated by a monkey at a typewriter. The monkey presses any of the \(44\) keys of the typewriter at random (you can assume that it has not discovered any special keys or the shift key yet). How many bits of randomness per character do you observe?

Being unhappy with the monkey, you replaced it by a drunk typesetter. It is able to generate words, albeit not coherently. Instead, it picks a random word out of a vocabulary of \(2,000\) words. Moreover, assume that the average length of a word is \(4.5\) letters in English. How many bits of randomness do you observe now?

Still being unhappy with the result, you replace the typesetter by a high quality language model. These can currently obtain perplexity numbers as low as \(15\) points per character. The perplexity is defined as a length normalized probability, i.e.,

(17.11.33)¶\[PPL(x) = \left[p(x)\right]^{1 / \text{length(x)} }.\]How many bits of randomness do you observe now?

Explain intuitively why \(I(X, Y) = H(X) - H(X|Y)\). Then, show this is true by expressing both sides as an expectation with respect to the joint distribution.

What is the KL Divergence between the two Gaussian distributions \(\mathcal{N}(\mu_1, \sigma_1^2)\) and \(\mathcal{N}(\mu_2, \sigma_2^2)\)?