22.9. Naive Bayes¶ Open the notebook in SageMaker Studio Lab
Throughout the previous sections, we learned about the theory of probability and random variables. To put this theory to work, let’s introduce the naive Bayes classifier. This uses nothing but probabilistic fundamentals to allow us to perform classification of digits.
Learning is all about making assumptions. If we want to classify a new data example that we have never seen before we have to make some assumptions about which data examples are similar to each other. The naive Bayes classifier, a popular and remarkably clear algorithm, assumes all features are independent from each other to simplify the computation. In this section, we will apply this model to recognize characters in images.
22.9.1. Optical Character Recognition¶
MNIST (LeCun et al., 1998) is one of widely used datasets. It contains 60,000 images for training and 10,000 images for validation. Each image contains a handwritten digit from 0 to 9. The task is classifying each image into the corresponding digit.
Gluon provides a MNIST
class in the data.vision
module to
automatically retrieve the dataset from the Internet. Subsequently,
Gluon will use the already-downloaded local copy. We specify whether we
are requesting the training set or the test set by setting the value of
the parameter train
to True
or False
, respectively. Each
image is a grayscale image with both width and height of
data_transform = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
lambda x: torch.floor(x * 255 / 128).squeeze(dim=0)
])
mnist_train = torchvision.datasets.MNIST(
root='./temp', train=True, transform=data_transform, download=True)
mnist_test = torchvision.datasets.MNIST(
root='./temp', train=False, transform=data_transform, download=True)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./temp/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 115752065.81it/s]
Extracting ./temp/MNIST/raw/train-images-idx3-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 5234904.66it/s]
Extracting ./temp/MNIST/raw/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 43715298.68it/s]Extracting ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 21501725.47it/s]
Extracting ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw
Downloading /opt/mxnet/datasets/mnist/train-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz...
Downloading /opt/mxnet/datasets/mnist/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
[22:05:00] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
Downloading /opt/mxnet/datasets/mnist/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading /opt/mxnet/datasets/mnist/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.mnist.load_data()
# Original pixel values of MNIST range from 0-255 (as the digits are stored as
# uint8). For this section, pixel values that are greater than 128 (in the
# original image) are converted to 1 and values that are less than 128 are
# converted to 0. See section 18.9.2 and 18.9.3 for why
train_images = tf.floor(tf.constant(train_images / 128, dtype = tf.float32))
test_images = tf.floor(tf.constant(test_images / 128, dtype = tf.float32))
train_labels = tf.constant(train_labels, dtype = tf.int32)
test_labels = tf.constant(test_labels, dtype = tf.int32)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
We can access a particular example, which contains the image and the corresponding label.
Our example, stored here in the variable image
, corresponds to an
image with a height and width of
Our code stores the label of each image as a scalar. Its type is a
We can also access multiple examples at the same time.
(torch.Size([28, 28, 28]), torch.Size([28]))
Let’s visualize these examples.
22.9.2. The Probabilistic Model for Classification¶
In a classification task, we map an example into a category. Here an
example is a grayscale
Unfortunately, this requires that we estimate
Moreover, where is the learning? If we need to see every single possible example in order to predict the corresponding label then we are not really learning a pattern but just memorizing the dataset.
22.9.3. The Naive Bayes Classifier¶
Fortunately, by making some assumptions about conditional independence, we can introduce some inductive bias and build a model capable of generalizing from a comparatively modest selection of training examples. To begin, let’s use Bayes theorem, to express the classifier as
Note that the denominator is the normalizing term
Now, let’s focus on
By itself, this expression does not get us any further. We still must
estimate roughly
If we can estimate
In addition, we estimate
for any
22.9.4. Training¶
The problem now is that we do not know
tensor([0.0987, 0.1124, 0.0993, 0.1022, 0.0974, 0.0904, 0.0986, 0.1044, 0.0975,
0.0992])
array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667,
0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 ])
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667,
0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 ],
dtype=float32)>
Now on to slightly more difficult things
By visualizing these
Now we can use (22.9.6) to predict a new
image. Given
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>
This went horribly wrong! To find out why, let’s look at the per pixel
probabilities. They are typically numbers between
As discussed in that section, we fix this by use the fact that
underflow: 0.0
logarithm is normal: -1805.2267129073316
underflow: 0.0
logarithm is normal: -1805.2267129073316
Since the logarithm is an increasing function, we can rewrite (22.9.6) as
We can implement the following stable version:
log_P_xy = torch.log(P_xy)
log_P_xy_neg = torch.log(1 - P_xy)
log_P_y = torch.log(P_y)
def bayes_pred_stable(x):
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
tensor([-268.9725, -301.7044, -245.1951, -218.8738, -193.4570, -206.0909,
-292.5226, -114.6257, -220.3313, -163.1784])
log_P_xy = np.log(P_xy)
log_P_xy_neg = np.log(1 - P_xy)
log_P_y = np.log(P_y)
def bayes_pred_stable(x):
x = np.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
array([-268.97253, -301.7044 , -245.19514, -218.87384, -193.45703,
-206.09088, -292.52264, -114.62566, -220.33133, -163.17842])
log_P_xy = tf.math.log(P_xy)
log_P_xy_neg = tf.math.log(1 - P_xy)
log_P_y = tf.math.log(P_y)
def bayes_pred_stable(x):
x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = tf.math.reduce_sum(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([-266.65015, -320.79282, -261.50186, -202.62967, -295.57236,
-199.49718, -318.66226, -266.01474, -224.59566, -271.9329 ],
dtype=float32)>
We may now check if the prediction is correct.
array(True)
If we now predict a few validation examples, we can see the Bayes classifier works pretty well.
def predict(X):
return [tf.argmax(
bayes_pred_stable(x), axis=0, output_type = tf.int32).numpy()
for x in X]
X = tf.stack([train_images[i] for i in range(10, 38)], axis=0)
y = tf.constant([train_labels[i].numpy() for i in range(10, 38)])
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);
Finally, let’s compute the overall accuracy of the classifier.
0.8427
0.8427
Modern deep networks achieve error rates of less than
22.9.5. Summary¶
Using Bayes’ rule, a classifier can be made by assuming all observed features are independent.
This classifier can be trained on a dataset by counting the number of occurrences of combinations of labels and pixel values.
This classifier was the gold standard for decades for tasks such as spam detection.
22.9.6. Exercises¶
Consider the dataset
with labels given by the XOR of the two elements . What are the probabilities for a Naive Bayes classifier built on this dataset. Does it successfully classify our points? If not, what assumptions are violated?Suppose that we did not use Laplace smoothing when estimating probabilities and a data example arrived at testing time which contained a value never observed in training. What would the model output?
The naive Bayes classifier is a specific example of a Bayesian network, where the dependence of random variables are encoded with a graph structure. While the full theory is beyond the scope of this section (see Koller and Friedman (2009) for full details), explain why allowing explicit dependence between the two input variables in the XOR model allows for the creation of a successful classifier.