22.10. Statistics¶ Open the notebook in SageMaker Studio Lab
Undoubtedly, to be a top deep learning practitioner, the ability to train the state-of-the-art and high accurate models is crucial. However, it is often unclear when improvements are significant, or only the result of random fluctuations in the training process. To be able to discuss uncertainty in estimated values, we must learn some statistics.
The earliest reference of statistics can be traced back to an Arab
scholar Al-Kindi in the
More specifically, statistics can be divided to descriptive statistics and statistical inference. The former focus on summarizing and illustrating the features of a collection of observed data, which is referred to as a sample. The sample is drawn from a population, denotes the total set of similar individuals, items, or events of our experiment interests. Contrary to descriptive statistics, statistical inference further deduces the characteristics of a population from the given samples, based on the assumptions that the sample distribution can replicate the population distribution at some degree.
You may wonder: “What is the essential difference between machine learning and statistics?” Fundamentally speaking, statistics focuses on the inference problem. This type of problems includes modeling the relationship between the variables, such as causal inference, and testing the statistically significance of model parameters, such as A/B testing. In contrast, machine learning emphasizes on making accurate predictions, without explicitly programming and understanding each parameter’s functionality.
In this section, we will introduce three types of statistics inference
methods: evaluating and comparing estimators, conducting hypothesis
tests, and constructing confidence intervals. These methods can help us
infer the characteristics of a given population, i.e., the true
parameter
22.10.1. Evaluating and Comparing Estimators¶
In statistics, an estimator is a function of given samples used to
estimate the true parameter
We have seen simple examples of estimators before in section Section 22.7. If you have a number of samples from a Bernoulli random variable, then the maximum likelihood estimate for the probability the random variable is one can be obtained by counting the number of ones observed and dividing by the total number of samples. Similarly, an exercise asked you to show that the maximum likelihood estimate of the mean of a Gaussian given a number of samples is given by the average value of all the samples. These estimators will almost never give the true value of the parameter, but ideally for a large number of samples the estimate will be close.
As an example, we show below the true density of a Gaussian random
variable with mean zero and variance one, along with a collection
samples from that Gaussian. We constructed the
import torch
from d2l import torch as d2l
torch.pi = torch.acos(torch.zeros(1)) * 2 #define pi in torch
# Sample datapoints and create y coordinate
epsilon = 0.1
torch.manual_seed(8675309)
xs = torch.randn(size=(300,))
ys = torch.tensor(
[torch.sum(torch.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2))\
/ torch.sqrt(2*torch.pi*epsilon**2)) / len(xs)\
for i in range(len(xs))])
# Compute true density
xd = torch.arange(torch.min(xs), torch.max(xs), 0.01)
yd = torch.exp(-xd**2/2) / torch.sqrt(2 * torch.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=torch.mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(torch.mean(xs).item()):.2f}')
d2l.plt.show()
import random
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
# Sample datapoints and create y coordinate
epsilon = 0.1
random.seed(8675309)
xs = np.random.normal(loc=0, scale=1, size=(300,))
ys = [np.sum(np.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2))
/ np.sqrt(2*np.pi*epsilon**2)) / len(xs) for i in range(len(xs))]
# Compute true density
xd = np.arange(np.min(xs), np.max(xs), 0.01)
yd = np.exp(-xd**2/2) / np.sqrt(2 * np.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=np.mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(np.mean(xs)):.2f}')
d2l.plt.show()
[21:50:06] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
import tensorflow as tf
from d2l import tensorflow as d2l
tf.pi = tf.acos(tf.zeros(1)) * 2 # define pi in TensorFlow
# Sample datapoints and create y coordinate
epsilon = 0.1
xs = tf.random.normal((300,))
ys = tf.constant(
[(tf.reduce_sum(tf.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2)) \
/ tf.sqrt(2*tf.pi*epsilon**2)) / tf.cast(
tf.size(xs), dtype=tf.float32)).numpy() \
for i in range(tf.size(xs))])
# Compute true density
xd = tf.range(tf.reduce_min(xs), tf.reduce_max(xs), 0.01)
yd = tf.exp(-xd**2/2) / tf.sqrt(2 * tf.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=tf.reduce_mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(tf.reduce_mean(xs).numpy()):.2f}')
d2l.plt.show()
There can be many ways to compute an estimator of a parameter
22.10.1.1. Mean Squared Error¶
Perhaps the simplest metric used to evaluate estimators is the mean
squared error (MSE) (or
This allows us to quantify the average squared deviation from the true
value. MSE is always non-negative. If you have read
Section 3.1, you will recognize it as the most
commonly used regression loss function. As a measure to evaluate an
estimator, the closer its value to zero, the closer the estimator is
close to the true parameter
22.10.1.2. Statistical Bias¶
The MSE provides a natural metric, but we can easily imagine multiple different phenomena that might make it large. Two fundamentally important are fluctuation in the estimator due to randomness in the dataset, and systematic error in the estimator due to the estimation procedure.
First, let’s measure the systematic error. For an estimator
Note that when
It is worth being aware, however, that biased estimators are frequently
used in practice. There are cases where unbiased estimators do not exist
without further assumptions, or are intractable to compute. This may
seem like a significant flaw in an estimator, however the majority of
estimators encountered in practice are at least asymptotically unbiased
in the sense that the bias tends to zero as the number of available
samples tends to infinity:
22.10.1.3. Variance and Standard Deviation¶
Second, let’s measure the randomness in the estimator. Recall from Section 22.6, the standard deviation (or standard error) is defined as the squared root of the variance. We may measure the degree of fluctuation of an estimator by measuring the standard deviation or variance of that estimator.
It is important to compare (22.10.3) to
(22.10.1). In this equation we do not compare to the true
population value
22.10.1.4. The Bias-Variance Trade-off¶
It is intuitively clear that these two main components contribute to the mean squared error. What is somewhat shocking is that we can show that this is actually a decomposition of the mean squared error into these two contributions plus a third one. That is to say that we can write the mean squared error as the sum of the square of the bias, the variance and the irreducible error.
We refer the above formula as bias-variance trade-off. The mean
squared error can be divided into three sources of error: the error from
high bias, the error from high variance and the irreducible error. The
bias error is commonly seen in a simple model (such as a linear
regression model), which cannot extract high dimensional relations
between the features and the outputs. If a model suffers from high bias
error, we often say it is underfitting or lack of flexibilty as
introduced in (Section 3.6). The high variance
usually results from a too complex model, which overfits the training
data. As a result, an overfitting model is sensitive to small
fluctuations in the data. If a model suffers from high variance, we
often say it is overfitting and lack of generalization as introduced
in (Section 3.6). The irreducible error is the
result from noise in the
22.10.1.5. Evaluating Estimators in Code¶
Since the standard deviation of an estimator has been implementing by
simply calling a.std()
for a tensor a
, we will skip it but
implement the statistical bias and the mean squared error.
To illustrate the equation of the bias-variance trade-off, let’s
simulate of normal distribution
tensor(1.0170)
array(0.9503336)
Let’s validate the trade-off equation by calculating the summation of the squared bias and the variance of our estimator. First, calculate the MSE of our estimator.
Next, we calculate
tensor(16.0298)
22.10.2. Conducting Hypothesis Tests¶
The most commonly encountered topic in statistical inference is
hypothesis testing. While hypothesis testing was popularized in the
early 20th century, the first use can be traced back to John Arbuthnot
in the 1700s. John tracked 80-year birth records in London and concluded
that more men were born than women each year. Following that, the modern
significance testing is the intelligence heritage by Karl Pearson who
invented
A hypothesis test is a way of evaluating some evidence against the
default statement about a population. We refer the default statement as
the null hypothesis
Imagine you are a chemist. After spending thousands of hours in the lab, you develop a new medicine which can dramatically improve one’s ability to understand math. To show its magic power, you need to test it. Naturally, you may need some volunteers to take the medicine and see whether it can help them learn mathematics better. How do you get started?
First, you will need carefully random selected two groups of volunteers, so that there is no difference between their mathematical understanding ability measured by some metrics. The two groups are commonly referred to as the test group and the control group. The test group (or treatment group) is a group of individuals who will experience the medicine, while the control group represents the group of users who are set aside as a benchmark, i.e., identical environment setups except taking this medicine. In this way, the influence of all the variables are minimized, except the impact of the independent variable in the treatment.
Second, after a period of taking the medicine, you will need to measure the two groups’ mathematical understanding by the same metrics, such as letting the volunteers do the same tests after learning a new mathematical formula. Then, you can collect their performance and compare the results. In this case, our null hypothesis will be that there is no difference between the two groups, and our alternate will be that there is.
This is still not fully formal. There are many details you have to think of carefully. For example, what is the suitable metrics to test their mathematical understanding ability? How many volunteers for your test so you can be confident to claim the effectiveness of your medicine? How long should you run the test? How do you decide if there is a difference between the two groups? Do you care about the average performance only, or also the range of variation of the scores? And so on.
In this way, hypothesis testing provides a framework for experimental design and reasoning about certainty in observed results. If we can now show that the null hypothesis is very unlikely to be true, we may reject it with confidence.
To complete the story of how to work with hypothesis testing, we need to now introduce some additional terminology and make some of our concepts above formal.
22.10.2.1. Statistical Significance¶
The statistical significance measures the probability of erroneously
rejecting the null hypothesis,
It is also referred to as the type I error or false positive. The
Fig. 22.10.1 shows the observations’ values
and probability of a given normal distribution in a two-sample
hypothesis test. If the observation data example is located outsides the
Fig. 22.10.1 Statistical significance.¶
22.10.2.2. Statistical Power¶
The statistical power (or sensitivity) measures the probability of
reject the null hypothesis,
Recall that a type I error is error caused by rejecting the null
hypothesis when it is true, whereas a type II error is resulted from
failing to reject the null hypothesis when it is false. A type II error
is usually denoted as
Intuitively, statistical power can be interpreted as how likely our test
will detect a real discrepancy of some minimum magnitude at a desired
statistical significance level.
One of the most common uses of statistical power is in determining the
number of samples needed. The probability you reject the null hypothesis
when it is false depends on the degree to which it is false (known as
the effect size) and the number of samples you have. As you might
expect, small effect sizes will require a very large number of samples
to be detectable with high probability. While beyond the scope of this
brief appendix to derive in detail, as an example, want to be able to
reject a null hypothesis that our sample came from a mean zero variance
one Gaussian, and we believe that our sample’s mean is actually close to
one, we can do so with acceptable error rates with a sample size of only
We can imagine the power as a water filter. In this analogy, a high power hypothesis test is like a high quality water filtration system that will reduce harmful substances in the water as much as possible. On the other hand, a smaller discrepancy is like a low quality water filter, where some relative small substances may easily escape from the gaps. Similarly, if the statistical power is not of enough high power, then the test may not catch the smaller discrepancy.
22.10.2.3. Test Statistic¶
A test statistic
Often,
22.10.2.4. -value¶
The
If the
22.10.2.5. One-side Test and Two-sided Test¶
Normally there are two kinds of significance test: the one-sided test
and the two-sided test. The one-sided test (or one-tailed test) is
applicable when the null hypothesis and the alternative hypothesis only
have one direction. For example, the null hypothesis may state that the
true parameter
22.10.2.6. General Steps of Hypothesis Testing¶
After getting familiar with the above concepts, let’s go through the general steps of hypothesis testing.
State the question and establish a null hypotheses
.Set the statistical significance level
and a statistical power ( ).Obtain samples through experiments. The number of samples needed will depend on the statistical power, and the expected effect size.
Calculate the test statistic and the
-value.Make the decision to keep or reject the null hypothesis based on the
-value and the statistical significance level .
To conduct a hypothesis test, we start by defining a null hypothesis and a level of risk that we are willing to take. Then we calculate the test statistic of the sample, taking an extreme value of the test statistic as evidence against the null hypothesis. If the test statistic falls within the reject region, we may reject the null hypothesis in favor of the alternative.
Hypothesis testing is applicable in a variety of scenarios such as the clinical trails and A/B testing.
22.10.3. Constructing Confidence Intervals¶
When estimating the value of a parameter
To be useful, a confidence interval should be as small as possible for a given degree of certainty. Let’s see how to derive it.
22.10.3.1. Definition¶
Mathematically, a confidence interval for the true parameter
Here
Note that (22.10.8) is about variable
22.10.3.2. Interpretation¶
It is very tempting to interpret a
This may seem pedantic, but it can have real implications for the interpretation of the results. In particular, we may satisfy (22.10.8) by constructing intervals that we are almost certain do not contain the true value, as long as we only do so rarely enough. We close this section by providing three tempting but false statements. An in-depth discussion of these points can be found in Morey et al. (2016).
Fallacy 1. Narrow confidence intervals mean we can estimate the parameter precisely.
Fallacy 2. The values inside the confidence interval are more likely to be the true value than those outside the interval.
Fallacy 3. The probability that a particular observed
confidence interval contains the true value is .
Sufficed to say, confidence intervals are subtle objects. However, if you keep the interpretation clear, they can be powerful tools.
22.10.3.3. A Gaussian Example¶
Let’s discuss the most classical example, the confidence interval for
the mean of a Gaussian of unknown mean and variance. Suppose we collect
If we now consider the random variable
we obtain a random variable following a well-known distribution called
the Student’s t-distribution on
This distribution is very well studied, and it is known, for instance,
that as
Thus, we may conclude that for large
Rearranging this by multiplying both sides by
Thus we know that we have found our
It is safe to say that (22.10.13) is one of the most
used formula in statistics. Let’s close our discussion of statistics by
implementing it. For simplicity, we assume we are in the asymptotic
regime. Small values of t_star
obtained either programmatically or from a
# PyTorch uses Bessel's correction by default, which means the use of ddof=1
# instead of default ddof=0 in numpy. We can use unbiased=False to imitate
# ddof=0.
# Number of samples
N = 1000
# Sample dataset
samples = torch.normal(0, 1, size=(N,))
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = torch.mean(samples)
sigma_hat = samples.std(unbiased=True)
(mu_hat - t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)),\
mu_hat + t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)))
(tensor(-0.0568), tensor(0.0704))
# Number of samples
N = 1000
# Sample dataset
samples = np.random.normal(loc=0, scale=1, size=(N,))
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = np.mean(samples)
sigma_hat = samples.std(ddof=1)
(mu_hat - t_star*sigma_hat/np.sqrt(N), mu_hat + t_star*sigma_hat/np.sqrt(N))
(array(-0.07853346), array(0.04412608))
# Number of samples
N = 1000
# Sample dataset
samples = tf.random.normal((N,), 0, 1)
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = tf.reduce_mean(samples)
sigma_hat = tf.math.reduce_std(samples)
(mu_hat - t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)), \
mu_hat + t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)))
(<tf.Tensor: shape=(), dtype=float32, numpy=-0.029904943>,
<tf.Tensor: shape=(), dtype=float32, numpy=0.09493986>)
22.10.4. Summary¶
Statistics focuses on inference problems, whereas deep learning emphasizes on making accurate predictions without explicitly programming and understanding.
There are three common statistics inference methods: evaluating and comparing estimators, conducting hypothesis tests, and constructing confidence intervals.
There are three most common estimators: statistical bias, standard deviation, and mean square error.
A confidence interval is an estimated range of a true population parameter that we can construct by given the samples.
Hypothesis testing is a way of evaluating some evidence against the default statement about a population.
22.10.5. Exercises¶
Let
, where “iid” stands for independent and identically distributed. Consider the following estimators of :(22.10.14)¶(22.10.15)¶Find the statistical bias, standard deviation, and mean square error of
Find the statistical bias, standard deviation, and mean square error of
Which estimator is better?
For our chemist example in introduction, can you derive the 5 steps to conduct a two-sided hypothesis testing? Given the statistical significance level
and the statistical power .Run the confidence interval code with
and for independently generated dataset, and plot the resulting intervals (in this caset_star = 1.0
). You will see several very short intervals which are very far from containing the true mean . Does this contradict the interpretation of the confidence interval? Do you feel comfortable using short intervals to indicate high precision estimates?