2.4. Calculus¶ Open the notebook in SageMaker Studio Lab
For a long time, how to calculate the area of a circle remained a
mystery. Then, in Ancient Greece, the mathematician Archimedes came up
with the clever idea to inscribe a series of polygons with increasing
numbers of vertices on the inside of a circle
(Fig. 2.4.1). For a polygon with
Fig. 2.4.1 Finding the area of a circle as a limit procedure.¶
This limiting procedure is at the root of both differential calculus and integral calculus. The former can tell us how to increase or decrease a function’s value by manipulating its arguments. This comes in handy for the optimization problems that we face in deep learning, where we repeatedly update our parameters in order to decrease the loss function. Optimization addresses how to fit our models to training data, and calculus is its key prerequisite. However, do not forget that our ultimate goal is to perform well on previously unseen data. That problem is called generalization and will be a key focus of other chapters.
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
2.4.1. Derivatives and Differentiation¶
Put simply, a derivative is the rate of change in a function with
respect to changes in its arguments. Derivatives can tell us how rapidly
a loss function would increase or decrease were we to increase or
decrease each parameter by an infinitesimally small amount. Formally,
for functions
This term on the right hand side is called a limit and it tells us
what happens to the value of an expression as a specified variable
approaches a particular value. This limit tells us what the ratio
between a perturbation
When
We can interpret the derivative
Setting
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.02999
h=0.00100, numerical limit=2.00295
h=0.00010, numerical limit=2.00033
h=0.00001, numerical limit=2.00272
[21:50:15] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003
There are several equivalent notational conventions for derivatives.
Given
where the symbols
Functions composed from differentiable functions are often themselves
differentiable. The following rules come in handy for working with
compositions of any differentiable functions
Using this, we can apply the rules to find the derivative of
Plugging in
2.4.2. Visualization Utilities¶
We can visualize the slopes of functions using the matplotlib
library. We need to define a few functions. As its name indicates,
use_svg_display
tells matplotlib
to output graphics in SVG
format for crisper images. The comment #@save
is a special modifier
that allows us to save any function, class, or other code block to the
d2l
package so that we can invoke it later without repeating the
code, e.g., via d2l.use_svg_display()
.
Conveniently, we can set figure sizes with set_figsize
. Since the
import statement from matplotlib import pyplot as plt
was marked via
#@save
in the d2l
package, we can call d2l.plt
.
The set_axes
function can associate axes with properties, including
labels, ranges, and scales.
With these three functions, we can define a plot
function to overlay
multiple curves. Much of the code here is just ensuring that the sizes
and shapes of inputs match.
#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
"""Plot data points."""
def has_one_axis(X): # True if X (tensor or list) has 1 axis
return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
and not hasattr(X[0], "__len__"))
if has_one_axis(X): X = [X]
if Y is None:
X, Y = [[]] * len(X), X
elif has_one_axis(Y):
Y = [Y]
if len(X) != len(Y):
X = X * len(Y)
set_figsize(figsize)
if axes is None:
axes = d2l.plt.gca()
axes.cla()
for x, y, fmt in zip(X, Y, fmts):
axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
Now we can plot the function
2.4.3. Partial Derivatives and Gradients¶
Thus far, we have been differentiating functions of just one variable. In deep learning, we also need to work with functions of many variables. We briefly introduce notions of the derivative that apply to such multivariate functions.
Let
To calculate
We can concatenate partial derivatives of a multivariate function with
respect to all its variables to obtain a vector that is called the
gradient of the function. Suppose that the input of function
When there is no ambiguity,
For all
we have and .For square matrices
we have that and in particular .
Similarly, for any matrix
2.4.4. Chain Rule¶
In deep learning, the gradients of concern are often difficult to
calculate because we are working with deeply nested functions (of
functions (of functions…)). Fortunately, the chain rule takes care of
this. Returning to functions of a single variable, suppose that
Turning back to multivariate functions, suppose that
where
2.4.5. Discussion¶
While we have just scratched the surface of a deep topic, a number of concepts already come into focus: first, the composition rules for differentiation can be applied routinely, enabling us to compute gradients automatically. This task requires no creativity and thus we can focus our cognitive powers elsewhere. Second, computing the derivatives of vector-valued functions requires us to multiply matrices as we trace the dependency graph of variables from output to input. In particular, this graph is traversed in a forward direction when we evaluate a function and in a backwards direction when we compute gradients. Later chapters will formally introduce backpropagation, a computational procedure for applying the chain rule.
From the viewpoint of optimization, gradients allow us to determine how to move the parameters of a model in order to lower the loss, and each step of the optimization algorithms used throughout this book will require calculating the gradient.
2.4.6. Exercises¶
So far we took the rules for derivatives for granted. Using the definition and limits prove the properties for (i)
, (ii) , (iii) and (iv) .In the same vein, prove the product, sum, and quotient rule from first principles.
Prove that the constant multiple rule follows as a special case of the product rule.
Calculate the derivative of
.What does it mean that
for some ? Give an example of a function and a location for which this might hold.Plot the function
and plot its tangent line at .Find the gradient of the function
.What is the gradient of the function
? What happens for ?Can you write out the chain rule for the case where
and , , and ?Given a function
that is invertible, compute the derivative of its inverse . Here we have that and conversely . Hint: use these properties in your derivation.