.. _chapter_linear_algebra:
Linear Algebra
==============
Now that you can store and manipulate data, let’s briefly review the
subset of basic linear algebra that you will need to understand most of
the models. We will introduce all the basic concepts, the corresponding
mathematical notation, and their realization in code all in one place.
If you are already confident in your basic linear algebra, feel free to
skim through or skip this chapter.
.. code:: python
from mxnet import nd
Scalars
-------
If you never studied linear algebra or machine learning, you are
probably used to working with one number at a time. And know how to do
basic things like add them together or multiply them. For example, in
Palo Alto, the temperature is :math:`52` degrees Fahrenheit. Formally,
we call these values :math:`scalars`. If you wanted to convert this
value to Celsius (using metric system’s more sensible unit of
temperature measurement), you would evaluate the expression
:math:`c = (f - 32) * 5/9` setting :math:`f` to :math:`52`. In this
equation, each of the terms :math:`32`, :math:`5`, and :math:`9` is a
scalar value. The placeholders :math:`c` and :math:`f` that we use are
called variables and they represent unknown scalar values.
In mathematical notation, we represent scalars with ordinary lower-cased
letters (:math:`x`, :math:`y`, :math:`z`). We also denote the space of
all scalars as :math:`\mathcal{R}`. For expedience, we are going to punt
a bit on what precisely a space is, but for now, remember that if you
want to say that :math:`x` is a scalar, you can simply say
:math:`x \in \mathcal{R}`. The symbol :math:`\in` can be pronounced “in”
and just denotes membership in a set.
In MXNet, we work with scalars by creating NDArrays with just one
element. In this snippet, we instantiate two scalars and perform some
familiar arithmetic operations with them, such as addition,
multiplication, division and exponentiation.
.. code:: python
x = nd.array([3.0])
y = nd.array([2.0])
print('x + y = ', x + y)
print('x * y = ', x * y)
print('x / y = ', x / y)
print('x ** y = ', nd.power(x,y))
.. parsed-literal::
:class: output
x + y =
[5.]
x * y =
[6.]
x / y =
[1.5]
x ** y =
[9.]
We can convert any NDArray to a Python float by calling its ``asscalar``
method. Note that this is typically a bad idea. While you are doing
this, NDArray has to stop doing anything else in order to hand the
result and the process control back to Python. And unfortunately Python
is not very good at doing things in parallel. So avoid sprinkling this
operation liberally throughout your code or your networks will take a
long time to train.
.. code:: python
x.asscalar()
.. parsed-literal::
:class: output
3.0
Vectors
-------
You can think of a vector as simply a list of numbers, for example
``[1.0,3.0,4.0,2.0]``. Each of the numbers in the vector consists of a
single scalar value. We call these values the *entries* or *components*
of the vector. Often, we are interested in vectors whose values hold
some real-world significance. For example, if we are studying the risk
that loans default, we might associate each applicant with a vector
whose components correspond to their income, length of employment,
number of previous defaults, etc. If we were studying the risk of heart
attacks hospital patients potentially face, we might represent each
patient with a vector whose components capture their most recent vital
signs, cholesterol levels, minutes of exercise per day, etc. In math
notation, we will usually denote vectors as bold-faced, lower-cased
letters (:math:`\mathbf{u}`, :math:`\mathbf{v}`, :math:`\mathbf{w})`. In
MXNet, we work with vectors via 1D NDArrays with an arbitrary number of
components.
.. code:: python
x = nd.arange(4)
print('x = ', x)
.. parsed-literal::
:class: output
x =
[0. 1. 2. 3.]
We can refer to any element of a vector by using a subscript. For
example, we can refer to the :math:`4`\ th element of :math:`\mathbf{u}`
by :math:`u_4`. Note that the element :math:`u_4` is a scalar, so we do
not bold-face the font when referring to it. In code, we access any
element :math:`i` by indexing into the ``NDArray``.
.. code:: python
x[3]
.. parsed-literal::
:class: output
[3.]
Length, dimensionality and shape
--------------------------------
Let’s revisit some concepts from the previous section. A vector is just
an array of numbers. And just as every array has a length, so does every
vector. In math notation, if we want to say that a vector
:math:`\mathbf{x}` consists of :math:`n` real-valued scalars, we can
express this as :math:`\mathbf{x} \in \mathcal{R}^n`. The length of a
vector is commonly called its :math:`dimension`. As with an ordinary
Python array, we can access the length of an NDArray by calling Python’s
in-built ``len()`` function.
We can also access a vector’s length via its ``.shape`` attribute. The
shape is a tuple that lists the dimensionality of the NDArray along each
of its axes. Because a vector can only be indexed along one axis, its
shape has just one element.
.. code:: python
x.shape
.. parsed-literal::
:class: output
(4,)
Note that the word dimension is overloaded and this tends to confuse
people. Some use the *dimensionality* of a vector to refer to its length
(the number of components). However some use the word *dimensionality*
to refer to the number of axes that an array has. In this sense, a
scalar *would have* :math:`0` dimensions and a vector *would have*
:math:`1` dimension.
**To avoid confusion, when we say 2D array or 3D array, we mean an array
with 2 or 3 axes respectively. But if we say :math:`n`-dimensional
vector, we mean a vector of length :math:`n`.**
.. code:: python
a = 2
x = nd.array([1,2,3])
y = nd.array([10,20,30])
print(a * x)
print(a * x + y)
.. parsed-literal::
:class: output
[2. 4. 6.]
[12. 24. 36.]
Matrices
--------
Just as vectors generalize scalars from order :math:`0` to order
:math:`1`, matrices generalize vectors from :math:`1D` to :math:`2D`.
Matrices, which we’ll typically denote with capital letters (:math:`A`,
:math:`B`, :math:`C`), are represented in code as arrays with 2 axes.
Visually, we can draw a matrix as a table, where each entry
:math:`a_{ij}` belongs to the :math:`i`-th row and :math:`j`-th column.
.. math::
A=\begin{pmatrix}
a_{11} & a_{12} & \cdots & a_{1m} \\
a_{21} & a_{22} & \cdots & a_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \cdots & a_{nm} \\
\end{pmatrix}
We can create a matrix with :math:`n` rows and :math:`m` columns in
MXNet by specifying a shape with two components ``(n,m)`` when calling
any of our favorite functions for instantiating an ``ndarray`` such as
``ones``, or ``zeros``.
.. code:: python
A = nd.arange(20).reshape((5,4))
print(A)
.. parsed-literal::
:class: output
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]]
Matrices are useful data structures: they allow us to organize data that
has different modalities of variation. For example, rows in our matrix
might correspond to different patients, while columns might correspond
to different attributes.
We can access the scalar elements :math:`a_{ij}` of a matrix :math:`A`
by specifying the indices for the row (:math:`i`) and column (:math:`j`)
respectively. Leaving them blank via a ``:`` takes all elements along
the respective dimension (as seen in the previous section).
We can transpose the matrix through ``T``. That is, if :math:`B = A^T`,
then :math:`b_{ij} = a_{ji}` for any :math:`i` and :math:`j`.
.. code:: python
print(A.T)
.. parsed-literal::
:class: output
[[ 0. 4. 8. 12. 16.]
[ 1. 5. 9. 13. 17.]
[ 2. 6. 10. 14. 18.]
[ 3. 7. 11. 15. 19.]]
Tensors
-------
Just as vectors generalize scalars, and matrices generalize vectors, we
can actually build data structures with even more axes. Tensors give us
a generic way of discussing arrays with an arbitrary number of axes.
Vectors, for example, are first-order tensors, and matrices are
second-order tensors.
Using tensors will become more important when we start working with
images, which arrive as 3D data structures, with axes corresponding to
the height, width, and the three (RGB) color channels. But in this
chapter, we’re going to skip this part and make sure you know the
basics.
.. code:: python
X = nd.arange(24).reshape((2, 3, 4))
print('X.shape =', X.shape)
print('X =', X)
.. parsed-literal::
:class: output
X.shape = (2, 3, 4)
X =
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
[[12. 13. 14. 15.]
[16. 17. 18. 19.]
[20. 21. 22. 23.]]]
Basic properties of tensor arithmetic
-------------------------------------
Scalars, vectors, matrices, and tensors of any order have some nice
properties that we will often rely on. For example, as you might have
noticed from the definition of an element-wise operation, given operands
with the same shape, the result of any element-wise operation is a
tensor of that same shape. Another convenient property is that for all
tensors, multiplication by a scalar produces a tensor of the same shape.
In math, given two tensors :math:`X` and :math:`Y` with the same shape,
:math:`\alpha X + Y` has the same shape (numerical mathematicians call
this the AXPY operation).
.. code:: python
a = 2
x = nd.ones(3)
y = nd.zeros(3)
print(x.shape)
print(y.shape)
print((a * x).shape)
print((a * x + y).shape)
.. parsed-literal::
:class: output
(3,)
(3,)
(3,)
(3,)
Shape is not the the only property preserved under addition and
multiplication by a scalar. These operations also preserve membership in
a vector space. But we will postpone this discussion for the second half
of this chapter because it is not critical to getting your first models
up and running.
Sums and means
--------------
The next more sophisticated thing we can do with arbitrary tensors is to
calculate the sum of their elements. In mathematical notation, we
express sums using the :math:`\sum` symbol. To express the sum of the
elements in a vector :math:`\mathbf{u}` of length :math:`d`, we can
write :math:`\sum_{i=1}^d u_i`. In code, we can just call ``nd.sum()``.
.. code:: python
print(x)
print(nd.sum(x))
.. parsed-literal::
:class: output
[1. 1. 1.]
[3.]
We can similarly express sums over the elements of tensors of arbitrary
shape. For example, the sum of the elements of an :math:`m \times n`
matrix :math:`A` could be written
:math:`\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}`.
.. code:: python
print(A)
print(nd.sum(A))
.. parsed-literal::
:class: output
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]]
[190.]
A related quantity is the *mean*, which is also called the *average*. We
calculate the mean by dividing the sum by the total number of elements.
With mathematical notation, we could write the average over a vector
:math:`\mathbf{u}` as :math:`\frac{1}{d} \sum_{i=1}^{d} u_i` and the
average over a matrix :math:`A` as
:math:`\frac{1}{n \cdot m} \sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}`. In
code, we could just call ``nd.mean()`` on tensors of arbitrary shape:
.. code:: python
print(nd.mean(A))
print(nd.sum(A) / A.size)
.. parsed-literal::
:class: output
[9.5]
[9.5]
Dot products
------------
So far, we have only performed element-wise operations, sums and
averages. And if this was all we could do, linear algebra probably would
not deserve its own chapter. However, one of the most fundamental
operations is the dot product. Given two vectors :math:`\mathbf{u}` and
:math:`\mathbf{v}`, the dot product :math:`\mathbf{u}^T \mathbf{v}` is a
sum over the products of the corresponding elements:
:math:`\mathbf{u}^T \mathbf{v} = \sum_{i=1}^{d} u_i \cdot v_i`.
.. code:: python
x = nd.arange(4)
y = nd.ones(4)
print(x, y, nd.dot(x, y))
.. parsed-literal::
:class: output
[0. 1. 2. 3.]
[1. 1. 1. 1.]
[6.]
Note that we can express the dot product of two vectors ``nd.dot(x, y)``
equivalently by performing an element-wise multiplication and then a
sum:
.. code:: python
nd.sum(x * y)
.. parsed-literal::
:class: output
[6.]
Dot products are useful in a wide range of contexts. For example, given
a set of weights :math:`\mathbf{w}`, the weighted sum of some values
:math:`{u}` could be expressed as the dot product
:math:`\mathbf{u}^T \mathbf{w}`. When the weights are non-negative and
sum to one :math:`\left(\sum_{i=1}^{d} {w_i} = 1\right)`, the dot
product expresses a *weighted average*. When two vectors each have
length one (we will discuss what *length* means below in the section on
norms), dot products can also capture the cosine of the angle between
them.
Matrix-vector products
----------------------
Now that we know how to calculate dot products we can begin to
understand matrix-vector products. Let’s start off by visualizing a
matrix :math:`A` and a column vector :math:`\mathbf{x}`.
.. math::
A=\begin{pmatrix}
a_{11} & a_{12} & \cdots & a_{1m} \\
a_{21} & a_{22} & \cdots & a_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \cdots & a_{nm} \\
\end{pmatrix},\quad\mathbf{x}=\begin{pmatrix}
x_{1} \\
x_{2} \\
\vdots\\
x_{m}\\
\end{pmatrix}
We can visualize the matrix in terms of its row vectors
.. math::
A=
\begin{pmatrix}
\mathbf{a}^T_{1} \\
\mathbf{a}^T_{2} \\
\vdots \\
\mathbf{a}^T_n \\
\end{pmatrix},
where each :math:`\mathbf{a}^T_{i} \in \mathbb{R}^{m}` is a row vector
representing the :math:`i`-th row of the matrix :math:`A`.
Then the matrix vector product :math:`\mathbf{y} = A\mathbf{x}` is
simply a column vector :math:`\mathbf{y} \in \mathbb{R}^n` where each
entry :math:`y_i` is the dot product :math:`\mathbf{a}^T_i \mathbf{x}`.
.. math::
A\mathbf{x}=
\begin{pmatrix}
\mathbf{a}^T_{1} \\
\mathbf{a}^T_{2} \\
\vdots \\
\mathbf{a}^T_n \\
\end{pmatrix}
\begin{pmatrix}
x_{1} \\
x_{2} \\
\vdots\\
x_{m}\\
\end{pmatrix}
= \begin{pmatrix}
\mathbf{a}^T_{1} \mathbf{x} \\
\mathbf{a}^T_{2} \mathbf{x} \\
\vdots\\
\mathbf{a}^T_{n} \mathbf{x}\\
\end{pmatrix}
So you can think of multiplication by a matrix
:math:`A\in \mathbb{R}^{n \times m}` as a transformation that projects
vectors from :math:`\mathbb{R}^{m}` to :math:`\mathbb{R}^{n}`.
These transformations turn out to be remarkably useful. For example, we
can represent rotations as multiplications by a square matrix. As we
will see in subsequent chapters, we can also use matrix-vector products
to describe the calculations of each layer in a neural network.
Expressing matrix-vector products in code with ``ndarray``, we use the
same ``nd.dot()`` function as for dot products. When we call
``nd.dot(A, x)`` with a matrix ``A`` and a vector ``x``, MXNet knows to
perform a matrix-vector product. Note that the column dimension of ``A``
must be the same as the dimension of ``x``.
.. code:: python
nd.dot(A, x)
.. parsed-literal::
:class: output
[ 14. 38. 62. 86. 110.]
Matrix-matrix multiplication
----------------------------
If you have gotten the hang of dot products and matrix-vector
multiplication, then matrix-matrix multiplications should be pretty
straightforward.
Say we have two matrices, :math:`A \in \mathbb{R}^{n \times k}` and
:math:`B \in \mathbb{R}^{k \times m}`:
.. math::
A=\begin{pmatrix}
a_{11} & a_{12} & \cdots & a_{1k} \\
a_{21} & a_{22} & \cdots & a_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \cdots & a_{nk} \\
\end{pmatrix},\quad
B=\begin{pmatrix}
b_{11} & b_{12} & \cdots & b_{1m} \\
b_{21} & b_{22} & \cdots & b_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
b_{k1} & b_{k2} & \cdots & b_{km} \\
\end{pmatrix}
To produce the matrix product :math:`C = AB`, it’s easiest to think of
:math:`A` in terms of its row vectors and :math:`B` in terms of its
column vectors:
.. math::
A=
\begin{pmatrix}
\mathbf{a}^T_{1} \\
\mathbf{a}^T_{2} \\
\vdots \\
\mathbf{a}^T_n \\
\end{pmatrix},
\quad B=\begin{pmatrix}
\mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{pmatrix}.
Note here that each row vector :math:`\mathbf{a}^T_{i}` lies in
:math:`\mathbb{R}^k` and that each column vector :math:`\mathbf{b}_j`
also lies in :math:`\mathbb{R}^k`.
Then to produce the matrix product :math:`C \in \mathbb{R}^{n \times m}`
we simply compute each entry :math:`c_{ij}` as the dot product
:math:`\mathbf{a}^T_i \mathbf{b}_j`.
.. math::
C = AB = \begin{pmatrix}
\mathbf{a}^T_{1} \\
\mathbf{a}^T_{2} \\
\vdots \\
\mathbf{a}^T_n \\
\end{pmatrix}
\begin{pmatrix}
\mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{pmatrix}
= \begin{pmatrix}
\mathbf{a}^T_{1} \mathbf{b}_1 & \mathbf{a}^T_{1}\mathbf{b}_2& \cdots & \mathbf{a}^T_{1} \mathbf{b}_m \\
\mathbf{a}^T_{2}\mathbf{b}_1 & \mathbf{a}^T_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^T_{2} \mathbf{b}_m \\
\vdots & \vdots & \ddots &\vdots\\
\mathbf{a}^T_{n} \mathbf{b}_1 & \mathbf{a}^T_{n}\mathbf{b}_2& \cdots& \mathbf{a}^T_{n} \mathbf{b}_m
\end{pmatrix}
You can think of the matrix-matrix multiplication :math:`AB` as simply
performing :math:`m` matrix-vector products and stitching the results
together to form an :math:`n \times m` matrix. Just as with ordinary dot
products and matrix-vector products, we can compute matrix-matrix
products in MXNet by using ``nd.dot()``.
.. code:: python
B = nd.ones(shape=(4, 3))
nd.dot(A, B)
.. parsed-literal::
:class: output
[[ 6. 6. 6.]
[22. 22. 22.]
[38. 38. 38.]
[54. 54. 54.]
[70. 70. 70.]]
Norms
-----
Before we can start implementing models, there is one last concept we
are going to introduce. Some of the most useful operators in linear
algebra are norms. Informally, they tell us how big a vector or matrix
is. We represent norms with the notation :math:`\|\cdot\|`. The
:math:`\cdot` in this expression is just a placeholder. For example, we
would represent the norm of a vector :math:`\mathbf{x}` or matrix
:math:`A` as :math:`\|\mathbf{x}\|` or :math:`\|A\|`, respectively.
All norms must satisfy a handful of properties:
1. :math:`\|\alpha A\| = |\alpha| \|A\|`
2. :math:`\|A + B\| \leq \|A\| + \|B\|`
3. :math:`\|A\| \geq 0`
4. If :math:`\forall {i,j}, a_{ij} = 0`, then :math:`\|A\|=0`
To put it in words, the first rule says that if we scale all the
components of a matrix or vector by a constant factor :math:`\alpha`,
its norm also scales by the *absolute value* of the same constant
factor. The second rule is the familiar triangle inequality. The third
rule simply says that the norm must be non-negative. That makes sense,
in most contexts the smallest *size* for anything is 0. The final rule
basically says that the smallest norm is achieved by a matrix or vector
consisting of all zeros. It is possible to define a norm that gives zero
norm to nonzero matrices, but you cannot give nonzero norm to zero
matrices. That may seem like a mouthful, but if you digest it then you
probably have grepped the important concepts here.
If you remember Euclidean distances (think Pythagoras’ theorem) from
grade school, then non-negativity and the triangle inequality might ring
a bell. You might notice that norms sound a lot like measures of
distance.
In fact, the Euclidean distance :math:`\sqrt{x_1^2 + \cdots + x_n^2}` is
a norm. Specifically it is the :math:`\ell_2`-norm. An analogous
computation, performed over the entries of a matrix, e.g.
:math:`\sqrt{\sum_{i,j} a_{ij}^2}`, is called the Frobenius norm. More
often, in machine learning we work with the squared :math:`\ell_2` norm
(notated :math:`\ell_2^2`). We also commonly work with the
:math:`\ell_1` norm. The :math:`\ell_1` norm is simply the sum of the
absolute values. It has the convenient property of placing less emphasis
on outliers.
To calculate the :math:`\ell_2` norm, we can just call ``nd.norm()``.
.. code:: python
nd.norm(x)
.. parsed-literal::
:class: output
[3.7416573]
To calculate the L1-norm we can simply perform the absolute value and
then sum over the elements.
.. code:: python
nd.sum(nd.abs(x))
.. parsed-literal::
:class: output
[6.]
Norms and objectives
--------------------
While we do not want to get too far ahead of ourselves, we do want you
to anticipate why these concepts are useful. In machine learning we are
often trying to solve optimization problems: *Maximize* the probability
assigned to observed data. *Minimize* the distance between predictions
and the ground-truth observations. Assign vector representations to
items (like words, products, or news articles) such that the distance
between similar items is minimized, and the distance between dissimilar
items is maximized. Oftentimes, these objectives, perhaps the most
important component of a machine learning algorithm (besides the data
itself), are expressed as norms.
Intermediate linear algebra
---------------------------
If you have made it this far, and understand everything that we have
covered, then honestly, you *are* ready to begin modeling. If you are
feeling antsy, this is a perfectly reasonable place to move on. You
already know nearly all of the linear algebra required to implement a
number of many practically useful models and you can always circle back
when you want to learn more.
But there is a lot more to linear algebra, even as concerns machine
learning. At some point, if you plan to make a career in machine
learning, you will need to know more than what we have covered so far.
In the rest of this chapter, we introduce some useful, more advanced
concepts.
Basic vector properties
~~~~~~~~~~~~~~~~~~~~~~~
Vectors are useful beyond being data structures to carry numbers. In
addition to reading and writing values to the components of a vector,
and performing some useful mathematical operations, we can analyze
vectors in some interesting ways.
One important concept is the notion of a vector space. Here are the
conditions that make a vector space:
- **Additive axioms** (we assume that x,y,z are all vectors):
:math:`x+y = y+x` and :math:`(x+y)+z = x+(y+z)` and
:math:`0+x = x+0 = x` and :math:`(-x) + x = x + (-x) = 0`.
- **Multiplicative axioms** (we assume that x is a vector and a, b are
scalars): :math:`0 \cdot x = 0` and :math:`1 \cdot x = x` and
:math:`(a b) x = a (b x)`.
- **Distributive axioms** (we assume that x and y are vectors and a, b
are scalars): :math:`a(x+y) = ax + ay` and :math:`(a+b)x = ax +bx`.
Special matrices
~~~~~~~~~~~~~~~~
There are a number of special matrices that we will use throughout this
tutorial. Let’s look at them in a bit of detail:
- **Symmetric Matrix** These are matrices where the entries below and
above the diagonal are the same. In other words, we have that
:math:`M^\top = M`. An example of such matrices are those that
describe pairwise distances, i.e. :math:`M_{ij} = \|x_i - x_j\|`.
Likewise, the Facebook friendship graph can be written as a symmetric
matrix where :math:`M_{ij} = 1` if :math:`i` and :math:`j` are
friends and :math:`M_{ij} = 0` if they are not. Note that the
*Twitter* graph is asymmetric - :math:`M_{ij} = 1`, i.e. :math:`i`
following :math:`j` does not imply that :math:`M_{ji} = 1`, i.e.
:math:`j` following :math:`i`.
- **Antisymmetric Matrix** These matrices satisfy :math:`M^\top = -M`.
Note that any square matrix can always be decomposed into a symmetric
and into an antisymmetric matrix by using
:math:`M = \frac{1}{2}(M + M^\top) + \frac{1}{2}(M - M^\top)`.
- **Diagonally Dominant Matrix** These are matrices where the
off-diagonal elements are small relative to the main diagonal
elements. In particular we have that
:math:`M_{ii} \geq \sum_{j \neq i} M_{ij}` and
:math:`M_{ii} \geq \sum_{j \neq i} M_{ji}`. If a matrix has this
property, we can often approximate :math:`M` by its diagonal. This is
often expressed as :math:`\mathrm{diag}(M)`.
- **Positive Definite Matrix** These are matrices that have the nice
property where :math:`x^\top M x > 0` whenever :math:`x \neq 0`.
Intuitively, they are a generalization of the squared norm of a
vector :math:`\|x\|^2 = x^\top x`. It is easy to check that whenever
:math:`M = A^\top A`, this holds since there
:math:`x^\top M x = x^\top A^\top A x = \|A x\|^2`. There is a
somewhat more profound theorem which states that all positive
definite matrices can be written in this form.
Summary
-------
In just a few pages (or one Jupyter notebook) we have taught you all the
linear algebra you will need to understand a good chunk of neural
networks. Of course there is a *lot* more to linear algebra. And a lot
of that math *is* useful for machine learning. For example, matrices can
be decomposed into factors, and these decompositions can reveal
low-dimensional structure in real-world datasets. There are entire
subfields of machine learning that focus on using matrix decompositions
and their generalizations to high-order tensors to discover structure in
datasets and solve prediction problems. But this book focuses on deep
learning. And we believe you will be much more inclined to learn more
mathematics once you have gotten your hands dirty deploying useful
machine learning models on real datasets. So while we reserve the right
to introduce more math much later on, we will wrap up this chapter here.
If you are eager to learn more about linear algebra, here are some of
our favorite resources on the topic
- For a solid primer on basics, check out Gilbert Strang’s book
`Introduction to Linear
Algebra `__
- Zico Kolter’s `Linear Algebra Review and
Reference `__
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_linear-algebra.svg