.. _sec_eigendecompositions:
Eigendecompositions
===================
Eigenvalues are often one of the most useful notions we will encounter
when studying linear algebra, however, as a beginner, it is easy to
overlook their importance. Below, we introduce eigendecomposition and
try to convey some sense of just why it is so important.
Suppose that we have a matrix :math:`A` with the following entries:
.. math::
\mathbf{A} = \begin{bmatrix}
2 & 0 \\
0 & -1
\end{bmatrix}.
If we apply :math:`A` to any vector :math:`\mathbf{v} = [x, y]^\top`, we
obtain a vector :math:`\mathbf{v}A = [2x, -y]^\top`. This has an
intuitive interpretation: stretch the vector to be twice as wide in the
:math:`x`-direction, and then flip it in the :math:`y`-direction.
However, there are *some* vectors for which something remains unchanged.
Namely :math:`[1, 0]^\top` gets sent to :math:`[2, 0]^\top` and
:math:`[0, 1]^\top` gets sent to :math:`[0, -1]^\top`. These vectors are
still in the same line, and the only modification is that the matrix
stretches them by a factor of :math:`2` and :math:`-1` respectively. We
call such vectors *eigenvectors* and the factor they are stretched by
*eigenvalues*.
In general, if we can find a number :math:`\lambda` and a vector
:math:`\mathbf{v}` such that
.. math::
\mathbf{A}\mathbf{v} = \lambda \mathbf{v}.
We say that :math:`\mathbf{v}` is an eigenvector for :math:`A` and
:math:`\lambda` is an eigenvalue.
Finding Eigenvalues
-------------------
| Let’s figure out how to find them.
| By subtracting off the :math:`\lambda \vec v` from both sides, and
then factoring out the vector, we see the above is equivalent to:
.. math:: (\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = 0.
:label: eq_eigvalue_der
For :eq:`eq_eigvalue_der` to happen, we see that
:math:`(\mathbf{A} - \lambda \mathbf{I})` must compress some direction
down to zero, hence it is not invertible, and thus the determinant is
zero. Thus, we can find the *eigenvalues* by finding for what
:math:`\lambda` is :math:`\det(\mathbf{A}-\lambda \mathbf{I}) = 0`. Once
we find the eigenvalues, we can solve
:math:`\mathbf{A}\mathbf{v} = \lambda \mathbf{v}` to find the associated
*eigenvector(s)*.
An Example
~~~~~~~~~~
Let’s see this with a more challenging matrix
.. math::
\mathbf{A} = \begin{bmatrix}
2 & 1\\
2 & 3
\end{bmatrix}.
If we consider :math:`\det(\mathbf{A}-\lambda \mathbf{I}) = 0`, we see
this is equivalent to the polynomial equation
:math:`0 = (2-\lambda)(3-\lambda)-2 = (4-\lambda)(1-\lambda)`. Thus, two
eigenvalues are :math:`4` and :math:`1`. To find the associated vectors,
we then need to solve
.. math::
\begin{bmatrix}
2 & 1\\
2 & 3
\end{bmatrix}\begin{bmatrix}x \\ y\end{bmatrix} = \begin{bmatrix}x \\ y\end{bmatrix} \; \text{and} \;
\begin{bmatrix}
2 & 2\\
1 & 3
\end{bmatrix}\begin{bmatrix}x \\ y\end{bmatrix} = \begin{bmatrix}4x \\ 4y\end{bmatrix} .
We can solve this with the vectors :math:`[1, -1]^\top` and
:math:`[1, 2]^\top` respectively.
We can check this in code using the built-in ``numpy.linalg.eig``
routine.
.. code:: python
%matplotlib inline
import d2l
from IPython import display
import numpy as np
np.linalg.eig(np.array([[2, 1], [2, 3]]))
.. parsed-literal::
:class: output
(array([1., 4.]),
array([[-0.70710678, -0.4472136 ],
[ 0.70710678, -0.89442719]]))
Note that ``numpy`` normalizes the eigenvectors to be of length one,
whereas we took ours to be of arbitrary length. Additionally, the choice
of sign is arbitrary. However, the vectors computed are parallel to the
ones we found by hand with the same eigenvalues.
Decomposing Matrices
--------------------
Let’s continue the previous example one step further. Let
.. math::
\mathbf{W} = \begin{bmatrix}
1 & 1 \\
-1 & 2
\end{bmatrix},
be the matrix where the columns are the eigenvectors of the matrix
:math:`\mathbf{A}`. Let
.. math::
\boldsymbol{\Sigma} = \begin{bmatrix}
1 & 0 \\
0 & 4
\end{bmatrix},
be the matrix with the associated eigenvalues on the diagonal. Then the
definition of eigenvalues and eigenvectors tells us that
.. math::
\mathbf{A}\mathbf{W} =\mathbf{W} \boldsymbol{\Sigma} .
The matrix :math:`W` is invertible, so we may multiply both sides by
:math:`W^{-1}` on the right, we see that we may write
.. math:: \mathbf{A} = \mathbf{W} \boldsymbol{\Sigma} \mathbf{W}^{-1}.
:label: eq_eig_decomp
In the next section we will see some nice consequences of this, but for
now we need only know that such a decomposition will exist as long as we
can find a full collection of linearly independent eigenvectors (so that
:math:`W` is invertible).
Operations on Eigendecompositions
---------------------------------
| One nice thing about eigendecompositions :eq:`eq_eig_decomp` is
that we can write many operations we usually encounter cleanly in
terms of the eigendecomposition.
| As a first example, consider:
.. math::
\mathbf{A}^n = \overbrace{\mathbf{A}\cdots \mathbf{A}}^{\text{$n$ times}} = \overbrace{(\mathbf{W}\boldsymbol{\Sigma} \mathbf{W}^{-1})\cdots(\mathbf{W}\boldsymbol{\Sigma} \mathbf{W}^{-1})}^{\text{$n$ times}} = \mathbf{W}\overbrace{\boldsymbol{\Sigma}\cdots\boldsymbol{\Sigma}}^{\text{$n$ times}}\mathbf{W}^{-1} = \mathbf{W}\boldsymbol{\Sigma}^n \mathbf{W}^{-1}.
This tells us that for any positive power of a matrix, the
eigendecomposition is obtained by just raising the eigenvalues to the
same power. The same can be shown for negative powers, so if we want to
invert a matrix we need only consider
.. math::
\mathbf{A}^{-1} = \mathbf{W}\boldsymbol{\Sigma}^{-1} \mathbf{W}^{-1},
or in other words, just invert each eigenvalue. This will work as long
as each eigenvalue is non-zero, so we see that invertible is the same as
having no zero eigenvalues.
Indeed, additional work can show that if
:math:`\lambda_1, \ldots, \lambda_n` are the eigenvalues of a matrix,
then the determinant of that matrix is
.. math::
\det(\mathbf{A}) = \lambda_1 \cdots \lambda_n,
or the product of all the eigenvalues. This makes sense intuitively
because whatever stretching :math:`\mathbf{W}` does, :math:`W^{-1}`
undoes it, so in the end the only stretching that happens is by
multiplication by the diagonal matrix :math:`\boldsymbol{\Sigma}`, which
stretches volumes by the product of the diagonal elements.
Finally, recall that the rank was the maximum number of linearly
independent columns of your matrix. By examining the eigendecomposition
closely, we can see that the rank is the same as the number of non-zero
eigenvalues of :math:`\mathbf{A}`.
The examples could continue, but hopefully the point is clear:
eigendecompositions can simplify many linear-algebraic computations and
are a fundamental operation underlying many numerical algorithms and
much of the analysis that we do in linear algebra.
Eigendecompositions of Symmetric Matrices
-----------------------------------------
It is not always possible to find enough linearly independent
eigenvectors for the above process to work. For instance the matrix
.. math::
\mathbf{A} = \begin{bmatrix}
1 & 1 \\
0 & 1
\end{bmatrix},
has only a single eigenvector, namely :math:`(0, 1)`. To handle such
matrices, we require more advanced techniques than we can cover (such as
the Jordan Normal Form, or Singular Value Decomposition). We will often
need to restrict our attention to those matrices where we can guarantee
the existence of a full set of eigenvectors.
| The most commonly encountered family are the *symmetric matrices*,
which are those matrices where :math:`\mathbf{A} = \mathbf{A}^\top`.
In this case, we may take :math:`W` to be an *orthogonal matrix*—a
matrix whose columns are all length one vectors that are at right
angles to one another, where
:math:`\mathbf{W}^\top = \mathbf{W}^{-1}`—and all the eigenvalues will
be real.
| Thus, in this special case, we can write :eq:`eq_eig_decomp` as
.. math::
\mathbf{A} = \mathbf{W}\boldsymbol{\Sigma}\mathbf{W}^\top .
Gershgorin Circle Theorem
-------------------------
Eigenvalues are often difficult to reason with intuitively. If presented
an arbitrary matrix, there is little that can be said about what the
eigenvalues are without computing them. There is, however, one theorem
that can make it easy to approximate well if the largest values are on
the diagonal.
Let :math:`\mathbf{A} = (a_{ij})` be any square matrix
(:math:`n\times n`). We will define
:math:`r_i = \sum_{j \neq i} |a_{ij}|`. Let :math:`\mathcal{D}_i`
represent the disc in the complex plane with center :math:`a_{ii}`
radius :math:`r_i`. Then, every eigenvalue of :math:`\mathbf{A}` is
contained in one of the :math:`\mathcal{D}_i`.
| This can be a bit to unpack, so let’s look at an example.
| Consider the matrix:
.. math::
\mathbf{A} = \begin{bmatrix}
1.0 & 0.1 & 0.1 & 0.1 \\
0.1 & 3.0 & 0.2 & 0.3 \\
0.1 & 0.2 & 5.0 & 0.5 \\
0.1 & 0.3 & 0.5 & 9.0
\end{bmatrix}.
We have :math:`r_1 = 0.3`, :math:`r_2 = 0.6`, :math:`r_3 = 0.8` and
:math:`r_4 = 0.9`. The matrix is symmetric, so all eigenvalues are real.
This means that all of our eigenvalues will be in one of the ranges of
.. math:: [a_{11}-r_1, a_{11}+r_1] = [0.7, 1.3],
.. math:: [a_{22}-r_2, a_{22}+r_2] = [2.4, 3.6],
.. math:: [a_{33}-r_3, a_{33}+r_3] = [4.2, 5.8],
.. math:: [a_{44}-r_4, a_{44}+r_4] = [8.1, 9.9].
Performing the numerical computation shows that the eigenvalues are
approximately :math:`0.99`, :math:`2.97`, :math:`4.95`, :math:`9.08`,
all comfortably inside the ranges provided.
.. code:: python
A = np.array([[1.0, 0.1, 0.1, 0.1],
[0.1, 3.0, 0.2, 0.3],
[0.1, 0.2, 5.0, 0.5],
[0.1, 0.3, 0.5, 9.0]])
v, _ = np.linalg.eig(A)
v
.. parsed-literal::
:class: output
array([9.08033648, 0.99228545, 4.95394089, 2.97343718])
In this way, eigenvalues can be approximated, and the approximations
will be fairly accurate in the case that the diagonal is significantly
larger than all the other elements.
It is a small thing, but with a complex and subtle topic like
eigendecomposition, it is good to get any intuitive grasp we can.
A Useful Application: The Growth of Iterated Maps
-------------------------------------------------
Now that we understand what eigenvectors are in principle, let’s see how
they can be used to provide a deep understanding of a problem central to
neural network behavior: proper weight initialization.
Eigenvectors as Long Term Behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The full mathematical investigation of the initialization of deep neural
networks is beyond the scope of the text, but we can see a toy version
here to understand how eigenvalues can help us see how these models
work. As we know, neural networks operate by interspersing layers of
linear transformations with non-linear operations. For simplicity here,
we will assume that there is no non-linearity, and that the
transformation is a single repeated matrix operation :math:`A`, so that
the output of our model is
.. math::
\mathbf{v}_{out} = \mathbf{A}\cdot \mathbf{A}\cdots \mathbf{A} \mathbf{v}_{in} = \mathbf{A}^N \mathbf{v}_{in}.
When these models are initialized, :math:`A` is taken to be a random
matrix with Gaussian entries, so let’s make one of those. To be
concrete, we start with a mean zero, variance one Gaussian distributed
:math:`5 \times 5` matrix.
.. code:: python
np.random.seed(8675309)
k = 5
A = np.random.randn(k, k)
A
.. parsed-literal::
:class: output
array([[ 0.58902366, 0.73311856, -1.1621888 , -0.55681601, -0.77248843],
[-0.16822143, -0.41650391, -1.37843129, 0.74925588, 0.17888446],
[ 0.69401121, -1.9780535 , -0.83381434, 0.56437344, 0.31201299],
[-0.87334496, 0.15601291, -0.38710108, -0.23920821, 0.88850104],
[ 1.29385371, -0.76774106, 0.20131613, 0.91800842, 0.38974115]])
Behavior on Random Data
~~~~~~~~~~~~~~~~~~~~~~~
For simplicity in our toy model, we will assume that the data vector we
feed in :math:`\mathbf{v}_{in}` is a random five dimensional Gaussian
vector. Let’s think about what we want to have happen. For context, lets
think of a generic ML problem, where we are trying to turn input data,
like an image, into a prediction, like the probability the image is a
picture of a cat. If repeated application of :math:`\mathbf{A}`
stretches a random vector out to be very long, then small changes in
input will be amplified into large changes in output—tiny modifications
of the input image would lead to vastly different predictions. This does
not seem right!
On the flip side, if :math:`\mathbf{A}` shrinks random vectors to be
shorter, then after running through many layers, the vector will
essentially shrink to nothing, and the output will not depend on the
input. This is also clearly not right either!
We need to walk the narrow line between growth and decay to make sure
that our output changes depending on our input, but not much!
Let’s see what happens when we repeatedly multiply our matrix
:math:`\mathbf{A}` against a random input vector, and keep track of the
norm.
.. code:: python
# Calculate the sequence of norms after repeatedly applying A
v_in = np.random.randn(k, 1)
norm_list = [np.linalg.norm(v_in)]
for i in range(1, 100):
v_in = A.dot(v_in)
norm_list.append(np.linalg.norm(v_in))
d2l.plot(np.arange(0, 100), norm_list, 'Iteration', 'Value')
.. figure:: output_eigendecomposition_89942e_7_0.svg
The norm is growing uncontrollably! Indeed if we take the list of
quotients, we will see a pattern.
.. code:: python
# Compute the scaling factor of the norms
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i - 1])
d2l.plot(np.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
.. figure:: output_eigendecomposition_89942e_9_0.svg
If we look at the last portion of the above computation, we see that the
random vector is stretched by a factor of ``1.974459321485[...]``, where
the portion at the end shifts a little, but the stretching factor is
stable.
Relating Back to Eigenvectors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have seen that eigenvectors and eigenvalues correspond to the amount
something is stretched, but that was for specific vectors, and specific
stretches. Let’s take a look at what they are for :math:`\mathbf{A}`. A
bit of a caveat here: it turns out that to see them all, we will need to
go to complex numbers. You can think of these as stretches and
rotations. By taking the norm of the complex number (square root of the
sums of squares of real and imaginary parts) we can measure that
stretching factor. Let’s also sort them.
.. code:: python
# Compute the eigenvalues
eigs = np.linalg.eigvals(A).tolist()
norm_eigs = [np.absolute(x) for x in eigs]
norm_eigs.sort()
"Norms of eigenvalues: {}".format(norm_eigs)
.. parsed-literal::
:class: output
'Norms of eigenvalues: [0.8786205280381857, 1.2757952665062624, 1.4983381517710659, 1.4983381517710659, 1.974459321485074]'
An Observation
~~~~~~~~~~~~~~
We see something a bit unexpected happening here: that number we
identified before for the long term stretching of our matrix
:math:`\mathbf{A}` applied to a random vector is *exactly* (accurate to
thirteen decimal places!) the largest eigenvalue of :math:`\mathbf{A}`.
This is clearly not a coincidence!
But, if we now think about what is happening geometrically, this starts
to make sense. Consider a random vector. This random vector points a
little in every direction, so in particular, it points at least a little
bit in the same direction as the eigenvector of :math:`\mathbf{A}`
associated with the largest eigenvalue. This is so important that it is
called the *principle eigenvalue* and *principle eigenvector*. After
applying :math:`\mathbf{A}`, our random vector gets stretched in every
possible direction, as is associated with every possible eigenvector,
but it is stretched most of all in the direction associated with this
principle eigenvector. What this means is that after apply in :math:`A`,
our random vector is longer, and points in a direction closer to being
aligned with the principle eigenvector. After applying the matrix many
times, the alignment with the principle eigenvector becomes closer and
closer until, for all practical purposes, our random vector has been
transformed into the principle eigenvector! Indeed this algorithm is the
basis for what is known as the *power iteration* for finding the largest
eigenvalue and eigenvector of a matrix. For details see, for example,
:cite:`Van-Loan.Golub.1983`.
Fixing the Normalization
~~~~~~~~~~~~~~~~~~~~~~~~
Now, from above discussions, we concluded that we do not want a random
vector to be stretched or squished at all, we would like random vectors
to stay about the same size throughout the entire process. To do so, we
now rescale our matrix by this principle eigenvalue so that the largest
eigenvalue is instead now just one. Let’s see what happens in this case.
.. code:: python
# Rescale the matrix A
A /= norm_eigs[-1]
# Do the same experiment again
v_in = np.random.randn(k, 1)
norm_list = [np.linalg.norm(v_in)]
for i in range(1, 100):
v_in = A.dot(v_in)
norm_list.append(np.linalg.norm(v_in))
d2l.plot(np.arange(0, 100), norm_list, 'Iteration', 'Value')
.. figure:: output_eigendecomposition_89942e_13_0.svg
We can also plot the ration between consecutive norms as before and see
that indeed it stabilizes.
.. code:: python
# Also plot the ratio
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i-1])
d2l.plot(np.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
.. figure:: output_eigendecomposition_89942e_15_0.svg
Conclusions
-----------
We now see exactly what we hoped for! After normalizing the matrices by
the principle eigenvalue, we see that the random data does not explode
as before, but rather eventually equilibrates to a specific value. It
would be nice to be able to do these things from first principles, and
it turns out that if we look deeply at the mathematics of it, we can see
that the largest eigenvalue of a large random matrix with independent
mean zero, variance one Gaussian entries is on average about
:math:`\sqrt{n}`, or in our case :math:`\sqrt{5} \approx 2.2`, due to a
fascinating fact known as the *circular law* :cite:`Ginibre.1965`. The
relationship between the eigenvalues (and a related object called
singular values) of random matrices has been shown to have deep
connections to proper initialization of neural networks as was discussed
in :cite:`Pennington.Schoenholz.Ganguli.2017` and subsequent works.
Summary
-------
- Eigenvectors are vectors which are stretched by a matrix without
changing direction.
- Eigenvalues are the amount that the eigenvectors are stretched by the
application of the matrix.
- The eigendecomposition of a matrix can allow for many operations to
be reduced to operations on the eigenvalues.
- The Gershgorin Circle Theorem can provide approximate values for the
eigenvalues of a matrix.
- The behavior of iterated matrix powers depends primarily on the size
of the largest eigenvalue. This understanding has many applications
in the theory of neural network initialization.
Exercises
---------
1. What are the eigenvalues and eigenvectors of
.. math::
\mathbf{A} = \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}?
2. What are the eigenvalues and eigenvectors of the following matrix,
and what is strange about this example compared to the previous one?
.. math::
\mathbf{A} = \begin{bmatrix}
2 & 1 \\
0 & 2
\end{bmatrix}.
3. Without computing the eigenvalues, is it possible that the smallest
eigenvalue of the following matrix is less that :math:`0.5`? *Note*:
this problem can be done in your head.
.. math::
\mathbf{A} = \begin{bmatrix}
3.0 & 0.1 & 0.3 & 1.0 \\
0.1 & 1.0 & 0.1 & 0.2 \\
0.3 & 0.1 & 5.0 & 0.0 \\
1.0 & 0.2 & 0.0 & 1.8
\end{bmatrix}.
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_eigendecomposition.svg