# 11.2. Convexity¶

Convexity plays a vital role in the design of optimization algorithms. This is largely due to the fact that it is much easier to analyze and test algorithms in this context. In other words, if the algorithm performs poorly even in the convex setting we should not hope to see great results otherwise. Furthermore, even though the optimization problems in deep learning are generally nonconvex, they often exhibit some properties of convex ones near local minima. This can lead to exciting new optimization variants such as [Izmailov et al., 2018].

## 11.2.1. Basics¶

Let us begin with the basics.

### 11.2.1.1. Sets¶

Sets are the basis of convexity. Simply put, a set \(X\) in a vector space is convex if for any \(a, b \in X\) the line segment connecting \(a\) and \(b\) is also in \(X\). In mathematical terms this means that for all \(\lambda \in [0, 1]\) we have

This sounds a bit abstract. Consider the picture Fig. 11.2.1. The first set is not convex since there are line segments that are not contained in it. The other two sets suffer no such problem.

Definitions on their own are not particularly useful unless you can do something with them. In this case we can look at unions and intersections as shown in Fig. 11.2.2. Assume that \(X\) and \(Y\) are convex sets. Then \(X \cap Y\) is also convex. To see this, consider any \(a, b \in X \cap Y\). Since \(X\) and \(Y\) are convex, the line segments connecting \(a\) and \(b\) are contained in both \(X\) and \(Y\). Given that, they also need to be contained in \(X \cap Y\), thus proving our first theorem.

We can strengthen this result with little effort: given convex sets \(X_i\), their intersection \(\cap_{i} X_i\) is convex. To see that the converse is not true, consider two disjoint sets \(X \cap Y = \emptyset\). Now pick \(a \in X\) and \(b \in Y\). The line segment in Fig. 11.2.3 connecting \(a\) and \(b\) needs to contain some part that is neither in \(X\) nor \(Y\), since we assumed that \(X \cap Y = \emptyset\). Hence the line segment is not in \(X \cup Y\) either, thus proving that in general unions of convex sets need not be convex.

Typically the problems in deep learning are defined on convex domains. For instance \(\mathbb{R}^d\) is a convex set (after all, the line between any two points in \(\mathbb{R}^d\) remains in \(\mathbb{R}^d\)). In some cases we work with variables of bounded length, such as balls of radius \(r\) as defined by \(\{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\|_2 \leq r\}\).

### 11.2.1.2. Functions¶

Now that we have convex sets we can introduce convex functions \(f\). Given a convex set \(X\) a function defined on it \(f: X \to \mathbb{R}\) is convex if for all \(x, x' \in X\) and for all \(\lambda \in [0, 1]\) we have

To illustrate this let us plot a few functions and check which ones satisfy the requirement. We need to import a few libraries.

```
%matplotlib inline
from d2l import mxnet as d2l
from mpl_toolkits import mplot3d
from mxnet import np, npx
npx.set_np()
```

```
%matplotlib inline
from d2l import torch as d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch
```

```
%matplotlib inline
from d2l import tensorflow as d2l
import numpy as np
from mpl_toolkits import mplot3d
import tensorflow as tf
```

Let us define a few functions, both convex and nonconvex.

```
f = lambda x: 0.5 * x**2 # Convex
g = lambda x: np.cos(np.pi * x) # Nonconvex
h = lambda x: np.exp(0.5 * x) # Convex
x, segment = np.arange(-2, 2, 0.01), np.array([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
d2l.plot([x, segment], [func(x), func(segment)], axes=ax)
```

```
f = lambda x: 0.5 * x**2 # Convex
g = lambda x: torch.cos(np.pi * x) # Nonconvex
h = lambda x: torch.exp(0.5 * x) # Convex
x, segment = torch.arange(-2, 2, 0.01), torch.tensor([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
d2l.plot([x, segment], [func(x), func(segment)], axes=ax)
```

```
f = lambda x: 0.5 * x**2 # Convex
g = lambda x: tf.cos(np.pi * x) # Nonconvex
h = lambda x: tf.exp(0.5 * x) # Convex
x, segment = tf.range(-2, 2, 0.01), tf.constant([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
d2l.plot([x, segment], [func(x), func(segment)], axes=ax)
```

As expected, the cosine function is nonconvex, whereas the parabola and the exponential function are. Note that the requirement that \(X\) is a convex set is necessary for the condition to make sense. Otherwise the outcome of \(f(\lambda x + (1-\lambda) x')\) might not be well defined. Convex functions have a number of desirable properties.

### 11.2.1.3. Jensen’s Inequality¶

One of the most useful tools is Jensen’s inequality. It amounts to a generalization of the definition of convexity:

where \(\alpha_i\) are nonnegative real numbers such that \(\sum_i \alpha_i = 1\). In other words, the expectation of a convex function is larger than the convex function of an expectation. To prove the first inequality we repeatedly apply the definition of convexity to one term in the sum at a time. The expectation can be proven by taking the limit over finite segments.

One of the common applications of Jensen’s inequality is with regard to the log-likelihood of partially observed random variables. That is, we use

This follows since \(\int P(y) P(x \mid y) dy = P(x)\). This is used in variational methods. Here \(y\) is typically the unobserved random variable, \(P(y)\) is the best guess of how it might be distributed and \(P(x)\) is the distribution with \(y\) integrated out. For instance, in clustering \(y\) might be the cluster labels and \(P(x \mid y)\) is the generative model when applying cluster labels.

## 11.2.2. Properties¶

Convex functions have a few useful properties. We describe them as follows.

### 11.2.2.1. No Local Minima¶

In particular, convex functions do not have local minima. Let us assume
the contrary and prove it wrong. If \(x \in X\) is a local minimum
there exists some neighborhood of \(x\) for which \(f(x)\) is
the smallest value. Since \(x\) is only a local minimum there has to
be another \(x' \in X\) for which \(f(x') < f(x)\). However, by
convexity the function values on the entire *line*
\(\lambda x + (1-\lambda) x'\) have to be less than \(f(x)\)
since for \(\lambda \in [0, 1)\)

This contradicts the assumption that \(f(x)\) is a local minimum. For instance, the function \(f(x) = (x+1) (x-1)^2\) has a local minimum for \(x=1\). However, it is not a global minimum.

```
f = lambda x: (x-1)**2 * (x+1)
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
```

```
f = lambda x: (x-1)**2 * (x+1)
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
```

```
f = lambda x: (x-1)**2 * (x+1)
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
```

The fact that convex functions have no local minima is very convenient. It means that if we minimize functions we cannot “get stuck”. Note, though, that this does not mean that there cannot be more than one global minimum or that there might even exist one. For instance, the function \(f(x) = \mathrm{max}(|x|-1, 0)\) attains its minimum value over the interval \([-1, 1]\). Conversely, the function \(f(x) = \exp(x)\) does not attain a minimum value on \(\mathbb{R}\). For \(x \to -\infty\) it asymptotes to \(0\), however there is no \(x\) for which \(f(x) = 0\).

### 11.2.2.2. Convex Functions and Sets¶

Convex functions define convex sets as *below-sets*. They are defined as

Such sets are convex. Let us prove this quickly. Remember that for any \(x, x' \in S_b\) we need to show that \(\lambda x + (1-\lambda) x' \in S_b\) as long as \(\lambda \in [0, 1]\). But this follows directly from the definition of convexity since \(f(\lambda x + (1-\lambda) x') \leq \lambda f(x) + (1-\lambda) f(x') \leq b\).

Have a look at the function \(f(x, y) = 0.5 x^2 + \cos(2 \pi y)\) below. It is clearly nonconvex. The level sets are correspondingly nonconvex. In fact, they are typically composed of disjoint sets.

```
x, y = np.meshgrid(
np.linspace(-1.0, 1.0, 101), np.linspace(-1.0, 1.0, 101))
z = x**2 + 0.5 * np.cos(2 * np.pi * y)
# Plot the 3D surface
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.contour(x, y, z, offset=-1)
ax.set_zlim(-1, 1.5)
# Adjust labels
for func in [d2l.plt.xticks, d2l.plt.yticks, ax.set_zticks]:
func([-1, 0, 1])
```

```
/var/lib/jenkins/miniconda3/envs/d2l-en-release-0/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
```

```
x, y = torch.meshgrid(
torch.linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101))
z = x**2 + 0.5 * torch.cos(2 * np.pi * y)
# Plot the 3D surface
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.contour(x, y, z, offset=-1)
ax.set_zlim(-1, 1.5)
# Adjust labels
for func in [d2l.plt.xticks, d2l.plt.yticks, ax.set_zticks]:
func([-1, 0, 1])
```

```
/var/lib/jenkins/miniconda3/envs/d2l-en-release-0/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
```

```
x, y = tf.meshgrid(
tf.linspace(-1.0, 1.0, 101), tf.linspace(-1.0, 1.0, 101))
z = x**2 + 0.5 * tf.cos(2 * np.pi * y)
# Plot the 3D surface
d2l.set_figsize((6, 4))
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.contour(x, y, z, offset=-1)
ax.set_zlim(-1, 1.5)
# Adjust labels
for func in [d2l.plt.xticks, d2l.plt.yticks, ax.set_zticks]:
func([-1, 0, 1])
```

```
/var/lib/jenkins/miniconda3/envs/d2l-en-release-0/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)
```

### 11.2.2.3. Derivatives and Convexity¶

Whenever the second derivative of a function exists it is very easy to check for convexity. All we need to do is check whether \(\partial_x^2 f(x) \succeq 0\), i.e., whether all of its eigenvalues are nonnegative. For instance, the function \(f(\mathbf{x}) = \frac{1}{2} \|\mathbf{x}\|^2_2\) is convex since \(\partial_{\mathbf{x}}^2 f = \mathbf{1}\), i.e., its derivative is the identity matrix.

The first thing to realize is that we only need to prove this property for one-dimensional functions. After all, in general we can always define some function \(g(z) = f(\mathbf{x} + z \cdot \mathbf{v})\). This function has the first and second derivatives \(g' = (\partial_{\mathbf{x}} f)^\top \mathbf{v}\) and \(g'' = \mathbf{v}^\top (\partial^2_{\mathbf{x}} f) \mathbf{v}\) respectively. In particular, \(g'' \geq 0\) for all \(\mathbf{v}\) whenever the Hessian of \(f\) is positive semidefinite, i.e., whenever all of its eigenvalues are greater equal than zero. Hence back to the scalar case.

To see that \(f''(x) \geq 0\) for convex functions we use the fact that

Since the second derivative is given by the limit over finite differences it follows that

To see that the converse is true we use the fact that \(f'' \geq 0\) implies that \(f'\) is a monotonically increasing function. Let \(a < x < b\) be three points in \(\mathbb{R}\). We use the mean value theorem to express

By monotonicity \(f'(\beta) \geq f'(\alpha)\), hence

By geometry it follows that \(f(x)\) is below the line connecting \(f(a)\) and \(f(b)\), thus proving convexity. We omit a more formal derivation in favor of a graph below.

```
f = lambda x: 0.5 * x**2
x = np.arange(-2, 2, 0.01)
axb, ab = np.array([-1.5, -0.5, 1]), np.array([-1.5, 1])
d2l.set_figsize()
d2l.plot([x, axb, ab], [f(x) for x in [x, axb, ab]], 'x', 'f(x)')
d2l.annotate('a', (-1.5, f(-1.5)), (-1.5, 1.5))
d2l.annotate('b', (1, f(1)), (1, 1.5))
d2l.annotate('x', (-0.5, f(-0.5)), (-1.5, f(-0.5)))
```

```
f = lambda x: 0.5 * x**2
x = torch.arange(-2, 2, 0.01)
axb, ab = torch.tensor([-1.5, -0.5, 1]), torch.tensor([-1.5, 1])
d2l.set_figsize()
d2l.plot([x, axb, ab], [f(x) for x in [x, axb, ab]], 'x', 'f(x)')
d2l.annotate('a', (-1.5, f(-1.5)), (-1.5, 1.5))
d2l.annotate('b', (1, f(1)), (1, 1.5))
d2l.annotate('x', (-0.5, f(-0.5)), (-1.5, f(-0.5)))
```

```
f = lambda x: 0.5 * x**2
x = tf.range(-2, 2, 0.01)
axb, ab = tf.constant([-1.5, -0.5, 1]), tf.constant([-1.5, 1])
d2l.set_figsize()
d2l.plot([x, axb, ab], [f(x) for x in [x, axb, ab]], 'x', 'f(x)')
d2l.annotate('a', (-1.5, f(-1.5)), (-1.5, 1.5))
d2l.annotate('b', (1, f(1)), (1, 1.5))
d2l.annotate('x', (-0.5, f(-0.5)), (-1.5, f(-0.5)))
```

## 11.2.3. Constraints¶

One of the nice properties of convex optimization is that it allows us to handle constraints efficiently. That is, it allows us to solve problems of the form:

Here \(f\) is the objective and the functions \(c_i\) are constraint functions. To see what this does consider the case where \(c_1(\mathbf{x}) = \|\mathbf{x}\|_2 - 1\). In this case the parameters \(\mathbf{x}\) are constrained to the unit ball. If a second constraint is \(c_2(\mathbf{x}) = \mathbf{v}^\top \mathbf{x} + b\), then this corresponds to all \(\mathbf{x}\) lying on a halfspace. Satisfying both constraints simultaneously amounts to selecting a slice of a ball as the constraint set.

### 11.2.3.1. Lagrange Function¶

In general, solving a constrained optimization problem is difficult. One way of addressing it stems from physics with a rather simple intuition. Imagine a ball inside a box. The ball will roll to the place that is lowest and the forces of gravity will be balanced out with the forces that the sides of the box can impose on the ball. In short, the gradient of the objective function (i.e., gravity) will be offset by the gradient of the constraint function (need to remain inside the box by virtue of the walls “pushing back”). Note that any constraint that is not active (i.e., the ball does not touch the wall) will not be able to exert any force on the ball.

Skipping over the derivation of the Lagrange function \(L\) (see e.g., the book by Boyd and Vandenberghe for details [Boyd & Vandenberghe, 2004]) the above reasoning can be expressed via the following saddlepoint optimization problem:

Here the variables \(\alpha_i\) are the so-called *Lagrange
Multipliers* that ensure that a constraint is properly enforced. They
are chosen just large enough to ensure that
\(c_i(\mathbf{x}) \leq 0\) for all \(i\). For instance, for any
\(\mathbf{x}\) for which \(c_i(\mathbf{x}) < 0\) naturally, we’d
end up picking \(\alpha_i = 0\). Moreover, this is a *saddlepoint*
optimization problem where one wants to *maximize* \(L\) with
respect to \(\alpha\) and simultaneously *minimize* it with respect
to \(\mathbf{x}\). There is a rich body of literature explaining how
to arrive at the function \(L(\mathbf{x}, \alpha)\). For our
purposes it is sufficient to know that the saddlepoint of \(L\) is
where the original constrained optimization problem is solved optimally.

### 11.2.3.2. Penalties¶

One way of satisfying constrained optimization problems at least approximately is to adapt the Lagrange function \(L\). Rather than satisfying \(c_i(\mathbf{x}) \leq 0\) we simply add \(\alpha_i c_i(\mathbf{x})\) to the objective function \(f(x)\). This ensures that the constraints will not be violated too badly.

In fact, we have been using this trick all along. Consider weight decay in Section 4.5. In it we add \(\frac{\lambda}{2} \|\mathbf{w}\|^2\) to the objective function to ensure that \(\mathbf{w}\) does not grow too large. Using the constrained optimization point of view we can see that this will ensure that \(\|\mathbf{w}\|^2 - r^2 \leq 0\) for some radius \(r\). Adjusting the value of \(\lambda\) allows us to vary the size of \(\mathbf{w}\).

In general, adding penalties is a good way of ensuring approximate constraint satisfaction. In practice this turns out to be much more robust than exact satisfaction. Furthermore, for nonconvex problems many of the properties that make the exact approach so appealing in the convex case (e.g., optimality) no longer hold.

### 11.2.3.3. Projections¶

An alternative strategy for satisfying constraints are projections. Again, we encountered them before, e.g., when dealing with gradient clipping in Section 8.5. There we ensured that a gradient has length bounded by \(c\) via

This turns out to be a *projection* of \(g\) onto the ball of radius
\(c\). More generally, a projection on a (convex) set \(X\) is
defined as

It is thus the closest point in \(X\) to \(\mathbf{x}\). This sounds a bit abstract. Fig. 11.2.4 explains it somewhat more clearly. In it we have two convex sets, a circle and a diamond. Points inside the set (yellow) remain unchanged. Points outside the set (black) are mapped to the closest point inside the set (red). While for \(L_2\) balls this leaves the direction unchanged, this need not be the case in general, as can be seen in the case of the diamond.

One of the uses for convex projections is to compute sparse weight vectors. In this case we project \(\mathbf{w}\) onto an \(L_1\) ball (the latter is a generalized version of the diamond in the picture above).

## 11.2.4. Summary¶

In the context of deep learning the main purpose of convex functions is to motivate optimization algorithms and help us understand them in detail. In the following we will see how gradient descent and stochastic gradient descent can be derived accordingly.

Intersections of convex sets are convex. Unions are not.

The expectation of a convex function is larger than the convex function of an expectation (Jensen’s inequality).

A twice-differentiable function is convex if and only if its second derivative has only nonnegative eigenvalues throughout.

Convex constraints can be added via the Lagrange function. In practice simply add them with a penalty to the objective function.

Projections map to points in the (convex) set closest to the original point.

## 11.2.5. Exercises¶

Assume that we want to verify convexity of a set by drawing all lines between points within the set and checking whether the lines are contained.

Prove that it is sufficient to check only the points on the boundary.

Prove that it is sufficient to check only the vertices of the set.

Denote by \(B_p[r] := \{\mathbf{x} | \mathbf{x} \in \mathbb{R}^d \text{ and } \|\mathbf{x}\|_p \leq r\}\) the ball of radius \(r\) using the \(p\)-norm. Prove that \(B_p[r]\) is convex for all \(p \geq 1\).

Given convex functions \(f\) and \(g\) show that \(\mathrm{max}(f, g)\) is convex, too. Prove that \(\mathrm{min}(f, g)\) is not convex.

Prove that the normalization of the softmax function is convex. More specifically prove the convexity of \(f(x) = \log \sum_i \exp(x_i)\).

Prove that linear subspaces are convex sets, i.e., \(X = \{\mathbf{x} | \mathbf{W} \mathbf{x} = \mathbf{b}\}\).

Prove that in the case of linear subspaces with \(\mathbf{b} = 0\) the projection \(\mathrm{Proj}_X\) can be written as \(\mathbf{M} \mathbf{x}\) for some matrix \(\mathbf{M}\).

Show that for convex twice differentiable functions \(f\) we can write \(f(x + \epsilon) = f(x) + \epsilon f'(x) + \frac{1}{2} \epsilon^2 f''(x + \xi)\) for some \(\xi \in [0, \epsilon]\).

Given a vector \(\mathbf{w} \in \mathbb{R}^d\) with \(\|\mathbf{w}\|_1 > 1\) compute the projection on the \(\ell_1\) unit ball.

As intermediate step write out the penalized objective \(\|\mathbf{w} - \mathbf{w}'\|_2^2 + \lambda \|\mathbf{w}'\|_1\) and compute the solution for a given \(\lambda > 0\).

Can you find the ‘right’ value of \(\lambda\) without a lot of trial and error?

Given a convex set \(X\) and two vectors \(\mathbf{x}\) and \(\mathbf{y}\) prove that projections never increase distances, i.e., \(\|\mathbf{x} - \mathbf{y}\| \geq \|\mathrm{Proj}_X(\mathbf{x}) - \mathrm{Proj}_X(\mathbf{y})\|\).