Skip to article frontmatterSkip to article content

13.1 Unconstrained Optimization

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in Chapter 12 in ROB101 textbook by Jesse Grizzle.

2Learning Objectives

By the end of this page, you should know

  • what is a mathematical optimization problem
  • the difference between constrained and unconstrained optimization
  • an example of optimization problem: least squares
  • what are local and global minimum
  • what are convex functions and some intuition on how to minimize them

3A Brief Introduction to Optimization

There were a few times during the semester where we tried to find the “best” vector (or matrix) among a collection of many such vectors or matrices. For example, in least squares, we looked to find the vector x\vv x that minimized the expression Axb2\|A\vv x - b\|^2. In low-rank approximation, we looked to find the matrix M^\hat{M} that has rank kk (or less) that minimized the expression M^MF2\|\hat{M} - M\|_F^2.

These were both specific instances of what is called a mathematical optimization problem. We will focus on unconstrained problems. You will learn a lot more about optimization problems in ESE 3040, and this lecture is only meant to give you a small taste, and to show how essential linear algebra is in finding solutions.

Optimization problem (1) is called unconstrained because we are free to pick any xRn\vv x \in \mathbb{R}^n we like to minimize f(x)f(\vv x). A constrained optimization problem has the additional requirement that x\vv x must satisfy some added conditions, e.g., live in the solution set of Ax=bA\vv x = b. We will not consider such problems in this lecture, but you’ll see many in ESE 3040.

The goal of optimization is to find a special decision variable x\vv x^* for which the cost function ff is as small as possible, i.e., such that

f(x)f(x)for all xRnf(\vv x^*) \leq f(\vv x) \quad \text{for all } \vv x \in \mathbb{R}^n

Such an x\vv x^* is called an optimal solution to problem (1), and is defined as the arg min of ff:

x=argminxRnf(x).\vv x^* = \arg\min_{\vv x \in \mathbb{R}^n} f(\vv x).

Equation (3) simply says in math that x\vv x^* satisfies the definition (2) of an optimal point.[1]

Despite how simple and innocuous problem (1) looks, it can be used to encode very rich and very challenging problems. Even for xR\vv x \in \mathbb{R}, we can get ourselves into trouble. Consider the following two functions that we wish to minimize:

Optimizing scalars

Which do you think is easier to optimize? The left figure, with function f1(x)f_1(x), is “bowl-shaped” and is smallest at x=2x^* = 2. What’s “nice” about f1f_1 is that there’s also an obvious algorithm for finding x=2x^* = 2: If you imagine yourself as an ant standing on the function f1(x)f_1(x), all you need to do is “walk downhill” until you eventually find the bottom of the bowl.

In contrast, the right figure with function f2(x)f_2(x) has many hills and valleys. The optimal value x=3x^* = 3 is the one for which f2(x)f_2(x) is smallest. But now if we again imagine ourselves as an ant standing on the function f2(x)f_2(x), our strategy of walking downhill will not always work! For example, if we were to start at x=1.5x = 1.5, then walking downhill would bring us to the bottom of the first valley at x=1x = 1. Now x=1x = 1 is not an optimal point, since f2(3)<f2(1)f_2(3) < f_2(1), but if we look around at nearby points, then f2(1)f_2(1) is indeed the smallest.

As you may have guessed, we really like “bowl-shaped” functions for which our walking downhill strategy finds a global minimum. Such functions satisfy a geometric property called convexity.

To understand what (CVX) is saying, it is best to draw what it means for a scalar function f:RRf: \mathbb{R} \to \mathbb{R}.

Convex Function

(CVX) says that if I pick any two points f(x)f(x) and f(y)f(y) on the graph, and draw a line segment between these two points, then this line lies above the graph. It turns out that this is exactly the right way to mathematically characterize “bowl-shaped” functions, even when xRn\vv x \in \mathbb{R}^n. The important feature of convex functions is that “walking downhill” will always bring us to a global minimum. We won’t say much more about convex functions, but you’ll see them again in ESE 3040, and there is a graduate level course, ESE 6050, which focuses entirely on convex optimization problems.

A more formal way of making the argument in Example 3 is to show that the Hessian 2f(x)\nabla^2 f(x) of f is positive semidefinite. Here 2f(x)=2ATA\nabla^2 f(\vv x) = 2A^TA, which is indeed positive semidefinite. This is the matrix/vector equivalent of saying that f(x)=a2x+bx+cf(x)=a^2 x + bx + c is an upward pointing bowl if and only if a0a \geq 0.

If you don’t remember what the Hessian 2f(x)\nabla^2f(\vv x) of a function is, or how we computed 2f(x)=2ATA\nabla^2f(\vv x) = 2A^TA in the above, don’t worry, we’ll review this in a bit.

4Which way is down?

Let’s assume that we are either in a “nice” setting where our objective function is convex (bowl shaped), or that we’re happy to settle for a local minimum. How can we figure out which way is down so that we can tell our little ant friend which way they should walk. To get some intuition, we will start with with functions with xR2\vv x\in\mathbb{R}^2 and look at some cost function contour plots.

Let’s start with two familiar examples:

  1. f(x)=x2=x12+x22f(\vv x) = \|\vv x\|^2 = x_1^2 + x_2^2, which is nothing but the Euclidean norm squared of x\vv x, and
  2. f(x)=xb2=(x1b1)2+(x2b2)2f(\vv x) = \|\vv x-\vv b\|^2 = (x_1-b_1)^2+(x_2-b_2)^2, which is a very simple least squares objective with A=IA=I.

We have plotted both of these functions, and their contour plots, below:

Level Set

The contour plots show the level sets of f(x)f(\vv x), which we’ve seen before. These are the sets Cα={xR2f(x)=α}C_{\alpha} = \{\vv x\in\mathbb{R}^2 | f(\vv x)=\alpha\} for some constant α. Here, these end up being circles, since for example C1C_1 is the set of x1,x2x_1,x_2 such that x12+x22=1x_1^2 + x_2^2 = 1 or (x1b1)2+(x2b2)2=1(x_1-b_1)^2 + (x_2-b_2)^2 = 1.

Can we use these contour plots to identify which way is down if our ant is currently sitting at point wR2\vv w\in\mathbb{R}^2?

To answer this question, we’ll need the gradient f(x)\nabla f(\vv x) of our function. Recall from Math 1410 that for a function f:RnRf:\mathbb{R}^n\to\mathbb{R}, its gradient f(x)\nabla f(\vv x) is an n-vector of the partial derivatives of f:

f(x)=[fx1(x)fxn(x)]Rn\nabla f(\vv x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\vv x) \\[6pt] \vdots \\[6pt] \frac{\partial f}{\partial x_n}(\vv x) \end{bmatrix} \in \mathbb{R}^n

For our functions on R2\mathbb{R}^2, this reduces to f(x)=(fx1(x1,x2),fx2(x1,x2))R2\nabla f(\vv x) = \left(\frac{\partial f}{\partial x_1}(x_1, x_2), \frac{\partial f}{\partial x_2}(x_1, x_2)\right) \in \mathbb{R}^2.

In the figure below, we show contour plots for f1(x)=xb2=(x11)2+(x22)2f_1(\vv x)=\|\vv x-b\|^2=(x_1-1)^2+(x_2-2)^2 and f2(x)=Axb2f_2(x)=\|A\vv x-\vv b\|^2, along with the gradients f(x)\nabla f(\vv x) (green arrows) at various points.

Contours

In both cases, the gradients are pointing in the direction of maximal increase: if we go in the opposite f(x)-\nabla f(\vv x) direction, we will move in the direction of maximal decrease!

Let’s flex our calculus muscles a bit and compute the gradients of f(x)f(\vv x) and f2(x)f_2(\vv x)[2]:

f1(x)=[x1((x11)2+(x22)2)x2((x11)2+(x22)2)]=2[x11x22]\nabla f_1(\vv x) = \begin{bmatrix} \frac{\partial}{\partial x_1}((x_1-1)^2+(x_2-2)^2) \\[6pt] \frac{\partial}{\partial x_2}((x_1-1)^2+(x_2-2)^2) \end{bmatrix} = 2\begin{bmatrix} x_1-1 \\[6pt] x_2-2 \end{bmatrix}

and

f2(x)=(Axb)T(Axb)=2AT(Axb)\nabla f_2(\vv x) = \nabla (A\vv x-\vv b)^T(A\vv x-\vv b) = 2A^T(A\vv x-\vv b)

One thing you might notice in the plots above is that the red dots, which we placed on the function minimum, have no green arrows. This is because the gradient at these points is zero. In fact, this is true in general: a point x\vv x is a local minimum only if f(x)=0\nabla f(\vv x)=0.

We won’t prove this, but intuitively, this makes sense: if f(x)0\nabla f(\vv x)\neq \vv 0, then this means there is a direction f(x)-\nabla f(\vv x) in which I could walk a little bit downhill to decrease f(x)f(\vv x), so f(x)\nabla f(\vv x) must be zero if we’re at a local minima. Let’s check that this holds for f1(x)\nabla f_1(\vv x) and f2(x)\nabla f_2(\vv x) above:

For f1(x)f_1(\vv x), x=(1,2)x^* = (1, 2), and f1(x)=2[x11x22]=0\nabla f_1(\vv x^*) = 2\begin{bmatrix} x_1^*-1 \\ x_2^*-2 \end{bmatrix} = \vv 0, and so that checks out. And for f2(x)f_2(\vv x), x=(ATA)1ATb\vv x^* = (A^TA)^{-1}A^T\vv b, and

f2(x)=2(ATAxATb)=2(ATA)(ATA)1I(ATbATb)=2(ATbATb)=0\nabla f_2(\vv x^*) = 2(A^TA \vv x^* - A^T\vv b) = 2\underbrace{(A^TA)(A^TA)^{-1}}_{I}(A^T\vv b - A^T\vv b) = 2(A^T\vv b - A^T\vv b) = \vv 0!

In general, while f(x)=0\nabla f(\vv x^*) = \vv 0 is necessary for x\vv x^* to be a local minima, it is not sufficient, i.e., there may be certain points with f(x)=0\nabla f(\vv x^*)=\vv 0 that are not local minima. For example, all red dots in the plot below have vanishing gradients, but only two are minima:

Local Minima

However, if our function ff is convex (bowl shaped), we have the following theorem which tells us that walking downhill will always find a global minimum:

Next, we’ll use our new found insights to define gradient descent, a widely used iterative algorithm for finding local minima of optimization problem (1).

Binder

Footnotes
  1. Note that if there are multiple optimal points, one instead writes xargminxRnf(x) \vv x^* \in \arg\min_{\vv x \in \mathbb{R}^n} f(\vv x) to indicate x\vv x^* belongs to the set of optimal points.

  2. If you don’t know how to compute this gradient, you may wish to review Math1410 notes, but don’t worry, we won’t ask you to calculate gradients on the homework & exams.