Skip to article frontmatterSkip to article content

13.2 Gradient Descent

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in Chapter 12 in ROB101 textbook by Jesse Grizzle

2Learning Objectives

By the end of this page, you should know:

  • what is the gradient descent algorithm
  • examples of implementing the gradient descent algorithm
  • what is exact line search
  • when does gradient descent perform poorly and how to address it

3Introduction

Our intuition so far is that we should try to “walk downhill” and that the negative gradient f(x)-\nabla f(\vv x) tells us the steepest direction of descent at point x\vv x. Can we turn this into an algorithm for minimizing (at least locally) a cost function f(x)f(\vv x)?

This intuition is precisely the motivation behind the gradient descent algorithm, which starting with an initial guess x(0)\vv x^{(0)}, iteratively updates the current best guess x(k)\vv x^{(k)} of x\vv x^* according to:

x(k+1)=x(k)sf(x(k)),for k=0,1,2,(GD)\vv x^{(k+1)} = \vv x^{(k)} - s\nabla f(\vv x^{(k)}), \quad \text{for } k=0,1,2,\ldots \quad (\text{GD})

where s>0s>0 is called the step size. The update rule (GD) moves you in the direction of a local minimum if s>0s>0, but be careful, because if ss is too large, you can overshoot (we’ll see more about this later).

Because we know that f(x)=0\nabla f(\vv x^*)=0 if x\vv x^* is a local minima, we can use the norm of the gradient as a stopping criterion, i.e., if f(x(k))ϵ\|\nabla f(\vv x^{(k)})\| \leq \epsilon for some small ϵ>0\epsilon>0, we stop updating our iterate because x(k)\vv x^{(k)} is “close enough” to x\vv x^* (typical choice of ε are 10-4 or 10-6, depending on how precise of a solution is required).

Before looking at some examples of (GD) in action, let’s try to get some intuition as to why it might work. Suppose we are currently at x(k)\vv x^{(k)}: let’s form a linear approximation of f(x)f(\vv x) near x(k)\vv x^{(k)} using its first-order Taylor series approximation:

f(x)f(x(k))base point+f(x(k))Thow f changes at x(k)(xx(k))direction of change(TS)f(\vv x) \approx \underbrace{f(\vv x^{(k)})}_{\text{base point}} + \underbrace{\nabla f(\vv x^{(k)})^T}_{\text{how } f \text{ changes at } x^{(k)}}\underbrace{(\vv x-\vv x^{(k)})}_{\text{direction of change}} \quad (\text{TS})
GD

As you can see from the figure, (TS) is a very good approximation of f(x)f(\vv x) when x\vv x is not too far from x(k)\vv x^{(k)}, but gets worse as we move further away.

Let’s use (TS) to define our next point x(k+1)\vv x^{(k+1)} so that f(x(k+1))<f(x(k))f(\vv x^{(k+1)}) < f(\vv x^{(k)}). If we define Δx(k)=x(k+1)x(k)\Delta \vv x^{(k)} = \vv x^{(k+1)} - \vv x^{(k)}, then evaluating (TS) at point x(k+1)\vv x^{(k+1)} becomes

f(x(k+1))f(x(k))f(x(k))TΔx(k)(=f(x(k)),Δx(k))f(\vv x^{(k+1)}) - f(\vv x^{(k)}) \approx \nabla f(\vv x^{(k)})^T \Delta \vv x^{(k)} \left(= \langle \nabla f(\vv x^{(k)}), \Delta \vv x^{(k)} \rangle\right)

so that if we want f(x(k+1))<f(x(k))f(\vv x^{(k+1)}) < f(\vv x^{(k)}), then we should find a nearby x(k+1)\vv x^{(k+1)} such that

f(x(k))TΔx(k)<0.\nabla f(\vv x^{(k)})^T \Delta \vv x^{(k)} < 0.

Now, assuming that f(x(k))0\nabla f(\vv x^{(k)}) \neq \vv 0 (so we’re not at a local extremum), a clear choice for Δx(k)\Delta \vv x^{(k)} is sf(x(k))-s\nabla f(\vv x^{(k)}) for s>0s>0 a step size chosen small enough so that (TS) is a good approximation. In that case, we have

f(x(k))TΔx(k)=sf(x(k))2<0.\nabla f(\vv x^{(k)})^T \Delta \vv x^{(k)} = -s\|\nabla f(\vv x^{(k)})\|^2 < 0.

In general though, any choice Δx(k)\Delta \vv x^{(k)} such that (4) holds is a valid descent direction. Geometrically, this is illustrated in the picture below:

GD direction

3.1Python Break!

In this section, we’ll take a look how step sizes can affect convergence of the gradient descent algorithm through a few Python examples. We’ll take a look at a less trivial least-squares objective, where we randomly generate a (rectangular) AA matrix and b\vv b vector and optimize the least-squares objective Axb2\|A\vv x - \vv b\|^2 with respect to x\vv x. We’ll consider 2 choices of step sizes, and plot the convergence of the objective value with the number of iterations of gradient descent:

  • Large step size of ηt=1\eta_t = 1

  • Small step size of ηt=0.001\eta_t = 0.001

The convergence is shown by the log-suboptimality plot, which plots log(xtx)\log(x_t - x^*) as a function of tt. Faster converging step sizes are indicated by a more steeply decreasing function on the log-suboptimality plot.

import numpy as np
import matplotlib.pyplot as plt

# Generate random matrix A and vector b
np.random.seed(42)
A = np.random.randn(100, 50)
b = np.random.randn(100)

# Objective function: ||Ax - b||^2
def objective_function(x):
    return np.linalg.norm(A @ x - b) ** 2

# Gradient of the objective function
def gradient(x):
    return 2 * A.T @ (A @ x - b)

# Gradient descent function
def gradient_descent(A, b, eta, num_iterations):
    x = np.zeros(A.shape[1])  # Initialize x to zero
    obj_values = []
    f_star = objective_function(np.linalg.pinv(A) @ b)  # Best achievable value (for log suboptimality)

    for _ in range(num_iterations):
        obj_values.append(objective_function(x))
        grad = gradient(x)
        x -= eta * grad  # Update x using gradient descent

    # Compute log suboptimality
    log_suboptimality = np.log10(np.array(obj_values) - f_star + 1e-10)  # Small epsilon to avoid log(0)
    
    return log_suboptimality

# Set number of iterations
num_iterations = 100

# Define a range of step sizes to try
etas = [0.0036, 0.0035, 0.001, 0.0005, 0.0001]

# Plot the results for each step size
plt.figure(figsize=(12, 6))

for eta in etas:
    log_suboptimality = gradient_descent(A, b, eta, num_iterations)
    plt.plot(log_suboptimality, label=f'eta={eta}')

plt.xlabel('Iteration')
plt.ylabel('Log Suboptimality')
plt.title('Convergence of Gradient Descent with Different Step Sizes')
plt.legend()
plt.grid(True)
plt.show()

print('1 / maximum eigenvalue of A.T @ A:', 1 / max(np.linalg.eig(A.T @ A).eigenvalues))
<Figure size 1200x600 with 1 Axes>
1 / maximum eigenvalue of A.T @ A: 0.0035395895131759892

There are a few observations we can make from the graph. First, note that when the step size η is very small, then the gradient descent converges to the minimum, but at a slow rate, as indicated by the purple curve. Second, note that by gradually increasing η (up until a point), we can achieve faster convergence, as indicated by the purple, red, and green curves. Third, however, note that after we increase η too much, the convergence starts to get slower and if we increase η enough, then gradient descent actually diverges, as indicted by the orange and blue curve.

As a cool and optional final note, we also printed the inverse of the minimum eigenvalue of A.T @ A, which is 0.003539\approx 0.003539. Note that it is between the step sizes of 0.0036 and 0.0035, which represent the boundary where gradient descent starts to diverge; this is no coincidence! For those who are familiar with Lipschitz continuity, the objective function f(x)=Axb2f(\vv x) = \| A\vv x - \vv b\|^2 has Hessian AAA^\top A, meaning that its first derivative f(x)\nabla f(\vv x) is Lipschitz-continuous in x\vv x with with a Lipschitz constant of L=λmax(AA)L = \lambda_{\max} (A^\top A) (the maximum eigenvalue of AAA^\top A). It turns out that when the step size is chosen such that η<1L\eta < \frac{1}{L}, then gradient descent is guaranteed to converge! A proof of this fact can be found here.

4Zig-zags and What to do About Them

Let’s consider a very simple optimization over xR2\vv x\in\mathbb{R}^2 with cost function

f(x1,x2)=12(x2+by2)f(\vv x_1, \vv x_2) = \frac{1}{2}(x^2 + by^2)

where we’ll let 0<b10<b\leq 1 vary. The optimal solution is obviously (x,y)=(0,0)(x^*, y^*) = (0,0) but we’ll use this to illustrate how gradient descent can get you into trouble sometimes.

Suppose we run gradient descent on ff, and we further allow ourselves to pick the best possible step size sks_k at each iteration, i.e., we choose step size

sk=arg mins>0f(x(k)sf(x(k)))s_k = \argmin_{s > 0} f(\vv x^{(k)} - s\nabla f(\vv x^{(k)}))

and then update x(k+1)=x(k)skf(x(k))\vv x^{(k+1)} = \vv x^{(k)} - s_k \nabla f(\vv x^{(k)}).

This is called exact line search for selecting the step size, and is widely used in practice. If we use this choice of step size sks_k, then it is possible to write an explicit formula for our iterates (x(k),y(k))(x^{(k)}, y^{(k)}) as we progress down the bowl. Starting at (x(0),y(0))=(b,1)(x^{(0)},y^{(0)})=(b,1), we get

If b=1b=1, which corresponds to a function with level sets that are perfect circles, we succeed immediately in one step (x(1)=y(1)=0)(x^{(1)} = y^{(1)} = 0). This is because the gradient always points directly to the optimal point (0,0)(0,0):

point

The real purpose of this example is seen when bb is small. The crucial ratio in equation (10) is

r=b1b+1, r = \frac{b-1}{b+1},

If rr is small, (x(k),y(k))(x^{(k)},y^{(k)}) converges to (0,0)(0,0) very quickly. However, if rr is close to 1, then this convergence is very slow. For example, if b=110b = \frac{1}{10}, then r=911r = \frac{9}{11}; for b=1100b = \frac{1}{100}, r=99101r = \frac{99}{101}.

This means we need to take many more gradient steps to get close to x=(0,0)\vv x^* = (0,0). The picture below highlights what’s going wrong: for small bb, the level sets become elongated ellipses, so that following gradients leads to us zig-zagging our way to the origin instead of taking a straight path. It is this zig-zagging that causes slow convergence.

zigzag

So what’s going wrong here? If we write f(x)f(\vv x) as a quadratic form, it is:

f(x)=[xy]T[100b][xy]=xTKxf(\vv x) = \begin{bmatrix} x & y \end{bmatrix}^T \begin{bmatrix} 1 & 0 \\ 0 & b \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \vv x^T K \vv x

Notice that the condition number of the matrix KK is precisely 1b\frac{1}{b}, which is large for small bb. This means that one direction (in this case the x-axis) is penalized much more than the other (the y-axis). This leads to stretched out level sets, which leads to zigzags and slow convergence. How can we fix this?

5Newton’s Method (optional)

To derive the gradient descent method (GD), we used a first order approximation of f(x)f(\vv x) near x(k)\vv x^{(k)} to figure out what direction we should move in. What the last example we saw showed was that when the gradient changes quickly (look at the direction of the gradients in the zizag plot; they are all over the place!), things can go wrong. This suggests that we should also account for how quickly the gradient changes: we need to compute the “gradient of the gradient”, a.k.a. the Hessian of ff 2f(x)\nabla^2 f(\vv x). The Hessian of a function f:RnRf: \mathbb{R}^n \to \mathbb{R} is an n×nn \times n symmetric matrix with entries given by 2nd2^{nd} order partial derivatives of ff:

[2f(x)]ij=2fxixj(x)\bm \nabla^2 f(\vv x)\em_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(\vv x)

The Hessian tells us how quickly the gradient changes in the same way f(x)f''(x) tells us how quickly f(x)f(x) changes for a scalar function f(x)f(x). The Hessian lets us make a second order Taylor series approximation to our function f(x)f(\vv x) near our current guess x(k)\vv x^{(k)}:

f(x)f(x(k))+f(x(k))T(xx(k))+(xx(k))T2f(x(k))(xx(k)),f(\vv x) \approx f(\vv x^{(k)}) + \nabla f(\vv x^{(k)})^T(\vv x- \vv x^{(k)}) + (\vv x-\vv x^{(k)})^T \nabla^2 f(\vv x^{(k)})(\vv x- \vv x^{(k)}),

which provides a local quadratic approximation to f(x)f(\vv x) near f(x(k))f(\vv x^{(k)}):

newton

As before, if we let Δx(k)=x(k+1)x(k)\Delta \vv x^{(k)} = \vv x^{(k+1)} - \vv x^{(k)}, we can rewrite (14) as

f(x(k+1))f(x(k))(Δx(k))T2f(x(k))Δx(k)+f(x(k))TΔx(k).f(\vv x^{(k+1)}) - f(\vv x^{(k)}) \approx (\Delta \vv x^{(k)})^T \nabla^2 f(\vv x^{(k)}) \Delta \vv x^{(k)} + \nabla f(\vv x^{(k)})^T \Delta \vv x^{(k)}.

Since we want to make f(x(k+1))f(x(k))f(\vv x^{(k+1)}) - f(\vv x^{(k)}) as small as possible, it makes sense to pick Δx(k)\Delta \vv x^{(k)} to minimize the RHS of (15), which is another minimization problem!

We’ll focus on the setting where 2f(x(k))\nabla^2 f(\vv x^{(k)}) is positive definite: this corresponds to settings where our function is convex. In this case, the RHS of (15) is a positive definite quadratic function, which is minimized at:

Δx(k)=2f(x(k))1f(x(k)).\Delta \vv x^{(k)} = - \nabla^2 f(\vv x^{(k)})^{-1} \nabla f(\vv x^{(k)}).

Using this descent direction instead of f(x(k))-\nabla f(\vv x^{(k)}) yields Newton’s Method:

x(k+1)=x(k)2f(x(k))1f(x(k)).\vv x^{(k+1)} = \vv x^{(k)} - \nabla^2 f(\vv x^{(k)})^{-1} \nabla f(\vv x^{(k)}).

The idea behind Newton’s Method is to “unstretch” the stretched out directions, so that to our algorithm, the level sets of a function are locally circles. If we look at our test example above, note that:

2f(x)=[1b]2f(x)1=[11b]\nabla^2 f(\vv x) = \begin{bmatrix} 1 & \\ & b \end{bmatrix} \Rightarrow \nabla^2 f(\vv x)^{-1} = \begin{bmatrix} 1 & \\ & \frac{1}{b} \end{bmatrix}

so that for x(0)=(b,1)\vv x^{(0)} = (b,1), we have:

x(1)=x(0)2f(x)1f(x)=[b1][11b][bb]=[00]\begin{align*} \vv x^{(1)} &= \vv x^{(0)} - \nabla^2 f(\vv x)^{-1} \nabla f(\vv x) \\ &= \begin{bmatrix} b \\ 1 \end{bmatrix} - \begin{bmatrix} 1 & \\ & \frac{1}{b} \end{bmatrix} \begin{bmatrix} b \\ b \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \end{align*}

i.e., we converge in one step no matter what the choice of bb is in f(x,y)=12(x2+by2)f(x, y) = \frac{1}{2}(x^2+by^2)!

The cost of this fast convergence though is that at each update, we need to solve a linear system of equations of the form

2f(x(k))Δx(k)=f(x(k))\nabla^2 f(\vv x^{(k)}) \Delta \vv x^{(k)} = \nabla f(\vv x^{(k)})

which may be expensive if x\vv x is a high-dimensional vector. It is for this reason that gradient descent based methods are the predominant methods used in machine learning, where oftentimes the dimensionality of xRn\vv x\in \mathbb{R}^n can be on the order of millions or billions.

Binder