Skip to article frontmatterSkip to article content

13.3 Backpropagation

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in LLA Chapter 9.2. Reviewing CalcBLUE2 Chapter 5 on the chain rule is recommended.

2Learning Objectives

By the end of this page, you should know:

  • how the least-squares data fitting problem can be solved using gradient descent
  • the multivariate chain rule to compute gradients of loss functions
  • how gradients are computed efficiently using backpropagation for deep learning models

3Motivation

Last class, we studied the unconstrained optimization

minimize f(x)\text{minimize } f(\vv x)

over xRn\vv x \in \mathbb{R}^n, where we look for the xRn\vv x \in \mathbb{R}^n that makes the value of the cost function f:RnRf: \mathbb{R}^n \to \mathbb{R} as small as possible. We saw that one way to find either a local or global minimum x\vv x^* is gradient descent. Starting at an initial guess x(0)\vv x^{(0)}, we iteratively update our guess via

x(k+1)=x(k)sf(x(k)),k=0,1,2,(GD)\vv x^{(k+1)} = \vv x^{(k)} - s \nabla f(\vv x^{(k)}), \quad k = 0, 1, 2, \ldots \text{(GD)}

where f(x(k))Rn\nabla f(\vv x^{(k)}) \in \mathbb{R}^n is the gradient of ff evaluated at the current guess, and s>0s > 0 is a step size chosen large enough to make progress towards x\vv x^*, but not so big as to overshoot.

Today, we’ll focus our attention on optimization problems (1) for which the cost function takes the following special form

f(x)=i=1Nfi(x),f(\vv x) = \sum_{i=1}^N f_i(\vv x),

i.e., cost functions ff that decompose into a sum of NN “sub-costs” fif_i. Problems with cost functions of the form (3) are particularly common in machine learning.

For example, a typical problem setup in machine learning is as follows (we saw an example of this when we studied least squares for data-fitting). We are given a set of training data {(zi,yi)},i=1,,N\{(\vv z_i, \vv y_i)\}, i=1, \ldots, N, comprised of “inputs” ziRp\vv z_i \in \mathbb{R}^p and “outputs” yiRp\vv y_i \in \mathbb{R}^p. Our goal is to find a set of weights xRn\vv x \in \mathbb{R}^n which parametrize a model such that m(zi;x)yim(\vv z_i; \vv x) \approx \vv y_i on our training data. A common way of doing this is to minimize a loss function of the form

loss((zi,yi);x)=1Ni=1N(m(zi;x)yi),\text{loss}((\vv z_i, \vv y_i); \vv x) = \frac{1}{N} \sum_{i=1}^N \ell(m(\vv z_i; \vv x) - \vv y_i),

where each term (m(zi;x)yi)\ell(m(\vv z_i; \vv x) - \vv y_i) is a term penalizing the difference between our model prediction m(zi;x)m(\vv z_i; \vv x) on input zi\vv z_i and the observed output yi\vv y_i. In this setting, the loss function (4) takes the form (3), with fi=1N(m(zi;x)yi)f_i = \frac{1}{N} \ell(m(\vv z_i; \vv x) - \vv y_i) the error between our prediction y^i=m(zi;x)\hat{\vv y}_i = m(\vv z_i; \vv x) and the true output yi\vv y_i.

A common choice for the “sub-loss” function is (e)=e2\ell(\vv e) = \|\vv e\|^2, leading to a least-squares regression problem, but note that most other choices of loss function are compatible with the following discussion.

Now suppose that we want to implement gradient descent (GD) on the loss function (4). Our first step is to compute the gradient xloss((zi,yi);x)\nabla_{\vv x} \text{loss}((\vv z_i, \vv y_i); \vv x). Because of the sum structure of (4), we have that:

xloss((zi,yi);x)=1Ni=1Nx(m(zi;x)yi),\nabla_{\vv x} \text{loss}((\vv z_i,\vv y_i);\vv x) = \frac{1}{N} \sum_{i=1}^N \nabla_{\vv x} \ell(m(\vv z_i; \vv x) - \vv y_i),

i.e., the gradient of the loss function is the sum of the gradients of the “sub-losses” on each of the i=1,,Ni=1,\ldots,N data points.

Our task now is therefore to compute the gradient x(m(zi;x)yi)\nabla_{\vv x} \ell(m(\vv z_i; \vv x)- \vv y_i). This requires the multivariate chain rule, as fi(x)=(m(zi;x)yi)f_i(\vv x) = \ell(m(\vv z_i;\vv x)-\vv y_i) is a composition of the functions (e),e=wyi\ell(\vv e), \vv e = \vv w - \vv y_i, and w=m(zi;x)\vv w = m(\vv z_i;\vv x).

4The Multivariate Chain Rule (CalcBLUE 2 Ch.5)

We begin with a reminder of the chain rule for scalar functions. Let f:RRf:\mathbb{R}\to\mathbb{R} and g:RRg:\mathbb{R}\to\mathbb{R} be differentiable functions. Then for h(x)=g(f(x))h(x) = g(f(x)), we have that:

h(x)=g(f(x))f(x).h'(x) = g'(f(x)) f'(x).

If we define g=g(f)g = g(f) and f=f(x)f = f(x), then we can rewrite (6) as dhdx=dgdfdfdx\frac{dh}{dx} = \frac{dg}{df}\cdot\frac{df}{dx}. This is a useful way of writing things as we can “cancel” dfdf on the RHS to check that our formula is correct.

Generalizing slightly, suppose now that f:RnRf:\mathbb{R}^n\to\mathbb{R} maps a vector xRn\vv x\in\mathbb{R}^n to f(x)Rf(\vv x)\in\mathbb{R}. Then for h(x)=g(f(x))h(x\vv ) = g(f(\vv x)), we have:

xh(x)=g(f(x))f(x),\nabla_{\vv x} h(\vv x) = g'(f(\vv x)) \nabla f(\vv x),

which we see is a natural generalization of equation (6). It will be convenient for us later to define dfdx=xf(x)T\frac{df}{d \vv x} = \nabla_{\vv x} f(\vv x)^T and dhdx=xh(x)T\frac{dh}{d \vv x} = \nabla_{\vv x} h(\vv x)^T. Again defining g=g(f)g = g(f) and f=f(x)f = f(\vv x), we can rewrite (7) as dhdx=dgdfdfdx\frac{dh}{d \vv x} = \frac{dg}{df} \cdot \frac{df}{d \vv x}, which looks exactly the same as before!

Now, let’s apply these ideas to computing the gradient of h(x)=(m(zi;x)yi)h(\vv x) = \ell(m(\vv z_i; \vv x)- y_i), where we’ll assume for now that m(zi;x),yiRm(\vv z_i;\vv x), y_i \in \mathbb{R}. Applying (7), we get

xh(x)=(m(zi;x)yi)x(m(zi;x)yi)=(m(zi;x)yi)xm(zi;x)\nabla_{\vv x} h(\vv x) = \ell'(m(\vv z_i;\vv x)-y_i) \cdot \nabla_{\vv x} (m(\vv z_i;\vv x) - y_i) = \ell'(m(\vv z_i;\vv x)-y_i) \cdot \nabla_{\vv x} m(\vv z_i; \vv x)

where we use that xyi=0\nabla_{\vv x} y_i = 0 (since it’s a constant). Without knowing more about the functions \ell and mm, this is all we can say.

We’ll use our same intuition of “cancelling” to derive the expression:

dhdx=dgfdfx\frac{d \vv h}{d \vv x} = \frac{d \vv g}{\partial \vv f} \cdot \frac{d \vv f}{\partial \vv x}

Note that (13) is defined by a matrix-matrix multiplication of an m×pm\times p and p×np\times n matrix, meaning dhdxRm×n\frac{d \vv h}{d \vv x} \in \mathbb{R}^{m\times n}. The claim is that (i,j)th(i, j)^{th} entry of of dhdx\frac{d \vv h}{d \vv x} is the rate of change of hi(x)=gi(f(x))h_i(\vv x) = g_i(f(\vv x)) with respect to xjx_j. From (12) and (13), we have

(dhdx)i,j=dgidfith row of dgdf[f1xjfpxj]jthcolumn of dfdx=dgif1f1xj++gifpfpxj,\left(\frac{d \vv h}{d \vv x}\right)_{i,j} = \underbrace{\frac{dg_i}{d \vv f}}_{i^{th} \text{ row of } \frac{d \vv g}{d \vv f}} \cdot \underbrace{\bm \frac{\partial f_1}{\partial x_j} \\ \vdots \\ \frac{\partial f_p}{\partial x_j}\em}_{j^{th} \text{column of } \frac{d \vv f}{d \vv x}} = \frac{d g_i}{\partial f_1} \cdot \frac{\partial f_1}{\partial x_j} + \cdots + \frac{\partial g_i}{\partial f_p} \cdot \frac{\partial f_p}{\partial x_j},

which is precisely the expression we were looking for. The “cancellation rule” tells us each term in the sum is computing the partial of gixj\frac{\partial g_i}{\partial x_j} in the “fif_i” channel.

We can apply this formula recursively to our function class (10) to obtain the formula:

dmdx=dmLdmL1dmL1dmL2dm2dm1dm1dx\frac{d \vv m}{d \vv x} = \frac{d \vv m_L}{d \vv m_{L-1}} \cdot \frac{d \vv m_{L-1}}{d \vv m_{L-2}} \cdots \frac{d \vv m_2}{d \vv m_1} \cdot \frac{d \vv m_1}{d x}

which is a fully general matrix chain rule. We’ll use (15) next to explore the key idea behind backpropagation, which has been a key technical enabler of contemporary deep learning.

5Computing the Gradients

We are going to work out how to efficiently compute the gradient of

(m(zi;x)yi)\ell(\vv m(\vv z_i;\vv x)-\vv y_i)

when m\vv m takes the form in (11). We’ll furthermore assume, as is often the case in deep learning, that each layer function mi\vv m_i takes the following form:

mi(Oi1;xi)=σ(Xi[Oi11])\vv m_i(\vv O_{i-1}; \vv x_i) = \sigma\left(X_i \begin{bmatrix} \vv O_{i-1} \\ 1 \end{bmatrix}\right)

where XiX_i is a Rpi×(ni+1)\mathbb{R}^{p_i \times (n_{i}+1)} matrix with entries given by xiRpi(ni+1)\vv x_i \in \mathbb{R}^{p_i(n_{i}+1)}, and σ is a pointwise nonlinearity σ(x)=(σ(x1),,σ(xn))\sigma(\vv x) = (\sigma(x_1),\ldots,\sigma(x_n)) called an activation function (more on these next lecture).

Applying our matrix chain rule to (m(x)yi)\ell(\vv m(\vv x)-\vv y_i) (we won’t write zi\vv z_i to save space) we get the expression

dx=mmx=mLmLmL1m2m1m1x.\frac{d\ell}{\partial \vv x} = \frac{\partial \ell}{\partial \vv m} \frac{\partial \vv m}{\partial \vv x} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdots \frac{\partial \vv m_2}{\partial \vv m_1} \frac{\partial \vv m_1}{\partial \vv x}.

Here, m\frac{\partial \ell}{\partial \vv m} is a pLp_L dimensional row vector, and mimi1\frac{\partial \vv m_i}{\partial \vv m_{i-1}} is a pi×pi1p_i \times p_{i-1} matrix.

In modern architectures, the layer dimensions, also called layer widths, pip_i can be very large (on the order of 100s of thousands or even millions), meaning the mimi1\frac{\partial \vv m_i}{\partial \vv m_{i-1}} matrices are very very large! Too large to store in memory actually.

Fortunately, since m\frac{\partial \ell}{\partial \vv m} is a row vector, we can build x\frac{\partial \ell}{\partial \vv x} by sequentially computing inner products. For example, if mLmL1=[a1apL1]\frac{\partial \vv m_L}{\partial \vv m_{L-1}} = \bm \vv a_1 \cdots \vv a_{p_{L-1}}\em,

mLmLmL1=[mL]1×pL[a1apL1]pL×pL1=[mLa1mLapL1],\begin{align*} \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} &= \underbrace{\bm --- & \frac{\partial \ell}{\partial \vv m_L} & ---\em}_{1 \times p_L} \begin{bmatrix} \vv a_1 & \cdots & \vv a_{p_{L-1}} \end{bmatrix}_{p_L \times p_{L-1}} \\ &= \bm \frac{\partial \ell}{\partial \vv m_L} \vv a_1 & \cdots & \frac{\partial \ell}{\partial \vv m_L} \vv a_{p_{L-1}}\em, \end{align*}

meaning we only ever need to store mL\frac{\partial \ell}{\partial \vv m_L} and ai\vv a_i in memory at any given time, which is only 2pL2p_L numbers, as opposed to pL×pL1p_L \times p_{L-1} #s! Then once we’ve computed mLmLmL1\frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}}, which is now a pL1p_{L-1} dimensional row vector, we can continue our way down the chain.

What’s left to do is compute the partial derivatives! Let’s break down x\frac{\partial \ell}{\partial \vv x} into partial derivatives with respect to a layer’s parameters xi\vv x_i. For layer LL, we have:

xL=mLmLxL+mLmLmL1mL1xL=mLmLxL.\frac{\partial \ell}{\partial \vv x_L} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv x_L} + \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \frac{\partial \vv m_{L-1}}{\partial \vv x_L} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv x_L}.

Since xL\vv x_L appears in the last layer, it shows up right away in the first term above, which is the derivative of mL(mL1;xL)m_L(m_{L-1};\vv x_L) with respect to xL\vv x_L (the 2nd argument). The second term,

mLmLmL1mL1xL=0\frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \frac{\partial \vv m_{L-1}}{\partial \vv x_L} = 0

which measures how mL\vv m_L changes with respect to changes in mL1\vv m_{L-1} caused by changes in xL\vv x_L is zero because mL1\vv m_{L-1} does not depend on xL\vv x_L at all! This is a key observation in the backpropagation algorithm!

Let’s proceed to compute the derivative with respect to the parameter xL1\vv x_{L-1}:

xL1=mlmLmL1(mL1xL1+mL1mL2mL2xL1)where mL2xL1=0=mLmLmL1mL1xL1\begin{align*} \frac{\partial \ell}{\partial \vv x_{L-1}} &= \frac{\partial \ell}{\partial \vv m_l} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \left( \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} + \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-1}} \right) \quad \text{where}\ \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-1}} = 0\\ &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} \end{align*}

We see again that if we can “step” once we hit the layer that depends explicitly on xL1\vv x_{L-1}. Formally, we have:

xL=mLmLxLxL1=mLmLmL1mL1xL1(mL1=mLmLmL1)xL2=mLmLmL1mL1mL2mL2xL2(mL2=ml1mL1mL2)xj=mLmLmL1mL1mL2mL2xL2mj+1mjmjxj(mj1=mjmjmj1)\begin{align*} \frac{\partial \ell}{\partial \vv x_L} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv x_L} \\ \frac{\partial \ell}{\partial \vv x_{L-1}} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} \quad \left( \frac{\partial \ell}{\partial \vv m_{L-1}} = \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \right) \\ \frac{\partial \ell}{\partial \vv x_{L-2}} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-2}} \quad \left( \frac{\partial \ell}{\partial \vv m_{L-2}} = \frac{\partial \ell}{\partial \vv m_{l-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \right) \\ & \vdots \\ \frac{\partial \ell}{\partial \vv x_j} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-2}} \cdots \frac{\partial \vv m_{j+1}}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv x_j} \quad \left( \frac{\partial \ell}{\partial \vv m_{j-1}} = \frac{\partial \ell}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv m_{j-1}} \right) \end{align*}

Notice that there is a lot of reuse of expressions, which means we don’t have to recompute things over and over. In particular

mL1=mLmLmL1,mL2=ml1mL1mL2,\begin{align*} \frac{\partial \ell}{\partial \vv m_{L-1}} = \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}}, \quad \frac{\partial \ell}{\partial \vv m_{L-2}} = \frac{\partial \ell}{\partial \vv m_{l-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}}, \end{align*}

and in general

mj1=mjmjmj1\frac{\partial \ell}{\partial \vv m_{j-1}} = \frac{\partial \ell}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv m_{j-1}}

where mj\frac{\partial \ell}{\partial \vv m_j} will have been computed at the layer above. This is another key piece of backpropagation!

The only thing left to compute is mjxj\frac{\partial \vv m_j}{\partial \vv x_j} --- this is now just an exercise in calculus, so we’ll not work it out by hand. Please refer to Backpropagation#Finding the derivative of the error and for further information if you are interested.

6Optional

We apply our chain rule (with w=Xj[Oj11])\left( \text{with } \vv w = X_j \bm \vv O_{j-1} \\ 1\em\right) to get

mjxj=xjσ(Xj[Oj11])=σwwxj.\frac{\partial \vv m_j}{\partial \vv x_j} = \frac{\partial}{\partial x_j} \sigma\left(X_j \bm \vv O_{j-1} \\ 1\em\right) = \frac{\partial \vv \sigma}{\partial \vv w} \cdot \frac{\partial \vv w}{\partial \vv x_j}.

Now for σ(w)=[σ(w1)σ(wpj1)],σw=[σ(w1)σ(wpj1)]\vv \sigma(\vv w) = \begin{bmatrix} \vv \sigma(w_1) \\ \vdots \\ \sigma(w_{p_{j-1}}) \end{bmatrix}, \frac{\partial \sigma}{\partial \vv w} = \begin{bmatrix} \sigma'(w_1) & & \\& \ddots & \\& & \sigma'(w_{p_{j-1}}) \end{bmatrix}. Next, we need to find wxj=xj(Xj[Oj11]).\frac{\partial \vv w}{\partial \vv x_j} = \frac{\partial}{\partial \vv x_j} \left(X_j \bm \vv O_{j-1} \\ 1\em \right). This can be computed using multi linear algebra (tensors). We won’t work it out, but note that it can be found efficiently.

Binder