13.3 Backpropagation - ESE 2030 📏

1Reading¶

Material related to this page, as well as additional exercises, can be found in LLA Chapter 9.2. Reviewing CalcBLUE2 Chapter 5 on the chain rule is recommended.

2Learning Objectives¶

By the end of this page, you should know:

how the least-squares data fitting problem can be solved using gradient descent
the multivariate chain rule to compute gradients of loss functions
how gradients are computed efficiently using backpropagation for deep learning models

3Motivation¶

Last class, we studied the unconstrained optimization

\text{minimize } f(\vv x)

(1)

over $\vv x \in \mathbb{R}^n$ , where we look for the $\vv x \in \mathbb{R}^n$ that makes the value of the cost function $f: \mathbb{R}^n \to \mathbb{R}$ as small as possible. We saw that one way to find either a local or global minimum $\vv x^*$ is gradient descent. Starting at an initial guess $\vv x^{(0)}$ , we iteratively update our guess via

\vv x^{(k+1)} = \vv x^{(k)} - s \nabla f(\vv x^{(k)}), \quad k = 0, 1, 2, \ldots \text{(GD)}

(2)

where $\nabla f(\vv x^{(k)}) \in \mathbb{R}^n$ is the gradient of $f$ evaluated at the current guess, and $s > 0$ is a step size chosen large enough to make progress towards $\vv x^*$ , but not so big as to overshoot.

Today, we’ll focus our attention on optimization problems (1) for which the cost function takes the following special form

f(\vv x) = \sum_{i=1}^N f_i(\vv x),

(3)

i.e., cost functions $f$ that decompose into a sum of $N$ “sub-costs” $f_i$ . Problems with cost functions of the form (3) are particularly common in machine learning.

For example, a typical problem setup in machine learning is as follows (we saw an example of this when we studied least squares for data-fitting). We are given a set of training data $\{(\vv z_i, \vv y_i)\}, i=1, \ldots, N$ , comprised of “inputs” $\vv z_i \in \mathbb{R}^p$ and “outputs” $\vv y_i \in \mathbb{R}^p$ . Our goal is to find a set of weights $\vv x \in \mathbb{R}^n$ which parametrize a model such that $m(\vv z_i; \vv x) \approx \vv y_i$ on our training data. A common way of doing this is to minimize a loss function of the form

\text{loss}((\vv z_i, \vv y_i); \vv x) = \frac{1}{N} \sum_{i=1}^N \ell(m(\vv z_i; \vv x) - \vv y_i),

(4)

where each term $\ell(m(\vv z_i; \vv x) - \vv y_i)$ is a term penalizing the difference between our model prediction $m(\vv z_i; \vv x)$ on input $\vv z_i$ and the observed output $\vv y_i$ . In this setting, the loss function (4) takes the form (3), with $f_i = \frac{1}{N} \ell(m(\vv z_i; \vv x) - \vv y_i)$ the error between our prediction $\hat{\vv y}_i = m(\vv z_i; \vv x)$ and the true output $\vv y_i$ .

A common choice for the “sub-loss” function is $\ell(\vv e) = \|\vv e\|^2$ , leading to a least-squares regression problem, but note that most other choices of loss function are compatible with the following discussion.

Now suppose that we want to implement gradient descent (GD) on the loss function (4). Our first step is to compute the gradient $\nabla_{\vv x} \text{loss}((\vv z_i, \vv y_i); \vv x)$ . Because of the sum structure of (4), we have that:

\nabla_{\vv x} \text{loss}((\vv z_i,\vv y_i);\vv x) = \frac{1}{N} \sum_{i=1}^N \nabla_{\vv x} \ell(m(\vv z_i; \vv x) - \vv y_i),

(5)

i.e., the gradient of the loss function is the sum of the gradients of the “sub-losses” on each of the $i=1,\ldots,N$ data points.

Our task now is therefore to compute the gradient $\nabla_{\vv x} \ell(m(\vv z_i; \vv x)- \vv y_i)$ . This requires the multivariate chain rule, as $f_i(\vv x) = \ell(m(\vv z_i;\vv x)-\vv y_i)$ is a composition of the functions $\ell(\vv e), \vv e = \vv w - \vv y_i$ , and $\vv w = m(\vv z_i;\vv x)$ .

4The Multivariate Chain Rule (CalcBLUE 2 Ch.5)¶

We begin with a reminder of the chain rule for scalar functions. Let $f:\mathbb{R}\to\mathbb{R}$ and $g:\mathbb{R}\to\mathbb{R}$ be differentiable functions. Then for $h(x) = g(f(x))$ , we have that:

h'(x) = g'(f(x)) f'(x).

(6)

If we define $g = g(f)$ and $f = f(x)$ , then we can rewrite (6) as $\frac{dh}{dx} = \frac{dg}{df}\cdot\frac{df}{dx}$ . This is a useful way of writing things as we can “cancel” $df$ on the RHS to check that our formula is correct.

Generalizing slightly, suppose now that $f:\mathbb{R}^n\to\mathbb{R}$ maps a vector $\vv x\in\mathbb{R}^n$ to $f(\vv x)\in\mathbb{R}$ . Then for $h(x\vv ) = g(f(\vv x))$ , we have:

\nabla_{\vv x} h(\vv x) = g'(f(\vv x)) \nabla f(\vv x),

(7)

which we see is a natural generalization of equation (6). It will be convenient for us later to define $\frac{df}{d \vv x} = \nabla_{\vv x} f(\vv x)^T$ and $\frac{dh}{d \vv x} = \nabla_{\vv x} h(\vv x)^T$ . Again defining $g = g(f)$ and $f = f(\vv x)$ , we can rewrite (7) as $\frac{dh}{d \vv x} = \frac{dg}{df} \cdot \frac{df}{d \vv x}$ , which looks exactly the same as before!

Now, let’s apply these ideas to computing the gradient of $h(\vv x) = \ell(m(\vv z_i; \vv x)- y_i)$ , where we’ll assume for now that $m(\vv z_i;\vv x), y_i \in \mathbb{R}$ . Applying (7), we get

\nabla_{\vv x} h(\vv x) = \ell'(m(\vv z_i;\vv x)-y_i) \cdot \nabla_{\vv x} (m(\vv z_i;\vv x) - y_i) = \ell'(m(\vv z_i;\vv x)-y_i) \cdot \nabla_{\vv x} m(\vv z_i; \vv x)

(8)

where we use that $\nabla_{\vv x} y_i = 0$ (since it’s a constant). Without knowing more about the functions $\ell$ and $m$ , this is all we can say.

Example 1

Suppose $\ell(e) = \frac{1}{2}e^2$ and $m(\vv z_i; \vv x) = \vv x^T \vv z_i$ . Then

\ell(m(\vv z_i;\vv x)-y_i) = \frac{1}{2}(\vv x^T\vv z_i - y_i)^2 \text{ and } \nabla_{\vv x} \ell(m(\vv z_i; \vv x)-y_i) = \underbrace{(\vv x^T\vv z_i - y_i)}_{\ell'(m - y_i)} \cdot \underbrace{\vv z_i}_{\nabla_{\vv x} m}

(9)

Next lecture we will have a brief introduction to deep learning. In deep learning, the function $m(\vv z_i; \vv x)$ is often parameterized as a chain of function compositions:

\begin{align*} \vv m(\vv z_i; \vv x) &= \vv m_L(\vv m_{L-1}(\cdots(\vv m_2(\vv m_1(\vv z_i))\cdots)) \\ &= \vv m_L \circ \vv m_{L-1} \circ \cdots \circ \vv m_2 \circ \vv m_1(\vv z_i). \end{align*}

(10)

A more suggestive way of writing this parameterization (that also highlights the dependence on $\vv x$ ) is

\begin{align*} \vv O_0 &= \vv z_i & \\ \vv O_1 &= \vv m_1(\vv O_0; \vv x_1) & \vv O_1 \in \mathbb{R}^{p_1}, \vv O_0 \in \mathbb{R}^{p_0} \\ \vv O_2 &= \vv m_2(\vv O_1; \vv x_2) & \vv O_2 \in \mathbb{R}^{p_2}, \vv O_1 \in \mathbb{R}^{p_1} \\ &\vdots & \vdots \\ \vv O_L &= \vv m_L(\vv O_{L-1}; \vv x_L) & \vv O_L \in \mathbb{R}^{p_L}, \vv O_{L-1} \in \mathbb{R}^{p_{L-1}} \end{align*}

(11)

Here the model parameters $\vv x = (\vv x_1, \ldots, \vv x_L)$ we split across the layers $1, \ldots, L$ . The intermediate outputs $\vv O_i$ can be of different dimensions, as can the layer parameters $\vv x_i$ . Writing (10) as (11) highlights why these functions are called deep neural networks as the number of layers $L$ increases. Our goal is then to compute $\nabla_{\vv x} \ell(m(\vv z_i; \vv x)-\vv y_i)$ for $m$ of the form (10), and where $m, y_i \in \mathbb{R}^{p_L}$ are now also possibly vector-valued. To do this, we need the fully general multivariable chain rule.

For $h(\vv x) = g(f(\vv x))$ with vector-valued $\vv f: \mathbb{R}^n \to \mathbb{R}^p$ and $\vv g: \mathbb{R}^p \to \mathbb{R}^m$ , we need to define the Jacobian matrices for $\vv f$ and $\vv g$ :

\frac{d\vv f}{d\vv x} = \bm \frac{d f_1}{d \vv x} \\ \vdots \\ \frac{d f_p}{d \vv x} \em \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & & \vdots \\ \frac{\partial f_p}{\partial x_1} & \cdots & \frac{\partial f_p}{\partial x_n} \end{bmatrix} \quad \text{and} \quad \frac{d\vv g}{d \vv f} = \bm \frac{d f_1}{d \vv f} \\ \vdots \\ \frac{d g_m}{d \vv f} \em \begin{bmatrix} \frac{\partial g_1}{\partial f_1} & \cdots & \frac{\partial g_1}{\partial f_p} \\ \vdots & & \vdots \\ \frac{\partial g_m}{\partial f_1} & \cdots & \frac{\partial g_m}{\partial f_p} \end{bmatrix}

(12)

as the $p\times n$ and $m\times p$ matrices of partial derivatives, respectively.

We’ll use our same intuition of “cancelling” to derive the expression:

\frac{d \vv h}{d \vv x} = \frac{d \vv g}{\partial \vv f} \cdot \frac{d \vv f}{\partial \vv x}

(13)

Note that (13) is defined by a matrix-matrix multiplication of an $m\times p$ and $p\times n$ matrix, meaning $\frac{d \vv h}{d \vv x} \in \mathbb{R}^{m\times n}$ . The claim is that $(i, j)^{th}$ entry of of $\frac{d \vv h}{d \vv x}$ is the rate of change of $h_i(\vv x) = g_i(f(\vv x))$ with respect to $x_j$ . From (12) and (13), we have

\left(\frac{d \vv h}{d \vv x}\right)_{i,j} = \underbrace{\frac{dg_i}{d \vv f}}_{i^{th} \text{ row of } \frac{d \vv g}{d \vv f}} \cdot \underbrace{\bm \frac{\partial f_1}{\partial x_j} \\ \vdots \\ \frac{\partial f_p}{\partial x_j}\em}_{j^{th} \text{column of } \frac{d \vv f}{d \vv x}} = \frac{d g_i}{\partial f_1} \cdot \frac{\partial f_1}{\partial x_j} + \cdots + \frac{\partial g_i}{\partial f_p} \cdot \frac{\partial f_p}{\partial x_j},

(14)

which is precisely the expression we were looking for. The “cancellation rule” tells us each term in the sum is computing the partial of $\frac{\partial g_i}{\partial x_j}$ in the “ $f_i$ ” channel.

We can apply this formula recursively to our function class (10) to obtain the formula:

\frac{d \vv m}{d \vv x} = \frac{d \vv m_L}{d \vv m_{L-1}} \cdot \frac{d \vv m_{L-1}}{d \vv m_{L-2}} \cdots \frac{d \vv m_2}{d \vv m_1} \cdot \frac{d \vv m_1}{d x}

(15)

which is a fully general matrix chain rule. We’ll use (15) next to explore the key idea behind backpropagation, which has been a key technical enabler of contemporary deep learning.

5Computing the Gradients¶

We are going to work out how to efficiently compute the gradient of

\ell(\vv m(\vv z_i;\vv x)-\vv y_i)

(16)

when $\vv m$ takes the form in (11). We’ll furthermore assume, as is often the case in deep learning, that each layer function $\vv m_i$ takes the following form:

\vv m_i(\vv O_{i-1}; \vv x_i) = \sigma\left(X_i \begin{bmatrix} \vv O_{i-1} \\ 1 \end{bmatrix}\right)

(17)

where $X_i$ is a $\mathbb{R}^{p_i \times (n_{i}+1)}$ matrix with entries given by $\vv x_i \in \mathbb{R}^{p_i(n_{i}+1)}$ , and σ is a pointwise nonlinearity $\sigma(\vv x) = (\sigma(x_1),\ldots,\sigma(x_n))$ called an activation function (more on these next lecture).

Applying our matrix chain rule to $\ell(\vv m(\vv x)-\vv y_i)$ (we won’t write $\vv z_i$ to save space) we get the expression

\frac{d\ell}{\partial \vv x} = \frac{\partial \ell}{\partial \vv m} \frac{\partial \vv m}{\partial \vv x} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdots \frac{\partial \vv m_2}{\partial \vv m_1} \frac{\partial \vv m_1}{\partial \vv x}.

(18)

Here, $\frac{\partial \ell}{\partial \vv m}$ is a $p_L$ dimensional row vector, and $\frac{\partial \vv m_i}{\partial \vv m_{i-1}}$ is a $p_i \times p_{i-1}$ matrix.

In modern architectures, the layer dimensions, also called layer widths, $p_i$ can be very large (on the order of 100s of thousands or even millions), meaning the $\frac{\partial \vv m_i}{\partial \vv m_{i-1}}$ matrices are very very large! Too large to store in memory actually.

Fortunately, since $\frac{\partial \ell}{\partial \vv m}$ is a row vector, we can build $\frac{\partial \ell}{\partial \vv x}$ by sequentially computing inner products. For example, if $\frac{\partial \vv m_L}{\partial \vv m_{L-1}} = \bm \vv a_1 \cdots \vv a_{p_{L-1}}\em$ ,

\begin{align*} \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} &= \underbrace{\bm --- & \frac{\partial \ell}{\partial \vv m_L} & ---\em}_{1 \times p_L} \begin{bmatrix} \vv a_1 & \cdots & \vv a_{p_{L-1}} \end{bmatrix}_{p_L \times p_{L-1}} \\ &= \bm \frac{\partial \ell}{\partial \vv m_L} \vv a_1 & \cdots & \frac{\partial \ell}{\partial \vv m_L} \vv a_{p_{L-1}}\em, \end{align*}

(19)

meaning we only ever need to store $\frac{\partial \ell}{\partial \vv m_L}$ and $\vv a_i$ in memory at any given time, which is only $2p_L$ numbers, as opposed to $p_L \times p_{L-1}$ #s! Then once we’ve computed $\frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}}$ , which is now a $p_{L-1}$ dimensional row vector, we can continue our way down the chain.

What’s left to do is compute the partial derivatives! Let’s break down $\frac{\partial \ell}{\partial \vv x}$ into partial derivatives with respect to a layer’s parameters $\vv x_i$ . For layer $L$ , we have:

\frac{\partial \ell}{\partial \vv x_L} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv x_L} + \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \frac{\partial \vv m_{L-1}}{\partial \vv x_L} = \frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv x_L}.

(20)

Since $\vv x_L$ appears in the last layer, it shows up right away in the first term above, which is the derivative of $m_L(m_{L-1};\vv x_L)$ with respect to $\vv x_L$ (the 2nd argument). The second term,

\frac{\partial \ell}{\partial \vv m_L} \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \frac{\partial \vv m_{L-1}}{\partial \vv x_L} = 0

(21)

which measures how $\vv m_L$ changes with respect to changes in $\vv m_{L-1}$ caused by changes in $\vv x_L$ is zero because $\vv m_{L-1}$ does not depend on $\vv x_L$ at all! This is a key observation in the backpropagation algorithm!

Let’s proceed to compute the derivative with respect to the parameter $\vv x_{L-1}$ :

\begin{align*} \frac{\partial \ell}{\partial \vv x_{L-1}} &= \frac{\partial \ell}{\partial \vv m_l} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \left( \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} + \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-1}} \right) \quad \text{where}\ \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-1}} = 0\\ &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} \end{align*}

(22)

We see again that if we can “step” once we hit the layer that depends explicitly on $\vv x_{L-1}$ . Formally, we have:

\begin{align*} \frac{\partial \ell}{\partial \vv x_L} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv x_L} \\ \frac{\partial \ell}{\partial \vv x_{L-1}} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv x_{L-1}} \quad \left( \frac{\partial \ell}{\partial \vv m_{L-1}} = \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \right) \\ \frac{\partial \ell}{\partial \vv x_{L-2}} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-2}} \quad \left( \frac{\partial \ell}{\partial \vv m_{L-2}} = \frac{\partial \ell}{\partial \vv m_{l-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \right) \\ & \vdots \\ \frac{\partial \ell}{\partial \vv x_j} &= \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}} \cdot \frac{\partial \vv m_{L-2}}{\partial \vv x_{L-2}} \cdots \frac{\partial \vv m_{j+1}}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv x_j} \quad \left( \frac{\partial \ell}{\partial \vv m_{j-1}} = \frac{\partial \ell}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv m_{j-1}} \right) \end{align*}

(23)

Notice that there is a lot of reuse of expressions, which means we don’t have to recompute things over and over. In particular

\begin{align*} \frac{\partial \ell}{\partial \vv m_{L-1}} = \frac{\partial \ell}{\partial \vv m_L} \cdot \frac{\partial \vv m_L}{\partial \vv m_{L-1}}, \quad \frac{\partial \ell}{\partial \vv m_{L-2}} = \frac{\partial \ell}{\partial \vv m_{l-1}} \cdot \frac{\partial \vv m_{L-1}}{\partial \vv m_{L-2}}, \end{align*}

(24)

and in general

\frac{\partial \ell}{\partial \vv m_{j-1}} = \frac{\partial \ell}{\partial \vv m_j} \cdot \frac{\partial \vv m_j}{\partial \vv m_{j-1}}

where $\frac{\partial \ell}{\partial \vv m_j}$ will have been computed at the layer above. This is another key piece of backpropagation!

The only thing left to compute is $\frac{\partial \vv m_j}{\partial \vv x_j}$ --- this is now just an exercise in calculus, so we’ll not work it out by hand. Please refer to Backpropagation#Finding the derivative of the error and for further information if you are interested.

6Optional¶

We apply our chain rule $\left( \text{with } \vv w = X_j \bm \vv O_{j-1} \\ 1\em\right)$ to get

\frac{\partial \vv m_j}{\partial \vv x_j} = \frac{\partial}{\partial x_j} \sigma\left(X_j \bm \vv O_{j-1} \\ 1\em\right) = \frac{\partial \vv \sigma}{\partial \vv w} \cdot \frac{\partial \vv w}{\partial \vv x_j}.

Now for $\vv \sigma(\vv w) = \begin{bmatrix} \vv \sigma(w_1) \\ \vdots \\ \sigma(w_{p_{j-1}}) \end{bmatrix}, \frac{\partial \sigma}{\partial \vv w} = \begin{bmatrix} \sigma'(w_1) & & \\& \ddots & \\& & \sigma'(w_{p_{j-1}}) \end{bmatrix}$ . Next, we need to find $\frac{\partial \vv w}{\partial \vv x_j} = \frac{\partial}{\partial \vv x_j} \left(X_j \bm \vv O_{j-1} \\ 1\em \right).$ This can be computed using multi linear algebra (tensors). We won’t work it out, but note that it can be found efficiently.