13.4 Stochastic Gradient Descent

1Learning Objectives¶

By the end of this page, you should know:

why naive backpropagation is computationally inefficient
the Stochastic Gradient Descent Algorithm
what is Perceptron and how it is used for linear classification
single and multilayer perceptrons, and the importance of activiation functions

2Naive Backpropagation¶

In the previous lecture, motivated by machine learning applications, we discussed how to apply gradient descent to solve the following optimization problem:

\text{minimize } \text{loss}((\vv z_i,\vv y_i); \vv x) = \text{minimize } \sum_{i=1}^N\ell(\vv m(\vv z_i;\vv x)-\vv y_i)

(1)

over the parameters $\vv x$ of a model $\vv m$ so that $\vv m(\vv z_i;\vv x) \approx \vv y_i$ for all training data input/output pairs $(\vv z_i,\vv y_i)$ , $i=1,\ldots,N$ .

The bulk of our effort was spent on understanding how to compute the gradient of $\ell(\vv m(\vv z_i;\vv x)-\vv y_i)$ with respect to the model parameters $\vv x$ , with a particular focus on models $\vv m$ that can be written as the following composition of models:

\begin{align*} \vv O_0 &= \vv z_i \\ \vv O_1 &= \vv m_1(\vv O_0;\vv x_1), \quad \vv O_1 \in \mathbb{R}^{p_1}, \vv O_0 \in \mathbb{R}^{p_0} \\ \vv O_2 &= \vv m_2(\vv O_1;\vv x_2), \quad \vv O_2 \in \mathbb{R}^{p_2}, \vv O_1 \in \mathbb{R}^{p_1} \\ &\vdots \\ \vv O_L &= \vv m_L(\vv O_{L-1};\vv x_L), \quad \vv O_L \in \mathbb{R}^{p_L}, \vv O_{L-1} \in \mathbb{R}^{p_{L-1}} \end{align*}

(2)

which is the structure of contemporary models used in machine learning called deep neural networks (we’ll talk about these much more in this lecture). We made two key observations about the model structure that allowed us to effectively apply the matrix chain rule to compute gradients $\frac{\partial \ell}{\partial \vv x_1}, \ldots, \frac{\partial \ell}{\partial \vv x_L}$ :

$\frac{\partial \ell}{\partial \vv x_j}$ only needs to compute partial derivatives of $\ell$ and layers $j, j+1, \ldots, L$ ;
If we start from layer $L$ $\left(\frac{\partial \ell}{\partial \vv x_L}\right)$ and work our way backwards we can (i) reuse previously computed partial derivatives, and (ii) save space on memory by exploiting that $\frac{\partial \ell}{\partial \vv m_L}$ is a row-vector.

The resulting algorithm is called backpropagation, and is a key enabling technology in modern machine learning. You will learn more about this in ESE 5460.

Now, despite all of this cleverness, when the model parameter vectors $\vv x_1, \ldots, \vv x_L$ and layer outputs $\vv O_1, \ldots, \vv O_L$ are very high dimensional (it is not uncommon for each $\vv x_i$ to have 100s of thousands or even millions of components) computing the gradient $\nabla_{\vv x} \ell(\vv m(\vv z_i;\vv x)-\vv y_i)$ of a single term in the sum (1) can be quite costly. Add to that the fact that the number of data points $N$ is often very large (order of millions in many settings), and we quickly run into some serious computational bottlenecks. And remember, this all just so we can run a single iteration of gradient descent. This may seem hopeless, but luckily, there is a very simple trick that lets us work around this problem: stochastic gradient descent.

3Stochastic Gradient Descent (SGD) Algorithm¶

Stochastic Gradient Descent (SGD) is the work horse algorithm of modern machine learning and has been rediscovered by various communities over the past 70 years, although it is usually credited to Robbins and Monro for a paper they wrote in 1951.

Key Idea

Since our loss function can be written as a sum over examples, i.e.

\text{loss}((\vv z_i,\vv y_i),\vv x) = \frac{1}{N} \sum_{i=1}^N \ell(\vv m(\vv z_i;\vv x)-\vv y_i) \quad (\text{loss}(\vv x) = \sum_i \ell_i(\vv x))

(3)

then the gradient is also a sum: $\nabla_{\vv x} \text{loss} = \frac{1}{N} \sum_{i=1}^N \nabla_{\vv x} \ell_i$ . Therefore we expect each individual gradient $\nabla_{\vv x} \ell_i$ to have some useful information in it. SGD minimizes (3) by following the gradient of a single randomly selected example (or a small batch of $B$ randomly selected samples).

The SGD algorithm can be summarized as follows: Start with an initial guess $\vv x^{(0)}$ , and at each iteration $k=0,1,2,\ldots$ , do:

(i) Select an index $i \in \{1,\ldots,N\}$ at random (ii) Update

\vv x^{(k+1)} = \vv x^{(k)} - s^{(k)} \nabla_{\vv x} \ell_i(\vv x^{(k)}) \quad (\text{SGD})

(4)

Using the gradient of only the $i^{th}$ loss term $\ell_i(\vv x) = \ell(\vv m(\vv z_i;\vv x)-\vv y_i)$ .

As before, $s^{(k)} > 0$ is a step-size that can change as a function of the current iterate.

This method works shockingly well in practice and is computationally tractable as at each iteration, the gradient of only the $i^{th}$ loss term needs to be computed. Modern versions of this algorithm replace step (i) with a mini-batch, i.e., by selecting $B$ indices at random, and step (ii) replaces $\nabla_{\vv x} \ell_i(\vv x^{(k)})$ with the average gradient:

\frac{1}{B} \sum_{b=1}^B \nabla_{\vv x} \ell_b(\vv x)

(5)

The overall idea behind why SGD works (take ESE 6050 if you want to see a rigorous proof) is that while each individual update (SGD) may not be an accurate gradient for the overall loss function loss $(\vv x)$ , we are still following $\nabla_{\vv x} \text{loss}(\vv x)$ “on average”. This also explains why you may want to use a mini-batch $B$ to compute a better gradient estimate (5), as having more loss terms leads to a better approximation of the true gradient. The tradeoff here is that as $B$ becomes larger, computing (5) is more computationally demanding.

4Linear Classification and the Perceptron¶

An important problem in machine learning is that of binary classification. In one of the case studies, you saw how to use least squares to solve this problem. Here, we offer an alternate perspective that will lead us to one important historical reason for the emergence of deep neural networks.

The problem set up for linear binary classification is as follows. We are given a set of $N$ vectors $\vv z_1,\ldots,\vv z_N \in \mathbb{R}^n$ with associated binary labels $y_1,\ldots,y_N \in \{-1,+1\}$ . The objective in linear classification is to find an affine function $\vv x^T\vv z+v$ , defined by unknown parameters $\vv x \in \mathbb{R}^n$ and $v \in \mathbb{R}$ , that strictly separates the two classes. We can pose this as finding a feasible solution to the following linear inequalities:

\begin{cases} \vv x^T\vv z_i + v > 0 & \text{if } y_i = +1 \\ \vv x^T\vv z_i + v < 0 & \text{if } y_i = -1 \end{cases}

(6)

The geometry of this problem is illustrated below.

There are three key components:

The separating hyperplane $H = \{\vv z \in \mathbb{R}^n : \vv x^T\vv z+v = 0\}$ . This is the set of vectors $\vv z \in \mathbb{R}^n$ that lie on the subspace $H$ , which is the solution set to the linear equation

\vv x^T\vv z = -v.

(7)

The coefficient matrix here is $\vv x^T \in \mathbb{R}^{1\times n}$ , and so rank Col $(\vv x^T) = 1$ . This tells us that $\dim \text{Null}(\vv x^T) = \dim H = n-1$ . In $\mathbb{R}^2$ , this is the equation of a line:

\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}^T \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} + v = x_1z_1 + x_2z_2 + v = 0 \Rightarrow z_2 = -\frac{x_1}{x_2}z_1 - \frac{v}{x_2}

(8)

In $\mathbb{R}^3$ , this is the equation of a plane with normal vector $\vv x$ going through point $v$ and in $\mathbb{R}^n$ is called a hyperplane. A key feature of a hyperplane is that it splits $\mathbb{R}^n$ into two half-spaces, i.e., the subsets of $\mathbb{R}^n$ on either side.

The half-space $H^+ = \{\vv z \in \mathbb{R}^n : \vv x^T\vv z+v > 0\}$ , which is the “half” of $\mathbb{R}^n$ for which $\vv x^T\vv z+v > 0$ . We want all of our positive $\textcolor{blue}{(+)}$ examples to live here.
The half-space $H^- = \{\vv z \in \mathbb{R}^n : \vv x^T\vv z+v < 0\}$ , which is the “half” of $\mathbb{R}^n$ for which $\vv x^T\vv z+v < 0$ . We want all $\textcolor{red}{(-)}$ examples to live here.

The problem of finding the parameters $(\vv x,v)$ defining the classifier in (6) can be solved using linear programming, a kind of optimization algorithm that you’ll learn about in ESE 3040 and ESE 6050. It can also be solved using SGD as applied to a special loss function called the hinge loss:

\text{loss}((\vv z_i,y_i);(\vv x,v)) = \frac{1}{N} \sum_{i=1}^N \max\{1-y_i(\vv x^T\vv z+v), 0\},

(9)

which is a commonly used loss function for classification (you’ll learn why in your Machine Learning classes).

The reason we are taking this little digression is that applying SGD to the hinge loss gives us The Perceptron Algorithm:

Initialize initial guess $(\vv x^{(0)},\vv v^{(0)})$
For each iteration $k=0,1,2,\ldots$ , do:
- Draw a random index $i \in \{1,\ldots,N\}$
- If $y_i\left(\left(\vv x^{(k)}\right)^T \vv z_i+v^{(k)}\right) < 1$ : Update
$(\vv x^{(k+1)},v^{(k+1)}) = (\vv x^{(k)},v^{(k)}) + y_i\begin{bmatrix} \vv z_i \\ 1 \end{bmatrix}$
(10)
Else, if $y_i(\vv x^{(k)})^T\vv z_i+v^{(k)}) \geq 1$ , do not update $(\vv x^{(k+1)},v^{(k+1)}) = (\vv x^{(k)},v^{(k)})$ .

This algorithm goes through the examples $(\vv z_i,y_i)$ one at a time, and updates the classifier only when it makes a mistake (10). The intuition is it “nudges” the classifier to be “less” wrong by $\|\vv z_i\|^2+1$ on any example $(\vv z_i,y_i)$ it currently misclassifies.

This incremental update works, and you can show that if there exists a solution to (6), the perceptron algorithm will find it. People got REALLY EXCITED ABOUT THIS! See below for a NYT article about the perceptron algorithm, which in hindsight seems a little silly given that we now know it’s just SGD applied to a particular loss function. But then again, so is most of today’s AI!

5Single and Multi Layer Perceptrons¶

Given the excitement about the Perceptron, why do we not use them anymore? It turns out, it is very easy to stump! Consider the following set of positive and negative examples:

These define an XOR function: the positive examples are in quadrants where $\text{sign}(z_1) \neq \text{sign}(z_2)$ and the negative ones are in quadrants for which $\text{sign}(z_1) = \text{sign}(z_2)$ . These can’t be separated by a linear classifier!

But suppose we were allowed to have two classifiers, and then combine them using a nonlinearity?

In the image above, we define a pink classifier that returns $f_1(\vv z) = z_1$ , a purple classifier that returns $f_2(\vv z) = -z_2$ . If we define our output to be

f(\vv z) = f_1(\vv z) f_2(\vv z) = -z_1 z_2,

(11)

then we see that this ‘works’:

Table 1:Nonlinear classifier

	1	2	3	4
Sign $f_1(\vv z)$	+1	-1	-1	+1
Sign $f_2(\vv z)$	-1	-1	+1	+1
Sign $f(\vv z)$	-1	+1	-1	+1

This ‘worked’! The two key ingredients here are:

Having intermediate computation, called hidden layers
Allowing for some nonlinearity.

These two ingredients are combined to define the Multilayer Perceptron (MLP).

Definition 1 (Multilayer Perceptron (MLP))

A single hidden layer MLP is defined by the equations:

\begin{align*} \vv h &= \sigma(W_1 \vv z + \vv b_1) \\ \vv o &= W_2 \vv h + \vv b_2 \end{align*}

(12)

The key features of (12) are:

An element-wise nonlinearity σ, called an activation function.
The input is $\vv z \in \mathbb{R}^n$ .
The hidden layer is defined by a weight matrix $W_1 \in \mathbb{R}^{p_1 \times n}$ and a bias vector $\vv b_1 \in \mathbb{R}^{p_1}$
The output layer is defined by a weight matrix $W_2 \in \mathbb{R}^{p \times p_1}$ and bias vector $\vv b_2 \in \mathbb{R}^{p}$
The overall map maps input $\vv z \in \mathbb{R}^n$ to output $\vv o \in \mathbb{R}^p$ .

In practice, (12) is trained to find $W_1, W_2, \vv b_1, \vv b_2$ using SGD and backpropagation.

Some common activation functions include:

The Sigmoid function:

The Rectified Linear Unit (ReLU):

Which activation function to use is a bit of an art, but there are generally accepted tricks of the trade that you’ll learn about in ESE 5460. There are also many more than these two, with new ones being invented.

6Deep MLPs¶

There is nothing preventing us from adding more hidden layers. The $L$ -hidden-layer MLP is defined as:

\begin{align*} \vv h_1 &= \sigma(W_1 \vv x + \vv b_1) \\ \vv h_2 &= \sigma(W_2 \vv h_1 + \vv b_2) \\ &\vdots \\ \vv h_L &= \sigma(W_L \vv h_{L-1} + \vv b_L) \\ \vv o &= W_{L+1} \vv h_L + \vv b_{L+1} \end{align*}

(14)

Shown below is an example with 3 hidden layers.

The important thing to notice is these functions are compatible with our discussion on backpropagation, meaning computing gradients with respect to the parameters $W_1,\ldots,W_{L+1}, \vv b_1,\ldots,\vv b_{L+1}$ can be done efficiently!

6.1Python Break: Autodifferentiation in PyTorch!¶

In this example, we’ll show how to take advantage of autodifferentiation in the PyTorch machine learning library in order to efficiently train machine learning models in code.

For our first example, we will take a look at logistic regression. In logistic regression, we have the following:

A dataset $D$ of pairs of the form $(\vv x, y)$ , where $\vv x \in \mathbb{R}^n$ is a vector with real-valued entries, and $y \in \{0, 1\}$ is a boolean-valued label.
A class of models $f_{\vv a, b}(\vv x)$ of “logistic functions”, where $a\in \mathbb{R}^n$ , $b\in \mathbb{R}$ , and $f_{\vv a, b}(\vv x) = \frac{1}{1 + \exp(-\vv a^\top \vv x + b)}$ . A few things to note; the range of any logistic function is $(0, 1)$ , i.e., these logistic functions only output values between 0 and 1; this is good when we want to learn boolean-valued labelings of our data. Also, try to convince yourself that for any value $v \in (0, 1)$ , the set of all $\vv x$ such that $f_{\vv a, b}(\vv x) = v$ is the solution set to an inhomogenous linear system. Furthermore, convince yourself that the set of all $\vv x$ such that $f_{\vv a, b}(\vv x) = 0.5$ is the solution to a homogeneous linear system; we will call this set the decision boundary.
A binary cross-entropy loss (or log-loss) function $l$ which measures the loss of our model $f_{\vv a, b}(\vv x)$ on the dataset $D = \{(\vv x_i, y)\}_{i = 1, \dots, M}$ , defined by:

\text{Loss}(f_{\vv a, b}(\vv x), D) = -\sum_{i = 1}^{M}{ y_i \cdot \log f_{\vv a, b}(\vv x) + (1 - y_i) \cdot \log (1 - f_{\vv a, b}(\vv x))}

(15)

To give some high-level intuition, the binary cross-entropy loss is smaller when, on average, the values of $f_{\vv a, b}(\vv x_i)$ are close to $y_i$ . This has a very nice interpretation in terms of probabilities, but for now we will focus on the optimization aspects of logistic regression.

In many machine learning scenarios, the objective is to choose the model $f_{\vv a, b}(\vv x)$ with the lowest loss on the dataset $D$ (this is known as empirical risk minimization). In our scenario, we can formulate this as an optimization problem:

\begin{align*} &\text{Minimize (over $\vv a, b$): $\text{Loss}(f_{\vv a, b}(\vv x), D)$} \end{align*}

(16)

In the case of logistic regression with log-loss, this is actually a convex loss which means that gradient descent will find the global minimum (if done correctly)! This is exactly what the code snippet below is doing; note that we labeled each section of the code; we will explain what each section does in a bit.

# We need to install the required software packages for deep learning: this will take about 10min
%pip install torch torchvision

Requirement already satisfied: torch in /opt/anaconda3/envs/books/lib/python3.12/site-packages (2.4.0)
Requirement already satisfied: torchvision in /opt/anaconda3/envs/books/lib/python3.12/site-packages (0.19.0)
Requirement already satisfied: filelock in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (3.15.4)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (4.11.0)
Requirement already satisfied: sympy in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (1.13.0)
Requirement already satisfied: networkx in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (3.3)
Requirement already satisfied: jinja2 in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (3.1.4)
Requirement already satisfied: fsspec in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (2024.6.1)
Requirement already satisfied: setuptools in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torch) (70.0.0)
Requirement already satisfied: numpy in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torchvision) (1.26.4)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from torchvision) (10.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from jinja2->torch) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/envs/books/lib/python3.12/site-packages (from sympy->torch) (1.3.0)
Note: you may need to restart the kernel to use updated packages.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
torch.manual_seed(541)

# (A) Generate a random dataset
X = torch.rand(1000, 2)
Y = torch.round(0.25 * X[:, 0] + 0.45 * X[:, 1] + 0.3 * torch.rand(1000))

# (B) Define a logistic regression model using a single linear layer
class LogisticRegressionModel(nn.Module):
    def __init__(self):
        super(LogisticRegressionModel, self).__init__()
        self.a = nn.Parameter(torch.zeros(2))  
        self.b = nn.Parameter(torch.tensor(0.0))  
    
    def forward(self, x):
        return torch.sigmoid(torch.matmul(x, self.a.T) + self.b)

# (C) Instantiate the model
model = LogisticRegressionModel()

# (D) Define the loss function and optimizer
criterion = nn.BCELoss() 
optimizer = optim.SGD(model.parameters(), lr=0.01)

# (E) Train the model
num_epochs = 10000
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, Y)
    
    # Backward pass and optimization
    optimizer.zero_grad() 
    loss.backward()      
    optimizer.step()     
    
    # Print loss every 1000 epochs
    if (epoch+1) % 1000 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# (F) Testing the model
with torch.no_grad():
    predicted = model(X)
    predicted_labels = predicted.round()
    accuracy = (predicted_labels.eq(Y).sum().item()) / Y.size(0)
    print(f'Accuracy: {accuracy * 100:.2f}%')

# (G) Plotting the dataset, labels, and decision boundary
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=Y.squeeze(), cmap='coolwarm', marker='o', edgecolor='k', label='Data points')
x1_min, x1_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
x2_min, x2_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 100), np.linspace(x2_min, x2_max, 100))
grid_tensor = torch.tensor(np.c_[xx1.ravel(), xx2.ravel()], dtype=torch.float32)
with torch.no_grad():
    probs = model(grid_tensor).reshape(xx1.shape)
plt.contourf(xx1, xx2, probs, levels=[0, 0.5, 1], alpha=0.2, colors=['blue', 'red'])
plt.colorbar()
plt.title('Logistic Regression Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(loc='upper right')
plt.show()

Epoch [1000/10000], Loss: 0.6212
Epoch [2000/10000], Loss: 0.5708
Epoch [3000/10000], Loss: 0.5320
Epoch [4000/10000], Loss: 0.5018
Epoch [5000/10000], Loss: 0.4778
Epoch [6000/10000], Loss: 0.4583
Epoch [7000/10000], Loss: 0.4424
Epoch [8000/10000], Loss: 0.4291
Epoch [9000/10000], Loss: 0.4180
Epoch [10000/10000], Loss: 0.4085
Accuracy: 83.70%

Here, we’ll explain what each labeled section does and connect these functions to ideas we have already learned:

(A) This is just generating a random dataset.
(B) Here, we define a model class. Logistic models, which we have denoted by $f_{\vv a, \vv b}(\vv x)$ , are parameterized by two weights: $\vv a$ and $b$ . We indicate that this are weights that the optimizer should optimize over by declaring them as nn.Parameter. (The torch.tensor class is an analog to np.array.) The forward function evaluates our function on a batch of data, i.e., evaluates its output on a bunch of points at once; this is done for computational efficiency reasons.
(C) Here, we are generating an instance of our model with initial weights set to zero.
(D) Here, we indicate that we want to use the gradient descent algorithm to fit $\vv a$ and $\vv b$ such as to minimizer the binary cross entropy loss on our dataset. One small note: while the name of the optimizer is SGD, the way we use it (by running the model on the entire dataset and the computing the gradient with respect to the weights) is actually more akin to gradient descent, which uses the entire dataset.
(E) Here, we actually train the model using SGD. The lines optimizer.zero_grad() and loss.backward() calculate the gradient of the loss function with respect to the weights. This is done completely via autodifferentiation; during the forward pass of the model (where we call outputs = model(X), which uses the forward function we defined), PyTorch automatically calculates the gradient of the model outputs with respect to the weights! In the line loss = criterion(outputs, Y), PyTorch then finds the gradient of the loss with respect to the model output! Then, the line optimizer.step() updates the weights of the model (using the multivariate chain rule) with the learning rate of 0.01 specified earlier. We repeat this 10000 times, and see that the loss approximately decreases with each iteration.
(F) & (G) Here, we test the model and visualize the learned decision boundary compared to the true labelings of the dataset. Looks like we did pretty good! One note: you might notice the with torch.no_grad() line; usually, PyTorch computes gradients of the model outputs with respect to the inputs, but even with the optimized autodifferentiation package, this can be a costly procedure. Luckily, we only need to do this during the training phase! During test time, when we aren’t actively updating the weights of our model, we can tell PyTorch to not compute gradients using the with torch.no_grad() specifier.

As you can see, we have already learned many of these ideas! We just translated them to code.

For our second example, we’ll see how we can extend these ideas to training multilayer perceptrons (also known as neural networks). Just as in the example above, a neural network is just a parameterized function $f_{\vv w}(\vv x)$ which takes in a scalar/vector/matrix/tensor as input and produces a scalar/vector/matrix/tensor as output! Below, we have a Python snippet where we train a special type of multilayer perceptron known as a convolutional neural network to classify handwritten digits:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

# (A) Data Preparation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

# (B) Define the CNN Model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = torch.relu(torch.max_pool2d(self.conv1(x), 2))
        x = torch.relu(torch.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 320)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return torch.log_softmax(x, dim=1)

# (C) Initialize the model, loss function, and optimizer
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# (D) Training the Model
num_epochs = 5

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(trainloader, 0):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    print(f'Epoch {epoch + 1}: Loss {running_loss / (i + 1):.3f}')

print('Finished Training')

# (E) Testing the Model and Collecting Predictions
correct = 0
total = 0
all_images = []
all_labels = []
all_preds = []

with torch.no_grad():
    for inputs, labels in testloader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        
        # Collect images and predictions for visualization
        all_images.append(inputs)
        all_labels.append(labels)
        all_preds.append(predicted)

print(f'Accuracy on the 10,000 test images: {100 * correct / total:.2f}%')

# (F) Visualize a Few Images with Predictions

# Convert the list of tensors to one large tensor
all_images = torch.cat(all_images)
all_labels = torch.cat(all_labels)
all_preds = torch.cat(all_preds)

# Define a helper function to unnormalize and plot images
def imshow(img, label, pred, ax):
    img = img / 2 + 0.5  # Unnormalize
    npimg = img.numpy()
    ax.imshow(npimg, cmap='gray')
    ax.set_title(f'True: {label}, Pred: {pred}')
    ax.axis('off')

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(8, 8))
axes = axes.flatten()

# Plot some correctly classified images
correct_indices = (all_labels == all_preds).nonzero(as_tuple=True)[0]
incorrect_indices = (all_labels != all_preds).nonzero(as_tuple=True)[0]

# Plot 3 correct and 3 incorrect classifications
for i, ax in enumerate(axes[:3]):
    idx = correct_indices[i].item()
    imshow(all_images[idx][0], all_labels[idx].item(), all_preds[idx].item(), ax)

for i, ax in enumerate(axes[3:6]):
    idx = incorrect_indices[i].item()
    imshow(all_images[idx][0], all_labels[idx].item(), all_preds[idx].item(), ax)

plt.tight_layout()
plt.show()

Epoch 1: Loss 0.423
Epoch 2: Loss 0.147
Epoch 3: Loss 0.112
Epoch 4: Loss 0.095
Epoch 5: Loss 0.085
Finished Training
Accuracy on the 10,000 test images: 97.59%

Here, we see that our model achieves 97% accuracy on previously unseen data, which is pretty good! Above, we have a few examples of images that the model correctly and incorrectly classified.

To conclude, let’s appreciate how easy PyTorch and autodifferentiation makes it to train deep neural networks. All we need to do is specify the free parameters of our model, and PyTorch automatically computes gradients!