11.2 Computing the Principal Components

1Reading¶

Material related to this page, as well as additional exercises, can be found in LAA 7.5 and ALA 8.8.

2Learning Objectives¶

By the end of this page, you should know:

the definition of principal components of a data set and an example of it
how to compress the data via dimensionality reduction
how to compute the principal components using the Singular Value Decomposition

3Principal Component Analysis¶

To make our notation a little bit cleaner, we’ll assume that our observations $x_j \in \mathbb{R}^p$ and observation matrix $X \in \mathbb{R}^{N \times p}$ have already been centered so that $X = \hat{X}$ and $\vv m = \vv 0$ .

The goal of PCA is to find a new orthogonal basis for $\mathbb{R}^p$ defined by the orthogonal $p \times p$ matrix $P = \bm \vv u_1 & \ldots & \vv u_p\em$ , for $\vv u_1, \ldots, \vv u_p$ an orthonormal basis of $\mathbb{R}^p$ :

\vv x = P\vv y = y_1\vv u_1 + y_2\vv u_2 + \cdots + y_p\vv u_p

(1)

with the property that the new coordinates $y_1, \ldots, y_p$ are uncorrelated (i.e., have covariance 0) and arranged in decreasing order of variance. The matrix $P$ is orthogonal, and hence $\vv y = P^{-1} \vv x = P^T \vv x$ provides a decomposition of the measurement $\vv x$ along the directions $\vv u_1, \ldots, \vv u_p$ , where most of the variance in $\vv x$ can be found in the direction $\vv u_1$ , 2nd most in direction $\vv u_2$ , etc.

What does the covariance matrix of the new variables $\vv y$ look like? Note that if $\vv y_j = P^T \vv x_j$ , then $Y = P^T X$ is the observation matrix in our new basis, so let^[1]

S_y = YY^T = P^T X X^T P = P^T S_x P.

(2)

If the change of basis $\vv x = P\vv y$ is such that the $y_i$ are uncorrelated, the $S_y$ , the covariance matrix of the observations $\vv y$ , should be diagonal. Our goal is therefore to pick an orthonormal basis $\vv u_1, \ldots, \vv u_p$ so that for $P = \bm \vv u_1 & \ldots & \vv u_p\em$ ,

S_y = P^T S_x P

(3)

is diagonal. But we already know how to do this! $S_x$ is a symmetric matrix, and thus admits a spectral decomposition $S_x = Q \Lambda Q^T$ , where $Q = \bm \vv u_1 & \ldots & \vv u_p\em$ is an orthogonal matrix composed of the orthonormal eigenvectors of $S_x$ , and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ a diagonal matrix of its eigenvalues. Hence, setting $P = Q$ , we have

S_y = Q^T S_x Q = Q^T Q \Lambda Q^T Q = \Lambda

(4)

which is diagonal, as we wanted!

The first principal component $\vv u_1$ determines the new variable $y_1$ by projecting the original observation $\vv x$ along $\vv u_1$ . In particular, since $\vv u_1$ is the first column of $P$ , $\vv u_1^T$ is the first row of $P^T$ , and hence

y_1 = \vv u_1^T \vv x,

(5)

which is a projection of $\vv x$ along the unit vector $\vv u_1$ . In a similar fashion, $\vv u_2$ determines $y_2$ , and so on. The direction $\vv u_1$ is aligned with the direction of maximal variance in the data, so we expect “most of” each observation $\vv x_j$ to be aligned with $\vv u_1$ . We’ll see how to use this observation to compress multivariate data via dimensionality reduction next, but first, let’s take a look at a simple example:

Example 1

The initial data for our Landsat imaging sample was composed of 4 million observations $\vv x_i \in \mathbb{R}^3$ . The associated covariance matrix (after centering the data) is:

S_x = \begin{bmatrix} 2382.78 & 2611.84 & 2136.20 \\ 2611.84 & 3106.47 & 2553.90 \\ 2136.20 & 2553.90 & 2650.71 \end{bmatrix}

(6)

The eigenvalue/vectors of $S_x$ are:

\begin{align*} \lambda_1 &= 7614.33, \vv u_1 = \begin{bmatrix} .5471 \\ .6295 \\ .5570 \end{bmatrix} & \lambda_2 = 427.63, \vv u_2 = \begin{bmatrix} -.4894 \\ -.3026 \\ .8179 \end{bmatrix} & \lambda_3 = 98.10, \vv u_3 = \begin{bmatrix} .6834 \\ -.7157 \\ .1441 \end{bmatrix} \end{align*}

(7)

Up to 2 decimal places, the first principal component is:

y_1 = .54x_1 + .63x_2 + .56x_3

(8)

This equation was used to create image (d) by taking a weighted combination of the three spectral band measurements (each suitably centered to gray scale so that they too can be combined).

The covariance of the transformed observations $\vv y = (y_1, y_2, y_3)$ is the diagonal matrix

\Lambda = \text{diag}(7614.23, 427.63, 98.10)

(9)

How can we use the fact that $\lambda_1 = 7614.23$ is much larger than $\lambda_2 = 427.63$ and $\lambda_3 = 98.10$ ?

4Compression via Dimensionality Reduction¶

PCA is very useful for applications in which most of the variation of the data lies in a low-dimensional subspace, i.e., can be explained by only a few of the new variables $y_1, \ldots, y_p$ .

The total variance in the original data can be measured by summing together the variances of the original observation variables, i.e., by computing the sum of the diagonal entries of $S_x$ :

T\text{Var}(X) = s_{11} + s_{22} + \cdots + s_{pp}.

(10)

Our first observation is that variance is preserved by the change of variables $\vv y = P^T \vv x$ . We leave showing this as an exercise, but the intuition is that since the change of basis matrix $P$ is orthogonal, it only rotates/flips vectors, and does not affect length or angles. This means that

\text{TVar}(Y) = \lambda_1 + \cdots + \lambda_p = s_{11} + \cdots + s_{pp} = \text{TVar}(X),

(11)

where $\lambda_j$ is the variance in the $y_j$ coordinate. Thus the ratio $\frac{\lambda_j}{\text{TVar}(Y)}$ measures the fraction of the total variance “explained” by $y_j$ .

This suggests that if we are interested in compressing the original data, a strategy could be to discard the directions $y_j$ for which $\frac{\lambda_j}{\text{TVar}(Y)}$ is very small, as these directions do not capture much of the variation in the data.

Example 2

The total variance of the Landsat data is

\text{TVar}(X) = \text{TVar}(Y) = \lambda_1 + \lambda_2 + \lambda_3 = 7614.23 + 427.63 + 98.10 = 8139.96.

(12)

The percentages of total variance explained by the principal components are:

\frac{7614.23}{8139.96} = .935, \quad \frac{427.63}{8139.96} = .053, \quad \frac{98.10}{8139.96} = .012.

(13)

Thus, 93.5% of the information collected by Landsat for this specific image is captured in photograph (d) --- if that is sufficiently accurate for our purposes, we could compress the three images (a)-(c) into only image (d), reducing our memory requirements by $\frac{1}{3}$ .

5Computing Principal Components and the SVD¶

The singular value decomposition is the main tool for performing PCA in practice. Suppose $X$ is our centred observation matrix, and then define

A = \frac{\hat{X}^T}{\sqrt{N}} \in \mathbb{R}^{N \times p}.

(14)

We assume that $\text{rank}(A) = p$ , i.e., our $p$ -dimensional measurements $\vv x_1, \ldots, \vv x_N \in \mathbb{R}^p$ span $\mathbb{R}^p$ .

Then $A^T A = S_x$ is the covariance matrix of our data. Now, let $A = U \Sigma V^T$ be a singular value decomposition for $A$ : since $\text{rank}(A) = p$ , $\Sigma, V \in \mathbb{R}^{p \times p}$ , and $U \in \mathbb{R}^{N \times p}$ . Then

S_x = A^T A = V \Sigma U^T U \Sigma V^T = V \Sigma^2 V^T \quad (= Q \Lambda Q^T)

(15)

i.e., $V \in \mathbb{R}^{p \times p}$ orthogonally diagonalizes the symmetric matrix $S_x$ , and defines a spectral factorization of $S_x$ . This allows us to immediately conclude the folllowing.

In practice, computing an SVD of $A$ is both faster and more accurate than computing an eigendecomposition of $S_x$ , and is particularly true when the observation vector dimension $p$ is large.