Skip to article frontmatterSkip to article content

11.2 Computing the Principal Components

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in LAA 7.5 and ALA 8.8.

2Learning Objectives

By the end of this page, you should know:

  • the definition of principal components of a data set and an example of it
  • how to compress the data via dimensionality reduction
  • how to compute the principal components using the Singular Value Decomposition

3Principal Component Analysis

To make our notation a little bit cleaner, we’ll assume that our observations xjRpx_j \in \mathbb{R}^p and observation matrix XRN×pX \in \mathbb{R}^{N \times p} have already been centered so that X=X^X = \hat{X} and m=0\vv m = \vv 0.

The goal of PCA is to find a new orthogonal basis for Rp\mathbb{R}^p defined by the orthogonal p×pp \times p matrix P=[u1up]P = \bm \vv u_1 & \ldots & \vv u_p\em, for u1,,up\vv u_1, \ldots, \vv u_p an orthonormal basis of Rp\mathbb{R}^p:

x=Py=y1u1+y2u2++ypup\vv x = P\vv y = y_1\vv u_1 + y_2\vv u_2 + \cdots + y_p\vv u_p

with the property that the new coordinates y1,,ypy_1, \ldots, y_p are uncorrelated (i.e., have covariance 0) and arranged in decreasing order of variance. The matrix PP is orthogonal, and hence y=P1x=PTx\vv y = P^{-1} \vv x = P^T \vv x provides a decomposition of the measurement x\vv x along the directions u1,,up\vv u_1, \ldots, \vv u_p, where most of the variance in x\vv x can be found in the direction u1\vv u_1, 2nd most in direction u2\vv u_2, etc.

What does the covariance matrix of the new variables y\vv y look like? Note that if yj=PTxj\vv y_j = P^T \vv x_j, then Y=PTXY = P^T X is the observation matrix in our new basis, so let[1]

Sy=YYT=PTXXTP=PTSxP.S_y = YY^T = P^T X X^T P = P^T S_x P.

If the change of basis x=Py\vv x = P\vv y is such that the yiy_i are uncorrelated, the SyS_y, the covariance matrix of the observations y\vv y, should be diagonal. Our goal is therefore to pick an orthonormal basis u1,,up\vv u_1, \ldots, \vv u_p so that for P=[u1up]P = \bm \vv u_1 & \ldots & \vv u_p\em,

Sy=PTSxPS_y = P^T S_x P

is diagonal. But we already know how to do this! SxS_x is a symmetric matrix, and thus admits a spectral decomposition Sx=QΛQTS_x = Q \Lambda Q^T, where Q=[u1up]Q = \bm \vv u_1 & \ldots & \vv u_p\em is an orthogonal matrix composed of the orthonormal eigenvectors of SxS_x, and Λ=diag(λ1,,λp)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p) a diagonal matrix of its eigenvalues. Hence, setting P=QP = Q, we have

Sy=QTSxQ=QTQΛQTQ=ΛS_y = Q^T S_x Q = Q^T Q \Lambda Q^T Q = \Lambda

which is diagonal, as we wanted!

The first principal component u1\vv u_1 determines the new variable y1y_1 by projecting the original observation x\vv x along u1\vv u_1. In particular, since u1\vv u_1 is the first column of PP, u1T\vv u_1^T is the first row of PTP^T, and hence

y1=u1Tx,y_1 = \vv u_1^T \vv x,

which is a projection of x\vv x along the unit vector u1\vv u_1. In a similar fashion, u2\vv u_2 determines y2y_2, and so on. The direction u1\vv u_1 is aligned with the direction of maximal variance in the data, so we expect “most of” each observation xj\vv x_j to be aligned with u1\vv u_1. We’ll see how to use this observation to compress multivariate data via dimensionality reduction next, but first, let’s take a look at a simple example:

4Compression via Dimensionality Reduction

PCA is very useful for applications in which most of the variation of the data lies in a low-dimensional subspace, i.e., can be explained by only a few of the new variables y1,,ypy_1, \ldots, y_p.

The total variance in the original data can be measured by summing together the variances of the original observation variables, i.e., by computing the sum of the diagonal entries of SxS_x:

TVar(X)=s11+s22++spp.T\text{Var}(X) = s_{11} + s_{22} + \cdots + s_{pp}.

Our first observation is that variance is preserved by the change of variables y=PTx\vv y = P^T \vv x. We leave showing this as an exercise, but the intuition is that since the change of basis matrix PP is orthogonal, it only rotates/flips vectors, and does not affect length or angles. This means that

TVar(Y)=λ1++λp=s11++spp=TVar(X),\text{TVar}(Y) = \lambda_1 + \cdots + \lambda_p = s_{11} + \cdots + s_{pp} = \text{TVar}(X),

where λj\lambda_j is the variance in the yjy_j coordinate. Thus the ratio λjTVar(Y)\frac{\lambda_j}{\text{TVar}(Y)} measures the fraction of the total variance “explained” by yjy_j.

This suggests that if we are interested in compressing the original data, a strategy could be to discard the directions yjy_j for which λjTVar(Y)\frac{\lambda_j}{\text{TVar}(Y)} is very small, as these directions do not capture much of the variation in the data.

5Computing Principal Components and the SVD

The singular value decomposition is the main tool for performing PCA in practice. Suppose XX is our centred observation matrix, and then define

A=X^TNRN×p.A = \frac{\hat{X}^T}{\sqrt{N}} \in \mathbb{R}^{N \times p}.

We assume that rank(A)=p\text{rank}(A) = p, i.e., our pp-dimensional measurements x1,,xNRp\vv x_1, \ldots, \vv x_N \in \mathbb{R}^p span Rp\mathbb{R}^p.

Then ATA=SxA^T A = S_x is the covariance matrix of our data. Now, let A=UΣVTA = U \Sigma V^T be a singular value decomposition for AA: since rank(A)=p\text{rank}(A) = p, Σ,VRp×p\Sigma, V \in \mathbb{R}^{p \times p}, and URN×pU \in \mathbb{R}^{N \times p}. Then

Sx=ATA=VΣUTUΣVT=VΣ2VT(=QΛQT)S_x = A^T A = V \Sigma U^T U \Sigma V^T = V \Sigma^2 V^T \quad (= Q \Lambda Q^T)

i.e., VRp×pV \in \mathbb{R}^{p \times p} orthogonally diagonalizes the symmetric matrix SxS_x, and defines a spectral factorization of SxS_x. This allows us to immediately conclude the folllowing.

In practice, computing an SVD of AA is both faster and more accurate than computing an eigendecomposition of SxS_x, and is particularly true when the observation vector dimension pp is large.

Binder

Footnotes
  1. We add subscripts XX and YY to SXS_X and SYS_Y to highlight which observations are used.