To make our notation a little bit cleaner, we’ll assume that our observations xj∈Rp and observation matrixX∈RN×p have already been centered so that X=X^ and m=0.
The goal of PCA is to find a new orthogonal basis for Rp defined by the orthogonal p×p matrix P=[u1…up], for u1,…,up an orthonormal basis of Rp:
with the property that the new coordinates y1,…,yp are uncorrelated (i.e., have covariance 0) and arranged in decreasing order of variance. The matrix P is orthogonal, and hence y=P−1x=PTx provides a decomposition of the measurement x along the directions u1,…,up, where most of the variance in x can be found in the direction u1, 2nd most in direction u2, etc.
What does the covariance matrix of the new variables y look like? Note that if yj=PTxj, then Y=PTX is the observation matrix in our new basis, so let[1]
If the change of basis x=Py is such that the yi are uncorrelated, the Sy, the covariance matrix of the observations y, should be diagonal. Our goal is therefore to pick an orthonormal basis u1,…,up so that for P=[u1…up],
is diagonal. But we already know how to do this! Sx is a symmetric matrix, and thus admits a spectral decomposition Sx=QΛQT, where Q=[u1…up] is an orthogonal matrix composed of the orthonormal eigenvectors of Sx, and Λ=diag(λ1,…,λp) a diagonal matrix of its eigenvalues. Hence, setting P=Q, we have
The first principal component u1 determines the new variable y1 by projecting the original observation x along u1. In particular, since u1 is the first column of P, u1T is the first row of PT, and hence
which is a projection of x along the unit vector u1. In a similar fashion, u2 determines y2, and so on. The direction u1 is aligned with the direction of maximal variance in the data, so we expect “most of” each observation xj to be aligned with u1.
We’ll see how to use this observation to compress multivariate data via dimensionality reduction next, but first, let’s take a look at a simple example:
PCA is very useful for applications in which most of the variation of the data lies in a low-dimensional subspace, i.e., can be explained by only a few of the new variables y1,…,yp.
The total variance in the original data can be measured by summing together the variances of the original observation variables, i.e., by computing the sum of the diagonal entries of Sx:
Our first observation is that variance is preserved by the change of variables y=PTx. We leave showing this as an exercise, but the intuition is that since the change of basis matrix P is orthogonal, it only rotates/flips vectors, and does not affect length or angles. This means that
where λj is the variance in the yj coordinate. Thus the ratio TVar(Y)λj measures the fraction of the total variance “explained” by yj.
This suggests that if we are interested in compressing the original data, a strategy could be to discard the directions yj for which TVar(Y)λj is very small, as these directions do not capture much of the variation in the data.
We assume that rank(A)=p, i.e., our p-dimensional measurements x1,…,xN∈Rp span Rp.
Then ATA=Sx is the covariance matrix of our data. Now, let A=UΣVT be a singular value decomposition for A: since rank(A)=p, Σ,V∈Rp×p, and U∈RN×p. Then
i.e., V∈Rp×p orthogonally diagonalizes the symmetric matrix Sx, and defines a spectral factorization of Sx. This allows us to immediately conclude the folllowing.
In practice, computing an SVD of A is both faster and more accurate than computing an eigendecomposition of Sx, and is particularly true when the observation vector dimension p is large.