We start with a motivating application from satellite imagery analysis. The Landsat satellites are a pair of imaging satellites that record images of terrain and coastlines. These satellites cover almost every square mile of the Earth’s surface every 16 days.
Satellite sensors acquire seven simultaneous images of any given region, with each sensor recording energy from separate wavelength bands: three in the visible light spectrum and four in the infrared and thermal bands.
Each image is digitized and stored as a rectangular array of numbers, with each number representing the signal intensity at the corresponding pixel. Each of the seven images is one channel of a multichannel or multispectral image.
The seven Landsat images of a given region typically contain a lot of redundant information, as some features will appear across most channels. However, other features, because of their color or temperature, may only appear in one or two channels. A goal of multispectral image processing is to view the data in a way that extracts information better than studying each image separately.
One approach, called Principal Component Analysis (PCA), seeks to find a special linear combination of the data that takes a weighted combination of all seven images into just one or two images. Importantly, we want these one or two composite images. or principal components to capture as much of the scene variance (features) as possible; in particular, features should be more visible in the composite images than any of the original individual ones.
This idea, which we’ll explore in detail today, is illustrated with some Landsat imagery taken over Railroad Valley Nevada.
Images from three Landsat spectral bands are shown in (a)-(c); the total information in these images is “rearranged” into the three principal components in (d)-(f). The first component, (d), “explains” 93.5% of the scene features (or variance) found in the original data. In this way, we could compress all of the original data to the single image (d) with only a 6.5% loss of scene variance.
PCA can in general be applied to any data that consists of lists of measurements made on a collection of objects or individuals, including data mining, machine learning, image processing, speech recognition, facial recognition, and health informatics. As we’ll see next, the way in which these “special combinations” of measurements are computed are via the singular vectors of an observation matrix.
Let xj∈Rp denote an observation vector obtained from measurement j, and suppose that j=1,…,N measurements are obtained. The observation matrix X∈Rp×N is a p×N matrix with jth column equal to the jth measurement vector xj:
To understand PCA, we need to understand some basic concepts from statistics. We will review the mean and covariance of a set of observations x1,…,xN. For our purposes, these will simply be things we can compute from the data, but you should be aware that these are well motivated quantities from a statistical perspective.: you will learn more about this in ESE 3010, STAT 4300 or ESE 4020.
Let’s start with an observation matrix X∈Rp×N, with columns x1,…,xN∈Rp.
Since PCA is interested in directions of (maximal) variation in our data, it makes sense to subtract off the mean m, as it captures the average behavior of our data set. To that end, define the centered observations to be
For example, Fig. 3 below shows a centered version of the weight/height data illustrated in Fig. 1:
Since any matrix of the form AAT is positive semidefinite (can you see why?), so is S. Note sometimes N−11 is used as normalization; this is motivated for statistical considerations beyond the scope of this course (it leads to S being an unbiased estimator of the “true” covariance of the data). We will just use N1.
You might be wondering what the entries sij of the covariance matrix S mean. Let’s take a bit of a closer look. We’ll consider a small example where the observations xj∈R2 are two dimensional, and assume we have N=3 observations. Let the first measurement be a∈R and the second b∈R, so that xi=(ai,bi)∈R2 and the centered observation is x^i=(a^i,b^i)∈R2. Our centered observation matrix is then
where we defined a^=(a^1,a^2,a^3)∈R3 and b^=(b^1,b^2,b^3) as the vectors in R3 containing all of the centered first and second measurements, respectively.
Then, we can write our sample covariance matrix as:
we see that s11 captures how much the first measurement ai deviates from its mean value mi, on average, i.e., it measures how much ai varies relative to its mean. Similarly, s22=3∥b^∥2 is the variance of measurement 2.
Now let’s look at the off-diagonal term s12=s21=3a^Tb^. Recall from our work on inner products that a^Tb^=∥a^∥∥b^∥cosθ, where θ is the angle between a^ and b^. We can view
as a measure of how well aligned, or correlated: if a^ and b^ are parallel, cosθ=1 or -1, and if a^ and b^ are perpendicular, cosθ=0. This lets us interpret s12=3a^Tb^, which is proportional to cosθ, as a measure of how similarly a^ and b^ deviate from their means: if a^Tb^ is positive, this means a^ and b^ tend to move up or down together; if it is negative they tend to move in opposite directions; and if it is small (or zero), a^ and b^ tend to move independently of each other. Since s12 captures how the 1st and 2nd measurements vary with each other, it is called their covariance.
Finally, although we worked out the concepts for xj∈Rp and j=1,2,3, These concepts extend naturally to the general setting:
Sii = variance of measurement i across measurements j=1,…,N
Skl = cvariance of measurements k and l across measurements j=1,…,N.