Skip to article frontmatterSkip to article content

11.1 Basics of Statistics

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in LAA 7.5 and ALA 8.8.

2Learning Objectives

By the end of this page, you should know:

  • the idea of studying only important features of an image
  • what is an observation matrix
  • the sample mean and covariance of observations with some examples

3Motivation: Satellite Imagery

We start with a motivating application from satellite imagery analysis. The Landsat satellites are a pair of imaging satellites that record images of terrain and coastlines. These satellites cover almost every square mile of the Earth’s surface every 16 days.

Satellite sensors acquire seven simultaneous images of any given region, with each sensor recording energy from separate wavelength bands: three in the visible light spectrum and four in the infrared and thermal bands.

Each image is digitized and stored as a rectangular array of numbers, with each number representing the signal intensity at the corresponding pixel. Each of the seven images is one channel of a multichannel or multispectral image.

The seven Landsat images of a given region typically contain a lot of redundant information, as some features will appear across most channels. However, other features, because of their color or temperature, may only appear in one or two channels. A goal of multispectral image processing is to view the data in a way that extracts information better than studying each image separately.

One approach, called Principal Component Analysis (PCA), seeks to find a special linear combination of the data that takes a weighted combination of all seven images into just one or two images. Importantly, we want these one or two composite images. or principal components to capture as much of the scene variance (features) as possible; in particular, features should be more visible in the composite images than any of the original individual ones.

This idea, which we’ll explore in detail today, is illustrated with some Landsat imagery taken over Railroad Valley Nevada.

Railroad Satellite Imagery

Images from three Landsat spectral bands are shown in (a)-(c); the total information in these images is “rearranged” into the three principal components in (d)-(f). The first component, (d), “explains” 93.5% of the scene features (or variance) found in the original data. In this way, we could compress all of the original data to the single image (d) with only a 6.5% loss of scene variance.

PCA can in general be applied to any data that consists of lists of measurements made on a collection of objects or individuals, including data mining, machine learning, image processing, speech recognition, facial recognition, and health informatics. As we’ll see next, the way in which these “special combinations” of measurements are computed are via the singular vectors of an observation matrix.

4Observation Matrix

Let xjRp\mathbf{x}_j \in \mathbb{R}^p denote an observation vector obtained from measurement jj, and suppose that j=1,,Nj=1,\ldots,N measurements are obtained. The observation matrix XRp×NX \in \mathbb{R}^{p \times N} is a p×Np \times N matrix with jthj^{th} column equal to the jthj^{th} measurement vector xj\mathbf{x}_j:

X=[x1x2xN]Rp×NX = \bm \mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_N\em \in \mathbb{R}^{p \times N}

5Mean and Covariance

To understand PCA, we need to understand some basic concepts from statistics. We will review the mean and covariance of a set of observations x1,,xN\vv x_1, \ldots, \vv x_N. For our purposes, these will simply be things we can compute from the data, but you should be aware that these are well motivated quantities from a statistical perspective.: you will learn more about this in ESE 3010, STAT 4300 or ESE 4020.

Let’s start with an observation matrix XRp×NX \in \mathbb{R}^{p\times N}, with columns x1,,xNRp\mathbf{x}_1,\ldots,\mathbf{x}_N \in \mathbb{R}^p.

Since PCA is interested in directions of (maximal) variation in our data, it makes sense to subtract off the mean m\mathbf{m}, as it captures the average behavior of our data set. To that end, define the centered observations to be

x^j=xjm,j=1,,N,\hat{\mathbf{x}}_j = \mathbf{x}_j - \mathbf{m}, \quad j=1,\ldots,N,

and the centered or de-meaned observation matrix

X^=[x^1x^2x^N].\hat{X} = \bm \hat{\mathbf{x}}_1 & \hat{\mathbf{x}}_2 & \cdots & \hat{\mathbf{x}}_N\em.

For example, Fig. 3 below shows a centered version of the weight/height data illustrated in Fig. 1:

scatter plot centered

Since any matrix of the form AATAA^T is positive semidefinite (can you see why?), so is SS. Note sometimes 1N1\frac{1}{N-1} is used as normalization; this is motivated for statistical considerations beyond the scope of this course (it leads to SS being an unbiased estimator of the “true” covariance of the data). We will just use 1N\frac{1}{N}.

You might be wondering what the entries sijs_{ij} of the covariance matrix SS mean. Let’s take a bit of a closer look. We’ll consider a small example where the observations xjR2\mathbf{x}_j \in \mathbb{R}^2 are two dimensional, and assume we have N=3N=3 observations. Let the first measurement be aRa \in \mathbb{R} and the second bRb \in \mathbb{R}, so that xi=(ai,bi)R2\mathbf{x}_i = (a_i, b_i) \in \mathbb{R}^2 and the centered observation is x^i=(a^i,b^i)R2\hat{\mathbf{x}}_i = (\hat{a}_i, \hat{b}_i) \in \mathbb{R}^2. Our centered observation matrix is then

X^=[a^1a^2a^3b^1b^2b^3]=[a^Tb^T],\hat{X} = \bm \hat{a}_1 & \hat{a}_2 & \hat{a}_3 \\ \hat{b}_1 & \hat{b}_2 & \hat{b}_3 \em = \bm \hat{\mathbf{a}}^T \\ \hat{\mathbf{b}}^T \em,

where we defined a^=(a^1,a^2,a^3)R3\hat{\mathbf{a}} = (\hat{a}_1, \hat{a}_2, \hat{a}_3) \in \mathbb{R}^3 and b^=(b^1,b^2,b^3)\hat{\mathbf{b}} = (\hat{b}_1, \hat{b}_2, \hat{b}_3) as the vectors in R3\mathbb{R}^3 containing all of the centered first and second measurements, respectively.

Then, we can write our sample covariance matrix as:

S=13X^X^T=13[a^Tb^T][a^b^]=[a^23a^Tb^3a^Tb^3b^23].S = \frac{1}{3} \hat{X}\hat{X}^T = \frac{1}{3} \bm \hat{\mathbf{a}}^T \\ \hat{\mathbf{b}}^T \em \bm \hat{\mathbf{a}} & \hat{\mathbf{b}} \em = \begin{bmatrix} \frac{\|\hat{\mathbf{a}}\|^2}{3} & \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3} \\ \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3} & \frac{\|\hat{\mathbf{b}}\|^2}{3} \end{bmatrix}.

The diagonal entry s11=a^23s_{11} = \frac{\|\hat{\mathbf{a}}\|^2}{3} is called the variance of measurement 1.

Expanding it out:

s11=a^23=13(a^12+a^22+a^32)=13((a1m1)2+(a2m2)2+(a3m3)2)\begin{align*} s_{11} = \frac{\|\hat{\mathbf{a}}\|^2}{3} &= \frac{1}{3}(\hat{a}_1^2 + \hat{a}_2^2 + \hat{a}_3^2) \\ &= \frac{1}{3}((a_1-m_1)^2 + (a_2-m_2)^2 + (a_3-m_3)^2) \end{align*}

we see that s11s_{11} captures how much the first measurement aia_i deviates from its mean value mim_i, on average, i.e., it measures how much aia_i varies relative to its mean. Similarly, s22=b^23s_{22} = \frac{\|\hat{\mathbf{b}}\|^2}{3} is the variance of measurement 2.

Now let’s look at the off-diagonal term s12=s21=a^Tb^3s_{12} = s_{21} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}. Recall from our work on inner products that a^Tb^=a^b^cosθ\hat{\mathbf{a}}^T\hat{\mathbf{b}} = \|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\| \cos \theta, where θ is the angle between a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}}. We can view

cosθ=a^Tb^a^b^\cos \theta = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{\|\hat{\mathbf{a}}\| \|\hat{\mathbf{b}}\|}

as a measure of how well aligned, or correlated: if a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}} are parallel, cosθ=1\cos \theta = 1 or -1, and if a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}} are perpendicular, cosθ=0\cos \theta = 0. This lets us interpret s12=a^Tb^3s_{12} = \frac{\hat{\mathbf{a}}^T\hat{\mathbf{b}}}{3}, which is proportional to cosθ\cos \theta, as a measure of how similarly a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}} deviate from their means: if a^Tb^\hat{\mathbf{a}}^T\hat{\mathbf{b}} is positive, this means a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}} tend to move up or down together; if it is negative they tend to move in opposite directions; and if it is small (or zero), a^\hat{\mathbf{a}} and b^\hat{\mathbf{b}} tend to move independently of each other. Since s12s_{12} captures how the 1st and 2nd measurements vary with each other, it is called their covariance.

Finally, although we worked out the concepts for xjRp\vv x_j \in \mathbb{R}^p and j=1,2,3,j=1,2,3, These concepts extend naturally to the general setting:

  • SiiS_{ii} = variance of measurement ii across measurements j=1,,Nj=1,\ldots,N
  • SklS_{kl} = cvariance of measurements kk and ll across measurements j=1,,Nj=1,\ldots,N.

Binder