Skip to article frontmatterSkip to article content

4.5 Least Squares

approximate linear system

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in LAA 6.5 and VMLS 12.1.

2Learning Objectives

By the end of this page, you should know:

  • the least squares problem and how to solve it
  • how the least squares problem relates to a solving an approximate linear system

3Introduction: Inconsistent Linear Equations

Suppose we are presented with an inconsistent set of linear equations AxbA \vv x \approx \vv b. This typically coincides with ARm×nA \in \mathbb{R}^{m \times n} being a “tall matrix”, i.e., m>nm > n. This corresponds to an overdetermined system of mm linear equations in nn unknowns. A typical setup assumes this arises is one of data fitting: we are given feature variables aiRn\vv a_i \in \mathbb{R}^n and response variables biR{b_i \in \mathbb{R}}, and we believe that aixbi\vv a_i^{\top} \vv x \approx b_i for measurements i=1,,mi=1,\ldots, m and xRn\vv x \in \mathbb{R}^n are our model parameters. We will revisit this application in detail later.

The question then becomes, if no xRn\vv x \in \mathbb{R}^n exists such that Ax=bA \vv x = \vv b exists, what should we do? A natural idea is to select an x\vv x that makes the error or residual r=Axb\vv r = A\vv x - \vv b as small as possible, i.e., to find the x\vv x that minimizes r=Axb\|\vv r\| = \|A\vv x - \vv b\|. Now minimizing the residual or its square gives the same answer, so we may as well minimize

Axb2=r2=r12++rm2,\|A\vv x - \vv b\|^2 = \|\vv r\|^2 = r_1^2 + \cdots + r_m^2,

the sum of squares of the residuals.

4The Least Squares Problem

4.1Solving by Orthogonal Projection

There are many ways of deriving the solution to (LS): you may have seen a vector calculus-based derivation in Math 1410. Here, we will use our new understanding of orthogonal projections to provide an intuitive and elegant geometric derivation.

Our starting point is a column interpretation of the least squares objective: let a1,,anRm\vv a_1, \ldots, \vv a_n \in \mathbb{R}^m be the columns of AA: then the least squares (LS) problem is the problem of finding a linear combination of the columns that is closest to the vector bRm\vv b \in \mathbb{R}^m, with coefficients specified by x\vv x:

Axb2=(x1a1++xnan)b2 \|A\vv x - \vv b\|^2 = \|(x_1a_1 + \cdots + x_na_n) - b\|^2
Least squares

To prove the above geometrically intuitive fact (see Figure 1), we need to decompose b\vv b into its orthogonal projection onto Col(A)(A), which we denote by b^\hat{\vv b}, and the element in its orthogonal complement Col(A)(A), which we denote by e\vv e. Recall b,b^,eRm\vv b, \hat{\vv b}, \vv e\in \mathbb{R}^m and Col(A)Rm(A) \subset \mathbb{R}^m.

We then have that

r=Axb=(Axb^)e. \vv r = A \vv x - \vv b = \left(A \vv x - \hat{\vv b}\right) - \vv e.

Since Ax,b^A \vv x, \hat{\vv b} \in Col(A)(A), so is Axb^A \vv x - \hat{\vv b} (why?), and thus we have decomposed r\vv r into components lying in Col(A)(A) and Col(A)(A)^{\perp}. Using our generalized Pythagorean theorem, it then follows that

Axb2=r2=Axb^2+e2. \|A \vv x - \vv b\|^2 = \|\vv r\|^2 = \|A \vv x - \hat{\vv b}\|^2 + \|\vv e\|^2.

The above expression can be made as small as possible be choosing x^\hat{\vv x} such that Ax^=b^A\hat{\vv x} = \hat{\vv b}, which always has a solution (why?) leaving the residual error e2=bb^2\|\vv e\|^2 = \|\vv b - \hat{\vv b}\|^2, ie, the component of b\vv b that is orthogonal to Col(A)(A).

This gives us a nice geometric interpretation of the lest squares solution x^\hat{\vv x}, but how should we compute it? We now recall from here that Col(A)=(A)^{\perp} = Null(A)(A^{\top}). So, we therefore have that e\vv e \in Null(A)(A^{\top}). This means that

Ae=A(bb^)=A(bAx^)=0. A^{\top} \vv e=A^{\top} \left(\vv b - \hat{\vv b}\right) = A^{\top} \left(\vv b - A \hat{\vv x}\right) = 0.

or, equivalently that

AAx^=Ab. (NE)A^{\top} A \hat{\vv x} = A^{\top} \vv b. \ (\textrm{NE})

The above equations are the normal equations associated with the lest squares problem specified by AA and b\vv b. We have just informally argued that the set of least squares solutions x^\hat{\vv x} coincide with the set of solutions to the normal equations (NE): this is in fact true, and can be proven (we wont do that here).

Thus, we have reduced solving a least squares problem to our favorite problem, solving a system of linear equations! One question you might have is when do the normal equations (NE) have a unique solution? The answer, perhaps unsurprisingly, is when the columns of are linearly independent, and hence form a basis for Col(A)(A). The following theorem is a useful summary of our discussion thus fur:

Python break!

In the following code, we show how to use np.linalg.lstsq in Python to solve the least squares problem, and also how to obtain the solution by solving a linear system (np.linalg.solve) as illustrated in Example 1. If there is more than one solution to the least squares problem, then the two strategies ( np.linalg.lstsq and np.linalg.solve) might possibly return different solutions x^\hat{\vv x} because each NumPy function uses a different numerical strategy to obtain x^\hat{\vv x}.

# Least squares

import numpy as np

def least_squares_linalg(A, b):

    print("\nA: \n", A, "\nb: ", b)

    print("\nlstsq function\n")
    
    x, residual, rank, sing_val = np.linalg.lstsq(A, b, rcond=None)
    # residual = 0 if rank of A < size of x (or) number of rows of A <= size of x 
    print("Solution (x): \n", x, "\nResidual: ", residual)

def least_squares(A, b):
    print("\nsolving a linear system\n")

    x = np.linalg.solve(A.T @ A, A.T @ b)
    
    residual = np.linalg.norm(A@x- b)**2
    
    print("Solution (x): \n", x, "\nResidual: ", residual)

A = np.array([[1, -2],
              [5, 3]])
b = np.array([8, 1])

least_squares_linalg(A, b)
least_squares(A, b)

A1 = np.array([[1, 1, 0, 0],
              [1, 1, 0, 0],
              [1, 0, 1 , 0],
              [1, 0, 1, 0],
              [1, 0, 0, 1],
              [1, 0, 0, 1]])
b1 = np.array([-3, -1, 0, 2, 5, 1])

# Notice the difference in both the solutions
least_squares_linalg(A1, b1)
least_squares(A1, b1)

A2 = np.array([[4, 0], [0, 2],[1, 1]])
b2 = np.array([2, 0, 11])

least_squares_linalg(A2, b2)
least_squares(A2, b2)

A: 
 [[ 1 -2]
 [ 5  3]] 
b:  [8 1]

lstsq function

Solution (x): 
 [ 2. -3.] 
Residual:  []

solving a linear system

Solution (x): 
 [ 2. -3.] 
Residual:  0.0

A: 
 [[1 1 0 0]
 [1 1 0 0]
 [1 0 1 0]
 [1 0 1 0]
 [1 0 0 1]
 [1 0 0 1]] 
b:  [-3 -1  0  2  5  1]

lstsq function

Solution (x): 
 [ 0.5 -2.5  0.5  2.5] 
Residual:  []

solving a linear system

Solution (x): 
 [-6.  4.  7.  9.] 
Residual:  11.999999999999998

A: 
 [[4 0]
 [0 2]
 [1 1]] 
b:  [ 2  0 11]

lstsq function

Solution (x): 
 [1. 2.] 
Residual:  [84.]

solving a linear system

Solution (x): 
 [1. 2.] 
Residual:  84.0

Binder