Skip to article frontmatterSkip to article content

4.6 Least Squares and Data Fitting

model some observed data

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in VMLS 13.

2Learning Objectives

By the end of this page, you should know:

  • what a data fitting problem is and how it relates to least squares
  • model parameterization and examples of different parameterizations
  • the Minimum Mean Squared Error (MMSE)
  • the regression model and one of its application (auto-regressive model for time-series modeling)
  • the standard tricks for model selection: generalization and validation

3Introduction

We will introduce one of the most important applications of least squares methods: fitting a mathematical model to some relation given some observed data.

A typical data fitting problem takes the following form: There is some underlying feature vector or independent variable xRm\vv x \in \mathbb{R}^m and a scalar outcome or response variable yRy \in \mathbb{R} that we believe are (approximately) related by some function f:RmRf: \mathbb{R}^m \to \mathbb{R} such that

yf(x).(M)y \approx f(x). \qquad \text{(M)}

3.1Data

Our goal is to fit (or learn) a model ff given some data:

(x(1),y(1)),(x(2),y(2)),,(x(N),y(N)).(\vv x^{(1)}, y^{(1)}), (\vv x^{(2)}, y^{(2)}), \ldots, (\vv x^{(N)}, y^{(N)}).

These data pairs (x(i),y(i))(\vv x^{(i)}, y^{(i)}) are sometimes also called observations, examples, samples, or measurements depending on context.

3.2Model Parameterization

Our goal is to choose a model f^:RmR\hat{f}: \mathbb{R}^m \to \mathbb{R} that approximates the model well, that is, yf^(x)y \approx \hat{f}(x). The hat notation is traditionally used to highlight that f^\hat{f} is an approximation to ff. Specifically, we will write y^=f^(x)\hat{y} = \hat{f}(x) to highlight that y^\hat{y} is an approximate prediction of the outcome yy.

In order to efficiently search over candidate model functions f^\hat{f}, we need to parameterize a model class F\mathcal{F} that is easy to work with. A powerful and commonly used model class is the set of linear in the parameters models of the form

f^(x)=θ1f1(x)+θ2f2(x)++θpfp(x).(LP)\hat{f}(\vv x) = \theta_1 f_1(\vv x) + \theta_2 f_2(\vv x) + \cdots + \theta_p f_p(\vv x). \qquad \text{(LP)}

In (LP), the functions fi:RmRf_i: \mathbb{R}^m \to \mathbb{R} are basis functions or features that we choose before hand.

When we solve the data fitting problem, we will look for the parameters θi\theta_i that, among other things, make the model prediction y^i=f^(x(i))\hat{y}_i = \hat{f}(\vv x^{(i)}) consistent with the observed data, i.e., we want y(i)y(i)y^{(i)} \approx y^{(i)}.

3.3Data fitting:

For the ii-th observation y(i)y^{(i)} and the ithi^{th} prediction y^(i)\hat{y}^{(i)}, we define the prediction error or residual r(i)=y^(i)y(i)r^{(i)} = \hat{y}^{(i)} - y^{(i)}.

The least squares data fitting problem chooses the model parameters θi\theta_i that minimize the (average of the) sum of the squares of the prediction errors on the data set:

(r(1))2++(r(N))2N\frac{(r^{(1)})^2 + \cdots + (r^{(N)})^2}{N}

Next we’ll show that this problem can be cast as a least squares problem over the model parameters θi\theta_i. Before doing that though, we want to highlight the conceptual shift we are making.

4Data Fitting as Least Squares

We start by stacking the outcomes y(i)y^{(i)}, predictions y^(i)\hat{y}^{(i)}, and residuals r(i)r^{(i)} as vectors in RN\mathbb{R}^N:

y=[y(1)y(2)y(N)],y^=[y^(1)y^(2)y^(N)],r=[r(1)r(2)r(N)]=[y(1)y^(1)y(2)y^(2)y(N)y^(N)]\vv y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{bmatrix}, \quad \hat{\vv y} = \begin{bmatrix} \hat{y}^{(1)} \\ \hat{y}^{(2)} \\ \vdots \\ \hat{y}^{(N)} \end{bmatrix}, \quad \vv r = \begin{bmatrix} r^{(1)} \\ r^{(2)} \\ \vdots \\ r^{(N)} \end{bmatrix} = \begin{bmatrix} y^{(1)} - \hat{y}^{(1)} \\ y^{(2)} - \hat{y}^{(2)} \\ \vdots \\ y^{(N)} - \hat{y}^{(N)} \end{bmatrix}

Then we can compactly write the squared prediction error as r22\|\vv r\|_2^2. Next, we compile our model parameters into a vector θRp\vv \theta \in \mathbb{R}^p, and build our feature matrix or measurement matrix ARN×pA \in \mathbb{R}^{N \times p} by setting

Aij=fj(x(i)),i=1,,N,j=1,,p.A_{ij} = f_j(\vv x^{(i)}), \quad i=1,\ldots,N, \quad j=1,\ldots,p.

The jj-th column of the matrix AA is composed of the jj-th basis function evaluated on each of the data points x(1),,x(N)\vv x^{(1)},\ldots,\vv x^{(N)}:

f1(x)=[f1(x(1))f1(x(2))f1(x(N))],,fp(x)=[fp(x(1))fp(x(2))fp(x(N))]\vv f_1(\vv x) = \begin{bmatrix} f_1(\vv x^{(1)}) \\ f_1(\vv x^{(2)}) \\ \vdots \\ f_1(\vv x^{(N)}) \end{bmatrix}, \cdots, \vv f_p(x) = \begin{bmatrix} f_p(\vv x^{(1)}) \\ f_p(\vv x^{(2)}) \\ \vdots \\ f_p(\vv x^{(N)}) \end{bmatrix}

and A=[f1(x)fp(x)]A = [\vv f_1(\vv x) \cdots \vv f_p(\vv x)]. In matrix-vector notation, we then have

y^=Aθ=θ1f1(x)++θpfp(x).\hat{\vv y} = A\vv \theta = \theta_1 \vv f_1(\vv x) + \cdots + \theta_p \vv f_p(\vv x).

The least squares data fitting problem then becomes to

minimize r2minimize yAθ2\text{minimize } \|\vv r\|^2 \Rightarrow \text{minimize } \|\vv y - A\vv \theta\|^2

over the model parameters θ\vv \theta, which we recognize as a least squares problem! Assuming we have chosen basis functions fif_i such that the columns of AA are linearly independent (what would it mean if this wasn’t true?), we have that the least squares solution is

θ^=(ATA)1ATy.\hat{\vv \theta} = (A^TA)^{-1}A^T\vv y.

5Warm-up: Fitting a Constant Model

We start with the simplest possible model and set the number of features p=1p=1 and f1(x)=1f_1(x) = 1, so that our (admittedly boring) model becomes f^(x)=θ1\hat{f}(\vv x) = \theta_1.

First, we construct ARN×1A \in \mathbb{R}^{N \times 1} by setting Ai1=f1(x(i))=1A_{i1} = f_1(\vv x^{(i)}) = 1. Therefore AA is the NN-dimensional all ones vector 1N\mathbf{1}_N. We plug this into our formula for θ^\hat{\vv \theta}:

θ^=θ^1=(1T1)11Ty=1Ni=1Ny(i)=average(y).\hat{\vv \theta} = \hat{\theta}_1 = (\vv 1^T\vv 1)^{-1}\vv 1^T\vv y = \frac{1}{N}\sum_{i=1}^N y^{(i)} = \text{average}(\vv y).

We have just shown that the mean or average of the outcomes y(1),,y(N)y^{(1)},\ldots,y^{(N)} is the best least squares fit of a constant model. In this case, the MMSE is

1Ni=1N(average(y)y(i))2,\frac{1}{N}\sum_{i=1}^N (\text{average}(\vv y) - y^{(i)})^2,

which is called the variance of y\vv y, and measures how “wiggly” y\vv y is.

6Univariate Function: Straight Line Fit

We start by considering the univariate function setting where our feature vector x=xR\vv x = x \in \mathbb{R} is a scalar, and hence we are looking to approximate a function f:RRf: \mathbb{R} \to \mathbb{R}. This is a nice way to get intuition because it is easy to plot the data (x(i),y(i))(x^{(i)}, y^{(i)}) and the model function y^=f^(x)\hat{y} = \hat{f}(x).

We’ll start with a straight line fit model: we set p=2p=2, with f1(x)=1f_1(x) = 1 and f2(x)=xf_2(x) = x. In this case our model function is composed of models of the form

f^(x)=θ1+θ2x.\hat{f}(x) = \theta_1 + \theta_2 x.

Here, we can easily interpret θ1\theta_1 as the y-intercept and θ2\theta_2 as the slope of the straight line model we are searching for.

In this case, the matrix ARN×2A \in \mathbb{R}^{N \times 2} and takes the form

A=[1x(1)1x(2)1x(N)]A = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(2)} \\ \vdots & \vdots \\ 1 & x^{(N)} \end{bmatrix}

Although we can work out formulas for θ^1\hat{\theta}_1 and θ^2\hat{\theta}_2, they are not particularly interesting or informative. Instead, we’ll focus on some examples of how to use these ideas. A straight-line fit to 50 data points is given below.

Straight Line fit

7Univariate Function: Polynomial Fit

A simple extension beyond the straight-line fit is a polynomial fit where we set the jthj^{th} feature to be

fj(x)=xj1f_j(x) = x^{j-1}

for j=1,,pj = 1,\ldots,p. This leads to a model class composed of polynomials of at most degree p1p-1:

f^(x)=θ1+θ2x+θ3x2++θpxp1\hat{f}(x) = \theta_1 + \theta_2 x + \theta_3 x^2 + \cdots + \theta_p x^{p-1}

In this case, our matrix ARN×pA \in \mathbb{R}^{N \times p} and takes the form

A=[1x(1)(x(1))p11x(2)(x(2))p11x(N)(x(N))p1]A = \begin{bmatrix} 1 & x^{(1)} & \cdots & (x^{(1)})^{p-1} \\ 1 & x^{(2)} & \cdots & (x^{(2)})^{p-1} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(N)} & \cdots & (x^{(N)})^{p-1} \end{bmatrix}

which you might recognize as a Vandermonde Matrix, which we encountered earlier in the class when discussing polynomial interpolation. An important property of such matrices is that their columns are linearly independent provided that the numbers x(1),,x(N)x^{(1)}, \ldots, x^{(N)} include at least pp different values. The figures below show examples of least squares fits of polynomials of degree 2, 6, 10, and 15 to a set of 100 data points.

Polynomial Model Data fitting

8Regression Models

We now consider the setting of vector-valued independent variables xRn{\vv x \in \mathbb{R}^n}. The analog of a straight-line fit here is a linear regression model of the form:

y^=f^(x)=βx+v, \hat{y} = \hat{f}(\vv x) = \boldsymbol \beta^{\top} \vv x + v,

where βRn\boldsymbol\beta \in \mathbb{R}^{n} and vRv \in \mathbb{R}. If we set θ=[vβ]\theta = \begin{bmatrix} v \\ \boldsymbol \beta \end{bmatrix}, then the model becomes:

y^=θ1+θ2x1++θn+1xn.\hat{y} = \theta_1 + \theta_2 x_1 + \cdots + \theta_{n+1} x_n.

We can view this as fitting within our general linear in the parameters model by setting f1(x)=1f_1(\vv x) = 1 and fi(x)=xi1f_i(\vv x) = x_{i-1} for i=2,,n+1i = 2, \ldots, n+1, so that p=n+1p = n+1.

We are of course not obliged to use these features. Instead, suppose that we have p1p-1 features f2(x),,fp(x)f_2(\vv x), \ldots, f_p(\vv x), and assume we have set f1(x)=1f_1(x) = 1, as is common done. If we define:

x~=[f2(x)fp(x)]Rp1 \tilde{\vv x} = \begin{bmatrix} f_2(x) \\ \vdots \\ f_p(x) \end{bmatrix} \in \mathbb{R}^{p-1}

we can write a linear regression model in the new feature vector x~\tilde{\vv x}:

y^=θ1f1(x)++θpfp(x)=βx~+v\hat{y} = \theta_1 f_1(\vv x) + \cdots + \theta_p f_p(\vv x) = \boldsymbol \beta^{\top} \tilde{\vv x} + v

where:

  1. x~=(f2(x),,fp(x))\tilde{\vv x} = (f_2(\vv x), \ldots, f_p(\vv x)) are the transformed features
  2. v=θ1v = \theta_1 is called the affine term
  3. β=(θ2,θ3,,θp)\boldsymbol\beta = (\theta_2, \theta_3, \ldots, \theta_p) is the linear term

8.1Application: Auto-Regressive Time Series Modeling

Here is a very widely used application of the above ideas in the context of time-series forecasting. Our goal here is to fit a model that predicts elements of a time series z1,z2,,z_1, z_2, \ldots, where ztRz_t \in \mathbb{R} is a scalar quantity of interest.

A standard approach is to use an auto-regressive (AR) prediction model:

z^t+1=θ1zt+θ2zt1++θMztM+1,t=M,M+1,(AR)\hat{z}_{t+1} = \theta_1 z_t + \theta_2 z_{t-1} + \cdots + \theta_M z_{t-M+1}, \quad t = M, M+1, \ldots \quad (AR)

In equation (AR), the parameter MM is the memory of the model, and z^t+1\hat{z}_{t+1} is the prediction of the next value based on the previous MM observations. We will choose θRM\boldsymbol \theta \in \mathbb{R}^M to minimize

(z^M+1zM+1)2++(z^TzT)2(\hat{z}_{M+1} - z_{M+1})^2 + \cdots + (\hat{z}_{T} - z_{T})^2

We can fit this within our regression model framework by

y(i)=zM+i,x(i)=[zM+i1zM+i2zi]RM,i=1,,TM. y^{(i)} = z_{M+i}, \quad \vv x^{(i)} = \begin{bmatrix} z_{M+i-1} \\ z_{M+i-2} \\ \vdots \\ z_{i} \end{bmatrix} \in \mathbb{R}^M, \quad i = 1, \ldots, T-M.

A little bit of bookkeeping allows us to conclude that we have N=TM{N = T-M} examples and p=Mp = M features.

Python break!

In the following code, we show how to fit an auto-regressive model to an online temperature data set that is given in an excel sheet (.csv file). Typically, such large data sets are stored as pandas DataFrame in Python. We convert the pandas DataFrame to numpy format and create our data for the auto-regressive model for a given memory. Then, we use np.linalg.lstsq to solve the least squares problem as we did before.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data source: https://www.ncei.noaa.gov/cdo-web/search
# Read CSV file into pandas DataFrame
df = pd.read_csv('LGA_temp.csv') # Temperature at LGA airport for 2010 (12 months)
temp = df['HLY-TEMP-NORMAL'].to_numpy()

memory = 8
T = temp.size
N = T - memory

# Function to create the dataset for AR model
def create_data(data, M):
    X, y = [], []
    for i in range(M, len(data)):
        X.append(data[i-M:i])
        y.append(data[i])
    return np.array(X), np.array(y)

# Create lagged dataset
X_data, y_data = create_data(temp, memory)

theta, residual, rank, sing_val = np.linalg.lstsq(X_data, y_data, rcond=None)
y_pred = X_data @ theta

print("RMS error: ", np.sqrt(residual/N))
# Plot predictions
plot_hours = 24*3 # first 3 days of 2010
plt.figure(figsize=(10, 5))
plt.scatter(np.arange(plot_hours), y_data[:plot_hours], color='green', label='Original Data')
plt.plot(y_pred[:plot_hours], label='Predictions')
plt.xlabel('Time (hours)')
plt.ylabel('Temprature (Fahrenheit)')
plt.title('Autoregressive Modeling')
plt.legend()
plt.show()
RMS error:  [0.34101808]
<Figure size 1000x500 with 1 Axes>

9Model Selection, Generalization, and Validation

This section is entirely practical: these are standard “tricks of the trade” that you will revisit in more detail in mae advanced classes on statistics and machine learning.

Our starting point is a philosophical question: what is the foul of a learned model? Perhaps surprisingly, it is NOT TO PREDICT OUTCOMES FOR THE GIVEN DATA; after all, we already have this data! Instead, we want to predict the outcome on new, unseen data.

If a model makes reasonable predictions on new unseen data, it is said to generalize. On the other hand, a model that makes poor predictions on new unseen data, but predicts the given data well, is said to be over-fit.

Then, we only use the training data to fit (or “train”) our model, and then evaluate the model’s performance on the test set. If the prediction errors on the training and test sets are similar, then we guess the model will generalize. This is rarely guaranteed, but such a comparison is often predictive of a model’s generalization properties.

Validation is often used for model selection, ie., to choose among different candidate models. For example, by comparing train/test errors, we can select between;

  1. Polynomial models of different degrees.
  2. Regression models with different sets of features
  3. AR models with different memories.

Python break!

In the following code, we use the polyfit function from the Polynomial libray in Python to fit a polynomial model by solving the corresponding least squares problem. We compare the RMS error and the quality of fit for varying degrees of polynomial using synthetic data.

import numpy as np
import matplotlib.pyplot as plt
from numpy.polynomial.polynomial import Polynomial

# Generate synthetic data
np.random.seed(334)
x = np.linspace(-2, 2, 100)
y = -0.5*x**13 + x**10 + 2*x**9 - x**7  + 0.3*x**6 - 3*x**4 + 2 * x**3 - 5 * x**2 + 3 * x + np.random.normal(0, 10, size=x.shape)*7
y_data = ((y - min(y))/(max(y) - min(y)))*10 - 5 # scaling to [-5, 5]
# Function to fit and plot polynomials of varying degrees
def fit_and_plot_polynomial(x, y, degrees):

    
    for degree in degrees:
        # Fit polynomial of given degree
        p, results = Polynomial.fit(x, y, degree, full=True) # results[0] is residuals
        y_fit = p(x)

        print(f'RMS for Degree {degree}: ', np.sqrt(results[0]/y.size))
        plt.figure(figsize=(12, 8))
        plt.scatter(x, y, label='Data', color='black')
        # Plot the polynomial fit
        plt.plot(x, y_fit)
    
        plt.xlabel('x')
        plt.ylabel('y')
        plt.title(f'Least Squares Data Fitting with Polynomial Degree {degree}')
        plt.legend()
        plt.show()

# Degrees of polynomials to fit
degrees = [2, 4, 8, 10]

# Fit and plot polynomials
fit_and_plot_polynomial(x, y_data, degrees)
RMS for Degree 2:  [0.89961455]
<Figure size 1200x800 with 1 Axes>
RMS for Degree 4:  [0.53118696]
<Figure size 1200x800 with 1 Axes>
RMS for Degree 8:  [0.12656145]
<Figure size 1200x800 with 1 Axes>
RMS for Degree 10:  [0.10145153]
<Figure size 1200x800 with 1 Axes>

Binder