Skip to article frontmatterSkip to article content

12.1 Motivation and SVD

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page can be found in Lecture 9 from Stanford CS168 course.

2Learning Objectives

By the end of this page, you should know:

  • what is matrix completion
  • the motivating applications for low rank matrix approximation
  • how the SVD of a matrix is used to find the low rank approximation

3What are the Missing Entries?

Suppose that I run a web streaming service for movies for three of my friends, Amy, Bob, and Carol. It’s a very specialized web movie service, with only five movie options: The Matrix, Inception, Star Wars: Episode 1, Moana, and Inside Out. After 1 month, we ask our friends Amy, Bob, and Carol to rate the movies they’ve watched from one to five. We collect their ratings into a table below (we mark unrated movies with ?):

Table 1:Movie Ratings

The MatrixInceptionStar Wars: Ep.1MoanaInside Out
Amy9???5
Bob?34?2
Carol??21?

and are asked to provide recommendations to Amy, Bob, and Carol as to which movie they should watch next. Said another way, we are asked to fill in the unknown ? ratings in the table above.

This seems a bit unfair! Each of the unknown entries could be any value in 1-5 after all! But what if I told you an additional hint: Amy, Bob, and Carol have the same relative preferences for each movie. For example, Amy likes Inside Out 52\frac{5}{2} more than Bob likes Inside Out, and this ratio is the same across all movies. Mathematically, we are making the assumption that all columns of the table above are multiples of each other.

Thus we can conclude that Bob likes The Matrix 25(Amy’s rating)=45\frac{2}{5} \cdot (\text{Amy's rating}) = \frac{4}{5}. Similarly, Carol’s rating of Inception is 12(Bob’s rating)=1.5\frac{1}{2} \cdot (\text{Bob's rating}) = 1.5, Carol’s rating of Inside Out is 12(Bob’s rating)=1\frac{1}{2} \cdot (\text{Bob's rating}) = 1, and so on. Here’s the completed matrix:

M=[27.510550.834220.41.5211]M = \begin{bmatrix} 2 & 7.5 & 10 & 5 & 5 \\ 0.8 & 3 & 4 & 2 & 2 \\ 0.4 & 1.5 & 2 & 1 & 1 \end{bmatrix}

The point of this example is that when you know something about the structure of a partially known matrix, then sometimes it is possible to intelligently fill in missing entries. In this example, the assumption that every column is a multiple of each means that rank M=1M = 1 (since dim column (M)=1(M) = 1), which is pretty extreme! One natural and useful definition is that assuming a matrix MM has low-rank. What rank counts as “low” is application dependent but it typically means that for a matrix MRm×nM \in \mathbb{R}^{m \times n}, that rank M=r<<min{m,n}M = r << \min\{m,n\}.

This lecture will explore how we can use this idea of structure to solve the matrix completion problem by finding the best low-rank approximation to a partially known matrix. The SVD will of course be our main tool.

4Low-Rank Matrix Approximations: Motivation

Before diving into the math, let’s highlight some applications of low-rank matrix approximation:

  1. Compression: We saw this idea last class, but it’s worth revisiting through the lens of low-rank approximations. If the original matrix MRm×nM \in \mathbb{R}^{m \times n} is described by mnmn numbers, then a rank kk approximation requires k(m+n)k(m+n) numbers. To see this, recall that if MM has rank kk, then we can write its SVD as:

    M=[U]m×k[Σ]k×k[VT]k×n(Σ12=diag(σ112,,σk12))=[UΣ12=Y]m×k[Σ12VT=ZT]k×n\begin{align*} M &= \bm U \em_{m \times k} \bm \Sigma \em_{k \times k} \bm V^T \em_{k \times n} \quad \left(\Sigma^{\frac{1}{2}} = \text{diag}(\sigma_1^{\frac{1}{2}}, \ldots, \sigma_k^{\frac{1}{2}})\right) \\ &= \bm U\Sigma^{\frac{1}{2}} = Y \em_{m \times k} \bm \Sigma^{\frac{1}{2}}V^T = Z^T \em_{k \times n} \end{align*}

    or product M^=YZT\hat{M} = YZ^T where YRm×kY \in \mathbb{R}^{m \times k} and ZRn×kZ \in \mathbb{R}^{n \times k}. For example, if MM represents a grayscale image (with entries = pixel intensities), mm and nn are typically in the 100s (or 1000s for HD images), and a modest value of kk (\sim100-150) is usually enough to give a good approximation of the original image.

  2. Updating Huge AI Models: A modern application of low-rank matrix approximation is for “fine-tuning” huge AI models. In the setting of large language models (LLMs) like ChatGPT, we are typically given some huge off-the-shelf model with billions (or more) parameters. Given this large model that has been trained on an enormous but generic corpus of text from the web, one often performs “fine-tuning”. This fine-tuning is a second round of training, typically using a much smaller domain specific dataset (for example, the lecture notes for this class could be used to fine-tune a “LinearAlgebraGPT”). The challenge of fine-tuning is that because these models are so big, making these updates is extremely challenging. The 2021 paper LoRA: Low-Rank Adaptation of Large Language Models argued that fine-tuning updates are generally approximately low-rank and that one can learn these updates in their factored YZTYZ^T forms, allowing model fine-tuning with 1000x-10000x fewer parameters.

  3. Denoising: If MM is a noisy version of some “true” matrix that is approximately low-rank, then finding a low-rank approximation to MM will typically remove a lot of noise (and maybe some signal), resulting in a matrix that is actually more informative than the original.

  4. Matrix Completion: Low-rank approximations offers a way of solving the matrix completion problem we introduced above. Given a matrix MM with missing entries, the first step is to obtain a full matrix M^\hat{M} by filling in the missing entries with “default” values: what these default values should be often requires trial and error, but natural things to try include 0, the average of known entries in the same column, row, or the entire matrix. The second step is then to find a rank kk approximation to M^\hat{M}. This approach works well when the unknown matrix is close to a rank kk matrix and there aren’t too many missing entries.

With this motivation in mind, let’s see how the SVD can help us in finding a good rank rr approximation of a matrix MM. Once we’ve described our procedure and seen some examples of it in action, we’ll make precise how our method is actually producing the “best” rank rr approximation possible.

5Low-Rank Approximations from the SVD

Given an m×nm \times n matrix MRm×nM \in \mathbb{R}^{m \times n}, which we’ll assume has rank rr. Then the SVD of MM is given by

M=UΣVT=i=1rσiuiviT(SVD)M = U \Sigma V^T = \sum_{i=1}^r \sigma_i \vv u_i \vv v_i^T \quad \text{(SVD)}

for U=[u1ur]Rm×rU = \bm \vv u_1 \cdots \vv u_r\em \in \mathbb{R}^{m \times r}, V=[v1vr]Rn×rV = \bm \vv v_1 \cdots \vv v_r\em \in \mathbb{R}^{n \times r}, and Σ=diag(σ1,,σr)\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r) the matrices of left singular vectors, right singular vectors, and singular values, respectively.

This right-most expression of (SVD) is a particularly convenient expression for our purposes, which expresses MM as a sum of rank 1 matrices σiuiviT\sigma_i \vv u_i \vv v_i^T with mutually orthogonal column and row spaces.

This sum expression suggests a very natural way of forming a rank kk approximation to MM: simply truncate the sum to the top kk terms, as measured by the singular values σi\sigma_i:

M^k=i=1kσiuiviT=UkΣkVkT(SVD-k)\hat{M}_k = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k \Sigma_k V_k^T \quad \text{(SVD-k)}

where the right-most expression is defined in terms of the truncated matrices:

Uk=[u1uk]Rm×k,Vk=[v1vk]Rn×k,Σk=diag(σ1,,σk)Rk×kU_k = \bm \vv u_1 \cdots \vv u_k\em \in \mathbb{R}^{m \times k}, \quad V_k = \bm v_1 \cdots v_k\em \in \mathbb{R}^{n \times k}, \quad \Sigma_k = \text{diag}(\sigma_1, \ldots, \sigma_k) \in \mathbb{R}^{k \times k}

Before analyzing the properties of M^k=UkΣkVkT\hat{M}_k = U_k \Sigma_k V_k^T, let’s examine if M^k\hat{M}_k could plausibly address our motivating applications. Storing the matrices Uk,Vk,U_k, V_k, and Σk\Sigma_k requires storing km+kn+k2k(m+n)km + kn + k^2 \approx k(m+n) numbers if k<<min{m,n}k << \min\{m, n\} which is much less than the mnmn numbers needed to store MRm×nM \in \mathbb{R}^{m \times n} when mm and nn are relatively large.

It is also natural to interpret (SVD-k) as approximating the raw data MM in terms of kk “concepts” (e.g., “sci-fi”, “romcom”, “drama”, “classic”), where the singular values σ1,,σk\sigma_1, \ldots, \sigma_k express the “prominance” of the concepts, the rows of VTV^T and columns of UU express the “typical row/column” associated with each concept (e.g., a viewer likes only sci-fi movies, or a movie liked only by romcom viewers), and the rows of UU (or columns of VTV^T) approximately express each row (or column) of MM as a linear combination (scaled by σ1,,σk\sigma_1,\ldots,\sigma_k) of the “typical rows” (or “typical columns”).

This method of producing a low-rank approximation is beautiful: we interpret the SVD of a matrix MM as a list of “ingredients” ordered by “importance”, and we retain only the kk most important ingredients. But is this elegant procedure any “good”?

Binder