12.1 Motivation and SVD - ESE 2030 📏

1Reading¶

Material related to this page can be found in Lecture 9 from Stanford CS168 course.

2Learning Objectives¶

By the end of this page, you should know:

what is matrix completion
the motivating applications for low rank matrix approximation
how the SVD of a matrix is used to find the low rank approximation

3What are the Missing Entries?¶

Suppose that I run a web streaming service for movies for three of my friends, Amy, Bob, and Carol. It’s a very specialized web movie service, with only five movie options: The Matrix, Inception, Star Wars: Episode 1, Moana, and Inside Out. After 1 month, we ask our friends Amy, Bob, and Carol to rate the movies they’ve watched from one to five. We collect their ratings into a table below (we mark unrated movies with ?):

Table 1:Movie Ratings

	The Matrix	Inception	Star Wars: Ep.1	Moana	Inside Out
Amy	9	?	?	?	5
Bob	?	3	4	?	2
Carol	?	?	2	1	?

and are asked to provide recommendations to Amy, Bob, and Carol as to which movie they should watch next. Said another way, we are asked to fill in the unknown ? ratings in the table above.

This seems a bit unfair! Each of the unknown entries could be any value in 1-5 after all! But what if I told you an additional hint: Amy, Bob, and Carol have the same relative preferences for each movie. For example, Amy likes Inside Out $\frac{5}{2}$ more than Bob likes Inside Out, and this ratio is the same across all movies. Mathematically, we are making the assumption that all columns of the table above are multiples of each other.

Thus we can conclude that Bob likes The Matrix $\frac{2}{5} \cdot (\text{Amy's rating}) = \frac{4}{5}$ . Similarly, Carol’s rating of Inception is $\frac{1}{2} \cdot (\text{Bob's rating}) = 1.5$ , Carol’s rating of Inside Out is $\frac{1}{2} \cdot (\text{Bob's rating}) = 1$ , and so on. Here’s the completed matrix:

M = \begin{bmatrix} 2 & 7.5 & 10 & 5 & 5 \\ 0.8 & 3 & 4 & 2 & 2 \\ 0.4 & 1.5 & 2 & 1 & 1 \end{bmatrix}

(1)

The point of this example is that when you know something about the structure of a partially known matrix, then sometimes it is possible to intelligently fill in missing entries. In this example, the assumption that every column is a multiple of each means that rank $M = 1$ (since dim column $(M) = 1$ ), which is pretty extreme! One natural and useful definition is that assuming a matrix $M$ has low-rank. What rank counts as “low” is application dependent but it typically means that for a matrix $M \in \mathbb{R}^{m \times n}$ , that rank $M = r << \min\{m,n\}$ .

This lecture will explore how we can use this idea of structure to solve the matrix completion problem by finding the best low-rank approximation to a partially known matrix. The SVD will of course be our main tool.

4Low-Rank Matrix Approximations: Motivation¶

Before diving into the math, let’s highlight some applications of low-rank matrix approximation:

Compression: We saw this idea last class, but it’s worth revisiting through the lens of low-rank approximations. If the original matrix $M \in \mathbb{R}^{m \times n}$ is described by $mn$ numbers, then a rank $k$ approximation requires $k(m+n)$ numbers. To see this, recall that if $M$ has rank $k$ , then we can write its SVD as:
$\begin{align*} M &= \bm U \em_{m \times k} \bm \Sigma \em_{k \times k} \bm V^T \em_{k \times n} \quad \left(\Sigma^{\frac{1}{2}} = \text{diag}(\sigma_1^{\frac{1}{2}}, \ldots, \sigma_k^{\frac{1}{2}})\right) \\ &= \bm U\Sigma^{\frac{1}{2}} = Y \em_{m \times k} \bm \Sigma^{\frac{1}{2}}V^T = Z^T \em_{k \times n} \end{align*}$
(2)
or product $\hat{M} = YZ^T$ where $Y \in \mathbb{R}^{m \times k}$ and $Z \in \mathbb{R}^{n \times k}$ . For example, if $M$ represents a grayscale image (with entries = pixel intensities), $m$ and $n$ are typically in the 100s (or 1000s for HD images), and a modest value of $k$ ( $\sim$ 100-150) is usually enough to give a good approximation of the original image.
Updating Huge AI Models: A modern application of low-rank matrix approximation is for “fine-tuning” huge AI models. In the setting of large language models (LLMs) like ChatGPT, we are typically given some huge off-the-shelf model with billions (or more) parameters. Given this large model that has been trained on an enormous but generic corpus of text from the web, one often performs “fine-tuning”. This fine-tuning is a second round of training, typically using a much smaller domain specific dataset (for example, the lecture notes for this class could be used to fine-tune a “LinearAlgebraGPT”). The challenge of fine-tuning is that because these models are so big, making these updates is extremely challenging. The 2021 paper LoRA: Low-Rank Adaptation of Large Language Models argued that fine-tuning updates are generally approximately low-rank and that one can learn these updates in their factored $YZ^T$ forms, allowing model fine-tuning with 1000x-10000x fewer parameters.
Denoising: If $M$ is a noisy version of some “true” matrix that is approximately low-rank, then finding a low-rank approximation to $M$ will typically remove a lot of noise (and maybe some signal), resulting in a matrix that is actually more informative than the original.
Matrix Completion: Low-rank approximations offers a way of solving the matrix completion problem we introduced above. Given a matrix $M$ with missing entries, the first step is to obtain a full matrix $\hat{M}$ by filling in the missing entries with “default” values: what these default values should be often requires trial and error, but natural things to try include 0, the average of known entries in the same column, row, or the entire matrix. The second step is then to find a rank $k$ approximation to $\hat{M}$ . This approach works well when the unknown matrix is close to a rank $k$ matrix and there aren’t too many missing entries.

With this motivation in mind, let’s see how the SVD can help us in finding a good rank $r$ approximation of a matrix $M$ . Once we’ve described our procedure and seen some examples of it in action, we’ll make precise how our method is actually producing the “best” rank $r$ approximation possible.

5Low-Rank Approximations from the SVD¶

Given an $m \times n$ matrix $M \in \mathbb{R}^{m \times n}$ , which we’ll assume has rank $r$ . Then the SVD of $M$ is given by

M = U \Sigma V^T = \sum_{i=1}^r \sigma_i \vv u_i \vv v_i^T \quad \text{(SVD)}

(3)

for $U = \bm \vv u_1 \cdots \vv u_r\em \in \mathbb{R}^{m \times r}$ , $V = \bm \vv v_1 \cdots \vv v_r\em \in \mathbb{R}^{n \times r}$ , and $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r)$ the matrices of left singular vectors, right singular vectors, and singular values, respectively.

This right-most expression of (SVD) is a particularly convenient expression for our purposes, which expresses $M$ as a sum of rank 1 matrices $\sigma_i \vv u_i \vv v_i^T$ with mutually orthogonal column and row spaces.

This sum expression suggests a very natural way of forming a rank $k$ approximation to $M$ : simply truncate the sum to the top $k$ terms, as measured by the singular values $\sigma_i$ :

\hat{M}_k = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k \Sigma_k V_k^T \quad \text{(SVD-k)}

(4)

where the right-most expression is defined in terms of the truncated matrices:

U_k = \bm \vv u_1 \cdots \vv u_k\em \in \mathbb{R}^{m \times k}, \quad V_k = \bm v_1 \cdots v_k\em \in \mathbb{R}^{n \times k}, \quad \Sigma_k = \text{diag}(\sigma_1, \ldots, \sigma_k) \in \mathbb{R}^{k \times k}

(5)

Before analyzing the properties of $\hat{M}_k = U_k \Sigma_k V_k^T$ , let’s examine if $\hat{M}_k$ could plausibly address our motivating applications. Storing the matrices $U_k, V_k,$ and $\Sigma_k$ requires storing $km + kn + k^2 \approx k(m+n)$ numbers if $k << \min\{m, n\}$ which is much less than the $mn$ numbers needed to store $M \in \mathbb{R}^{m \times n}$ when $m$ and $n$ are relatively large.

It is also natural to interpret (SVD-k) as approximating the raw data $M$ in terms of $k$ “concepts” (e.g., “sci-fi”, “romcom”, “drama”, “classic”), where the singular values $\sigma_1, \ldots, \sigma_k$ express the “prominance” of the concepts, the rows of $V^T$ and columns of $U$ express the “typical row/column” associated with each concept (e.g., a viewer likes only sci-fi movies, or a movie liked only by romcom viewers), and the rows of $U$ (or columns of $V^T$ ) approximately express each row (or column) of $M$ as a linear combination (scaled by $\sigma_1,\ldots,\sigma_k$ ) of the “typical rows” (or “typical columns”).

This method of producing a low-rank approximation is beautiful: we interpret the SVD of a matrix $M$ as a list of “ingredients” ordered by “importance”, and we retain only the $k$ most important ingredients. But is this elegant procedure any “good”?