12.2 Matrix Norm - ESE 2030 📏

1Reading¶

Material related to this page can be found in Lecture 9 from Stanford CS168 course.

2Learning Objectives¶

By the end of this page, you should know:

what is a matrix norm
the definition of the Frobenious norm and examples of computing the same
some important properties of the Frobenius norm
how the Frobenious norm is used as an approximation error for the Low Rank approximation of a matrix

3A Matrix Norm¶

For an $m\times n$ matrix $M\in\mathbb{R}^{m\times n}$ , let $\hat{M}$ be a low-rank approximation of $M$ , and define the approximation error as $E = M-\hat{M}$ . Intuitively, a “good” approximation will lead to “small” error $E$ . But we need to quantify the “size” of $E\in\mathbb{R}^{m\times n}$ . We know that for vectors $\vv x\in\mathbb{R}^n$ , the right way to quantify the size of $\vv x$ was through its norm $\|\vv x\|$ , where $\|\cdot\|$ is a function that needs to satisfy the axioms of a norm.

$\|a \vv x\| = |a| \|\vv x\|$ for all $\vv x\in\mathbb{R}^n$ , $a \in\mathbb{R}$
$\|\vv x\| \geq 0$ for all $\vv x\in\mathbb{R}^n$ , with $\|\vv x\|=0$ if and only if $\vv x=0$
$\|\vv x+ \vv y\| \leq \|\vv x\| + \|\vv y\|$ for all $\vv x,\vv y\in\mathbb{R}^n$

It turns out we can define functions on the vector space of $m\times n$ matrices that satisfy these same properties: these are called matrix norms. We’ll introduce one of them here that is particularly relevant to low-rank matrix approximations, but be aware that just as for vectors there are many kinds of matrix norms.

4The Frobenius Norm¶

We need a couple of properties of the Frobenius norm before we can connect the SVD to low-rank matrix approximation.

Property 1: For $A \in \mathbb{R}^{n \times n}$ a square matrix, $\|A\|_F = \|A^T\|_F$

This isn’t too hard to check from the definition of (F): taking the transpose just swaps the role of $(i,j)$ in the sum, but you still end up adding together the square of all entries in $A$ , which are the same as the square of all of the entries in $A^T$ .

Property 2: If $Q \in \mathbb{R}^{n \times n}$ is an orthogonal matrix and $A \in \mathbb{R}^{n \times n}$ is a square matrix, then $\|QA\|_F = \|AQ\|_F = \|A\|_F$ , i.e., the Frobenius norm of a matrix $A$ is unchanged by left or right multiplication by an orthogonal matrix.

To see why this is true, recall that if $A = \bm \vv a_1 \cdots \vv a_n\em$ are the columns of $A$ , then $QA = \bm Q \vv a_1 \cdots Q\vv a_n\em$ . Then, since we can write the Frobenius norm squared of a matrix as the sum of the Euclidean norm squared of its columns, we have:

\begin{align*} \|QA\|_F^2 &= \|Q\vv a_1\|^2 + \cdots + \|Q\vv a_n\|^2 \\ &= \|\vv a_1\|^2 + \cdots + \|\vv a_n\|^2 = \|A\|_F^2 \end{align*}

(2)

Here, the second equality holds because multiplying a vector by an orthogonal matrix does not change its Euclidean norm. Finally we use this and Property 1 to conclude:

\|AQ\|_F = \|Q^T A^T\|_F \underbrace{=}_{\text{since} \ Q^{\top} \ \text{is also orthogonal}} \|A^T\|_F \underbrace{=}_{\text{property 1}} \|A\|_F

(3)

We will measure the quality of our rank $k$ approximation (SVD-k) $\hat{M}$ to $M$ in terms of the Frobenius norm of their difference.

The following theorem tells us that the SVD-based approximation (SVD-k) is optimal with respect to the Frobenius norm of the approximation error!

We won’t formally prove this theorem, but let’s get some intuition as to why this is true.

4.1Understanding Theorem 1¶

To keep things simple, we’ll assume $M$ is square and full rank, i.e., $M \in \mathbb{R}^{n \times n}$ with rank $M= n$ . Nearly the exact same argument works for general $M$ , but we have to use the non-compact SVD of M (which keeps zero singular values around).

Our goal is to find a rank k matrix $\hat{M}$ which minimizes $\|\hat{M} - M\|_F^2$ . Let $M = U \Sigma V^T$ be the SVD of M, where $U, \Sigma, V \in \mathbb{R}^{n\times n}$ since rank $M=n$ . By Property 2 of the Frobenius norm, we then have the following sequence of equalities:

\begin{align*} \|\hat{M} - M\|_F^2 &= \|\hat{M} - U \Sigma V^T\|_F^2 \\ &= \|(U^T(\hat{M} - U \Sigma V^T))\|_F^2 \quad (\|AB\|_F = \|BA\|_F) \\ &= \|U^T\hat{M} - \Sigma V^T\|_F^2 \quad (U^TU = I) \\ &= \|(U^T\hat{M} - \Sigma V^T)V\|_F^2 \quad (\|A\|_F = \|AQ\|_F) \\ &= \|U^T\hat{M}V - \Sigma\|_F^2 \quad (V^TV = I) \end{align*}

(5)

Now notice that since Σ is a diagonal matrix, any non-diagonal entry in $U^T\hat{M}V$ adds to our approx error, so $U^T\hat{M}V$ should be diagonal. Let $\hat{M} = UDV^T$ for some diagonal matrix $D$ . Then

\|\hat{M} - M\|_F^2 = \|U^T(UDV^T)V - \Sigma\|_F^2 = \|D - \Sigma\|_F^2 = \sum_{i=1}^n (d_{ii} - \sigma_{i})^2.

(6)

Therefore, we want to pick the diagonal entries $d_{ii}$ of $D$ to minimize the right-most expression in (6). If there was no rank restriction on $\hat{M}$ , we simply would set $d_{ii} = \sigma_{i}$ . However, notice $\hat{M} = UDV^T$ is an SVD of $\hat{M}$ ! Therefore, for $\hat{M}$ to be rank $k$ , only $k$ of the $d_{ii}$ can be nonzero: if we can only knock off $k$ of the $(d_{ii} - \sigma_{i})^2$ terms in (6), we should pick the top $k$ , i.e., $d_{ii} = \sigma_{i}$ for $i = 1, \ldots, k$ and $d_{ii} = 0 \text{ for } i = k+1, \ldots, n$ .

Then,

\hat{M} = \bm \vv u_1 \ldots \vv u_k \vv u_{k+1} \ldots \vv u_n\em \begin{bmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_k & \\ & & & 0 \\ & & & & \ddots \\ & & & & & 0 \end{bmatrix} \begin{bmatrix} \vv v_1^T \\ \vdots \\ \vv v_k^T \\ \vv v_{k+1}^T \\ \vdots \\ \vv v_n^T \end{bmatrix} = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k\Sigma_k V_k^T

(7)

is exactly the expression in (SVD-k), and the square approximation error it incurs is

\|\hat{E}\|_F^2 = \|\hat{M} - M\|_F^2 = \sum_{i=k+1}^n \sigma_i^2,

(8)

i.e. the sum of the squares of the “tail” singular values of $M$ .

Example 2

Recall the matrix $A = \begin{bmatrix} 4 & 11 & 14 \\ 8 & 7 & -2 \end{bmatrix}$ from Lecture 18, we computed its SVD as:

A = \begin{bmatrix} \frac{3}{\sqrt{10}} & \frac{1}{\sqrt{10}} \\ \frac{1}{\sqrt{10}} & -\frac{3}{\sqrt{10}} \end{bmatrix} \begin{bmatrix} 6\sqrt{10} & 0 \\ 0 & 3\sqrt{10} \end{bmatrix} \begin{bmatrix} \frac{1}{3} & \frac{2}{3} & \frac{2}{3} \\ -\frac{2}{3} & -\frac{1}{3} & \frac{2}{3} \end{bmatrix} = U \Sigma V^T.

(9)

$A$ is rank 2, and its rank 1 approximation is, according to (SVD-k), given by

\hat{A}_1 = \begin{bmatrix} \frac{3}{\sqrt{10}} \\ \frac{1}{\sqrt{10}} \end{bmatrix} 6\sqrt{10} \begin{bmatrix} \frac{1}{3} & \frac{2}{3} & \frac{2}{3} \end{bmatrix} = \begin{bmatrix} 6 & 12 & 12 \\ 2 & 4 & 4 \end{bmatrix}

(10)

If we compute $\|\hat{A}_1 - A\|_F^2$ we get:

\begin{align*} \left\|\begin{bmatrix} 2 & -1 & 2 \\ -6 & -3 & 6 \end{bmatrix}\right\|_F^2 &= 2^2 + (-1)^2 + 2^2 + (-6)^2 + (-3)^2 + 6^2 \\ &= 90 \end{align*}

(11)

which is exactly $\sigma_2^2 = (3\sqrt{10})^2 = 90$ .

Finally, we address an obvious question when applying these ideas in practice: how should we pick the rank $k$ of our approximation?

In a perfect world, the singular values of the original data matrix will give strong guidance: if the top few singular values are much larger than the rest, then the obvious solution is to take $k = \#$ of big values. This was the case in the handset example previous lecture: the 1st singular value was significantly larger than others, suggesting a rank 1 approximation would be a good choice (which was the image (d)).

In less clear settings, the rule of thumb is to take $k$ as small as possible while still providing a “useful” approximation of the original data. For example, it is common to choose $k$ so that the sum of the top k singular values is at least $c$ times larger than the sum of the other singular values. The ratio $c$ is typically a domain-dependent constant picked based on the application.