Skip to article frontmatterSkip to article content

12.2 Matrix Norm

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page can be found in Lecture 9 from Stanford CS168 course.

2Learning Objectives

By the end of this page, you should know:

  • what is a matrix norm
  • the definition of the Frobenious norm and examples of computing the same
  • some important properties of the Frobenius norm
  • how the Frobenious norm is used as an approximation error for the Low Rank approximation of a matrix

3A Matrix Norm

For an m×nm\times n matrix MRm×nM\in\mathbb{R}^{m\times n}, let M^\hat{M} be a low-rank approximation of MM, and define the approximation error as E=MM^E = M-\hat{M}. Intuitively, a “good” approximation will lead to “small” error EE. But we need to quantify the “size” of ERm×nE\in\mathbb{R}^{m\times n}. We know that for vectors xRn\vv x\in\mathbb{R}^n, the right way to quantify the size of x\vv x was through its norm x\|\vv x\|, where \|\cdot\| is a function that needs to satisfy the axioms of a norm.

  1. ax=ax\|a \vv x\| = |a| \|\vv x\| for all xRn\vv x\in\mathbb{R}^n, aRa \in\mathbb{R}
  2. x0\|\vv x\| \geq 0 for all xRn\vv x\in\mathbb{R}^n, with x=0\|\vv x\|=0 if and only if x=0\vv x=0
  3. x+yx+y\|\vv x+ \vv y\| \leq \|\vv x\| + \|\vv y\| for all x,yRn\vv x,\vv y\in\mathbb{R}^n

It turns out we can define functions on the vector space of m×nm\times n matrices that satisfy these same properties: these are called matrix norms. We’ll introduce one of them here that is particularly relevant to low-rank matrix approximations, but be aware that just as for vectors there are many kinds of matrix norms.

4The Frobenius Norm

We need a couple of properties of the Frobenius norm before we can connect the SVD to low-rank matrix approximation.

  1. Property 1: For ARn×nA \in \mathbb{R}^{n \times n} a square matrix, AF=ATF\|A\|_F = \|A^T\|_F

This isn’t too hard to check from the definition of (F): taking the transpose just swaps the role of (i,j)(i,j) in the sum, but you still end up adding together the square of all entries in AA, which are the same as the square of all of the entries in ATA^T.

  1. Property 2: If QRn×nQ \in \mathbb{R}^{n \times n} is an orthogonal matrix and ARn×nA \in \mathbb{R}^{n \times n} is a square matrix, then QAF=AQF=AF\|QA\|_F = \|AQ\|_F = \|A\|_F, i.e., the Frobenius norm of a matrix AA is unchanged by left or right multiplication by an orthogonal matrix.

To see why this is true, recall that if A=[a1an]A = \bm \vv a_1 \cdots \vv a_n\em are the columns of AA, then QA=[Qa1Qan]QA = \bm Q \vv a_1 \cdots Q\vv a_n\em. Then, since we can write the Frobenius norm squared of a matrix as the sum of the Euclidean norm squared of its columns, we have:

QAF2=Qa12++Qan2=a12++an2=AF2\begin{align*} \|QA\|_F^2 &= \|Q\vv a_1\|^2 + \cdots + \|Q\vv a_n\|^2 \\ &= \|\vv a_1\|^2 + \cdots + \|\vv a_n\|^2 = \|A\|_F^2 \end{align*}

Here, the second equality holds because multiplying a vector by an orthogonal matrix does not change its Euclidean norm. Finally we use this and Property 1 to conclude:

AQF=QTATF=since Q is also orthogonalATF=property 1AF\|AQ\|_F = \|Q^T A^T\|_F \underbrace{=}_{\text{since} \ Q^{\top} \ \text{is also orthogonal}} \|A^T\|_F \underbrace{=}_{\text{property 1}} \|A\|_F

We will measure the quality of our rank kk approximation (SVD-k) M^\hat{M} to MM in terms of the Frobenius norm of their difference.

The following theorem tells us that the SVD-based approximation (SVD-k) is optimal with respect to the Frobenius norm of the approximation error!

We won’t formally prove this theorem, but let’s get some intuition as to why this is true.

4.1Understanding Theorem 1

To keep things simple, we’ll assume MM is square and full rank, i.e., MRn×nM \in \mathbb{R}^{n \times n} with rank M=nM= n. Nearly the exact same argument works for general MM, but we have to use the non-compact SVD of M (which keeps zero singular values around).

Our goal is to find a rank k matrix M^\hat{M} which minimizes M^MF2\|\hat{M} - M\|_F^2. Let M=UΣVTM = U \Sigma V^T be the SVD of M, where U,Σ,VRn×nU, \Sigma, V \in \mathbb{R}^{n\times n} since rankM=nM=n. By Property 2 of the Frobenius norm, we then have the following sequence of equalities:

M^MF2=M^UΣVTF2=(UT(M^UΣVT))F2(ABF=BAF)=UTM^ΣVTF2(UTU=I)=(UTM^ΣVT)VF2(AF=AQF)=UTM^VΣF2(VTV=I)\begin{align*} \|\hat{M} - M\|_F^2 &= \|\hat{M} - U \Sigma V^T\|_F^2 \\ &= \|(U^T(\hat{M} - U \Sigma V^T))\|_F^2 \quad (\|AB\|_F = \|BA\|_F) \\ &= \|U^T\hat{M} - \Sigma V^T\|_F^2 \quad (U^TU = I) \\ &= \|(U^T\hat{M} - \Sigma V^T)V\|_F^2 \quad (\|A\|_F = \|AQ\|_F) \\ &= \|U^T\hat{M}V - \Sigma\|_F^2 \quad (V^TV = I) \end{align*}

Now notice that since Σ is a diagonal matrix, any non-diagonal entry in UTM^VU^T\hat{M}V adds to our approx error, so UTM^VU^T\hat{M}V should be diagonal. Let M^=UDVT\hat{M} = UDV^T for some diagonal matrix DD. Then

M^MF2=UT(UDVT)VΣF2=DΣF2=i=1n(diiσi)2.\|\hat{M} - M\|_F^2 = \|U^T(UDV^T)V - \Sigma\|_F^2 = \|D - \Sigma\|_F^2 = \sum_{i=1}^n (d_{ii} - \sigma_{i})^2.

Therefore, we want to pick the diagonal entries diid_{ii} of DD to minimize the right-most expression in (6). If there was no rank restriction on M^\hat{M}, we simply would set dii=σid_{ii} = \sigma_{i}. However, notice M^=UDVT\hat{M} = UDV^T is an SVD of M^\hat{M}! Therefore, for M^\hat{M} to be rank kk, only kk of the diid_{ii} can be nonzero: if we can only knock off kk of the (diiσi)2(d_{ii} - \sigma_{i})^2 terms in (6), we should pick the top kk, i.e., dii=σid_{ii} = \sigma_{i} for i=1,,ki = 1, \ldots, k and dii=0 for i=k+1,,nd_{ii} = 0 \text{ for } i = k+1, \ldots, n.

Then,

M^=[u1ukuk+1un][σ1σk00][v1TvkTvk+1TvnT]=i=1kσiuiviT=UkΣkVkT\hat{M} = \bm \vv u_1 \ldots \vv u_k \vv u_{k+1} \ldots \vv u_n\em \begin{bmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_k & \\ & & & 0 \\ & & & & \ddots \\ & & & & & 0 \end{bmatrix} \begin{bmatrix} \vv v_1^T \\ \vdots \\ \vv v_k^T \\ \vv v_{k+1}^T \\ \vdots \\ \vv v_n^T \end{bmatrix} = \sum_{i=1}^k \sigma_i \vv u_i \vv v_i^T = U_k\Sigma_k V_k^T

is exactly the expression in (SVD-k), and the square approximation error it incurs is

E^F2=M^MF2=i=k+1nσi2,\|\hat{E}\|_F^2 = \|\hat{M} - M\|_F^2 = \sum_{i=k+1}^n \sigma_i^2,

i.e. the sum of the squares of the “tail” singular values of MM.

Finally, we address an obvious question when applying these ideas in practice: how should we pick the rank kk of our approximation?

In a perfect world, the singular values of the original data matrix will give strong guidance: if the top few singular values are much larger than the rest, then the obvious solution is to take k=#k = \# of big values. This was the case in the handset example previous lecture: the 1st singular value was significantly larger than others, suggesting a rank 1 approximation would be a good choice (which was the image (d)).

In less clear settings, the rule of thumb is to take kk as small as possible while still providing a “useful” approximation of the original data. For example, it is common to choose kk so that the sum of the top k singular values is at least cc times larger than the sum of the other singular values. The ratio cc is typically a domain-dependent constant picked based on the application.

Binder