12.2 Matrix Norm
1Reading¶
Material related to this page can be found in Lecture 9 from Stanford CS168 course.
2Learning Objectives¶
By the end of this page, you should know:
- what is a matrix norm
- the definition of the Frobenious norm and examples of computing the same
- some important properties of the Frobenius norm
- how the Frobenious norm is used as an approximation error for the Low Rank approximation of a matrix
3A Matrix Norm¶
For an matrix , let be a low-rank approximation of , and define the approximation error as . Intuitively, a “good” approximation will lead to “small” error . But we need to quantify the “size” of . We know that for vectors , the right way to quantify the size of was through its norm , where is a function that needs to satisfy the axioms of a norm.
- for all ,
- for all , with if and only if
- for all
It turns out we can define functions on the vector space of matrices that satisfy these same properties: these are called matrix norms. We’ll introduce one of them here that is particularly relevant to low-rank matrix approximations, but be aware that just as for vectors there are many kinds of matrix norms.
4The Frobenius Norm¶
We need a couple of properties of the Frobenius norm before we can connect the SVD to low-rank matrix approximation.
- Property 1: For a square matrix,
This isn’t too hard to check from the definition of (F): taking the transpose just swaps the role of in the sum, but you still end up adding together the square of all entries in , which are the same as the square of all of the entries in .
- Property 2: If is an orthogonal matrix and is a square matrix, then , i.e., the Frobenius norm of a matrix is unchanged by left or right multiplication by an orthogonal matrix.
To see why this is true, recall that if are the columns of , then . Then, since we can write the Frobenius norm squared of a matrix as the sum of the Euclidean norm squared of its columns, we have:
Here, the second equality holds because multiplying a vector by an orthogonal matrix does not change its Euclidean norm. Finally we use this and Property 1 to conclude:
We will measure the quality of our rank approximation (SVD-k) to in terms of the Frobenius norm of their difference.
The following theorem tells us that the SVD-based approximation (SVD-k) is optimal with respect to the Frobenius norm of the approximation error!
We won’t formally prove this theorem, but let’s get some intuition as to why this is true.
4.1Understanding Theorem 1¶
To keep things simple, we’ll assume is square and full rank, i.e., with rank . Nearly the exact same argument works for general , but we have to use the non-compact SVD of M (which keeps zero singular values around).
Our goal is to find a rank k matrix which minimizes . Let be the SVD of M, where since rank. By Property 2 of the Frobenius norm, we then have the following sequence of equalities:
Now notice that since Σ is a diagonal matrix, any non-diagonal entry in adds to our approx error, so should be diagonal. Let for some diagonal matrix . Then
Therefore, we want to pick the diagonal entries of to minimize the right-most expression in (6). If there was no rank restriction on , we simply would set . However, notice is an SVD of ! Therefore, for to be rank , only of the can be nonzero: if we can only knock off of the terms in (6), we should pick the top , i.e., for and .
Then,
is exactly the expression in (SVD-k), and the square approximation error it incurs is
i.e. the sum of the squares of the “tail” singular values of .
Finally, we address an obvious question when applying these ideas in practice: how should we pick the rank of our approximation?
In a perfect world, the singular values of the original data matrix will give strong guidance: if the top few singular values are much larger than the rest, then the obvious solution is to take of big values. This was the case in the handset example previous lecture: the 1st singular value was significantly larger than others, suggesting a rank 1 approximation would be a good choice (which was the image (d)).
In less clear settings, the rule of thumb is to take as small as possible while still providing a “useful” approximation of the original data. For example, it is common to choose so that the sum of the top k singular values is at least times larger than the sum of the other singular values. The ratio is typically a domain-dependent constant picked based on the application.