程序案例-STA 141C - Father Essays

STA 141C – Big Data & High Performance Statistical Computing Spring 2022
Week 7-1: Power method
Lecturer: Bo Y.-C. Ning May 10, 2022
Disclaimer: My notes may contain errors, distribution outside this class is allowed only with the permission
of the Instructor.
Last time
PCA and SVD
Today
Power method
1 Review of singular value decomposition (SVD)
For a rectangular matrix A ∈ Rm×n, let p = min{m,n}, then we have the SVD
A = UΣV ′,
where U = (u1, . . . , um) and V = (v1, . . . , vn) are orthogonal matrices and Σ = diag(σ1, . . . , σp) is a diagonal
matrix such that σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0. σis are called the singular values, uis are the left singular vectors
and vis are the right singular vectors.
The matrix Σ is not a square matrix, one can define thin SVD, which factorizes A as
A = UnΣnV
′ =
n∑
i=1
σiuiv
′
i,
where Un ∈ Rm×n, U ′nUn = In, Σ = diag(σ1, . . . , σn). This is for m > n, if m < n, then we let V ∈ Rm×n, The following properties are useful: for σ(A) = (σ1, . . . , σp) ′, the rank of A is the number of nonzero singular values denoted as ‖σ(A)‖0. The Frobenius norm of A, ‖A‖F= ( ∑p i=1 σ 2 i ) 1/2 = ‖σ(A)‖2, and the spectrum norm of A, ‖A‖2= σ1 = ‖σ(A)‖∞. Using the fact that U, V are both orthogonal matrices A′A = V ΣU ′UΣV ′ = V Σ2V ′, AA′ = UΣV ′V ΣU ′ = UΣ2U ′ Last, the eigen-decomposition for a real symmetric matrix is B = WΛW ′, where Λ = diag(λ1, . . . , λn), which is the SVD of B. 6-1 6-2 Week 7-1: Bo Y.-C. Ning 2 Power method To start, let’s assume A ∈ Rn×n is a symmetric and p.s.d. matrix, the power method for obtaining the largest eigenvalue is given as: 1) Choose an initial guess of q(0) (non-zero); 2) Repeat k = 1, . . . ,K, z(k) = Aq(k 1) q(k) = z(k) ‖z(k)‖2 ; 3) Output: λ1 ← q(K)′Aq(K). 3 Why the power method works Let’s understand how the power method works. Before that, we need to recall a few facts: The eigenvalue vi attached to i-th eigenvalue λi has the relation Avi = λivi Given A, A = ∑n i=1 λiviv ′ i, where λ1 ≥ · · · ≥ λn ≥ 0 and 〈vi, vj〉 = 0 for i 6= j Ak = ∑n i=1 λ k i viv ′ i, why By inspecting the algorithm, we have q(k) = Akq(0) ‖Akq(0)‖2 . Now given an initial guess of q(0) of unit Euclidean norm, it is possible to express q(0) = α1v1 + α2v2 + · · ·+ αnvn, for α1, . . . , αn are scalars. By the relation Avi = λivi, Akq(0) = α1λ k 1 v1 + n∑ j=2 αjλ k j α1λk1 vj For simplicity, let’s denote y(k) = ∑n j=2 αjλ k j α1λk1 vj , note that y (k) → 0 as k →∞ as long as λ1 > λ2 ≥ · · · ≥ λn
then
q(k) =
Akq(0)
‖Akq(0)‖2 =
α1λ
k
1(v1 + y
(k))
‖α1λk1(v1 + y(k))‖2
→ v1, as k →∞
In practice, k will never goes to∞, the algorithm will stop as some K when min{‖q(K) q(K 1)‖2, ‖ q(K)
q(K 1)‖2} ≤ for some small .
The output q(K) is a close approximation of v1, the leading eigenvector. How to obtain the leading eigenvalue
λ1 (Hint: using Av1 = λ1v1).
A few comments:
Week 7-1: Bo Y.-C. Ning 6-3
Figure 6.1: Convergence speed of the power method.
[Source: https://web.mit.edu/18.06/www/Spring17/Power-Method.pdf.%5D
The power method works well if λ1 > λ2. It converges slowly if λ1/λ2 ≈ 1.
The convergence speed of the power method is proportional to (λ2/λ1)k, the ratio between λ2 and λ1
For a general matrix An×p, we can apply the power method to A′A or AA′ instead. Then the output
is the absolute value of λ1.
How to get λ2, · · ·, λn
Eigen-decomposition is implemented in LAPACK, see eigen() in R and np.linalg.eig in numpy.
The power method is the most basic algorithm for SVD. There are other methods such as
Inverse power method for
finding the eigenvalue of smallest absolute value (replace A with A 1 in the power method);
QR algorithm for symmetric eigen-decomposition (takes 4n3/3 for eigenvalues and 8n3/3 for eigenvec-
tor)
“Golub-Kahan-Reinsch” algorithm (Section 8.6 of Golub and Van Loan); used in svd function in R
(4m2n+ 8mn2 + 9n3 flops for an m > n matrix)
Jacobi methods (Section 8.5 of Golub and Van Loan) (suitable for parallel computing).
Concluding remarks on numerical linear algebra:
6-4 Week 7-1: Bo Y.-C. Ning
Numerical linear algebra forms the building blocks of most computation we do. Most lines of our code
are numerical linear algebra.
Be flop and memory aware! The form of a mathematical expression and the way the expression should
be evaluated in actual practice may be quite different.
Be alert to problem structure and make educated choice of software/algorithm — the structure should
be exploited whenever solving a problem.
Do not write your own matrix computation routines unless for good reason. Utilize BLAS and LAPACK
as much as possible!
In contrast, for optimization, often we need to devise problem specific optimization routines, or even
“mix and match” them.
4 Ridge regression by SVD
In ridge regression, we minimize
‖y Xβ‖22+λ‖β‖22
If we obtain SVD of X such that X = UΣV , then the equation is
(Σ2 + λIp)V
′β = ΣU ′y.
We get
β λ =
r∑
j=1
σiu
′
iy
σ2i + λ
vi, r = rank(X).
It is clear that β λ → βOLS as λ→ 0 and ‖β λ‖2 is monotone decreasing as λ→∞.