Understanding Machine Learning: From Theory to Algorithms

326 Dimensionality Reduction

It is not hard to verify (see Exercise 2) that the right-hand side equals to∑ n j=1Dj,j. We have therefore shown that for every matrixU∈R

d,nwith or- thonormal columns it holds that trace(U>AU)≤

∑n j=1Dj,j. On the other hand, if we setUto be the matrix whose columns are thenleading eigenvectors ofA we obtain that trace(U>AU) =

∑n j=1Dj,j, and this concludes our proof.

Remark 23.1 The proof of Theorem 23.2 also tells us that the value of the objective of Equation (23.4) is

∑n i=1Di,i. Combining this with Equation (23.3) and noting that

∑m i=1‖xi‖^2 = trace(A) =

∑d i=1Di,iwe obtain that the optimal objective value of Equation (23.1) is

∑d i=n+1Di,i. Remark 23.2 It is a common practice to “center” the examples before applying PCA. That is, we first calculateμ=m^1

∑m i=1xiand then apply PCA on the vectors (x 1 −μ),...,(xm−μ). This is also related to the interpretation of PCA as variance maximization (see Exercise 4).

23.1.1 A More Efficient Solution for the Casedm

In some situations the original dimensionality of the data is much larger than the number of examplesm. The computational complexity of calculating the PCA solution as described previously isO(d^3 ) (for calculating eigenvalues ofA) plusO(md^2 ) (for constructing the matrixA). We now show a simple trick that enables us to calculate the PCA solution more efficiently whendm. Recall that the matrixAis defined to be

∑m i=1xix

> i. It is convenient to rewrite A=X>X whereX ∈Rm,dis a matrix whoseith row isx>i. Consider the matrixB=XX>. That is,B∈Rm,mis the matrix whosei,jelement equals 〈xi,xj〉. Suppose thatuis an eigenvector ofB: That is,Bu=λufor some λ∈R. Multiplying the equality byX>and using the definition ofBwe obtain X>XX>u=λX>u. But, using the definition ofA, we get thatA(X>u) = λ(X>u). Thus, X

>u ‖X>u‖is an eigenvector ofAwith eigenvalue ofλ. We can therefore calculate the PCA solution by calculating the eigenvalues of Binstead ofA. The complexity isO(m^3 ) (for calculating eigenvalues ofB) and m^2 d(for constructing the matrixB).

Remark 23.3 The previous discussion also implies that to calculate the PCA solution we only need to know how to calculate inner products between vectors. This enables us to calculate PCA implicitly even whendis very large (or even infinite) using kernels, which yields thekernel PCAalgorithm.

23.1.2 Implementation and Demonstration

A pseudocode of PCA is given in the following.

Understanding Machine Learning: From Theory to Algorithms

23.1.1 A More Efficient Solution for the Casedm

23.1.2 Implementation and Demonstration

Get our desktop app

Company

Features

Documentation

Resources

Understanding Machine Learning: From Theory to Algorithms

23.1.1 A More Efficient Solution for the Casedm

23.1.2 Implementation and Demonstration

Get our desktop app

Company

Features

Documentation

Resources

23.1.1 A More Efficient Solution for the Casedm