Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

326 Dimensionality Reduction


It is not hard to verify (see Exercise 2) that the right-hand side equals to∑
n
j=1Dj,j. We have therefore shown that for every matrixU∈R

d,nwith or-
thonormal columns it holds that trace(U>AU)≤

∑n
j=1Dj,j. On the other hand,
if we setUto be the matrix whose columns are thenleading eigenvectors ofA
we obtain that trace(U>AU) =

∑n
j=1Dj,j, and this concludes our proof.

Remark 23.1 The proof of Theorem 23.2 also tells us that the value of the
objective of Equation (23.4) is

∑n
i=1Di,i. Combining this with Equation (23.3)
and noting that

∑m
i=1‖xi‖^2 = trace(A) =

∑d
i=1Di,iwe obtain that the optimal
objective value of Equation (23.1) is

∑d
i=n+1Di,i.
Remark 23.2 It is a common practice to “center” the examples before applying
PCA. That is, we first calculateμ=m^1

∑m
i=1xiand then apply PCA on the
vectors (x 1 −μ),...,(xm−μ). This is also related to the interpretation of PCA
as variance maximization (see Exercise 4).

23.1.1 A More Efficient Solution for the Casedm


In some situations the original dimensionality of the data is much larger than
the number of examplesm. The computational complexity of calculating the
PCA solution as described previously isO(d^3 ) (for calculating eigenvalues ofA)
plusO(md^2 ) (for constructing the matrixA). We now show a simple trick that
enables us to calculate the PCA solution more efficiently whendm.
Recall that the matrixAis defined to be

∑m
i=1xix

>
i. It is convenient to rewrite
A=X>X whereX ∈Rm,dis a matrix whoseith row isx>i. Consider the
matrixB=XX>. That is,B∈Rm,mis the matrix whosei,jelement equals
〈xi,xj〉. Suppose thatuis an eigenvector ofB: That is,Bu=λufor some
λ∈R. Multiplying the equality byX>and using the definition ofBwe obtain
X>XX>u=λX>u. But, using the definition ofA, we get thatA(X>u) =
λ(X>u). Thus, X

>u
‖X>u‖is an eigenvector ofAwith eigenvalue ofλ.
We can therefore calculate the PCA solution by calculating the eigenvalues of
Binstead ofA. The complexity isO(m^3 ) (for calculating eigenvalues ofB) and
m^2 d(for constructing the matrixB).

Remark 23.3 The previous discussion also implies that to calculate the PCA
solution we only need to know how to calculate inner products between vectors.
This enables us to calculate PCA implicitly even whendis very large (or even
infinite) using kernels, which yields thekernel PCAalgorithm.

23.1.2 Implementation and Demonstration


A pseudocode of PCA is given in the following.
Free download pdf