Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
9.2 Linear Regression 125

Or, in matrix form:


A=





x 1 ... xm









x 1 ... xm





>

, (9.7)

b=





x 1 ... xm








y 1

ym


. (9.8)

IfAis invertible then the solution to the ERM problem is

w=A−^1 b.

The case in whichAis not invertible requires a few standard tools from linear
algebra, which are available in Appendix C. It can be easily shown that if the
training instances do not span the entire space ofRdthenAis not invertible.
Nevertheless, we can always find a solution to the systemAw=bbecauseb
is in the range ofA. Indeed, sinceAis symmetric we can write it using its
eigenvalue decomposition asA=V DV>, whereDis a diagonal matrix andV
is an orthonormal matrix (that is,V>V is the identityd×dmatrix). Define
D+ to be the diagonal matrix such thatD+i,i= 0 ifDi,i= 0 and otherwise
D+i,i= 1/Di,i. Now, define


A+=V D+V> and wˆ=A+b.

Letvidenote thei’th column ofV. Then, we have


Awˆ=AA+b=V DV>V D+V>b=V DD+V>b=


i:Di,i 6 =0

viv>ib.

That is,Awˆis the projection ofbonto the span of those vectorsvifor which
Di,i 6 = 0. Since the linear span ofx 1 ,...,xmis the same as the linear span of
thosevi, andbis in the linear span of thexi, we obtain thatAwˆ=b, which
concludes our argument.


9.2.2 Linear Regression for Polynomial Regression Tasks


Some learning tasks call for nonlinear predictors, such as polynomial predictors.
Take, for instance, a one dimensional polynomial function of degreen, that is,


p(x) =a 0 +a 1 x+a 2 x^2 +···+anxn

where (a 0 ,...,an) is a vector of coefficients of sizen+ 1. In the following we
depict a training set that is better fitted using a 3rd degree polynomial predictor
than using a linear predictor.

Free download pdf