Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
9.3 Logistic Regression 127

The hypothesis class is therefore (where for simplicity we are using homogenous
linear functions):


Hsig=φsig◦Ld={x7→φsig(〈w,x〉) :w∈Rd}.

Note that when〈w,x〉is very large thenφsig(〈w,x〉) is close to 1, whereas if
〈w,x〉is very small thenφsig(〈w,x〉) is close to 0. Recall that the prediction of the
halfspace corresponding to a vectorwis sign(〈w,x〉). Therefore, the predictions
of the halfspace hypothesis and the logistic hypothesis are very similar whenever
|〈w,x〉|is large. However, when|〈w,x〉|is close to 0 we have thatφsig(〈w,x〉)≈
1
2. Intuitively, the logistic hypothesis is not sure about the value of the label so it
guesses that the label is sign(〈w,x〉) with probability slightly larger than 50%.
In contrast, the halfspace hypothesis always outputs a deterministic prediction
of either 1 or−1, even if|〈w,x〉|is very close to 0.
Next, we need to specify a loss function. That is, we should define how bad it
is to predict somehw(x)∈[0,1] given that the true label isy∈ {± 1 }. Clearly,
we would like thathw(x) would be large ify= 1 and that 1−hw(x) (i.e., the
probability of predicting−1) would be large ify=−1. Note that


1 −hw(x) = 1−

1

1 + exp(−〈w,x〉)

=

exp(−〈w,x〉)
1 + exp(−〈w,x〉)

=

1

1 + exp(〈w,x〉)

.

Therefore, any reasonable loss function would increase monotonically with1+exp(^1 y〈w,x〉),
or equivalently, would increase monotonically with 1 + exp(−y〈w,x〉). The lo-
gistic loss function used in logistic regression penalizeshwbased on the log of
1 + exp(−y〈w,x〉) (recall that log is a monotonic function). That is,


`(hw,(x,y)) = log (1 + exp(−y〈w,x〉)).

Therefore, given a training setS = (x 1 ,y 1 ),...,(xm,ym), the ERM problem
associated with logistic regression is


argmin
w∈Rd

1

m

∑m

i=1

log (1 + exp(−yi〈w,xi〉)). (9.10)

The advantage of the logistic loss function is that it is aconvexfunction with
respect tow; hence the ERM problem can be solved efficiently using standard
methods. We will study how to learn with convex functions, and in particular
specify a simple algorithm for minimizing convex functions, in later chapters.
The ERM problem associated with logistic regression (Equation (9.10)) is iden-
tical to the problem of finding a Maximum Likelihood Estimator, a well-known
statistical approach for finding the parameters that maximize the joint probabil-
ity of a given data set assuming a specific parametric probability function. We
will study the Maximum Likelihood approach in Chapter 24.

Free download pdf