Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

182 Regularization and Stability


Rd, letH=B, letZ=B ×{ 0 , 1 }d, and let`:Z×H →Rbe defined as
follows:

`(w,(x,α)) =

∑d

i=1

αi(xi−wi)^2.

This problem corresponds to anunsupervisedlearning task, meaning that we
do not try to predict the label ofx. Instead, what we try to do is to find the
“center of mass” of the distribution overB. However, there is a twist, modeled
by the vectorsα. Each example is a pair (x,α), wherexis the instancexand
αindicates which features ofxare “active” and which are “turned off.” A
hypothesis is a vectorwrepresenting the center of mass of the distribution,
and the loss function is the squared Euclidean distance betweenxandw, but
only with respect to the “active” elements ofx.


  • Show that this problem is learnable using the RLM rule with a sample
    complexity that does not depend ond.

  • Consider a distributionDoverZas follows:xis fixed to be somex 0 , and
    each element ofαis sampled to be either 1 or 0 with equal probability.
    Show that the rate of uniform convergence of this problem grows with
    d.
    Hint:Letmbe a training set size. Show that ifd 2 m, then there is
    a high probability of sampling a set of examples such that there exists
    somej∈[d] for whichαj= 1 for all the examples in the training set.
    Show that such a sample cannot be-representative. Conclude that the
    sample complexity of uniform convergence must grow with log(d).

  • Conclude that if we takedto infinity we obtain a problem that is learnable
    but for which the uniform convergence property does not hold. Compare
    to the fundamental theorem of statistical learning.
    3.Stability and Asymptotic ERM Are Sufficient for Learnability:
    We say that a learning ruleAis anAERM (Asymptotic Empirical Risk
    Minimizer)with rate(m) if for every distributionDit holds that


E

S∼Dm

[

LS(A(S))−min
h∈H

LS(h)

]

≤(m).

We say that a learning ruleAlearns a classHwith rate(m) if for every
distributionDit holds that

E
S∼Dm

[

LD(A(S))−min
h∈H

LD(h)

]

≤(m).

Prove the following:

theorem13.12 If a learning algorithmAis on-average-replace-one-stable
with rate 1 (m)and is an AERM with rate 2 (m), then it learnsHwith rate
 1 (m) + 2 (m).
Free download pdf