Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
29.4 On Good and Bad ERMs 407

learnable by some ERM, but other ERMs will fail to learn it. Clearly, this also
implies that the class is learnable but it does not have the uniform convergence
property. For simplicity, we consider only the realizable case.
The class we consider is defined as follows. The instance spaceXwill be any
finite or countable set. LetPf(X) be the collection of all finite and cofinite
subsets ofX (that is, for eachA∈Pf(X), eitherAorX \Amust be finite).
Instead of [k], the label set isY=Pf(X)∪{∗}, where∗is some special label.
For everyA∈Pf(X) definehA:X →Yby


hA(x) =

{

A x∈A
∗ x /∈A

Finally, the hypothesis class we take is


H={hA : A∈Pf(X)}.

LetAbe some ERM algorithm forH. Assume thatAoperates on a sample
labeled byhA∈ H. SincehAis theonlyhypothesis inHthat might return
the labelA, ifAobserves the labelA, it “knows” that the learned hypothesis
ishA, and, as an ERM, must return it (note that in this case the error of the
returned hypothesis is 0). Therefore, to specify an ERM, we should only specify
the hypothesis it returns upon receiving a sample of the form


S={(x 1 ,∗),...,(xm,∗)}.

We consider two ERMs: The first,Agood, is defined by


Agood(S) =h∅;

that is, it outputs the hypothesis which predicts ‘*’ for everyx∈X. The second
ERM,Abad, is defined by


Abad(S) =h{x 1 ,...xm}c.

The following claim shows that the sample complexity ofAbadis about|X|-times
larger than the sample complexity ofAgood. This establishes a gap between
different ERMs. IfX is infinite, we even obtain a learnable class that is not
learnable by every ERM.


claim29.9



  1. Let,δ > 0 ,Da distribution overXandhA∈ H. LetSbe an i.i.d. sample
    consisting ofm≥^1 log


( 1

δ

)

examples, sampled according toDand labeled by
hA. Then, with probability of at least 1 −δ, the hypothesis returned byAgood
will have an error of at most.


  1. There exists a constanta > 0 such that for every 0 <  < athere exists a
    distributionDoverXandhA∈Hsuch that the following holds. The hypoth-
    esis returned byAbadupon receiving a sample of sizem≤ |X 6 |−^1 , sampled
    according toDand labeled byhA, will have error≥with probability≥e−


(^16)
.

Free download pdf