Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

52 A Formal Learning Model


Hint: Use the geometric-arithmetic mean inequality.


  1. LetHbe a hypothesis class of binary classifiers. Show that ifHis agnostic
    PAC learnable, thenHis PAC learnable as well. Furthermore, ifAis a suc-
    cessful agnostic PAC learner forH, thenAis also a successful PAC learner
    forH.
    7.() The Bayes optimal predictor:Show that for every probability distri-
    butionD, the Bayes optimal predictorfDis optimal, in the sense that for
    every classifiergfromXto{ 0 , 1 },LD(fD)≤LD(g).
    8.(
    )We say that a learning algorithmAis better thanBwith respect tosome
    probability distribution,D, if


LD(A(S))≤LD(B(S))
for all samplesS∈(X×{ 0 , 1 })m. We say that a learning algorithmAis better
thanB, if it is better thanBwith respect to all probability distributionsD
overX ×{ 0 , 1 }.


  1. A probabilistic label predictor is a function that assigns to every domain
    pointxa probability value,h(x)∈[0,1], that determines the probability of
    predicting the label 1. That is, given such anhand an input,x, the label for
    xis predicted by tossing a coin with biash(x) toward Heads and predicting
    1 iff the coin comes up Heads. Formally, we define a probabilistic label
    predictor as a function,h:X →[0,1]. The loss of suchhon an example
    (x,y) is defined to be|h(x)−y|, which is exactly the probability that the
    prediction ofhwill not be equal toy. Note that ifhis deterministic, that


is, returns values in{ 0 , 1 }, then|h(x)−y|= (^1) [h(x) 6 =y].
Prove that for every data-generating distributionDoverX ×{ 0 , 1 }, the
Bayes optimal predictor has the smallest risk (w.r.t. the loss function
`(h,(x,y)) =|h(x)−y|, among all possible label predictors, including prob-
abilistic ones).



  1. LetX be a domain and{ 0 , 1 }be a set of labels. Prove that for every
    distributionDoverX ×{ 0 , 1 }, there exist a learning algorithmADthat is
    better than any other learning algorithm with respect toD.

  2. Prove that for every learning algorithmAthere exist a probability distri-
    bution,D, and a learning algorithmBsuch thatAis not better thanB
    w.r.t.D.

  3. Consider a variant of the PAC model in which there are two example ora-
    cles: one that generates positive examples and one that generates negative
    examples, both according to the underlying distributionDonX. Formally,
    given a target function f :X → { 0 , 1 }, letD+be the distribution over
    X+={x∈ X :f(x) = 1}defined byD+(A) =D(A)/D(X+), for every
    A⊂X+. Similarly,D−is the distribution overX−induced byD.
    The definition of PAC learnability in the two-oracle model is the same as the
    standard definition of PAC learnability except that here the learner has access
    tom+H(,δ) i.i.d. examples fromD+andm−(,δ) i.i.d. examples fromD−. The
    learner’s goal is to outpuths.t. with probability at least 1−δ(over the choice

Free download pdf