52 A Formal Learning Model
Hint: Use the geometric-arithmetic mean inequality.
- LetHbe a hypothesis class of binary classifiers. Show that ifHis agnostic
PAC learnable, thenHis PAC learnable as well. Furthermore, ifAis a suc-
cessful agnostic PAC learner forH, thenAis also a successful PAC learner
forH.
7.() The Bayes optimal predictor:Show that for every probability distri-
butionD, the Bayes optimal predictorfDis optimal, in the sense that for
every classifiergfromXto{ 0 , 1 },LD(fD)≤LD(g).
8.()We say that a learning algorithmAis better thanBwith respect tosome
probability distribution,D, if
LD(A(S))≤LD(B(S))
for all samplesS∈(X×{ 0 , 1 })m. We say that a learning algorithmAis better
thanB, if it is better thanBwith respect to all probability distributionsD
overX ×{ 0 , 1 }.
- A probabilistic label predictor is a function that assigns to every domain
pointxa probability value,h(x)∈[0,1], that determines the probability of
predicting the label 1. That is, given such anhand an input,x, the label for
xis predicted by tossing a coin with biash(x) toward Heads and predicting
1 iff the coin comes up Heads. Formally, we define a probabilistic label
predictor as a function,h:X →[0,1]. The loss of suchhon an example
(x,y) is defined to be|h(x)−y|, which is exactly the probability that the
prediction ofhwill not be equal toy. Note that ifhis deterministic, that
is, returns values in{ 0 , 1 }, then|h(x)−y|= (^1) [h(x) 6 =y].
Prove that for every data-generating distributionDoverX ×{ 0 , 1 }, the
Bayes optimal predictor has the smallest risk (w.r.t. the loss function
`(h,(x,y)) =|h(x)−y|, among all possible label predictors, including prob-
abilistic ones).
- LetX be a domain and{ 0 , 1 }be a set of labels. Prove that for every
distributionDoverX ×{ 0 , 1 }, there exist a learning algorithmADthat is
better than any other learning algorithm with respect toD. - Prove that for every learning algorithmAthere exist a probability distri-
bution,D, and a learning algorithmBsuch thatAis not better thanB
w.r.t.D. - Consider a variant of the PAC model in which there are two example ora-
cles: one that generates positive examples and one that generates negative
examples, both according to the underlying distributionDonX. Formally,
given a target function f :X → { 0 , 1 }, letD+be the distribution over
X+={x∈ X :f(x) = 1}defined byD+(A) =D(A)/D(X+), for every
A⊂X+. Similarly,D−is the distribution overX−induced byD.
The definition of PAC learnability in the two-oracle model is the same as the
standard definition of PAC learnability except that here the learner has access
tom+H(,δ) i.i.d. examples fromD+andm−(,δ) i.i.d. examples fromD−. The
learner’s goal is to outpuths.t. with probability at least 1−δ(over the choice