Understanding Machine Learning: From Theory to Algorithms

13.7 Exercises 181

In the context of modern learning theory, the use of stability can be traced back at least to the work of Rogers & Wagner (1978), which noted that the sensitiv- ity of a learning algorithm with regard to small changes in the sample controls the variance of the leave-one-out estimate. The authors used this observation to obtain generalization bounds for thek-nearest neighbor algorithm (see Chap- ter 19). These results were later extended to other “local” learning algorithms (see Devroye, Gy ̈orfi & Lugosi (1996) and references therein). In addition, practi- cal methods have been developed to introduce stability into learning algorithms, in particular the Bagging technique introduced by (Breiman 1996). Over the last decade, stability was studied as a generic condition for learnability. See (Kearns & Ron 1999, Bousquet & Elisseeff 2002, Kutin & Niyogi 2002, Rakhlin, Mukherjee & Poggio 2005, Mukherjee, Niyogi, Poggio & Rifkin 2006). Our presentation follows the work of Shalev-Shwartz, Shamir, Srebro & Sridha- ran (2010), who showed that stability is sufficient and necessary for learning. They have also shown that all convex-Lipschitz-bounded learning problems are learnable using RLM, even though for some convex-Lipschitz-bounded learning problems uniform convergence does not hold in a strong sense.

13.7 Exercises

1.From Bounded Expected Risk to Agnostic PAC Learning:LetAbe an algorithm that guarantees the following: Ifm≥mH() then for every distributionDit holds that

S∼DEm[LD(A(S))]≤minh∈HLD(h) +.

Show that for everyδ∈(0,1), ifm≥mH(δ) then with probability of at
least 1−δit holds thatLD(A(S))≤minh∈HLD(h) +.
Hint:Observe that the random variableLD(A(S))−minh∈HLD(h) is
nonnegative and rely on Markov’s inequality.

For everyδ∈(0,1) let

mH(,δ) =mH(/2)dlog 2 (1/δ)e+

⌈

log(4/δ) + log(dlog 2 (1/δ)e) ^2

⌉

.

Suggest a procedure that agnostic PAC learns the problem with sample complexity ofmH(,δ), assuming that the loss function is bounded by 1. Hint:Letk=dlog 2 (1/δ)e. Divide the data intok+ 1 chunks, where each of the firstkchunks is of sizemH(/2) examples. Train the firstkchunks usingA. On the basis of the previous question argue that the probability that for all of these chunks we haveLD(A(S))>minh∈HLD(h) +is at most 2−k≤δ/2. Finally, use the last chunk as a validation set. 2.Learnability without Uniform Convergence:LetBbe the unit ball of

Understanding Machine Learning: From Theory to Algorithms

13.7 Exercises

⌈

⌉

.

Get our desktop app

Company

Features

Documentation

Resources