Understanding Machine Learning: From Theory to Algorithms
13.7 Exercises 181 In the context of modern learning theory, the use of stability can be traced back at least to the work of Rog ...
182 Regularization and Stability Rd, letH=B, letZ=B ×{ 0 , 1 }d, and let`:Z×H →Rbe defined as follows: `(w,(x,α)) = ∑d i=1 αi(xi ...
13.7 Exercises 183 4.Strong Convexity with Respect to General Norms: Throughout the section we used the` 2 norm. In this exercis ...
14 Stochastic Gradient Descent Recall that the goal of learning is to minimize the risk function,LD(h) = Ez∼D[`(h,z)]. We cannot ...
14.1 Gradient Descent 185 the Stochastic Gradient Descent algorithm, along with several useful variants. We show that SGD enjoys ...
186 Stochastic Gradient Descent Figure 14.1An illustration of the gradient descent algorithm. The function to be minimized is 1. ...
14.1 Gradient Descent 187 lemma14.1 Letv 1 ,...,vTbe an arbitrary sequence of vectors. Any algorithm with an initializationw(1)= ...
188 Stochastic Gradient Descent Lemma 14.1 applies to the GD algorithm withvt=∇f(w(t)). As we will show later in Lemma 14.7, iff ...
14.2 Subgradients 189 f(w) f(u) w u f( w) + 〈u −w ,∇ f( w) 〉 Figure 14.2Left: The right-hand side of Equation (14.7) is the tang ...
190 Stochastic Gradient Descent Example 14.2(A Subgradient of the Hinge Loss) Recall the hinge loss function from Section 12.3,f ...
14.3 Stochastic Gradient Descent (SGD) 191 Figure 14.3An illustration of the gradient descent algorithm (left) and the stochasti ...
192 Stochastic Gradient Descent subgradient off atw(t), we can still derive a similar bound on theexpected output of stochastic ...
14.4 Variants 193 Sincew(t)only depends onv1:t− 1 and SGD requires thatEvt[vt|w(t)]∈∂f(w(t)) we obtain thatEvt[vt|v1:t− 1 ]∈∂f(w ...
194 Stochastic Gradient Descent Then, for everyu∈H, ‖w−u‖^2 −‖v−u‖^2 ≥ 0. Proof By the convexity ofH, for everyα∈(0,1) we have t ...
14.4 Variants 195 14.4.3 Other Averaging Techniques We have set the output vector to bew ̄ =T^1 ∑T t=1w(t). There are alternativ ...
196 Stochastic Gradient Descent Sincew(t+1) is the projection ofw(t+ (^12) ) ontoH, andw?∈ Hwe have that ‖w(t+ (^12) ) −w?‖^2 ≥‖ ...
14.5 Learning with SGD 197 LD(w), that is, a random vector whose conditional expected value is∇LD(w(t)). We shall now see how su ...
198 Stochastic Gradient Descent LD(w)with a number of iterations (i.e., number of examples) T≥ B^2 ρ^2 ^2 and withη= √ B^2 ρ^2 ...
14.5 Learning with SGD 199 toz 1 ,...,zT. Clearly,E[ft(w?)] =LD(w?). In addition, using the same argument as in the proof of The ...
200 Stochastic Gradient Descent thatH=Rdand therefore the projection step does not matter) as follows w(t+1)=w(t)− 1 λt ( λw(t)+ ...
«
5
6
7
8
9
10
11
12
13
14
»
Free download pdf