Understanding Machine Learning: From Theory to Algorithms

162 Convex Learning Problems

12.1.3 Smoothness

The definition of a smooth function relies on the notion ofgradient. Recall that the gradient of a differentiable functionf:Rd→Ratw, denoted∇f(w), is the vector of partial derivatives off, namely,∇f(w) =

(∂f(w) ∂w 1 ,...,

∂f(w) ∂wd

)

.

definition12.8 (Smoothness) A differentiable functionf :Rd→Risβ- smooth if its gradient isβ-Lipschitz; namely, for allv,wwe have‖∇f(v)− ∇f(w)‖≤β‖v−w‖.

It is possible to show that smoothness implies that for allv,wwe have

f(v)≤f(w) +〈∇f(w),v−w〉+

β 2 ‖v−w‖

(^2). (12.5)
Recall that convexity offimplies thatf(v)≥f(w)+〈∇f(w),v−w〉. Therefore,
when a function is both convex and smooth, we have both upper and lower
bounds on the difference between the function and its first order approximation.
Settingv=w−β^1 ∇f(w) in the right-hand side of Equation (12.5) and rear-
ranging terms, we obtain
1
2 β
‖∇f(w)‖^2 ≤f(w)−f(v).
If we further assume thatf(v)≥0 for allvwe conclude that smoothness implies
the following:
‖∇f(w)‖^2 ≤ 2 βf(w). (12.6)
A function that satisfies this property is also called aself-boundedfunction.
Example 12.5

The functionf(x) =x^2 is 2-smooth. This follows directly from the fact that
f′(x) = 2x. Note that for this particular function Equation (12.5) and
Equation (12.6) hold with equality.

The functionf(x) = log(1 + exp(x)) is (1/4)-smooth. Indeed, sincef′(x) =
1
1+exp(−x)we have that

|f′′(x)|=

exp(−x) (1 + exp(−x))^2

=

1

(1 + exp(−x))(1 + exp(x))

≤ 1 / 4.

Hence,f′ is (1/4)-Lipschitz. Since this function is nonnegative, Equa- tion (12.6) holds as well.

The following claim shows that a composition of a smooth scalar function over a linear function preserves smoothness.

claim12.9 Letf(w) =g(〈w,x〉+b), whereg:R→Ris aβ-smooth function, x∈Rd, andb∈R. Then,fis(β‖x‖^2 )-smooth.

Understanding Machine Learning: From Theory to Algorithms

12.1.3 Smoothness

)

.

=

1

≤ 1 / 4.

Get our desktop app

Company

Features

Documentation

Resources