Pattern Recognition and Machine Learning

10.5. Local Variational Methods 495

Now, instead of fixingλand varyingx, we can consider a particularxand then adjustλuntil the tangent plane is tangent at that particularx. Because theyvalue of the tangent line at a particularxis maximized when that value coincides with its contact point, we have f(x) = max λ

{λx−g(λ)}. (10.130)

We see that the functionsf(x)andg(λ)play a dual role, and are related through (10.129) and (10.130). Let us apply these duality relations to our simple examplef(x)=exp(−x). From (10.129) we see that the maximizing value ofxis given byξ=−ln(−λ), and back-substituting we obtain the conjugate functiong(λ)in the form

g(λ)=λ−λln(−λ) (10.131)

as obtained previously. The functionλξ−g(λ)is shown, forξ=1in the right-hand plot in Figure 10.10. As a check, we can substitute (10.131) into (10.130), which gives the maximizing value ofλ=−exp(−x), and back-substituting then recovers the original functionf(x)=exp(−x). For concave functions, we can follow a similar argument to obtain upper bounds, in whichmax’ is replaced with ‘min’, so that

f(x) = min λ

{λx−g(λ)} (10.132)

g(λ) = min x {λx−f(x)}. (10.133)

If the function of interest is not convex (or concave), then we cannot directly apply the method above to obtain a bound. However, we can first seek invertible transformations either of the function or of its argument which change it into a convex form. We then calculate the conjugate function and then transform back to the original variables. An important example, which arises frequently in pattern recognition, is the logistic sigmoid function defined by

σ(x)=

1

1+e−x

. (10.134)

As it stands this function is neither convex nor concave. However, if we take the
logarithm we obtain a function which is concave, as is easily verified by finding the
Exercise 10.30 second derivative. From (10.133) the corresponding conjugate function then takes
the form

g(λ)=min x

{λx−f(x)}=−λlnλ−(1−λ)ln(1−λ) (10.135)

which we recognize as the binary entropy function for a variable whose probability
Appendix B of having the value 1 isλ. Using (10.132), we then obtain an upper bound on the log
sigmoid
lnσ(x)λx−g(λ) (10.136)

Pattern Recognition and Machine Learning

1

. (10.134)

Get our desktop app

Company

Features

Documentation

Resources