Pattern Recognition and Machine Learning

5.7. Bayesian Neural Networks 277

where we have used (5.148). Because a standard network trained by least squares
is approximating the conditional mean, we see that a mixture density network can
reproduce the conventional least-squares result as a special case. Of course, as we
have already noted, for a multimodal distribution the conditional mean is of limited
value.
We can similarly evaluate the variance of the density function about the condi-
Exercise 5.37 tional average, to give

s^2 (x)=E

[ ‖t−E[t|x]‖^2 |x

] (5.159)

=

∑K

k=1

πk(x)

⎧ ⎨

⎩

σ^2 k(x)+

∥ ∥ ∥ ∥ ∥

μk(x)−

∑K

l=1

πl(x)μl(x)

∥ ∥ ∥ ∥ ∥

2 ⎫ ⎬

⎭

(5.160)

where we have used (5.148) and (5.158). This is more general than the corresponding least-squares result because the variance is a function ofx. We have seen that for multimodal distributions, the conditional mean can give a poor representation of the data. For instance, in controlling the simple robot arm shown in Figure 5.18, we need to pick one of the two possible joint angle settings in order to achieve the desired end-effector location, whereas the average of the two solutions is not itself a solution. In such cases, the conditional mode may be of more value. Because the conditional mode for the mixture density network does not have a simple analytical solution, this would require numerical iteration. A simple alternative is to take the mean of the most probable component (i.e., the one with the largest mixing coefficient) at each value ofx. This is shown for the toy data set in Figure 5.21(d).

5.7 Bayesian Neural Networks

So far, our discussion of neural networks has focussed on the use of maximum likelihood to determine the network parameters (weights and biases). Regularized maximum likelihood can be interpreted as a MAP (maximum posterior) approach in which the regularizer can be viewed as the logarithm of a prior parameter distribution. However, in a Bayesian treatment we need to marginalize over the distribution of parameters in order to make predictions. In Section 3.3, we developed a Bayesian solution for a simple linear regression model under the assumption of Gaussian noise. We saw that the posterior distribution, which is Gaussian, could be evaluated exactly and that the predictive distribution could also be found in closed form. In the case of a multilayered network, the highly nonlinear dependence of the network function on the parameter values means that an exact Bayesian treatment can no longer be found. In fact, the log of the posterior distribution will be nonconvex, corresponding to the multiple local minima in the error function. The technique of variational inference, to be discussed in Chapter 10, has been applied to Bayesian neural networks using a factorized Gaussian approximation

Pattern Recognition and Machine Learning

=

(5.160)

5.7 Bayesian Neural Networks

Get our desktop app

Company

Features

Documentation

Resources