Pattern Recognition and Machine Learning

270 5. NEURAL NETWORKS

Recall that the simple weight decay regularizer, given in (5.112), can be viewed
as the negative log of a Gaussian prior distribution over the weights. We can encour-
age the weight values to form several groups, rather than just one group, by consid-
Section 2.3.9 ering instead a probability distribution that is amixtureof Gaussians. The centres
and variances of the Gaussian components, as well as the mixing coefficients, will be
considered as adjustable parameters to be determined as part of the learning process.
Thus, we have a probability density of the form

p(w)=

∏

i

p(wi) (5.136)

where

p(wi)=

∑M

j=1

πjN(wi|μj,σ^2 j) (5.137)

andπjare the mixing coefficients. Taking the negative logarithm then leads to a regularization function of the form

Ω(w)=−

∑

i

ln

(M ∑

j=1

πjN(wi|μj,σj^2 )

)

. (5.138)

The total error function is then given by

E ̃(w)=E(w)+λΩ(w) (5.139)

whereλis the regularization coefficient. This error is minimized both with respect to the weightswiand with respect to the parameters{πj,μj,σj}of the mixture model. If the weights were constant, then the parameters of the mixture model could be determined by using the EM algorithm discussed in Chapter 9. However, the distribution of weights is itself evolving during the learning process, and so to avoid nu- merical instability, a joint optimization is performed simultaneously over the weights and the mixture-model parameters. This can be done using a standard optimization algorithm such as conjugate gradients or quasi-Newton methods. In order to minimize the total error function, it is necessary to be able to evaluate its derivatives with respect to the various adjustable parameters. To do this it is con- venient to regard the{πj}aspriorprobabilities and to introduce the corresponding posterior probabilities which, following (2.192), are given by Bayes’ theorem in the form γj(w)=

πjN(w|μj,σj^2 ) ∑ kπkN(w|μk,σ

2 k)

. (5.140)

The derivatives of the total error function with respect to the weights are then given
Exercise 5.29 by
∂E ̃
∂wi

=

∂E

∂wi

+λ

∑

j

γj(wi)

(wi−μj) σj^2

. (5.141)

Pattern Recognition and Machine Learning

270 5. NEURAL NETWORKS

. (5.140)

=

∂E

. (5.141)

Get our desktop app

Company

Features

Documentation

Resources