Pattern Recognition and Machine Learning

(Jeff_L) #1
270 5. NEURAL NETWORKS

Recall that the simple weight decay regularizer, given in (5.112), can be viewed
as the negative log of a Gaussian prior distribution over the weights. We can encour-
age the weight values to form several groups, rather than just one group, by consid-
Section 2.3.9 ering instead a probability distribution that is amixtureof Gaussians. The centres
and variances of the Gaussian components, as well as the mixing coefficients, will be
considered as adjustable parameters to be determined as part of the learning process.
Thus, we have a probability density of the form


p(w)=


i

p(wi) (5.136)

where

p(wi)=

∑M

j=1

πjN(wi|μj,σ^2 j) (5.137)

andπjare the mixing coefficients. Taking the negative logarithm then leads to a
regularization function of the form

Ω(w)=−


i

ln

(M

j=1

πjN(wi|μj,σj^2 )

)

. (5.138)


The total error function is then given by

E ̃(w)=E(w)+λΩ(w) (5.139)

whereλis the regularization coefficient. This error is minimized both with respect
to the weightswiand with respect to the parameters{πj,μj,σj}of the mixture
model. If the weights were constant, then the parameters of the mixture model could
be determined by using the EM algorithm discussed in Chapter 9. However, the dis-
tribution of weights is itself evolving during the learning process, and so to avoid nu-
merical instability, a joint optimization is performed simultaneously over the weights
and the mixture-model parameters. This can be done using a standard optimization
algorithm such as conjugate gradients or quasi-Newton methods.
In order to minimize the total error function, it is necessary to be able to evaluate
its derivatives with respect to the various adjustable parameters. To do this it is con-
venient to regard the{πj}aspriorprobabilities and to introduce the corresponding
posterior probabilities which, following (2.192), are given by Bayes’ theorem in the
form
γj(w)=

πjN(w|μj,σj^2 )

kπkN(w|μk,σ

2
k)

. (5.140)

The derivatives of the total error function with respect to the weights are then given
Exercise 5.29 by
∂E ̃
∂wi


=

∂E

∂wi



j

γj(wi)

(wi−μj)
σj^2

. (5.141)
Free download pdf