266 5. NEURAL NETWORKS
in which the parameterξis drawn from a distributionp(ξ), then the error function
defined over this expanded data set can be written as
E ̃=^1
2
∫∫∫
{y(s(x,ξ))−t}^2 p(t|x)p(x)p(ξ)dxdtdξ. (5.130)
We now assume that the distributionp(ξ)has zero mean with small variance, so that
we are only considering small transformations of the original input vectors. We can
then expand the transformation function as a Taylor series in powers ofξto give
s(x,ξ)=s(x,0) +ξ
∂
∂ξ
s(x,ξ)
∣
∣
∣
∣
ξ=0
+
ξ^2
2
∂^2
∂ξ^2
s(x,ξ)
∣
∣
∣
∣
ξ=0
+O(ξ^3 )
= x+ξτ+
1
2
ξ^2 τ′+O(ξ^3 )
whereτ′denotes the second derivative ofs(x,ξ)with respect toξevaluated atξ=0.
This allows us to expand the model function to give
y(s(x,ξ)) =y(x)+ξτT∇y(x)+
ξ^2
2
[
(τ′)
T
∇y(x)+τT∇∇y(x)τ
]
+O(ξ^3 ).
Substituting into the mean error function (5.130) and expanding, we then have
E ̃ =^1
2
∫∫
{y(x)−t}^2 p(t|x)p(x)dxdt
+ E[ξ]
∫∫
{y(x)−t}τT∇y(x)p(t|x)p(x)dxdt
+ E[ξ^2 ]
∫∫[
{y(x)−t}
1
2
{
(τ′)
T
∇y(x)+τT∇∇y(x)τ
}
+
(
τT∇y(x)
) 2 ]
p(t|x)p(x)dxdt+O(ξ^3 ).
Because the distribution of transformations has zero mean we haveE[ξ]=0. Also,
we shall denoteE[ξ^2 ]byλ. Omitting terms ofO(ξ^3 ), the average error function then
becomes
E ̃=E+λΩ (5.131)
whereEis the original sum-of-squares error, and the regularization termΩtakes the
form
Ω=
∫ [
{y(x)−E[t|x]}
1
2
{
(τ′)
T
∇y(x)+τT∇∇y(x)τ
}
+
(
τT∇y(x)
) 2
]
p(x)dx (5.132)
in which we have performed the integration overt.