Pattern Recognition and Machine Learning

(Jeff_L) #1
298 6. KERNEL METHODS

This is equivalent, up to an overall multiplicative constant, to a mixture distribution
in which the components factorize, with the indexiplaying the role of a ‘latent’
Section 9.2 variable. Two inputsxandx′will give a large value for the kernel function, and
hence appear similar, if they have significant probability under a range of different
components. Taking the limit of an infinite sum, we can also consider kernels of the
form
k(x,x′)=



p(x|z)p(x′|z)p(z)dz (6.30)

wherezis a continuous latent variable.
Now suppose that our data consists of ordered sequences of lengthLso that
an observation is given byX= {x 1 ,...,xL}. A popular generative model for
Section 13.2 sequences is the hidden Markov model, which expresses the distributionp(X)as a
marginalization over a corresponding sequence of hidden statesZ={z 1 ,...,zL}.
We can use this approach to define a kernel function measuring the similarity of two
sequencesXandX′by extending the mixture representation (6.29) to give


k(X,X′)=


Z

p(X|Z)p(X′|Z)p(Z) (6.31)

so that both observed sequences are generated by the same hidden sequenceZ. This
model can easily be extended to allow sequences of differing length to be compared.
An alternative technique for using generative models to define kernel functions
is known as theFisher kernel(Jaakkola and Haussler, 1999). Consider a parametric
generative modelp(x|θ)whereθdenotes the vector of parameters. The goal is to
find a kernel that measures the similarity of two input vectorsxandx′induced by the
generative model. Jaakkola and Haussler (1999) consider the gradient with respect
toθ, which defines a vector in a ‘feature’ space having the same dimensionality as
θ. In particular, they consider theFisher score
g(θ,x)=∇θlnp(x|θ) (6.32)
from which the Fisher kernel is defined by
k(x,x′)=g(θ,x)TF−^1 g(θ,x′). (6.33)
HereFis theFisher information matrix, given by
F=Ex

[
g(θ,x)g(θ,x)T

]
(6.34)
where the expectation is with respect toxunder the distributionp(x|θ). This can
be motivated from the perspective ofinformation geometry(Amari, 1998), which
considers the differential geometry of the space of model parameters. Here we sim-
ply note that the presence of the Fisher information matrix causes this kernel to be
Exercise 6.13 invariant under a nonlinear re-parameterization of the density modelθ→ψ(θ).
In practice, it is often infeasible to evaluate the Fisher information matrix. One
approach is simply to replace the expectation in the definition of the Fisher informa-
tion with the sample average, giving


F

1

N

∑N

n=1

g(θ,xn)g(θ,xn)T. (6.35)
Free download pdf