Pattern Recognition and Machine Learning

298 6. KERNEL METHODS

This is equivalent, up to an overall multiplicative constant, to a mixture distribution
in which the components factorize, with the indexiplaying the role of a ‘latent’
Section 9.2 variable. Two inputsxandx′will give a large value for the kernel function, and
hence appear similar, if they have significant probability under a range of different
components. Taking the limit of an infinite sum, we can also consider kernels of the
form
k(x,x′)=

∫ p(x|z)p(x′|z)p(z)dz (6.30)

wherezis a continuous latent variable.
Now suppose that our data consists of ordered sequences of lengthLso that
an observation is given byX= {x 1 ,...,xL}. A popular generative model for
Section 13.2 sequences is the hidden Markov model, which expresses the distributionp(X)as a
marginalization over a corresponding sequence of hidden statesZ={z 1 ,...,zL}.
We can use this approach to define a kernel function measuring the similarity of two
sequencesXandX′by extending the mixture representation (6.29) to give

k(X,X′)=

∑

Z

p(X|Z)p(X′|Z)p(Z) (6.31)

so that both observed sequences are generated by the same hidden sequenceZ. This model can easily be extended to allow sequences of differing length to be compared. An alternative technique for using generative models to define kernel functions is known as theFisher kernel(Jaakkola and Haussler, 1999). Consider a parametric generative modelp(x|θ)whereθdenotes the vector of parameters. The goal is to find a kernel that measures the similarity of two input vectorsxandx′induced by the generative model. Jaakkola and Haussler (1999) consider the gradient with respect toθ, which defines a vector in a ‘feature’ space having the same dimensionality as θ. In particular, they consider theFisher score g(θ,x)=∇θlnp(x|θ) (6.32) from which the Fisher kernel is defined by k(x,x′)=g(θ,x)TF−^1 g(θ,x′). (6.33) HereFis theFisher information matrix, given by F=Ex

[ g(θ,x)g(θ,x)T

]
(6.34)
where the expectation is with respect toxunder the distributionp(x|θ). This can
be motivated from the perspective ofinformation geometry(Amari, 1998), which
considers the differential geometry of the space of model parameters. Here we sim-
ply note that the presence of the Fisher information matrix causes this kernel to be
Exercise 6.13 invariant under a nonlinear re-parameterization of the density modelθ→ψ(θ).
In practice, it is often infeasible to evaluate the Fisher information matrix. One
approach is simply to replace the expectation in the definition of the Fisher informa-
tion with the sample average, giving

F

1

N

∑N

n=1

g(θ,xn)g(θ,xn)T. (6.35)

Pattern Recognition and Machine Learning

298 6. KERNEL METHODS

F

1

N

Get our desktop app

Company

Features

Documentation

Resources

Pattern Recognition and Machine Learning

298 6. KERNEL METHODS

F

1

N

Get our desktop app

Company

Features

Documentation

Resources

F