12.4.NonlinearLatentVariableModels 595
Ithastheadvantageofnotbeinglimitedtolineartransformations, althoughit con-
tainsstandardprincipalcomponentanalysisasa specialcase. However,training
thenetworknowinvolvesa nonlinearoptimizationproblem,sincetheerrorfunction
(12.91)isnolongera quadraticfunctionofthenetworkparameters.Computation-
allyintensivenonlinearoptimizationtechniquesmustbeused,andthereistheriskof
findinga suboptimallocalminimumoftheerrorfunction.Also,thedimensionality
ofthesubspacemustbespecifiedbeforetrainingthenetwork.
12.4.3 Modelling nonlinear manifolds
As wehavealreadynoted, manynaturalsourcesofdatacorrespondtolow-
dimensional, possiblynoisy, nonlinearmanifoldsembeddedwithinthehigherdi-
mensionalobserveddataspace. Capturingthispropertyexplicitlycanleadtoim-
proveddensitymodellingcomparedwithmoregeneralmethods.Hereweconsider
brieflya rangeoftechniquesthatattempttodothis.
Onewaytomodelthenonlinearstructureisthrougha combinationoflinear
models,sothatwemakea piece-wiselinearapproximationtothemanifold.Thiscan
beobtained,forinstance,byusinga clusteringtechniquesuchasK-meansbasedon
EuclideandistancetopartitionthedatasetintolocalgroupswithstandardPCAap-
pliedtoeachgroup.Abetterapproachistousethereconstructionerrorforcluster
assignment(KambhatlaandLeen,1997;Hintonetal.,1997)asthen acommoncost
functionisbeingoptimizedineachstage. However,theseapproachesstillsuffer
fromlimitationsduetotheabsenceofanoveralldensitymodel. Byusingprob-
abilisticPCAitisstraightforwardtodefinea fullyprobabilisticmodelsimplyby
consideringa mixturedistributioninwhichthecomponentsareprobabilisticPCA
models(TippingandBishop, 1999a). Sucha modelhasbothdiscretelatentvari-
ables,correspondingtothediscretemixture,aswellascontinuouslatentvariables,
andthelikelihoodfunctioncanbemaximizedusingtheEMalgorithm. Afully
Bayesiantreatment,basedonvariationalinference(BishopandWinn,2000),allows
thenumberofcomponentsinthemixture,aswellastheeffectivedimensionalities
oftheindividualmodels,tobeinferredfromthedata. Therearemanyvariantsof
thismodelinwhichparameterssuchastheWmatrixorthenoisevariancesaretied
acrosscomponentsinthemixture,orinwhichtheisotropicnoisedistributionsare
replacedbydiagonalones,givingrisetoa mixtureoffactoranalysers(Ghahramani
andHinton,1996a;GhahramaniandBeal,2000).ThemixtureofprobabilisticPCA
modelscanalsobeextendedhierarchicallytoproduceaninteractivedatavisualiza-
tionalgorithm(BishopandTipping,1998).
Analternativetoconsideringa mixtureoflinearmodelsis toconsidera single
nonlinearmodel. RecallthatconventionalPCAfindsa linearsubspacethatpasses
closetothedataina least-squaressense. Thisconceptcanbeextendedto one-
dimensionalnonlinearsurfacesintheformofprincipalcurves(HastieandStuetzle,
1989).Wecandescribea curveina D-dimensionaldataspaceusinga vector-valued
functionf ().),whichisa vectoreachofwhoseelementsisa functionofthescalar)..
Therearemanypossiblewaystoparameterizethecurve,ofwhicha naturalchoice
isthearclengthalongthecurve. Foranygivenpointxindataspace,wecanfind
thepointonthecurvethatisclosestinEuclideandistance.Wedenotethispointby