Personalized_Medicine_A_New_Medical_and_Social_Challenge

(Barré) #1

Here, we give a brief mathematical introduction into the kernel-based statistical
learning methods followed by a review of the literature applying this method for
biological data integration.
Kernel methodswork by embedding data items (genes, proteins, diseases, etc.)
into a vector space, called thefeature space, which in many cases represents a
Hilbertspace (see Yu et al. ( 2011 )). This embedding is implicitly represented by a
kernel function,K(x 1 ,x 2 ), defined as the inner product between embedding repre-
sentations,φ(x 1 ) andφ(x 2 ), of two data items,x 1 andx 2 :


KðÞ¼x 1 ;x 2 φðÞx 1 TφðÞðx 2 3 Þ

Computing this product for all pair of data items yields a symmetric, semi-
definite matrix,K, called thekernel matrix. This representation has two great
advantages. The first advantage is the fact that many nonvector data types, such
as symbolic data, strings, and trees (that would be very difficult to translate into
probability distributions in the BN framework), can be readily represented by this
mathematical embedding as a Hilbert space. The second advantage is that the
mathematical form of kernel matrices provides various opportunities for their
algebraic manipulation, such as addition or multiplication, thereby enabling com-
bination (integration) of various data throughout this formalism.
Support vector machinesare supervised learning models that are mostly used
for classification and regression tasks in machine learning. They can be defined in
the following way: given a training set of data items, {x 1 ...,xN}, each labeled as
belonging to a positive or a negative class,yk 2 {×1,þ1}, an SVM training algorithm
builds a model that assigns new, not classified examples to one of the classes. In
machine learning, this problem is calledlinear classification. SVM models can
readily be generalized fornon-linear classificationproblems by replacingxiwith
the embeddingφ(xi) (see Yu et al. ( 2011 )). Using geometric intuition, this can be
explained in the following way: SVM aims to construct a hyperplane in a high-
dimensional feature space that separates class1 and classþ1 in the best possible
way. The best performance of an SVM classifier is achieved if the hyperplane is at
the largest distance to the nearest training data points of any class. Therefore, we
can mathematically formulate this optimization problem in the following way:


min
w,b

1


2


wTw ð 4 Þ

subject to the following:


ykwTφðÞþxk b




1,k¼1,...,n

where w is the norm vector of the hyperplane,bis a parameter, and the whole term
in the brackets is called the linear discriminant function,f(x)¼wTφ(x)þb, the sign
of which (sign(f(x))¼1) determines the class assignment. This problem is knows
as aconvex optimization problemand its solution, w andb, can be obtained by


Computational Methods for Integration of Biological Data 163

Free download pdf