Personalized_Medicine_A_New_Medical_and_Social_Challenge

(Barré) #1

The kernel-based data integration framework was originally proposed by
Lanckrietet al.(2004a,b). They demonstrated the feasibility of this framework
by classifying yeast’s proteins into two groups, ribosomal proteins and membrane
proteins. They trained the SVM classifier by solving the optimization problem
based on the combined kernel matrix and the protein labels. The combined kernel
matrix is obtained by a parameterized linear combination of the kernel matrices
constructed from three different types of data: protein sequence, PPI, and gene
expression data. The performance of the SVM algorithm was significantly
increased with the addition of different data sources even in the presence of
noise. Different choices of embeddings (i.e., different forms of kernel functions)
and their effect on the classification problem were also covered.
A similar methodology was applied for protein function prediction (see
Lanckriet et al. (2004a,b)). KB data integration methodology was used to classify
yeast’s proteins into 13 broad functional categories. Fusion of data derived from
amino acid sequences, protein complexes, gene expression data, and PPI networks
significantly increased the classification performance compared to the performance
of classification done on any single data type.
Napolitanoet al.( 2013 ) used KB data integration for drug repositioning. They
trained a multiclass SVM classifier to predict new targets (proteins) for existing
drugs. Kernel matrices were constructed for three different data sets: gene expres-
sion under the influence of drugs, chemical structure, and molecular targets. The
joint kernel matrix was defined as a simple average of all three matrices. The
authors reported high performance (AUC¼0.78) of their classifier.


4.3 Matrix Factorization-Based Data Integration


Data integration by matrix factorization is a recently proposed approach with many
advantages over the previous two methods.^119 It is based on penalized nonnegative
matrix tri-factorization (PNMTF), which was originally designed to cocluster
heterogeneous relational data.^120 The approach is flexible, i.e., it can take an
unlimited number of sources. Unlike kernel-based approaches, matrix factorization
approaches require minimal (or no) data transformation. Namely, it takes data in the
original representation, network representation, which is the most natural represen-
tation for most biological data (see Sect. 3 ), and creates binary matrices depending
on the relations between biological components (genes, proteins, drugs, diseases,
etc.). Another great advantage of this approach lies in fact that most of the networks
are sparse, and therefore network matrices can be represented using spare matrix
representation, which drastically reduces memory storage and computational


(^119) Zˇitnik et al. ( 2013 ).
(^120) Wang et al. ( 2011 ).
Computational Methods for Integration of Biological Data 165

Free download pdf