Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
thousands of samples is possible nowadays because large amounts
of expression data can be publicly available; however, extracting
information from the correlation data is not straightforward due to
the expression data generated by different laboratories from differ-
ent cell types under different biological conditions [73]. To address
those issues from batch effect, many computational approaches
have been proposed. The “surrogate variable analysis” (SVA) is
introduced to recover the effects of the important missed variables
and essentially produce an analysis as if all relevant variables were
included, which has shown the improved biological accuracy and
reproducibility [71]. Meanwhile, the ComBat removes batch
effects based on an empirical Bayes framework, which centers data
to the overall grand mean of all samples and obtains an adjusted
data without coinciding with the location of any original batches
[74]. And a modified version of ComBat (M-ComBat) adopts to
shift samples to the mean and variance of a “gold standard” as
reference batch rather than the grand mean and pooled variance
[75]. Next, an extension of PCA known as guided PCA (gPCA) has
been proposed to quantify the existence of batch effects, and a new
statistic is also designed to apply gPCA to test whether a batch
effect exists in high-throughput data [76]. Further, a software
pipeline, BatchQC, is implemented to use interactive visualizations
and statistics to evaluate the impact of batch effects in a genomic
dataset, which can also apply existing adjustment tools and allow to
evaluate researchers’ benefits interactively [72].
As an initiative integrative application related to batch effect
removal, the conventional horizontal data ensemble needs to inte-
grate the same type of data from different studies. For example, an
integrative pre-screening is provided to reduce the dimensionality
in cancer genomic studies for the analysis of multiple cancer geno-
mic datasets, which can be coupled with existing analysis methods
to identify cancer markers [77]. And by analyzing the accrued gene
expression data in TCGA pan-cancer (PANCAN) data, the paired
normal samples seem to be in general more informative on patient
survival than tumors, whose analysis supports the importance of
collecting and profiling matched normal tissues to gain more
insights on disease etiology and patient progression [78].

3.2 Bottom-Up
Integration


According to the combination of different types of high-
throughput data, the “bottom-up integration” approaches have
many particular analysis frameworks as summarized in Table3.
Generally, the mutation and transcriptome information are both
considered, especially the mRNA expression is used in almost any
analysis (seeNote 2). Below, considering the usage of mutation or
not, the integrative methods are introduced and discussed,
respectively.
On one hand, the mutation-centered integration mainly tries
to identify the genetic determinants of phenotype and its change,

116 Xiang-Tian Yu and Tao Zeng

Free download pdf