two-way identification of most possible TF-gene interactions: on
the basis of ENCODE ChIP-Seq binding evidence or Jaspar pre-
diction and co-expression according to the data of the largest
cancer omics resource [112].
4 Notes
This paper has given a comprehensive summary of data resources,
data analysis, and data visualization supporting the integration of
big biological data. Finally, we would like to list several notes on this
review:
- Conventional big data from society would have a large number
of samples, and each sample has a few features/attributes. By
contrast, the big biological data would supply not large but
enough samples and test tens of thousands of features for each
sample simultaneously. This small-sample high-dimensional
data requires new analytic approaches, including the data
integration. - “Bottom-up integration” mode with follow-up manual inte-
gration is always the hypothesis-driven approaches to extract
the significant enriched or observed biological knowledge in
data. The key of these methods is there should be clear and
suitable biological hints on the experiments and outcome data,
and then the data combination can extract the biological signals
in each type of data and explain the same preset biological
hypothesis in a single analysis framework. Although for differ-
ent combinations on data types, there is already corresponding
integrative analysis framework, it is still short of more general
and flexible scheme to deal with the existing data types and
potential new data types.It is urgently required to design quan-
titative evaluation on the confidence of driver hypothesis ahead of
data analysis and also on the contribution of different data types
to the biological hypothesis. - Meanwhile, “top-down integration” mode with follow-up in
silico integration is usually the data-driven approaches to
extract the most probable feature signals or sample patterns in
data. The key of these methods is there must be efficient
correction to reduce the noise and bias in different types of
data, and then the data fusion can identify the coordinate data
distribution or data correlation in multiple types of data in a
unified mathematical model. Many techniques are available;
however, they are used solid constraint on the union of data
coordination, which limit their application on the diverse
biological systems. Thus, the more relaxations, e.g., soft-
constraint-based approaches, will expand the power of data fusion
in biological study and detect unseen biological patterns.
126 Xiang-Tian Yu and Tao Zeng