Systems Biology (Methods in Molecular Biology)

(Tina Sui) #1
workflow involve performing identical transformations on the data
sets for individual species, and thus, for those steps I show the
example R commands only for the dog mRNA-seq data set (the R
commands for the human data set are identical except for replacing
"dog" with "human" in the commands).


  1. For each species, compute (for each gene and each sample
    group) the geometric mean of the normalized log 2 expression
    levels of the gene for all the samples in the sample group. Then,
    for each gene, compute the maximum sample-group-averaged
    expression level. Using the reshape2 R package, these steps can
    be performed for a given species using a single R command.
    For the dog data set, the command would be (example output
    also shown):


library(reshape2)
rsc_maxexp_dog <- setNames(aggregate(value~gene,
data=aggregate(value~gene+condition,
data=merge(setNames(melt(log2(1+as.matrix(rsc_norm_dog))),
c("gene","sample_name","value")),
col_data_dog, by.x="sample_name",
by.y=0),
FUN=mean),
FUN=max),
c("gene","max_exp"))

>head(rsc_maxexp_dog)
gene maxexp
1 ENSCAFG00000014413 68.60036224
2 ENSCAFG00000014412 133.12610293
3 ENSCAFG00000014410 56.05202922
4 ENSCAFG00000014417 0.00000000
5 ENSCAFG00000014416 404.80579062
6 ENSCAFG00000014415 76.53825143
>dim(rsc_maxexp_dog)
[1] 24580 2
In the above R command, the function "melt" converts a
data frame from a wide format to a narrow "melted" format
[28], the function "merge" combines sample information with
the melted data frame of normalized log 2 counts, and the outer
and inner calls to the function "aggregate" compute the inter-
sample-group maximum and intra-sample-group geometric
mean (respectively) of the log 2 counts on a gene-level basis.
Usually, the distribution (over genes) of the sample-group-
maximum normalized expression levels is bimodal, with the
lower mode corresponding to genes with either zero or
extremely low transcript abundances in any sample group;
however, the distribution can differ between experimental
data sets and between species (Fig.1). It is usually convenient
to filter out low-expressed genes under the premise that their
mRNA-seq counts are likely noise-dominated (this step also

296 Stephen A. Ramsey

Free download pdf