workflow involve performing identical transformations on the data
sets for individual species, and thus, for those steps I show the
example R commands only for the dog mRNA-seq data set (the R
commands for the human data set are identical except for replacing
"dog" with "human" in the commands).
- For each species, compute (for each gene and each sample
group) the geometric mean of the normalized log 2 expression
levels of the gene for all the samples in the sample group. Then,
for each gene, compute the maximum sample-group-averaged
expression level. Using the reshape2 R package, these steps can
be performed for a given species using a single R command.
For the dog data set, the command would be (example output
also shown):
library(reshape2)
rsc_maxexp_dog <- setNames(aggregate(value~gene,
data=aggregate(value~gene+condition,
data=merge(setNames(melt(log2(1+as.matrix(rsc_norm_dog))),
c("gene","sample_name","value")),
col_data_dog, by.x="sample_name",
by.y=0),
FUN=mean),
FUN=max),
c("gene","max_exp"))
>head(rsc_maxexp_dog)
gene maxexp
1 ENSCAFG00000014413 68.60036224
2 ENSCAFG00000014412 133.12610293
3 ENSCAFG00000014410 56.05202922
4 ENSCAFG00000014417 0.00000000
5 ENSCAFG00000014416 404.80579062
6 ENSCAFG00000014415 76.53825143
>dim(rsc_maxexp_dog)
[1] 24580 2
In the above R command, the function "melt" converts a
data frame from a wide format to a narrow "melted" format
[28], the function "merge" combines sample information with
the melted data frame of normalized log 2 counts, and the outer
and inner calls to the function "aggregate" compute the inter-
sample-group maximum and intra-sample-group geometric
mean (respectively) of the log 2 counts on a gene-level basis.
Usually, the distribution (over genes) of the sample-group-
maximum normalized expression levels is bimodal, with the
lower mode corresponding to genes with either zero or
extremely low transcript abundances in any sample group;
however, the distribution can differ between experimental
data sets and between species (Fig.1). It is usually convenient
to filter out low-expressed genes under the premise that their
mRNA-seq counts are likely noise-dominated (this step also
296 Stephen A. Ramsey