- Sample group information for the mRNA-seq datasets for each
species. For each species, the sample group information should
be contained in a single-column data frame in which the row
names are unique sample names. A portion of the data frame
for the example human-dog analysis vignette (along with the
dimensions of the data frame) is shown here:
>head(dog_sample_info)
external_name
s01 TCC.1
s02 TCC.2
s03 TCC.3
s05 normal.1
s06 TCC.4
s34 TCC.5
>dim(dog_sample_info)
[1] 10 1
Sample information for the other species (in this vignette,
human) should be stored in a similar data frame (in this
vignette, we will assume the data frame is named
"human_sample_info").
- Ortholog mappings between the two species, in the form of a
two-column data frame whose first column contains Ensembl
gene identifiers for the second species (in this example vignette,
human) and whose second column contains the Ensembl gene
identifier of an ortholog (if any) for the gene in the first species
(in this example vignette, dog). Such a mapping can be
obtained using Ensembl BioMart. A portion of the data
frame for the example human-dog analysis vignette (along
with the dimensions of the data frame) is shown here (see
Note 2).
>head(human_dog_ensg)
Ensembl.Gene.ID Dog.Ensembl.Gene.ID
1 ENSG00000261657
2 ENSG00000223116
3 ENSG00000233440
4 ENSG00000207157
5 ENSG00000229483
6 ENSG00000252952 ENSCAFG00000025776
>dim(human_dog_ensg)
[1] 65999 2
3 Methods
Below, I outline the steps required to carry out an unsupervised and
a supervised comparison of mRNA-seq data sets from two species,
using as an example mRNA-seq data sets from a cross-species (dog
and human) study of bladder cancer. The first five steps of the
Cross-Species RNA-Seq Analysis 295