untitled

(ff) #1

10.1 Text Transformations 231


can be hashes as well as arrays. Furthermore, data structures in general can
mix arrays, hashes, and scalars in a single “dimension.” Thus it is possible to
have a data structure consisting of an array some of whose items are hashes,
some are arrays, and the rest are scalars. This kind of mixing is necessary
for representing XML documents as Perl data structures. This is developed
in subsection 10.2.3. However, one should avoid mixing arrays and hashes
in a completely arbitrary fashion, as this can get very confusing. One tech-
nique that helps keep the program simple is to use only hashes and scalars.
In other words, avoid arrays. As we saw in subsection 10.1.1, one can use a
hash instead of an array.
Consider the task of representing a DNA motif. A motif is a sequence of
probability distributions, so it should be represented as an array. Each item of
this array is a probability distribution. This probability distribution assigns
a number to each of the DNA bases. Such an assignment is most naturally
represented using a hash. Thus a motif is an array of hashes. A motif-finding
program produces several motifs, each with a label. The most natural way to
label the motifs is to use a hash. So the result of a motif-finding program is a
hash of arrays of hashes. However, to avoid mixing hashes and arrays, motifs
will be represented using a 3D hash. Program 10.13 extracts the probability
distributions from the output produced by CONSENSUS.
The program extracts information by using Perl patterns. The label of the
motif is indicated by a line that starts withMATRIXand followed by a num-
ber. Note the use of the dollar sign to specify that the line has nothing else
on it. The motif number is obtained from the pattern by putting parentheses
around the subpattern for the number. The number of sequences is obtained
in a similar fashion. Adding 0 to the number of sequences tells Perl that this
is a number. The motif label, by contrast, may look like a number but it is
being treated as being just text.
The most complicated part of the program is the part that extracts the prob-
ability distributions. The frequencies for one DNA base are on a line that
begins with the name of the base, followed by a vertical bar. The rest of the
line consists of frequencies. The frequencies are obtained by splitting the line
and looping over the fields, starting with the third field (i.e., starting with
index 2 because arrays always start with 0).
The data structure being constructed is calledmotifs.Itisa3Dhash.An
item in the the first dimension is a single motif and is determined by the motif
label. An item in the second dimension is one position in the motif. One
advantage of using a hash instead of an array is that the DNA positions need
not start at 0, and they need not be contiguous. In this case, the frequency of

Free download pdf