untitled

(ff) #1

10.1 Text Transformations 227


while (<>) {
chomp;
if (/Motif #1:/) {
print "The first motif has been found!\n";
}
}


Program 10.9 Using pattern matching to find one piece of data in a file

et al. 2000; Roth et al. 1998), CONSENSUS (Stormo and Hartzell III 1989;
Hertz et al. 1990; Hertz and Stormo 1999), and Gibbs sampler (Lawrence et al.
1993; Liu et al. 1995), and all of them use their own output formats. No
doubt many more formats already exist for motifs, and many more will be
used in the future. A similar situation exists for virtually every other kind of
bioinformatics information. Many tools are available for similar tasks, and
each one uses its own input and output formats.
To process information such as the BioProspector file above, we make use
of the pattern-matching features of Perl. Pattern matching is one of the most
powerful features of Perl, and it is one of the reasons why Perl has become
so popular.
Consider the task of extracting just the information about the first motif. A
motif is defined as a sequence of probability distributions on the four DNA
bases. We will do this in a series of steps. First we need to read the Bio-
Prospector file and find where the information about the desired motif is
located, as shown in program 10.9.
Each motif description begins with a title containing “Motif #” followed
by a number and ending with a colon. The condition/Motif #1:/is re-
sponsible for detecting such a title. The text between the forward slashes is
thepatternto be matched. A pattern can be as simple as just some text that is
to be matched, as in this case.
If one wanted the line that contained exactly this text, one would use the
condition$_ eq "Motif #1:\n". Note that string comparison useseq,
not the equal-to sign. Also note that every line ends with the newline char-
acter. In practice it is usually easier to use a pattern match condition than a
test for equality. The pattern match will handle more cases, and one does not
have to worry about whether or not the newline character might be in the
line.

Free download pdf