232 10 Transforming with Traditional Programming Languages
while (<>) {
chomp;
if (/^MATRIX ([0-9]+)$/) {
$label = $1;
} elsif (/^number of sequences = ([0-9]+)$/) {
$numberOfSequences = $1 + 0;
} elsif (/^[ACGT] [|]/) {
@record = split;
for ($i = 2; $i < scalar(@record); $i++) {
$motifs{$label}{$i-2}{$record[0]} =
$record[$i] / $numberOfSequences;
}
}
}
foreach $label (sort(keys(%motifs))) {
print "Probability distributions for motif $label\n";
%motif = %{ $motifs{$label} };
foreach $position (sort(keys(%motif))) {
foreach $base (A, C, T, G) {
print("$base $motif{$position}{$base} ");
}
print("\n");
}
}
Program 10.13 Extracting data structures from a file using pattern matching
the first DNA base is the third field on the line, the second frequency is the
fourth field, and so on. So it is necessary to subtract 2 from the field position
to get the DNA base position. Finally, an item in the third dimension is the
probability for one of the four DNA bases. This is obtained by dividing the
frequency by the number of sequences.
Having extracted the motifs, the next step is to print them. Since the motifs
are in a 3D data structure, the most natural way to use the structure is with
three nested loops. The first loop processes the motifs. The labels are the
keys of themotifshash, and it is customary to sort the keys of a hash so
that they are printed in a reasonable order.