Analysis of Algorithms : An Active Learning Approach

(Ron) #1
5.2 APPROXIMATE STRING MATCHING 137

position. (Add an r and change the “ad” to an “ea”.) The second position has a
2-approximate match (change the “ad” to an “ea”) and a 1-approximate match
(add an e to the front).
Notice that there can be a lot of possibilities and they build very quickly. If
the first few characters matched, but then we hit a sequence that didn’t, we
might find a better match if we changed some characters or put some extra
characters into the substring, into the text, or into both. How can we consider
the possibilities and still do this with a reasonable algorithm and data structure?
If the algorithm has to test all possibilities, it will be too complex. Therefore,
we make the algorithm simple but have a larger data structure.
We will solve this problem by creating a matrix that we will call diffs to
hold the information that we have gathered so far. Each row of this matrix will
be associated with one of the characters in the substring, and each column
will be associated with one of the characters in the text. The values in the
matrix will give us an idea of how well the matching process is going at that
point. So, if the value in row 5 column 27 is a 4, in matching the first five
characters of the substring with the portion of the text ending at location 27,
we have found four differences.
The number of differences for any location will be based on the three possi-
ble values that are immediately above, to the left, and to the left and diagonally
up. If we use the value above, we are implying that the text is missing a charac-
ter of the substring. If we use the value to the left, we are implying that the
substring is missing a character of the text. Use of the diagonal value is related
to the match or mismatch of the characters. More specifically, for any value of
diffs[i,j], we will look at the minimum of three values.^3


  1. diffs[i 1, j 1] if substringi = textj,otherwise diffs[i 1, j 1] + 1

  2. diffs[i 1, j] + 1 (substringiis missing from the text)

  3. diffs[i,j 1] + 1 (textjis missing from the substring)


To get this process started, if we refer to any location above the matrix (in
other words, i= 0), that location will be considered to have a zero stored in it.
If we refer to any location to the left of the matrix (in other words, j = 0), that
location will be considered to have the corresponding value of i stored in it. A

(^3) Notice that diffs[i 1, j 1] represents the value diagonally up and to the left,
diffs[i 1, j] represents the value above, and diffs[i,j 1] represents the value to the
left.

Free download pdf