124 MATCHING ALGORITHMS
if (subLoc > length(substring))
return textStart // found a match
else
return 0 // indicates no match found
end if
It should be obvious that the important task is to compare characters, and
that is what we will count. In the worst case, each time we compare the sub-
string we match all of the characters but fail on the last one. How many times
could this happen? It could happen once for each character in the text. If S is
the length of the substring and T is the length of the text, the worst case would
seem to take S* (TS + 1) comparisons.^1 We have to consider whether this
arrangement of characters is at all possible. Consider a substring of
“XX...XXY” and text of “XX...XXXXX,” where the substring is a set of
S 1 Xs followed by one Y and the text is a set of T Xs. This set of characters
will cause this worst-case behavior. It should be noted that natural language is
not usually like this, and so it can be expected that if this algorithm is used
with actual words, it will perform much better. In fact, studies have shown that
this algorithm averages a little over T comparisons on a natural language text.^2
The problem with the standard algorithm is that it can waste a lot of effort.
If we have matched the beginning part of the substring, we can use that infor-
mation to tell us how far to move in the text to start the next match. For
example, if we look at pass 1 in Fig. 5.1, we see that the mismatch occurred
with the fourth character of the substring. That means the first three matched.
When we examine the substring, we see that the third symbol doesn’t appear
anywhere else, so we could have skipped past the first three symbols and had
our second pass start with the fourth symbol of the text instead of the second.
The following techniques take advantage of this fact.
■ 5.1.1 Finite Automata
The area of theory of computation shows that finite automata are used to
decide whether a word is in a given language. A finite automaton (the singular
(^1) The TS + 1 is because we can stop if there are fewer characters left in the text
than there are in the substring.
(^2) Because of the limited number of characters and the uneven distribution of their
occurrence (i.e.; in English, e occurs much more frequently than z or q), performance
on natural language will always be better than our general measures.