122 MATCHING ALGORITHMS
text, you should trace the straightforward algorithm and the Knuth-Morris-
Pratt algorithm using the pattern “abcabc” and trace the Boyer-Moore algo-
rithm using the pattern “abcbccabc.” You should also try to answer any ques-
tions before reading on. A hint or the answer is in the sentences following the
question.
ooking for a substring in a longer piece of text is an important utility in
text editors and word processors. This chapter begins with an examina-
tion of four ways this can be done. Presentation of matching techniques
will be done from the perspective of character strings. These techniques could,
however, be used to search for any string of bits or bytes in a binary file. Virus
checking is an example of a binary-based use that searches for the known pat-
tern of bytes that appear in a computer virus.
Word processing programs typically have spelling checkers that will not only
identify words that appear to be misspelled but also suggest possible correct
spellings for the word. One process for spell checkers is to produce a sorted list
of words in the document. This list is then compared to the words stored in
both the system dictionary and the user’s dictionary, and words that do not
appear are flagged as potentially incorrect. The process of identifying suggested
alternative spellings can involve approximate string matching.
The discussion of approximate string matches will be based on looking for a
substring in a given piece of text. This technique can, however, also be applied
to looking for approximate matches with a dictionary.
5.1 String Matching
Our problem is to find the first occurrence of a substring within a larger piece
of text. Finding later occurrences can use the same techniques by just changing
the starting point in the text. This problem is complex because the entire sub-
string has to match in order. In the standard algorithm, we begin by compar-
ing the first character of the text with the first character of the substring. If
they match, we move to the next character of each. This process continues
until the entire substring matches the text or the next characters do not match.
L