Analysis of Algorithms : An Active Learning Approach

124 MATCHING ALGORITHMS

if (subLoc > length(substring)) return textStart // found a match else return 0 // indicates no match found end if It should be obvious that the important task is to compare characters, and that is what we will count. In the worst case, each time we compare the substring we match all of the characters but fail on the last one. How many times could this happen? It could happen once for each character in the text. If S is the length of the substring and T is the length of the text, the worst case would seem to take S* (TS + 1) comparisons.^1 We have to consider whether this arrangement of characters is at all possible. Consider a substring of “XX...XXY” and text of “XX...XXXXX,” where the substring is a set of S 1 Xs followed by one Y and the text is a set of T Xs. This set of characters will cause this worst-case behavior. It should be noted that natural language is not usually like this, and so it can be expected that if this algorithm is used with actual words, it will perform much better. In fact, studies have shown that this algorithm averages a little over T comparisons on a natural language text.^2 The problem with the standard algorithm is that it can waste a lot of effort. If we have matched the beginning part of the substring, we can use that infor- mation to tell us how far to move in the text to start the next match. For example, if we look at pass 1 in Fig. 5.1, we see that the mismatch occurred with the fourth character of the substring. That means the first three matched. When we examine the substring, we see that the third symbol doesn’t appear anywhere else, so we could have skipped past the first three symbols and had our second pass start with the fourth symbol of the text instead of the second. The following techniques take advantage of this fact.

■ 5.1.1 Finite Automata The area of theory of computation shows that finite automata are used to decide whether a word is in a given language. A finite automaton (the singular

(^1) The TS + 1 is because we can stop if there are fewer characters left in the text
than there are in the substring.
(^2) Because of the limited number of characters and the uneven distribution of their
occurrence (i.e.; in English, e occurs much more frequently than z or q), performance
on natural language will always be better than our general measures.

Analysis of Algorithms : An Active Learning Approach

Get our desktop app

Company

Features

Documentation

Resources