Hash-based Search | 125
Searching
All HASH-BASEDSEARCHalgorithms share the first two components; different
behaviors come about when variations to collision handling are employed.
The cost of computing the hash value must be bounded by a fixed, constant upper
bound. If you consider the example in this section, computing the hash value was
proportional to the length of the string. For any finite collection of words, there is
a longest string with lengthk.Iftkis the time it takes to compute the hash value
for the longest string, then it will require≤tkto compute any hash value.
Computing the hash value is therefore considered to be a constant time operation.
The second part of the algorithm also performs in constant time. If the table is
stored on secondary storage, there may be a variation that depends upon the posi-
tion of the element and the time required to position the device, but this has a
constant upper bound.
If we can show that the third part of the computation also has a constant upper
bound, then we can easily prove that the time performance of hash-based
searching is constant. When we use chaining, we use theloadfactor,α=n/b, where
bis the number of bins in the hash table andnis the number of elements stored in
the hash table. The load factor computes the average number of elements in a list
in the chain.
The worst-case performance for finding an element by chaining is O(n), which
occurs when all elements hash to the same bin in the table. The average number
of bins to search isα. For a detailed analysis, see Chapter 11 of (Cormen et al.,
2001).
Table 5-5 compares the performance of the code from Example 5-8 with the
existing JDK classjava.util.Hashtableon hash tables of different sizes. For the
tests labeledp=1.0, each of the 213,557 words is used as the target item to ensure
that the word exists in the hash table. For the tests labeledp=0.0, each of these
words has its last character replaced with a * to ensure that the word does not
exist in the hash table. Note also that we keep the size of the search words for
these two cases the same to ensure that the cost for computing the hash is iden-
tical. We ran each test 100 times and discarded the best- and worst-performing
trials. The average of the remaining 98 trials is shown in Table 5-5. To under-
stand these results we produce statistics on the hash tables we create, shown in
Table 5-6. As the load factor goes down, the average length of each element linked
list also goes down, leading to improved performance.
As the size of the hash tableAapproximately doubles, the time to find an item
decreases, because the linked lists of elements are shorter. Indeed, by the time
b=1,045,875 no linked list contains more than five elements. Because a hash table
can typically grow large enough to ensure that all linked lists of elements are
small, its search performance is considered to be O(1). However, this is contin-
gent (as always) on having sufficient memory and a suitable hash function to
disperse the elements throughout the bins of the hash table. About 81% of the
213,557 words are mapped to unique hash table bins whenb=1,045,875 using the
hash function from Example 5-6.
Algorithms in a Nutshell
Algorithms in a Nutshell By Gary Pollice, George T. Heineman, Stanley Selkow ISBN:
9780596516246 Publisher: O'Reilly Media, Inc.
Prepared for Ming Yi, Safari ID: [email protected]
Licensed by Ming Yi
Print Publication Date: 2008/10/21 User number: 594243
© 2009 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use