Hash-based Search | 121
Searching
For our first try at this problem, we choose a primary arrayAthat will hold
b=2^18 –1=262,143 elements. Our word list contains 213,557 words. If our hash func-
tion perfectly distributes the strings, there will be no collisions and there will be only
about 40,000 open slots. This, however, is not the case. Table 5-4 shows the distribu-
tion of the hash values*for the JavaStringclass on our word list with a table of
262,143 slots. As you can see, no slot contains more than seven strings; for non-
empty slots, the average number of strings per slot is approximately 1.46. Each row
shows the number of slots used and how many strings hash to those slots. Almost
half of the table slots (116,186) have no strings that hash to them. So, this hashing
function wastes about 500KB of memory—assuming that the size of a pointer is four
bytes and that we don’t fill up the empty entries with our collision-handling strategy.
You may be surprised that this is quite a good hashing function and finding one with
better distribution will require a more complex scheme. For the record, there were
only five pairs of strings with identical key values (for example, both “hypoplankton”
and “unheavenly” have a computed key value of 427,589,249)!
Finally, we need to decide on a strategy for handling collisions. One approach is
to store a pointer to a list of entries in each slot of the primary hash table, rather
than a single object. This approach, shown in Figure 5-6, is calledchaining.
The overhead with this approach is that you have either a list or the valuenil
(representing “no list”) in each slot. When a slot has only a single search item, it
must be accessed using the list capability. For our first approximation of a solution,
we start with this approach and refine it if the performance becomes an issue.
Forces
Choosing the hashing function is just the first decision that must be made when
implementing HASH-BASEDSEARCH. Hashing has been studied for decades, and
there are numerous papers that describe effective hash functions, but they are
used in much more than searching. For example, special types of hash functions
are essential for cryptography. For searching, a hash function should have a good
distribution and should be quick to compute with respect to machine cycles.
* In this chapter, thehashCodemethod associated with each Java class represents thekeyfunction
described earlier. Recall thathash(s)=key(s)%b.
Table 5-4. Hash distribution using Java String.hashCode( ) method as key with b=262,143
Number of hits Number of slots
0 116,186
1 94,319
2 38,637
3 10,517
4 2,066
5 362
653
73
Algorithms in a Nutshell
Algorithms in a Nutshell By Gary Pollice, George T. Heineman, Stanley Selkow ISBN:
9780596516246 Publisher: O'Reilly Media, Inc.
Prepared for Ming Yi, Safari ID: [email protected]
Licensed by Ming Yi
Print Publication Date: 2008/10/21 User number: 594243
© 2009 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use