Algorithms in a Nutshell

Hash-based Search | 121

Searching

For our first try at this problem, we choose a primary arrayAthat will hold b=2^18 –1=262,143 elements. Our word list contains 213,557 words. If our hash function perfectly distributes the strings, there will be no collisions and there will be only about 40,000 open slots. This, however, is not the case. Table 5-4 shows the distribution of the hash values*for the JavaStringclass on our word list with a table of 262,143 slots. As you can see, no slot contains more than seven strings; for non- empty slots, the average number of strings per slot is approximately 1.46. Each row shows the number of slots used and how many strings hash to those slots. Almost half of the table slots (116,186) have no strings that hash to them. So, this hashing function wastes about 500KB of memory—assuming that the size of a pointer is four bytes and that we don’t fill up the empty entries with our collision-handling strategy. You may be surprised that this is quite a good hashing function and finding one with better distribution will require a more complex scheme. For the record, there were only five pairs of strings with identical key values (for example, both “hypoplankton” and “unheavenly” have a computed key value of 427,589,249)!

Finally, we need to decide on a strategy for handling collisions. One approach is to store a pointer to a list of entries in each slot of the primary hash table, rather than a single object. This approach, shown in Figure 5-6, is calledchaining. The overhead with this approach is that you have either a list or the valuenil (representing “no list”) in each slot. When a slot has only a single search item, it must be accessed using the list capability. For our first approximation of a solution, we start with this approach and refine it if the performance becomes an issue.

Forces

Choosing the hashing function is just the first decision that must be made when implementing HASH-BASEDSEARCH. Hashing has been studied for decades, and there are numerous papers that describe effective hash functions, but they are used in much more than searching. For example, special types of hash functions are essential for cryptography. For searching, a hash function should have a good distribution and should be quick to compute with respect to machine cycles.

* In this chapter, thehashCodemethod associated with each Java class represents thekeyfunction described earlier. Recall thathash(s)=key(s)%b.

Table 5-4. Hash distribution using Java String.hashCode( ) method as key with b=262,143

Number of hits Number of slots 0 116,186 1 94,319 2 38,637 3 10,517 4 2,066 5 362 653 73

Algorithms in a Nutshell
Algorithms in a Nutshell By Gary Pollice, George T. Heineman, Stanley Selkow ISBN:
9780596516246 Publisher: O'Reilly Media, Inc.

Prepared for Ming Yi, Safari ID: [email protected]
Licensed by Ming Yi
Print Publication Date: 2008/10/21 User number: 594243
© 2009 Safari Books Online, LLC. This PDF is made available for personal use only during the relevant subscription term, subject to the Safari Terms of Service. Any other use

Algorithms in a Nutshell

Get our desktop app

Company

Features

Documentation

Resources