Lesson AB32 - Hash-Coded Data Storage

Lesson AB32 - Hash-Coded Data Storage

Title Page >

Summary >

Lesson A1 >

Lesson A2 >

Lesson A3 >

Lesson A4 >

Lesson A5 >

Lesson A6 >

Lesson A7 >

Lesson A8 >

Lesson A9 >

Lesson A10 >

Lesson A11 >

Lesson A12 >

Lesson A13 >

Lesson A14 >

Lesson A15 >

Lesson A16 >

Lesson A17 >

Lesson A18 >

Lesson A19 >

Lesson A20 >

Lesson A21 >

Lesson A22 >

Lesson AB23 >

Lesson AB24 >

Lesson AB25 >

Lesson AB26 >

Lesson AB27 >

Lesson AB28 >

Lesson AB29 >

Lesson AB30 >

Lesson AB31 >

Lesson AB32 >

Lesson AB33 >

Vocabulary >

page 3 of 8

The example of distributing student schedules illustrates a natural means of hashing. The process can be simplified even further by organizing the schedules into piles by the first letter of the last name.
A hashing scheme involves converting a key piece of information into a specific location, thus reducing the amount of searching in the data structure. Instead of working with the entire list or tree of data, the hashing scheme tells you exactly where to store that data or search for it.
One important bar code system used by retail stores is the UPC A code¹, which involves a sequence of 10 digits. This system provides for 10 billion different possible products, 0 - 9,999,999,999 (which equals 10¹⁰ - 1). For quick access, an array of 10 billion locations would be nice, but wasteful in terms of computer memory. Since it is unlikely that a store would carry such a huge number of items, a system is needed to store a list of products in a reasonably sized array.
A cash register using a bar code scanner needs a very quick response time when an item is scanned. The 10-digit bar code is read, the item is searched for in the store's database, and the price is returned to that register. While searching algorithms of the order (log2N) are relatively fast, we may want an even faster algorithm.
Suppose hypothetically that a store maintains a database of 10,000 bar codes out of the possible 10 billion different values. The values are stored in an array called a hash table. Because an array has direct random access to every cell, using a hashing scheme will give much faster access to the desired item. The hash table is usually sized about 1.5 to 2.0 times as big as the maximum number of values stored. (The reason for this sizing will be readily apparent.) Therefore, the store will need an array with about 15,000 locations.
The hashing scheme tells us where item XXXXX XXXXX is stored. A hashing algorithm is a sequence of steps, usually mathematical in nature that converts a key field of information into a location in the hash table.
These “key-to-address transformations” are called hashing functions or hashing algorithms. When the key is composed of character data, a numerical equivalent such as the ASCII code is used so that the mathematical processing can take place. Bar codes are numbers, so conversion is not necessary. Some common hash functions:
1. Division. The key is subject to integer modulus (often a prime) equal to or slightly smaller than the desired size of the array. The result of the division determines which short list to work with in the hash table.
2. Midsquare. The key is squared and the digits in the middle are retained for the address. This probably would not work well with bar codes because they are such large numbers.
3. Folding. The key is divided into several parts, each of which is combined and processed to give an address. For example:
  
  If the bar code = 70662 11001
  
  1) group into pairs: 70 66 21 10 01
  
  2) multiply the first three numbers together:
  
  70 x 66 x 21 = 97020
  
  3) add this number to the last two numbers:
  
  97020 + 10 + 01 = 97031
  
  4) find the remainder of modulo division by 14983 (the largest prime less than 15000):
  
  97031 % 14983 = 7133
  
  5) address 7133 is the location to store bar code 70662 11001
  
  6) In the address 7133 will be stored all the fields related to this item, such as price and name of the item.
It is important to develop a good hashing function that avoids collisions in the hash table. Even when using a prime number for the divisor, it is possible for two bar codes to result in the same address, e.g. 7133. To reduce the chances of such a “collision,” the hash table is sized about 1.5-2.0 times the number of expected items. If the hash table in our example were sized at 10,000 (the number of items in the database) the likelihood of collisions would be increased. Try to balance the need for decreasing the number of collisions against memory limitations, hence the recommended sizing.
This advance sizing of the hash table affects the mathematics of the hashing algorithm; therefore the programmer must have a very clear idea of the number of items to be stored. The number of items must be known in advance and this number must be fairly constant during the life of the program. This limits the use of hashing to certain situations. If the number of items is unknown or varies greatly, hashing is inappropriate.

¹ Samples of bar code graphics and an explanation of how to read bar codes may be found at http://www.adams1.com/pub/russadam/upccode.html

	Contact

	© ICT 2006, All Rights Reserved.