
All too often, during the data manipulation stage of a database extraction process, we find that there is an array of data, which has to be scanned repeatedly for every step in an outer loop. When the amount of data is small, it doesn’t really matter. But if we’re comparing each of twenty million telephone subscribers against an array of a thousand products, it can take a long time.
By Mark Sitkowski C.Eng, M.I.E.E Design Simulation Systems Ltd
Simplistically speaking, a hash table is a random access matrix of variables and values, where the variables are also the index into the matrix of values.
Instead of saying ‘for(I = 0; I < 1000; I++)….’ , and waiting for the sought after value to fly by, we can simply ask the hash table for the value corresponding to the given variable.
Unix has four hash table manipulation functions:
• hcreate(length)
Allocates space for a hash table of size ‘length’ elements
• hsearch(key, ENTER)
Makes a hash table entry, for variable ‘key’
• hsearch(key, FIND)
Retrieves an entry, described by ‘key’, from the hash table
• hdestroy(void)
Deletes a hash table
The data type of ‘key’ is defined by the typedef ENTRY, in , which must be included if we want to use the hash table functions.
struct entry { char *key; char *data; };
Since hash tables can be used with arrays of complex data structures, the pointers to char are an indication of how the hash table is implemented.
A hash table only stores hash values and pointers to the original data. This data must persist, with unchanged keys and addresses, throughout the life of the table. The data itself can change and the pointer will still correctly retrieve it. But its address must be constant.
For obvious reasons, the key must be unique, so the same criteria must be applied to its choice as is applied when choosing the primary key to a database table. If the target array comprises the rows of a table, which has been extracted to memory, the primary key of the table is an obvious choice. Though, there is one restriction, on the choice of a key. The hash table comparison function is strcmp(), which means that the key has to be an ASCII string, and numbers have to be represented by their ASCII values. However, the overhead of the extra sprintf() is negligible compared with the saving of time, especially when scanning a huge array.
By way of an example, suppose that we have an array of 1000 data structures, each of which describes a product, as per:
struct product { char product_code[10]; char colour[25]; float length; int style; }; struct product products[1000];
ENTRY hhash;
Early in our code, we create the 1000 element hash table, like this:
if(hcreate(1000) == 0){ printf(“Can’t allocate memory for hash table\n”); exit(-1); }
As we create or load the array of product structures, we add an extra step to make the hash table entries:
for(I = 0; I < 1000; I++){ get_product_data(products[I]); hhash.key = (void *)product[I].product_code; hhash.data = (void *)&products[I]; if(hsearch(hhash, ENTER) == NULL){ printf(“Hash table full\n”); } /* do other stuff… */ }
Sometime later, we do some processing in another loop, and need to find the color corresponding to a product code:
char * pc; struct product temp; for(j = 0; j < 20000000; j++){ /* * process a lot of customer data */ strcpy(pc, customer_product[j[.product_code); /* Now, we need the colour */ hhash.key = (void *)pc if((temp = hsearch(hentry, FIND)) == NULL) { printf(“No such entry\n”); } else { strcpy(customer_product[I].colour, ((struct product *)temp->data)->colour). } }
The alternative to using a hash table would have been an inner loop, making up to 1000 iterations for every iteration of the outer loop.
0 responses on "The Design of Hash Tables for Fast Retrieval of Data"