CMSC 12200 - Hash Tables - Februrary 2, 2011

CMSC 12200 - Hash Tables

Goals

Today we will build and analyze hash tables. This is great because

Hash tables are a magical data structure
You have to build them for your homework anyway

Hash Tables

Abstractly a hash table is a data structure that can do the following

Add an element in constant time (i.e. fast.)
Lookup an element from a key in constant time
Grow Dynamically

This means that it can grow (like a linked list) but can access any element in it semi-instantly (like an array) without traversing through a list (which is slow). Hash tables somehow have the best of both worlds and are appropriately used everywhere. Today we're going to learn how to build them. You've already used hash tables in Python. They were the data-structure behind the dict object. You could give a dict a key and a value to store. Later you could ask for the value of a specific key. The dict then retreived that value very very quickly (although we didn't talk about speed at this point in the class). Conceptually for the homework we'll want to be able to do something like the following:


ht = mkHashTable()
addToHashTable(ht, 'MSFT', microsoftStockObject)
addToHashTable(ht, 'GOOG', googleStockObject)
msStock = getFromHashTable(ht, 'MSFT')

And, because stock trades happen extremely quickly these operations will need to be fast.

How Hash Tables Work

Concretely a Hash Table is an array of very short unsorted linked lists (like the image on the right) combined with a hash function.

What is a Hash Function?

A hash function takes a key (in our case a string) and turns it into a number. A very simple hash function on strings might take the last letter of the string and use its ASCII value (i.e. hash("hello")==111 because 'o'==111). This isn't a very good hash function because it has a very limited range (0-255) and only uses a little piece of the information to determine the number. Another pretty standard hash function on strings is written in pythonesque pseudo-code below:


def hash(string, maxVal):
  total = 0
  prime = 31 # a prime number of my choice
  for letter in string: #go down the letters in the string
    total = total * prime + int(letter) #multiply the total by the prime, add the next letter value
  total = total % maxVal #clip the number at some maximum size
  return total

How to Add and Lookup an Element

The main two operations of a Hash table are Add and Lookup. You could add other functionality like length/numberOfElements, isEmpty, etc....

To Add an Element: Hash the key to obtain an integer. Use this integer as an index to the array. Add the element to the linked list stored at that index.

To Lookup a key: Hash the key to obtain an integer. Use this integer to find the appropriate linked list in the array. Search down this (short) linked list to find the element which has the correct key.

Why do Hash Tables Work so Well?

In an ideal Hash Table the linked lists are very short (1-10 entries) and only a few are unused and empty (null). When we add or lookup an element we depend on the Array's fast lookup time to find the right linked list. Behind each index of the array we have a "Bucket" or unsorted-linked-list of possible elements to search through. We depend on these small linked lists to make our data structure dynamic (growable) while still being easy/fast to look through. To maintain speed we must balance the size of the array.

If the array is too small then the linked lists will be very long and lookups will start to take a long time as they traverse these long linked lists
If the array is too large then the hash table might become to big to fit in memory. It is full of wasted empty linked lists.

Sophisticated Hash Tables can resize themselves if they detect that they have too many or two few entries for their current size. We won't deal with this feature for this assignment.

In the ideal situation each "bucket" or entry in the Array would have one or two elements. In the worst situation one bucket has all the elements and the other buckets are null.

Write Some Code

Make a new source file, hashTable.c (or whatever name you like). You'll build how hash tables work in this file. You may want to look at stock.c for an idea of how things are usually organized.

For your assignment you need to write a Hash Table to store stocks. You will use the stock's ticker symbol as the key. I.e. we should be able to ask the hash table to return the stock with ticker symbol (key) 'MSFT'.

You will need to write a very simple node struct to string stock_t's together.

You will need to write a simple-ish HashTable struct which is, as discussed above, simply an array of linked lists.

You will need to write a hash function from strings to numbers (an example is above but several others exist.) Make sure your function is relatively balanced between any range of numbers (we'll need to fit this range to the size of the array so that the indices match up). For the purposes of this assignment use a fixed-size hash table. Something in the tens of thousands is good for our purposes.

Then you'll need to write the logic to add and lookup stocks.

Write some very basic testing code in a separate file (see testorder.c or stockTest.c for examples). A basic test might be to make a few stocks by hand, throw them into the hash table and then make sure you can extract them. You can also test your hashTable code with a main() directly in the original .c file. The drawback is that when you try to use hashTable.c in some larger program there will be two main() functions which will confuse the computer.

Organizing Larger Projects: .h Files and Makefiles

As projects grow it's a good idea to separate work into smaller tasks. For large projects organization is probably the most important part of getting things done. Personally I (matt) hate organization, and as a result I've wasted a lot of my time tracking down very frustrating problems.

Header (.h) Files

In your hw3 folder you'll notice that there are both .c and .h files. The .c files contain the logic of how things work. The .h files contain the information someone else needs in order to use your code. The definitions and functions contained in the .h file are like the public methods in Java. Other people shouldn't have to look through and understand all of your code, they should only have to read the .h file. It is more important to document the .h file than the .c file (although of course you should document both).

After you construct a working hash table decide what should go into the .h file. As an example the hash function probably should not but the ability to make and add to hash tables probably should.

There is some annoying syntax involved in starting a .h file. I almost always copy from an existing .h file. Be sure to get the #ifndef stuff at the beginning and the #endif stuff at the end.

Makefiles

There are lots of files in this weeks project. Some of the code depends on other parts of the code (for example stock.c requires that order.c works) and there are lots of potential executables that could be made (for example each of stockTest.c and testOrder.c have a main() function for testing purposes).

We can organize all of the dependencies and potential executables in a Makefile and use the unix command make to compile and run them. Look at the Makefile in your hw3 directory. The following is two generic lines and then two example lines from the Makefile:


target: dependencyOne dependencyTwo
  unix command to run
master: order.c util.c book.c stock.c venue.c master.c
  gcc order.c util.c book.c stock.c venue.c master.c -o master

(Note: you must have a tab before the unix command on the second line. The need for tab was a bad design choice, but that is how it works.) When we type "make master" into the command line it checks that all of the dependencies "order.c, util.c, book.c, stock.c, venue.c and master.c" are available and then it runs the command on the second line,

gcc order.c util.c book.c stock.c venue.c master.c -o master

which compiles the code into an executable. So far this is just a complicated way of replacing

gcc order.c util.c book.c stock.c venue.c master.c -o master

with

 make master

So far this is convenient but not really worth all the trouble to set up. Make becomes important when the dependencies go beyond just source files. For example the following two lines also appear in the Makefile


t2: master
  ./master < data/t2.txt

When we type "make t2" it first makes master (because it is listed as a dependency) and then runs the command

./master < data/t2.txt

As projects grow in complexity handling this organization becomes important.

Using Makefiles

Trivial Task: Use make to perform a trivial unix command. For example change your Makefile so that typing "make list" executes the unix "ls" command. There should be no dependencies.

Substantial Task: Edit the Makefile to compile a hashTest program. This will be similar to testOrder and stockTest. Make this target and ensure that things work.