Chapter 1
Introduction
At the lowest level, computers represent data as sequences of bits (0 or 1). The normal way to represent a message as a sequence of bits is to use a table that associates bit patterns with characters and then translate each letter the message according to the table. The standard table that many computers use is called the ASCII table, which represents every character on the keyboard (and a few more besides) as a sequence of exactly 8 bits. Here is one portion of the ASCII table:
|
In ASCII, the message MISSISSIPPI would be represented like this:

If you save the word MISSISSIPPI in DrScheme, that sequence of bits is how it will be written out in the saved file.
There are a number of advantages to representing messages with ASCII, but it is not particularly good for generating short encodings of particular messages. In situations where we really need messages to be short (maybe because we want to transmit a message quickly across a network, or save it on a disk that doesn't have much space left) we can often do dramatically better. The message MISSISSIPPI, for instance, doesn't use most of the letters of the alphabet at all, so an encoding scheme that didn't let us write those letters down at all would be fine. Furthermore, it uses I and S four times each, but P only twice and M only once: for that reason, it would be a good trade to use an encoding table that had short representations for I and S and longer representations of P and M. The following alternative encoding produces a much shorter encoding for the message MISSISSIPPI:
|

While ASCII needs 88 bits, the alternative encoding needs just 21.
The goal of this lab is to implement an algorithm called Huffman coding that determines the best encoding table for a particular message, and then encodes or decodes messages according to that table. As a demonstration of the technique's practical application, you will use it to write a program that compresses and decompresses files.
For this lab, you will need to use the following teachpack:
Huffman coding is named after its inventor, David Huffman (1925-1999). He invented it in 1951 as a final project for a class he was taking -- his instructor listed it as a possible paper topic without mentioning that it was a major unsolved problem at the time!