CMSC 15100 - Autumn 2005
Lab 6

Save a copy of this lab for yourself

0. Huffman Codes

Donald Knuth has called huffman coding one of the fundamental ideas in Computer Science for its ubiquitous role in coding and compression algorithms used in data communication and data storage. Many of the most popular compression methods, such as ZIP, JPEG and MP3 make use of huffman coding at some point. Huffman coding was discovered by David Huffman, as a graduate student at MIT in 1951. He and his classmates were given the choice of a term paper or a final exam. The term paper was on the problem of finding the most efficient binary code. Huffman found himself unable to show any of the known codes were most efficient, and was just about to give-up when he had an epiphany and the solution to the problem came to him

"It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization."

A binary code is a way of representing data (called a message), such as a picture or text or music, as a sequence of 0s and 1s (bits.) For example, ASCII code represents text characters using seven bits. Pictures can be represented by lists of positions and colors (the color of each pixel), and colors can be represented using 24 bits to represent the three color components: red, green and blue.

For example, suppose we are coding messages using only the letters A through H. We can choose a code using three bits for each of the eight possible characters:

A 000 C 010 E 100 G 110
B 001 D 011 F 101 H 111


With this code, the message

BACADAEAFABBAAAGAH

is encoded as a sequence of 54 bits

001000010000011000100000101000001001000000000110000111

Codes such as ASCII and the A-through-H code above are known as fixed-length codes, because they represent each symbol in the message with the same number of bits.

David Huffman's basic insight was to use variable-length codes, in which different symbols may be represented by different numbers of bits. If some symbols appear more frequently in the message, then we can encode the data using fewer bits if we assign shorter codes to these symbols. In the above message the letter A occurs 8 times, B occurs 3 times and all other letters occur once. Using this information the huffman coding procedure assigns the following codes to each letter:

A 0 C 1010 E 1100 G 1110
B 100 D 1011 F 1101 H 1111


Notice that A has the shortest code, B next shortest, and all other letters have the same length code. With this code the same message as above is encoded by the sequence

100010100101101100011010100100000111001111

This string contains 42 bits, so it saves more than 20% in space in comparison with the fixed-length code A-through-H.

1. Code Trees

This section will introduce you the data structure, a code-tree, that we will use to create codes for each symbol occuring in the example message above. We will use these trees to encode and decode messages using these symbols.

1a. Code Tree Data Structure



;; A leaf is a
;; (make-leaf symbol number)
(define-struct leaf (symbol weight))

;; A code-tree is either a
;; --leaf, or
; (make-code-tree (listof symbols) number
;          code-tree
;          code-tree)
(define-struct code-tree (symbols weight left right))

;; and where the following two invariants hold
;;       (a) For each node, the sum of the weights of the left subtree
;; plus the sum of the weights of the right subtree is
;; the weight of the node.
;;       (b) For each node, the symbols in the list of symbols of the node are
;; all of the symbols occurring in leaves of the left subtree
;; and all of the symbols occurring in leaves of the right subtree.


(This is similar to the use of invariants in the data definition of binary-search-tree.) For example, the following is a code-tree

(make-code-tree '(A B) 2
    (make-leaf 'A 1)
    (make-leaf 'B 1))


1b. Basic Algorithm for Building Code Trees

The basic algorithm for building a code-tree starts with a list of leaves and outputs a code-tree created from these leaves. The leaves are created from our data about the symbol frequency in the message, but for now we will concentrate on building the code-tree from the list of leaves. We will use this algorithm to write a Scheme function for building code-trees in Part 4.

  1. Create a list of leaves.
  2. Merge code-trees on your list (as follows) until there is one code-tree remaining on the list:
    1. Choose two code-trees with smallest weight and remove from the list.
    2. Create a new code-tree as follows:
      • Let the left and right subtrees be the two code-trees you removed from the list.
      • Let weight be the sum of the weights of the left and right subtrees.
      • Let symbols be the list of symbols you get by appending the list of symbols in the left subtree with the list of symbols in the right subtree.
    3. Insert this new code-tree on your list (Notice that your list of code-trees has shrunk by one.)
  3. Return the remaining code-tree on your list.

In step 1 we start-off with a list of leaves, which are code-trees. In step 2b we produce a code-tree provided the two items chosen in 2a were also code-trees. Together, these imply that the remaining item on our list in step 3 really is a code-tree.

1c. A fixed-length code-tree

Follow the algorithm from Part 1b and create a code-tree with the following leaves:

(list
(make-leaf 'A 1)
(make-leaf 'B 1)
(make-leaf 'C 1)
(make-leaf 'D 1)
(make-leaf 'E 1)
(make-leaf 'F 1)
(make-leaf 'G 1)
(make-leaf 'H 1))


Notice that in the code-tree you create, the depth of each leaf (the number of make-code-trees that have that leaf as a subtree) is always the same, 3. I have included the code-tree tree on your template, tree-1.

1d. A variable-length code-tree

Follow the algorithm from Part 1 and create a code-tree with the following leaves:

(list
(make-leaf 'A 8)
(make-leaf 'B 3)
(make-leaf 'C 1)
(make-leaf 'D 1)
(make-leaf 'E 1)
(make-leaf 'F 1)
(make-leaf 'G 1)
(make-leaf 'H 1))


The weight of each leaf is the frequency that the symbol occurs in our original message:

BACADAEAFABBAAAGAH

from Part 0. Notice that the depth of the leaf containing the symbol 'A (the number of make-code-trees that have the leaf with this symbol as a subtree) is only one, the depth of 'B is three, and the depth of each other symbol is four. I have included the code-tree tree on your template, tree-2.

2. Structural vs. Generative Recursion

Several of the functions you will write will be based upon generative recursion.

Functions written using structural recursion are data-directed, your analysis of the data the function has as input drives the organization of your function through templates. The recursive structure of the data determines what new subproblems need to be solved and what the immediately solvable subproblems are.

Functions written using generative recursion are algorithm-directed, you must start with an algorithm for generating solutions to determine what new subproblems need to be solved and which subproblems are immediately solvable. The examples you give play a crucial role in your writing the function body: by working through simple examples by hand you are led to an understanding of the of how new subproblems are generated and combined to produce a solution to your original problem.

There are several questions you need to ask yourself as you set out to write the body of your function

Do I have recursive data elements in the input of my function?
If you do, you may need to use structural or generative recursion to write your function. In this case you need to decide which kind of recursion will guide the construction of your function
Do I solve a given problem by solving subproblems generated by the recursive structure of some of the input data elements? If this is so, then you are likely going to use structural recursion. You need to decide which input elements drive the recursion, and produce an appropriate template for the body of your function.
If the new subproblems you have generated are not generated by the recursive structure of some of the input data elements, then you will have to use generative recursion. In this case, there are three questions you need to answer:
  1. What are the immediately solvable cases of the problem at hand?
  2. How can I break the current problem into smaller subproblems whose solutions I can use to solve the current problem?
  3. How do I combine the solution to the smaller subproblems to provide a solution to the current problem?
As you answer these questions you will have at hand more details for creating the body of your function. HTDT provides the following guide .

3. Encoding and Decoding Messages

We can encode and decode messages using the code-trees we construct using the algorithm in Part 1b. Here is a pictorial representation of the code-tree we generated in Part 1d, tree-2:


3a. Encoding

Encoding is the process of going from the message to a binary coding of the message. Because of invariant (b), you can use the list of symbols, symbols, to follow the branches of the code-tree until you arrive at a leaf on which the symbol occurs. Each time you go down the left child, add a 0 to your code; each time you go down the right child add a 1 to your code. For example, given the pictured tree representation of tree-2 the coding of each symbol is

'A 0 'C 1010 'E 1100 'G 1110
'B 100 'D 1011 'F 1101 'H 1111


The depth of the leaf with a given symbol corresponds to the length of the code assigned to that symbol.

Write the following function

;; A bit is either 0 or 1
;; encode: (list-of symbols) code-tree -> (list-of bits)
;; encode msg using ctree
;; (define (encode msg ctree) ... )


Hint: You can write this function using structural recursion.

Test encode using your tree-1 and tree-2 from Part 1.

3b. Decoding

Decoding is the process of retrieving a message given a binary coding of the message. To decode we start with a (list-of bits) and our code-tree. We are trying to find the next symbol using the bits. If our code-tree has subtrees, then we use the next bit to select:

If our code-tree is a leaf, we have our next symbol. If there are remaining bits on the list we search for our next symbol.

Write the following function

;; A bit is either 0 or 1
;; decode: (list-of bits) code-tree -> (list-of symbols)
;; decode string (of bits) using ctree
;; (define (encode string ctree) ... )


Hint: You can write this function using structural recursion on several input data elements.

Test decode using your code-trees from Part 1.

4. Building Code Trees

In Part 1b I gave an algorithm for building a code-tree from a list of leaves. Now, you will write a Scheme function for implementing this algorithm and actually use the computer to build the code-trees from Part 1c and 1d.

4a. Frequency Lists

We need to build a (list-of leaves) from information about the frequency of each symbol.

;; An fpair is a
;; (make-fpair symbol weight)
(define-struct fpair (symbol weight))


Write the following function

;; create-leaves: (list-of fpair) -> (list-of leaf)
;; Create a list of leaves from list of fpairs
;; (define (create-leaves fpairs) ...)


You can test your function on freq-1 and freq-2 given on the template.

4b. build-code-tree

You will write a Scheme function for implementing the algorithm in Part 1b. The first step, creating the list of leaves from data about the frequency of each symbol is performed by create-leaves from Part 3a.

;; build-code-tree: (list-of s-w-p) -> code-tree
;; return a code-tree based on fpairs giving the frequency of each symbol.
;; fpairs is sorted in increasing order of weight
;; (define (build-code-tree fpairs) ... )


You can test build-code-tree on freq-1 and freq-2 found in the template. Your output from (build-code-tree freq-1) and (build-code-tree freq-2) may not exactly match the examples tree-1 and tree-2. But the code generated from each symbol in your output trees must have the same length as these examples. The depth of the leaf for a given symbol must be the same in your code-tree as the depth in the example.

Hint: Writing a helper function for step 2 is the heart of the problem.

;; merge-trees: (list-of code-trees) -> (list-of code-trees)
;; Merge alocts as in step 2 of the algorithm in Part 1b
;; until alocts contains only one code-tree
;; (define (merge-trees alocts)...)


Will this require structural or generative recursion?