Huffman Coding

To gain more insight about greedy algorithms, let's look at another problem that is solvable by a greedy algorithm.

Huffman codes represent a way of compressing data effectively. Data is considered to be a sequence of characters. Huffman's greedy algorithm uses a table with the frequency of each character to build up an optimal way of representing each character as a binary string. 

To illustrate this, imagine we have a 1,00,000 character data file that we wish to store in a compressed fashion. The frequency of each character in the data file is given by the following table:

Character a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Table 4.3: Frequency of each character in a data file

There are various ways to represent this information. For the purpose of this problem, let's say that we want to design a binary character code in which each character is represented by a unique binary string (which we shall call a code word). One option is to use a fixed-length code (for example, each character is represented by a code word of the same size). If we opt for that, we need three bits to represent each the six characters, as shown in the following table:

Character a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Code Word 000 001 010 011 100 101
Table 4.4: Code word for each character

Using this method, we need 3,00,000 bits to code the entire sequence of characters. Can we do better?

A variable-length code can do a lot better than a fixed-length code. Since we want to minimize the size of the compressed sequence of bits, we want to give short code words to frequent characters and long code words to infrequent characters. A possible code for this character sequence is shown in the following table:

Character a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Code Word 0 101 100 111 1101 1100
Table 4.5: Possible code for character sequence

Instead of 3,00,000 bits, this code requires only 2,24,000 bits to represent the character sequence. Using this code, we save around 28% of space. The code we have presented is also an optimal character code for this sequence, as we shall see.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.30