Huffman coding is one of the oldest methods of compressing data and is based on creating a Huffman tree, which is used to both encode and decode data. Huffman coding can represent data content in a more compact form by exploiting the fact that some data (for instance certain characters of the alphabet) appears more frequently in a data stream. By using encodings of different length (shorter for the most frequent characters and longer for the least frequent ones), the data consumes less space.
Now, let's learn a few terminologies related to Huffman coding:
- Coding: Coding in the context of data represents the method of representing data from one form to another. We would like the resultant form to be concise.
- Codeword: A particular character in encoded form is called a codeword.
- Fixed-length coding: This is when each encoded character, that is, codeword, uses the same number of bits.
- Variable-length coding: Codewords can use a different number of bits.
- Evaluation of code: This is the expected number of bits per codeword.
- Prefix free codes: This means that no codeword is a prefix of any other codeword.
- Decoding: This means that a variable-length code must be free from any prefix.
For understanding the last two terms, you need to have a look at this table first:
character | frequency | fixed length code | variable length code |
L | .45 | 000 | 0 |
M | .13 | 001 | 101 |
N | .12 | 010 | 100 |
X | .16 | 011 | 111 |
Y | .09 | 100 | 1101 |
Z | .05 | 101 | 1100 |
Now, we can infer the following:
- Fixed length code: The fixed-length code for this table is 3.
- Variable length code: The variable-length code for this table is 45(1) + .13(3) + .12(3) + .16(3) + .09(4) + .05(4) = 2.24.
The following diagram shows the Huffman tree created from the preceding example:
Note that Huffman encoding is about converting data into a Huffman tree that enables compression. Decoding or decompression brings the data back to the original format.