In 1948, Claude Shannon founded the field of information theory with his paper, “A Mathematical Theory of Communication” [5]. In it, he defined the entropy H of a message as
where pi is the probability of the ith character occurring in the message. This probability can be easily calculated if we count the number of times each character appears in the message:
When entropy H is calculated as above, using the log base two, it measures the average number of bits1 per character required to communicate the given message.
Intuitively, in a string of 26 e’s alternated with 26 f’s, neither character carries very much information, because they act so similarly. To calculate the entropy of this string, we first calculate the probabilities of each character. The probability of an “e,” pe, is:
and the probability of an “f” is exactly the same. Thus, the entropy is:
which says that each character of this message carries an average of one bit’s worth of information.
If we now calculate the entropy of the string “abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,” each character has probability:
and the entropy is:
which is much higher than the previous example. In this case, each character requires an average of at least 4.7 bits to encode it.
Shannon entropy gives a limit to how much compression is possible with a file: if each character requires, as in the last example, an average of 4.7 bits, then one could not compress that text to fewer than approximately 4.7×52 = 244.4 bits.
The log() function from the math module allows the base to be specified as an optional second argument:
log(x[, b]) |
Log of x base b. Default base is e. |
1Interesting fact: Shannon’s paper introduced the term “bit.”
3.16.81.200