Project: Shannon Entropy

In 1948, Claude Shannon founded the field of information theory with his paper, “A Mathematical Theory of Communication” [5]. In it, he defined the entropy H of a message as

H=ipilog2(pi)

where pi is the probability of the ith character occurring in the message. This probability can be easily calculated if we count the number of times each character appears in the message:

pi=number of times ith character appearslength of message

When entropy H is calculated as above, using the log base two, it measures the average number of bits1 per character required to communicate the given message.

Example: Low Entropy

Intuitively, in a string of 26 e’s alternated with 26 f’s, neither character carries very much information, because they act so similarly. To calculate the entropy of this string, we first calculate the probabilities of each character. The probability of an “e,” pe, is:

pe=count of e’slength of message=2652=0.5

and the probability of an “f” is exactly the same. Thus, the entropy is:

H=0.5log2(0.5)0.5log2(0.5)=1

which says that each character of this message carries an average of one bit’s worth of information.

Example: Higher Entropy

If we now calculate the entropy of the string “abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,” each character has probability:

p=252=126

and the entropy is:

H=26(126log2126)=.4.7

which is much higher than the previous example. In this case, each character requires an average of at least 4.7 bits to encode it.

Shannon entropy gives a limit to how much compression is possible with a file: if each character requires, as in the last example, an average of 4.7 bits, then one could not compress that text to fewer than approximately 4.7×52 = 244.4 bits.

Logs

The log() function from the math module allows the base to be specified as an optional second argument:

log(x[, b])

Log of x base b. Default base is e.

Exercise

  1. Write a program entropy.py to calculate the Shannon entropy of a text file. Compare the required average number of bits per character to the number of bits per character actually used if the file is stored as ASCII.

1Interesting fact: Shannon’s paper introduced the term “bit.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.81.200