Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Project: Shannon Entropy

In 1948, Claude Shannon founded the field of information theory with his paper, “A Mathematical Theory of Communication” [5]. In it, he defined the entropy H of a message as

$H = - \sum_{i} p_{i} \log_{2} (p_{i})$ $H = - \sum_{i} p_{i} \log_{2} (p_{i})$

where pi is the probability of the ith character occurring in the message. This probability can be easily calculated if we count the number of times each character appears in the message:

$p_{i} = \frac{number of times i th character appears}{length of message}$ $p_{i} = \frac{number of times i th character appears}{length of message}$

When entropy H is calculated as above, using the log base two, it measures the average number of bits1 per character required to communicate the given message.

Example: Low Entropy

Intuitively, in a string of 26 e’s alternated with 26 f’s, neither character carries very much information, because they act so similarly. To calculate the entropy of this string, we first calculate the probabilities of each character. The probability of an “e,” pe, is:

$p_{e} = \frac{count of e’s}{length of message} = \frac{26}{52} = 0.5$ $p_{e} = \frac{count of e’s}{length of message} = \frac{26}{52} = 0.5$

and the probability of an “f” is exactly the same. Thus, the entropy is:

$H = - 0.5 \log_{2} (0.5) - 0.5 \log_{2} (0.5) = 1$ $H = - 0.5 \log_{2} (0.5) - 0.5 \log_{2} (0.5) = 1$

which says that each character of this message carries an average of one bit’s worth of information.

Example: Higher Entropy

If we now calculate the entropy of the string “abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,” each character has probability:

$p = \frac{2}{52} = \frac{1}{26}$ $p = \frac{2}{52} = \frac{1}{26}$

and the entropy is:

$H = - 26 (\frac{1}{26} \log_{2} \frac{1}{26}) \dot{=} 4.7$ $H = - 26 (\frac{1}{26} \log_{2} \frac{1}{26}) \dot{=} 4.7$

which is much higher than the previous example. In this case, each character requires an average of at least 4.7 bits to encode it.

Shannon entropy gives a limit to how much compression is possible with a file: if each character requires, as in the last example, an average of 4.7 bits, then one could not compress that text to fewer than approximately 4.7×52 = 244.4 bits.

Logs

The log() function from the math module allows the base to be specified as an optional second argument:

log(x[, b])

Log of x base b. Default base is e.

Exercise

Write a program entropy.py to calculate the Shannon entropy of a text file. Compare the required average number of bits per character to the number of bits per character actually used if the file is stored as ASCII.

1Interesting fact: Shannon’s paper introduced the term “bit.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Project: Shannon Entropy

Create new playlist

Sign In

Sign Up

Project: Shannon Entropy

Example: Low Entropy

Example: Higher Entropy

Logs

Table of Contents for
Project: Shannon Entropy