© Marius Iulian Mihailescu and Stefania Loredana Nita 2021
M. I. Mihailescu, S. L. NitaPro Cryptography and Cryptanalysis https://doi.org/10.1007/978-1-4842-6367-9_23

23. Text Characterization

Marius Iulian Mihailescu1   and Stefania Loredana Nita1
(1)
Bucharest, Romania
 

In this chapter, we will analyze two important metrics for cipher and plaintext analysis: the chi-squared statistic and searching for patterns (monograms, bigrams, and trigrams). When working with classic and modern cryptography, text characterization as technique is a very important part of the cryptanalysis bag of tricks.

Chi-Squared Statistic

The chi-squared statistic is an important metric that computes the similarity percent between two probability distributions. There are two situations when the results of the chi-squared statistic is equal with 0: it means that the two distributions are similar, and if the distributions are very different, a higher number will be outputted.

The chi-squared statistic is defined by the following formula:
$$ {chi}^2left(C,E
ight)=sum limits_{i=A}^{i=Z}frac{{left({C}_i-{E}_i
ight)}^2}{E_i} $$
In Listing 23-1, we compute an example of a chi-squared distribution.
using System
namespace ComputeChiSquaredStatistics
{
   class ComputeChiSquaredStatistics
   {
      static void Main(string[] args)
      {
         int number_of_experiments=10000;
         int number_of_stars_distribution=100;
         Random theGenerator = new Random();
            double theDistribution(6.0);
         int[] probability = new int[10];
         for (int counter=0; counter
             <number_of_experiments; ++counter)
         {
               double no =
                   theDistribution(theGenerator);
              if ((no>=0.0)&&(no<10.0))
                 ++ probability [int(no)];
        }
           Console.Writeline("The Chi-Squared
            Distribution (6.0):");
            for (int index = 0; index < 10; ++index)
            {
                Console.WriteLine("index {0} ", index, " -- {1}: ",  (index + 1), ":");
                Console.WriteLine("{0}", probability[index] * number_of_stars_distribution / number_of_experiments);
            }
            Console.ReadKey();
        }
   }
}
Listing 23-1

ChiSquaredDistribution Source Code

The output of the above implementation is shown in Figure 23-1.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig1_HTML.jpg
Figure 23-1

Chi-distribution output

How does the chi-squared distribution example help us in cryptanalysis and cryptography?

The first step that has to be made is to compute the frequency of the characters within the ciphertext. The second step is to compare the frequency distribution of the assumed language used for encryption (e.g. English) with shifting the two frequency distributions related to one another. In this way, we can find the shift that was used during the encryption process. This procedure is a standard and simple procedure that can be used on ciphers, such as a Caesar cipher. This take place when the frequency of the English characters is lined up with the frequency of the ciphertext. We are aware of the probabilities of the occurrences for English characters

As an example, let’s consider the following example encrypted with a Caesar cipher, which has 46 characters (see Figure 23-2 for the letter frequency). The example from Figure 23-2 is computed and shown using CrypTool1 to compute and verify the implementation and its correctness for letter frequency on encrypted text:
     ZHOFRPHWRDSUHVVWKLVLVHQFUBSWHGZLWKFDHVDUFLSKHU
It is very important to understand that the chi-squared statistic is based on counts and not on probabilities. For example, if we have the letter E with an occurrence probability of 0.127, the expectation is that the occurance will will be 12.7 times within 100 characters.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig2_HTML.jpg
Figure 23-2

Letter frequency for encrypted text

To compute the count expected, the length of the ciphertext has to be multiplied by the probability. The cipher from above has a total of 46 characters. Following the statistic with E from above, our expectation is for the E letter to occur 46 · 0.127 = 5.842 times.

In order to solve the Caesar cipher, we need to use each of the possible 25 keys, using the letter or the position of the letter within the alphabet. For this, it’s very important how the count starts, from 0 or from 1. The chi-squared has to be computed for each of the keys. The process consists of comparing the count number of a letter with what we can expect the count to be if the text is in English.

For computing the chi-squared statistic for our ciphertext, we count each letter and we see that the letter H occurs 7 times. If the language used is English, it should appear 46 · 0.082 = 3.772 times. Based on the output we can compute the following:
$$ frac{{left(7-3.772
ight)}^2}{3.772}=frac{3.228^2}{3.772}=frac{10.420}{3.772}=2.762 $$

This procedure is done for the rest of the letters and making addition between all the probabilities (see Figure 23-3).

Once the ciphertext is decrypted, the plaintext should be
          WELCOMETOAPRESSTHISISENCRYPTEDWITHCAESARCIPHER
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig3_HTML.jpg
Figure 23-3

Encryption letter frequency (%)2

Cryptanalysis Using Monogram, Bigram, and Trigram Frequency Counts

Frequency analysis is one of the best practices for finding the number of ciphertext characters occurrence with the goal of breaking the cipher. Pattern analysis can be used to measure and count the characters as bigrams (or digraphs), a method for measuring pairs of characters that occurs within the text. There is also trigram frequency analysis, which measures the occurrence of combinations formed out of three letters.

In this section, we will focus on text characterization with bigrams and trigrams as opposed to resolving ciphers, such as Playfair.

Counting Monograms

Counting monograms represents one of the most effective methods used in substitution ciphers, such as Caesar ciphers, Polybius squares, etc. The method works very well because the English language has a specific frequency distribution. This means also that it’s not hidden by the substitution ciphers. The distribution will look as depicted in Figure 23-4 and Listing 23-2.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig4_HTML.jpg
Figure 23-4

Letter frequency for the English language

using System;
using System.IO;
class Program
{
    static void Main()
    {
        //** we use the array to store the frequencies
        int[] frequency = new int[(int)char.MaxValue];
        //** look at the content of the text file
        string s = File.ReadAllText("TheText.txt");
        //** go through each of the characters
        foreach (char t in s)
        {
            //** store the frequencies as a table
            frequency [(int)t]++;
        }
        //** write all letters that have been found
        for (int letterPos = 0; letterPos <
                                (int)char.MaxValue; letterPos++)
        {
            if (c[letterPos] > 0 &&
                char.IsLetterOrDigit((char)letterPos))
            {
                Console.WriteLine("The Letter: {0}
                                        has the frequency: {1}",
                    (char)letterPos,
                    freq [letterPos]);
            }
        }
    }
}
}
Listing 23-2

Counting Monograms

The output is shown in Figure 23-5.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig5_HTML.jpg
Figure 23-5

Output of counting monograms

An advanced example of doing the above counting is shown in Listing 23-3. It is an advanced example using LINQ and lambda expression. Figure 23-6 shows the output.
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace MonogramsCounting_LINQ{
    class Program{
        static void Main(){
            var frequencies = from c in
   File.ReadAllText("TheText.txt")
              group c by c into groupCharactersFrequencies
              select groupCharactersFrequencies;
            foreach (var c in frequencies)
                Console.WriteLine($"The character: {c.Key} has
the frequency: {c.Count()} times");
            Console.ReadKey();}}}
Listing 23-3

Lambda LINQ Expression for Counting Letter Frequencies

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig6_HTML.jpg
Figure 23-6

Character frequencies using LINQ and lambda expressions

Counting Bigrams

The method for counting bigrams is based on the same idea as counting monograms. Instead of counting the occurrence of single characters, counting bigrams means counting the occurrence frequency for pairs of characters.

Figure 23-7 lists some of the common bigrams found during the cryptanalysis process. In Listing 23-4, we have implemented a solution that deals with counting the occurrences of bigrams. Figure 23-8 shows the output.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig7_HTML.jpg
Figure 23-7

Examples of bigrams

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig8_HTML.jpg
Figure 23-8

The output from Listing 23-4

class Program
    {
        static void Main(string[] args)
        {
            String text = "Welcome to Apress! This book is
            about cryptography and C#.";
            String bigramPattern = "to";
            Console.WriteLine("The number of occurrences of
                  "" + bigramPattern + "" in "" + text +
                  "" is: " +
                 countFrequenciesBigrams(bigramPattern,
                                       text).ToString());
        }
        static int countFrequenciesBigrams(String
                                bigramPattern, String text)
        {
            int bigramPatternLength = bigramPattern.Length;
            int textLength = text.Length;
            int occurrences = 0;
            for (int idx = 0; idx <= textLength –
                               bigramPatternLength; idx++)
            {
                int jIdx;
                for (jIdx = 0; jIdx < bigramPatternLength;
                                                   jIdx++)
                {
                    if (text[idx + jIdx] !=
                                    bigramPattern[jIdx])
                    {
                        break;
                    }
                }
                if (jIdx == bigramPatternLength)
                {
                    occurrences++;
                    jIdx = 0;
                }
            }
            return occurrences;
        }
    }
Listing 23-4

Computing Bigrams

Counting Trigrams

Trigram counting uses the same principle as bigram counting. The difference consists in counting triple characters.

Figure 23-9 lists some common bigrams that are experienced during the cryptanalysis process. In Listing 23-5, we have implemented a solution for finding and counting the occurrences of trigrams within a text. The solution is different from the one from Listing 23-4. The output is shown in Figure 23-10.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig9_HTML.jpg
Figure 23-9

Examples of trigrams

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig10_HTML.jpg
Figure 23-10

The output from Listing 23-5

class Program
    {
        static void Main(string[] args)
        {
            String fullText = "Welcome to Apress! This book is
            about cryptography and C#.";
            String trigramPattern = "Apr";
            Console.WriteLine("Full text: " + fullText);
            Console.WriteLine("Trigram: " +
                         trigramPattern + " ");
            Console.WriteLine("The number of occurrences of
"" + trigramPattern + "" in "" +
fullText + "" is: " +
countFrequenciesTrigrams(trigramPattern,
fullText).ToString());
        }
        static int countFrequenciesTrigrams(String
                        trigramPattern, String fullText)
        {
            int trigramPatternLength = trigramPattern.Length;
            int fullTextLength = fullText.Length;
            int noOfOccurrence = 0;
            for (int index = 0; index <= fullTextLength –
                           trigramPatternLength; index++)
            {
                int jIndex;
                for (jIndex = 0; jIndex <
                          trigramPatternLength; jIndex++)
                {
                    if (fullText[index + jIndex] !=
                                  trigramPattern[jIndex])
                    {
                        break;
                    }
                }
                if (jIndex == trigramPatternLength)
                {
                    noOfOccurrence++;
                    jIndex = 0;
                }
            }
            return noOfOccurrence;
        }
    }
Listing 23-5

Counting Trigrams

Generate Letter Frequency

The array wikiFrequencies stores the frequencies (which are a relative percentage of the letters) as listed on Wikipedia3 (see Listing 23-6). The provided example interprets the number as a percentage. As an example, letter “A” appears 8.167% of the time.

The program in Figure 23-11 declares a Random object. The defining value of the letter (e.g. “A”) is as an integer. We will use this convention for later purposes. The event handler, Load, makes the sum of the entire values from the wikiFrequencies.
../images/493660_1_En_23_Chapter/493660_1_En_23_Fig11_HTML.jpg
Figure 23-11

Generation of the letter frequencies

using System;
using System.Linq;
namespace LetterFrequency
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "";
            VerifyFreq();
            Console.Write("The number of letters: ");
            input = Console.ReadLine();
            Compute(input);
        }
        //** Store the letter frequencies.
        //** For more details and the values
        //** stored below, see the link:
        //** http://en.wikipedia.org/wiki/Letter_frequency
        private static float[] wikiFrequencies =
        {
            8.167f, 1.492f, 2.782f, 4.253f, 12.702f,
            2.228f, 2.015f, 6.094f, 6.966f, 0.153f,
            0.772f, 4.025f, 2.406f, 6.749f, 7.507f,
            1.929f, 0.095f, 5.987f, 6.327f, 9.056f,
            2.758f, 0.978f, 2.360f, 0.150f, 1.974f,
            0.074f
        };
        //** create a instance of a number
        //** generator using Random class
        private static Random randomNumber = new Random();
        //** compute the ASCII value of letter A
        private static int int_AsciiA = (int)'A';
        //** verify that the frequencies are adding up to 100
        private static void VerifyFreq()
        {
            //** compute the difference to E
            float totalAmount = wikiFrequencies.Sum();
            float differenceComputation = 100f - totalAmount;
            wikiFrequencies[(int)'E' - int_AsciiA] +=
                         differenceComputation;
        }
        //** based on the frequencies
        //** generate randomly the letters
        private static void Compute(string txtNumLetters)
        {
            //** monitor and track each letter
            //** that has been generated
            int[] countGeneratedLetters = new int[26];
            //** randomly generate the letters
            int theNumberOfLetters = int.Parse(txtNumLetters);
            string result = "";
            for (int k = 0; k < theNumberOfLetters; k++)
            {
                //** randomly generate a number
                //** between 0 and 100
                double randomlyNumber = 100.0 *
                randomNumber.NextDouble();
                //** select the letter that
                //** this will represents
                for (int numberOfLetter = 0; ;
                                    numberOfLetter++)
                {
                    //** extract the frequency of the
                    //** letter from the number
                    randomlyNumber -=
                       wikiFrequencies[numberOfLetter];
                    //** if the randomly number is
                    //** less and equal than 0
                    //** it means that we have the letter
                    if ((randomlyNumber <= 0) ||
                              (numberOfLetter == 25))
                    {
                        char character = (char)(int_AsciiA +
                                           numberOfLetter);
                        result += character.ToString() + ' ';
                      countGeneratedLetters[numberOfLetter]++;
                        break;
                    }
                }
            }
            Console.WriteLine(result + " ");
            //** show the frequencies
            for (int i = 0; i < countGeneratedLetters.Length;
                                                  i++)
            {
                char ch = (char)(int_AsciiA + i);
                float frequency =
(float)countGeneratedLetters[i] /
theNumberOfLetters * 100;
                string str =
string.Format("{0} {1,6} {2,6}
{3,6} ,ch.ToString(),
frequency.ToString("0.000"),
wikiFrequencies[i].ToString("0.000"),
(frequency –
wikiFrequencies[i]).ToString("0.000"));
                Console.WriteLine(str);
            }
        }
    }
}
Listing 23-6

Randomly Generating Letter Frequencies

Conclusion

The chapter covered the concept of text characterization and showed its importance in the cryptanalysis process. You can now deal with chi-squared statistics and working with monograms, digrams, and trigrams for decrypting substitution ciphertexts. As a summary, you learned about
  • The concept of text characterization

  • Working with monograms, digrams, and trigrams

  • The implementation of chi-squared statistic

  • Monogram, digram, and trigram implementations

Bibliography

  1. [1]

    Simon Singh, The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. ISBN 0-385-49532-3. 2000.

     
  2. [2]

    Helen F. Gaines, Cryptanalysis: A Study of Ciphers and Their Solutions. 1989.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.191.247