Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

M. I. Mihailescu, S. L. NitaPro Cryptography and Cryptanalysis https://doi.org/10.1007/978-1-4842-6367-9_23

23. Text Characterization

Marius Iulian Mihailescu¹ and Stefania Loredana Nita¹

(1)

Bucharest, Romania

In this chapter, we will analyze two important metrics for cipher and plaintext analysis: the chi-squared statistic and searching for patterns (monograms, bigrams, and trigrams). When working with classic and modern cryptography, text characterization as technique is a very important part of the cryptanalysis bag of tricks.

Chi-Squared Statistic

The chi-squared statistic is an important metric that computes the similarity percent between two probability distributions. There are two situations when the results of the chi-squared statistic is equal with 0: it means that the two distributions are similar, and if the distributions are very different, a higher number will be outputted.

The chi-squared statistic is defined by the following formula:

${chi}^2left(C,E ight)=sum limits_{i=A}^{i=Z}frac{{left({C}_i-{E}_i ight)}^2}{E_i}$

In Listing 23-1, we compute an example of a chi-squared distribution.

using System

namespace ComputeChiSquaredStatistics

{

class ComputeChiSquaredStatistics

{

static void Main(string[] args)

{

int number_of_experiments=10000;

int number_of_stars_distribution=100;

Random theGenerator = new Random();

double theDistribution(6.0);

int[] probability = new int[10];

for (int counter=0; counter

<number_of_experiments; ++counter)

{

double no =

theDistribution(theGenerator);

if ((no>=0.0)&&(no<10.0))

++ probability [int(no)];

}

Console.Writeline("The Chi-Squared

Distribution (6.0):");

for (int index = 0; index < 10; ++index)

{

Console.WriteLine("index {0} ", index, " -- {1}: ", (index + 1), ":");

Console.WriteLine("{0}", probability[index] * number_of_stars_distribution / number_of_experiments);

}

Console.ReadKey();

}

Listing 23-1

ChiSquaredDistribution Source Code

The output of the above implementation is shown in Figure 23-1.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig1_HTML.jpg — Figure 23-1
Chi-distribution output

How does the chi-squared distribution example help us in cryptanalysis and cryptography?

The first step that has to be made is to compute the frequency of the characters within the ciphertext. The second step is to compare the frequency distribution of the assumed language used for encryption (e.g. English) with shifting the two frequency distributions related to one another. In this way, we can find the shift that was used during the encryption process. This procedure is a standard and simple procedure that can be used on ciphers, such as a Caesar cipher. This take place when the frequency of the English characters is lined up with the frequency of the ciphertext. We are aware of the probabilities of the occurrences for English characters

As an example, let’s consider the following example encrypted with a Caesar cipher, which has 46 characters (see Figure 23-2 for the letter frequency). The example from Figure 23-2 is computed and shown using CrypTool¹ to compute and verify the implementation and its correctness for letter frequency on encrypted text:

ZHOFRPHWRDSUHVVWKLVLVHQFUBSWHGZLWKFDHVDUFLSKHU

It is very important to understand that the chi-squared statistic is based on counts and not on probabilities. For example, if we have the letter E with an occurrence probability of 0.127, the expectation is that the occurance will will be 12.7 times within 100 characters.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig2_HTML.jpg — Figure 23-2
Letter frequency for encrypted text

To compute the count expected, the length of the ciphertext has to be multiplied by the probability. The cipher from above has a total of 46 characters. Following the statistic with E from above, our expectation is for the E letter to occur 46 · 0.127 = 5.842 times.

In order to solve the Caesar cipher, we need to use each of the possible 25 keys, using the letter or the position of the letter within the alphabet. For this, it’s very important how the count starts, from 0 or from 1. The chi-squared has to be computed for each of the keys. The process consists of comparing the count number of a letter with what we can expect the count to be if the text is in English.

For computing the chi-squared statistic for our ciphertext, we count each letter and we see that the letter H occurs 7 times. If the language used is English, it should appear 46 · 0.082 = 3.772 times. Based on the output we can compute the following:

$frac{{left(7-3.772 ight)}^2}{3.772}=frac{3.228^2}{3.772}=frac{10.420}{3.772}=2.762$

This procedure is done for the rest of the letters and making addition between all the probabilities (see Figure 23-3).

Once the ciphertext is decrypted, the plaintext should be

WELCOMETOAPRESSTHISISENCRYPTEDWITHCAESARCIPHER

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig3_HTML.jpg — Figure 23-3
Encryption letter frequency (%)²

Cryptanalysis Using Monogram, Bigram, and Trigram Frequency Counts

Frequency analysis is one of the best practices for finding the number of ciphertext characters occurrence with the goal of breaking the cipher. Pattern analysis can be used to measure and count the characters as bigrams (or digraphs), a method for measuring pairs of characters that occurs within the text. There is also trigram frequency analysis, which measures the occurrence of combinations formed out of three letters.

In this section, we will focus on text characterization with bigrams and trigrams as opposed to resolving ciphers, such as Playfair.

Counting Monograms

Counting monograms represents one of the most effective methods used in substitution ciphers, such as Caesar ciphers, Polybius squares, etc. The method works very well because the English language has a specific frequency distribution. This means also that it’s not hidden by the substitution ciphers. The distribution will look as depicted in Figure 23-4 and Listing 23-2.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig4_HTML.jpg — Figure 23-4
Letter frequency for the English language

using System;

using System.IO;

class Program

{

static void Main()

{

//** we use the array to store the frequencies

int[] frequency = new int[(int)char.MaxValue];

//** look at the content of the text file

string s = File.ReadAllText("TheText.txt");

//** go through each of the characters

foreach (char t in s)

{

//** store the frequencies as a table

frequency [(int)t]++;

}

//** write all letters that have been found

for (int letterPos = 0; letterPos <

(int)char.MaxValue; letterPos++)

{

if (c[letterPos] > 0 &&

char.IsLetterOrDigit((char)letterPos))

{

Console.WriteLine("The Letter: {0}

has the frequency: {1}",

(char)letterPos,

freq [letterPos]);

}

Listing 23-2

Counting Monograms

The output is shown in Figure 23-5.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig5_HTML.jpg — Figure 23-5
Output of counting monograms

An advanced example of doing the above counting is shown in Listing 23-3. It is an advanced example using LINQ and lambda expression. Figure 23-6 shows the output.

using System;

using System.Collections;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

namespace MonogramsCounting_LINQ{

class Program{

static void Main(){

var frequencies = from c in

File.ReadAllText("TheText.txt")

group c by c into groupCharactersFrequencies

select groupCharactersFrequencies;

foreach (var c in frequencies)

Console.WriteLine($"The character: {c.Key} has

the frequency: {c.Count()} times");

Console.ReadKey();}}}

Listing 23-3

Lambda LINQ Expression for Counting Letter Frequencies

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig6_HTML.jpg — Figure 23-6
Character frequencies using LINQ and lambda expressions

Counting Bigrams

The method for counting bigrams is based on the same idea as counting monograms. Instead of counting the occurrence of single characters, counting bigrams means counting the occurrence frequency for pairs of characters.

Figure 23-7 lists some of the common bigrams found during the cryptanalysis process. In Listing 23-4, we have implemented a solution that deals with counting the occurrences of bigrams. Figure 23-8 shows the output.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig7_HTML.jpg — Figure 23-7
Examples of bigrams

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig8_HTML.jpg — Figure 23-8
The output from Listing 23-4

class Program

{

static void Main(string[] args)

{

String text = "Welcome to Apress! This book is

about cryptography and C#.";

String bigramPattern = "to";

Console.WriteLine("The number of occurrences of

"" + bigramPattern + "" in "" + text +

"" is: " +

countFrequenciesBigrams(bigramPattern,

text).ToString());

}

static int countFrequenciesBigrams(String

bigramPattern, String text)

{

int bigramPatternLength = bigramPattern.Length;

int textLength = text.Length;

int occurrences = 0;

for (int idx = 0; idx <= textLength –

bigramPatternLength; idx++)

{

int jIdx;

for (jIdx = 0; jIdx < bigramPatternLength;

jIdx++)

{

if (text[idx + jIdx] !=

bigramPattern[jIdx])

{

break;

}

if (jIdx == bigramPatternLength)

{

occurrences++;

jIdx = 0;

}

return occurrences;

}

Listing 23-4

Computing Bigrams

Counting Trigrams

Trigram counting uses the same principle as bigram counting. The difference consists in counting triple characters.

Figure 23-9 lists some common bigrams that are experienced during the cryptanalysis process. In Listing 23-5, we have implemented a solution for finding and counting the occurrences of trigrams within a text. The solution is different from the one from Listing 23-4. The output is shown in Figure 23-10.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig9_HTML.jpg — Figure 23-9
Examples of trigrams

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig10_HTML.jpg — Figure 23-10
The output from Listing 23-5

class Program

{

static void Main(string[] args)

{

String fullText = "Welcome to Apress! This book is

about cryptography and C#.";

String trigramPattern = "Apr";

Console.WriteLine("Full text: " + fullText);

Console.WriteLine("Trigram: " +

trigramPattern + " ");

Console.WriteLine("The number of occurrences of

"" + trigramPattern + "" in "" +

fullText + "" is: " +

countFrequenciesTrigrams(trigramPattern,

fullText).ToString());

}

static int countFrequenciesTrigrams(String

trigramPattern, String fullText)

{

int trigramPatternLength = trigramPattern.Length;

int fullTextLength = fullText.Length;

int noOfOccurrence = 0;

for (int index = 0; index <= fullTextLength –

trigramPatternLength; index++)

{

int jIndex;

for (jIndex = 0; jIndex <

trigramPatternLength; jIndex++)

{

if (fullText[index + jIndex] !=

trigramPattern[jIndex])

{

break;

}

if (jIndex == trigramPatternLength)

{

noOfOccurrence++;

jIndex = 0;

}

return noOfOccurrence;

}

Listing 23-5

Counting Trigrams

Generate Letter Frequency

The array wikiFrequencies stores the frequencies (which are a relative percentage of the letters) as listed on Wikipedia³ (see Listing 23-6). The provided example interprets the number as a percentage. As an example, letter “A” appears 8.167% of the time.

The program in Figure 23-11 declares a Random object. The defining value of the letter (e.g. “A”) is as an integer. We will use this convention for later purposes. The event handler, Load, makes the sum of the entire values from the wikiFrequencies.

../images/493660_1_En_23_Chapter/493660_1_En_23_Fig11_HTML.jpg — Figure 23-11
Generation of the letter frequencies

using System;

using System.Linq;

namespace LetterFrequency

{

class Program

{

static void Main(string[] args)

{

string input = "";

VerifyFreq();

Console.Write("The number of letters: ");

input = Console.ReadLine();

Compute(input);

}

//** Store the letter frequencies.

//** For more details and the values

//** stored below, see the link:

//** http://en.wikipedia.org/wiki/Letter_frequency

private static float[] wikiFrequencies =

{

8.167f, 1.492f, 2.782f, 4.253f, 12.702f,

2.228f, 2.015f, 6.094f, 6.966f, 0.153f,

0.772f, 4.025f, 2.406f, 6.749f, 7.507f,

1.929f, 0.095f, 5.987f, 6.327f, 9.056f,

2.758f, 0.978f, 2.360f, 0.150f, 1.974f,

0.074f

};

//** create a instance of a number

//** generator using Random class

private static Random randomNumber = new Random();

//** compute the ASCII value of letter A

private static int int_AsciiA = (int)'A';

//** verify that the frequencies are adding up to 100

private static void VerifyFreq()

{

//** compute the difference to E

float totalAmount = wikiFrequencies.Sum();

float differenceComputation = 100f - totalAmount;

wikiFrequencies[(int)'E' - int_AsciiA] +=

differenceComputation;

}

//** based on the frequencies

//** generate randomly the letters

private static void Compute(string txtNumLetters)

{

//** monitor and track each letter

//** that has been generated

int[] countGeneratedLetters = new int[26];

//** randomly generate the letters

int theNumberOfLetters = int.Parse(txtNumLetters);

string result = "";

for (int k = 0; k < theNumberOfLetters; k++)

{

//** randomly generate a number

//** between 0 and 100

double randomlyNumber = 100.0 *

randomNumber.NextDouble();

//** select the letter that

//** this will represents

for (int numberOfLetter = 0; ;

numberOfLetter++)

{

//** extract the frequency of the

//** letter from the number

randomlyNumber -=

wikiFrequencies[numberOfLetter];

//** if the randomly number is

//** less and equal than 0

//** it means that we have the letter

if ((randomlyNumber <= 0) ||

(numberOfLetter == 25))

{

char character = (char)(int_AsciiA +

numberOfLetter);

result += character.ToString() + ' ';

countGeneratedLetters[numberOfLetter]++;

break;

}

Console.WriteLine(result + " ");

//** show the frequencies

for (int i = 0; i < countGeneratedLetters.Length;

i++)

{

char ch = (char)(int_AsciiA + i);

float frequency =

(float)countGeneratedLetters[i] /

theNumberOfLetters * 100;

string str =

string.Format("{0} {1,6} {2,6}

{3,6} ,ch.ToString(),

frequency.ToString("0.000"),

wikiFrequencies[i].ToString("0.000"),

(frequency –

wikiFrequencies[i]).ToString("0.000"));

Console.WriteLine(str);

}

Listing 23-6

Randomly Generating Letter Frequencies

Conclusion

The chapter covered the concept of text characterization and showed its importance in the cryptanalysis process. You can now deal with chi-squared statistics and working with monograms, digrams, and trigrams for decrypting substitution ciphertexts. As a summary, you learned about

The concept of text characterization
Working with monograms, digrams, and trigrams
The implementation of chi-squared statistic
Monogram, digram, and trigram implementations

Bibliography

[1]
Simon Singh, The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. ISBN 0-385-49532-3. 2000.
[2]
Helen F. Gaines, Cryptanalysis: A Study of Ciphers and Their Solutions. 1989.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 23. Text Characterization

Create new playlist

Sign In

Sign Up

23. Text Characterization

Chi-Squared Statistic

Cryptanalysis Using Monogram, Bigram, and Trigram Frequency Counts

Counting Monograms

Counting Bigrams

Counting Trigrams

Generate Letter Frequency

Conclusion

Bibliography

Table of Contents for
23. Text Characterization