Chapter 19. Implementing a Checksum to Protect Data Integrity

 

Where is the information?

Lost in data.

Where is the data?

Lost in the #@%!& database!

 
 --Joe Celko

Nearly all software applications handle the manipulation of data through a transmission medium. A transmission medium could be the registry, memory, disk files, or a database, to name a few. Each medium handles and stores data in a different way, but every transmission medium is unreliable and has the potential to fail. It is for this reason that the CRC-32 (Cyclic Redundancy Check) algorithm came to be, which is used to verify that there has been no corruption or errors in a data transmission. This algorithm is given arbitrary data of arbitrary length, and computes a 32-bit checksum number representing the contents of the supplied data, which is transmitted along with the data through the transmission medium that is used. Once the data arrives at its destination, a new checksum is recalculated on the data that was received, and it is compared to the checksum calculated before the transmission. If the values match, the transmission most likely was successful, but if the values do not match, you know that the transmission encountered an error of some sort and the data received is incomplete, modified, or corrupted.

This algorithm has also been used to detect any tampering done to data files for games (or any software application for that matter). Checksum values are typically created for all files before a game ships; those pre-calculated values are checked against the runtime calculated version when the game launches and detect whether data has been modified from its original state.

This chapter covers an implementation of the CRC-32 checksum algorithm in C# and later goes on to show an alternative algorithm provided by Microsoft. The mathematical proofs and reasoning behind this algorithm will not be covered.

Implementation

The implementation for the CRC-32 algorithm is fairly straightforward and exists in a few flavors. The implementation provided in this chapter precalculates a lookup table using a specified polynomial value, and the calculation is based on the algebra of polynomials over the values (mod 2) using the cached lookup table.

The code is as follows:

public class Crc32
{
    private static uint[] _lookupTable;

    public uint Calculate(System.IO.Stream stream)
    {
        unchecked
        {
            uint result = 0xFFFFFFFF;

            byte[] buffer = new byte[1024];

            int byteCount = stream.Read(buffer, 0, 1024);

            while (byteCount > 0)
            {
                for (int byteIndex = 0; byteIndex < byteCount; byteIndex++)
                {
                   result = ((result) >> 8) ^
                   _lookupTable[(buffer[byteIndex]) ^
                   ((result) & 0x000000FF)];
                }
            byteCount = stream.Read(buffer, 0, 1024);
         }

         return ~result;
      }
}

public uint Calculate(byte[] buffer)
{
    unchecked
    {
        uint result = 0xFFFFFFFF;

        for (int byteIndex = 0; byteIndex < buffer.Length; byteIndex++)
        {
            result = ((result) >> 8) ^
            _lookupTable[(buffer[byteIndex]) ^
            ((result) & 0x000000FF)];
        }

        return ~result;
    }
 }

// This static constructor pregenerates the lookup table that our crc32
// algorithm will use to compute more efficiently.
 static Crc32()
 {
     unchecked
     {
         uint polynomial = 0xADB11320;
         uint iterationIndex;
         uint bitIndex;
         uint crc32Value;

         _lookupTable = new uint[256];

         for (iterationIndex = 0; iterationIndex < 256; iterationIndex++)
         {
             crc32Value = iterationIndex;

             for (bitIndex = 8; bitIndex > 0; bitIndex—)
             {
                    if ((crc32Value & 1) == 1)
                    {
                         crc32Value = (crc32Value >> 1) ^ polynomial;
                    }
                    else
                    {
                         crc32Value >>= 1;
                    }
                }
                _lookupTable[iterationIndex] = crc32Value;
             }
         }
     }
}

You will notice the _lookupTable array variable and the static constructor; the implementation precalculates the checksum values using the provided polynomial and stores them in a lookup table to improve and speed up calculation performance.

After instantiation and the precalculation of the lookup table, you can call either signature for the Calculate method. One version accepts a byte array containing the data to generate the checksum for, and the other version accepts a System.IO.Stream instead.

Usage

Using the functionality defined in the implementation class is fairly straightforward. The class will calculate the internal lookup table the first time you instantiate it, and all you have to worry about is calling the Calculate() method. The Calculate method is overloaded to accept either a byte array or a System.IO.Stream object.

The following code shows the proper way to use this class with a byte array:

byte[] data = new byte[DATA_SIZE];
Crc32 crc = new Crc32();
byte[] result = crc.Calculate(data);

The following code shows the proper way to use this class with a System.IO.Stream:

byte[] data = new byte[DATA_SIZE];
using (System.IO.MemoryStream stream = new System.IO.MemoryStream(data))
{
    Crc32 crc = new Crc32();
    byte[] result = crc.Calculate(stream);
}

Using the data “This is a test” will result in a 32-bit checksum value of 2042881507.

The result will be a 32-bit (4-octet) checksum of the data that was provided to the CRC-32 algorithm, and will subsequently be compared against a future checksum calculation.

Alternative

There is one potential problem with the CRC-32 Checksum algorithm in regards to malicious security attacks. Generally, these issues are not important for verification of simple data integrity, but it may be advisable to seek an alternative algorithm in environments where security is a concern; verifying the integrity of packets in a multiplayer environment, for example.

The problem with the CRC-32 (Cyclic Redundancy Checksum) algorithm is that it is not collision-proof, meaning that it is possible to generate two checksum values that are identical. This is not an extremely common occurrence, but it introduces enough exploitability that a malicious plain-text attack could be used to spoof an integrity check. The probability that two different blocks of data will have the same checksum value in an N-bit checksum is 1/2N. The larger the value represented by N, the lower the probability that two different blocks of data will have the same checksum value. So the probability that our CRC-32 implementation will generate an identical checksum for two different blocks of data is 1/232, a percentage that is reasonable enough for most situations.

As an alternative, you can utilize the built-in MD5 algorithm from the System.Security.Cryptography namespace. This algorithm is known so far to be collision-proof, and may be used in place of CRC-32 for better security and credibility with a bit of increased overhead.

Implementing the built-in functionality from Microsoft is very easy. Reference the System.Security.Cryptography namespace and use the following code:

byte[] data = new byte[DATA_SIZE];
MD5 md5 = new MD5CryptoServiceProvider();
byte[] result = md5.ComputeHash(data);

The result will be a 128-bit (16-octet) checksum hash of the data that was provided to the MD5 algorithm, and will subsequently be compared against a future checksum calculation.

Conclusion

This chapter covered two ways of generating a checksum value that can be used to verify data integrity. Each method has different pros and cons, which can be evaluated on a per-project basis. The CRC-32 algorithm can be used in situations where you are basically testing for data corruption, and also in situations where speed is important. The MD5 algorithm has some added overhead, but its usage offers more credible and relatively secure checksums.

Regardless of the algorithm you choose to implement, verifying data integrity using checksums is a popular and low overhead way to ensure that you are always processing complete and unmodified data. Reliability of tools is very important, and using checksums offers a quick way to verify that the data that users are creating is valid, rather than finding out after the application throws an error when it tries to process the data at a later stage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.7