Chapter Twelve. Digital Data Formats and Their Effects

Digital Data Formats and Their Effects

In digital signal processing, there are many ways to represent numerical data in computing hardware. These representations, known as data formats, have a profound effect on the accuracy and ease of implementation of any given signal processing algorithm. The simpler data formats enable uncomplicated hardware designs to be used at the expense of a restricted range of number representation and susceptibility to arithmetic errors. The more elaborate data formats are somewhat difficult to implement in hardware, but they allow us to manipulate very large and very small numbers while providing immunity to many problems associated with digital arithmetic. The data format chosen for any given application can mean the difference between processing success and failure—it's where our algorithmic rubber meets the road.

In this chapter, we'll introduce the most common types of fixed-point digital data formats and show why and when they're used. Next, we'll use analog-to-digital (A/D) converter operation to establish the precision and dynamic range afforded by these fixed-point formats along with the inherent errors encountered with their use. Finally, we'll cover the interesting subject of floating-point binary formats.

FIXED-POINT BINARY FORMATS

Within digital hardware, numbers are represented by binary digits known as bits—in fact, the term bit originated from the words Binary digIT. A single bit can be in only one of two possible states: either a one or a zero.[†] A six-bit binary number could, for example, take the form 101101, with the leftmost bit known as the most significant bit (msb), while the rightmost bit is called the least significant bit (lsb). The number of bits in a binary number is known as the word length—hence 101101 has a word length of six. Like the decimal number system so familiar to us, the binary number system assumes a weight associated with each digit in the number. That weight is the base of the system (two for binary numbers and ten for decimal numbers) raised to an integral power. To illustrate this with a simple example, the decimal number 4631 is

Equation 12-1. 

The factors 103, 102, 101, and 100 are the digit weights in Eq. (12-1). Similarly, the six-bit binary number 101101 is equal to decimal 45 as shown by

Equation 12-2. 

Using subscripts to signify the base of a number, we can write Eq. (12-2) as 1011012 = 4510. Equation (12-2) shows us that, like decimal numbers, binary numbers use the place value system where the position of a digit signifies its weight. If we use B to denote a number system's base, the place value representation of the four-digit number a3a2a1a0 is

Equation 12-3. 

In Eq. (12-3), Bn is the weight multiplier for the digit an, where 0 ≤ anB–1. (This place value system of representing numbers is very old—so old, in fact, that its origin is obscure. However, with its inherent positioning of the decimal or binary point, this number system is so convenient and powerful that its importance has been compared to that of the alphabet[1].)

Octal Numbers

As the use of minicomputers and microprocessors rapidly expanded in the 1960s, people grew tired of manipulating long strings of ones and zeros on paper and began to use more convenient ways to represent binary numbers. One way to express a binary number is an octal format, with its base of eight. Converting from binary to octal is as simple as separating the binary number into three-bit groups starting from the right. For example, the binary number 101010012 can be converted to octal format as

Octal Numbers

Each of the three groups of bits above are easily converted from their binary formats to a single octal digit because, for three-bit words, the octal and decimal formats are the same. That is, starting with the left group of bits, 102 = 210 = 28, 1012 = 510 = 58, and 0012 = 110 = 18. The octal format also uses the place value system meaning that 2518 = (2 · 82 + 5 · 81 + 1 · 80). Octal format enables us to represent the eight-digit 101010012 with the three-digit 2518. Of course, the only valid digits in the octal format are 0 to 7—the digits 8 and 9 have no meaning in octal representation.

Hexadecimal Numbers

Another popular binary format is the hexadecimal number format using 16 as its base. Converting from binary to hexadecimal is done, this time, by separating the binary number into four-bit groups starting from the right. The binary number 101010012 is converted to hexadecimal format as

Hexadecimal Numbers

If you haven't seen the hexadecimal format used before, don't let the A9 digits confuse you. In this format, the characters A, B, C, D, E, and F represent the digits whose decimal values are 10, 11, 12, 13, 14, and 15 respectively. We convert the two groups of bits above to two hexadecimal digits by starting with the left group of bits, 10102 = 1010 = A16, and 10012 = 910 = 916. Hexadecimal format numbers also use the place value system, meaning that A916 = (A · 161 + 9 · 160). For convenience, then, we can represent the eight-digit 101010012 with the two-digit number A916. Table 12-1 lists the permissible digit representations in the number systems discussed thus far.

Fractional Binary Numbers

Fractions (numbers whose magnitudes are greater than zero and less than one) can also be represented by binary numbers if we use a binary point identical in function to our familiar decimal point. In the binary numbers we've discussed so far, the binary point is assumed to be fixed just to the right of the rightmost digit. Using the symbol ◊ to denote the binary point, the six-bit binary fraction 11 0101 is equal to decimal 3.3125 as shown by

Equation 12-4. 

Table 12-1. Allowable Digit Representations vs. Number System Base

Binary

Octal

Decimal

Hexadecimal

Decimal equivalent

0

0

0

0

0

1

1

1

1

1

 

2

2

2

2

 

3

3

3

3

 

4

4

4

4

 

5

5

5

5

 

6

6

6

6

 

7

7

7

7

  

8

8

8

  

9

9

9

   

A

10

   

B

11

   

C

12

   

D

13

   

E

14

   

F

15

For the example in Eq. (12-4), the binary point is set between the second and third most significant bits. Having a stationary position for the binary point is why binary numbers are often called fixed-point binary.

For some binary number formats (like the floating-point formats that we'll cover shortly), the binary point is fixed just to the left of the most significant bit. This forces the number values to be restricted to the range between zero and one. In this format, the largest and smallest values possible for a b-bit fractional word are 1–2b and 2b, respectively. Taking a six-bit binary fraction, for example, the largest value is 1111112, or

Equation 12-5. 

which is in decimal. The smallest nonzero value is 0000012, equaling a decimal .

Sign-Magnitude Binary Format

For binary numbers to be at all useful in practice, they must be able to represent negative values. Binary numbers do this by dedicating one of the bits in a binary word to indicate the sign of a number. Let's consider a popular binary format known as sign-magnitude. Here, we assume that a binary word's leftmost bit is a sign bit and the remaining bits represent the magnitude of a number which is always positive. For example, we can say that the four-bit number 00112 is +310 and the binary number 10112 is equal to –310, or

Sign-Magnitude Binary Format

Of course, using one of the bits as a sign bit reduces the magnitude of the numbers we can represent. If an unsigned binary number's word length is b bits, the number of different values that can be represented is 2b. An eight-bit word, for example, can represent 28 = 256 different integral values. With zero being one of the values we have to express, a b-bit unsigned binary word can represent integers from 0 to 2b–1. The largest value represented by an unsigned eight-bit word is 28–1 = 25510 = 111111112. In the sign-magnitude binary format a b-bit word can represent only a magnitude of ±2b–1–1, so the largest positive or negative value we can represent by an eight-bit sign-magnitude word is ±28–1–1 = ±127.

Two's Complement Format

Another common binary number scheme, known as the two's complement format, also uses the leftmost bit as a sign bit. The two's complement format is the most convenient numbering scheme from a hardware design standpoint, and has been used for decades. It enables computers to perform both addition and subtraction using the same hardware adder logic. To get the negative version of a positive two's complement number, we merely complement (change a one to a zero and change a zero to a one) each bit and add a one to the complemented word. For example, with 00112 representing a decimal 3 in two's complement format, we obtain a negative decimal 3 through the following steps:

Two's Complement Format

In the two's complement format, a b-bit word can represent positive amplitudes as great as 2b–1–1, and negative amplitudes as large as –2b–1. Table 12-2 shows four-bit word examples of sign-magnitude and two's complement binary formats.

While using two's complement numbers, we have to be careful when adding two numbers of different word lengths. Consider the case where a four-bit number is added to a eight-bit number:

Two's Complement Format

Table 12-2. Binary Number Formats

Decimal equivalent

Sign-magnitude

Two's complement

Offset binary

7

0111

0111

1111

6

0110

0110

1110

5

0101

0101

1101

4

0100

0100

1100

3

0011

0011

1011

2

0010

0010

1010

1

0001

0001

1001

+0

0000

0000

1000

–0

1000

–1

1001

1111

0111

–2

1010

1110

0110

–3

1011

1101

0101

–4

1100

1100

0100

–5

1101

1011

0011

–6

1110

1010

0010

–7

1111

1001

0001

–8

1000

0000

No problem so far. The trouble occurs when our four-bit number is negative. Instead of adding a +3 to the +15, let's try to add a –3 to the +15:

Binary Number Formats

The above arithmetic error can be avoided by performing what's called a sign-extend operation on the four-bit number. This process, typically performed automatically in hardware, extends the sign bit of the four-bit negative number to the left, making it an eight-bit negative number. If we sign-extend the –3 and, then perform the addition, we'll get the correct answer:

Binary Number Formats

Offset Binary Format

Another useful binary number scheme is known as the offset binary format. While this format is not as common as two's complement, it still shows up in some hardware devices. Table 12-2 shows offset binary format examples for four-bit words. Offset binary represents negative numbers by subtracting 2b–1 from an unsigned binary value. For example, in the second row of Table 12-2, the offset binary number is 11102. When this number is treated as an unsigned binary number, it's equivalent to 1410. For four-bit words b = 4 and 2b–1 = 8, so 1410 – 810 = 610, which is the decimal equivalent of 11102 in offset binary. The difference between the unsigned binary equivalent and the actual decimal equivalent of the offset binary numbers in Table 12-2 is always –8. This kind of offset is sometimes referred to as a bias when the offset binary format is used. (It may interest the reader that we can convert back and forth between the two's complement and offset binary formats merely by complementing a word's most significant bit.)

The history, arithmetic, and utility of the many available number formats is a very broad field of study. A thorough and very readable discussion of the subject is given by Knuth in Reference [2].

BINARY NUMBER PRECISION AND DYNAMIC RANGE

As we implied earlier, for any binary number format, the number of bits in a data word is a key consideration. The more bits used in the word, the better the resolution of the number, and the larger the maximum value that can be represented.[†] Assuming that a binary word represents the amplitude of a signal, digital signal processing practitioners find it useful to quantify the dynamic range of various binary number schemes. For a signed integer binary word length of b+1 bits (one sign bit and b magnitude bits), the dynamic range in decibels is defined by

Equation 12-6. 

When 2b is much larger than 1, we can ignore the –1 in Eq. (12-6) and state that

Equation 12-6'. 

Equation (12-6'), dimensioned in dB, tells us that the dynamic range of our number system is directly proportional to the word length. Thus, an eight-bit two's complement word, with seven bits available to represent signal magnitude, has a dynamic range of 6.02 · 7 = 42.14 dB. Most people simplify Eq. (12-6') by using the rule of thumb that the dynamic range is equal to “six dB per bit.”

EFFECTS OF FINITE FIXED-POINT BINARY WORD LENGTH

The effects of finite binary word lengths touch all aspects of digital signal processing. Using finite word lengths prevents us from representing values with infinite precision, increases the background noise in our spectral estimation techniques, creates nonideal digital filter responses, induces noise in analog-to-digital (A/D) converter outputs, and can (if we're not careful) lead to wildly inaccurate arithmetic results. The smaller the word lengths, the greater these problems will be. Fortunately, these finite, word-length effects are rather well understood. We can predict their consequences and take steps to minimize any unpleasant surprises. The first finite, word-length effect we'll cover is the errors that occur during the A/D conversion process.

A/D Converter Quantization Errors

Practical A/D converters are constrained to have binary output words of finite length. Commercial A/D converters are categorized by their output word lengths, which are normally in the range from 8 to 16 bits. A typical A/D converter input analog voltage range is from –1 to +1 volt. If we used such an A/D converter having eight-bit output words, the least significant bit would represent

Equation 12-7. 

What this means is that we can represent continuous (analog) voltages perfectly as long as they're integral multiples of 7.81 millivolts—any intermediate input voltage will cause the A/D converter to output a best estimate digital data value. The inaccuracies in this process are called quantization errors because an A/D output least significant bit is an indivisible quantity. We illustrate this situation in Figure 12-1(a), where the continuous waveform is being digitized by an eight-bit A/D converter whose output is in the sign-magnitude format. When we start sampling at time t = 0, the continuous waveform happens to have a value of 31.25 millivolts (mv), and our A/D output data word will be exactly correct for sample x(0). At time T when we get the second A/D output word for sample x(1), the continuous voltage is between 0 and –7.81 mv. In this case, the A/D converter outputs a sample value of 10000001 representing –7.81 mv, even though the continuous input was not quite as negative as –7.81mv. The 10000001 A/D output word contains some quantization error. Each successive sample contains quantization error because the A/D's digitized output values must lie on a horizontal dashed line in Figure 12-1(a). The difference between the actual continuous input voltage and the A/D converter's representation of the input is shown as the quantization error in Figure 12-1(b). For an ideal A/D converter, the quantization error, a kind of roundoff noise, can never be greater than ±1/2 an lsb, or ±3.905 mv.

Quantization errors: (a) digitized x(n) values of a continuous signal; (b) quantization error between the actual analog signal values and the digitized signal values.

Figure 12-1. Quantization errors: (a) digitized x(n) values of a continuous signal; (b) quantization error between the actual analog signal values and the digitized signal values.

While Figure 12-1(b) shows A/D quantization noise in the time domain, we can also illustrate this noise in the frequency domain. Figure 12-2(a) depicts a continuous sinewave of one cycle over the sample interval shown as the dotted line and a quantized version of the time-domain samples of that wave as the dots. Notice how the quantized version of the wave is constrained to have only integral values, giving it a stair step effect oscillating above and below the true unquantized sinewave. The quantization here is 4 bits, meaning that we have a sign bit and three bits to represent the magnitude of the wave. With three bits, the maximum peak values for the wave are ±7. Figure 12-2(b) shows the discrete Fourier transform (DFT) of a discrete version of the sinewave whose time-domain sample values are not forced to be integers, but have high precision. Notice in this case that the DFT has a nonzero value only at m = 1. On the other hand, Figure 12-2(c) shows the spectrum of the 4-bit quantized samples in Figure 12-2(a), where quantization effects have induced noise components across the entire spectral band. If the quantization noise depictions in Figures 12-1(b) and 12-2(c) look random, that's because they are. As it turns out, even though A/D quantization noise is random, we can still quantify its effects in a useful way.

In the field of communications, people often use the notion of output signal-to-noise ratio, or SNR = (signal power)/(noise power), to judge the usefulness of a process or device. We can do likewise and obtain an important expression for the output SNR of an ideal A/D converter, SNRA/D, accounting for finite word-length quantization effects. Because quantization noise is random, we can't explicitly represent its power level, but we can use its statistical equivalent of variance to define SNRA/D measured in decibels as

Quantization noise effects: (a) input sinewave applied to a 64-point DFT; (b) theoretical DFT magnitude of high-precision sinewave samples; (c) DFT magnitude of a sinewave quantized to 4 bits.

Figure 12-2. Quantization noise effects: (a) input sinewave applied to a 64-point DFT; (b) theoretical DFT magnitude of high-precision sinewave samples; (c) DFT magnitude of a sinewave quantized to 4 bits.

Equation 12-8. 

Next, we'll determine an A/D converter's quantization noise variance relative to the converter's maximum input peak voltage Vp. If the full scale (–Vp to +Vp volts) continuous input range of a b-bit A/D converter is 2Vp, a single quantization level q is that voltage range divided by the number of possible A/D output binary values, or q = 2Vp/2b. (In Figure 12-1, for example, the quantization level q is the lsb value of 7.81 mv.) A depiction of the likelihood of encountering any given quantization error value, called the probability density function p(e) of the quantization error, is shown in Figure 12-3.

Probability density function of A/D conversion roundoff error (noise).

Figure 12-3. Probability density function of A/D conversion roundoff error (noise).

This simple rectangular function has much to tell us. It indicates that there's an equal chance that any error value between –q/2 and +q/2 can occur. By definition, because probability density functions have an area of unity (i.e., the probability is 100 percent that the error will be somewhere under the curve), the amplitude of the p(e) density function must be the area divided by the width, or p(e) = 1/q. From Figure D-4 and Eq. (D-12) in Appendix D, the variance of our uniform p(e) is

Equation 12-9. 

We can now express the A/D noise error variance in terms of A/D parameters by replacing q in Eq. (12-9) with q = 2Vp/2b to get

Equation 12-10. 

OK, we're halfway to our goal—with Eq. (12-10) giving us the denominator of Eq. (12-8), we need the numerator. To arrive at a general result, let's express the input signal in terms of its root mean square (rms), the A/D converter's peak voltage, and a loading factor LF defined as

Equation 12-11. 

With the loading factor defined as the input rms voltage over the A/D converter's peak input voltage, we square and rearrange Eq. (12-11) to show the signal variance as

Equation 12-12. 

Substituting Eqs. (12-10) and (12-12) in Eq. (12-8),

Equation 12-13. 

Eq. (12-13) gives us the SNRA/D of an ideal b-bit A/D converter in terms of the loading factor and the number of bits b. Figure 12-4 plots Eq. (12-13) for various A/D word lengths as a function of the loading factor. Notice that the loading factor in Figure 12-4 is never greater than –3dB, because the maximum continuous A/D input peak value must not be greater than Vp volts. Thus, for a sinusoid input, its rms value must not be greater than volts (3 dB below Vp).

SNRA/D of ideal A/D converters as a function of loading factor in dB.

Figure 12-4. SNRA/D of ideal A/D converters as a function of loading factor in dB.

† Recall from Appendix D, Section D.2 that, although the variance σ2 is associated with the power of a signal, the standard deviation is associated with the rms value of a signal.

When the input sinewave's peak amplitude is equal to the A/D converter's full-scale voltage Vp, the full-scale LF is

Equation 12-14. 

Under this condition, the maximum A/D output SNR from Eq. (12-13) is

Equation 12-15. 

This discussion of SNR relative to A/D converters means three important things to us:

  1. An ideal A/D converter will have an SNRA/D defined by Eq. (12-13), so any continuous signal we're trying to digitize with a b-bit A/D converter will never have an SNRA/D greater than Eq. (12-13) after A/D conversion. For example, let's say we want to digitize a continuous signal whose SNR is 55 dB. Using an ideal eight-bit A/D converter with its full-scale SNRA/D of 6.02 · 8 + 1.76 = 49.9 dB from Eq. (12-15), the quantization noise will contaminate the digitized values, and the resultant digital signal's SNR can be no better than 49.9 dB. We'll have lost signal SNR through the A/D conversion process. (A ten-bit A/D, with its ideal SNRA/D ≈ 62 dB, could be used to digitize a 55 dB SNR continuous signal to reduce the SNR degradation caused by quantization noise.) Equations (12-13) and (12-15) apply to ideal A/D converters and don't take into account such additional A/D noise sources as aperture jitter error, missing output bit patterns, and other nonlinearities. So actual A/D converters are likely to have SNRs that are lower than that indicated by theoretical Eq. (12-13). To be safe in practice, it's sensible to assume that SNRA/D-max is 3 to 6 dB lower than indicated by Eq. (12-15).

  2. Equation (12-15) is often expressed in the literature, but it can be a little misleading because it's imprudent to force an A/D converter's input to full scale. It's wise to drive an A/D converter to some level below full scale because inadvertent overdriving will lead to signal clipping and will induce distortion in the A/D's output. So Eq. (12-15) is overly optimistic, and, in practice, A/D converter SNRs will be less than indicated by Eq. (12-15). The best approximation for an A/D's SNR is to determine the input signal's rms value that will never (or rarely) overdrive the converter input, and plug that value into Eq. (12-11) to get the loading factor value for use in Eq. (12-13).[†] Again, using an A/D converter with a wider word length will alleviate this problem by increasing the available SNRA/D.

  3. Remember now, real-world continuous signals always have their own inherent continuous SNR, so using an A/D converter whose SNRA/D is a great deal larger than the continuous signal's SNR serves no purpose. In this case, we'd be using the A/D converter's extra bits to digitize the continuous signal's noise to a greater degree of accuracy.

A word of caution is appropriate here concerning our analysis of A/D converter quantization errors. The derivations of Eqs. (12-13) and (12-15) are based upon three assumptions:

  1. The cause of A/D quantization errors is a stationary random process; that is, the performance of the A/D converter does not change over time. Given the same continuous input voltage, we always expect an A/D converter to provide exactly the same output binary code.

  2. The probability density function of the A/D quantization error is uniform. We're assuming that the A/D converter is ideal in its operation and all possible errors between –q/2 and +q/2 are equally likely. An A/D converter having stuck bits or missing output codes would violate this assumption. High-quality A/D converters being driven by continuous signals that cross many quantization levels will result in our desired uniform quantization noise probability density function.

  3. The A/D quantization errors are uncorrelated with the continuous input signal. If we were to digitize a single continuous sinewave whose frequency was harmonically related to the A/D sample rate, we'd end up sampling the same input voltage repeatedly and the quantization error sequence would not be random. The quantization error would be predictable and repetitive, and our quantization noise variance derivation would be invalid. In practice, complicated continuous signals such as music or speech, with their rich spectral content, avoid this problem.

To conclude our discussion of A/D converters, let's consider one last topic. In the literature the reader is likely to encounter the expression

Equation 12-16. 

Equation (12-16) is used by test equipment manufacturers to specify the sensitivity of test instruments using a beff parameter known as the number of effective bits, or effective number of bits (ENOB) [38]. Equation (12-16) is merely Eq. (12-15) solved for b. Test equipment manufacturers measure the actual SNR of their product indicating its ability to capture continuous input signals relative to the instrument's inherent noise characteristics. Given this true SNR, they use Eq. (12-16) to determine the beff value for advertisement in their product literature. The larger beff, the greater the continuous voltage that can be accurately digitized relative to the equipment's intrinsic quantization noise.

Data Overflow

The next finite, word-length effect we'll consider is called overflow. Overflow is what happens when the result of an arithmetic operation has too many bits, or digits, to be represented in the hardware registers designed to contain that result. We can demonstrate this situation to ourselves rather easily using a simple four-function, eight-digit pocket calculator. The sum of a decimal 9.9999999 plus 1.0 is 10.9999999, but on an eight-digit calculator the sum is 10.999999 as

Data Overflow

The hardware registers, which contain the arithmetic result and drive the calculator's display, can hold only eight decimal digits; so the least significant digit is discarded (of course). Although the above error is less than one part in ten million, overflow effects can be striking when we work with large numbers. If we use our calculator to add 99,999,999 plus 1, instead of getting the correct result of 100 million, we'll get a result of 1. Now that's an authentic overflow error!

Let's illustrate overflow effects with examples more closely related to our discussion of binary number formats. First, adding two unsigned binary numbers is as straightforward as adding two decimal numbers. The sum of 42 plus 39 is 81, or

Data Overflow

In this case, two 6-bit binary numbers required 7 bits to represent the results. The general rule is the sum of m individual b-bit binary numbers can require as many as [b + log2(m)] bits to represent the results. So, for example, a 24-bit result register (accumulator) is needed to accumulate the sum of sixteen 20-bit binary numbers, or 20 + log2(16) = 24. The sum of 256 eight-bit words requires an accumulator whose word length is [8 + log2(256)], or 16 bits, to ensure that no overflow errors occur.

In the preceding example, if our accumulator word length was 6 bits, an overflow error occurs as

Data Overflow

Here, the most significant bit of the result overflowed the 6-bit accumulator, and an error occurred.

With regard to overflow errors, the two's complement binary format has two interesting characteristics. First, under certain conditions, overflow during the summation of two numbers causes no error. Second, with multiple summations, intermediate overflow errors cause no problems if the final magnitude of the sum of the b-bit two's complement numbers is less than 2b–1. Let's illustrate these properties by considering the four-bit two's complement format in Figure 12-5, whose binary values are taken from Table 12-2.

Four-bit two's complement binary numbers.

Figure 12-5. Four-bit two's complement binary numbers.

The first property of two's complement overflow, which sometimes causes no errors, can be shown by the following examples:

Four-bit two's complement binary numbers.
Four-bit two's complement binary numbers.

Then again, the following examples show how two's complement overflow sometimes does cause errors:

Four-bit two's complement binary numbers.
Four-bit two's complement binary numbers.

The rule with two's complement addition is if the carry bit into the sign bit is the same as the overflow bit out of the sign bit, the overflow bit can be ignored, causing no errors; if the carry bit into the sign bit is different from the overflow bit out of the sign bit, the result is invalid. An even more interesting property of two's complement numbers is that a series of b-bit word summations can be performed where intermediate sums are invalid, but the final sum will be correct if its magnitude is less than 2b–1. We show this by the following example. If we add a +6 to a +7, and then add a –7, we'll encounter an intermediate overflow error but our final sum will be correct as

Four-bit two's complement binary numbers.

The magnitude of the sum of the three four-bit numbers was less than 24–1 (<8), so our result was valid. If we add a +6 to a +7, and next add a –5, we'll encounter an intermediate overflow error, and our final sum will also be in error because its magnitude is not less than 8.

Four-bit two's complement binary numbers.

Another situation where overflow problems are conspicuous is during the calculation of the fast Fourier transform (FFT). It's difficult at first to imagine that multiplying complex numbers by sines and cosines can lead to excessive data word growth—particularly because sines and cosines are never greater than unity. Well, we can show how FFT data word growth occurs by considering a decimation-in-time FFT butterfly from Figure 4-14(c) repeated here as Figure 12-6, and grinding through a little algebra. The expression for the x' output of this FFT butterfly, from Eq. (4-26), is

Equation 12-17. 

Single decimation-in-time FFT butterfly.

Figure 12-6. Single decimation-in-time FFT butterfly.

Breaking up the butterfly's x and y inputs into their real and imaginary parts and remembering that Single decimation-in-time FFT butterfly., we can express Eq. (12-17) as

Equation 12-18. 

If we let α be the twiddle factor angle of 2πk/N, and recall that ejα = cos(α) – jsin(α), we can simplify Eq. (12-18) as

Equation 12-19. 

If we look, for example, at just the real part of the x' output, x'real, it comprises the three terms

Equation 12-20. 

If xreal, yreal, and yimag are of unity value when they enter the butterfly and the twiddle factor angle α = 2πk/N happens to be π/4 = 45o, then, x'real can be greater than 2 as

Equation 12-21. 

So we see that the real part of a complex number can more than double in magnitude in a single stage of an FFT. The imaginary part of a complex number is equally likely to more than double in magnitude in a single FFT stage. Without mitigating this word growth problem, overflow errors could render an FFT algorithm useless.

OK, overflow problems are handled in one of two ways—by truncation or rounding—each inducing its own individual kind of quantization errors, as we shall see.

Truncation

Truncation is the process where a data value is represented by the largest quantization level that is less than or equal to that data value. If we're quantizing to integral values, for example, the real value 1.2 would be quantized to 1. An example of truncation to integral values is shown in Figure 12-7(a), where all values of x in the range of 0 ≤ x < 1 are set equal to 0, values of x in the range of 1 ≤ x < 2 are set equal to 1, x values in the range of 2 ≤ x < 3 are set equal to 2, and so on.

Truncation: (a) quantization nonlinearities; (b) error probability density function.

Figure 12-7. Truncation: (a) quantization nonlinearities; (b) error probability density function.

As we did with A/D converter quantization errors, we can call upon the concept of probability density functions to characterize the errors induced by truncation. The probability density function of truncation errors, in terms of the quantization level, is shown in Figure 12-7(b). In Figure 12-7(a) the quantization level q is 1, so, in this case, we can have truncation errors as great as –1. Drawing upon our results from Eqs. (D-11) and (D-12) in Appendix D, the mean and variance of our uniform truncation probability density function are expressed as

Equation 12-22. 

and

Equation 12-23. 

In a sense, truncation error is the price we pay for the privilege of using integer binary arithmetic. One aspect of this is the error introduced when we use truncation to implement division by some integral power of 2. We often speak of a quick way of dividing a binary value by 2T is to shift that binary word T bits to the right; that is, we're truncating the data value (not the data word) by lopping off the rightmost T bits after the shift. For example, let's say we have the value 31 represented by the five-bit binary number 111112, and we want to divide it by 16 through shifting the bits T = 4 places to the right and ignoring (truncating) those shifted bits. After the right shift and truncation, we'd have a binary quotient of 31/16 = 000012. Well, we see the significance of the problem because our quick division gave us an answer of one instead of the correct 31/16 = 1.9375. Our division-by-truncation error here is almost 50 percent of the correct answer. Had our original dividend been a 63 represented by the six-bit binary number 1111112, dividing it by 16 through a four-bit shift would give us an answer of binary 0000112, or decimal three. The correct answer, of course, is 63/16 = 3.9375. In this case the percentage error is 0.9375/3.9375, or about 23.8 percent. So, the larger the dividend, the lower the truncation error.

If we study these kinds of errors, we'll find that the resulting truncation error depends on three things: the number of value bits shifted and truncated, the values of the truncated bits (were those dropped bits ones or zeros), and the magnitude of the binary number left over after truncation. Although a complete analysis of these truncation errors is beyond the scope of this book, we can quantify the maximum error that can occur with our division-by-truncation scheme when using binary integer arithmetic. The worst case scenario is when all of the T bits to be truncated are ones. For binary integral numbers the value of T bits, all ones, to the right of the binary point is 1 – 2T. If the resulting quotient N is small, the significance of those truncated ones aggravates the error. We can normalize the maximum division error by comparing it to the correct quotient using a percentage. So the maximum percentage error of the binary quotient N after truncation for a T-bit right shift, when the truncated bits are all ones, is

Equation 12-24. 

Let's plug a few numbers into Eq. (12-24) to understand its significance. In the first example above, 31/16, where T = 4 and the resulting binary quotient was N = 1, the percentage error is

Plotting Eq. (12-24) for three different shift values as functions of the quotient, N, after truncation results in Figure 12-8.

Maximum error when data shifting and truncation are used to implement division by 2T. T = 1 is division by 2; T = 2 is division by 4; and T = 8 is division by 256.

Figure 12-8. Maximum error when data shifting and truncation are used to implement division by 2T. T = 1 is division by 2; T = 2 is division by 4; and T = 8 is division by 256.

So, to minimize this type of division error, we need to keep the resulting binary quotient N, after the division, as large as possible. Remember, Eq. (12-24) was a worst case condition where all of the truncated bits were ones. On the other hand, if all of the truncated bits happened to be zeros, the error will be zero. In practice, the errors will be somewhere between these two extremes. A practical example of how division-by-truncation can cause serious numerical errors is given in Reference [9].

Data Rounding

Rounding is an alternate process of dealing with overflow errors where a data value is represented by, or rounded off to, its nearest quantization level. If we're quantizing to integral values, the number 1.2 would be quantized to 1, and the number 1.6 would be quantized to 2. This is shown in Figure 12-9(a), where all values of x in the range of –0.5 ≤ x < 0.5 are set equal to 0, values of x in the range of 0.5 ≤ x < 1.5 are set equal to 1, x values in the range of 1.5 ≤ x < 2.5 are set equal to 2, etc. The probability density function of the error induced by rounding, in terms of the quantization level, is shown in Figure 12-9(b). In Figure 12-9(a), the quantization level q is 1, so, in this case, we can have truncation error magnitudes no greater than q/2, or 1/2. Again, using our Eqs. (D-11) and (D-12) results from Appendix D, the mean and variance of our uniform roundoff probability density function are expressed as

Roundoff: (a) quantization nonlinearities; (b) error probability density function.

Figure 12-9. Roundoff: (a) quantization nonlinearities; (b) error probability density function.

Equation 12-25. 

and

Equation 12-26. 

Because the mean (average) and maximum error possibly induced by roundoff is less than that of truncation, rounding is generally preferred, in practice, to minimize overflow errors.

In digital signal processing, statistical analysis of quantization error effects is exceedingly complicated. Analytical results depend on the types of quantization errors, the magnitude of the data being represented, the numerical format used, and which of the many FFT or digital filter structures we happen to use. Be that as it may, digital signal processing experts have developed simplified error models whose analysis has proved useful. Although discussion of these analysis techniques and their results is beyond the scope of this introductory text, many references are available for the energetic reader[1018]. (Reference [11] has an extensive reference list of its own on the topic of quantization error analysis.)

Again, the overflow problems using fixed-point binary formats—that we try to alleviate with truncation or roundoff—arise because so many digital signal processing algorithms comprise large numbers of additions or multiplications. This obstacle, particularly in hardware implementations of digital filters and the FFT, is avoided by hardware designers through the use of floating-point binary formats.

FLOATING-POINT BINARY FORMATS

Floating-point binary formats allow us to overcome most of the limitations of precision and dynamic range mandated by fixed-point binary formats, particularly in reducing the ill effects of overflow [19]. Floating-point formats segment a data word into two parts: a mantissa m and an exponent e. Using these parts, the value of a binary floating-point number n is evaluated as that is, the number's value is the product of the mantissa and 2 raised to the power of the exponent. (Mantissa is a somewhat unfortunate choice of terms because it has a meaning here very different from that in the mathematics of logarithms. Mantissa originally meant the decimal fraction of a logarithm.[†] However, due to its abundance in the literature we'll continue using the term mantissa here.) Of course, both the mantissa and the exponent in Eq. (12-27) can be either positive or negative numbers.

Equation 12-27. 

Let's assume that a b-bit floating-point number will use be bits for the fixed-point signed exponent and bm bits for the fixed-point signed mantissa. The greater the number of be bits used, the larger the dynamic range of the number. The more bits used for bm, the better the resolution, or precision, of the number. Early computer simulations conducted by the developers of b-bit floating-point formats indicated that the best trade-off occurred with beb/4 and bm ≈ 3b/4. We'll see that, for typical 32-bit floating-point formats used today, be ≈ 8 bits and bm ≈ 24 bits. To take advantage of a mantissa's full dynamic range, most implementations of floating-point numbers treat the mantissa as a fractional fixed-point binary number, shift the mantissa bits to the right or left, so that its most significant bit is a one, and adjust the exponent accordingly. This convention is called normalization. When normalized, the mantissa bits are typically called the fraction of the floating-point number, instead of the mantissa. For example, the decimal value 3.687510 can be represented by the fractional binary number 11.10112. If we use a two-bit exponent with a six-bit fraction floating-point word, we can just as well represent 11.10112 by shifting it to the right two places and setting the exponent to two as

Equation 12-28. 

The floating-point word above can be evaluated to retrieve our decimal number again as

Equation 12-29. 

After some experience using floating-point normalization, users soon realized that always having a one in the most significant bit of the fraction was wasteful. That redundant one was taking up a single bit position in all data words and serving no purpose. So practical implementations of floating-point formats discard that one, assume its existence, and increase the useful number of fraction bits by one. This is why the term hidden bit is used to describe some floating-point formats. While increasing the fraction's precision, this scheme uses less memory because the hidden bit is merely accounted for in the hardware arithmetic logic. Using a hidden bit, the fraction in Eq. (12-28)'s floating point number is shifted to the left one place and would now be

Equation 12-30. 

Recall that the exponent and mantissa bits were fixed-point signed binary numbers, and we've discussed several formats for representing signed binary numbers, i.e., sign magnitude, two's complement, and offset binary. As it turns out, all three signed binary formats are used in industry-standard floating-point formats. The most common floating-point formats, all using 32-bit words, are listed in Table 12-3.

The IEEE P754 floating-point format is the most popular because so many manufacturers of floating-point integrated circuits comply with this standard [8, 2022]. Its exponent e is offset binary (biased exponent), and its fraction is a sign-magnitude binary number with a hidden bit that's assumed to be 20. The decimal value of a normalized IEEE P754 floating-point number is evaluated as

Equation 12-31. 

The IBM floating-point format differs somewhat from the other floating-point formats because it uses a base of 16 rather than 2. Its exponent is offset binary, and its fraction is sign magnitude with no hidden bit. The decimal value of a normalized IBM floating-point number is evaluated as

Equation 12-32. 

Table 12-3. Floating–Point Number Formats

IEEE Standard P754 Format

Bit

31

30

29

28

27

26

25

24

23

22

21

20

. . .

2

1

0

 

S

27

26

25

24

23

22

21

20

2–1

2–2

2–3

. . .

2–21

2–22

2–23

Sign (s)

← Exponent (e)

← Fraction (f) →

IBM Format

Bit

31

30

29

28

27

26

25

24

23

22

21

20

. . .

2

1

0

 

S

26

25

24

23

22

21

20

2–1

2–2

2–3

2–4

. . .

2–22

2–23

2–24

Sign (s)

← Exponent (e)

← Fraction (f) →

DEC (Digital Equipment Corp.) Format

Bit

31

30

29

28

27

26

25

24

23

22

21

20

. . .

2

1

0

 

S

27

26

25

24

23

22

21

20

2–2

2–3

2–4

. . .

2–22

2–23

2–24

Sign (s)

← Exponent (e)

← Fraction (f) →

MIL–STD 1750A Format

Bit

31

30

29

. . .

11

10

9

8

7

6

5

4

3

2

1

0

 

20

2–1

2–2

. . .

2–20

2–21

2–22

2–23

27

26

25

24

23

22

21

20

← Fraction (f) →

← Exponent (e)

The DEC floating-point format uses an offset binary exponent, and its fraction is sign magnitude with a hidden bit that's assumed to be 2–1. The decimal value of a normalized DEC floating-point number is evaluated as

Equation 12-33. 

MIL-STD 1750A is a United States Military Airborne floating-point standard. Its exponent e is a two's complement binary number residing in the least significant eight bits. MIL-STD 1750A's fraction is also a two's complement number (with no hidden bit), and that's why no sign bit is specifically indicated in Table 12-3. The decimal value of a MIL-STD 1750A floating-point number is evaluated as

Equation 12-34. 

Notice how the floating-point formats in Table 12-3 all have word lengths of 32 bits. This was not accidental. Using 32-bit words makes these formats easier to handle using 8-, 16-, and 32-bit hardware processors. That fact not withstanding and given the advantages afforded by floating-point number formats, these formats do require a significant amount of logical comparisons and branching to correctly perform arithmetic operations. Reference [23] provides useful flow charts showing what procedural steps must be taken when floating-point numbers are added and multiplied.

Floating-Point Dynamic Range

Attempting to determine the dynamic range of an arbitrary floating-point number format is a challenging exercise. We start by repeating the expression for a number system's dynamic range from Eq. (12-6) as

Equation 12-35. 

When we attempt to determine the largest and smallest possible values for a floating-point number format, we quickly see that they depend on such factors as

  • the position of the binary point

  • whether a hidden bit is used or not (If used, its position relative to the binary point is important.)

  • the base value of the floating-point number format

  • the signed binary format used for the exponent and the fraction (For example, recall from Table 12-2 that the binary two's complement format can represent larger negative numbers than the sign-magnitude format.)

  • how unnormalized fractions are handled, if at all. (Unnormalized, also called gradual underflow, means a nonzero number that's less than the minimum normalized format but can still be represented when the exponent and hidden bit are both zero.)

  • how exponents are handled when they're either all ones or all zeros. (For example, the IEEE P754 format treats a number having an all ones exponent and a nonzero fraction as an invalid number, whereas the DEC format handles a number having a sign = 1 and a zero exponent as a special instruction instead of a valid number.)

Trying to develop a dynamic range expression that accounts for all the possible combinations of the above factors is impractical. What we can do is derive a rule of thumb expression for dynamic range that's often used in practice[8,22,24].

Let's assume the following for our derivation: the exponent is a be-bit offset binary number, the fraction is a normalized sign-magnitude number having a sign bit and bm magnitude bits, and a hidden bit is used just left of the binary point. Our hypothetical floating-point word takes the following form:

Bit

 

bm+be–1

bm+be–2

· · ·

bm+2

bm

bm–1

bm–2

. . .

1

0

 

S

2be–1

2be–2

· · ·

21

20

2–1

2–2

. . .

2bm+1

2bm

Sign (s)

Exponent (e)

Fraction (f)

First we'll determine what the largest value can be for our floating-point word. The largest fraction is a one in the hidden bit, and the remaining bm fraction bits are all ones. This would make fraction The first 1 in this expression is the hidden bit to the left of the binary point, and the value in parentheses is all bm bits equal to ones to the right of the binary point. The greatest positive value we can have for the be-bit offset binary exponent is . So the largest value that can be represented with the floating-point number is the largest fraction raised to the largest positive exponent, or

Equation 12-36. 

The smallest value we can represent with our floating-point word is a one in the hidden bit times two raised to the exponent's most negative value, , or

Equation 12-37. 

Plugging Eqs. (12-36) and (12-37) into Eq. (12-35),

Equation 12-38. 

Now here's where the thumb comes in—when bm is large, say over seven, the value approaches zero; that is, as bm increases, the all ones fraction value in the numerator approaches 1. Assuming this, Eq. (12-38) becomes

Equation 12-39. 

Using Eq. (12-39) we can estimate, for example, the dynamic range of the single-precision IEEE P754 standard floating-point format with its eight-bit exponent:

Equation 12-40. 

Although we've introduced the major features of the most common floating-point formats, there are still more details to learn about floating-point numbers. For the interested reader, the references given in this section provide a good place to start.

BLOCK FLOATING-POINT BINARY FORMAT

A marriage of fixed-point and floating-point binary formats is known as block floating point. This scheme is used, particularly in dedicated FFT integrated circuits, when large arrays, or blocks, of associated data are to be manipulated mathematically. Block floating-point schemes begin by examining all the words in a block of data, normalizing the largest-valued word's fraction, and establishing the correct exponent. This normalization takes advantage of the fraction's full dynamic range. Next, the fractions of the remaining data words are shifted appropriately, so that they can use the exponent of the largest word. In this way, all of the data words use the same exponent value to conserve hardware memory.

In FFT implementations, the arithmetic is performed treating the block normalized data values as fixed-point binary. However, when an addition causes an overflow condition, all of the data words are shifted one bit to the right (division by two), and the exponent is incremented by one. As the reader may have guessed, block floating-point formats have increased dynamic range and avoid the overflow problems inherent in fixed-point formats, but do not reach the performance level of true floating-point formats [8,25,26].

REFERENCES



[†] Binary numbers are used because early electronic computer pioneers quickly realized that it was much more practical and reliable to use electrical devices (relays, vacuum tubes, transistors, etc.) that had only two states, on or off. Thus, the on/off state of a device could represent a single binary digit.

[†] Some computers use 64-bit words. Now, 264 is approximately equal to 1.8 · 1019—that's a pretty large number. So large, in fact, that if we started incrementing a 64-bit counter once per second at the beginning of the universe (≈20 billion years ago), the most significant four bits of this counter would still be all zeros today.

[†] By the way, some folks use the term crest factor to describe how hard an A/D converter's input is being driven. The crest factor is the reciprocal of the loading factor, or CF = Vp/(rms of the input signal).

[†] For example, the common logarithm (log to the base 10) of 256 is 2.4082. The 2 to the left of the decimal point is called the characteristic of the logarithm and the 4082 digits are called the mantissa. The 2 in 2.4082 does not mean that we multiply .4082 by 102. The 2 means that we take the antilog of .4082 to get 2.56 and multiply that by 102 to get 256.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.93.137