Overflow and Scaling

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

84 5. FIXED-POINT VS. FLOATING-POINT

0:25, respectively. e result is 0:1875, which falls within the fractional range. Notice that

the hardware generates the same 1’s and 0’s; what is diﬀerent is the interpretation of the bits.

0110 6 0.110 0.75

* 1110 * -2 * 1.110 * -0.25

0000 0000

0110 0110

1010 1010

11110100 -12 11.110100 -0.1875

sign bit

best approximation

in 4-biy memory

1.110

-0.25

Note that since the

MSB is a sign bit,

the corresponding

partial product is

the 2’s complement

of the multiplicand

extended

sign bit

Figure 5.4: Binary and fractional multiplication.

When multiplying QN numbers, it should be remembered that the result will consist of

2N fractional bits, one sign bit, and one or more extended sign bits. Based on the data type used,

the result has to be shifted accordingly. If two Q15 numbers are multiplied, the result will be

32-bits wide, with the MSB being the extended sign bit followed by the sign bit. e imaginary

decimal point will be after the 30th bit. After discarding the extended sign bit with a 1-bit left

shift, a right shift of 16 is required to store the result in a 16-bit memory location as a Q15

number. It should be realized that some precision is lost, of course, as a result of discarding the

smaller fractional bits. Since only 16 bits can be stored, the shifting allows one to retain the

higher precision fractional bits. If a 32-bit storage capability is available, a left shift of 1 can be

performed to remove the extended sign bit and store the result as a Q31 number.

To further understand a possible precision loss when manipulating Q-format numbers,

let us consider another example where two Q3.12 numbers corresponding to 7.5 and 7.25 are

multiplied and that the available memory space is 16-bit wide. As can be seen from Figure 5.5,

the resulting product might be left shifted by 4 bits to store all the fractional bits corresponding

to Q3.12 format. However, doing so results in a product value of 6.375, which is diﬀerent than

the correct value of 54.375. If the fractional product is stored in a lower precision Q-format—say,

in Q6.9 format—then the correct product value can be stored.

Although Q-format solves the problem of overﬂow in multiplication, addition, and sub-

traction still pose a problem. When adding two Q15 numbers, the sum exceeds the range of

Q15 representation. To solve this problem, the scaling approach, discussed later in the chapter,

needs to be employed.

5.2. FLOATING-POINT NUMBER REPRESENTATION 85

Q3. 12

7.5 0111. 1000 0000 0000

Q3. 12

7.25 * 0111. 0100 0000 0000

Q6. 24

54.375 0011 0110. 0110 0000 0000 0000...

Q 3. 12

6.375

6.9

54.375

Figure 5.5: Q-format precision loss example.

5.2 FLOATING-POINT NUMBER REPRESENTATION

Due to relatively limited dynamic ranges of ﬁxed-point processors, when using such processors,

one should be concerned with the scaling issue, or how big the numbers get in the manipulation

of a signal. Scaling is not an issue when using ﬂoating-point processors, since the ﬂoating-point

hardware provides a much wider dynamic range. ere are two ﬂoating-point data representa-

tions commonly in use: single-precision (SP) and double-precision (DP). In the SP format, a

value is expressed as

 1

 2

.exp127/

 1  frac; (5.3)

where s denotes the sign bit (bit 31), exp the exponent bits (bits 23 through 30), and frac the

fractional or mantissa bits (bits 0 through 22); see Figure 5.6.

31 30

s exp

frac

Figure 5.6: Floating point data representation.

Consequently, numbers as big as 3:4  10

and as small as 1:175  10

38

can be processed.

In the double-precision format, more fractional and exponent bits are used as indicated:

 1

 2

.exp1023/

 1  frac; (5.4)

where the exponent bits are from bits 20 through 30 and the fractional bits are all the bits of

one word and bits 0 through 19 of the other word; see Figure 5.7. In this manner, numbers as

big as 1:7  10

308

and as small as 2:2  10

308

can be handled.

When using a ﬂoating-point processor, all the steps needed to perform ﬂoating-point

arithmetic are done by a ﬂoating-point CPU hardware. For example, consider adding two

ﬂoating-point numbers represented by

a D a

frac

 2

exp

b D b

frac

 2

exp

(5.5)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Overflow and Scaling

Create new playlist

Sign In

Sign Up

Table of Contents for
Overflow and Scaling