84 5. FIXED-POINT VS. FLOATING-POINT
0:25, respectively. e result is 0:1875, which falls within the fractional range. Notice that
the hardware generates the same 1’s and 0’s; what is different is the interpretation of the bits.
0110 6 0.110 0.75
Q3
* 1110 * -2 * 1.110 * -0.25
Q3
0000 0000
0110 0110
0110 0110
1010 1010
11110100 -12 11.110100 -0.1875
Q6
sign bit
best approximation
in 4-biy memory
1.110
-0.25
Note that since the
MSB is a sign bit,
the corresponding
partial product is
the 2’s complement
of the multiplicand
extended
sign bit
sign bit
Figure 5.4: Binary and fractional multiplication.
When multiplying QN numbers, it should be remembered that the result will consist of
2N fractional bits, one sign bit, and one or more extended sign bits. Based on the data type used,
the result has to be shifted accordingly. If two Q15 numbers are multiplied, the result will be
32-bits wide, with the MSB being the extended sign bit followed by the sign bit. e imaginary
decimal point will be after the 30th bit. After discarding the extended sign bit with a 1-bit left
shift, a right shift of 16 is required to store the result in a 16-bit memory location as a Q15
number. It should be realized that some precision is lost, of course, as a result of discarding the
smaller fractional bits. Since only 16 bits can be stored, the shifting allows one to retain the
higher precision fractional bits. If a 32-bit storage capability is available, a left shift of 1 can be
performed to remove the extended sign bit and store the result as a Q31 number.
To further understand a possible precision loss when manipulating Q-format numbers,
let us consider another example where two Q3.12 numbers corresponding to 7.5 and 7.25 are
multiplied and that the available memory space is 16-bit wide. As can be seen from Figure 5.5,
the resulting product might be left shifted by 4 bits to store all the fractional bits corresponding
to Q3.12 format. However, doing so results in a product value of 6.375, which is different than
the correct value of 54.375. If the fractional product is stored in a lower precision Q-format—say,
in Q6.9 format—then the correct product value can be stored.
Although Q-format solves the problem of overflow in multiplication, addition, and sub-
traction still pose a problem. When adding two Q15 numbers, the sum exceeds the range of
Q15 representation. To solve this problem, the scaling approach, discussed later in the chapter,
needs to be employed.
5.2. FLOATING-POINT NUMBER REPRESENTATION 85
Q3. 12
7.5 0111. 1000 0000 0000
Q3. 12
7.25 * 0111. 0100 0000 0000
Q6. 24
54.375 0011 0110. 0110 0000 0000 0000...
Q 3. 12
6.375
Q
6.9
54.375
Figure 5.5: Q-format precision loss example.
5.2 FLOATING-POINT NUMBER REPRESENTATION
Due to relatively limited dynamic ranges of fixed-point processors, when using such processors,
one should be concerned with the scaling issue, or how big the numbers get in the manipulation
of a signal. Scaling is not an issue when using floating-point processors, since the floating-point
hardware provides a much wider dynamic range. ere are two floating-point data representa-
tions commonly in use: single-precision (SP) and double-precision (DP). In the SP format, a
value is expressed as
1
s
2
.exp127/
1 frac; (5.3)
where s denotes the sign bit (bit 31), exp the exponent bits (bits 23 through 30), and frac the
fractional or mantissa bits (bits 0 through 22); see Figure 5.6.
31 30
s exp
23
22
frac
0
Figure 5.6: Floating point data representation.
Consequently, numbers as big as 3:4 10
38
and as small as 1:175 10
38
can be processed.
In the double-precision format, more fractional and exponent bits are used as indicated:
1
s
2
.exp1023/
1 frac; (5.4)
where the exponent bits are from bits 20 through 30 and the fractional bits are all the bits of
one word and bits 0 through 19 of the other word; see Figure 5.7. In this manner, numbers as
big as 1:7 10
308
and as small as 2:2 10
308
can be handled.
When using a floating-point processor, all the steps needed to perform floating-point
arithmetic are done by a floating-point CPU hardware. For example, consider adding two
floating-point numbers represented by
a D a
frac
2
a
exp
b D b
frac
2
b
exp
:
(5.5)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.141.202