Floating-Point Number Representation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C H A P T E R 5

Fixed-Point vs. Floating-Point

One important feature that distinguishes diﬀerent processors is whether their CPUs perform

ﬁxed-point or ﬂoating-point arithmetic. In a ﬁxed-point processor, numbers are represented and

manipulated in integer format. In a ﬂoating-point processor, in addition to integer arithmetic,

ﬂoating-point arithmetic can be handled. is means that numbers are represented by the com-

bination of a mantissa (or a fractional part) and an exponent part, and the CPU possesses the

necessary hardware for manipulating both of these parts. As a result, in general, ﬂoating-point

operations involve more logic elements (larger ALU) and more cycles (more time) to manipulate

ﬂoating-point values.

In a ﬁxed-point processor, one needs to be concerned with the dynamic range of numbers,

since a much narrower range of numbers can be represented in integer format as compared to

ﬂoating-point format. For most applications, such a concern can be virtually ignored when using

a ﬂoating-point processor. Consequently, ﬁxed-point processors usually demand more coding

eﬀort than do ﬂoating-point processors.

5.1 Q-FORMAT NUMBER REPRESENTATION

e decimal value of a 2’s-complement number B D b

N 1

N 2

: : : b

; b

2 f0; 1g, is given by

D.B/ D b

N 1

C b

N 2

C    C b

C b

: (5.1)

e 2’s-complement representation allows a processor to perform integer addition and subtrac-

tion by using the same hardware. When using unsigned integer representation, the sign bit is

treated as an extra bit. Only positive numbers get represented this way.

ere is a limitation to the dynamic range of the foregoing integer representation scheme.

For example, in a 16-bit system it is not possible to represent numbers larger than C2



1 D32,767 or smaller than 2

D32,768. To cope with this limitation, numbers are normal-

ized between 1 and 1. In other words, they are represented as fractions. is normalization is

achieved by the programmer moving the implied or imaginary binary point (note that there is

no physical memory allocated to this point), as indicated in Figure 5.1. is way, the fractional

value is given by

F .B/ D b

N 1

C b

N 2



C    C b



C b



: (5.2)

is representation scheme is referred to as Q-format or fractional representation. e

programmer needs to keep track of the implied binary point when manipulating Q-format num-

82 5. FIXED-POINT VS. FLOATING-POINT

Integer Representation

Implied binary point

N-1

N-2

Fractional Representation

N-1

N-2

Implied Binary Point

Figure 5.1: Integer vs. fractional representation.

bers. For instance, let us consider two Q15 format numbers and a 16-bit wide memory. Each

number consists of 1 sign bit plus 15 fractional bits. When these numbers are multiplied, a Q30

format number is obtained (the product of two fractions is still a fraction), with bit 31 being the

sign bit and bit 32 another sign bit (called extended sign bit). If not enough bits are available to

store all 32 bits, and only 16 bits can be stored, it makes sense to store the most signiﬁcant bits.

is translates into storing the upper portion of the 32-bit product register, minus the extended

sign bit, by doing a 1-bit left shift followed by a 16-bit right shift. In this manner, the prod-

uct would be stored in Q15 format (see Figure 5.2). Notation for Q-format numbers is QM:N

where M represents the number of bits corresponding to the whole-number part and N the

number of bits corresponding to the fractional-number part.

Based on 2’s-complement representation, a dynamic range of 



N 1



 D.B/ <

N 1

 1 can be achieved, where N denotes the number of bits. For an easy illustration, let

us consider a 4-bit system where the most negative number is 8 and the most positive number

7. e decimal representations of the numbers are shown in Figure 5.3. Notice how the numbers

change from most positive to most negative with the sign bit. Since only the integer numbers

falling within the limits 8 and 7 can be represented, it is easy to see that any multiplication or

addition resulting in a number larger than 7 or smaller than 8 will cause overﬂow. For exam-

ple, when 6 is multiplied by 2, the number 12 is obtained. Hence, the result is greater than the

representation limits and will be wrapped around the circle to 1100, which is 4.

Q-format representation addresses this problem by normalizing the dynamic range be-

tween 1 and 1. Any resulting multiplication falls within the limits of this dynamic range.

Using Q-format representation, the dynamic range is divided into 2

sections, where 2

.N 1/

is the size of a section. e most negative number is always 1 and the most positive number is

1  2

.N 1/

5.1. Q-FORMAT NUMBER REPRESENTATION 83

Add 1 to ? bit then truncate

If ? = 0, no eﬀ

ect (i.e., rounded down)

If ? = 1, result is rounded up

Figure 5.2: Multiplying and storing Q-15 numbers.

0000

1000

-8

1111

0111

-1

0011

1011

-5

0010

1010

-6

0001

1001

-7

1100 0100-4 4

1101

0101

-3

1110

0110

-2

Figure 5.3: Four-bit binary representation.

e following example helps one to see the diﬀerence in the two representation schemes.

As shown in Figure 5.4, the multiplication of 0110 by 1110 in binary is the equivalent of mul-

tiplying 6 by 2 in decimal, giving an outcome of 12, a number exceeding the dynamic range

of the 4-bit system. Based on the Q3 representation, these numbers correspond to 0.75 and

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Floating-Point Number Representation

Create new playlist

Sign In

Sign Up

Table of Contents for
Floating-Point Number Representation