The Internal Operations of the CPU

As noted earlier, the CPU is also called a microprocessor because all of its components—at least all those needed to carry out calculations—are on a single silicon chip (see Chapter 4 for a discussion of how chips are made). We'll begin our overview of the microprocessor with a simple description of its functions. The classical CPU includes a four-sequence operation (see Figure 1.4).

  1. Fetch

  2. Decode

  3. Execute

  4. Store

Fetch

First, instructions and data are fetched from outside the chip (usually this means from DRAM). An example of an instruction is one that contains a simple mathematical operation such as "add." In some cases, the data used by an instruction are included with it; in other instances, the instruction references the locations where the data are held. These locations are called addresses. So, an "add" instruction might carry the numbers to be added with it, or it might state that the value in address x is to be added to the value in address y. An instruction actually contains two parts: an opcode—the action to be performed—and the operand—the data it works on. So, "add" is an opcode, and address x and address y are operands.

Decode (Analyze)

Once the CPU has received the instruction, it is turned over to an area of the chip that decodes (analyzes) it in order to determine which of the chip's circuits should be used for processing. This analysis stage can also include other functions. For example, some chips will look into the stream of incoming instructions to reorder them so they can be completed in the most efficient way possible. Also, in the event that the instruction does not include the actual data that will be used, but just their addresses, this is the point at which the CPU will retrieve the data.

Execute

If the CPU is the brains of the computer, then the ALU (arithmetic/logical unit) is the part where the actual thinking (execution of the instruction) takes place. The ALU includes groups of transistors, known as logic gates, that are organized to carry out basic mathematical and logical operations. An appropriately arranged collection of logic gates can then execute a complete mathematical instruction (such as "add" or "divide" two numbers) or a logical instruction (such as "compare" two values). The instructions that a specific ALU can execute are known as its instruction set. CPUs from different vendors (e.g., Intel, Sun) have different instruction sets. Problems of systems compatibility begin at this level. Software written to the Intel instruction set won't work (at all) on other types of processors such as Sun's SPARC family or the IBM/Motorola Power PC series. Like different languages, instruction sets vary in both their vocabulary and their grammar (for example, they use different ways of organizing instructions).

Tech Talk

Logic gate: A logic gate is a series of transistors connected in a way that allows it to carry out a mathematical or logical operation such as addition. Logic gates are grouped into electrical circuits that execute the CPU's instructions such as to "add" two numbers or "compare" two values.


Register Here

Registers provide a good example of the connection between hardware and software. Intel's original 16-bit CPU, the 8086/8088, had only eight general purpose registers that programmers could use to hold data and instructions. As more room became available on the chip, it would have been logical to increase this number, since in most cases more registers means better performance. But Intel couldn't do this because it would have made newer chips incompatible with older "legacy" software. The number of general purpose registers finally increased—to 48—when Intel completely rearranged the architecture of its chips with the sixth generation (Pentium Pro) in 1995. Significantly, these CPUs included a mechanism for translating old register instructions to new ones.


Store

Instructions have to tell the CPU not only what operation to perform, but where to put the result. There are a number of options. If the instruction is iterative, for example adding two numbers then adding another to the result, the instruction will tell the CPU to place the product of the first addition in a special short-term on-chip storage area, known as a register, until it is needed. Because the registers are interwoven with the ALU's circuits, they allow very fast retrieval. Alternatively, if the result is not expected to be used again right away, it will be sent off-chip to memory (fast retrieval) or to disk storage (a lot slower—see Figure 1.5 for an overview of these parts).

Tech Talk

Register: A register is an on-CPU storage space where instructions and data can be transferred and held temporarily for fast retrieval.


Figure 1.5. The flow of instructions in a CPU. Note that this illustration shows bus widths.


Recent changes in computer architecture have simplified the way in which the CPU manages instructions. In the old style, it was common for a programmer to tell the computer to "add the value at memory location xx to the value in memory location yy and place it in memory location zz." The problem with this approach, often called fetch/execute, is that operations that need to access memory are slow. The CPU has to wait for many clock cycles while the information is looked up and retrieved. Current system strategy limits memory accesses to load and store operations. In this approach, the programmer will first give the computer instructions to load the appropriate data from memory to a register or registers. Then, an add instruction would be something like this: "add the value in register a to the value in register b and place the result in register c." Subsequently, another instruction could tell the CPU to "store the value in register c in memory location zz."

The value of the load/store approach is that every operation doesn't have to begin with a load since some or all of the data to be manipulated are already in registers. In considering this strategy, you will appreciate that it is important to have a lot of registers. With a large register file, you can use one load operation to bring an entire data series onto the CPU and do all your additions or whatever without having to access memory again. It's easy to see that adding registers is an important way of making a CPU more productive.

Tech Talk

Instruction set: The instruction set refers to both the instructions that a CPU can execute and the way in which they are organized. CPUs from different vendors have different instruction sets unless one is a clone of the other.


The Clock

Correctly executing instructions in a CPU depends on perfect synchronization. If instructions somehow got out of logical sequence, more of the results would be erroneous than accurate, and even worse, it would be impossible to tell which was right and which was wrong. Since even one error is intolerable, something in the system has to make sure that actions within the CPU are coherent. This management function is performed by the clock. As described earlier, the clock's circuitry is based on a quartz crystal system like that used in watches. At precisely timed intervals, the clock sends out pulses of electricity that cause bits to move from place to place within or between logic gates or between logic gates and registers. Simple instructions, such as add, can often be executed in just one clock cycle, while complex ones, such as divide, will require a number of smaller steps, each using one cycle. Chip speed is measured in cycles per second, which was once referred to by the acronym cps, but which is now known by the word Hertz (abbreviated Hz and named for the famous German scientist Gustav Hertz). So, a 700 million cycle per second CPU is described as functioning at 700 MHz.

Interrupts

If the CPU couldn't be interrupted until it had completed a task, its usefulness would be greatly reduced. In effect, the CPU would be like the primitive batch processing computers of the earliest days of computing; they could work on only one program at a time and were unable to do anything else until that program was completed.

To make interrupts possible, CPUs have lines (wires) that connect them to an external interrupt controller chip (part of the chipset), which contains a small database of what are known as interrupt vectors. When an interrupt signal comes onto the chip, the CPU saves what it is doing and goes to the interrupt vector (which is just a fancy name for a numerical table) to find the address of the instruction that the interrupt is telling it to execute instead. When finished, it goes back to the previous task. The CPU keeps track of where it was by writing the address of an interrupted task to a special register known as a stack. Interrupts have various priority levels that are interpreted differently according to the task the CPU is engaged in. For example, a busy CPU might temporarily ignore low priority interrupts, such as those coming from the keyboard, but would respond immediately to a high priority one that carried something like a disk error message. Similarly, a software program can be designed to mask lower priority interrupts if it isn't capable of being stopped.

Designing a Faster CPU

As computers become more powerful, defined as being able to do a greater amount of work in a given unit of time, they become more useful. How do you get more power? This section describes the most important strategies.

Faster Clock Speeds

It's obvious that a faster clock will allow more to be done in a given unit of time. The original IBM PC's clock ticked just under 5 million times a second (actually, 4.77 MHz). As this is written (1999), standard PCs are beginning to exceed 700 million clock cycles per second. This hundredfold-plus improvement has taken just over 15 years. And the beat goes on. Expect speeds of desktop machines to double—to a billion and a half cycles per second (1.5 GHz)—by 2001.

A chip's maximum clock speed, the fastest it can go without producing errors, is a function of how it is made. Specifically, higher speeds require that distances within the CPU have to be shorter. Thus, for the hundredfold improvement we have seen so far, the connecting lines between transistors in logic gates have to be shorter (thinner) and closer together, and these in turn have to be closer to other logic gates, registers, and other circuits. Faster chips necessarily have greater density; that is to say they have more transistors in a given area. Figure 1.6 illustrates one type of transistor.

Tech Talk

Clock speed: CPUs (and other devices) are controlled by quartz crystal clocks. The consistent timing provided by the clocks helps to keep operations synchronized. Clock speed is usually measured in millions of cycles per second; abbreviated MHz.


Figure 1.6. One type of transistor. In the top drawing, the gate is neutral and there is no movement of electrical current. If a positive charge is applied to the gate, as in the lower drawing, electrons are attracted and allow current to flow from the source to the drain. If wires are attached to the source and drain, the transistor is a switch. If current flows, it is a 1; if not, it is a 0. The narrower the gate, the faster the chip can cycle, or turn a circuit on and off. The transistor is naturally a binary device. There isn't much you can do with one transistor, but put them together in groups, logic gates, and you have something really powerful.


There are several reasons why shorter distances are essential to faster chips. One is that the speed of electricity in a wire is constant. Our speedier clock is not actually making bits move faster, it is just making them move more frequently. Since electrons won't go faster, a higher speed circuit must assume shorter distances for each clock cycle. Distance is an issue even within transistors. If the parts of a transistor are closer together, it can cycle faster. A bonus is that chips with shorter internal paths also require less power.

Timeless Chips

Clock-driven CPUs work well, but this synchronous approach is energy inefficient. Every time the clock ticks, electricity pulses through the chip, whether any computation is going on or not. Laptops and other devices that need to save power do so in part by slowing or even shutting down the CPU when not in use. The penalty in this method, of course, is that the CPU is slow to reawaken when needed. Asynchronous (non clock-driven) chips exist, and have been shown to have much lower power consumption, but the required control circuitry uses lots of transistors that could be employed for computation. However, as shrinking parts allow more transistors on a chip, expect asynchronous designs to be used for special purposes where both low power and fast reaction are essential.


Making distances shorter in a CPU (or any other chip) means that the fundamental features of the circuits need to be smaller. Among the most important of these elements are the traces, microscopic equivalents of wires that make the paths between logic gates. Feature size, in the form of trace widths and transistor design rules, has gone from about 0.5 microns (a micron is one millionth of a meter or about one twenty-five thousandth of an inch) in the original PC to about 0.18 microns in the 700 million-plus cycles per second screamer of today; 0.15 design rules should be standard and 0.13 in testing by the time you read this. These successive stages of miniaturization are the result of some incredible improvements in process engineering. You would expect to be told next that these advances don't come cheap. Well they don't; the cost of chip factories, or fabs, goes up by huge amounts with each new generation. New ones today are in the several billion dollar range. Amazingly though, each of these fabs is radically more productive. This means that once processes are perfected, smaller, faster CPUs are also cheaper than their bigger, slower predecessors. This is the reason for "Moore's Law," which states that the number of transistors on a chip doubles every eighteen months, while the cost falls by about fifty percent. By the way, since there was no Bureau of Computer Laws where Moore could register his thought, there are a lot of variations of his law floating around.

Tech Talk

Trace: The wires that connect devices within a chip are so tiny that they are called traces.


Tech Talk

Moore's Law: Intel pioneer Gordon Moore stated that the number of transistors on a chip would double every eighteen months, and that their cost would fall by fifty percent during the same time.


Table 1.3. The Power PC 604e, which stayed with the same 5.1 million transistor design over three process generations, provides an excellent example of the impact of feature size on CPU area and speed.
Impact of Feature Size on Chip Dimensions and Clock Speed
YearFeature size in micronsSize in mm2Speed in MHz
19960.5148225
19970.3596233
19970.2547375

Table 1.3 provides an example of the impact of smaller design rules on a chip's size and speed. A CPU made by IBM and Motorola, the Power PC604e, was unusual because it kept the same design through three process generations. Fortunately, this anomaly provides an excellent illustration. You'll observe that as its internal elements got smaller, the chip shrank in size while its speed increased. Most important, the smaller and faster version was also less power-hungry and a lot cheaper to manufacture than its bigger, slower, energy-gulping predecessors. The manufacturing dimension is discussed in Chapter 4, but note for the moment that it's likely that the 47mm2 size yielded four times as many good chips from one wafer as the 148mm2 version.

Wider Paths

In addition to a faster clock, a way of speeding up the CPU is to have it process more bits on every cycle. The first generation of microcomputers moved 8 bits at a time, the first IBM PC was 16 bits wide, the current generation of microprocessors is 32 bits, and the next (standard by about 2002) will be 64 bits. The number of bits is described as a function of width because down there inside the chip, that's what it means. Compared to its 16-bit predecessor, a 32-bit computer has twice as many traces connecting logic gates, the logic gates themselves are about twice as wide, and so on. You will have observed right away that wider is the enemy of denser. That's true. A 32-bit chip will be bigger than a 16-bit one with the same number and size of logic gates. A 32-bit register will be bigger than a 16-bit register, etc. Thus, part of the benefit of smaller feature sizes has been employed not just to make chips smaller, but also to give them wider circuits. The tradeoff is well worth it, though. For a variety of reasons, including that the wider chips can use the same control structures to manage twice as much data, they are much more efficient. Instructions can be more complex; more data can be processed at once, and so on.

Tech Talk

Word: A computer's word length is the amount of information that it can process at one time. Current desktop computers use 32-bit words; 64-bit systems will be commonplace soon.


Doing More than One Thing at a Time

Higher clock speeds aren't the only way to increase productivity within a CPU. An obvious additional strategy is to do more than one task during each clock cycle. A popular technique of this type is called pipelining. In a pipelined CPU, while one part of the chip is fetching a new instruction, another is analyzing, yet another is executing, and another is storing results. Complex CPUs can have lots of pipeline stages. Intel's sixth generation CPUs (Pentium Pro, Pentium III), have 14 stages in their pipelines vs. only five for the Pentium. Modern CPUs also fetch several instructions at once, placing them in the first stage of the long pipeline, the prefetch queue. This helps to ensure that the pipeline is always full. Today's processors, with their longer pipelines are, of course, described as superpipelined. Figure 1.7 illustrates piplining.

Figure 1.7. Pipelining. A pipelined design can complete one instruction for each clock cycle as compared to one every four cycles for a pipe that handles just one instruction at a time.


Figure 1.8. Superscalar. In a superscalar design, Instruction 2 enters its pipeline one clock cycle after Instruction 1 and each flows down its corresponding pipeline in parallel. This is the simplest approach to parallel execution.


Tech Talk

Pipeline: Engineers describe the path that instructions follow through a CPU—fetch, decode, execute, store, and variations—as its pipeline. In a pipelined processor, every stage of the pipeline is doing work on every clock cycle.


A second technique, following logically from the first, is to have more than one pipeline. In this approach, instruction 1 in pipeline A is executed at about the same time as instruction 2 in pipeline B even though it followed 1 into the CPU. This is a lot more efficient as long as 2 doesn't depend on the results of A. Fortunately, programmers can often organize the flow of instructions in a way that avoids dependencies. CPUs with multiple pipelines are called superscalar (see Figure 1.8). Most contemporary microcomputer CPUs have this characteristic. Most also have different kinds of pipelines. The standard pipeline, one present in all CPUs, performs integer mathematics (whole numbers). Most CPUs also have pipelines that are optimized to do floating point math (using decimal points). Integer pipelines can be made to do floating point calculations, but for certain kinds of applications, such as scientific analysis, CPUs with one or more special floating point pipelines can be dramatically faster.

Tech Talk

Superscalar: A CPU that is superscalar has more than one pipeline.


More exotic ways of doing more than one thing at a time include such techniques as branch prediction and speculative execution. Branch prediction circuitry is designed to deal with the fact that typical computer programs have lots of decision points (branches).

To illustrate, a branch occurs when an instruction tells the CPU to perform a certain operation and, if the result meets the prescribed test, then to go on to the next instruction. As an example, the computer system in a portable pocket organizer might want to determine if the next appointment time has passed. To do this, it would subtract that appointment time from the current time and then compare the result to zero (i.e., the appointment time equals the current time). If the result is zero, the system would perform the branch instruction—in this case, sounding an alarm. But, if this test is not met (the result is not zero), the system would forget the branch instruction for now and go about its business.

Tech Talk

Branch: A point in a program where the CPU may have to switch to a different stream of instructions is called a branch. A conditional branch is where the stream chosen depends on the result of some computation.


Branches can waste a lot of clock cycles while the CPU dumps everything in the pipeline and finds the new instruction. Branch prediction circuits deal with this by looking ahead in the pipeline (before the ALU) and predicting the result of a branch. If the CPU thinks the branch will be taken (usually based on past performance) it will then go ahead and load the alternative instruction and possibly also new data. This can be done while the ALU is still working on earlier calculations, thereby greatly minimizing the penalty imposed if a branch has to be taken. Speculative execution means that the CPU will use the equivalent of a separate pipeline to follow a possible branch. If it turns out that the branch is taken, the CPU will be ready with the result. If not, it's pitched.

Tech Talk

Speculative execution: A CPU that has circuits designed for speculative execution will execute the instructions after a branch just in case the program needs to go that way.


Techniques like pipelining and superscalar organization, not to mention branch prediction and speculative execution, require that a CPU have lots of extra transistors. That's OK, since manufacturers have been using smaller and smaller design rules that allow more and more traces and transistors. The Intel 8088 that powered the first IBM PC had 29,000 transistors. The Pentium III of today, which is superscalar and superpipelined, supports branch prediction, speculative execution, and a bunch of other clever stuff, has around 7.5 million transistors. Some high-end CPUs, like the HP 8500, have in the vicinity of 140 million transistors (though much of that is in onboard memory, discussed in Chapter 2). Figure 1.9 provides an illustration of a pipelined, superscalar CPU.

Figure 1.9. A pipelined, superscalar CPU. This diagram shows the flow of operations within the CPU, as well as the most important external connections. Needless to say, this is highly simplified.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.105