7.5. System Design with Diamond 108Mini Processor Cores

Figure 7.5 shows a system built with four Diamond 108Mini processor cores. The processor cores can communicate with global memory over the shared 32-bit PIF bus and with each other’s local memories using the Diamond processor cores’ inbound-PIF feature. A bus arbiter controls access to the PIF bus. Local/Global address-translation blocks attached to each processor’s PIF bus perform the critical function of mapping the attached processor’s local address space into one unified global address map.

Figure 7.5. This 4-processor system design allows the master Diamond 108Mini processor (shown on the left) to control the operation of the other three processors through their Reset, Run/Stall, and NMI inputs.


Without these address-translation blocks, the four Diamond 108Mini processors shown in Figure 7.5 could not communicate over the PIF because their local address maps would overlap each other and conflict. For example, if the master Diamond 108Mini processor attempted to write to the local data memory of slave processor #1, it would use a target destination address (say 0x3FFE0000) that would result in the master processor writing to its own local data memory instead of the intended destination in slave processor #1’s local data-memory address space. With the address-translation blocks, the master processor writes to an address in the PIF’s global address space and that address is then mapped to the target processor’s local address space.

The processor on the left of the figure is configured as the system master and the other three processors are slaves. Three wires from the master processor’s direct output port are connected to the Reset inputs of the three slave processors; three more output-port wires are connected to the slave processors’ Run/Stall inputs; and yet another three output-port wires are connected to the slave processors’ NMI inputs.

This system configuration allows the master processor to independently reset and halt all three slave processors. In fact, the three slave processors are automatically reset when the master processor is reset because the reset-initialized state for each of the pins associated with the master processor’s output port is zero. Inverters attached to the master processor’s output-port wires used for resetting the slave processors will therefore assert the slave processors’ Reset inputs when the master processor is reset. After initializing itself, the master processor can assert the Run/Stall input to each slave processor and then remove the reset signal to each slave processor. While each slave processor is stalled, the master processor can transfer program code from the large global memory (or from its own local memory) to each slave processor’s local instruction memory using inbound-PIF write operations—which are permitted even while the slave processors are stalled.

When the master processor releases a slave processor’s Run/Stall input, the slave processor will commence program execution starting at its reset-vector address. Because the Diamond 108Mini processor’s reset vector is set to 0x50000000, each slave processor’s first instruction fetch will be directed to the PIF-connected global memory. That first instruction however can set up a jump to a location in the slave processor’s local instruction memory, which will then isolate the slave’s operations from the PIF bus.

This scheme allows two of the slave processors to independently execute code from their local memories concurrently while the master processor is programming or reprogramming the third slave processor’s instruction memory. Although this design is based on four processors that together consume less than 2 mm2 on the SOC die, the resulting multiprocessor system is capable of quite sophisticated behavior and very good processing performance.

There are two noteworthy, performance-related reasons why the four Diamond 108Mini processor cores shown in Figure 7.5 should not execute code directly out of the global memory attached to the PIF bus. First, as mentioned above, the local memories provide faster access times to the processor than PIF-connected memory and therefore, operating almost exclusively from the local memories gives the master and slave processors much better performance. Memory transactions on local-memory buses are about five times faster than the same transactions conducted over the PIF bus.

Note that this speed imbalance between the local-memory buses and the PIF bus is not due to poor PIF design. It is because the processor’s main bus must hide speculative activity caused by the processor’s 5-stage RISC pipeline. At any time, there are as many as five instructions in this pipeline and each of these instructions is at a different stage of completion. If the processor completes the execution of a branch instruction, the processor will discard all of the upstream instructions in its pipeline, including load instructions that may already have generated read operations. The processor fetched these instructions and started to execute them speculatively, before discovering the branch. This speculative, pipeline-oriented operation is one of the reasons behind the RISC processor’s good performance. However, extra cycles are required to resolve speculative operations in the processor’s pipeline before load or store transactions appear on the PIF bus.

Local-memory buses do not hide speculative activity so they are inherently faster than the PIF bus. Any devices connected to the local-memory buses must be immune to read side effects (the main problem caused by allowing speculative activity to be visible on a processor’s external bus). Simple memories—static and dynamic RAMs and ROMs—are immune to read side effects. Peripheral devices and FIFO memories generally aren’t immune unless they have been specially designed for speculative operation. System designers who attach devices with read side effects to a RISC processor’s local-memory buses will encounter unexpected and incorrect system behavior. Avoid such designs.

All four processors share one 32-bit PIF bus in the system design shown in Figure 7.5, which leads to the second reason for minimizing each processor’s PIF activity. Under ideal conditions—and system conditions are rarely ideal—each processor would get a bit less than 25% of the PIF bus’ bandwidth (assuming all processors need equal bus bandwidth) when multimaster bus overhead is accounted for. Looked at another way, the PIF bus must operate at a speed that delivers somewhat more than the sum of all four processors’ required bus bandwidth. Otherwise, the processors will all starve for instructions and data and overall system performance will suffer.

RISC processors often employ instruction and data caches to avoid such starvation problems. However, cache memories are larger than simple local memories because they require additional RAM bits to hold the cache tags. In addition, integral cache controllers increase a processor’s gate count. Consequently, the Diamond 108Mini processor core, which was created to be as small as possible, was designed as a cacheless core with local instruction- and data-memory interfaces to minimize processor gate count.

To get good performance from such a processor core, the processor’s code should be stored in local instruction memory and data should be stored in local data memory. Local-memory access is as fast as cache-memory access, so a cacheless processor experiences no performance loss as long it can perform its work using just its local memories. This use model explains why the Diamond 108Mini processor core was configured for two local data memories: so that it could hold larger data sets (as large as 256 Kbytes) completely in its local memory and still deliver good performance.

Note: Other members of the Diamond Standard Series of processor cores include cache-memory controllers to take on even larger tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.235.225