4.2 CACHE COHERENCE AND MEMORY CONSISTENCY

Attaching private caches to processors speeds up program execution by making memory latency match the processor speed. Thus, read/write operations take about the same time as the arithmetic and logic unit (ALU) operations. Table 4.1 summarizes the terminology used to describe cache coherence. A cache is useful because most tasks or applications display temporal locality and spatial locality. Temporal locality refers to the near future. Spatial locality refers to using data located near the current data in the near future. For this reason, data load/store operations between the shared memory and the caches take place using blocks. Figure 4.2 shows the relation between the blocks stored in the shared memory and their copies in the cache of a certain processor. The cache stores some blocks using a tag, which stores the address of the block in the shared memory. Each block stored in the cache is stored as a row called a line. A line contains the following components:

1. Valid bit (V) to indicate whether the data in the line are coherent with the block in the shared memory or not

2. Index, which is the address of the line in the cache

3. Tag, which refers to the address of the block in the shared memory

4. Data, which comprise the data stored in the block

Table 4.1 Terminology Used to Describe Cache Coherence

TermMeaning
BlockGroup of contiguous words or data stored in shared memory
BroadcastWhen information is sent to all caches
CacheA small high-speed memory implemented on the same chip as the processor to reduce memory access time and to reduce shared-memory collisions between processors
Cache coherenceThe contents of a block in the shared memory and the different caches are not the same.
Cache coherence protocolPolicy implemented to maintain cache coherence
Coherent systemWhen every read of a block from the shared memory finds the same data produced by the last previous write by any other processor
Global dataData stored in the shared memory
LineA block stored in a cache along with its tag and valid bit
Local dataData stored in the cache
Modified blockData of block in cache have not been updated in the shared memory
MulticastInformation is sent to some, not all, caches.
ReplacementRemoving a block from the cache to make room for a new block
Spatial localityData in the same block will be used over a short period of time.
Temporal localityA data word in a block will be used over a short period of time.
UnicastInformation is sent to only one cache.
ValidBlock contents are up to date.
Write-backA block in the shared memory is updated when the corresponding block in a cache is replaced.
Write-throughA block in the shared memory is updated when the corresponding block in a cache is modified.

Figure 4.2 Relation between the blocks stored in the shared memory and their copies in the cache of a certain processor.

c04f002

For shared memory systems, caches also help eliminate memory contention when two or more processors attempt to access the same memory module [27].

Copies of the data stored in the shared memory must match those copies stored in the local caches. This is referred to as cache coherence. The copies of a shared variable are coherent if they are all equal [35]. Effectively, the caches are coherent if every read operation by a processor finds a value produced by a previous write [27]. Cache coherence is important to guarantee correct program execution and to ensure high system performance. Assume two processors, P1 and P2, use the same shared variable stored in their two separate caches. If P1 modifies its local value, we must make sure that P2 is aware of the change.

Consider the case of a shared memory system with four processors and one shared memory module. Table 4.2 illustrates the problems that arise when a block in the shared memory is loaded by the processors and then gets modified by one or more processors. A write-through policy is assumed.

Table 4.2 Example of Cache Coherence Problem with Write-Through Policy

c04t07228cv

Table 4.3 illustrates the problems that arise when a block in the shared memory is loaded by the processors and then gets modified by one or more processors. A write-back policy is assumed. The cache contents at the different time instances according to the events of Table 4.2 are explained below.

Table 4.3 Example of Cache Coherence Problem with Write-Back Policy

c04t07328dy

Time 0. The cache C0 in processor P0 loads block b for use during its processing.

Time 1. Both caches C1 and C3 also load the same block from the shared memory. Now we have three copies of b.

Time 2. Processor P3 updates its copy of b in C3. At this time, the data in the shared memory and the caches are not consistent.

Time 3. P3 performs a write-through operation to make the data in C3 and in the shared memory consistent.

Time 4. Which processor should update the shared memory? With the write-back policy, this is determined by whichever processor performs a cache replacement. In this case, it happens to be P1.

Time 5. P2 loads block b and the central controller informs all processors of the new value b1. What should P0 and P3 do? Replace their data or inform the shared memory to use their own versions of b.

It is clear from the previous two situations in Tables 4.2 and 4.3 that the cache coherence problem is very serious especially for the case of multiprocessor systems. The correct value of memory content should not be implied by which processor performed the cache store (write) into memory first. For example, in Table 4.2, we see that P3 performed the first change in block b followed by P0 then P1. This might not have been the correct sequence to update block b in the shared memory. The reason for that is the processors are not synchronized with each other and the scheduling of tasks and threads in each processor is not aligned with that in the other processors. So what is the correct sequence of updating b? There are two ways to arrive at this answer:

1. Correct update of b based on sequential execution of the application

2. Correct update of b based on data dependencies

Sequential execution of the program means running the program on a single-processor sequential machine, which has no multiprocessing or multithreading capabilities. The order of accessing and updating the variable is the correct order as designed by the application developer. This sequence of variable access/update should be the one followed when implementing the application on a multiprocessor system. That sequence of access should be followed when determining how to update the shared variable in memory and in all the caches. Correct cache/memory access is correct if the results obtained by the parallel machine are always identical to the results obtained by the sequential machine.

Correct update of b based on data dependencies is a byproduct of implementing a timing sequence for the application tasks. The data scheduling strategies developed in Chapters 8–11 all identify the correct sequence for updating the algorithm variables. This serves as a guideline for determining the cache update sequencing.

Updating the values of shared variables by the processors is expected in a shared memory system. A cache coherence protocol must be used to ensure that the contents of the cache memories are consistent with the contents of the shared memory. There are two main cache coherence protocols:

1. Directory protocols

2. Snoopy protocols

4.2.1 Cache Coherence Using Directory Protocols

The main components for maintaining cache coherence using directory protocols are shown in Fig. 4.3. The local caches associated with the processors have local cache controllers to coordinate updating the copies of the shared variables stored in the local caches. The central controller is responsible for mainlining cache coherence for the system. Part of the shared memory is a directory that stores entries denoting the state of each shared block. The structure of each entry in the directory depends on the implementation details of the directory protocol used. The central controller handles local cache requests and is responsible for informing the local cache controllers of any changes in the states of the shared variables. The interconnection network enables communication between the controllers and between the caches and the shared memory.

Figure 4.3 System components for cache coherence using directory protocols.

c04f003

Figure 4.4 shows the details of the full-map directory protocol. Each entry contains n + 2 bits, where n is the number of processors. We assumed in the figure that n = 8. The bit labeled D indicates whether the data are valid (0) or modified (1). The bit labeled X indicates whether to broadcast the update information (B) to each processor or that data is no-broadcast (NB). We see from the figure that if the block corresponding to the shown entry is modified, then only the caches in processors 1 and 4 will have to be informed of that change.

Figure 4.4 Full-map directory-based scheme.

c04f004

Full-map directory scheme knows exactly the locations of the shared blocks. The caches associated with these copies are involved in coherence actions associated with a given shared block. However, the scheme is inflexible since all coherence transactions must be routed to the central controller. This could prove to be a bottleneck. Also, the size of each entry is directly proportional to the number of processors and must be changed when the number of processors changes.

4.2.2 Cache Coherence Using Snoopy Protocols

Figure 4.5 shows the main components for cache coherence using snoopy protocols. Unlike directory protocols, snoopy protocols do not use a directory in the shared memory nor a central controller. The coherence actions associated with a block are communicated between a local cache and the shared memory. These transactions are monitored by all other local caches. The interconnection network must be able to support broadcast of the data transmissions such that every processor is able to monitor all the network activities. A shared bus is suited for this broadcast mode since each bus transaction can be easily sensed by all processors connected to the bus. The shared bus, however, has a limited bandwidth that allows only one transaction to take place at any given time.

Figure 4.5 System components for cache coherence using snoopy protocols.

c04f005

When a memory write operation takes place by any processor, all other processors decide if this operation is relevant to them or not. The write operation by processor Pi is relevant to processor Pj if it has a copy of the block being accessed by Pi. There are two options for Pj based on its cache update policy. In the case of invalidation-based policy, Pj invalidates its own copy of b. It then copies b from the shared memory when it needs data from the cache. In the case of updated-based policy, Pj replaces its copy of b using the data available on the bus while the shared memory is being updated or some other time thereafter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.243.130