i
i
i
i
i
i
i
i
7.2. CACHE COHERENCE IN BUS-BASED MULTIPROCESSORS 225
On a read from Processor 1, a BusRd is posted, the main memory responds with the block, and
Processor 1 gets the block, stores it in its cache with an exclusive state. On a write from Processor 1,
in contrast to MSI, the cache state transitions to modified without incurring a bus transaction, since
we know that in the exclusive state, no other cached copies exist.
When Processor 3 makes a read request, a BusRd is posted on the bus. Processor 1’s snooper
picks it up, checks its cache tag, and finds that it has the block in a modified state. This means
that Processor 1 has the latest (and the only valid) copy, and flushes it in response to the snooped
request. The block state transitions to shared after that. In the mean time, the memory controller
also attempts to fetch the block from the main memory because it does not know if eventually a
cache will supply the data or not. Processor 3 snoops the flush, and by matching the address of the
block being flushed with its outstanding read transaction, it knows that the flushed block should be
treated as the reply to its read request. So the block is picked up, stored in its cache in the shared
state. The main memory which has been trying to fetch the block from memory will also snoop
the flushed block, pick it up, cancel its memory fetch, and overwrite the stale copy of the block in
memory.
Next, Processor 3 has a write request. It posts a BusUpgr on the bus to invalidate copies in other
caches. Processor 1’s coherence controller responds by invalidating its copy. Processor 3’s cache
block state transitions to modified.
When Processor 1 attempts to read the block, it suffers a cache miss as a result of the earlier
invalidation that it received. Processor 1 posts a BusRd and Processor 3 responds by flushing its
cache block and transitions its state to shared. The flushed block also updates the main memory
copy, so as a result the cache blocks are now clean.
When Processor 3 reads the block, it finds it in state shared in its cache. Since it has a valid
copy, it has a cache hit and does not generate a bus request.
Finally, Processor 2 attempts to read the block and posts a BusRd on the bus. Unlike MSI
that does not employ cache-to-cache transfer, in MESI both Processor 1 and Processor 3’s cache
controllers attempt to supply the block by flushing it on the cache through FlushOpt. One of them
wins and supplies the block, and Processor 2 will pick up the block. The memory controller cancels
its fetch upon snooping the FlushOpt block.
While the MESI protocol has improved various performance aspects of the MSI protocol, it still
suffers from a remaining problem that is potentially quite serious. When a cache block is read and
written successively by multiple processors, each of the read incurs an intervention that requires the
owner to flush the cache block. While the flushed block must be picked up by the requestor as a
means to ensure write propagation, the flushed block updating the copy in the main memory is not
a correctness requirement for write propagation. Unfortunately, the definition of the shared state is
that the cache block is clean, in that the value of the block is the same with that in the main memory.
Therefore, in order to preserve the meaning of the shared block, the main memory has no choice but
to update its copy. This is referred to as clean sharing in that when a block is shared by multiple
caches, it has to be clean. Note also that clean sharing implies that evicting a shared cache block can
be performed silently, i.e., the block is simply discarded. Unfortunately, by keeping clean sharing,
the main memory is updated too many times. In some systems, the bandwidth to the main memory
is already restricted, so updating the main memory on each cache flush uses an excessive amount of
bandwidth. For example, if multiple cores in a multicore architecture maintain coherence at the L2
cache level, the L2 caches can communicate with each other using an on-chip interconnection, but
i
i
i
i
i
i
i
i
226 CHAPTER 7. BASIC CACHE COHERENCE ISSUES
updating the main memory must be performed by going off-chip. Off-chip bandwidth is severely
restricted in a multicore architecture because of the limited availability of pins and slow off-chip
interconnection. Thus, it will be nice if a cache flush does not need to update the main memory, by
allowing dirty block to be shared by multiple caches. Supporting dirty sharing can be provided with
an additional state in the MOESI protocol which is described in the next section.
7.2.4 MOESI Protocol with Write Back Caches
As mentioned earlier, the bandwidth to the main memory can be reduced by allowing dirty sharing.
The MOESI protocol allows dirty sharing. The MESI protocol is used by Intel processors such as
the Xeon processor while the MOESI protocol is used by processors such as the AMD Opteron [4].
In the MOESI protocol, the same as the MSI protocol, processor requests to the cache include:
1. PrRd: processor-side request to read to a cache block
2. PrWr: processor-side request to write to a cache block
Bus-side requests include:
1. BusRd: snooped request that indicates there is a read request to a cache block made by
another processor.
2. BusRdX: snooped request that indicates there is a read exclusive (write) request to a cache
block made by another processor which does not already have the block.
3. BusUpgr: snooped request that indicates that there is a write request to a cache block that
another processor already has in its cache.
4. Flush: snooped request that indicates that an entire cache block is placed on the bus by a
processor to facilitate a transfer to another processor’s cache.
5. FlushOpt: snooped request that indicates that an entire cache block is posted on the bus
in order to supply it to another processor. We refer to it as FlushOpt because unlike Flush
which is needed for write propagation correctness, FlushOpt is implemented as a performance
enhancing feature that can be removed without impacting correctness.
6. FlushWB: snooped request that indicates that an entire cache block is written back to the main
memory by another processor, and it is not meant as a transfer from one cache to another.
Each cache block has an associated state which can have one of the following values:
1. Modified (M): the cache block is valid in only one cache, and the value is (likely) different
than the one in the main memory. This state has the same meaning as the dirty state in a write
back cache for a single system, except that now it also implies exclusive ownership.
2. Exclusive (E): the cache block is valid, clean, and only resides in one cache.
3. Owned (O): the cache block is valid, possibly dirty, and may reside in multiple caches. How-
ever, when there are multiple cached copies, there can only be one cache that has the block in
owned state, other caches should have the block in state shared.
i
i
i
i
i
i
i
i
7.2. CACHE COHERENCE IN BUS-BASED MULTIPROCESSORS 227
4. Shared (S): the cache block is valid, possibly dirty, but may reside in multiple caches.
5. Invalid (I): the cache block is invalid.
The idea behind the owned state is that when a cache block is shared across caches, its value
is allowed to differ from that in the main memory. One cache is assigned as the owner and caches
the block in state “O” or owned, while others cache it in the shared state. The existence of the
owner simplifies how data is supplied in a cache-to-cache transfer. For example, when a BusRd is
snooped, we can let the owner to provide data through FlushOpt, while other controllers take no
action. The main memory does not need to pick up a Flush or FlushOpt to update the block in main
memory. In addition, we can also assign the owner to be responsible for writing back the block
to the main memory when the block is evicted. Hence, when a cache block in the shared state is
evicted, regardless of whether it is clean or dirty, it can be discarded. Only when the cache block
that is evicted is in the owned state, it is written back to the memory to update it. To indicate that
a block in the owned state is evicted and needs to update the main memory, a different bus request
type is needed, which we refer to as FlushWB.
Who should be the owner of a cache block? To answer this question, consider that when there
is dirty sharing, a block in the shared state can be replaced silently, but a block in the owned state
must be written back to the main memory. Bus bandwidth can be conserved if the frequency of
write backs is minimized. To reduce the frequency of write backs, the cache that will hold the
block the longest should be selected as the owner. Although predicting which cache will hold a
particular shared block the longest is difficult, good heuristics can often help. Since applications
tend to exhibit temporal locality, a good heuristic for predicting such a cache is selecting the cache
that last wrote to or read from the block as the owner. However, reads to valid blocks do not
incur any bus transactions, so it is inconvenient (and expensive) to change the ownership when a
processor reads from a shared block in the cache. Thus, one heuristic that can be used (implemented
in AMD Opteron systems) is to select the last cache that wrote to the block as the owner. More
specifically, the cache that has the block in the modified state, when it receives an intervention
request, downgrades the block state to owned – in effect becoming the owner of the block.
We assume that the caches use write allocate and write invalidate cache coherence policies. The
finite state machine corresponding to the MOESI coherence protocol for write back caches is shown
in Figure 7.7. In the figure, the response to processor-side requests is shown on the top part, while
the response to the snooper-side requests is shown on the bottom part.
As before, the I state represents two cases: a case in which the block is not cached, or when
the block is cached but its state is invalid. Let us consider the top part of the figure that shows a
reaction to a processor read or write request. First, consider when the cache block is in “I” (invalid)
state. When there is a processor read request, it suffers a cache miss. To load the data into the
cache, a BusRd is posted on the bus, and the memory controller responds to the BusRd by fetching
the block from the main memory. Other snoopers will snoop the request and check their caches to
determine if any of them has a copy. If a copy is found, the cache asserts the COPIES-EXIST bus
line (indicated as “C” in the figure). In that case, the fetched block is placed in the requestor’s cache
in the shared state. If, on the other hand, the COPIES-EXIST bus line is not asserted (indicated as
the “!C” in the figure), the fetched block is placed in the requestor’s cache in the exclusive state.
When there is a processor write request, the cache must allocate a valid copy of the cache block,
and to do that, it posts a BusRdX request on the bus. Other caches will respond by invalidating their
i
i
i
i
i
i
i
i
228 CHAPTER 7. BASIC CACHE COHERENCE ISSUES
PrWr/−
PrRd/−
PrRd/−
PrRd/BusRd(C)
PrWr/BusRdX
PrWr/−
PrRd/−
BusRdX/FlushOpt
BusRd/−
BusRdX/−
PrRd/−
PrWr/BusUpgr
BusRd/−
BusRdX/Flush
BusRd/Flush
BusRd/Flush
BusRd/FlushOpt
BusUpgr/−
BusRdX/Flush
PrRd/BusRd(!C)
PrWr/BusUpgr
BusRdX/−
BusUpgr/−
BusUpgr/−
O
M
E
I
S
M
E
I
S
O
Figure 7.7: State transition diagram for the MOESI coherence protocol.
cached copies, while the memory responds by supplying the requested block. When the requestor
gets the block, it is placed in the cache in the “M” or modified state.
Suppose now the cache already has the block in the exclusive state. Any read to the block is a
cache hit and proceeds without generating a bus transaction. A write to the block, as in the MESI
protocol, does not generate a bus transaction because a block in the exclusive state implies that it is
the only cached copy in the system. So the write can proceed after the state transitions to modified.
Suppose now the cache already has the block in the shared state. On a processor read, the block
is found in the cache and data is returned to the processor. This does not incur a bus transaction
since it is a cache hit, and the state remains unchanged. On the other hand, on a processor write,
there may be other cached copies that need to be invalidated, so a BusUpgr is posted on the bus, and
the state transitions to modified.
If the cache block is present in the cache in the modified state, reads or writes by the processor
do not change the state, and no bus transaction is generated since it can be sure that no other cached
i
i
i
i
i
i
i
i
7.2. CACHE COHERENCE IN BUS-BASED MULTIPROCESSORS 229
copies exist, since through an earlier invalidation, we have made sure that only one modified copy
can exist in the system.
If the cache block is present in the cache in an owned state, this means that it is dirty and
the block is shared in other caches. A processor read can just fetch the value from the cache. A
processor write must invalidate other cached copies by posting a BusUpgr transaction.
Now let us look at how the finite state machine reacts to snooped bus transactions. If the cache
does not have the block or the block is in invalid state, any snooped BusRd or BusRdX/BusUpgr
should not affect it, so they are ignored.
If the cache block is in an exclusive state, when a BusRd request is snooped, the block is flushed
to the cache using FlushOpt, and the state transitions to shared. When a BusRdX request is snooped,
the block is flushed using FlushOpt and the state transitions to invalid.
If the cache block is in a shared state, when a BusRd transaction is snooped, that means another
processor suffered a read miss and is trying to fetch the block. Therefore, the state remains shared,
and since only the owner is responsible for flushing the block, the local cache does not flush the
block. Note that there may be an owner (in case of dirty sharing) or there may not be an owner
(clean sharing). In the case that there is no owner (clean sharing), the main memory can supply
the block, although using a MESI-like FlushOpt is also possible here. If a BusRdX or BusUpgr is
snooped, the block’s state transitions to invalid. Again, the owner (if there is one) is responsible for
flushing the block, so, in contrast to MESI, the non-owner caches do not need to flush their block
copies.
If the cache block is in the modified state, the copy in the cache is the only valid copy in the
entire system (no other cached copies exist and the value in the main memory is stale). Therefore,
when a BusRd transaction is snooped, the block must be flushed for ensuring write propagation, and
the state transitions to owned. The reason for transitioning to the owned state is based on a heuristic
that has been discussed earlier. Note that by transitioning to the owned state, the local cache has
become the supplier of the block and is responsible for flushing it when required.
If the cache block is in the owned state, it indicates that there is dirty sharing, with the local
cache being responsible as the supplier of the block. Hence, when a BusRd is snooped, it flushes
the block and remains in the owned state (i.e., remains the owner). If a BusRdX is snooped, it
supplies the block by flushing it, and transitions into the invalid state. If a BusUpgr is snooped, it
transitions into the invalid state without flushing the block. Note that because or dirty sharing, the
flushes from this state is a correctness requirement, rather than a performance enhancement because
potentially, nobody else in the system has a valid copy of the block (some sharers which obtain the
block from the owner may have a valid copy, but there is no guarantee that there are sharers).
When the block in the owned state is evicted from the cache, the ownership disappears since
other caches have the copy in the shared state. Hence, at this point dirty sharing must be converted
into clean sharing. The owner is responsible for flushing the block to the memory so that the memory
can update its copy. This is achieved by posting a FlushWB request on the bus. In contrast to Flush
or FlushOpt requests that are ignored by the memory controller, a FlushWB request is picked up by
the memory controller to update the value in the main memory.
Two mechanisms ensure write propagation. First, by invalidating other copies on a write to
a cache block, other caches are forced to reload the block through cache misses. Secondly, with
dirty sharing, a cache acts as an owner and flushes the block when it snoops a BusRd or BusRdX,
ensuring the correct block value is passed on. With clean sharing, the memory supplies the block.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.81.232