i
i
i
i
i
i
i
i
7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 245
block. This problem can be mitigated with the addition of an “owner” state as in AMD’s MOESI
(for dirty sharing) or “forward” state as in Intel’s MESIF (for clean sharing), or both.
Finally, nodes A and B can send a notice of completion, the Done message, to the sequencer S.
Part (f) of the figure shows a case where node B sends the Done message earlier than node A. This
is fine because both read transactions deal with the same data value, hence they can be overlapped
and their completion reordered. However, the sequencer S must know how many Done messages it
has to receive in order to determine that all the outstanding read transactions have been completed,
for example to determine when it can process a future read exclusive request.
The illustrations of the protocol so far have assumed the case in which data is found on some of
the nodes. The case where data is not found in any of the nodes and must be fetched from the main
memory is more complicated to handle, because we have to deal with the issue of when to issue a
memory fetch. In general, there are two approaches to decide when to issue a memory fetch. The
first approach is a late memory fetch approach, where the system does not issue a memory fetch
until it determines that there is no copy of the block currently cached in any of the nodes. Another
approach is a early memory fetch approach, where the system issues a memory fetch before it can
be determined that there is no copy of the block in any of the nodes. Thus, the memory fetch is
speculative. The memory fetch is useful in reducing the cache miss latency if it turns out that the
block is not cached in any of the nodes. The memory fetch is useless if it turns out that data can be
supplied by some node. Moreover, the fetch has to be cancelled and the fetched block discarded if
it turns out the data is in a dirty state in a node, as the fetched block is stale.
Figure 7.12 illustrates the two approaches. The request arrives at the sequencer (a). In the late
fetch approach, the sequencer sends out intervention messages to the rest of the nodes (b). As a
response to receiving the intervention messages, each node sends a NoData response message the
requestor A, indicating it does not have the block (c). The rest of the processing is shown in part (d),
in three consecutive steps. The requestor, knowing the block is not found on any of the nodes, issues
a request to fetch from the main memory (Step 1). The sequencer fetches the block from the main
memory (Step 2), and supplies it to A (Step 3). Contrast this with the early fetch approach, where
the memory fetch is issued speculatively, immediately upon receiving the read request (part (e)).
The block is fetched from the main memory without knowing whether the block may be found in a
node, and whether the block in the node is in a clean or dirty state. Suppose that the node S finds
the data block in its cache and supplies it to the requestor, while other nodes do not find the data
block and send NoData response messages to the requestor (f). If the requestor (node A) receives
multiple data replies, it is responsible in determining which version of data to keep and which one
to discard. For example, if it obtains a data reply for a block that is currently cached in a dirty state
in a node, the block being fetched from memory is stale and should be the one discarded.
Comparing the early and late fetch approaches, the late fetch approach suffers from high mem-
ory access latency. The early fetch approach has a lower memory access latency. However, it comes
with significant costs. It performs fetches speculatively even when they may not be needed. These
fetches increase power consumption and off-chip bandwidth demand.
Due to the need for the sequencer to issue memory fetches, it makes sense to place the sequencer
close to the memory controller, or assigns the memory controller a sequencer role, to reduce the
fetch latency. As an example, in the AMD Opteron-based Magny Cours system, the memory con-
trollers act as the sequencers, based on addresses assigned to them.
i
i
i
i
i
i
i
i
246 CHAPTER 7. BASIC CACHE COHERENCE ISSUES
A B
S
Rd
(b)
(c)
(d)
(a)
A B
S
Int
Int
Int
A B
S
NoData
NoData
NoData
A B
S
NoData
NoData
Int
2:Fetch
(e)
(f)
A
S
Int
Int
Int
A B
S
NoData
NoData
NoData
NoData
Int
Fetch
Data
1:Fetch
3:Data
Late Fetch
Approach
Early Fetch
Approach
Figure 7.12: Illustration of a broadcast coherence protocol over point-to-point interconnec-
tion with a global sequencer.
Did you know?
Let us look at how the AMD Magny Cours system deals with the early vs. late fetch
dilemma. The memory controller is augmented with a structure referred to as a probe
filter, which tracks the state of blocks that are cached. When a request arrives at the
sequencer (the memory controller), following an early fetch approach, it immediately
issues a fetch to the main memory, in parallel with looking up the directory. If a block
is cached in a shared state or is uncached, the sequencer supplies the block, after
the fetch from the main memory has returned. This policy avoids the possibility of
multiple caches supplying the block, which may happen under an alternative approach
of broadcasting an intervention message. If a block is in a modified or exclusive state,
however, at most only one node has the block and can supply it. In this case, the
directory maintains precise information of which cache has the block, forwards the
request to the cache, which then supplies the block to the requestor.
The early fetch approach is more attractive when the directory lookup latency is
high, the memory access latency is low compared to the directory lookup, and mem-
ory bandwidth and power are relatively abundant. In the Magny Cours, the directory
is kept in the L3 cache and has a relatively high access latency.
i
i
i
i
i
i
i
i
7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 247
Global Sequencing with Overlapped Request Processing
The protocol discussed in the previous section suffers from not being able to overlap request pro-
cessing completely, except between multiple read transactions. This may cause a high coherence
transaction latency due to the high waiting time as requests from different nodes for a single address
collide at the sequencer.
One way to allow overlapped request processing while still giving the appearance that requests
are not overlapped is to use an ordered network. Using an ordered network to overlap request
processing has been used in some distributed shared memory system [19], and can be applied to a
multicore system as well. The basic idea is to impose or use the property of the network to ensure
that all nodes see the order of transactions in the same way as the sequencer. For example, we can
define a deterministic path for messages sent from the sequencer to every node. Different destination
nodes may share a path or have their own paths, but all messages sent by the sequencer to a node
must follow this predetermined path. Hence, messages sent by the sequencer are guaranteed to
arrive at the destination node in the same order as the sequencer sending them.
A B
S
Upgr
(b)
(c)
(d)
(a)
A B
S
Inv
Inv
Inv
A B
C S
InvAck
InvAck
InvAck
InvAck
InvAck
Inv
(e)
A B
S
Rd
Ack
A B
S
Int
Int
Int
Ack
Int
NoDataNoData
NoData
NoData
Data
Figure 7.13: Overlapping request processing using an ordered network.
i
i
i
i
i
i
i
i
248 CHAPTER 7. BASIC CACHE COHERENCE ISSUES
Figure 7.13 illustrates how the ordered network property can be used to overlap request process-
ing. Suppose node A sends an upgrade request to S and in the mean time B sends a read request
to S. Suppose the write request from A arrives at the sequencer before the request from B arrives
(a). Upon receiving the read exclusive request, the sequencer sends out invalidation messages to all
other nodes, while sending an acknowledgment message to the requestor node A, indicating that
the request has been accepted by the sequencer (b). The messages may use different paths as shown
in the figure, but for each destination node, only one path is allowed. While this is still going on,
the sequencer assumes that the transaction is already completed and serves a new request. When
it receives the read request from node B, it forwards intervention messages to all other nodes, and
sends an acknowledgment to node B indicating that the request has been accepted by the sequencer
(c). Because the network message delivery is ordered, it is guaranteed that each node will first re-
ceive the invalidation message before it receives the intervention message. Each node also needs to
respond to messages in the order of their arrival. Thus, each node will have invalidated its own copy
of the block by the time it receives or responds to the intervention message (d). In this case, they
would respond indicating that they do not find the block in their cache. This guarantees that it will
be node A who will supply the data to B.
Note that A will receive the acknowledgment message from the sequencer before the interven-
tion message. Hence, A knows that its request is accepted by the sequencer earlier than the read
request by node B. However, node A may receive the intervention message before it has collected
all invalidation acknowledgments. It will be incorrect to supply data to B until after all the invali-
dation acknowledgments are collected and its write completed. Therefore, node A must wait until
its own acknowledged transaction is complete before it reacts to other requests. In the mean time,
node A can buffer the intervention message or negatively acknowledge it until it has collected all
invalidation acknowledgments and has performed the write to the data. After that occurs, it can sup-
ply the newly written data directly to node B (e). Similarly, node B must wait until it receives data
before it can service other requests that it receives after the arrival of the acknowledgment message
from the sequencer.
Compared to the non-overlapped processing approach, with the overlapped request processing
may require additional mechanism and buffering at each node. However, since transactions can be
processed at a lower latency and higher bandwidth, the overlapped processing approach is likely
more attractive. Requiring ordered message delivery in the network often requires restricting paths
that can be used to deliver the message. On a larger network, this results in reduced path diversity,
which in turn leads to more pathological cases exhibiting network performance problems. However,
for a small network, there is not much path diversity to start with, and hence the effect of the reduced
path diversity is less significant.
Distributed Sequencing
It is possible also to rely on distributed protocol determining the order of transactions affecting a
single block address, instead of relying on just one fixed sequencer. The idea here is that requests
can be broadcast, and any node that can supply the requested data block is a good candidate to act as
the sequencer. However, the remaining nodes must observe the same order of transactions as deter-
mined by the sequencer. The order observed by other nodes is inferred through the order in which
messages are sent out by the sequencer. Thus, distributed sequencing requires the interconnect to
route messages in a certain way to guarantee global ordering of message arrivals. This requires
i
i
i
i
i
i
i
i
7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 249
beyond what an ordered network can provide. It requires nodes to be fully ordered with respect
to one another, with an ordering that is statically fixed, any time messages involved in transaction
are sent. An example of a network that satisfies such a requirement is a ring. Each link in a ring
connects two routers in a single direction (if the link is unidirectional), and together the routers and
links form a closed loop. In a ring, messages sent from nodes will travel through the same path that
is statically fixed, thereby satisfying the requirement.
A ring provides several benefits compared to a more general interconnect topology. Each router
in a ring can have a simple design, because it only has to choose whether to forward a message to
the next link or not, hence it can be implemented with low routing latency and modest amount of
buffering. However, a ring’s network distance increases linearly with the number of nodes, hence
a ring becomes less attractive as the size of the system grows. In this section, we will look at an
example of a cache coherence protocol that uses distributed sequencing exploiting the property of a
ring interconnect.
In a ring protocol, the data supplier is typically assigned the role of a sequencer for transactions
affecting the block. All requests are placed on the ring and travel through the ring until they go the
entire circle back to the requestors. Let us define transaction window as the time between when a
node places a request on the ring until its response has returned to the node. In some cases, a request
can be responded by just one node, such as when a read request is satisfied by a node that caches the
block supplying the block. In some cases, a request must be responded by all nodes, such as when
an upgrade request needs to involve all nodes invalidating their copy of the block.
One important property that needs to be guaranteed in a ring is that the data supplier supplies re-
sponds with a message (containing data) in the same order of requests arriving at the data supplier.
Such a property is important in ensuring that there is one order of transactions being processed that
is seen consistently by all nodes. For example, if there are two requests arriving at the supplier, the
request that reaches the supplier earlier will also be responded earlier than the other request. This
ensures that a requesting node, if its request reaches the supplier earlier than anybody else, will see
a response earlier than anybody else. It also ensures that a requesting node, if its request reaches the
supplier later than another requesting node, will see the other requesting node’s response before its
own response. The latter is a condition where a ring collision between two requests is detected.
More precisely, if during the transaction window a node receives a response to a request other
than its own, it has detected a collision with that other request. On the other hand, if no other
responses are received in the transaction window, no collision has been detected. Absence of col-
lision, the request has been satisfied as if it were the only request in the system. A collision may
be benign if it involves two overlapping read transactions. A benign collision can be allowed to
occur without causing a coherence problem, since they only involve a block with a single value. A
collision between two write transactions or between a write and a read transaction is not benign, as
it has a potential to cause coherence problems, because they involve multiple data values. Due to
this distinction, a benign collision may be ignored, but a non-benign collision cannot be ignored. A
non-benign collision may be dealt with by canceling both transactions involved in the collision, or
canceling one of the transactions by prioritizing one over the others.
Let us look at an illustration of how a basic ring coherence protocol works. Figure 7.14 shows
four nodes being connected in a ring. Part (a) of the figure shows simultaneously, node A and B send
a read request for the same data block. The message is shown in the following format: command
type (Rd, RdX, or Upgr) dot the requestor, comma, then the response in parentheses. The response
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.138.123