Chapter 7 - Basic Cache Coherence Issues (9/12)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 245

block. This problem can be mitigated with the addition of an “owner” state as in AMD’s MOESI

(for dirty sharing) or “forward” state as in Intel’s MESIF (for clean sharing), or both.

Finally, nodes A and B can send a notice of completion, the Done message, to the sequencer S.

Part (f) of the ﬁgure shows a case where node B sends the Done message earlier than node A. This

is ﬁne because both read transactions deal with the same data value, hence they can be overlapped

and their completion reordered. However, the sequencer S must know how many Done messages it

has to receive in order to determine that all the outstanding read transactions have been completed,

for example to determine when it can process a future read exclusive request.

The illustrations of the protocol so far have assumed the case in which data is found on some of

the nodes. The case where data is not found in any of the nodes and must be fetched from the main

memory is more complicated to handle, because we have to deal with the issue of when to issue a

memory fetch. In general, there are two approaches to decide when to issue a memory fetch. The

ﬁrst approach is a late memory fetch approach, where the system does not issue a memory fetch

until it determines that there is no copy of the block currently cached in any of the nodes. Another

approach is a early memory fetch approach, where the system issues a memory fetch before it can

be determined that there is no copy of the block in any of the nodes. Thus, the memory fetch is

speculative. The memory fetch is useful in reducing the cache miss latency if it turns out that the

block is not cached in any of the nodes. The memory fetch is useless if it turns out that data can be

supplied by some node. Moreover, the fetch has to be cancelled and the fetched block discarded if

it turns out the data is in a dirty state in a node, as the fetched block is stale.

Figure 7.12 illustrates the two approaches. The request arrives at the sequencer (a). In the late

fetch approach, the sequencer sends out intervention messages to the rest of the nodes (b). As a

response to receiving the intervention messages, each node sends a NoData response message the

requestor A, indicating it does not have the block (c). The rest of the processing is shown in part (d),

in three consecutive steps. The requestor, knowing the block is not found on any of the nodes, issues

a request to fetch from the main memory (Step 1). The sequencer fetches the block from the main

memory (Step 2), and supplies it to A (Step 3). Contrast this with the early fetch approach, where

the memory fetch is issued speculatively, immediately upon receiving the read request (part (e)).

The block is fetched from the main memory without knowing whether the block may be found in a

node, and whether the block in the node is in a clean or dirty state. Suppose that the node S ﬁnds

the data block in its cache and supplies it to the requestor, while other nodes do not ﬁnd the data

block and send NoData response messages to the requestor (f). If the requestor (node A) receives

multiple data replies, it is responsible in determining which version of data to keep and which one

to discard. For example, if it obtains a data reply for a block that is currently cached in a dirty state

in a node, the block being fetched from memory is stale and should be the one discarded.

Comparing the early and late fetch approaches, the late fetch approach suffers from high mem-

ory access latency. The early fetch approach has a lower memory access latency. However, it comes

with signiﬁcant costs. It performs fetches speculatively even when they may not be needed. These

fetches increase power consumption and off-chip bandwidth demand.

Due to the need for the sequencer to issue memory fetches, it makes sense to place the sequencer

close to the memory controller, or assigns the memory controller a sequencer role, to reduce the

fetch latency. As an example, in the AMD Opteron-based Magny Cours system, the memory con-

trollers act as the sequencers, based on addresses assigned to them.

246 CHAPTER 7. BASIC CACHE COHERENCE ISSUES

A B

(b)

(c)

(d)

(a)

A B

Int

A B

NoData

A B

NoData

Int

2:Fetch

(e)

(f)

Int

A B

NoData

Int

Fetch

Data

1:Fetch

3:Data

Late Fetch

Approach

Early Fetch

Approach

Figure 7.12: Illustration of a broadcast coherence protocol over point-to-point interconnec-

tion with a global sequencer.

 Did you know?

Let us look at how the AMD Magny Cours system deals with the early vs. late fetch

dilemma. The memory controller is augmented with a structure referred to as a probe

ﬁlter, which tracks the state of blocks that are cached. When a request arrives at the

sequencer (the memory controller), following an early fetch approach, it immediately

issues a fetch to the main memory, in parallel with looking up the directory. If a block

is cached in a shared state or is uncached, the sequencer supplies the block, after

the fetch from the main memory has returned. This policy avoids the possibility of

multiple caches supplying the block, which may happen under an alternative approach

of broadcasting an intervention message. If a block is in a modiﬁed or exclusive state,

however, at most only one node has the block and can supply it. In this case, the

directory maintains precise information of which cache has the block, forwards the

request to the cache, which then supplies the block to the requestor.

The early fetch approach is more attractive when the directory lookup latency is

high, the memory access latency is low compared to the directory lookup, and mem-

ory bandwidth and power are relatively abundant. In the Magny Cours, the directory

is kept in the L3 cache and has a relatively high access latency.

7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 247

Global Sequencing with Overlapped Request Processing

The protocol discussed in the previous section suffers from not being able to overlap request pro-

cessing completely, except between multiple read transactions. This may cause a high coherence

transaction latency due to the high waiting time as requests from different nodes for a single address

collide at the sequencer.

One way to allow overlapped request processing while still giving the appearance that requests

are not overlapped is to use an ordered network. Using an ordered network to overlap request

processing has been used in some distributed shared memory system [19], and can be applied to a

multicore system as well. The basic idea is to impose or use the property of the network to ensure

that all nodes see the order of transactions in the same way as the sequencer. For example, we can

deﬁne a deterministic path for messages sent from the sequencer to every node. Different destination

nodes may share a path or have their own paths, but all messages sent by the sequencer to a node

must follow this predetermined path. Hence, messages sent by the sequencer are guaranteed to

arrive at the destination node in the same order as the sequencer sending them.

A B

Upgr

(b)

(c)

(d)

(a)

A B

Inv

A B

C S

InvAck

Inv

(e)

A B

Ack

A B

Int

Ack

Int

NoDataNoData

NoData

Data

Figure 7.13: Overlapping request processing using an ordered network.

248 CHAPTER 7. BASIC CACHE COHERENCE ISSUES

Figure 7.13 illustrates how the ordered network property can be used to overlap request process-

ing. Suppose node A sends an upgrade request to S and in the mean time B sends a read request

to S. Suppose the write request from A arrives at the sequencer before the request from B arrives

(a). Upon receiving the read exclusive request, the sequencer sends out invalidation messages to all

other nodes, while sending an acknowledgment message to the requestor node A, indicating that

the request has been accepted by the sequencer (b). The messages may use different paths as shown

in the ﬁgure, but for each destination node, only one path is allowed. While this is still going on,

the sequencer assumes that the transaction is already completed and serves a new request. When

it receives the read request from node B, it forwards intervention messages to all other nodes, and

sends an acknowledgment to node B indicating that the request has been accepted by the sequencer

(c). Because the network message delivery is ordered, it is guaranteed that each node will ﬁrst re-

ceive the invalidation message before it receives the intervention message. Each node also needs to

respond to messages in the order of their arrival. Thus, each node will have invalidated its own copy

of the block by the time it receives or responds to the intervention message (d). In this case, they

would respond indicating that they do not ﬁnd the block in their cache. This guarantees that it will

be node A who will supply the data to B.

Note that A will receive the acknowledgment message from the sequencer before the interven-

tion message. Hence, A knows that its request is accepted by the sequencer earlier than the read

request by node B. However, node A may receive the intervention message before it has collected

all invalidation acknowledgments. It will be incorrect to supply data to B until after all the invali-

dation acknowledgments are collected and its write completed. Therefore, node A must wait until

its own acknowledged transaction is complete before it reacts to other requests. In the mean time,

node A can buffer the intervention message or negatively acknowledge it until it has collected all

invalidation acknowledgments and has performed the write to the data. After that occurs, it can sup-

ply the newly written data directly to node B (e). Similarly, node B must wait until it receives data

before it can service other requests that it receives after the arrival of the acknowledgment message

from the sequencer.

Compared to the non-overlapped processing approach, with the overlapped request processing

may require additional mechanism and buffering at each node. However, since transactions can be

processed at a lower latency and higher bandwidth, the overlapped processing approach is likely

more attractive. Requiring ordered message delivery in the network often requires restricting paths

that can be used to deliver the message. On a larger network, this results in reduced path diversity,

which in turn leads to more pathological cases exhibiting network performance problems. However,

for a small network, there is not much path diversity to start with, and hence the effect of the reduced

path diversity is less signiﬁcant.

Distributed Sequencing

It is possible also to rely on distributed protocol determining the order of transactions affecting a

single block address, instead of relying on just one ﬁxed sequencer. The idea here is that requests

can be broadcast, and any node that can supply the requested data block is a good candidate to act as

the sequencer. However, the remaining nodes must observe the same order of transactions as deter-

mined by the sequencer. The order observed by other nodes is inferred through the order in which

messages are sent out by the sequencer. Thus, distributed sequencing requires the interconnect to

route messages in a certain way to guarantee global ordering of message arrivals. This requires

7.5. BROADCAST PROTOCOL WITH POINT-TO-POINT INTERCONNECT 249

beyond what an ordered network can provide. It requires nodes to be fully ordered with respect

to one another, with an ordering that is statically ﬁxed, any time messages involved in transaction

are sent. An example of a network that satisﬁes such a requirement is a ring. Each link in a ring

connects two routers in a single direction (if the link is unidirectional), and together the routers and

links form a closed loop. In a ring, messages sent from nodes will travel through the same path that

is statically ﬁxed, thereby satisfying the requirement.

A ring provides several beneﬁts compared to a more general interconnect topology. Each router

in a ring can have a simple design, because it only has to choose whether to forward a message to

the next link or not, hence it can be implemented with low routing latency and modest amount of

buffering. However, a ring’s network distance increases linearly with the number of nodes, hence

a ring becomes less attractive as the size of the system grows. In this section, we will look at an

example of a cache coherence protocol that uses distributed sequencing exploiting the property of a

ring interconnect.

In a ring protocol, the data supplier is typically assigned the role of a sequencer for transactions

affecting the block. All requests are placed on the ring and travel through the ring until they go the

entire circle back to the requestors. Let us deﬁne transaction window as the time between when a

node places a request on the ring until its response has returned to the node. In some cases, a request

can be responded by just one node, such as when a read request is satisﬁed by a node that caches the

block supplying the block. In some cases, a request must be responded by all nodes, such as when

an upgrade request needs to involve all nodes invalidating their copy of the block.

One important property that needs to be guaranteed in a ring is that the data supplier supplies re-

sponds with a message (containing data) in the same order of requests arriving at the data supplier.

Such a property is important in ensuring that there is one order of transactions being processed that

is seen consistently by all nodes. For example, if there are two requests arriving at the supplier, the

request that reaches the supplier earlier will also be responded earlier than the other request. This

ensures that a requesting node, if its request reaches the supplier earlier than anybody else, will see

a response earlier than anybody else. It also ensures that a requesting node, if its request reaches the

supplier later than another requesting node, will see the other requesting node’s response before its

own response. The latter is a condition where a ring collision between two requests is detected.

More precisely, if during the transaction window a node receives a response to a request other

than its own, it has detected a collision with that other request. On the other hand, if no other

responses are received in the transaction window, no collision has been detected. Absence of col-

lision, the request has been satisﬁed as if it were the only request in the system. A collision may

be benign if it involves two overlapping read transactions. A benign collision can be allowed to

occur without causing a coherence problem, since they only involve a block with a single value. A

collision between two write transactions or between a write and a read transaction is not benign, as

it has a potential to cause coherence problems, because they involve multiple data values. Due to

this distinction, a benign collision may be ignored, but a non-benign collision cannot be ignored. A

non-benign collision may be dealt with by canceling both transactions involved in the collision, or

canceling one of the transactions by prioritizing one over the others.

Let us look at an illustration of how a basic ring coherence protocol works. Figure 7.14 shows

four nodes being connected in a ring. Part (a) of the ﬁgure shows simultaneously, node A and B send

a read request for the same data block. The message is shown in the following format: command

type (Rd, RdX, or Upgr) dot the requestor, comma, then the response in parentheses. The response

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7 - Basic Cache Coherence Issues (9/12)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7 - Basic Cache Coherence Issues (9/12)