i
i
i
i
i
i
i
i
6.3. SYNCHRONIZATION PROBLEM 201
Compare this against Code 6.3, which only tests one variable lockvar in order to acquire a lock.
Of course, Code 6.3 is incorrect unless it is used in conjunction with an atomic instruction supported
by the hardware. This points out that it is important to have hardware support for synchronization
in order to reduce synchronization overheads and enable scalable synchronization.
Finally, it is important to note that the correctness of a synchronization primitive is related to
the memory consistency model provided by the processor. By nature, synchronization operation is
used to impose ordering between memory accesses performed by different threads. However, the
ordering of the memory accesses performed by a synchronization operation and those performed
outside the synchronization cannot be guaranteed unless the hardware provides either a strict mem-
ory consistency model, or an explicit mechanism that can order memory accesses. This will be
discussed further in Chapter 8.
i
i
i
i
i
i
i
i
202 CHAPTER 6. INTRODUCTION TO SHARED MEMORY MULTIPROCESSORS
6.4 Exercises
Worked Problems
1. Register allocation. Suppose that you are writing a compiler algorithm that decides what
variable can be allocated in a register. A register allocation allows the variable to be accessed
in the register without involving a load or store to read from or write to it. In order not to create
cache coherence problem, which of the following types of variables can be register allocated
vs. cannot be register allocated: read-only, read/write non-conflicting, and r/w conflicting?
Explain. (see Section 3.6 for what each variable type refers to).
Answer:
Variable type Can be register allocated?
Read-only Yes
R/W Non-Conflicting Yes
R/W Conflicting No
R/W conflicting variables are written and read by different threads. Hence, they cannot be
register allocated because a write by a processor to its registers is not visible to other threads,
hence write propagation no longer occurs.
2. Cache coherence. Assume that we have a two-processor system (P1 and P2). These pro-
cessors run different threads (T1 and T2, respectively) that access a data block containing
variable sum, in the following sequence in time:
T1: prefetch the block containing sum from &sum (i.e., the location of sum)
T2: read the value of sum from &sum
T2: add 7 to sum and write into &sum
T1: read the value of sum from &sum
T1: add 3 to sum and write into &sum
T1: read sum to print its value
Each access may hit or miss in the cache, depending on the cache configuration that will be
specified. For each situation below, determine if there is a correctness issue. Also give the
final values in all caches and in the main memory.
(a) The two processors P1 and P2 share an only single level cache (the L1 cache) that uses
a write back policy.
(b) Each processor has its own single-level write through cache. Does your answer depend
on whether the cache has a write allocate or write no-allocate policy?
(c) Each processor has its own L1 cache that uses evict-on-write policy, i.e., a block is
replaced immediately after it is written to. There are no other levels of caches in the
system.
(d) Each processor has a cache, but the block containing sum is not allowed to be cached.
i
i
i
i
i
i
i
i
6.4. EXERCISES 203
Answer:
(a) No correctness problem because data block containing ”sum” is only cached at a single
location (the shared L1 cache).
L1 cache 10, dirty
Memory 0
(b) Correctness problem exists because a block may be kept in several L1 caches. A write
through write allocate L1 cache only guarantees that a write will be propagated to the
outer level memory hierarchy but does not guarantee write propagation to other L1
caches.
P1’s cache 3
P2’s cache 7
Memory 3
Furthermore, the problem persists even with a write no-allocate policy. This is because
the block was brought into the cache through a prefetch instruction, not a write instruc-
tion.
(c) Correctness problem exists: although evict-on-write ensures that after a write, the writ-
ten value is propagated to the outer memory hierarchy, it does not ensure that future
reads in other caches will see that new value.
P1’s cache 3
P2’s cache -
Memory 3
(d) No correctness problem since sum is uncached. There is only one copy of sum in the
system.
P1’s cache -
P2’s cache -
Memory 10
Homework Problems
1. Register allocation. Suppose that you are writing a compiler algorithm that decides what
variable can be allocated in a register. A register allocation allows the variable to be accessed
in the register without involving a load or store to read from or write to it. In order not to
create cache coherence problem, which type of variable can be register allocated vs. cannot
be register allocated?
Variable type Can be register allocated?
Read-only
R/W Non-Conflicting
R/W Conflicting
If there are variable types that you think cannot be safely allocated in registers, give explana-
tion of why it is so (see Section 3.6 for what each variable type refers to).
2. Cache coherence. Assume that we have a two-processor system (P1 and P2). These pro-
cessors run different threads (T1 and T2, respectively) that access a data block containing
variable sum, in the following sequence in time:
i
i
i
i
i
i
i
i
204 CHAPTER 6. INTRODUCTION TO SHARED MEMORY MULTIPROCESSORS
T1: prefetch the block containing sum from &sum (i.e., the location of sum)
T2: read the value of sum from &sum
T2: add 7 to sum and write into &sum
T1: read the value of sum from &sum
T1: add 3 to sum and write into &sum
T1: read sum to print its value
Each access may hit or miss in the cache, depending on the cache configuration that will be
specified. For each situation below, determine if there is a correctness issue. Also give the
final values in all caches and in the main memory.
(a) Each processor has its own write through L1 cache, and both processors share a single
write back L2 cache.
(b) Each processor has its own write back L1 cache, however, each block can be cached in
at most one cache. More specifically, if a block is cached at Px, and Py requests the
block on a read or a write, the block is removed from Px’s cache and placed in Py’s
cache. Assume a block is not evicted from the cache unless when it is moved to another
processor’s cache.
(c) Each processor has its own write back L1 cache. Suppose we run the two threads T1
and T2 in only one processor (either P1 or P2, but not both) in a time-sharing manner.
(d) Each processor has its own write back L1 cache. Suppose that instead of executing with
two threads, only one thread executes. Thus, in the sequence of events shown above, all
the events occur due to T1’s execution while T2 does not exist. Furthermore, to achieve
balanced processor utilization, the operating system migrates T1 from P1 to P2 after the
first three events, i.e. right after adding ‘7’ to sum but before reading from sum for the
second time.
3. Peterson’s algorithm. Modify Peterson’s algorithm for lock acquisition and release so that
it can be used for four threads that compete for a lock.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.180.118