Sequences

I’ll spend the rest of this chapter on a more concrete example of the problems you can run into, and the way you have to look at RAC, by picking a single, simple piece of functionality and examining the anomalies that start to appear when RAC gets thrown into the equation. There’s no specific reason why I picked sequences as the focus of this section—they just happen to offer an easy way of producing results that make the type of generic points I want to raise. We’ll start by clearing up a couple of common misconceptions about sequences, then examine the way that the internal implementation of sequences is subject to a couple of problems in RAC, then we’ll see how one of the commonest methods of using sequences needs to be reconsidered in the light of RAC.

Although Oracle sequences can be used in several different ways, the most common is as a source of unique, consecutive, whole numbers that, basically, are what you get if you use the default syntax to create a sequence, as follows:

create sequence test_sequence;

If you use the dbms_metadata package to read the full definition of this sequence from the data dictionary, you’ll see the following important features that need some careful thought when you move to RAC:

select
        dbms_metadata.get_ddl('SEQUENCE','TEST_SEQUENCE',user) defn
from dual;

DEFN
--------------------------------------------------
CREATE SEQUENCE  "TEST_USER"."TEST_SEQUENCE"
MINVALUE 1 MAXVALUE 999999999999999999999999999
INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER
NOCYCLE

The two points I want to pick up on are the CACHE 20 clause (where the 20 could be replaced with any positive whole number or the entire clause could be replaced with NOCACHE) and the NOORDER clause. The cache size can make a big difference to single-instance and to RAC, and if you switch from NOORDER to ORDER it can make a catastrophic difference to RAC.

Caching Sequences

There are two misconceptions about sequences and their caches that need to be corrected. The “cache” isn’t, as many people think, quite like a traditional cache—it’s just a target; if you set a cache of 100 for a sequence Oracle doesn’t have to create 100 numbers and store them somewhere in anticipation of them being used—which means it doesn’t cost you anything to have a very large cache value.

The second erroneous belief is about where sequences are cached—and they’re cached in the SGA. Some people think that sequences are cached by sessions, which would mean that if two sessions were using a sequence with a cache of 1,000 then one session would (for example) be generating number between 1 and 1,000 and the other session would be generating numbers from 1,001 to 2,000. It’s possible that this belief came into existence because you can request the currval (current value) as well as the nextval (next value) from a sequence—and this does carry a faint implication that the sequence may be cached locally to the sessions, but really it’s just the session remembering currval as the result from the last time it called nextval, it’s only one number, not “the whole cache size.”

Sequence Internals

Oracle sequences really don’t do anything particularly clever, and it’s very easy to find out almost everything about how they work by simply tracing a couple of sessions and examining the resulting trace files. Essentially they work as follows:

  • A sequence is defined internally as a single row in a table called seq$.
  • When you first use the sequence, the starting value of the sequence is incremented by the cache size and written to the row.
  • Oracle then tracks two things in memory—the current value (starting from the original value on the row) and the target value.
  • Every time anyone calls for nextval, Oracle increments the current value and checks it against the target value before supplying it.
  • If the current value and the target value match, Oracle updates the row in the table by adding the cache size to the value that was the target value, producing a new in-memory target.

It’s easy to demonstrate the principle. The following example is coded for 11g, where the syntax to assign a sequence value to a variable becomes legal in PL/SQL (even though the internal implementation still runs a select {sequence}.nextval from dual):

create sequence test_sequence;

declare
        m_n        number;
begin
        for i in 1..1000 loop
                m_n := test_sequence.nextval;
        end loop;
end;
/

If you run this code with sql_trace enabled, and then process the trace file through tkprof, you will find the following results (and some other bits that I’ve deleted):

insert into seq$(obj#,increment$,minvalue,maxvalue,cycle#,order$,cache,
  highwater,audit$,flags)
values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10)

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          1          4           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.00       0.00          0          1          4           1

update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6,
  cache=:7,highwater=:8,audit$=:9,flags=:10
where obj#=:1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       50      0.00       0.00          0          0          0           0
Execute     50      0.00       0.00          1         50        102          50
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      100      0.00       0.00          1         50        102          50

The first statement is Oracle creating our sequence by inserting a row into the seq$ table with a highwater of 1 and a cache of 20. The first time we select nextval (which is 1) this row is updated to get a highwater of 21, and on every twentieth call thereafter it gets updated again—which is why we can see the update taking place 50 (= 1000/20) times. There are two important consequences to this mechanism.

  • If you make a lot of calls to nextval and don’t change the default cache size, you will be generating a lot of updates and redo. That’s bad for your session and database anyway—whether RAC or single instance. You may also see waits for the SQ lock, which is the lock a session takes while it updates the highwater mark, and on the latch covering the dictionary cache line for dc_sequences.
  • In a RAC system, things are worse because the block from seq$ holding the row that you’re updating could be constantly traveling round the cluster as different instances make use of the same sequence and have to update the same row in the table. You also have the problem that a single instance may need to read the row back from the table before updating it—because other instances have been updating it more or less concurrently.

In fact, things can be much worse in RAC because the seq$ table can hold 80 or 90 rows per block—so if you’ve got a few popular sequences you could have a huge amount of cross-instance traffic and extra reads on the table even when users on different instances are using different sequences. Even in single-instance Oracle this can lead to buffer busy waits, in RAC it leads to gc buffer busy waits. (Many years ago Oracle even had a note on Metalink about recreating the data dictionary table seq$ with a very large pctfree to limit the damage this behavior could cause in what was then OPS.)

images Note In the tkprof output we saw 50 updates to the seq$ table when we called for nextval 1,000 times from a sequence with a cache size of 20. The number 50 would be repeated in v$enqueue_stat with 50 gets for enqueues of type TX (transaction), TM (table), and SQ (sequence). Moreover, we would also see 50 gets, modifications and flushes on the dc_sequences parameter in v$rowcache—every time we update the table we first update the row cache entry, then copy it to the database.

Ordered Sequences

I’ve already pointed out that the sequence is cached at the instance level, but there are multiple instances in RAC that lead to another interesting issue: by default each instance will generate its own set of values. If you have four nodes and a cache size of 20, then you will see things like one instance using values 101 to 120, the next using 121 to 140, and so on. This means that the order in which the values are generated and used is unlikely to be purely sequential, and this can make people unhappy. This arrangement also means that if you shut the system down, you will lose some sequence numbers. The instance aiming for the current highwater value on the table can update the seq$ table to say it stopped prematurely, but there’s no way for the other instances to record the numbers they will lose on shutdown.

SEQUENCES, ORDER, AND AUDIT

A common response to the problem of “losing” values is either to set the cache size to zero (nocache)—which causes a dramatic volume of cache transfers as the seq$ block moves from instance to instance in a RAC system—or to leave the sequence with a non-zero (preferably large) cache but apply the order clause—which is nearly as bad for traffic though less nasty in terms of generating undo and redo.

The order option is specifically aimed at RAC. Instead of allowing each instance to have its own cache for the sequence there is just one highwater value, and one memory location with the current value; however, that memory location is controlled by a special global lock (an SV lock) and the current value of the sequence is passed around in the negotiations for the lock. So the values are generated in order because there is only one live value at any one moment, and only one instance can increment it—which means you’ve serialized the sequence generation across your instances (but you still haven’t got a guarantee that a session or instance crash, or rollback, will stop you from losing the occasional value).

The SV lock is a little strange, and highlights another of the oddities that appear when you start looking closely at implementation details in RAC. If you look in v$lock_type you will find that there is an SV lock type with the description: Lock to ensure ordered sequence allocation in RAC mode, and its first parameter (the id1 of view v$lock—although you won’t see it appearing there) is the object id of the sequence. However, when a process tries to acquire the SV lock you don’t see any sign of it in v$lock, and you won’t see a wait for an enqueue as you would, for example in a trace file or in v$session_wait. Instead you’ll see a wait for DFS lock handle (which is one of the many waits that is subsequently summarized in v$session_event to the blanket events in waitclass Other) and it’s only by decoding the value of the first parameter that you’ll discover that it’s an SV lock.

There are still more surprises to come though. You might expect the SV lock to be held in mode 6 (exclusive) when one instance wants to acquire it to generate the next sequence value—but decoding the first parameter value of the DFS lock handle wait event you’ll find that the lock mode is 5. Believe it or not this is an exclusive lock in RAC! Don’t ask why but instead of the six lock modes we saw for non-RAC resources in Chapter 4, RAC uses fives values, which I’ve listed in Table 8-1.

images Note You will find various documents telling you how to decode the parameters that go with wait events; in many cases you need only query v$event_name to get meaningful descriptions. For example, many of the enqueues waits have the text name|mode for the description of their first parameter. The pipeline symbol (“|”) is a clue to convert the value to hexadecimal for ease of translation. In my case I got a value of 1398145029 for my DFS lock handle, which converts to 0x53560005. The 5 on the end is the lock mode, and if you know your ASCII you’ll recognize 0x53 as ‘S’ and 0x56 as ‘V’.

images

The other events you’re likely to see appearing if you use the order option for a popular sequence are the latch: ges resource hash list and the row cache lock wait. The former is just telling us that RAC-related resources use the same hash table and linked list strategies that we’ve seen so many times before. The latter is reminding us that sequences are closely linked to their entries in v$rowcache and the work (and time) needed to modify and flush an entry in the row cache is higher in RAC than it is in single-instance Oracle.

Sequences and Indexes

There’s still more that we can learn about RAC by looking closely at sequences. The next problem comes with the commonest use of sequences—the “meaningless primary key.” If you are using sequence to generate consecutive numbers and then using those numbers as a primary or unique key you will have an index on the column holding those numbers—and everyone, from every instance, who inserts a row into the table will want to insert the latest high value into that index. In other words, you’ve created a cross-instance hot spot—guaranteed to lead to gc buffer busy waits. This problem isn’t restricted to sequences, of course, all it requires is that the values inserted are always increasing—an index on a simple “current timestamp” column would have the same problem.

INDEX ITL ISSUES

Prior to the introduction of ASSM (automatic segment space management) it was possible to define multiple free lists and (sometimes more importantly) multiple free list groups to reduce the problem of concurrent inserts on tables. ASSM, which allows different RAC instances to claim different bitmap blocks from a segment thereby isolating inserts from each other, made it virtually unnecessary for DBAs to think about the problem of buffer busy waits on tables. But you can’t do anything to spread index hot spots; an index entry can’t be randomly placed, it has to go in the right location.

There have been several suggestions about how to deal with this hot-spot problem. One of them is to reconstruct the index as a reverse-key index

images Note In a reverse-key index the bytes of each individual column (except the rowid, in the case of a non-unique index with its embedded rowid) are reversed before the entry is stored in the index. This has the effect of making values that initially look similar end up looking very different from each other. For example comparing 329,002 and 329,003 before and after reversal we would see: (c3, 21, 5b, 3) and (c3, 21, 5b, 4) turning into: (3, 5b, 21, c3) and (4, 5b, 21, c3).

At first sight this looks like a good idea—but there is a side effect. If most of your queries are about “recent” data, you may need to cache only a tiny percentage of the total index to keep the relevant part of a “hot” primary key in memory; but if you reverse it the intent is to spread the hot spot across a larger area of the index, and the effect is often that you have to cache most (or even all) of the index to get the same caching benefit. Figure 8-5 is a graphic demonstration of the point

images

Figure 8-5. The caching impact of reverse key indexes

There are two alternative strategies. The simplest one is to ensure that the sequence you use has a very large cache size—5,000 to 50,000 would not be unreasonable (remember, we are talking about high rates of insert so we expect to get through a cache of that size fairly quickly. If you do this, each instance will be inserting rows that have values that differ by thousands so each instance will spend most of its time inserting into its own isolated area in the index. The drawback to this plan is that most of the time leaf block splits will be 50/50 leaf block splits (except for the instance currently inserting the highest values)—in fact, given the concurrency problem mentioned in the side bar, you could see space usage running as low as 25 percent. As a variant on the idea of introducing a large difference between the values used by different nodes, Oracle has, in the past, suggested adding 10^N times the instance id as you use a sequence value, where N is sufficiently large that the sequence won’t get near 10^N for a few years.

The other strategy—if you’ve paid for the partitioning option—is to hash partition the index. Split the index into (say) 16 different pieces by hashing on the sequence number and you’ve turned the one high-value hot spot into 16 (the number of hash partitions should always be a power of two for best effect) warm spots. You will still get contention on the highest block on each partition, but the reduction in contention may be enough that you can live with the result.

Whatever strategy you choose to address this problem, remember to keep the scale of the problem in perspective. Buffer busy waits are not good—but unless they really waste a lot of critical processing time it’s possible to live with them. Be careful that your solution to buffer busy waits doesn’t introduce a worse problem. Tragically, I have known sites that introduced RAC because they thought it would get them out of trouble—and then found that it made their troubles worse because the nature of their problems (such buffer busy waits) is exaggerated by the way that RAC works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.68.131