Chapter 8. Performance Counters

This chapter introduces tools to examine CPU cache activity (cpustat, cpu-track) and bus activity (busstat).

Introducing CPU Caches

Figure 8.1 depicts typical caches that a CPU can use.

CPU Caches

Figure 8.1. CPU Caches

Caches include the following:

  • I-cache. Level 1 instruction cache

  • D-cache. Level 1 data cache

  • P-cache. Prefetch cache

  • W-cache. Write cache

  • E-cache. Level 2 external or embedded cache

These are the typical caches for the content of main memory, depending on the processor. Another framework for caching page translations as part of the Memory Management Unit (MMU) includes the Translation Lookaside Buffer (TLB) and Translation Storage Buffers (TSBs). These translation facilities are discussed in detail in Chapter 12 in Solaris Internals.

Of particular interest are the I-cache, D-cache, and E-cache, which are often listed as key specifications for a CPU type. Details of interest are their size, their cache line size, and their set-associativity. A greater size improves cache hit ratio, and a larger cache line size can improve throughput. A higher set-associativity improves the effect of the Least Recently Used policy, which can avoid hot spots where the cache would otherwise have flushed frequently accessed data.

Experiencing a low cache hit ratio and a large number of cache misses for the I-, D-, or E-cache is likely to degrade application performance. Section 8.2 demonstrates the monitoring of different event statistics, many of which can be used to determine cache performance.

It is important to stress that each processor type is different and can have a different arrangement, type, and number of caches. For example, the UltraSPARC IV+ has a Level 3 cache of 32 Mbytes, in addition to its Level 1 and 2 caches.

To highlight this further, the following describes the caches for three recent SPARC processors:

  • UltraSPARC III Cu. The Level 2 cache is an external cache of either 1, 4, or 8 Mbytes in size, providing either 64-, 256-, or 512-byte cache lines connected by a dedicated bus. It is unified, write-back, allocating, and either one-way or two-way set-associative. It is physically indexed, physically tagged (PIPT).

  • UltraSPARC IIIi. The Level 2 cache is an embedded cache of 1 Mbyte in size, providing a 64-byte cache line and is on the CPU itself. It is unified, write-back, write-allocate, and four-way set-associative. It is physically indexed, physically tagged (PIPT).

  • UltraSPARC T1. Sun’s UltraSPARC T1 is a chip level multi-processor. Its CMT hardware architecture has eight cores, or individual execution pipelines, per chip, each with four strands or active thread contexts that share a pipeline in each core. Each cycle of a different hardware strand is scheduled on the pipeline in round robin order. There are 32 threads total per Ultra-SPARC T1 processor.

The cores are connected by a high-speed, low-latency crossbar in silicon. An UltraSPARC T1 processor can be considered SMP on a chip. Each core has an instruction cache, a data cache, an instruction translation-lookaside buffer (iTLB), and a data TLB (dTLB) shared by the four strands. A twelve-way associative unified Level 2 (L2) on-chip cache is shared by all 32 hardware threads. Memory latency is uniform across all cores—uniform memory access (UMA), not non-uniform memory access (NUMA).

Figure 8.2 illustrates the structure of the UltraSPARC T1 processor.

UltraSPARC T1 Caches

Figure 8.2. UltraSPARC T1 Caches

For a reference on UltraSPARC caches, see the UltraSPARC Processors Documentation Web site at

http://www.sun.com/processors/documentation.html

This Web site lists the processor user manuals, which are referred to by the cpustat command in the next section. Other CPU brands have similar documentation that can be found online.

cpustat Command

The cpustat command monitors the CPU Performance Counters (CPCs), which provide performance details for the CPU hardware caches. These types of hardware counters are known as Performance Instrumentation Counters, or PICs, which also exist on other devices. The PICs are programmable and record statistics for different events (event is a deliberate term). For example, they can be programmed to track statistics for CPU cache events.

A typical UltraSPARC system might provide two PICs, each of which can be programmed to monitor one event from a list of around twenty. An example of an event is an E-cache hit, the number of which could be counted by a PIC.

Which CPU caches can be measured depends on the type of CPU. Different CPU types not only can have different caches but also can have different available events that the PICs can monitor. It is possible that a CPU could contain a cache with no events associated with it—leaving us with no way to measure cache performance.

The following example demonstrates the use of cpustat to measure E-cache (Level 2 cache) events on an UltraSPARC IIi CPU.

# cpustat -c pic0=EC_ref,pic1=EC_hit 1 5
   time cpu event      pic0      pic1
  1.005   0  tick     66931     52598
  2.005   0  tick     67871     52569
  3.005   0  tick     65003     50907
  4.005   0  tick     64793     50958
  5.005   0  tick     64574     50904
  5.005   1 total    329172    257936

The cpustat command has a -c eventspec option to configure which events the PICs should monitor. We set pic0 to monitor EC_ref, which is E-cache references; and we set pic1 to monitor EC_hit, which is E-cache hits.

Cache Hit Ratio, Cache Misses

If both the cache references and hits are available, as with the UltraSPARC IIi CPU in the previous example, you can calculate the cache hit ratio. For that calculation you could also use cache misses and hits, which some CPU types provide. The calculations are fairly straightforward:

cache hit ratio = cache hits / cache references

cache hit ratio = cache hits / (cache hits + cache misses)

A higher cache hit ratio improves the performance of applications because the latency incurred when main memory is accessed through memory buses is obviated. The cache hit ratio may also indicate the pattern of activity; a low cache hit ratio may indicate a hot spot—where frequently accessed memory locations map to the same cache location, causing frequently used data to be flushed.

Since satisfying each cache miss incurs a certain time cost, the volume of cache misses may be of more interest than the cache hit ratio. The number of misses can more directly affect application performance than does changing percent hit ratios since the number of misses is proportional to the total time penalty.

Both cache hit ratios and cache misses can be calculated with a little awk, as the following script, called ecache, demonstrates.[1]

#!/usr/bin/sh
#
# ecache - print E$ misses and hit ratio for UltraSPARC IIi CPUs.
#
# USAGE: ecache [interval [count]]      # by default, interval is 1 sec

cpustat -c pic0=EC_ref,pic1=EC_hit ${1-1} $2 | awk '
        BEGIN { pagesize = 20; lines = pagesize }
        lines >= pagesize {
           lines = 0
           printf("%8s %3s %5s %9s %9s %9s %7s
",
              "E$  time", "cpu", "event", "total", "hits", "miss", "%hit")
        }
        $1 !~ /time/ {
           total = $4
           hits = $5
           miss = total - hits
           ratio = 100 * hits / total
           printf("%8s %3s %5s %9s %9s %9s %7.2f
",
              $1, $2, $3, total, hits, miss, ratio)
           lines++
        }
'

This script is verbose to illustrate the calculations performed, in particular, using extra named variables.[2] nawk or perl would also be suitable for postprocessing the output of cpustat, which itself reads the PICs by using the libcpc library, and binding a thread to each CPU.

The following example demonstrates the extra columns that ecache prints.

# ecache 1 5
E$  time cpu event     total      hits      miss    %hit
   1.013   0  tick     65856     51684     14172   78.48
   2.013   0  tick     71511     55793     15718   78.02
   3.013   0  tick     69051     54203     14848   78.50
   4.013   0  tick     69878     55082     14796   78.83
   5.013   0  tick     68665     53873     14792   78.46
   5.013   1 total    344961    270635     74326   78.45

This tool measures the volume of cache misses (miss) and the cache hit ratio (%hit) achieved for UltraSPARC II CPUs.

Listing PICs and Events

The -h option to cpustat lists the available events for a CPU type and the PICs that can monitor them.

# cpustat -h
Usage:
        cpustat [-c events] [-p period] [-nstD] [interval [count]]

        -c events specify processor events to be monitored
        -n        suppress titles
        -p period cycle through event list periodically
        -s        run user soaker thread for system-only events
        -t        include %tick register
        -D        enable debug mode
        -h        print extended usage information

        Use cputrack(1) to monitor per-process statistics.

        CPU performance counter interface:  UltraSPARC I&II

        event specification syntax:
        [picn=]<eventn>[,attr[n][=<val>]][,[picn=]<eventn>[,attr[n][=<val>]],...]

        event0:  Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr
                 EC_ref EC_snoop_inv Dispatch0_storeBuf Load_use
                 EC_write_hit_RDO EC_rd_hit

        event1:  Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb
                 Dispatch0_FP_use IC_hit DC_rd_hit DC_wr_hit Load_use_RAW
                 EC_hit EC_ic_hit

        attributes: nouser sys

        See the "UltraSPARC I/II User's Manual" (Part No. 802-7220-02) for
        descriptions of these events. Documentation for Sun processors can
        be found at: http://www.sun.com/processors/manuals

The -h output lists the events that can be monitored and finishes by referring to the reference manual for this CPU. These invaluable manuals discuss the CPU caches in detail and explain what the events really mean.

In this example of cpustat -h, the event specification syntax shows that you can set picn to measure events from eventn. For example, you can set pic0 to IC_ref and pic1 to IC_hit; but not the other way around. The output also indicates that this CPU type provides only two PICs and so can measure only two events at the same time.

PIC Examples: UltraSPARC IIi

We chose the UltraSPARC IIi CPU for the preceding examples because it provides a small collection of fairly straightforward PICs. Understanding this CPU type is a good starting point before we move on to more difficult CPUs. For a full reference for this CPU type, see Appendix B of the UltraSPARC I/II User’s Manual.[3]

The UltraSPARC IIi provides two 32-bit PICs, which are joined as a 64-bit register. The 32-bit counters could wrap around, especially for longer sample intervals. The 64-bit Performance Control Register (PCR) configures those events (statistics) the two PICs will contain. Only one invocation of cpustat (or cputrack) at a time is possible, since there is only one set of PICs to share.

The available events for measuring CPU cache activity are listed in Table 8.1. This is from the User’s Manual, where you can find a listing for all events.

Table 8.1. UltraSPARC IIi CPU Cache Events

Event

PICs

Description

IC_ref

PIC0

I-cache references; I-cache references are fetches of up to four instructions from an aligned block of eight instructions. I-cache references are generally prefetches and do not correspond exactly to the instructions executed.

IC_hit

PIC1

I-cache hits.

DC_rd

PIC0

D-cache read references (including accesses that subsequently trap); non-D-cacheable accesses are not counted. Atomic, block load, “internal” and “external” bad ASIs, quad precision LDD, and MEMBAR instructions also fall into this class.

DC_rd_hit

PIC1

D-cache read hits are counted in one of two places:

  1. When they access the D-cache tags and do not enter the load buffer (because it is already empty)

  2. When they exit the load buffer (because of a D-cache miss or a nonempty load buffer)

DC_wr

PIC0

D-cache write references (including accesses that subsequently trap); non-D-cacheable accesses are not counted.

DC_wr_hit

PIC1

D-cache write hits.

EC_ref

PIC0

Total E-cache references; noncacheable accesses are not counted.

EC_hit

PIC1

total E-cache hits.

EC_write_hit_RDO

PIC0

E-cache hits that do a read for ownership of a UPA transaction.

EC_wb

PIC1

E-cache misses that do writebacks.

EC_snoop_inv

PIC0

E-cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQ.

EC_snoop_cb

PIC1

E-cache snoop copybacks from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.

EC_rd_hit

PIC0

E-cache read hits from D-cache misses.

EC_ic_hit

PIC1

E-cache read hits from I-cache misses.

Reading through the descriptions will reveal many subtleties you need to consider to understand these events. For example, some activity is not cacheable and so does not show up in event statistics for that cache. This includes block loads and block stores, which are not sent to the E-cache since it is likely that this data will be touched only once. You should consider such a point if an application experienced memory latency not explained by the E-cache miss statistics alone.

PIC Examples: The UltraSPARC T1 Processor

Each of the 32 UltraSPARC T1 strands has a set of hardware performance counters that can be monitored using the cpustat(1M) command. cpustat can collect two counters in parallel, the second always being the instruction count. For example, to collect iTLB misses and instruction counts for every strand on the chip, type the following:

# /usr/sbin/cpustat -c pic0=ITLB_miss,pic1=Instr_cnt,sys 1 10
time cpu event pic0 pic1
2.019 0 tick 6 186595695 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys
2.089 1 tick 7 192407038 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys
2.039 2 tick 49 192237411 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys
2.049 3 tick 15 190609811 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys
......

Both a pic0 and pic1 register must be specified. ITLB_miss is used in the preceding example, although instruction counts are only of interest in this instance.

The performance counters indicate that each strand is executing about 190 million instructions per second. To determine how many instructions are executing per core, aggregate counts from four strands. Strands zero, one, two, and three are in the first core, strands four, five, six, and seven are in the second core, and so on. The preceding example indicates that the system is executing about 760 million instructions per core per second. If the processor is executing at 1.2 Gigahertz, each core can execute a maximum of 1200 million instructions per second, yielding an efficiency rating of 0.63. To achieve maximum throughput, maximize the number of instructions per second on each core and ultimately on the chip.

Other useful cpustat counters for assessing performance on an UltraSPARC T1 processor-based system are detailed in Table 8.2. All counters are per second, per thread. Rather than deal with raw misses, accumulate the counters and express them as a percentage miss rate of instructions. For example, if the system executes 200 million instructions per second on a strand and IC_miss indicates 14 million instruction cache misses per second, then the instruction cache miss rate is seven percent.

Table 8.2. UltraSPARC-T1 Performance Counters

Events

Description

High Value

Impact

Potential Remedy

IC_miss

Number of instruction cache misses

>7%

Small impact as latency can be hidden by strands

Compiler flag options to compact the binary. See compiler section.

DC_miss

Number of data cache misses

>11%

Small impact as latency can be hidden by strands

Compact data structures to align on 64-byte boundaries.

ITLB_miss

Number of instruction TLB misses

>.001%

Potentially severe impact from TLB thrashing

Make sure text on large pages. See TLB section.

DTLB_miss

Number of data TLB misses

>.005%

Potentially severe impact from TLB thrashing

Make sure data segments are on large pages. See TLB section.

L1_imiss

Instruction cache misses that also miss L2

>2%

Medium impact potential for all threads to stall

Reduce conflict with data cache misses if possible.

L1_dmiss_ld

Data case misses that also miss L2

>2%

Medium impact potential for all threads to stall

Potential alignment issues. Offset data structures.

Event Multiplexing

Since some CPUs have only two PICs, only two events can be measured at the same time. If you are looking at a specific CPU component like the I-cache, this situation may be fine. However, sometimes you want to monitor more events than just the PIC count. In that case, you can use the -c option more than once, and the cpustat command will alternate between them. For example,

# cpustat -c pic0=IC_ref,pic1=IC_hit -c pic0=DC_rd,pic1=DC_rd_hit -c 
pic0=DC_wr,pic1=DC_wr_hit -c pic0=EC_ref,pic1=EC_hit -p 1 0.25 5
   time cpu event      pic0      pic1
  0.267   0  tick    221423    197095  # pic0=IC_ref,pic1=IC_hit
  0.513   0  tick       105        65  # pic0=DC_rd,pic1=DC_rd_hit
  0.763   0  tick        37        21  # pic0=DC_wr,pic1=DC_wr_hit
  1.013   0  tick       282       148  # pic0=EC_ref,pic1=EC_hit
  1.267   0  tick    213558    190520  # pic0=IC_ref,pic1=IC_hit
  1.513   0  tick       109        62  # pic0=DC_rd,pic1=DC_rd_hit
  1.763   0  tick        37        21  # pic0=DC_wr,pic1=DC_wr_hit
  2.013   0  tick       276       149  # pic0=EC_ref,pic1=EC_hit
  2.264   0  tick    217713    194040  # pic0=IC_ref,pic1=IC_hit
...

We specified four different PIC configurations (-c eventspec), and cpustat cycled between sampling each of them. We set the interval to 0.25 seconds and set a period (-p) to 1 second so that the final value of 5 is a cycle count, not a sample count. An extra commented field lists the events the columns represent, which helps a postprocessing script such as awk to identify what the values represent.

Some CPU types provide many PICs (more than eight), usually removing the need for event multiplexing as used in the previous example.

Using cpustat with Multiple CPUs

Each example output of cpustat has contained a column for the CPU ID (cpu). Each CPU has its own PIC, so when cpustat runs on a multi-CPU system, it must collect PIC values from every CPU. cpustat does this by creating a thread for each CPU and binding it onto that CPU. Each sample then produces a line for each CPU and prints it in the order received. Thus, some slight shuffling of the output lines occurs.

The following example demonstrates cpustat on a server with four Ultra-SPARC IV CPUs, each of which has two cores.

# cpustat -c pic0=DC_rd,pic1=DC_rd_miss 5 1
   time cpu event      pic0      pic1
  5.008 513  tick    355670     25132
  5.008   3  tick   8824184     34366
  5.008 512  tick        11         1
  5.008   2  tick      1127       123
  5.008 514  tick     55337      3908
  5.008   0  tick        10         3
  5.008   1  tick     19833       854
  5.008 515  tick   7360753     36567
  5.008   8 total  16616925     100954

The cpu column prints the total CPU count for the last line (total).

Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | 
awk '{ printf "%s %.2f cpi
",$0,$4/$5; }'
 10.034   0  tick 3554903403 3279712368  1.08 cpi
 10.034   1 total 3554903403 3279712368  1.08 cpi

This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | 
awk '{ printf "%s %.2f cpi
",$0,$4/$5; }'
 10.036   0  tick 205607856  34023849  6.04 cpi
 10.036   1 total 205607856  34023849  6.04 cpi

Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

PIC Examples: UltraSPARC IV

The UltraSPARC IV processor provides a greater number of events that can be monitored. The following example is the output from cpustat -h, which lists these events.

# cpustat -h
...
Use cputrack(1) to monitor per-process statistics.

        CPU performance counter interface:  UltraSPARC III+ & IV

        events  pic0=<event0>,pic1=<event1>[,sys][,nouser]

        event0: Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr
                EC_ref EC_snoop_inv Dispatch0_br_target Dispatch0_2nd_br
                Rstall_storeQ Rstall_IU_use EC_write_hit_RTO EC_rd_miss
                PC_port0_rd SI_snoop SI_ciq_flow SI_owned SW_count_0
                IU_Stat_Br_miss_taken IU_Stat_Br_count_taken
                Dispatch_rs_mispred FA_pipe_completion MC_reads_0
                MC_reads_1 MC_reads_2 MC_reads_3 MC_stalls_0 MC_stalls_2
                EC_wb_remote EC_miss_local EC_miss_mtag_remote

        event1: Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb
                IC_miss_cancelled Re_FPU_bypass Re_DC_miss Re_EC_miss
                IC_miss DC_rd_miss DC_wr_miss Rstall_FP_use EC_misses
                EC_ic_miss Re_PC_miss ITLB_miss DTLB_miss WC_miss
                WC_snoop_cb WC_scrubbed WC_wb_wo_read PC_soft_hit
                PC_snoop_inv PC_hard_hit PC_port1_rd SW_count_1
                IU_Stat_Br_miss_untaken IU_Stat_Br_count_untaken
                PC_MS_misses Re_RAW_miss FM_pipe_completion MC_writes_0
                MC_writes_1 MC_writes_2 MC_writes_3 MC_stalls_1 MC_stalls_3
                Re_DC_missovhd EC_miss_mtag_remote EC_miss_remote

        See the "SPARC V9 JPS1 Implementation Supplement: Sun
        UltraSPARC-III+"

Some of these are similar to the UltraSPARC IIi CPU, but many are additional. The extra events allow memory controller and pipeline activity to be measured.

cputrack Command

While the cpustat command monitors activity for the entire system, the cputrack command allows the same counters to be measured for a single process. This can be useful for focusing on particular applications and determining whether only one process is the cause of performance issues.

The event specification for cputrack is the same as cpustat, except that instead of an interval and a count, cputrack takes either a command or -p PID.

# cputrack
Usage:
        cputrack [-T secs] [-N count] [-Defhnv] [-o file]
                -c events [command [args] | -p pid]

        -T secs   seconds between samples, default 1
        -N count  number of samples, default unlimited
        -D        enable debug mode
        -e        follow exec(2), and execve(2)
        -f        follow fork(2), fork1(2), and vfork(2)
        -h        print extended usage information
        -n        suppress titles
        -t        include virtualized %tick register
        -v        verbose mode
        -o file   write cpu statistics to this file
        -c events specify processor events to be monitored
        -p pid    pid of existing process to capture

        Use cpustat(1M) to monitor system-wide statistics.

The usage message for cputrack ends with a reminder to use cpustat for systemwide statistics.

The following example demonstrates cputrack monitoring the instructions and cycles for a sleep command.

# cputrack -c pic0=Instr_cnt,pic1=Cycle_cnt sleep 5
   time lwp      event      pic0      pic1
  1.024   1       tick    188134    629987
  2.023   1       tick         0         0
  3.023   1       tick         0         0
  4.023   1       tick         0         0
  5.023   1       tick         0         0
  5.034   1       exit    196623    682808

In the first second, the sleep command initializes and executes 188, 134 instructions. Then the sleep command sleeps, reporting zero counts in the output; this shows that cputrack is monitoring our sleep command only and is not reporting on other system activity. The sleep command wakes after five seconds and executes the final instructions, finishing with the total on exit of 196, 623 instructions.

As another example, we use cputrack to monitor the D-cache activity of PID 19849, which has multiple threads. The number of samples is limited to 20 (-N).

$ cputrack -N 20 -c pic0=DC_access,pic1=DC_miss -p 19849
   time lwp      event      pic0      pic1
  1.007   1       tick  34543793    824363
  1.007   2       tick         0         0
  1.007   3       tick 1001797338   5153245
  1.015   4       tick 976864106   5536858
  1.007   5       tick 1002880440   5217810
  1.017   6       tick 948543113   3731144
  2.007   1       tick  15425817    745468
  2.007   2       tick         0         0
  2.014   3       tick 1002035102   5110169
  2.017   4       tick 976879154   5542155
  2.030   5       tick 1018802136   5283137
  2.033   6       tick 1013933228   4072636
......

This CPU type provides D-cache misses for pic1, a useful statistic inasmuch as cache misses incur a certain time cost. Here, lwp 2 appears to be idle, while lwps 3, 4, 5, and 6 are causing many D-cache events. With a little awk, we could add another column for D-cache hit ratio.

For additional information on cputrack, see cputrack(1).

busstat Command

The busstat command monitors bus statistics for systems that contain instrumented buses. Such buses contain Performance Instrumentation Counters (PICs), which in some ways are similar to the CPU PICs.

Listing Supported Buses

busstat -l lists instrumented buses that busstat can monitor.

# busstat -l
busstat: No devices available in system.

If you see the “No devices available” message, then you won’t get any further. Find another system (usually a larger system) that responds by listing instance names. The following is from a Sun Enterprise E4500.

# busstat -l
Busstat Device(s):
sbus1 ac0 ac1 ac2 ac3 ac4 sbus0 sbus2 sbus3 sbus4

The output of busstat -l has now listed six devices that provide PICs for us to use. sbus is for SBus, the interconnect bus for devices including peripherals; ac is for Address Controller.

Listing Bus Events

The -e switch for busstat lists events that a bus device can monitor. Here we list events for ac0.

# busstat -e ac0
pic0
mem_bank0_rds
mem_bank0_wrs
mem_bank0_stall
mem_bank1_rds
mem_bank1_wrs
mem_bank1_stall
clock_cycles
...

pic1
mem_bank0_rds
mem_bank0_wrs
mem_bank0_stall
mem_bank1_rds
mem_bank1_wrs
mem_bank1_stall
clock_cycles
...

The list of events for each PIC is very long; we truncated it so that this example doesn’t fill an entire page.

It can help to use the pr command to rework the output into columns. The following example does this for the sbus0.

# busstat -e sbus0 | pr -t2
pic0                                pic1
dvma_stream_rd                      dvma_stream_rd
dvma_stream_wr                      dvma_stream_wr
dvma_const_rd                       dvma_const_rd
dvma_const_wr                       dvma_const_wr
dvma_tlb_misses                     dvma_tlb_misses
dvma_stream_buf_mis                 dvma_stream_buf_mis
dvma_cycles                         dvma_cycles
dvma_bytes_xfr                      dvma_bytes_xfr
interrupts                          interrupts
upa_inter_nack                      upa_inter_nack
pio_reads                           pio_reads
pio_writes                          pio_writes
sbus_reruns                         sbus_reruns
pio_cycles                          pio_cycles
#

The first column lists events for pic0; the second are events for pic1.

Unlike cpustat, busstat does not finish by listing a reference manual for these events. There is currently little public documentation for bus events[4]; most Internet searches match only the man page for busstat and the event names in the OpenSolaris source. Fortunately, many of the event names are self-evident (for example, mem_bank0_rds is probably memory bank 0 reads), and some of the terms are similar to those used for CPU PICs, as documented in the CPU manuals.

Monitoring Bus Events

Monitoring bus events is similar to monitoring CPU events, except that we must specify which bus instance or instances to examine.

The following example examines ac1 for memory bank stalls, printing a column for each memory bank. We specified an interval of 1 second and a count of 5.

# busstat -w ac1,pic0=mem_bank0_stall,pic1=mem_bank1_stall 1 5
time dev    event0               pic0        event1               pic1
1    ac1    mem_bank0_stall      2653        mem_bank1_stall      0
2    ac1    mem_bank0_stall      2039        mem_bank1_stall      0
3    ac1    mem_bank0_stall      3614        mem_bank1_stall      0
4    ac1    mem_bank0_stall      3213        mem_bank1_stall      0
5    ac1    mem_bank0_stall      2380        mem_bank1_stall      0

The second bank is empty, so pic1 measured no events for it. Memory stall events are interesting—they signify latency suffered when a memory bank is already busy with a previous request.

There are some differences between busstat and cpustat: There is no total line with busstat, and intervals less than one second are not accepted. busstat uses a -w option to indicate that devices are written to, thereby configuring them so that their PICs will monitor the specified events, whereas cpustat itself writes to each CPU’s PCR.

By specifying ac instead of ac1, we now monitor these events across all address controllers.

# busstat -w ac,pic0=mem_bank0_stall,pic1=mem_bank1_stall 1 5
time dev    event0               pic0        event1               pic1
1    ac0    mem_bank0_stall      2641        mem_bank1_stall      0
1    ac1    mem_bank0_stall      2766        mem_bank1_stall      0
1    ac2    mem_bank0_stall      0           mem_bank1_stall      0
1    ac3    mem_bank0_stall      0           mem_bank1_stall      0
1    ac4    mem_bank0_stall      0           mem_bank1_stall      0
2    ac0    mem_bank0_stall      2374        mem_bank1_stall      0
2    ac1    mem_bank0_stall      2545        mem_bank1_stall      0
2    ac2    mem_bank0_stall      0           mem_bank1_stall      0
2    ac3    mem_bank0_stall      0           mem_bank1_stall      0
2    ac4    mem_bank0_stall      0           mem_bank1_stall      0
3    ac0    mem_bank0_stall      2133        mem_bank1_stall      0

We would study the dev column to see which device the line of statistics belongs to.

busstat also provides a -r option, to read PICs without changing the configured events. This means that we monitor whatever was previously set by -w. Here’s an example of using -r after the previous -w example.

# busstat -r ac0 1 5
time dev    event0               pic0        event1               pic1
1    ac0    mem_bank0_stall      2039        mem_bank1_stall      0
2    ac0    mem_bank0_stall      1822        mem_bank1_stall      0
3    ac0    mem_bank0_stall      1868        mem_bank1_stall      0
4    ac0    mem_bank0_stall      2109        mem_bank1_stall      0
5    ac0    mem_bank0_stall      1779        mem_bank1_stall      0

Event Multiplexing

As with using cpustat for a limited number of PICs (see Section 8.2.5), you can specify multiple events for busstat so that more events than PICs can be monitored. The multiple-event specifications are measured alternately.

The following example demonstrates the use of busstat to measure many bus events.

# busstat -w ac0,pic0=mem_bank0_rds,pic1=mem_bank0_wrs -w 
ac0,pic0=addr_pkts,pic1=data_pkts -w ac0,pic0=ac_addr_pkts,pic1=ac_data_pkts 1 9
time dev    event0               pic0        event1               pic1
1    ac0    mem_bank0_rds        47692       mem_bank0_wrs        1785
2    ac0    addr_pkts            87753       data_pkts            112209
3    ac0    ac_addr_pkts         126718      ac_data_pkts         141410
4    ac0    mem_bank0_rds        40187       mem_bank0_wrs        4860
5    ac0    addr_pkts            92343       data_pkts            119899
6    ac0    ac_addr_pkts         55964       ac_data_pkts         69573
7    ac0    mem_bank0_rds        39518       mem_bank0_wrs        3050
8    ac0    addr_pkts            84103       data_pkts            108542
9    ac0    ac_addr_pkts         256737      ac_data_pkts         317145
#

We specified three pairs of events, with an interval of one second and a count of nine. Each event pair was measured three times, for one second. We would study the event0 and event1 columns to see what the pic values represent.

For additional information on busstat, see busstat(1M).

Example: UltraSPARC T1

UltraSPARC T1 processors also have a number of DRAM performance counters, the most important of which are read and write operations to each of the four memory banks. The tool to display DRAM counters is the busstat command. Be sure to type the command on a single line.

# busstat -w dram0,pic0=mem_reads,pic1=mem_writes -w dram1,pic0=mem_reads,pic1=mem_
writes
-w dram2,pic0=mem_reads,pic1=mem_writes -w dram3,pic0=mem_reads,pic1=mem_writes
time dev event0 pic0 event1 pic1
1 dram0 mem_reads 16104 mem_writes 8086
1 dram1 mem_reads 15953 mem_writes 8032
1 dram2 mem_reads 15957 mem_writes 8069
1 dram3 mem_reads 15973 mem_writes 8001

The counts are of 64-byte lines read or written to memory; to get the total bandwidth, add all four counters together. In the preceding example, the system is roughly reading (4 * 16000 * 64) = 4096000 bytes / 3.9 megabytes per second and writing (4 * 8000 * 64 bytes) = 2048000 bytes / 1.95 megabytes per second.



[1] This script is based on E-cache from the freeware CacheKit (Brendan Gregg). See the Cache-Kit for scripts that support other CPU types and scripts that measure I- and D-cache activity.

[2] A one-liner version to add just the %hit column is as follows:

# cpustat -nc pic0=EC_ref,pic1=EC_hit 1 5 | awk '{ printf "%s %.2f
",$0,$5*100/$4 }'

[4] Probably because no one has asked! busstat is not in common use by customers; the main users have been engineers within Sun.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.186.219