The Performance Monitoring Facility

Performance Monitoring Is Not Architecturally Defined

Whether or not a processor implements a Performance Monitoring facility and, if so, the method of implementation is design-specific. The facility is not part of the IA32 processor architecture spec. The manner in which it has been implemented on the Pentium®, P6 and Pentium® 4 processor families are not compatible with each other.

Author's Note

The author had a considerable amount of trouble achieving a detailed understanding of every aspect of this feature. The Intel® documentation of this feature is somewhat confusing in some areas.

An Overview

The Pentium® and P6 processor families implemented two Performance Counters, permitting the simultaneously measurement of two event types during a given period of time. The Pentium® 4 processor family expanded the number of counters to 18 and all of the registers associated with the facility are MSRs:

  • There are 18 Performance Counters, each of which is 40 bits wide.

  • There is one Counter Configuration Control register (CCCR) associated with each of the counters. A counter's CCCR is used to set up its associated counter for a specific method or style of counting.

  • There are 45 Event Selection Control registers (ESCRs) used to select the type of event (or events) to be measured by each counter.

In addition, the processor has the ability to store certain types of event records in a special memory buffer referred to as the Debug Store (DS) save area:

  • The IA32_DS_AREA MSR is programmed with the location of the DS save area in memory.

  • The programmer can determine whether or not a processor supports the DS mechanism by executing a CPUID request type 1 and verifying that EDX[21] = 1.

The IA32_MISC_ENABLE MSR (see Figure 56-21 on page 1373) contains two bits associated with the Performance Monitoring facility:

  • The Performance Monitoring Available bit is a read-only bit that indicates whether or not the processor supports the Performance Monitoring facility.

  • The Precise Event-Based Sampling Unavailable bit is a read-only bit that indicates whether or not the processor supports the form of event counting referred to as Precise Event-Based Sampling (PEBS).

There Are Two Event Categories

Each event type that can be counted falls into one of two categories:

  • A non-retirement event is an event that occurs any time during instruction execution (e.g., FSB transactions or cache accesses).

  • An at-retirement event is an event that is counted when a μop is retired. When a μop experienced a specified event type during its execution, it can be tagged with an identifier. This permits the Performance Monitoring logic to sort which events occurred on an execution path that was correctly predicted versus a path that was mispredicted (and the results of those instructions are therefore not committed to the processor's register set). Intel® documentation sometimes refers to these as non-bogus versus bogus events.

There Are Three Sampling Methods

A counter can be programmed to use of the following three methods of counting:

  • A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.

  • Interrupt on counter overflow. Intel® documentation refers to this as Non-precise event-based sampling (NPEBS, if you will):

    - A counter is set up to count one or more event types.

    - It is preset with an initial count.

    - It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.

    - When the counter overflows, the handler is called.

    - The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.

  • Automatic state save on counter overflow. Intel® documentation refers to this as Precise Event-Based Sampling (PEBS):

    - A counter is set up to count one or more event types.

    - It is preset with an initial count.

    - It is enabled to store an event record in the Debug Store (DS) memory area when the counter has an overflow condition.

    - When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.

    - The processor then automatically resets the count to its initial programmed value and resumes counting.

    - The processor then resumes execution of the program.

    - When the DS save area is approaching a full condition, a Performance Counter interrupt is generated, and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.

Relationship of a Counter, Its CCCR and the ESCRs

Refer to Figure 56-22 on page 1375. There is a 1-to-1 hardwired relationship between each of the 18 counters and each of the 18 CCCRs. A counter's CCCR is used to set up and enable the counter for a specific method or style of counting.

Figure 56-22. Relationship of a Counter, Its CCCR and the ESCRs


Regarding the ESCRs, each counter is associated with a group of ESCRs and the number of ESCRs associated with counter is counter-specific (there can be up to eight ESCRs related to a counter). When a counter's CCCR is configured, one of the items it's configured with is a 3-bit number identifying which of the counter's related ESCRs defines the event(s) that will be counted. It should also be noted that a ESCR can be related to more than one counter.

Table 56-5 on page 1376 defines the relationship of each counter to its CCCR and to its related ESCRs. The following items are listed in each column:

  • The first column contains the counter's name.

  • The second column contains the counter's number.

  • The third column contains the counter's MSR address.

  • The fourth column contains the name of the counter's CCCR.

  • The fifth column contains the CCCR's MSR address.

  • The sixth column contains the names of the ESCRs that can feed event counts to the counter.

  • The seventh column contains the 3-bit ID of the ESCR. In order to connect the ESCR to a particular counter, the CCCR associated with the counter must be programmed with this ID.

  • The eigth and final column contains the ESCR's MSR address.

Table 56-5. Counter/CCCR/ESCR Relationship
CounterCCCRESCRs Associated With the Counter
NameNo.MSRNameMSRNameNo.MSR
     MSR_BSU_ESCR073A0h
     MSR_FSB_ESCR063A2h
     MSR_MOB_ESCR023AAh
MSR_BPU_COUNTER00300hMSR_BPU_CCCR0360hMSR_PMH_ESCR043ACh
     MSR_BPU_ESCR003B2h
     MSR_IS_ESCR013B4h
     MSR_ITLB_ESCR033B6h
     MSR_IX_ESCR053C8h
     MSR_BSU_ESCR073A0h
     MSR_FSB_ESCR063A2h
     MSR_MOB_ESCR023AAh
MSR_BPU_COUNTER11301hMSR_BPU_CCCR1361hMSR_PMH_ESCR043ACh
     MSR_BPU_ESCR003B2h
     MSR_IS_ESCR013B4h
     MSR_ITLB_ESCR033B6h
     MSR_IX_ESCR053C8h
     MSR_BSU_ESCR173A1h
     MSR_FSB_ESCR163A3h
     MSR_MOB_ESCR123ABh
     MSR_PMH_ESCR143ADh
MSR_BPU_COUNTER22302hMSR_BPU_CCCR2362hMSR_BPU_ESCR103B3h
     MSR_IS_ESCR113B5h
     MSR_ITLB_ESCR133B7h
     MSR_IX_ESCR153C9h
     MSR_BSU_ESCR173A1h
     MSR_FSB_ESCR163A3h
     MSR_MOB_ESCR123ABh
MSR_BPU_COUNTER33303hMSR_BPU_CCCR3363hMSR_PMH_ESCR143ADh
     MSR_BPU_ESCR103B3h
     MSR_IS_ESCR113B5h
     MSR_ITLB_ESCR133B7h
     MSR_IX_ESCR153C9h
     MSR_MS_ESCR003C0h
MSR_MS_COUNTER04304hMSR_MS_CCCR0364hMSR_TBPU_ESCR023C2h
     MSR_TC_ESCR013C4h
     MSR_MS_ESCR003C0h
MSR_MS_COUNTER15305hMSR_MS_CCCR1365hMSR_TBPU_ESCR023C2h
     MSR_TC_ESCR013C4h
     MSR_MS_ESCR103C1h
MSR_MS_COUNTER26306hMSR_MS_CCCR2366hMSR_TBPU_ESCR123C3h
     MSR_TC_ESCR113C5h
     MSR_MS_ESCR103C1h
MSR_MS_COUNTER37307hMSR_MS_CCCR3367hMSR_TBPU_ESCR123C3h
     MSR_TC_ESCR113C5h
     MSR_FIRM_ESCR013A4h
     MSR_FLAME_ESCR003A6h
MSR_FLAME_COUNTER08308hMSR_FLAME_CCCR0368hMSR_DAC_ESCR053A8h
     MSR_SAAT_ESCR023AEh
     MSR_U2L_ESCR033B0h
     MSR_FIRM_ESCR013A4h
     MSR_FLAME_ESCR003A6h
MSR_FLAME_COUNTER19309hMSR_FLAME_CCCR1369hMSR_DAC_ESCR053A8h
     MSR_SAAT_ESCR023AEh
     MSR_U2L_ESCR033B0h
     MSR_FIRM_ESCR113A5H
     MSR_FLAME_ESCR103A7h
MSR_FLAME_COUNTER 21030AhMSR_FLAME_CCCR236AhMSR_DAC_ESCR153A9h
     MSR_SAAT_ESCR123AFh
     MSR_U2L_ESCR133B1h
     MSR_FIRM_ESCR113A5H
     MSR_FLAME_ESCR103A7h
MSR_FLAME_COUNTER 31130BhMSR_FLAME_CCCR336BhMSR_DAC_ESCR153A9h
     MSR_SAAT_ESCR123AFh
     MSR_U2L_ESCR133B1h
     MSR_CRU_ESCR043B8h
     MSR_CRU_ESCR253CCh
     MSR_CRU_ESCR463E0h
MSR_IQ_COUNTER01230ChMSR_IQ_CCCR036ChMSR_IQ_ESCR003BAh
     MSR_RAT_ESCR023BCh
     MSR_SSU_ESCR033BEh
     MSR_ALF_ESCR013CAh
     MSR_CRU_ESCR043B8h
     MSR_CRU_ESCR253CCh
     MSR_CRU_ESCR463E0h
MSR_IQ_COUNTER11330DhMSR_IQ_CCCR136DhMSR_IQ_ESCR003BAh
     MSR_RAT_ESCR023BCh
     MSR_SSU_ESCR033BEh
     MSR_ALF_ESCR013CAh
     MSR_CRU_ESCR143B9h
     MSR_CRU_ESCR353CDh
MSR_IQ_COUNTER21430EhMSR_IQ_CCCR236EhMSR_CRU_ESCR563E1h
     MSR_IQ_ESCR103BBh
     MSR_RAT_ESCR123BDh
     MSR_ALF_ESCR113CBh
     MSR_CRU_ESCR143B9h
     MSR_CRU_ESCR353CDh
MSR_IQ_COUNTER31530FhMSR_IQ_CCCR336FhMSR_CRU_ESCR563E1h
     MSR_IQ_ESCR103BBh
     MSR_RAT_ESCR123BDh
     MSR_ALF_ESCR113CBh
     MSR_CRU_ESCR043B8h
     MSR_CRU_ESCR253CCh
     MSR_CRU_ESCR463E0h
MSR_IQ_COUNTER416310hMSR_IQ_CCCR4370hMSR_IQ_ESCR003BAh
     MSR_RAT_ESCR023BCh
     MSR_SSU_ESCR033BEh
     MSR_ALF_ESCR013CAh
     MSR_CRU_ESCR143B9h
     MSR_CRU_ESCR353CDh
MSR_IQ_COUNTER517311hMSR_IQ_CCCR5371hMSR_CRU_ESCR563E1h
     MSR_IQ_ESCR103BBh
     MSR_RAT_ESCR123BDh
     MSR_ALF_ESCR113CBh

The Event Select Control Registers

The ESCR selected to be associated with a counter (by programming a 3-bit value into the CCCR's ESCR Select field) contains two fields (see Figure 56-23 on page 1384) that are used to select the event or events to be counted by that counter:

  • The 6-bit Event Select field identifies a class of events.

  • The 16-bits in the Event Mask field are used to select the specific events within the selected class are to be counted.

Figure 56-23. The ESCR Format


Table 56-7 on page 1385 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for Non-Retirement Event Counting.

Table 56-7. Event Classes and Event Mask Bits for Non-Retirement Counting
Event ClassDescriptionESCRs can be used inCounters that must be used
06hRetired Branches. Mask Bits:
  • 0: MMNP. Branch Not-taken Predicted.

  • 1: MMNM. Branch Not-taken Mispredicted.

  • 2: MMTP. Branch Taken Predicted.

  • 3: MMTM. Branch Taken Mispredicted.

CRU_ESCR2 CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
03hRetired mispredicted branches. Mask Bits:
  • 0: NBOGUS. The retired instruction is not bogus.

CRU_ESCR0 CRU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
01h

TC_deliver_mode. This event counts the duration (in clock cycles) of the operating modes of the Trace Cache and decode engine in the processor.

  • Deliver Mode means that there was a hit on the TC and the requested μops are being delivered from the TC.

  • Build Mode means that there was a TC miss and the μops are being delivered from the decoder are being assembled into trace lines of 6 μops each.

Mask Bits:
  • 0: DD. Both logical processors are in deliver mode.

  • 1 :DB. Logical processor 0 is in deliver mode and logical processor 1 is in build mode.

  • 2: DI. Logical processor 0 is in deliver mode and logical processor 1 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow.

  • 3: BD. Logical processor 0 is in build mode and logical processor 1 is in deliver mode.

  • 4: BB. Both logical processors are in build mode.

  • 5: BI. Logical processor 0 is in build mode and logical processor 1 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow.

  • 6: ID. Logical processor 0 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. Logical processor 1 is in deliver mode.

  • 7: IB. Logical processor 0 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. Logical processor 1 is in build mode.

TC_ESCR0 TC_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 4, 5 ESCR1: 6, 7
03h

BPU_fetch_request (BPU = Branch Prediction Unit). This event counts instruction fetch requests of a specified request type by the Branch Prediction unit.

Mask Bits:
  • 0: TCMISS. Trace cache miss.

BPU_ESCR0 BPU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 0, 1 ESCR1: 2, 3
18h

ITLB_reference. This event counts translations using the Instruction Translation Lookaside Buffer (ITLB).

Mask Bits:
  • 0: HIT. ITLB hit.

  • 1: MISS. ITLB miss.

  • 2: HIT_UC. Uncacheable ITLB hit.

ITLB_ESCR0 ITLB_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 0, 1 ESCR1: 2, 3
02h

memory_cancel. This event counts the canceling of various types of requests in the Data Cache Address Control unit (DAC).

Mask Bits:
  • 2: ST_RB_FULL. Replayed because no Store Buffer is available.

  • 3: 64K_CONF. Conflicts due to 64K aliasing.

DAC_ESCR0 DAC_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
08h

memory_complete. This event counts the completion of a load split, store split, uncacheable (UC) split, or a UC load.

Mask Bits:
  • 0: LSC. Load split completed, excluding UC/WC loads.

  • 1: SSC. Any split stores completed.

SAAT_ESCR0 SAAT_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
04h

load_port_replay. This event counts replayed events at the load port.

Mask Bits:
  • 1: SPLIT_LD. Split load.

SAAT_ESCR0 SAAT_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
05h

store_port_replay. This event counts replayed events at the store port.

Mask Bits:
  • 1: SPLIT_ST. Split store.

SAAT_ESCR0 SAAT_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
03h

MOB_load_replay. This event triggers if the Memory Order Buffer (MOB) caused a load operation to be replayed.

Mask Bits:
  • 1: NO_STA. Replayed because of an unknown store address.

  • 3: NO_STD. Replayed because of an unknown store data.

  • 4: PARTIAL_DATA. Replayed because of a partially overlapped data access between the load and store operations.

  • 5: UNALGN_ADDR. Replayed because the lower four bits of the load and store linear addresses do not match.

MOB_ESCR0 MOB_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 0, 1 ESCR1: 2, 3
01h

page_walk_type. This event counts various types of page walks that the Page Miss Handler (PMH) performs.

Mask Bits:
  • 0: DTMISS.

  • 1: ITMISS.

PMH_CR_ESCR0 PMH_CR_ESCR1ESCR0: 0, 1 ESCR1: 2, 3
0Ch

BSQ_cache_reference. This event counts cache references to the L2 or L3 cache as seen by the bus unit. Specify one or more mask bits to select an access according to the access type (read types includes both loads and RFOs; write types includes writebacks and evictions) and the access result (hit, miss).

Mask Bits:
  • 0: RD_2ndL_HITS. Read L2 cache hit Shared (includes load and RFO).

  • 1: RD_2ndL_HITE. Read L2 cache hit Exclusive (includes load and RFO).

  • 2: RD_2ndL_HITM. Read L2 cache hit Modified (includes load and RFO).

  • 3: RD_3rdL_HITS. Read L3 cache hit Shared (includes load and RFO).

  • 4: RD_3rdL_HITE. Read L3 cache hit Exclusive (includes load and RFO).

  • 5: RD_3rdL_HITM. Read L3 cache hit Modified (includes load and RFO).

  • 8: RD_2ndL_MISS. Read L3 cache miss (includes load and RFO).

  • 9: RD_3rdL_MISS. Read L3 cache miss (includes load and RFO).

  • 10: WR_2ndL_MISS. A Writeback lookup from the DAC misses the L2 cache (unlikely to happen).

BSU_CR_ESCR0 BSU_CR_ESCR1ESCR0: 0, 1 ESCR1: 2, 3
03h

IOQ_allocation. This event counts the various types of transactions on the bus. A count is generated each time a transaction is allocated into the IOQ that matches the specified mask bits. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. Note that requests are counted once per retry. The event is triggered by evaluating the logical expression: (((Request type) OR Bit 5 OR Bit 6) OR (Memory type)) AND (Source agent).

Mask Bits:
  • [0:4]. Bus request type (use 00001 for invalid or default).

  • 5: ALL_READ. Count read entries.

  • 6: ALL_WRITE. Count write entries.

  • 7: MEM_UC. Count UC memory access entries.

  • 8: MEM_WC. Count WC memory access entries.

  • 9: MEM_WT. Count WT memory access entries.

  • 10: MEM_WP. Count WP memory access entries.

  • 11: MEM_WB. Count WB memory access entries.

  • 13: OWN. Count all store requests (RFOs) driven by the processor, as opposed to other processors or device adapters.

  • 14: OTHER. Count all requests driven by other processors or by device adapters.

  • 15: PREFETCH. Include hardware and software prefetch requests in the count.

MSR_FSB_ESCR0ESCR0: 0, 1
1Ah

IOQ_active_entries. This event counts the number of entries (clipped at 15) in the IOQ that are active. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. This event must be programmed in conjunction with the IOQ_allocation event.

Mask Bits:
  • [0:4]. Bus request type (use 00001 for invalid or default).

  • 5: ALL_READ. Count read entries.

  • 6: ALL_WRITE. Count write entries.

  • 7: MEM_UC. Count UC memory access entries.

  • 8: MEM_WC. Count WC memory access entries.

  • 9: MEM_WT. Count WT memory access entries.

  • 10: MEM_WP. Count WP memory access entries.

  • 11: MEM_WB. Count WB memory access entries.

  • 13: OWN. Count all store requests (RFOs) driven by the processor, as opposed to other processors or device adapters.

  • 14: OTHER. Count all requests driven by other processors or by device adapters.

  • 15: PREFETCH. Include hardware and software prefetch requests in the count.

MSR_FSB_ESCR1ESCR1: 2, 3
17h

FSB_data_activity. This event increments once for each DRDY or DBSY assertion that occurs on the FSB. The event allows selection of a specific DRDY or DBSY event.

Mask Bits:
  • 0: DRDY_DRV. Count when this processor drives data onto the FSB (this includes writes and implicit writebacks). Asserted one BCLK cycle for partial writes and four BCLKs (usually in consecutive bus clocks) for full line writes.

  • 1: DRDY_OWN. Count when this processor reads data from the FSB (this includes loads and some PIC transactions; PIC is the interrupt controller). Asserted one BCLK cycle for partial reads and four BCLKs (usually in consecutive bus clocks) for full line reads.

  • 2: DRDY_OTHER. Count when data is on the FSB but not being sampled by the processor. It may or may not be driven by this processor. Asserted one BCLK cycle for partial transactions and four BCLKs (usually in consecutive bus clocks) for full line transactions.

  • 3: DBSY_DRV. Count when this processor reserves the FSB for use in the next transaction in order to drive data. Asserted for two BCLK cycles for full line writes and not at all for partial line writes. May be asserted multiple times (in consecutive BCLKs) if we stall the bus waiting for a cache lock to complete.

  • 4: DBSY_OWN. Count when some agent reserves the FSB for use in the next transaction to drive data that this processor will sample. Asserted for two BCLK cycles for full line writes and not at all for partial line writes. May be asserted multiple times (each one BCLK apart) if we stall the FSB for some reason.

  • 5:DBSY_OTHER. Count when some agent reserves the FSB for use in the next transaction to drive data that this processor will NOT sample. It may or may not be driven by this processor. Asserted two BCLK cycles for partial transactions and four BCLKs (usually in consecutive BCLKs) for full line transactions.

MSR_FSB_ESCR0 MSR_FSB_ESCR1ESCR0: 0, 1 ESCR1: 2, 3
05h

BSQ_allocation. This event counts allocations in the Bus Sequence Unit (BSQ) according to the specified mask bit encoding.

Mask Bits:
  • [1:0]. REQ_TYPE. Request type encoding:

    - 0 = Read (excludes Memory Read and Invalidate).

    - 1 = Memory Read and Invalidate.

    - 2 = Write (other than writebacks).

    - 3 = Writeback (evicted from cache).

  • [3:2]. REQ_LEN. Request length encoding:

    - 0 = 0 chunks.

    - 1 = 1 chunk.

    - 3 = 8 chunks.

  • 5: REQ_IO_TYPE. Request type is and IO Read or Write.

  • 6: REQ_LOCK_TYPE. Request type is a locked Read/Modify/Write.

  • 7: REQ_CACHE_TYPE. Request type is cacheable.

  • 8: REQ_SPLIT_TYPE. Request type is an 8-byte chunk split across a qword address boundary.

  • 9: REQ_DEM_TYPE. Request type:

    - 1 = Demand. A read resulted in a sector miss and the sector is being fetched.

    - 0 = A sector fetch due to the hardware prefetch mechanism or a software PREFETCH instruction.

  • 10: REQ_ORD_TYPE. Request is an ordered type (the author isn't sure what this refers to).

  • [13:11]. MEM_TYPE. Memory type encoding:

    - 0 = UC.

    - 1 = USWC.

    - 4 = WT.

    - 5 = WP.

    - 6 = WB.

BSU_ESCR0

Note: This ESCR name begin with “MSR_”

ESCR0: 0, 1
06h

bsq_active_entries. This event represents the number of BSQ entries (clipped at 15) currently active (valid) which meet the mask criteria during allocation in the BSQ. Active request entries are allocated on the BSQ until de-allocated. Deallocation of an entry does not necessarily imply the request is filled. This event must be programmed in conjunction with the BSQ_allocation event.

Mask Bits:
  • [1:0]. REQ_TYPE. Request type encoding:

    - 0 = Read (excludes Memory Read and Invalidate).

    - 1 = Memory Read and Invalidate.

    - 2 = Write (other than writebacks).

    - 3 = Writeback (evicted from cache).

  • [3:2]. REQ_LEN. Request length encoding:

    - 0 = 0 chunks.

    - 1 = 1 chunk.

    - 3 = 8 chunks.

  • 5: REQ_IO_TYPE. Request type is and IO Read or Write.

  • 6: REQ_LOCK_TYPE. Request type is a locked Read/Modify/Write.

  • 7: REQ_CACHE_TYPE. Request type is cacheable.

  • 8: REQ_SPLIT_TYPE. Request type is an 8-byte chunk split across a qword address boundary.

  • 9: REQ_DEM_TYPE. Request type:

    - 1 = Demand. A read resulted in a sector miss and the sector is being fetched.

    - 0 = A sector fetch due to the hardware prefetch mechanism or a software PREFETCH instruction.

  • 10: REQ_ORD_TYPE. Request is an ordered type (the author isn't sure what this refers to).

  • [13:11]. MEM_TYPE. Memory type encoding:

    - 0 = UC.

    - 1 = USWC.

    - 4 = WT.

    - 5 = WP.

    - 6 = WB.

ESCR1

Note: The Intel® documentation just refers to ESCR1, but this is not a complete ESCR name (which ESCR1?).

ESCR1: 2, 3
03h

x87_assist. This event counts the retirement of x87 instructions that required special handling.

Mask Bits:
  • 0: FPSU. Handle FP stack underflow.

  • 1: FPSO. Handle FP stack overflow.

  • 2: POAO. Handle x87 output overflow.

  • 3: POAU. Handle x87 output underflow.

  • 4: PREA. Handle x87 input assist.

CRU_ESCR2 CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
34h

SSE_input_assist. This event counts the number of times an assist is requested to handle problems with input operands for SSE and SSE2 operations, most notably denormal source operands when the DAZ bit is not set. Set bit 15 of the event mask to use this event.

Mask Bits:
  • 15: ALL. Count assists for all SSE and SSE2 μops.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
08h

packed_SP_μop. This event increments for each packed SP FP μop.

Mask Bits:
  • 15: ALL. Count all μops operating on packed SP FP operands.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
0Ch

packed_DP_μop. This event increments for each packed DP FP μop.

Mask Bits:
  • 15: ALL. Count all μops operating on packed DP FP operands.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
0Ah

scalar_SP_μop. This event increments for each scalar SP FP μop.

Mask Bits:
  • 15: ALL. Count all μops operating on scalar SP FP operands.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
0Eh

scalar_DP_μop. This event increments for each scalar DP FP μop.

Mask Bits:
  • 15: ALL. Count all μops operating on scalar DP FP operands.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
02h

64bit_MMX_μop. This event increments for each MMX instruction which operates on 64-bit SIMD operands.

Mask Bits:
  • 15: ALL. Count all μops operating on 64-bit SIMD integer operands in memory or MMX registers.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
1Ah

128bit_MMX_μop. This event increments for each integer SIMD SSE2 instruction which operates on 128-bit SIMD operands.

Mask Bits:
  • 15: ALL. Count all μops operating on 128-bit SIMD integer operands in memory or XMM registers.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
04h

x87_FP_μop. This event increments for each x87 FP μop.

Mask Bits:
  • 15: ALL. Count all x87 FP μops.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
2Eh

x87_SIMD_moves_μop. This event increments for each x87 FP, MMX, SSE or SSE2 μop related to load data, store data, or register-to-register moves. These μops are dispatched to port 0 or port 2 at runtime.

Mask Bits:
  • 3: ALLP0. Count all x87/SIMD store/moves μops.

  • 4: ALLP2. Count all x87/SIMD load μops.

FIRM_ESCR0 FIRM_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 8, 9 ESCR1: 10, 11
02h

machine_clear. This event increments while the entire pipeline of the machine is cleared.

Mask Bits:
  • 0: CLEAR. Counts clock cycles while the machine is cleared for any cause. To just count the incident versus the duration of the incident, use Edge triggering (via the CCCR).

  • 2: MOCLEAR. Increments each time the machine is cleared due to memory ordering issues.

  • 3: SMCLEAR. Increments each time the machine is cleared due to Self Modifying Code (SMC) issues.

CRU_ESCR2 CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
05h

global_power_events. This event measures the time during which a processor is not stopped.

Mask Bits:
  • 0: Running. The processor is active (includes the handling of HLT STPCLK and throttling).

FSB_ESCR0 FSB_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 0, 1 ESCR1: 2, 3
05h

tc_ms_xfer. This event counts the number of times that μop delivery changed from the TC to the Microcode Store ROM.

Mask Bits:
  • 0: CISC. A TC to MS transfer occurred.

MS_ESCR0 MS_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 4, 5 ESCR1: 6, 7
09h

μop_queue_writes. This event counts the number of valid μops written to the μop queue.

Mask Bits:
  • 0: FROM_TC_BUILD. The μops being written to the μop Queue are from TC Build Mode.

  • 1: FROM_TC_DELIVER. The μops being written to the μop Queue are from TC Deliver Mode

  • 2: FROM_ROM. The μops being written to the μop Queue are from the MS ROM.

MS_ESCR0 MS_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 4, 5 ESCR1: 6, 7

Table 56-8 on page 1395 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for At-Retirement Event Counting.

Table 56-8. Event Classes and Event Mask Bits for At-Retirement Counting
Event ClassDescriptionESCRs can be used inCounters that must be used
08h

front_end_event. This event counts the retirement of tagged μops (specified by the front-end tagging mechanism).

Mask Bits:
  • 0: NBOGUS. The marked μops are not bogus.

  • 1: BOGUS. The marked μops are bogus.

CRU_ESCR2, CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
0Ch

execution_event. This event counts the retirement of tagged μops (specified through the execution tagging mechanism). The event mask allows from one to four types of μops to be specified as either bogus or non-bogus μops to be tagged.

Mask Bits:
  • 0: NBOGUS0. The marked μops are not bogus.

  • 1: NBOGUS1. The marked μops are not bogus.

  • 2: NBOGUS2. The marked μops are not bogus.

  • 3: NBOGUS3. The marked μops are not bogus.

  • 4: BOGUS0. The marked μops are bogus.

  • 5: BOGUS1. The marked μops are bogus.

  • 6: BOGUS2. The marked μops are bogus.

  • 7: BOGUS3. The marked μops are bogus.

For more information, refer to “Execution Tagging” on page 1412.
CRU_ESCR2 CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
09h

replay_event. This event counts the retirement of tagged μops (specified through the replay tagging mechanism).

Mask Bits:
  • 0: NBOGUS. The marked μops are not bogus.

  • 1: BOGUS. The marked μops are bogus.

CRU_ESCR2 CRU_ESCR3

Note: These ESCR names begin with “MSR_”

ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
02h

instr_retired. This event counts the instructions that are retired during a clock cycle. Mask bits specify bogus or non-bogus (and whether they are tagged via the front-end tagging mechanism). The event count may vary depending on the microarchitecture state of the processor when the event is enabled. The event may count more than once for some IA32 instructions with complex μop flows that were interrupted before retirement.

Mask Bits:
  • 0: NBOGUSNTAG. Non-bogus instructions that are not tagged.

  • 1: NBOGUSTAG. Non-bogus instructions that are tagged.

  • 2: BOGUSNTAG. Bogus instructions that are not tagged.

  • 3: BOGUSTAG. Bogus instructions that are tagged.

CRU_ESCR0 CRU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
01h

μops_retired. This event counts the μops retired during a clock cycle.

Mask Bits:
  • 0: NBOGUS. The marked μops are not bogus.

  • 1: BOGUS. The marked μops are bogus.

CRU_ESCR0 CRU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
02h

μop_type. This event is used in conjunction with the front-end at-retirement mechanism to tag load and store μops.

Mask Bits:
  • 1: TAGLOADS. The μop is a load operation.

  • 2: TAGSTORES. The μop is a store operation.

RAT_ESCR0 RAT_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
05h

retired_mispred_branch_type. This event counts retiring mispredicted branches by type.

Mask Bits:
  • 1: CONDITIONAL. Conditional jumps.

  • 2: CALL. Indirect call branches.

  • 3: RETURN. Return branches.

  • 4: INDIRECT. Returns, indirect calls, or indirect jumps.

TBPU_ESCR0 TBPU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 4, 5 ESCR1: 6, 7
04h

retired_branch_type. This event counts retiring branches by type.

Mask Bits:
  • 1: CONDITIONAL. Conditional jumps.

  • 2: CALL. Direct or indirect calls.

  • 3: RETURN. Return branches.

  • 4: INDIRECT. Returns, indirect calls, or indirect jumps.

TBPU_ESCR0 TBPU_ESCR1

Note: These ESCR names begin with “MSR_”

ESCR0: 4, 5 ESCR1: 6, 7

The remaining ESCR bit fields are defined in Table 56-6 on page 1382.

Table 56-6. ESCR Bit Field Definitions
Bit FieldDescription
The OS bitThis bit and description only applies to a processor that is not capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when the processor is executing privilege level 0 code. When both the OS and USR bits are set, the selected events are counted at all privilege levels. If neither the OS nor the USR bit is set, none of the selected events are counted.
The USR bitThis bit and description only applies to a processor that is not capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when the processor is executing privilege level 1, 2 or 3 code. When both the OS and USR bits are set, the selected events are counted at all privilege levels. If neither the OS nor the USR bit is set, none of the selected events are counted.
T0OSThis bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 0 is executing privilege level 0 code. When both the T0OS and T0USR bits are set, the selected events on logical processor 0 are counted at all privilege levels. If neither the T0OS nor the T0USR bit is set, none of the selected events are counted on logical processor 0.
T0USRThis bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 0 is executing privilege level 1, 2 or 3 code. When both the T0OS and T0USR bits are set, the selected events on logical processor 0 are counted at all privilege levels. If neither the T0OS nor the T0USR bit is set, none of the selected events are counted on logical processor 0.
T1OSThis bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 1 is executing privilege level 0 code. When both the T1OS and T1USR bits are set, the selected events on logical processor 1 are counted at all privilege levels. If neither the T1OS nor the T1USR bit is set, none of the selected events are counted on logical processor 1.
T1USRThis bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 1 is executing privilege level 1, 2 or 3 code. When both the T1OS and T1USR bits are set, the selected events on logical processor 1 are counted at all privilege levels. If neither the T1OS nor the T1USR bit is set, none of the selected events are counted on logical processor 1.
The Tag Value fieldSelects a 4-bit tag value to associate with a μop to use in At-Retirement Event Counting.
The Tag Enable bitWhen set to one, enables the tagging of μops to use in At Retirement Event Counting; when cleared to zero, tagging is disabled.
Event Select fieldThe 6 bit Event Select field identifies a class of events.
Event Mask fieldThe 16 bits in the Event Mask field are used to select the specific events within the selected Class that are to be counted.

The Counter Configuration Control Registers

General

As mentioned earlier, each of the 18 counters is associated with a specific CCCR. The CCCR controls event counting, event filtering and the interrupt-on-overflow capability.

Each CCCR has the format shown in Figure 56-24 on page 1398 and each of the bit fields are explained in Table 56-9 on page 1399.

Table 56-9. The CCCR Bit Field Definitions
Bit FieldDescription
Enable
  • 0 = Counter disabled.

  • 1 = Counter enabled.

This bit is cleared to 0 on reset.
ESCR SelectSelects the ESCR that defines the event(s) to be counted by the counter that is associated with the CCCR.
Bits[17:16]
  • This field is reserved and must be set 11b in a processor that is not capable of Hyper-Threading.

  • In a Hyper-Threading capable processor, this is the Active Thread field.

    - 00 = Count when neither logical processor is active.

    - 01 = Count when only one logical processor is active.

    - 10 = Count only when both logical processors are active.

    - 11 = Count when either logical processor is active.

A logical processor that is halted or that is in the “wait for SIPI” state is considered inactive.
Compare
  • 0 = Filtering disabled.

  • 1 = Filtering enabled.

The filtering method is selected by the Threshold, Complement, and Edge bit fields. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
ComplementSelects how the incoming event count is compared with the threshold value.
  • 0 = The counter doesn't start counting events until the count is > the specified Threshold value.This bit is “don't care” if the Compare bit = 0.

  • 1 = The counter counts events until the specified Threshold value is reached.

Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
ThresholdSpecifies the threshold value to be used for comparisons. This field only has meaning if the Compare bit = 1. The processor uses the setting of the Complement bit to determine the type of threshold comparison. The range of values that can be specified depends on the event type. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
Edge
  • 0 = Rising-edge (false-to-true) detection of the threshold comparison output for filtering event counts is disabled.

  • 1 = Rising-edge (false-to-true) detection of the threshold comparison output for filtering event counts is enabled.

This bit is only meaningful when the Compare bit = 1.
Force Overflow
  • 0 = A counter overflow condition only occurs when the counter overflows.

  • 1 = A counter overflow condition each time that the counter is incremented.

PMI bitThis bit only applies to a processor that does not support Hyper-Threading. It is the Interrupt On Overflow enable bit:
  • 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow.

  • 1 = The processor generates a Performance Monitor interrupt each time that the counter overflows.

The interrupt is generated on the next increment after the counter has overflowed.
OVF_PMI_T0This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 0:
  • 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow associated with logical processor 0.

  • 1 = The processor generates a Performance Monitor interrupt on a counter overflow associated with logical processor 0.

The interrupt is generated on the next increment after the counter has overflowed.
OVF_PMI_T1This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 1:
  • 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow associated with logical processor 1.

  • 1 = The processor generates a Performance Monitor interrupt on a counter overflow associated with logical processor 1.

The interrupt is generated on the next increment after the counter has overflowed.
Cascade
  • 0 = counter cascading is disabled.

  • 1 = Enables counting on this counter when the counter that is its cascade partner (see Table 56-10 on page 1403) overflows. See “Counter Cascading” on page 1401 for more information. As an example, to have a counter 0 overflow automatically enable counter 2 to start counting, the programmer sets the Cascade bit = 1 in the CCCR associated with counter 2.

Extended Cascade EnableThis bit is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2. Refer to “Extended Cascading” on page 1406.
OverflowThis is the overflow history bit.
  • 0 = The counter has not overflowed.

  • 1 = The counter has overflowed since the last time this bit was cleared by software.


Figure 56-24. The CCCR Format


Table 56-10. The Four Counter Groups
Counter GroupConsists of These CountersCounter Numbers
BPU groupThe Branch Prediction Unit group consists of two pairs of counters:
MSR_BPU_COUNTER0 and MSR_BPU_COUNTER10 and 1
MSR_BPU_COUNTER2 and MSR_BPU_COUNTER3.2 and 3
Counters 0 and 2 are a cascade pair, as are 1 and 3.
MS groupThe Microcode Store ROM group consists of:
MSR_MS_COUNTER0 and MSR_MS_COUNTER1.4 and 5
MSR_MS_COUNTER2 and MSR_MS_COUNTER3.6 and 7
Counters 4 and 6 are a cascade pair, as are 5 and 7.
FLAME groupThe FLAME group (the author has no idea what FLAME stands for) consists of:
MSR_FLAME_COUNTER0 and MSR_FLAME_COUNTER1.8 and 9
MSR_FLAME_COUNTER2 and MSR_FLAME_COUNTER3.10 and 11
Counters 8 and 10 are a cascade pair, as are 9 and 11.
IQ groupThe Instruction Queue group consists of:
MSR_IQ_COUNTER0 and MSR_IQ_COUNTER1.12 and 13
MSR_IQ_COUNTER2 and MSR_IQ_COUNTER3.14 and 15
MSR_IQ_COUNTER4 and MSR_IQ_COUNTER5.16 and 17
Because this group has a third counter pair, cascading is handled a little differently:
  • 12 and 14 are a cascade pair, as are 13 and 15.

  • 14 can be cascaded to 16 (but 16 cannot be cascaded to 14).

  • 15 can be cascaded to 17 (but 17 cannot be cascaded to 15).


The CCCRs are all cleared to zero on reset. The events that an enabled counter actually counts are selected and filtered by the following bit fields in the ESCR and CCCR (in the qualification order shown):

1.
First, ESCR[Event Select] and ESCR[Event Mask] select the Event Class and one or more events within the class, respectively.

2.
Then, ESCR[OS] and ESCR[USR] (in a HTT-capable processor, the ESCR T0OS, T0USR, T1OS and T1USR bits) select the privilege levels at which events will be counted.

3.
CCCR[ESCR Select] selects the ESCR that pipes counts to the CCCR for events that have passed steps 1 and 2.

4.
CCCR[Compare], CCCR[Complement] and CCCR[Threshold] then select the optional threshold to be used in qualifying the event count.

5.
CCCR[Edge] allows events to be counted only on rising-edge (false-to-true) transitions.

Counter Cascading

The 18 performance counters are organized into nine pairs (see Table 56-10 on page 1403) and each pair of performance counters is associated with a particular subset of events and ESCR's (see Table 56-5 on page 1376).

While the ability to count a particular event type is valuable, there are more complex tests or software tuning scenarios that counter cascading can address quite handily:

  • A counter could be set up to detect when a particular event occurs and then automatically enable another counter to start counting a different event type.

  • A counter could be set up to count a specified number of a particular event type, and then automatically enable another counter to start counting a different event type.

  • Each counter is 40-bits wide. Two counters may be initialized to a count of 0 and both set up to count the same event type. One of the counters is then enabled and starts counting the selected event type. When it overflows, the overflow condition automatically enables the other counter which then continues counting the same event type.

Referring to Table 56-10 on page 1403, each counter can be cascaded to a second counter located in another pair (not in the same pair, however) within the same group. As an example, counters 0 and 2 can be cascaded in any order (i.e., 0 to 2, or 2 to 0), as can counters 1 and 3.

As an example, to have a counter 0 overflow automatically enable counter 2 to start counting, the programmer sets CCCR[Cascade] = 1 (see Figure 56-24 on page 1398) in the CCCR associated with counter 2.

Interrupt on Overflow

Figure 56-25 on page 1405 illustrates an example wherein the overflow of the second counter then results in the generation of a Performance Monitor interrupt. A counter is enabled to do so by setting the PMI bit = 1 in its CCCR. Refer to “The Performance Counter Overflow Interrupt” on page 1547 for a detailed description of this interrupt.

Figure 56-25. Cascading Counter Example


It should be noted that Pentium® 4 processors with a Model number of 0 or 1 and a Stepping > 09h, as well as processors with a Model of 2 have an erratum that prevents them from generating a Performance Monitor interrupt if cascading or extended cascading (refer to “Extended Cascading” on page 1406) is enabled.

Extended Cascading

This feature is model-specific and is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2 (obtained by executing a CPUID request type 0).

  • If bit 11 = 1 in the CCCR associated with counter 12, then counter 16 can cascade to counter 12.

  • If bit 11 = 1 in the CCCR associated with counter 15, then counter 17 can cascade to counter 15.

  • If bit 11 = 1 in the CCCR associated with counter 16, then counter 17 can cascade to counter 16.

  • If bit 11 = 1 in the CCCR associated with counter 17, then counter 16 can cascade to counter 17.

If extended cascading is to be used, the programmer sets CCCR bit 11 to one rather than the Cascade bit.

Accessing the Performance Counters

In the Pentium® 4 processor, the RDPMC (Read Performance Counter) instruction (see “Performance Monitoring” on page 505) was enhanced to allow the full 40-bit counter to be read, or just its lower 32-bits (which is faster; this can be used when the count is small enough to be contained in 32 bits).

The CR4[PCE] bit permits the OS to limit access to the Performance Counters:

  • CR4[PCE] = 0. Only privilege level 0 code (i.e., the OS kernel) can execute the RDPMC instruction to read a performance counter's contents. An exception is generated whenever a less-privileged program attempts execution of the RDPMC instruction.

  • CR4[PCE] = 1. Any program can execute the RDPMC instruction to read a performance counter's contents.

Just like the RDTSC instruction (see “RDTSC Is Not a Serializing Instruction” on page 499), the RDPMC instruction is not a serializing instruction (see “Serializing Instructions” on page 1079). It can be executed out-of-order by the processor core and therefore may yield an inaccurate reading. It can be used in conjunction with the CPUID instruction (which is a serializing instruction) to obtain an accurate count.

In some cases, the programmer may want to preload a count into a counter prior to enabling the counter (so the counter will generate an overflow when that count has been achieved. To do so, enter the start count as a twos-complement negative integer in the counter. The counter then counts from the start value up to -1 and then overflows. A counter is written to using the WRMSR instruction and all 40 bits are written simultaneously.

Halting Event Counting

After being started, a counter continues counting. If and when the counter overflows, it wraps around and continues counting. When the counter wraps around, it sets the CCCR[OVF] bit to indicate that the counter has overflowed. To halt counting, clear CCCR[ENABLE] to 0 (see Figure 56-24 on page 1398).

To halt a cascaded counter (i.e., a counter that was automatically enabled when another counter overflowed), take one of the following actions:

  • CCCR[Cascade] = 0 in the cascaded counters CCCR.

  • CCCR[OVF] = 0 in the other counter's CCCR.

Non-Retirement Event Counting

Introduction

While At-Retirement Counting (see “At-Retirement Event Counting” on page 1409) only counts events related to μops along the correctly predicted path, Non-Retirement Event Counting counts all events of the specified type (even if the event is associated with an μop that resides on a mispredicted path and is therefore not ultimately retired).

Table 56-7 on page 1385 provides a listing of the Non-Retirement events by Event Class and Event Mask. These events can be counted using either of the following two counting methods:

  • A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.

  • Interrupt on counter overflow. Intel® documentation refers to this as Non-Precise Event-based Sampling (NPEBS if you will):

    - A counter is set up to count one or more event types.

    - It is preset with an initial count.

    - It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.

    - When the counter overflows, the handler is called.

    - The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.

Non-Retirement events may not be counted using the PEBS mechanism.

The Set Up

In order to program the Performance Monitoring logic to count one or more Non-Retirement events, the programmer first decides which event or events are to be counted (see Table 56-7 on page 1385). For each desired event type, the following general series of steps are then accomplished:

1.
In the “ESCR” column of Table 56-7 on page 1385, the programmer selects an ESCR that supports the desired event type.

2.
The programmer finds the CCCR and Counter associated with the selected ESCR using Table 56-5 on page 1376.

3.
The programmer sets up the ESCR for the specific event or events to be counted and the privilege levels they are to be counted at.

4.
The programmer sets up the CCCR:

- Selects the ESCR.

- Optionally selects the desired event filters (see “The Event Filtering Mechanism” on page 1408).

- Optionally enables counter cascading (see “Counter Cascading” on page 1401).

- Optionally enables the generation of the Performance Monitor interrupt (PMI) when the counter overflows. If this feature is enabled, the Local APIC's PMI LVT entry (see “The Performance Counter Overflow Interrupt” on page 1547) must also be set up and a PMI handler must be in place.

5.
In the CCCR, the programmer then enables the counter to begin counting.

The Event Filtering Mechanism
Introduction

During a given clock cycle, the Performance Monitoring logic may detect multiple instances of an event selected for counting. When the event filtering mechanism is disabled, the actual event count in a given clock cycle is added to the counter that has been set up to count the respective event.

It may be, however, that the programmer only wants to increment the counter (by one) when the per clock count is > or <= a threshold value that has been specified in the CCCR associated with the counter.

Threshold Comparison

To utilize this capability, the programmer uses the following CCCR fields:

- The CCCR[Compare bit} must be set to one.

- The CCCR[Threshold] is set to the threshold count value. It is a 4-bit field, so the Threshold value can be between 0 and 15d.

- The CCCR[Complement] bit is set to the appropriate value:

- Complement = 0: A per clock event count > the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).

- Complement = 1: A per clock event count <= the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).

As an example, if Complement = 0 and Threshold = 6, a count of 7 or greater in a clock cycle causes the counter to be incremented by one, while a count less than 7 results in the counter not being incremented in that clock cycle. If Complement = 1, a count value in a given clock cycle from 0 to 6 causes the counter to be incremented by one, while any count value from 7 to 15 results in the counter not being incremented in that clock cycle.

The Threshold Condition Transition Filter

It may be that the programmer only wants to increment a counter by one when the condition being measured by the Threshold comparison is false in one clock and then true in the following clock. This capability is enabled by setting CCCR[Edge] = 1.

At-Retirement Event Counting

First, Some Terminology
Bogus, Non-Bogus, Retire

In Table 56-8 on page 1395, the term “bogus” refers to IA32 instructions or μops that are canceled because they are on a path program taken due to a mispredicted branch. The terms “retired” and “non-bogus” refer to IA32 instructions or μops along a correctly-predicted path.

Tagging

The same event can happen to a μop more than once while it is being executed. Just counting the number of times the event occurred would not indicate how many μops experienced that event. A μop can be tagged once during its lifetime and counted once at retirement. In the Intel® documentation, the “retired” suffix is appended to performance metrics that increment a count once per μop, rather than once per event.

As an example, a μop could experience a cache miss more than once while it is being executed, but a “Miss Retired” event (counts the number of retired μops that experienced a cache miss) only increments once for that μop. This can be used to measure the performance of the cache hierarchy for a code fragment.

Replay

To achieve the optimum performance when dealing with commonly-encountered cases, the schedulers sometimes schedules μops for execution before all of the conditions necessary for correct execution are guaranteed to be satisfied. Obviously, the hope is that by the time the μop is actually dispatched for execution, the condition(s) will have been met.

If they have not, the μop must be re-issued and this is referred to as Replay. Some causes of replays are:

- Cache misses.

- Dependence violations (e.g., store forwarding problems; see “Store-to-Load Forwarding” on page 1070 for more information).

- Unforeseen resource constraints.

The processor will always experience some replays, but too many replays indicates that the code in question should be tuned.

Assist

When the hardware needs the assistance of microcode (supplied from the MS ROM) to deal with an event, the machine is said to “take an assist”. An example would be an underflow condition in the input operands of a FP operation. In this case, the processor must modify the operand format to perform the computation. Assists clear the entire machine of μops before they begin and are therefore cause a severe dip in performance.

General

At-Retirement Counting only counts events related to instructions along the correctly predicted path. If a performance counter had been set up to count all executed instructions, the count would also include events related to instructions that were executed along a mispredicted path.

Using the tagging mechanism, a counter can be set up to count only the events related to instructions along the correctly predicted path. This is referred to as At-Retirement Counting.

Table 56-8 on page 1395 lists the At-Retirement Event Classes as well as the events within each class.

The Tagging Mechanisms
Introduction

A counter can count each incident of an event, or the number of μops that experienced the event. A μop may be tagged when it encounters some of the events listed in Table 56-8 on page 1395. The tagging mechanisms (there are four types of tagging) can be used in Non-Precise Event Based Sampling (NPEBS; see “There Are Three Sampling Methods” on page 1374), and some of the mechanisms can be used in PEBS (Precise Event-Based Sampling; see “There Are Three Sampling Methods” on page 1374). There are four tagging mechanisms:

- Front-End tagging. This mechanism tags μops that experience front-end-related events (Trace Cache and μop decode-related events). They are counted using the “Front_end_event” event.

- Execution tagging. This mechanism tags μops that experience execution-related events (e.g., instruction types). They are counted using the “Execution_Event” event.

- Replay tagging. This mechanism tags μops that must be replayed (e.g., a cache miss) as well as mispredicted branches. They are counted using the “Replay_event” event.

- No tags. This mechanism does not use tags. Rather, it counts retired IA32 instructions using the “Instr_retired” event, and retired μops using the “μops_ retired” event.

Multi-Tagging

The tagging mechanisms are independent of each other. A μop tagged using one mechanism is not detected by another mechanism's tagged-μop detector. As an example, a μop tagged by the Front-End tagging mechanism is not counted by the “Execution_Event” unless it was also tagged by the Execution tagging mechanism. It should be noted, that execution tags allow up to four different types of μops to be counted at retirement.

PEBS and Multi-Tagging

When using PEBS, however, only one tagging mechanism can be used at a time.

Some μops Cannot Be Tagged

The following μops cannot be tagged: IO, uncacheable accesses, locked accesses, Return μops, far jumps, and far calls.

Front-End Tagging

The Front_end_event counts μops with tags that indicate they have experienced any of the following events:

- μop Decode events. Tagging a μop when it encounters a μop decode event requires specifying bits in the ESCR associated with the μop_type event class.

- Trace Cache events. Tagging a μop when it encounters a Trace Cache event is accomplished by a bit in the ESCR associated with the μop_type event class.

The Front_end_event is defined in Table 56-8 on page 1395. None of the events currently supported requires the use of the MSR_TC_Precise_Event MSR, but some may in the future.

Execution Tagging

The Execution_event is defined in Table 56-8 on page 1395.

The execution tagging mechanism uses two ESCRs:

- One upstream ESCR specifies the event to detect and assigns a 4-bit Tag (in the ESCR's Tag field) to identify that event. This ESCR must have its Tag Enable bit = 1. The 4-bit Tag is actually a mask that specifies which tag bit(s) should be set for a particular μop. The Tag mask must match the Event Mask bit setting in another downstream ESCR (e.g., if the TAG ID in the upstream ESCR 1h (a mask value of 0001b), then the Event Mask field in the downstream ESCR should be set as follows (see the Execution_event class in Table 56-8 on page 1395):

- Bit 0, the NBOGUS0 bit, = 1b.

- Bit 1, the NBOGUS1 bit, = 0b.

- Bit 2:, the NBOGUS2 bit, = 0b.

- Bit 3, the NBOGUS3 bit, = 0b.

- The second, downstream ESCR is used to detect μops with that Tag value using the Execution_event class in the ESCR's Event Select field. This ESCR's Event Mask bits specify which tag bits accompanying a μop to count. If any of the tag bits that accompany a μop select a mask bit that = 1, the related counter is incremented by one. If more than one mask bit is selected by the bits in the μop's tag, the counter is incremented once for each matching bit. The Tag Enable and Tag value in the downstream ESCR are “don't care”.

The author is puzzled by the fact that eight Event Mask bits (rather than four) are shown in the Execution_Event class entry of Table 56-8 on page 1395.

Replay Tagging

This mechanism tags μops that must be replayed (e.g., a cache miss) as well as mispredicted branches. They are counted using the “Replay_event” event. The Replay_event is defined in Table 56-8 on page 1395. Replay tagging is enabled with the μop_Tag bit (bit 24) in the IA32_PEBS_ENABLE MSR.

The Replay tagging mechanism requires selecting:

- The type of μop that may experience the replay in the MSR_PEBS_MATRIX_VERT MSR.

- The type of event in the IA32_PEBS_ENABLE MSR.

Table 56-11 on page 1413 lists the information used to set up a counter to count Replay events. The setup information in this table enables Precise Event-Based Sampling (see “There Are Three Sampling Methods” on page 1374). Non-Precise Event-Based Sampling can be used by not setting bits 24 or 25 in IA_32_PEBS_ENABLE_MSR (see Figure 56-26 on page 1417).

Table 56-11. Replay Tagging Setup
Replay EventIA32_PEBS_Enable bits to setMSR_PEBS_MATRIX_VERT bits to setAdditional Setup InfoEvent Mask Value
1stL_cache_load_miss_retired0, 24 and 250NoneNBOGUS
2ndL_cache_load_miss_retired1, 24 and 250NoneNBOGUS
DTLB_load_miss_retired2, 24 and 250NoneNBOGUS
DTLB_store_miss_retired2, 24 and 251NoneNBOGUS
DTLB_all_miss_retired2, 24 and 250 and 1NoneNBOGUS
MOB_load_replay_retired9, 24 and 250In the ESCR, select the MOB_load_replay event and set the PARTIAL_DATA and UNALGN_ADDR Event Mask bits.NBOGUS
split_load_retiredBit 10, Bit 24, Bit 250In MSR_SAAT_ESCR1, select the load_port_replay event and set the SPLIT_LD mask bit.NBOGUS
split_store_retiredBit 10, Bit 24, Bit 251In MSR_SAAT_ESCR0, select the store_port_replay event and set the SPLIT_ST mask bit.NBOGUS

Precise Event-Based Sampling

General

The Debug Store (DS) mechanism was introduced in the Pentium® 4 processor. A complete description can be found in “The Debug Store (DS) Mechanism” on page 1366. The processor can be set up to automatically store both PEBS records and BTS records in the DS save area in memory.

What the Intel® documentation refers to as Non-Precise Event-Based Sampling could more aptly be named Automatic state save on counter overflow. It works as follows:

  • A counter is set up to count one or more event types within the same Event Class.

  • It is preset with an initial count.

  • It is enabled to store an event record in the Debug Store (DS) memory area each time that the counter has an overflow condition.

  • When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.

  • The processor then automatically resets the count to its initial programmed value and resumes counting.

  • The processor then resumes execution of the program.

  • When the DS save area is approaching a full condition, a Performance Counter interrupt is generated and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.

Automatic state save on counter overflow (i.e., PEBS) is only supported for the following Event Classes within the At-Retirement Event category:

  • The Execution_event.

  • The Front_end_event.

  • The Replay_event.

Limited To a Single Counter

PEBS can only be performed using Counter 16.

Detecting the PEBS Capability

The programmer can determine that a processor supports PEBS by executing a CPUID type 1 and verifying that EDX[21] = 1. This indicates that the processor supports the DS feature. The programmer then verifies that IA32_MISC_ ENABLE[PEBS_UNAVAILABLE] = 0 (see Figure 56-21 on page 1373).

Enabling PEBS

Setting IA32_PEBS_ENABLE[24] (see Figure 56-26 on page 1417) enables the processor's PEBS capability. The reader should also note that the DS capability must have been configured (see “The Debug Store (DS) Mechanism” on page 1366).

The PEBS Interrupt Handler

As mentioned earlier, the processor will generate an interrupt when the PEBS memory buffer in the DS save area is approaching a full condition. See “The Debug Store (DS) Mechanism” on page 1366 for more information.

Sometimes, the DS Feature Is Disabled

The processor automatically disables the DS feature under the following circumstances:

  • On transitioning to SMM.

  • When a Machine Check Exception occurs.

  • When RESET# or INIT# are asserted.

PEBS and Hyper-Threading

In a processor that supports Hyper-Threading, PEBS is enabled and qualified using the following two bits in the IA32_PEBS_ENABLE MSR (this register is replicated for each of the logical processors; see Figure 56-26 on page 1417):

  • Bit 25. ENABLE_PEBS_MY_THR and

  • Bit 26. ENABLE_PEBS_OTH_THR.

Software executing on a logical processor uses these two bits to enable PEBS for subsequent threads of execution:

  • On the same logical processor on which the software is running (“my thread”) or

  • For the other logical processor in the physical package (“other thread”).

PEBS can be used only with two performance counters:

  • Counter 16 (MSR_IQ_CCCR4 MSR) for logical processor 0.

  • Counter 17 (MSR_IQ_CCCR5 MSR) for logical processor 1.

Additional information regarding PEBS on a Hyper-Threading capable processor can be found in section 15.10.4, Performance Monitoring Events, in Intel®'s IA32 Intel® Architecture Software Developer's Manual, Volume 3: System Programming Guide.

Counting Clocks

Introduction

Processor clock cycles are referred to as clockticks and can be used to measure how long a program takes to execute, as well as to derive efficiency measurements such as Cycles Per Instruction (CPI).

There are three processor clock cycle measurements:

  • Non-Halted Clockticks. This measurement counts the clock cycles during which the specified logical processor is not halted and is not in any power-saving state. If Hyper-Threading is enabled, this measurement can be performed on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. The processor clock is stopped under the following circumstances:

    - When the processor enters the Sleep power conservation state (see “The Sleep State” on page 692).

    - When the processor enters the Deep Sleep power conservation state (see “The Deep Sleep State” on page 693).

    See “The Non-Halted Clockticks Measurement” on page 1418 for more information.

  • Non-Sleep Clockticks. This measurement counts the clock cycles during which the physical processor is not in the Sleep mode. This measurement cannot be taken on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. See “The Non-Sleep Clockticks Measurement” on page 1419 for more information.

  • Time Stamp Counter. This measurement counts the clock cycles during which the physical processor is not in Deep Sleep state. These ticks cannot be measured on a per logical processor basis.

For applications wherein the processor is halted during some periods, there are two ratios of interest:

  • Non-Halted CPI: The Non-Halted Clockticks Per Instructions Retired ratio measures the CPI only during non-halted periods of time (i.e., the processor is actually executing code). This ratio can be measured on a per logical processor basis when Hyper-Threading Technology is enabled.

  • Nominal CPI: The TSC Ticks Per Instructions Retired ratio measures the CPI over the full period of a program, including those periods of time while the processor is halted.

The Non-Halted Clockticks Measurement

As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:

  1. In an ESCR, select the global_power_events Event Class, set the RUNNING Event Mask bit, and also set the appropriate mask bits (T0_OS, T0_USR, T1_OS, T1_USR) for the targeted processor.

  2. Set the ESCR Select field in a CCCR to select that ESCR.

  3. Enable counting in the CCCR for that counter by setting the Enable bit.

If Hyper-Threading is enabled, the count may include some portion of the clock cycles for that logical processor to complete a transition to a halted state.

If Hyper-Threading is enabled and both logical processors execute the HLT instruction, the physical processor enters the AutoHalt Powerdown power conservation state (see “The AutoHalt Power Down State” on page 686).

The Non-Sleep Clockticks Measurement

As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:

  1. Choose a counter to use for the measurement.

  2. Choose an ESCR associated with that counter.

  3. Set that ESCR's Event Select field to any Event Class other than the “no_event” Event Class.

  4. Set CCCR[Compare] = 1.

  5. Set CCCR[Threshold] = 15d.

  6. Set CCCR[Complement] = 1.

  7. This setup causes the counter to count every cycle. Note that this overrides any other qualifications (e.g., by CPL) that may be specified in the ESCR.

  8. Set CCCR[Enable] =1. to enable the counter.

This measurement tool continues to increment as long as one logical processor is still running.

The Time Stamp Counter

The TSC continues to increment unless one of the following is true:

  • RESET# is asserted.

  • The processor enters the Sleep power conservation state.

  • The processor enters the Deep Sleep power conservation state.

The counter can be read by executing the RDTSC instruction (please also refer to “Time Stamp Counter” on page 498). Computing the difference in values between two reads (modulo 264) yields the number of processor clocks between reads.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.14.63