The Performance Monitoring Facility

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The Performance Monitoring Facility

Performance Monitoring Is Not Architecturally Defined

Whether or not a processor implements a Performance Monitoring facility and, if so, the method of implementation is design-specific. The facility is not part of the IA32 processor architecture spec. The manner in which it has been implemented on the Pentium®, P6 and Pentium® 4 processor families are not compatible with each other.

Author's Note

The author had a considerable amount of trouble achieving a detailed understanding of every aspect of this feature. The Intel® documentation of this feature is somewhat confusing in some areas.

An Overview

The Pentium® and P6 processor families implemented two Performance Counters, permitting the simultaneously measurement of two event types during a given period of time. The Pentium® 4 processor family expanded the number of counters to 18 and all of the registers associated with the facility are MSRs:

There are 18 Performance Counters, each of which is 40 bits wide.
There is one Counter Configuration Control register (CCCR) associated with each of the counters. A counter's CCCR is used to set up its associated counter for a specific method or style of counting.
There are 45 Event Selection Control registers (ESCRs) used to select the type of event (or events) to be measured by each counter.

In addition, the processor has the ability to store certain types of event records in a special memory buffer referred to as the Debug Store (DS) save area:

The IA32_DS_AREA MSR is programmed with the location of the DS save area in memory.
The programmer can determine whether or not a processor supports the DS mechanism by executing a CPUID request type 1 and verifying that EDX[21] = 1.

The IA32_MISC_ENABLE MSR (see Figure 56-21 on page 1373) contains two bits associated with the Performance Monitoring facility:

The Performance Monitoring Available bit is a read-only bit that indicates whether or not the processor supports the Performance Monitoring facility.
The Precise Event-Based Sampling Unavailable bit is a read-only bit that indicates whether or not the processor supports the form of event counting referred to as Precise Event-Based Sampling (PEBS).

There Are Two Event Categories

Each event type that can be counted falls into one of two categories:

A non-retirement event is an event that occurs any time during instruction execution (e.g., FSB transactions or cache accesses).
An at-retirement event is an event that is counted when a μop is retired. When a μop experienced a specified event type during its execution, it can be tagged with an identifier. This permits the Performance Monitoring logic to sort which events occurred on an execution path that was correctly predicted versus a path that was mispredicted (and the results of those instructions are therefore not committed to the processor's register set). Intel® documentation sometimes refers to these as non-bogus versus bogus events.

There Are Three Sampling Methods

A counter can be programmed to use of the following three methods of counting:

A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.
Interrupt on counter overflow. Intel® documentation refers to this as Non-precise event-based sampling (NPEBS, if you will):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.
- When the counter overflows, the handler is called.
- The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.
Automatic state save on counter overflow. Intel® documentation refers to this as Precise Event-Based Sampling (PEBS):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to store an event record in the Debug Store (DS) memory area when the counter has an overflow condition.
- When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.
- The processor then automatically resets the count to its initial programmed value and resumes counting.
- The processor then resumes execution of the program.
- When the DS save area is approaching a full condition, a Performance Counter interrupt is generated, and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.

Relationship of a Counter, Its CCCR and the ESCRs

Refer to Figure 56-22 on page 1375. There is a 1-to-1 hardwired relationship between each of the 18 counters and each of the 18 CCCRs. A counter's CCCR is used to set up and enable the counter for a specific method or style of counting.

Figure 56-22. Relationship of a Counter, Its CCCR and the ESCRs

[View full size image]

Regarding the ESCRs, each counter is associated with a group of ESCRs and the number of ESCRs associated with counter is counter-specific (there can be up to eight ESCRs related to a counter). When a counter's CCCR is configured, one of the items it's configured with is a 3-bit number identifying which of the counter's related ESCRs defines the event(s) that will be counted. It should also be noted that a ESCR can be related to more than one counter.

Table 56-5 on page 1376 defines the relationship of each counter to its CCCR and to its related ESCRs. The following items are listed in each column:

The first column contains the counter's name.
The second column contains the counter's number.
The third column contains the counter's MSR address.
The fourth column contains the name of the counter's CCCR.
The fifth column contains the CCCR's MSR address.
The sixth column contains the names of the ESCRs that can feed event counts to the counter.
The seventh column contains the 3-bit ID of the ESCR. In order to connect the ESCR to a particular counter, the CCCR associated with the counter must be programmed with this ID.
The eigth and final column contains the ESCR's MSR address.

Table 56-5. Counter/CCCR/ESCR Relationship
Counter			CCCR		ESCRs Associated With the Counter
Name	No.	MSR	Name	MSR	Name	No.	MSR
					MSR_BSU_ESCR0	7	3A0h
					MSR_FSB_ESCR0	6	3A2h
					MSR_MOB_ESCR0	2	3AAh
MSR_BPU_COUNTER0	0	300h	MSR_BPU_CCCR0	360h	MSR_PMH_ESCR0	4	3ACh
					MSR_BPU_ESCR0	0	3B2h
					MSR_IS_ESCR0	1	3B4h
					MSR_ITLB_ESCR0	3	3B6h
					MSR_IX_ESCR0	5	3C8h
					MSR_BSU_ESCR0	7	3A0h
					MSR_FSB_ESCR0	6	3A2h
					MSR_MOB_ESCR0	2	3AAh
MSR_BPU_COUNTER1	1	301h	MSR_BPU_CCCR1	361h	MSR_PMH_ESCR0	4	3ACh
					MSR_BPU_ESCR0	0	3B2h
					MSR_IS_ESCR0	1	3B4h
					MSR_ITLB_ESCR0	3	3B6h
					MSR_IX_ESCR0	5	3C8h
					MSR_BSU_ESCR1	7	3A1h
					MSR_FSB_ESCR1	6	3A3h
					MSR_MOB_ESCR1	2	3ABh
					MSR_PMH_ESCR1	4	3ADh
MSR_BPU_COUNTER2	2	302h	MSR_BPU_CCCR2	362h	MSR_BPU_ESCR1	0	3B3h
					MSR_IS_ESCR1	1	3B5h
					MSR_ITLB_ESCR1	3	3B7h
					MSR_IX_ESCR1	5	3C9h
					MSR_BSU_ESCR1	7	3A1h
					MSR_FSB_ESCR1	6	3A3h
					MSR_MOB_ESCR1	2	3ABh
MSR_BPU_COUNTER3	3	303h	MSR_BPU_CCCR3	363h	MSR_PMH_ESCR1	4	3ADh
					MSR_BPU_ESCR1	0	3B3h
					MSR_IS_ESCR1	1	3B5h
					MSR_ITLB_ESCR1	3	3B7h
					MSR_IX_ESCR1	5	3C9h
					MSR_MS_ESCR0	0	3C0h
MSR_MS_COUNTER0	4	304h	MSR_MS_CCCR0	364h	MSR_TBPU_ESCR0	2	3C2h
					MSR_TC_ESCR0	1	3C4h
					MSR_MS_ESCR0	0	3C0h
MSR_MS_COUNTER1	5	305h	MSR_MS_CCCR1	365h	MSR_TBPU_ESCR0	2	3C2h
					MSR_TC_ESCR0	1	3C4h
					MSR_MS_ESCR1	0	3C1h
MSR_MS_COUNTER2	6	306h	MSR_MS_CCCR2	366h	MSR_TBPU_ESCR1	2	3C3h
					MSR_TC_ESCR1	1	3C5h
					MSR_MS_ESCR1	0	3C1h
MSR_MS_COUNTER3	7	307h	MSR_MS_CCCR3	367h	MSR_TBPU_ESCR1	2	3C3h
					MSR_TC_ESCR1	1	3C5h
					MSR_FIRM_ESCR0	1	3A4h
					MSR_FLAME_ESCR0	0	3A6h
MSR_FLAME_COUNTER0	8	308h	MSR_FLAME_CCCR0	368h	MSR_DAC_ESCR0	5	3A8h
					MSR_SAAT_ESCR0	2	3AEh
					MSR_U2L_ESCR0	3	3B0h
					MSR_FIRM_ESCR0	1	3A4h
					MSR_FLAME_ESCR0	0	3A6h
MSR_FLAME_COUNTER1	9	309h	MSR_FLAME_CCCR1	369h	MSR_DAC_ESCR0	5	3A8h
					MSR_SAAT_ESCR0	2	3AEh
					MSR_U2L_ESCR0	3	3B0h
					MSR_FIRM_ESCR1	1	3A5H
					MSR_FLAME_ESCR1	0	3A7h
MSR_FLAME_COUNTER 2	10	30Ah	MSR_FLAME_CCCR2	36Ah	MSR_DAC_ESCR1	5	3A9h
					MSR_SAAT_ESCR1	2	3AFh
					MSR_U2L_ESCR1	3	3B1h
					MSR_FIRM_ESCR1	1	3A5H
					MSR_FLAME_ESCR1	0	3A7h
MSR_FLAME_COUNTER 3	11	30Bh	MSR_FLAME_CCCR3	36Bh	MSR_DAC_ESCR1	5	3A9h
					MSR_SAAT_ESCR1	2	3AFh
					MSR_U2L_ESCR1	3	3B1h
					MSR_CRU_ESCR0	4	3B8h
					MSR_CRU_ESCR2	5	3CCh
					MSR_CRU_ESCR4	6	3E0h
MSR_IQ_COUNTER0	12	30Ch	MSR_IQ_CCCR0	36Ch	MSR_IQ_ESCR0	0	3BAh
					MSR_RAT_ESCR0	2	3BCh
					MSR_SSU_ESCR0	3	3BEh
					MSR_ALF_ESCR0	1	3CAh
					MSR_CRU_ESCR0	4	3B8h
					MSR_CRU_ESCR2	5	3CCh
					MSR_CRU_ESCR4	6	3E0h
MSR_IQ_COUNTER1	13	30Dh	MSR_IQ_CCCR1	36Dh	MSR_IQ_ESCR0	0	3BAh
					MSR_RAT_ESCR0	2	3BCh
					MSR_SSU_ESCR0	3	3BEh
					MSR_ALF_ESCR0	1	3CAh
					MSR_CRU_ESCR1	4	3B9h
					MSR_CRU_ESCR3	5	3CDh
MSR_IQ_COUNTER2	14	30Eh	MSR_IQ_CCCR2	36Eh	MSR_CRU_ESCR5	6	3E1h
					MSR_IQ_ESCR1	0	3BBh
					MSR_RAT_ESCR1	2	3BDh
					MSR_ALF_ESCR1	1	3CBh
					MSR_CRU_ESCR1	4	3B9h
					MSR_CRU_ESCR3	5	3CDh
MSR_IQ_COUNTER3	15	30Fh	MSR_IQ_CCCR3	36Fh	MSR_CRU_ESCR5	6	3E1h
					MSR_IQ_ESCR1	0	3BBh
					MSR_RAT_ESCR1	2	3BDh
					MSR_ALF_ESCR1	1	3CBh
					MSR_CRU_ESCR0	4	3B8h
					MSR_CRU_ESCR2	5	3CCh
					MSR_CRU_ESCR4	6	3E0h
MSR_IQ_COUNTER4	16	310h	MSR_IQ_CCCR4	370h	MSR_IQ_ESCR0	0	3BAh
					MSR_RAT_ESCR0	2	3BCh
					MSR_SSU_ESCR0	3	3BEh
					MSR_ALF_ESCR0	1	3CAh
					MSR_CRU_ESCR1	4	3B9h
					MSR_CRU_ESCR3	5	3CDh
MSR_IQ_COUNTER5	17	311h	MSR_IQ_CCCR5	371h	MSR_CRU_ESCR5	6	3E1h
					MSR_IQ_ESCR1	0	3BBh
					MSR_RAT_ESCR1	2	3BDh
					MSR_ALF_ESCR1	1	3CBh

The Event Select Control Registers

The ESCR selected to be associated with a counter (by programming a 3-bit value into the CCCR's ESCR Select field) contains two fields (see Figure 56-23 on page 1384) that are used to select the event or events to be counted by that counter:

The 6-bit Event Select field identifies a class of events.
The 16-bits in the Event Mask field are used to select the specific events within the selected class are to be counted.

Figure 56-23. The ESCR Format

[View full size image]

Table 56-7 on page 1385 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for Non-Retirement Event Counting.

Table 56-7. Event Classes and Event Mask Bits for Non-Retirement Counting
Event Class	Description	ESCRs can be used in	Counters that must be used
06h	Retired Branches. Mask Bits: 0: MMNP. Branch Not-taken Predicted. 1: MMNM. Branch Not-taken Mispredicted. 2: MMTP. Branch Taken Predicted. 3: MMTM. Branch Taken Mispredicted.	CRU_ESCR2 CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
03h	Retired mispredicted branches. Mask Bits: 0: NBOGUS. The retired instruction is not bogus.	CRU_ESCR0 CRU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
01h	TC_deliver_mode. This event counts the duration (in clock cycles) of the operating modes of the Trace Cache and decode engine in the processor. Deliver Mode means that there was a hit on the TC and the requested μops are being delivered from the TC. Build Mode means that there was a TC miss and the μops are being delivered from the decoder are being assembled into trace lines of 6 μops each. Mask Bits: 0: DD. Both logical processors are in deliver mode. 1 :DB. Logical processor 0 is in deliver mode and logical processor 1 is in build mode. 2: DI. Logical processor 0 is in deliver mode and logical processor 1 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. 3: BD. Logical processor 0 is in build mode and logical processor 1 is in deliver mode. 4: BB. Both logical processors are in build mode. 5: BI. Logical processor 0 is in build mode and logical processor 1 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. 6: ID. Logical processor 0 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. Logical processor 1 is in deliver mode. 7: IB. Logical processor 0 is either halted, under a Machine Clear condition, or transitioning to a long microcode flow. Logical processor 1 is in build mode.	TC_ESCR0 TC_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 4, 5 ESCR1: 6, 7
03h	BPU_fetch_request (BPU = Branch Prediction Unit). This event counts instruction fetch requests of a specified request type by the Branch Prediction unit. Mask Bits: 0: TCMISS. Trace cache miss.	BPU_ESCR0 BPU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 0, 1 ESCR1: 2, 3
18h	ITLB_reference. This event counts translations using the Instruction Translation Lookaside Buffer (ITLB). Mask Bits: 0: HIT. ITLB hit. 1: MISS. ITLB miss. 2: HIT_UC. Uncacheable ITLB hit.	ITLB_ESCR0 ITLB_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 0, 1 ESCR1: 2, 3
02h	memory_cancel. This event counts the canceling of various types of requests in the Data Cache Address Control unit (DAC). Mask Bits: 2: ST_RB_FULL. Replayed because no Store Buffer is available. 3: 64K_CONF. Conflicts due to 64K aliasing.	DAC_ESCR0 DAC_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
08h	memory_complete. This event counts the completion of a load split, store split, uncacheable (UC) split, or a UC load. Mask Bits: 0: LSC. Load split completed, excluding UC/WC loads. 1: SSC. Any split stores completed.	SAAT_ESCR0 SAAT_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
04h	load_port_replay. This event counts replayed events at the load port. Mask Bits: 1: SPLIT_LD. Split load.	SAAT_ESCR0 SAAT_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
05h	store_port_replay. This event counts replayed events at the store port. Mask Bits: 1: SPLIT_ST. Split store.	SAAT_ESCR0 SAAT_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
03h	MOB_load_replay. This event triggers if the Memory Order Buffer (MOB) caused a load operation to be replayed. Mask Bits: 1: NO_STA. Replayed because of an unknown store address. 3: NO_STD. Replayed because of an unknown store data. 4: PARTIAL_DATA. Replayed because of a partially overlapped data access between the load and store operations. 5: UNALGN_ADDR. Replayed because the lower four bits of the load and store linear addresses do not match.	MOB_ESCR0 MOB_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 0, 1 ESCR1: 2, 3
01h	page_walk_type. This event counts various types of page walks that the Page Miss Handler (PMH) performs. Mask Bits: 0: DTMISS. 1: ITMISS.	PMH_CR_ESCR0 PMH_CR_ESCR1	ESCR0: 0, 1 ESCR1: 2, 3
0Ch	BSQ_cache_reference. This event counts cache references to the L2 or L3 cache as seen by the bus unit. Specify one or more mask bits to select an access according to the access type (read types includes both loads and RFOs; write types includes writebacks and evictions) and the access result (hit, miss). Mask Bits: 0: RD_2ndL_HITS. Read L2 cache hit Shared (includes load and RFO). 1: RD_2ndL_HITE. Read L2 cache hit Exclusive (includes load and RFO). 2: RD_2ndL_HITM. Read L2 cache hit Modified (includes load and RFO). 3: RD_3rdL_HITS. Read L3 cache hit Shared (includes load and RFO). 4: RD_3rdL_HITE. Read L3 cache hit Exclusive (includes load and RFO). 5: RD_3rdL_HITM. Read L3 cache hit Modified (includes load and RFO). 8: RD_2ndL_MISS. Read L3 cache miss (includes load and RFO). 9: RD_3rdL_MISS. Read L3 cache miss (includes load and RFO). 10: WR_2ndL_MISS. A Writeback lookup from the DAC misses the L2 cache (unlikely to happen).	BSU_CR_ESCR0 BSU_CR_ESCR1	ESCR0: 0, 1 ESCR1: 2, 3
03h	IOQ_allocation. This event counts the various types of transactions on the bus. A count is generated each time a transaction is allocated into the IOQ that matches the specified mask bits. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. Note that requests are counted once per retry. The event is triggered by evaluating the logical expression: (((Request type) OR Bit 5 OR Bit 6) OR (Memory type)) AND (Source agent). Mask Bits: [0:4]. Bus request type (use 00001 for invalid or default). 5: ALL_READ. Count read entries. 6: ALL_WRITE. Count write entries. 7: MEM_UC. Count UC memory access entries. 8: MEM_WC. Count WC memory access entries. 9: MEM_WT. Count WT memory access entries. 10: MEM_WP. Count WP memory access entries. 11: MEM_WB. Count WB memory access entries. 13: OWN. Count all store requests (RFOs) driven by the processor, as opposed to other processors or device adapters. 14: OTHER. Count all requests driven by other processors or by device adapters. 15: PREFETCH. Include hardware and software prefetch requests in the count.	MSR_FSB_ESCR0	ESCR0: 0, 1
1Ah	IOQ_active_entries. This event counts the number of entries (clipped at 15) in the IOQ that are active. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. This event must be programmed in conjunction with the IOQ_allocation event. Mask Bits: [0:4]. Bus request type (use 00001 for invalid or default). 5: ALL_READ. Count read entries. 6: ALL_WRITE. Count write entries. 7: MEM_UC. Count UC memory access entries. 8: MEM_WC. Count WC memory access entries. 9: MEM_WT. Count WT memory access entries. 10: MEM_WP. Count WP memory access entries. 11: MEM_WB. Count WB memory access entries. 13: OWN. Count all store requests (RFOs) driven by the processor, as opposed to other processors or device adapters. 14: OTHER. Count all requests driven by other processors or by device adapters. 15: PREFETCH. Include hardware and software prefetch requests in the count.	MSR_FSB_ESCR1	ESCR1: 2, 3
17h	FSB_data_activity. This event increments once for each DRDY or DBSY assertion that occurs on the FSB. The event allows selection of a specific DRDY or DBSY event. Mask Bits: 0: DRDY_DRV. Count when this processor drives data onto the FSB (this includes writes and implicit writebacks). Asserted one BCLK cycle for partial writes and four BCLKs (usually in consecutive bus clocks) for full line writes. 1: DRDY_OWN. Count when this processor reads data from the FSB (this includes loads and some PIC transactions; PIC is the interrupt controller). Asserted one BCLK cycle for partial reads and four BCLKs (usually in consecutive bus clocks) for full line reads. 2: DRDY_OTHER. Count when data is on the FSB but not being sampled by the processor. It may or may not be driven by this processor. Asserted one BCLK cycle for partial transactions and four BCLKs (usually in consecutive bus clocks) for full line transactions. 3: DBSY_DRV. Count when this processor reserves the FSB for use in the next transaction in order to drive data. Asserted for two BCLK cycles for full line writes and not at all for partial line writes. May be asserted multiple times (in consecutive BCLKs) if we stall the bus waiting for a cache lock to complete. 4: DBSY_OWN. Count when some agent reserves the FSB for use in the next transaction to drive data that this processor will sample. Asserted for two BCLK cycles for full line writes and not at all for partial line writes. May be asserted multiple times (each one BCLK apart) if we stall the FSB for some reason. 5:DBSY_OTHER. Count when some agent reserves the FSB for use in the next transaction to drive data that this processor will NOT sample. It may or may not be driven by this processor. Asserted two BCLK cycles for partial transactions and four BCLKs (usually in consecutive BCLKs) for full line transactions.	MSR_FSB_ESCR0 MSR_FSB_ESCR1	ESCR0: 0, 1 ESCR1: 2, 3
05h	BSQ_allocation. This event counts allocations in the Bus Sequence Unit (BSQ) according to the specified mask bit encoding. Mask Bits: [1:0]. REQ_TYPE. Request type encoding: - 0 = Read (excludes Memory Read and Invalidate). - 1 = Memory Read and Invalidate. - 2 = Write (other than writebacks). - 3 = Writeback (evicted from cache). [3:2]. REQ_LEN. Request length encoding: - 0 = 0 chunks. - 1 = 1 chunk. - 3 = 8 chunks. 5: REQ_IO_TYPE. Request type is and IO Read or Write. 6: REQ_LOCK_TYPE. Request type is a locked Read/Modify/Write. 7: REQ_CACHE_TYPE. Request type is cacheable. 8: REQ_SPLIT_TYPE. Request type is an 8-byte chunk split across a qword address boundary. 9: REQ_DEM_TYPE. Request type: - 1 = Demand. A read resulted in a sector miss and the sector is being fetched. - 0 = A sector fetch due to the hardware prefetch mechanism or a software PREFETCH instruction. 10: REQ_ORD_TYPE. Request is an ordered type (the author isn't sure what this refers to). [13:11]. MEM_TYPE. Memory type encoding: - 0 = UC. - 1 = USWC. - 4 = WT. - 5 = WP. - 6 = WB.	BSU_ESCR0 Note: This ESCR name begin with “MSR_”	ESCR0: 0, 1
06h	bsq_active_entries. This event represents the number of BSQ entries (clipped at 15) currently active (valid) which meet the mask criteria during allocation in the BSQ. Active request entries are allocated on the BSQ until de-allocated. Deallocation of an entry does not necessarily imply the request is filled. This event must be programmed in conjunction with the BSQ_allocation event. Mask Bits: [1:0]. REQ_TYPE. Request type encoding: - 0 = Read (excludes Memory Read and Invalidate). - 1 = Memory Read and Invalidate. - 2 = Write (other than writebacks). - 3 = Writeback (evicted from cache). [3:2]. REQ_LEN. Request length encoding: - 0 = 0 chunks. - 1 = 1 chunk. - 3 = 8 chunks. 5: REQ_IO_TYPE. Request type is and IO Read or Write. 6: REQ_LOCK_TYPE. Request type is a locked Read/Modify/Write. 7: REQ_CACHE_TYPE. Request type is cacheable. 8: REQ_SPLIT_TYPE. Request type is an 8-byte chunk split across a qword address boundary. 9: REQ_DEM_TYPE. Request type: - 1 = Demand. A read resulted in a sector miss and the sector is being fetched. - 0 = A sector fetch due to the hardware prefetch mechanism or a software PREFETCH instruction. 10: REQ_ORD_TYPE. Request is an ordered type (the author isn't sure what this refers to). [13:11]. MEM_TYPE. Memory type encoding: - 0 = UC. - 1 = USWC. - 4 = WT. - 5 = WP. - 6 = WB.	ESCR1 Note: The Intel® documentation just refers to ESCR1, but this is not a complete ESCR name (which ESCR1?).	ESCR1: 2, 3
03h	x87_assist. This event counts the retirement of x87 instructions that required special handling. Mask Bits: 0: FPSU. Handle FP stack underflow. 1: FPSO. Handle FP stack overflow. 2: POAO. Handle x87 output overflow. 3: POAU. Handle x87 output underflow. 4: PREA. Handle x87 input assist.	CRU_ESCR2 CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
34h	SSE_input_assist. This event counts the number of times an assist is requested to handle problems with input operands for SSE and SSE2 operations, most notably denormal source operands when the DAZ bit is not set. Set bit 15 of the event mask to use this event. Mask Bits: 15: ALL. Count assists for all SSE and SSE2 μops.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
08h	packed_SP_μop. This event increments for each packed SP FP μop. Mask Bits: 15: ALL. Count all μops operating on packed SP FP operands.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
0Ch	packed_DP_μop. This event increments for each packed DP FP μop. Mask Bits: 15: ALL. Count all μops operating on packed DP FP operands.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
0Ah	scalar_SP_μop. This event increments for each scalar SP FP μop. Mask Bits: 15: ALL. Count all μops operating on scalar SP FP operands.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
0Eh	scalar_DP_μop. This event increments for each scalar DP FP μop. Mask Bits: 15: ALL. Count all μops operating on scalar DP FP operands.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
02h	64bit_MMX_μop. This event increments for each MMX instruction which operates on 64-bit SIMD operands. Mask Bits: 15: ALL. Count all μops operating on 64-bit SIMD integer operands in memory or MMX registers.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
1Ah	128bit_MMX_μop. This event increments for each integer SIMD SSE2 instruction which operates on 128-bit SIMD operands. Mask Bits: 15: ALL. Count all μops operating on 128-bit SIMD integer operands in memory or XMM registers.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
04h	x87_FP_μop. This event increments for each x87 FP μop. Mask Bits: 15: ALL. Count all x87 FP μops.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
2Eh	x87_SIMD_moves_μop. This event increments for each x87 FP, MMX, SSE or SSE2 μop related to load data, store data, or register-to-register moves. These μops are dispatched to port 0 or port 2 at runtime. Mask Bits: 3: ALLP0. Count all x87/SIMD store/moves μops. 4: ALLP2. Count all x87/SIMD load μops.	FIRM_ESCR0 FIRM_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 8, 9 ESCR1: 10, 11
02h	machine_clear. This event increments while the entire pipeline of the machine is cleared. Mask Bits: 0: CLEAR. Counts clock cycles while the machine is cleared for any cause. To just count the incident versus the duration of the incident, use Edge triggering (via the CCCR). 2: MOCLEAR. Increments each time the machine is cleared due to memory ordering issues. 3: SMCLEAR. Increments each time the machine is cleared due to Self Modifying Code (SMC) issues.	CRU_ESCR2 CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
05h	global_power_events. This event measures the time during which a processor is not stopped. Mask Bits: 0: Running. The processor is active (includes the handling of HLT STPCLK and throttling).	FSB_ESCR0 FSB_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 0, 1 ESCR1: 2, 3
05h	tc_ms_xfer. This event counts the number of times that μop delivery changed from the TC to the Microcode Store ROM. Mask Bits: 0: CISC. A TC to MS transfer occurred.	MS_ESCR0 MS_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 4, 5 ESCR1: 6, 7
09h	μop_queue_writes. This event counts the number of valid μops written to the μop queue. Mask Bits: 0: FROM_TC_BUILD. The μops being written to the μop Queue are from TC Build Mode. 1: FROM_TC_DELIVER. The μops being written to the μop Queue are from TC Deliver Mode 2: FROM_ROM. The μops being written to the μop Queue are from the MS ROM.	MS_ESCR0 MS_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 4, 5 ESCR1: 6, 7

Table 56-8 on page 1395 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for At-Retirement Event Counting.

Table 56-8. Event Classes and Event Mask Bits for At-Retirement Counting
Event Class	Description	ESCRs can be used in	Counters that must be used
08h	front_end_event. This event counts the retirement of tagged μops (specified by the front-end tagging mechanism). Mask Bits: 0: NBOGUS. The marked μops are not bogus. 1: BOGUS. The marked μops are bogus.	CRU_ESCR2, CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
0Ch	execution_event. This event counts the retirement of tagged μops (specified through the execution tagging mechanism). The event mask allows from one to four types of μops to be specified as either bogus or non-bogus μops to be tagged. Mask Bits: 0: NBOGUS0. The marked μops are not bogus. 1: NBOGUS1. The marked μops are not bogus. 2: NBOGUS2. The marked μops are not bogus. 3: NBOGUS3. The marked μops are not bogus. 4: BOGUS0. The marked μops are bogus. 5: BOGUS1. The marked μops are bogus. 6: BOGUS2. The marked μops are bogus. 7: BOGUS3. The marked μops are bogus. For more information, refer to “Execution Tagging” on page 1412.	CRU_ESCR2 CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
09h	replay_event. This event counts the retirement of tagged μops (specified through the replay tagging mechanism). Mask Bits: 0: NBOGUS. The marked μops are not bogus. 1: BOGUS. The marked μops are bogus.	CRU_ESCR2 CRU_ESCR3 Note: These ESCR names begin with “MSR_”	ESCR2: 12, 13, 16 ESCR3: 14, 15, 17
02h	instr_retired. This event counts the instructions that are retired during a clock cycle. Mask bits specify bogus or non-bogus (and whether they are tagged via the front-end tagging mechanism). The event count may vary depending on the microarchitecture state of the processor when the event is enabled. The event may count more than once for some IA32 instructions with complex μop flows that were interrupted before retirement. Mask Bits: 0: NBOGUSNTAG. Non-bogus instructions that are not tagged. 1: NBOGUSTAG. Non-bogus instructions that are tagged. 2: BOGUSNTAG. Bogus instructions that are not tagged. 3: BOGUSTAG. Bogus instructions that are tagged.	CRU_ESCR0 CRU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
01h	μops_retired. This event counts the μops retired during a clock cycle. Mask Bits: 0: NBOGUS. The marked μops are not bogus. 1: BOGUS. The marked μops are bogus.	CRU_ESCR0 CRU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
02h	μop_type. This event is used in conjunction with the front-end at-retirement mechanism to tag load and store μops. Mask Bits: 1: TAGLOADS. The μop is a load operation. 2: TAGSTORES. The μop is a store operation.	RAT_ESCR0 RAT_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 12, 13, 16 ESCR1: 14, 15, 17
05h	retired_mispred_branch_type. This event counts retiring mispredicted branches by type. Mask Bits: 1: CONDITIONAL. Conditional jumps. 2: CALL. Indirect call branches. 3: RETURN. Return branches. 4: INDIRECT. Returns, indirect calls, or indirect jumps.	TBPU_ESCR0 TBPU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 4, 5 ESCR1: 6, 7
04h	retired_branch_type. This event counts retiring branches by type. Mask Bits: 1: CONDITIONAL. Conditional jumps. 2: CALL. Direct or indirect calls. 3: RETURN. Return branches. 4: INDIRECT. Returns, indirect calls, or indirect jumps.	TBPU_ESCR0 TBPU_ESCR1 Note: These ESCR names begin with “MSR_”	ESCR0: 4, 5 ESCR1: 6, 7

The remaining ESCR bit fields are defined in Table 56-6 on page 1382.

Table 56-6. ESCR Bit Field Definitions
Bit Field	Description
The OS bit	This bit and description only applies to a processor that is not capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when the processor is executing privilege level 0 code. When both the OS and USR bits are set, the selected events are counted at all privilege levels. If neither the OS nor the USR bit is set, none of the selected events are counted.
The USR bit	This bit and description only applies to a processor that is not capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when the processor is executing privilege level 1, 2 or 3 code. When both the OS and USR bits are set, the selected events are counted at all privilege levels. If neither the OS nor the USR bit is set, none of the selected events are counted.
T0OS	This bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 0 is executing privilege level 0 code. When both the T0OS and T0USR bits are set, the selected events on logical processor 0 are counted at all privilege levels. If neither the T0OS nor the T0USR bit is set, none of the selected events are counted on logical processor 0.
T0USR	This bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 0 is executing privilege level 1, 2 or 3 code. When both the T0OS and T0USR bits are set, the selected events on logical processor 0 are counted at all privilege levels. If neither the T0OS nor the T0USR bit is set, none of the selected events are counted on logical processor 0.
T1OS	This bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 1 is executing privilege level 0 code. When both the T1OS and T1USR bits are set, the selected events on logical processor 1 are counted at all privilege levels. If neither the T1OS nor the T1USR bit is set, none of the selected events are counted on logical processor 1.
T1USR	This bit and description only applies to a processor that is capable of Hyper-Threading. When set to one, the events selected by this register's Event Select and Event Mask fields are counted when logical processor 1 is executing privilege level 1, 2 or 3 code. When both the T1OS and T1USR bits are set, the selected events on logical processor 1 are counted at all privilege levels. If neither the T1OS nor the T1USR bit is set, none of the selected events are counted on logical processor 1.
The Tag Value field	Selects a 4-bit tag value to associate with a μop to use in At-Retirement Event Counting.
The Tag Enable bit	When set to one, enables the tagging of μops to use in At Retirement Event Counting; when cleared to zero, tagging is disabled.
Event Select field	The 6 bit Event Select field identifies a class of events.
Event Mask field	The 16 bits in the Event Mask field are used to select the specific events within the selected Class that are to be counted.

The Counter Configuration Control Registers

General

As mentioned earlier, each of the 18 counters is associated with a specific CCCR. The CCCR controls event counting, event filtering and the interrupt-on-overflow capability.

Each CCCR has the format shown in Figure 56-24 on page 1398 and each of the bit fields are explained in Table 56-9 on page 1399.

Table 56-9. The CCCR Bit Field Definitions
Bit Field	Description
Enable	0 = Counter disabled. 1 = Counter enabled. This bit is cleared to 0 on reset.
ESCR Select	Selects the ESCR that defines the event(s) to be counted by the counter that is associated with the CCCR.
Bits[17:16]	This field is reserved and must be set 11b in a processor that is not capable of Hyper-Threading. In a Hyper-Threading capable processor, this is the Active Thread field. - 00 = Count when neither logical processor is active. - 01 = Count when only one logical processor is active. - 10 = Count only when both logical processors are active. - 11 = Count when either logical processor is active. A logical processor that is halted or that is in the “wait for SIPI” state is considered inactive.
Compare	0 = Filtering disabled. 1 = Filtering enabled. The filtering method is selected by the Threshold, Complement, and Edge bit fields. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
Complement	Selects how the incoming event count is compared with the threshold value. 0 = The counter doesn't start counting events until the count is > the specified Threshold value.This bit is “don't care” if the Compare bit = 0. 1 = The counter counts events until the specified Threshold value is reached. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
Threshold	Specifies the threshold value to be used for comparisons. This field only has meaning if the Compare bit = 1. The processor uses the setting of the Complement bit to determine the type of threshold comparison. The range of values that can be specified depends on the event type. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description.
Edge	0 = Rising-edge (false-to-true) detection of the threshold comparison output for filtering event counts is disabled. 1 = Rising-edge (false-to-true) detection of the threshold comparison output for filtering event counts is enabled. This bit is only meaningful when the Compare bit = 1.
Force Overflow	0 = A counter overflow condition only occurs when the counter overflows. 1 = A counter overflow condition each time that the counter is incremented.
PMI bit	This bit only applies to a processor that does not support Hyper-Threading. It is the Interrupt On Overflow enable bit: 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow. 1 = The processor generates a Performance Monitor interrupt each time that the counter overflows. The interrupt is generated on the next increment after the counter has overflowed.
OVF_PMI_T0	This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 0: 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow associated with logical processor 0. 1 = The processor generates a Performance Monitor interrupt on a counter overflow associated with logical processor 0. The interrupt is generated on the next increment after the counter has overflowed.
OVF_PMI_T1	This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 1: 0 = The processor will not generate a Performance Monitor interrupt on a counter overflow associated with logical processor 1. 1 = The processor generates a Performance Monitor interrupt on a counter overflow associated with logical processor 1. The interrupt is generated on the next increment after the counter has overflowed.
Cascade	0 = counter cascading is disabled. 1 = Enables counting on this counter when the counter that is its cascade partner (see Table 56-10 on page 1403) overflows. See “Counter Cascading” on page 1401 for more information. As an example, to have a counter 0 overflow automatically enable counter 2 to start counting, the programmer sets the Cascade bit = 1 in the CCCR associated with counter 2.
Extended Cascade Enable	This bit is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2. Refer to “Extended Cascading” on page 1406.
Overflow	This is the overflow history bit. 0 = The counter has not overflowed. 1 = The counter has overflowed since the last time this bit was cleared by software.

Figure 56-24. The CCCR Format

[View full size image]

Table 56-10. The Four Counter Groups
Counter Group	Consists of These Counters	Counter Numbers
BPU group	The Branch Prediction Unit group consists of two pairs of counters:
	MSR_BPU_COUNTER0 and MSR_BPU_COUNTER1	0 and 1
	MSR_BPU_COUNTER2 and MSR_BPU_COUNTER3.	2 and 3
	Counters 0 and 2 are a cascade pair, as are 1 and 3.
MS group	The Microcode Store ROM group consists of:
	MSR_MS_COUNTER0 and MSR_MS_COUNTER1.	4 and 5
	MSR_MS_COUNTER2 and MSR_MS_COUNTER3.	6 and 7
	Counters 4 and 6 are a cascade pair, as are 5 and 7.
FLAME group	The FLAME group (the author has no idea what FLAME stands for) consists of:
	MSR_FLAME_COUNTER0 and MSR_FLAME_COUNTER1.	8 and 9
	MSR_FLAME_COUNTER2 and MSR_FLAME_COUNTER3.	10 and 11
	Counters 8 and 10 are a cascade pair, as are 9 and 11.
IQ group	The Instruction Queue group consists of:
	MSR_IQ_COUNTER0 and MSR_IQ_COUNTER1.	12 and 13
	MSR_IQ_COUNTER2 and MSR_IQ_COUNTER3.	14 and 15
	MSR_IQ_COUNTER4 and MSR_IQ_COUNTER5.	16 and 17
	Because this group has a third counter pair, cascading is handled a little differently: 12 and 14 are a cascade pair, as are 13 and 15. 14 can be cascaded to 16 (but 16 cannot be cascaded to 14). 15 can be cascaded to 17 (but 17 cannot be cascaded to 15).

The CCCRs are all cleared to zero on reset. The events that an enabled counter actually counts are selected and filtered by the following bit fields in the ESCR and CCCR (in the qualification order shown):

1.	First, ESCR[Event Select] and ESCR[Event Mask] select the Event Class and one or more events within the class, respectively.
2.	Then, ESCR[OS] and ESCR[USR] (in a HTT-capable processor, the ESCR T0OS, T0USR, T1OS and T1USR bits) select the privilege levels at which events will be counted.
3.	CCCR[ESCR Select] selects the ESCR that pipes counts to the CCCR for events that have passed steps 1 and 2.
4.	CCCR[Compare], CCCR[Complement] and CCCR[Threshold] then select the optional threshold to be used in qualifying the event count.
5.	CCCR[Edge] allows events to be counted only on rising-edge (false-to-true) transitions.

Counter Cascading

The 18 performance counters are organized into nine pairs (see Table 56-10 on page 1403) and each pair of performance counters is associated with a particular subset of events and ESCR's (see Table 56-5 on page 1376).

While the ability to count a particular event type is valuable, there are more complex tests or software tuning scenarios that counter cascading can address quite handily:

A counter could be set up to detect when a particular event occurs and then automatically enable another counter to start counting a different event type.
A counter could be set up to count a specified number of a particular event type, and then automatically enable another counter to start counting a different event type.
Each counter is 40-bits wide. Two counters may be initialized to a count of 0 and both set up to count the same event type. One of the counters is then enabled and starts counting the selected event type. When it overflows, the overflow condition automatically enables the other counter which then continues counting the same event type.

Referring to Table 56-10 on page 1403, each counter can be cascaded to a second counter located in another pair (not in the same pair, however) within the same group. As an example, counters 0 and 2 can be cascaded in any order (i.e., 0 to 2, or 2 to 0), as can counters 1 and 3.

As an example, to have a counter 0 overflow automatically enable counter 2 to start counting, the programmer sets CCCR[Cascade] = 1 (see Figure 56-24 on page 1398) in the CCCR associated with counter 2.

Interrupt on Overflow

Figure 56-25 on page 1405 illustrates an example wherein the overflow of the second counter then results in the generation of a Performance Monitor interrupt. A counter is enabled to do so by setting the PMI bit = 1 in its CCCR. Refer to “The Performance Counter Overflow Interrupt” on page 1547 for a detailed description of this interrupt.

Figure 56-25. Cascading Counter Example

It should be noted that Pentium® 4 processors with a Model number of 0 or 1 and a Stepping > 09h, as well as processors with a Model of 2 have an erratum that prevents them from generating a Performance Monitor interrupt if cascading or extended cascading (refer to “Extended Cascading” on page 1406) is enabled.

Extended Cascading

This feature is model-specific and is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2 (obtained by executing a CPUID request type 0).

If bit 11 = 1 in the CCCR associated with counter 12, then counter 16 can cascade to counter 12.
If bit 11 = 1 in the CCCR associated with counter 15, then counter 17 can cascade to counter 15.
If bit 11 = 1 in the CCCR associated with counter 16, then counter 17 can cascade to counter 16.
If bit 11 = 1 in the CCCR associated with counter 17, then counter 16 can cascade to counter 17.

If extended cascading is to be used, the programmer sets CCCR bit 11 to one rather than the Cascade bit.

Accessing the Performance Counters

In the Pentium® 4 processor, the RDPMC (Read Performance Counter) instruction (see “Performance Monitoring” on page 505) was enhanced to allow the full 40-bit counter to be read, or just its lower 32-bits (which is faster; this can be used when the count is small enough to be contained in 32 bits).

The CR4[PCE] bit permits the OS to limit access to the Performance Counters:

CR4[PCE] = 0. Only privilege level 0 code (i.e., the OS kernel) can execute the RDPMC instruction to read a performance counter's contents. An exception is generated whenever a less-privileged program attempts execution of the RDPMC instruction.
CR4[PCE] = 1. Any program can execute the RDPMC instruction to read a performance counter's contents.

Just like the RDTSC instruction (see “RDTSC Is Not a Serializing Instruction” on page 499), the RDPMC instruction is not a serializing instruction (see “Serializing Instructions” on page 1079). It can be executed out-of-order by the processor core and therefore may yield an inaccurate reading. It can be used in conjunction with the CPUID instruction (which is a serializing instruction) to obtain an accurate count.

In some cases, the programmer may want to preload a count into a counter prior to enabling the counter (so the counter will generate an overflow when that count has been achieved. To do so, enter the start count as a twos-complement negative integer in the counter. The counter then counts from the start value up to -1 and then overflows. A counter is written to using the WRMSR instruction and all 40 bits are written simultaneously.

Halting Event Counting

After being started, a counter continues counting. If and when the counter overflows, it wraps around and continues counting. When the counter wraps around, it sets the CCCR[OVF] bit to indicate that the counter has overflowed. To halt counting, clear CCCR[ENABLE] to 0 (see Figure 56-24 on page 1398).

To halt a cascaded counter (i.e., a counter that was automatically enabled when another counter overflowed), take one of the following actions:

CCCR[Cascade] = 0 in the cascaded counters CCCR.
CCCR[OVF] = 0 in the other counter's CCCR.

Non-Retirement Event Counting

Introduction

While At-Retirement Counting (see “At-Retirement Event Counting” on page 1409) only counts events related to μops along the correctly predicted path, Non-Retirement Event Counting counts all events of the specified type (even if the event is associated with an μop that resides on a mispredicted path and is therefore not ultimately retired).

Table 56-7 on page 1385 provides a listing of the Non-Retirement events by Event Class and Event Mask. These events can be counted using either of the following two counting methods:

A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.
Interrupt on counter overflow. Intel® documentation refers to this as Non-Precise Event-based Sampling (NPEBS if you will):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.
- When the counter overflows, the handler is called.
- The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.

Non-Retirement events may not be counted using the PEBS mechanism.

The Set Up

In order to program the Performance Monitoring logic to count one or more Non-Retirement events, the programmer first decides which event or events are to be counted (see Table 56-7 on page 1385). For each desired event type, the following general series of steps are then accomplished:

1.	In the “ESCR” column of Table 56-7 on page 1385, the programmer selects an ESCR that supports the desired event type.
2.	The programmer finds the CCCR and Counter associated with the selected ESCR using Table 56-5 on page 1376.
3.	The programmer sets up the ESCR for the specific event or events to be counted and the privilege levels they are to be counted at.
4.	The programmer sets up the CCCR: - Selects the ESCR. - Optionally selects the desired event filters (see “The Event Filtering Mechanism” on page 1408). - Optionally enables counter cascading (see “Counter Cascading” on page 1401). - Optionally enables the generation of the Performance Monitor interrupt (PMI) when the counter overflows. If this feature is enabled, the Local APIC's PMI LVT entry (see “The Performance Counter Overflow Interrupt” on page 1547) must also be set up and a PMI handler must be in place.
5.	In the CCCR, the programmer then enables the counter to begin counting.

The Event Filtering Mechanism

Introduction

During a given clock cycle, the Performance Monitoring logic may detect multiple instances of an event selected for counting. When the event filtering mechanism is disabled, the actual event count in a given clock cycle is added to the counter that has been set up to count the respective event.

It may be, however, that the programmer only wants to increment the counter (by one) when the per clock count is > or <= a threshold value that has been specified in the CCCR associated with the counter.

Threshold Comparison

To utilize this capability, the programmer uses the following CCCR fields:

- The CCCR[Compare bit} must be set to one.

- The CCCR[Threshold] is set to the threshold count value. It is a 4-bit field, so the Threshold value can be between 0 and 15d.

- The CCCR[Complement] bit is set to the appropriate value:

- Complement = 0: A per clock event count > the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).

- Complement = 1: A per clock event count <= the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).

As an example, if Complement = 0 and Threshold = 6, a count of 7 or greater in a clock cycle causes the counter to be incremented by one, while a count less than 7 results in the counter not being incremented in that clock cycle. If Complement = 1, a count value in a given clock cycle from 0 to 6 causes the counter to be incremented by one, while any count value from 7 to 15 results in the counter not being incremented in that clock cycle.

The Threshold Condition Transition Filter

It may be that the programmer only wants to increment a counter by one when the condition being measured by the Threshold comparison is false in one clock and then true in the following clock. This capability is enabled by setting CCCR[Edge] = 1.

At-Retirement Event Counting

First, Some Terminology

Bogus, Non-Bogus, Retire

In Table 56-8 on page 1395, the term “bogus” refers to IA32 instructions or μops that are canceled because they are on a path program taken due to a mispredicted branch. The terms “retired” and “non-bogus” refer to IA32 instructions or μops along a correctly-predicted path.

Tagging

The same event can happen to a μop more than once while it is being executed. Just counting the number of times the event occurred would not indicate how many μops experienced that event. A μop can be tagged once during its lifetime and counted once at retirement. In the Intel® documentation, the “retired” suffix is appended to performance metrics that increment a count once per μop, rather than once per event.

As an example, a μop could experience a cache miss more than once while it is being executed, but a “Miss Retired” event (counts the number of retired μops that experienced a cache miss) only increments once for that μop. This can be used to measure the performance of the cache hierarchy for a code fragment.

Replay

To achieve the optimum performance when dealing with commonly-encountered cases, the schedulers sometimes schedules μops for execution before all of the conditions necessary for correct execution are guaranteed to be satisfied. Obviously, the hope is that by the time the μop is actually dispatched for execution, the condition(s) will have been met.

If they have not, the μop must be re-issued and this is referred to as Replay. Some causes of replays are:

- Cache misses.

- Dependence violations (e.g., store forwarding problems; see “Store-to-Load Forwarding” on page 1070 for more information).

- Unforeseen resource constraints.

The processor will always experience some replays, but too many replays indicates that the code in question should be tuned.

Assist

When the hardware needs the assistance of microcode (supplied from the MS ROM) to deal with an event, the machine is said to “take an assist”. An example would be an underflow condition in the input operands of a FP operation. In this case, the processor must modify the operand format to perform the computation. Assists clear the entire machine of μops before they begin and are therefore cause a severe dip in performance.

General

At-Retirement Counting only counts events related to instructions along the correctly predicted path. If a performance counter had been set up to count all executed instructions, the count would also include events related to instructions that were executed along a mispredicted path.

Using the tagging mechanism, a counter can be set up to count only the events related to instructions along the correctly predicted path. This is referred to as At-Retirement Counting.

Table 56-8 on page 1395 lists the At-Retirement Event Classes as well as the events within each class.

The Tagging Mechanisms

Introduction

A counter can count each incident of an event, or the number of μops that experienced the event. A μop may be tagged when it encounters some of the events listed in Table 56-8 on page 1395. The tagging mechanisms (there are four types of tagging) can be used in Non-Precise Event Based Sampling (NPEBS; see “There Are Three Sampling Methods” on page 1374), and some of the mechanisms can be used in PEBS (Precise Event-Based Sampling; see “There Are Three Sampling Methods” on page 1374). There are four tagging mechanisms:

- Front-End tagging. This mechanism tags μops that experience front-end-related events (Trace Cache and μop decode-related events). They are counted using the “Front_end_event” event.

- Execution tagging. This mechanism tags μops that experience execution-related events (e.g., instruction types). They are counted using the “Execution_Event” event.

- Replay tagging. This mechanism tags μops that must be replayed (e.g., a cache miss) as well as mispredicted branches. They are counted using the “Replay_event” event.

- No tags. This mechanism does not use tags. Rather, it counts retired IA32 instructions using the “Instr_retired” event, and retired μops using the “μops_ retired” event.

Multi-Tagging

The tagging mechanisms are independent of each other. A μop tagged using one mechanism is not detected by another mechanism's tagged-μop detector. As an example, a μop tagged by the Front-End tagging mechanism is not counted by the “Execution_Event” unless it was also tagged by the Execution tagging mechanism. It should be noted, that execution tags allow up to four different types of μops to be counted at retirement.

PEBS and Multi-Tagging

When using PEBS, however, only one tagging mechanism can be used at a time.

Some μops Cannot Be Tagged

The following μops cannot be tagged: IO, uncacheable accesses, locked accesses, Return μops, far jumps, and far calls.

Front-End Tagging

The Front_end_event counts μops with tags that indicate they have experienced any of the following events:

- μop Decode events. Tagging a μop when it encounters a μop decode event requires specifying bits in the ESCR associated with the μop_type event class.

- Trace Cache events. Tagging a μop when it encounters a Trace Cache event is accomplished by a bit in the ESCR associated with the μop_type event class.

The Front_end_event is defined in Table 56-8 on page 1395. None of the events currently supported requires the use of the MSR_TC_Precise_Event MSR, but some may in the future.

Execution Tagging

The Execution_event is defined in Table 56-8 on page 1395.

The execution tagging mechanism uses two ESCRs:

- One upstream ESCR specifies the event to detect and assigns a 4-bit Tag (in the ESCR's Tag field) to identify that event. This ESCR must have its Tag Enable bit = 1. The 4-bit Tag is actually a mask that specifies which tag bit(s) should be set for a particular μop. The Tag mask must match the Event Mask bit setting in another downstream ESCR (e.g., if the TAG ID in the upstream ESCR 1h (a mask value of 0001b), then the Event Mask field in the downstream ESCR should be set as follows (see the Execution_event class in Table 56-8 on page 1395):

- Bit 0, the NBOGUS0 bit, = 1b.

- Bit 1, the NBOGUS1 bit, = 0b.

- Bit 2:, the NBOGUS2 bit, = 0b.

- Bit 3, the NBOGUS3 bit, = 0b.

- The second, downstream ESCR is used to detect μops with that Tag value using the Execution_event class in the ESCR's Event Select field. This ESCR's Event Mask bits specify which tag bits accompanying a μop to count. If any of the tag bits that accompany a μop select a mask bit that = 1, the related counter is incremented by one. If more than one mask bit is selected by the bits in the μop's tag, the counter is incremented once for each matching bit. The Tag Enable and Tag value in the downstream ESCR are “don't care”.

The author is puzzled by the fact that eight Event Mask bits (rather than four) are shown in the Execution_Event class entry of Table 56-8 on page 1395.

Replay Tagging

This mechanism tags μops that must be replayed (e.g., a cache miss) as well as mispredicted branches. They are counted using the “Replay_event” event. The Replay_event is defined in Table 56-8 on page 1395. Replay tagging is enabled with the μop_Tag bit (bit 24) in the IA32_PEBS_ENABLE MSR.

The Replay tagging mechanism requires selecting:

- The type of μop that may experience the replay in the MSR_PEBS_MATRIX_VERT MSR.

- The type of event in the IA32_PEBS_ENABLE MSR.

Table 56-11 on page 1413 lists the information used to set up a counter to count Replay events. The setup information in this table enables Precise Event-Based Sampling (see “There Are Three Sampling Methods” on page 1374). Non-Precise Event-Based Sampling can be used by not setting bits 24 or 25 in IA_32_PEBS_ENABLE_MSR (see Figure 56-26 on page 1417).

Table 56-11. Replay Tagging Setup
Replay Event	IA32_PEBS_Enable bits to set	MSR_PEBS_MATRIX_VERT bits to set	Additional Setup Info	Event Mask Value
1stL_cache_load_miss_retired	0, 24 and 25	0	None	NBOGUS
2ndL_cache_load_miss_retired	1, 24 and 25	0	None	NBOGUS
DTLB_load_miss_retired	2, 24 and 25	0	None	NBOGUS
DTLB_store_miss_retired	2, 24 and 25	1	None	NBOGUS
DTLB_all_miss_retired	2, 24 and 25	0 and 1	None	NBOGUS
MOB_load_replay_retired	9, 24 and 25	0	In the ESCR, select the MOB_load_replay event and set the PARTIAL_DATA and UNALGN_ADDR Event Mask bits.	NBOGUS
split_load_retired	Bit 10, Bit 24, Bit 25	0	In MSR_SAAT_ESCR1, select the load_port_replay event and set the SPLIT_LD mask bit.	NBOGUS
split_store_retired	Bit 10, Bit 24, Bit 25	1	In MSR_SAAT_ESCR0, select the store_port_replay event and set the SPLIT_ST mask bit.	NBOGUS

Precise Event-Based Sampling

General

The Debug Store (DS) mechanism was introduced in the Pentium® 4 processor. A complete description can be found in “The Debug Store (DS) Mechanism” on page 1366. The processor can be set up to automatically store both PEBS records and BTS records in the DS save area in memory.

What the Intel® documentation refers to as Non-Precise Event-Based Sampling could more aptly be named Automatic state save on counter overflow. It works as follows:

A counter is set up to count one or more event types within the same Event Class.
It is preset with an initial count.
It is enabled to store an event record in the Debug Store (DS) memory area each time that the counter has an overflow condition.
When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.
The processor then automatically resets the count to its initial programmed value and resumes counting.
The processor then resumes execution of the program.
When the DS save area is approaching a full condition, a Performance Counter interrupt is generated and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.

Automatic state save on counter overflow (i.e., PEBS) is only supported for the following Event Classes within the At-Retirement Event category:

The Execution_event.
The Front_end_event.
The Replay_event.

Limited To a Single Counter

PEBS can only be performed using Counter 16.

Detecting the PEBS Capability

The programmer can determine that a processor supports PEBS by executing a CPUID type 1 and verifying that EDX[21] = 1. This indicates that the processor supports the DS feature. The programmer then verifies that IA32_MISC_ ENABLE[PEBS_UNAVAILABLE] = 0 (see Figure 56-21 on page 1373).

Enabling PEBS

Setting IA32_PEBS_ENABLE[24] (see Figure 56-26 on page 1417) enables the processor's PEBS capability. The reader should also note that the DS capability must have been configured (see “The Debug Store (DS) Mechanism” on page 1366).

The PEBS Interrupt Handler

As mentioned earlier, the processor will generate an interrupt when the PEBS memory buffer in the DS save area is approaching a full condition. See “The Debug Store (DS) Mechanism” on page 1366 for more information.

Sometimes, the DS Feature Is Disabled

The processor automatically disables the DS feature under the following circumstances:

On transitioning to SMM.
When a Machine Check Exception occurs.
When RESET# or INIT# are asserted.

PEBS and Hyper-Threading

In a processor that supports Hyper-Threading, PEBS is enabled and qualified using the following two bits in the IA32_PEBS_ENABLE MSR (this register is replicated for each of the logical processors; see Figure 56-26 on page 1417):

Bit 25. ENABLE_PEBS_MY_THR and
Bit 26. ENABLE_PEBS_OTH_THR.

Software executing on a logical processor uses these two bits to enable PEBS for subsequent threads of execution:

On the same logical processor on which the software is running (“my thread”) or
For the other logical processor in the physical package (“other thread”).

PEBS can be used only with two performance counters:

Counter 16 (MSR_IQ_CCCR4 MSR) for logical processor 0.
Counter 17 (MSR_IQ_CCCR5 MSR) for logical processor 1.

Additional information regarding PEBS on a Hyper-Threading capable processor can be found in section 15.10.4, Performance Monitoring Events, in Intel®'s IA32 Intel® Architecture Software Developer's Manual, Volume 3: System Programming Guide.

Counting Clocks

Introduction

Processor clock cycles are referred to as clockticks and can be used to measure how long a program takes to execute, as well as to derive efficiency measurements such as Cycles Per Instruction (CPI).

There are three processor clock cycle measurements:

Non-Halted Clockticks. This measurement counts the clock cycles during which the specified logical processor is not halted and is not in any power-saving state. If Hyper-Threading is enabled, this measurement can be performed on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. The processor clock is stopped under the following circumstances:
- When the processor enters the Sleep power conservation state (see “The Sleep State” on page 692).
- When the processor enters the Deep Sleep power conservation state (see “The Deep Sleep State” on page 693).
See “The Non-Halted Clockticks Measurement” on page 1418 for more information.
Non-Sleep Clockticks. This measurement counts the clock cycles during which the physical processor is not in the Sleep mode. This measurement cannot be taken on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. See “The Non-Sleep Clockticks Measurement” on page 1419 for more information.
Time Stamp Counter. This measurement counts the clock cycles during which the physical processor is not in Deep Sleep state. These ticks cannot be measured on a per logical processor basis.

For applications wherein the processor is halted during some periods, there are two ratios of interest:

Non-Halted CPI: The Non-Halted Clockticks Per Instructions Retired ratio measures the CPI only during non-halted periods of time (i.e., the processor is actually executing code). This ratio can be measured on a per logical processor basis when Hyper-Threading Technology is enabled.
Nominal CPI: The TSC Ticks Per Instructions Retired ratio measures the CPI over the full period of a program, including those periods of time while the processor is halted.

The Non-Halted Clockticks Measurement

As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:

In an ESCR, select the global_power_events Event Class, set the RUNNING Event Mask bit, and also set the appropriate mask bits (T0_OS, T0_USR, T1_OS, T1_USR) for the targeted processor.
Set the ESCR Select field in a CCCR to select that ESCR.
Enable counting in the CCCR for that counter by setting the Enable bit.

If Hyper-Threading is enabled, the count may include some portion of the clock cycles for that logical processor to complete a transition to a halted state.

If Hyper-Threading is enabled and both logical processors execute the HLT instruction, the physical processor enters the AutoHalt Powerdown power conservation state (see “The AutoHalt Power Down State” on page 686).

The Non-Sleep Clockticks Measurement

As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:

Choose a counter to use for the measurement.
Choose an ESCR associated with that counter.
Set that ESCR's Event Select field to any Event Class other than the “no_event” Event Class.
Set CCCR[Compare] = 1.
Set CCCR[Threshold] = 15d.
Set CCCR[Complement] = 1.
This setup causes the counter to count every cycle. Note that this overrides any other qualifications (e.g., by CPL) that may be specified in the ESCR.
Set CCCR[Enable] =1. to enable the counter.

This measurement tool continues to increment as long as one logical processor is still running.

The Time Stamp Counter

The TSC continues to increment unless one of the following is true:

RESET# is asserted.
The processor enters the Sleep power conservation state.
The processor enters the Deep Sleep power conservation state.

The counter can be read by executing the RDTSC instruction (please also refer to “Time Stamp Counter” on page 498). Computing the difference in values between two reads (modulo 2⁶⁴) yields the number of processor clocks between reads.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The Performance Monitoring Facility

Create new playlist

Sign In

Sign Up

The Performance Monitoring Facility

Performance Monitoring Is Not Architecturally Defined

Author's Note

An Overview

There Are Two Event Categories

There Are Three Sampling Methods

Relationship of a Counter, Its CCCR and the ESCRs

Figure 56-22. Relationship of a Counter, Its CCCR and the ESCRs

The Event Select Control Registers

Figure 56-23. The ESCR Format

The Counter Configuration Control Registers

General

Figure 56-24. The CCCR Format

Counter Cascading

Interrupt on Overflow

Figure 56-25. Cascading Counter Example

Extended Cascading

Accessing the Performance Counters

Halting Event Counting

Non-Retirement Event Counting

Introduction

The Set Up

The Event Filtering Mechanism

Introduction

Threshold Comparison

The Threshold Condition Transition Filter

At-Retirement Event Counting

First, Some Terminology

Bogus, Non-Bogus, Retire

Tagging

Replay

Assist

General

The Tagging Mechanisms

Introduction

Multi-Tagging

PEBS and Multi-Tagging

Some μops Cannot Be Tagged

Front-End Tagging

Execution Tagging

Replay Tagging

Precise Event-Based Sampling

General

Limited To a Single Counter

Detecting the PEBS Capability

Enabling PEBS

The PEBS Interrupt Handler

Sometimes, the DS Feature Is Disabled

PEBS and Hyper-Threading

Counting Clocks

Introduction

The Non-Halted Clockticks Measurement

The Non-Sleep Clockticks Measurement

The Time Stamp Counter

Table of Contents for
The Performance Monitoring Facility