Whether or not a processor implements a Performance Monitoring facility and, if so, the method of implementation is design-specific. The facility is not part of the IA32 processor architecture spec. The manner in which it has been implemented on the Pentium®, P6 and Pentium® 4 processor families are not compatible with each other.
The author had a considerable amount of trouble achieving a detailed understanding of every aspect of this feature. The Intel® documentation of this feature is somewhat confusing in some areas.
The Pentium® and P6 processor families implemented two Performance Counters, permitting the simultaneously measurement of two event types during a given period of time. The Pentium® 4 processor family expanded the number of counters to 18 and all of the registers associated with the facility are MSRs:
There are 18 Performance Counters, each of which is 40 bits wide.
There is one Counter Configuration Control register (CCCR) associated with each of the counters. A counter's CCCR is used to set up its associated counter for a specific method or style of counting.
There are 45 Event Selection Control registers (ESCRs) used to select the type of event (or events) to be measured by each counter.
In addition, the processor has the ability to store certain types of event records in a special memory buffer referred to as the Debug Store (DS) save area:
The IA32_DS_AREA MSR is programmed with the location of the DS save area in memory.
The programmer can determine whether or not a processor supports the DS mechanism by executing a CPUID request type 1 and verifying that EDX[21] = 1.
The IA32_MISC_ENABLE MSR (see Figure 56-21 on page 1373) contains two bits associated with the Performance Monitoring facility:
The Performance Monitoring Available bit is a read-only bit that indicates whether or not the processor supports the Performance Monitoring facility.
The Precise Event-Based Sampling Unavailable bit is a read-only bit that indicates whether or not the processor supports the form of event counting referred to as Precise Event-Based Sampling (PEBS).
Each event type that can be counted falls into one of two categories:
A non-retirement event is an event that occurs any time during instruction execution (e.g., FSB transactions or cache accesses).
An at-retirement event is an event that is counted when a μop is retired. When a μop experienced a specified event type during its execution, it can be tagged with an identifier. This permits the Performance Monitoring logic to sort which events occurred on an execution path that was correctly predicted versus a path that was mispredicted (and the results of those instructions are therefore not committed to the processor's register set). Intel® documentation sometimes refers to these as non-bogus versus bogus events.
A counter can be programmed to use of the following three methods of counting:
A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.
Interrupt on counter overflow. Intel® documentation refers to this as Non-precise event-based sampling (NPEBS, if you will):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.
- When the counter overflows, the handler is called.
- The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.
Automatic state save on counter overflow. Intel® documentation refers to this as Precise Event-Based Sampling (PEBS):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to store an event record in the Debug Store (DS) memory area when the counter has an overflow condition.
- When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.
- The processor then automatically resets the count to its initial programmed value and resumes counting.
- The processor then resumes execution of the program.
- When the DS save area is approaching a full condition, a Performance Counter interrupt is generated, and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.
Refer to Figure 56-22 on page 1375. There is a 1-to-1 hardwired relationship between each of the 18 counters and each of the 18 CCCRs. A counter's CCCR is used to set up and enable the counter for a specific method or style of counting.
Regarding the ESCRs, each counter is associated with a group of ESCRs and the number of ESCRs associated with counter is counter-specific (there can be up to eight ESCRs related to a counter). When a counter's CCCR is configured, one of the items it's configured with is a 3-bit number identifying which of the counter's related ESCRs defines the event(s) that will be counted. It should also be noted that a ESCR can be related to more than one counter.
Table 56-5 on page 1376 defines the relationship of each counter to its CCCR and to its related ESCRs. The following items are listed in each column:
The first column contains the counter's name.
The second column contains the counter's number.
The third column contains the counter's MSR address.
The fourth column contains the name of the counter's CCCR.
The fifth column contains the CCCR's MSR address.
The sixth column contains the names of the ESCRs that can feed event counts to the counter.
The seventh column contains the 3-bit ID of the ESCR. In order to connect the ESCR to a particular counter, the CCCR associated with the counter must be programmed with this ID.
The eigth and final column contains the ESCR's MSR address.
Counter | CCCR | ESCRs Associated With the Counter | |||||
---|---|---|---|---|---|---|---|
Name | No. | MSR | Name | MSR | Name | No. | MSR |
MSR_BSU_ESCR0 | 7 | 3A0h | |||||
MSR_FSB_ESCR0 | 6 | 3A2h | |||||
MSR_MOB_ESCR0 | 2 | 3AAh | |||||
MSR_BPU_COUNTER0 | 0 | 300h | MSR_BPU_CCCR0 | 360h | MSR_PMH_ESCR0 | 4 | 3ACh |
MSR_BPU_ESCR0 | 0 | 3B2h | |||||
MSR_IS_ESCR0 | 1 | 3B4h | |||||
MSR_ITLB_ESCR0 | 3 | 3B6h | |||||
MSR_IX_ESCR0 | 5 | 3C8h | |||||
MSR_BSU_ESCR0 | 7 | 3A0h | |||||
MSR_FSB_ESCR0 | 6 | 3A2h | |||||
MSR_MOB_ESCR0 | 2 | 3AAh | |||||
MSR_BPU_COUNTER1 | 1 | 301h | MSR_BPU_CCCR1 | 361h | MSR_PMH_ESCR0 | 4 | 3ACh |
MSR_BPU_ESCR0 | 0 | 3B2h | |||||
MSR_IS_ESCR0 | 1 | 3B4h | |||||
MSR_ITLB_ESCR0 | 3 | 3B6h | |||||
MSR_IX_ESCR0 | 5 | 3C8h | |||||
MSR_BSU_ESCR1 | 7 | 3A1h | |||||
MSR_FSB_ESCR1 | 6 | 3A3h | |||||
MSR_MOB_ESCR1 | 2 | 3ABh | |||||
MSR_PMH_ESCR1 | 4 | 3ADh | |||||
MSR_BPU_COUNTER2 | 2 | 302h | MSR_BPU_CCCR2 | 362h | MSR_BPU_ESCR1 | 0 | 3B3h |
MSR_IS_ESCR1 | 1 | 3B5h | |||||
MSR_ITLB_ESCR1 | 3 | 3B7h | |||||
MSR_IX_ESCR1 | 5 | 3C9h | |||||
MSR_BSU_ESCR1 | 7 | 3A1h | |||||
MSR_FSB_ESCR1 | 6 | 3A3h | |||||
MSR_MOB_ESCR1 | 2 | 3ABh | |||||
MSR_BPU_COUNTER3 | 3 | 303h | MSR_BPU_CCCR3 | 363h | MSR_PMH_ESCR1 | 4 | 3ADh |
MSR_BPU_ESCR1 | 0 | 3B3h | |||||
MSR_IS_ESCR1 | 1 | 3B5h | |||||
MSR_ITLB_ESCR1 | 3 | 3B7h | |||||
MSR_IX_ESCR1 | 5 | 3C9h | |||||
MSR_MS_ESCR0 | 0 | 3C0h | |||||
MSR_MS_COUNTER0 | 4 | 304h | MSR_MS_CCCR0 | 364h | MSR_TBPU_ESCR0 | 2 | 3C2h |
MSR_TC_ESCR0 | 1 | 3C4h | |||||
MSR_MS_ESCR0 | 0 | 3C0h | |||||
MSR_MS_COUNTER1 | 5 | 305h | MSR_MS_CCCR1 | 365h | MSR_TBPU_ESCR0 | 2 | 3C2h |
MSR_TC_ESCR0 | 1 | 3C4h | |||||
MSR_MS_ESCR1 | 0 | 3C1h | |||||
MSR_MS_COUNTER2 | 6 | 306h | MSR_MS_CCCR2 | 366h | MSR_TBPU_ESCR1 | 2 | 3C3h |
MSR_TC_ESCR1 | 1 | 3C5h | |||||
MSR_MS_ESCR1 | 0 | 3C1h | |||||
MSR_MS_COUNTER3 | 7 | 307h | MSR_MS_CCCR3 | 367h | MSR_TBPU_ESCR1 | 2 | 3C3h |
MSR_TC_ESCR1 | 1 | 3C5h | |||||
MSR_FIRM_ESCR0 | 1 | 3A4h | |||||
MSR_FLAME_ESCR0 | 0 | 3A6h | |||||
MSR_FLAME_COUNTER0 | 8 | 308h | MSR_FLAME_CCCR0 | 368h | MSR_DAC_ESCR0 | 5 | 3A8h |
MSR_SAAT_ESCR0 | 2 | 3AEh | |||||
MSR_U2L_ESCR0 | 3 | 3B0h | |||||
MSR_FIRM_ESCR0 | 1 | 3A4h | |||||
MSR_FLAME_ESCR0 | 0 | 3A6h | |||||
MSR_FLAME_COUNTER1 | 9 | 309h | MSR_FLAME_CCCR1 | 369h | MSR_DAC_ESCR0 | 5 | 3A8h |
MSR_SAAT_ESCR0 | 2 | 3AEh | |||||
MSR_U2L_ESCR0 | 3 | 3B0h | |||||
MSR_FIRM_ESCR1 | 1 | 3A5H | |||||
MSR_FLAME_ESCR1 | 0 | 3A7h | |||||
MSR_FLAME_COUNTER 2 | 10 | 30Ah | MSR_FLAME_CCCR2 | 36Ah | MSR_DAC_ESCR1 | 5 | 3A9h |
MSR_SAAT_ESCR1 | 2 | 3AFh | |||||
MSR_U2L_ESCR1 | 3 | 3B1h | |||||
MSR_FIRM_ESCR1 | 1 | 3A5H | |||||
MSR_FLAME_ESCR1 | 0 | 3A7h | |||||
MSR_FLAME_COUNTER 3 | 11 | 30Bh | MSR_FLAME_CCCR3 | 36Bh | MSR_DAC_ESCR1 | 5 | 3A9h |
MSR_SAAT_ESCR1 | 2 | 3AFh | |||||
MSR_U2L_ESCR1 | 3 | 3B1h | |||||
MSR_CRU_ESCR0 | 4 | 3B8h | |||||
MSR_CRU_ESCR2 | 5 | 3CCh | |||||
MSR_CRU_ESCR4 | 6 | 3E0h | |||||
MSR_IQ_COUNTER0 | 12 | 30Ch | MSR_IQ_CCCR0 | 36Ch | MSR_IQ_ESCR0 | 0 | 3BAh |
MSR_RAT_ESCR0 | 2 | 3BCh | |||||
MSR_SSU_ESCR0 | 3 | 3BEh | |||||
MSR_ALF_ESCR0 | 1 | 3CAh | |||||
MSR_CRU_ESCR0 | 4 | 3B8h | |||||
MSR_CRU_ESCR2 | 5 | 3CCh | |||||
MSR_CRU_ESCR4 | 6 | 3E0h | |||||
MSR_IQ_COUNTER1 | 13 | 30Dh | MSR_IQ_CCCR1 | 36Dh | MSR_IQ_ESCR0 | 0 | 3BAh |
MSR_RAT_ESCR0 | 2 | 3BCh | |||||
MSR_SSU_ESCR0 | 3 | 3BEh | |||||
MSR_ALF_ESCR0 | 1 | 3CAh | |||||
MSR_CRU_ESCR1 | 4 | 3B9h | |||||
MSR_CRU_ESCR3 | 5 | 3CDh | |||||
MSR_IQ_COUNTER2 | 14 | 30Eh | MSR_IQ_CCCR2 | 36Eh | MSR_CRU_ESCR5 | 6 | 3E1h |
MSR_IQ_ESCR1 | 0 | 3BBh | |||||
MSR_RAT_ESCR1 | 2 | 3BDh | |||||
MSR_ALF_ESCR1 | 1 | 3CBh | |||||
MSR_CRU_ESCR1 | 4 | 3B9h | |||||
MSR_CRU_ESCR3 | 5 | 3CDh | |||||
MSR_IQ_COUNTER3 | 15 | 30Fh | MSR_IQ_CCCR3 | 36Fh | MSR_CRU_ESCR5 | 6 | 3E1h |
MSR_IQ_ESCR1 | 0 | 3BBh | |||||
MSR_RAT_ESCR1 | 2 | 3BDh | |||||
MSR_ALF_ESCR1 | 1 | 3CBh | |||||
MSR_CRU_ESCR0 | 4 | 3B8h | |||||
MSR_CRU_ESCR2 | 5 | 3CCh | |||||
MSR_CRU_ESCR4 | 6 | 3E0h | |||||
MSR_IQ_COUNTER4 | 16 | 310h | MSR_IQ_CCCR4 | 370h | MSR_IQ_ESCR0 | 0 | 3BAh |
MSR_RAT_ESCR0 | 2 | 3BCh | |||||
MSR_SSU_ESCR0 | 3 | 3BEh | |||||
MSR_ALF_ESCR0 | 1 | 3CAh | |||||
MSR_CRU_ESCR1 | 4 | 3B9h | |||||
MSR_CRU_ESCR3 | 5 | 3CDh | |||||
MSR_IQ_COUNTER5 | 17 | 311h | MSR_IQ_CCCR5 | 371h | MSR_CRU_ESCR5 | 6 | 3E1h |
MSR_IQ_ESCR1 | 0 | 3BBh | |||||
MSR_RAT_ESCR1 | 2 | 3BDh | |||||
MSR_ALF_ESCR1 | 1 | 3CBh |
The ESCR selected to be associated with a counter (by programming a 3-bit value into the CCCR's ESCR Select field) contains two fields (see Figure 56-23 on page 1384) that are used to select the event or events to be counted by that counter:
The 6-bit Event Select field identifies a class of events.
The 16-bits in the Event Mask field are used to select the specific events within the selected class are to be counted.
Table 56-7 on page 1385 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for Non-Retirement Event Counting.
Event Class | Description | ESCRs can be used in | Counters that must be used |
---|---|---|---|
06h | Retired Branches.
Mask Bits:
| CRU_ESCR2
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
03h | Retired mispredicted branches.
Mask Bits:
| CRU_ESCR0
CRU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 12, 13, 16 ESCR1: 14, 15, 17 |
01h | TC_deliver_mode. This event counts the duration (in clock cycles) of the operating modes of the Trace Cache and decode engine in the processor.
| TC_ESCR0
TC_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 4, 5 ESCR1: 6, 7 |
03h | BPU_fetch_request (BPU = Branch Prediction Unit). This event counts instruction fetch requests of a specified request type by the Branch Prediction unit. Mask Bits:
| BPU_ESCR0
BPU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 0, 1 ESCR1: 2, 3 |
18h | ITLB_reference. This event counts translations using the Instruction Translation Lookaside Buffer (ITLB). Mask Bits:
| ITLB_ESCR0
ITLB_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 0, 1 ESCR1: 2, 3 |
02h | memory_cancel. This event counts the canceling of various types of requests in the Data Cache Address Control unit (DAC). Mask Bits:
| DAC_ESCR0
DAC_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
08h | memory_complete. This event counts the completion of a load split, store split, uncacheable (UC) split, or a UC load. Mask Bits:
| SAAT_ESCR0
SAAT_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
04h | load_port_replay. This event counts replayed events at the load port. Mask Bits:
| SAAT_ESCR0
SAAT_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
05h | store_port_replay. This event counts replayed events at the store port. Mask Bits:
| SAAT_ESCR0
SAAT_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
03h | MOB_load_replay. This event triggers if the Memory Order Buffer (MOB) caused a load operation to be replayed. Mask Bits:
| MOB_ESCR0
MOB_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 0, 1 ESCR1: 2, 3 |
01h | page_walk_type. This event counts various types of page walks that the Page Miss Handler (PMH) performs. Mask Bits:
| PMH_CR_ESCR0 PMH_CR_ESCR1 | ESCR0: 0, 1 ESCR1: 2, 3 |
0Ch | BSQ_cache_reference. This event counts cache references to the L2 or L3 cache as seen by the bus unit. Specify one or more mask bits to select an access according to the access type (read types includes both loads and RFOs; write types includes writebacks and evictions) and the access result (hit, miss). Mask Bits:
| BSU_CR_ESCR0 BSU_CR_ESCR1 | ESCR0: 0, 1 ESCR1: 2, 3 |
03h | IOQ_allocation. This event counts the various types of transactions on the bus. A count is generated each time a transaction is allocated into the IOQ that matches the specified mask bits. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. Note that requests are counted once per retry. The event is triggered by evaluating the logical expression: (((Request type) OR Bit 5 OR Bit 6) OR (Memory type)) AND (Source agent). Mask Bits:
| MSR_FSB_ESCR0 | ESCR0: 0, 1 |
1Ah | IOQ_active_entries. This event counts the number of entries (clipped at 15) in the IOQ that are active. An allocated entry can be a sector (64 bytes) or chunks of 8 bytes. This event must be programmed in conjunction with the IOQ_allocation event. Mask Bits:
| MSR_FSB_ESCR1 | ESCR1: 2, 3 |
17h | FSB_data_activity. This event increments once for each DRDY or DBSY assertion that occurs on the FSB. The event allows selection of a specific DRDY or DBSY event. Mask Bits:
| MSR_FSB_ESCR0 MSR_FSB_ESCR1 | ESCR0: 0, 1 ESCR1: 2, 3 |
05h | BSQ_allocation. This event counts allocations in the Bus Sequence Unit (BSQ) according to the specified mask bit encoding. Mask Bits:
| BSU_ESCR0
Note: This ESCR name begin with “MSR_” | ESCR0: 0, 1 |
06h | bsq_active_entries. This event represents the number of BSQ entries (clipped at 15) currently active (valid) which meet the mask criteria during allocation in the BSQ. Active request entries are allocated on the BSQ until de-allocated. Deallocation of an entry does not necessarily imply the request is filled. This event must be programmed in conjunction with the BSQ_allocation event. Mask Bits:
| ESCR1
Note: The Intel® documentation just refers to ESCR1, but this is not a complete ESCR name (which ESCR1?). | ESCR1: 2, 3 |
03h | x87_assist. This event counts the retirement of x87 instructions that required special handling. Mask Bits:
| CRU_ESCR2
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
34h | SSE_input_assist. This event counts the number of times an assist is requested to handle problems with input operands for SSE and SSE2 operations, most notably denormal source operands when the DAZ bit is not set. Set bit 15 of the event mask to use this event. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
08h | packed_SP_μop. This event increments for each packed SP FP μop. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
0Ch | packed_DP_μop. This event increments for each packed DP FP μop. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
0Ah | scalar_SP_μop. This event increments for each scalar SP FP μop. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
0Eh | scalar_DP_μop. This event increments for each scalar DP FP μop. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
02h | 64bit_MMX_μop. This event increments for each MMX instruction which operates on 64-bit SIMD operands. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
1Ah | 128bit_MMX_μop. This event increments for each integer SIMD SSE2 instruction which operates on 128-bit SIMD operands. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
04h | x87_FP_μop. This event increments for each x87 FP μop. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
2Eh | x87_SIMD_moves_μop. This event increments for each x87 FP, MMX, SSE or SSE2 μop related to load data, store data, or register-to-register moves. These μops are dispatched to port 0 or port 2 at runtime. Mask Bits:
| FIRM_ESCR0
FIRM_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 8, 9 ESCR1: 10, 11 |
02h | machine_clear. This event increments while the entire pipeline of the machine is cleared. Mask Bits:
| CRU_ESCR2
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
05h | global_power_events. This event measures the time during which a processor is not stopped. Mask Bits:
| FSB_ESCR0
FSB_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 0, 1 ESCR1: 2, 3 |
05h | tc_ms_xfer. This event counts the number of times that μop delivery changed from the TC to the Microcode Store ROM. Mask Bits:
| MS_ESCR0
MS_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 4, 5 ESCR1: 6, 7 |
09h | μop_queue_writes. This event counts the number of valid μops written to the μop queue. Mask Bits:
| MS_ESCR0
MS_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 4, 5 ESCR1: 6, 7 |
Table 56-8 on page 1395 describes the Event Classes and corresponding Event Mask bit fields that are currently defined for At-Retirement Event Counting.
Event Class | Description | ESCRs can be used in | Counters that must be used |
---|---|---|---|
08h | front_end_event. This event counts the retirement of tagged μops (specified by the front-end tagging mechanism). Mask Bits:
| CRU_ESCR2,
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
0Ch | execution_event. This event counts the retirement of tagged μops (specified through the execution tagging mechanism). The event mask allows from one to four types of μops to be specified as either bogus or non-bogus μops to be tagged. Mask Bits:
| CRU_ESCR2
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
09h | replay_event. This event counts the retirement of tagged μops (specified through the replay tagging mechanism). Mask Bits:
| CRU_ESCR2
CRU_ESCR3
Note: These ESCR names begin with “MSR_” | ESCR2: 12, 13, 16 ESCR3: 14, 15, 17 |
02h | instr_retired. This event counts the instructions that are retired during a clock cycle. Mask bits specify bogus or non-bogus (and whether they are tagged via the front-end tagging mechanism). The event count may vary depending on the microarchitecture state of the processor when the event is enabled. The event may count more than once for some IA32 instructions with complex μop flows that were interrupted before retirement. Mask Bits:
| CRU_ESCR0
CRU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 12, 13, 16 ESCR1: 14, 15, 17 |
01h | μops_retired. This event counts the μops retired during a clock cycle. Mask Bits:
| CRU_ESCR0
CRU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 12, 13, 16 ESCR1: 14, 15, 17 |
02h | μop_type. This event is used in conjunction with the front-end at-retirement mechanism to tag load and store μops. Mask Bits:
| RAT_ESCR0
RAT_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 12, 13, 16 ESCR1: 14, 15, 17 |
05h | retired_mispred_branch_type. This event counts retiring mispredicted branches by type. Mask Bits:
| TBPU_ESCR0
TBPU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 4, 5 ESCR1: 6, 7 |
04h | retired_branch_type. This event counts retiring branches by type. Mask Bits:
| TBPU_ESCR0
TBPU_ESCR1
Note: These ESCR names begin with “MSR_” | ESCR0: 4, 5 ESCR1: 6, 7 |
The remaining ESCR bit fields are defined in Table 56-6 on page 1382.
As mentioned earlier, each of the 18 counters is associated with a specific CCCR. The CCCR controls event counting, event filtering and the interrupt-on-overflow capability.
Each CCCR has the format shown in Figure 56-24 on page 1398 and each of the bit fields are explained in Table 56-9 on page 1399.
Bit Field | Description |
---|---|
Enable | This bit is cleared to 0 on reset. |
ESCR Select | Selects the ESCR that defines the event(s) to be counted by the counter that is associated with the CCCR. |
Bits[17:16] |
|
Compare |
|
Complement | Selects how the incoming event count is compared with the threshold value.
|
Threshold | Specifies the threshold value to be used for comparisons. This field only has meaning if the Compare bit = 1. The processor uses the setting of the Complement bit to determine the type of threshold comparison. The range of values that can be specified depends on the event type. Refer to “The Event Filtering Mechanism” on page 1408 for a detailed description. |
Edge |
|
Force Overflow |
|
PMI bit | This bit only applies to a processor that does not support Hyper-Threading. It is the Interrupt On Overflow enable bit:
|
OVF_PMI_T0 | This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 0:
|
OVF_PMI_T1 | This bit only applies to a processor that supports Hyper-Threading. It is the Interrupt On Overflow enable bit for logical processor 1:
|
Cascade |
|
Extended Cascade Enable | This bit is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2. Refer to “Extended Cascading” on page 1406. |
Overflow | This is the overflow history bit.
|
The CCCRs are all cleared to zero on reset. The events that an enabled counter actually counts are selected and filtered by the following bit fields in the ESCR and CCCR (in the qualification order shown):
1. | |
2. | |
3. | |
4. | |
5. |
The 18 performance counters are organized into nine pairs (see Table 56-10 on page 1403) and each pair of performance counters is associated with a particular subset of events and ESCR's (see Table 56-5 on page 1376).
While the ability to count a particular event type is valuable, there are more complex tests or software tuning scenarios that counter cascading can address quite handily:
A counter could be set up to detect when a particular event occurs and then automatically enable another counter to start counting a different event type.
A counter could be set up to count a specified number of a particular event type, and then automatically enable another counter to start counting a different event type.
Each counter is 40-bits wide. Two counters may be initialized to a count of 0 and both set up to count the same event type. One of the counters is then enabled and starts counting the selected event type. When it overflows, the overflow condition automatically enables the other counter which then continues counting the same event type.
Referring to Table 56-10 on page 1403, each counter can be cascaded to a second counter located in another pair (not in the same pair, however) within the same group. As an example, counters 0 and 2 can be cascaded in any order (i.e., 0 to 2, or 2 to 0), as can counters 1 and 3.
As an example, to have a counter 0 overflow automatically enable counter 2 to start counting, the programmer sets CCCR[Cascade] = 1 (see Figure 56-24 on page 1398) in the CCCR associated with counter 2.
Figure 56-25 on page 1405 illustrates an example wherein the overflow of the second counter then results in the generation of a Performance Monitor interrupt. A counter is enabled to do so by setting the PMI bit = 1 in its CCCR. Refer to “The Performance Counter Overflow Interrupt” on page 1547 for a detailed description of this interrupt.
It should be noted that Pentium® 4 processors with a Model number of 0 or 1 and a Stepping > 09h, as well as processors with a Model of 2 have an erratum that prevents them from generating a Performance Monitor interrupt if cascading or extended cascading (refer to “Extended Cascading” on page 1406) is enabled.
This feature is model-specific and is only implemented in the CCCRs associated with Counters 12, 15, 16 and 17, and is only available in Pentium® 4 processors with Family = Fh and a Model number >= 2 (obtained by executing a CPUID request type 0).
If bit 11 = 1 in the CCCR associated with counter 12, then counter 16 can cascade to counter 12.
If bit 11 = 1 in the CCCR associated with counter 15, then counter 17 can cascade to counter 15.
If bit 11 = 1 in the CCCR associated with counter 16, then counter 17 can cascade to counter 16.
If bit 11 = 1 in the CCCR associated with counter 17, then counter 16 can cascade to counter 17.
If extended cascading is to be used, the programmer sets CCCR bit 11 to one rather than the Cascade bit.
In the Pentium® 4 processor, the RDPMC (Read Performance Counter) instruction (see “Performance Monitoring” on page 505) was enhanced to allow the full 40-bit counter to be read, or just its lower 32-bits (which is faster; this can be used when the count is small enough to be contained in 32 bits).
The CR4[PCE] bit permits the OS to limit access to the Performance Counters:
CR4[PCE] = 0. Only privilege level 0 code (i.e., the OS kernel) can execute the RDPMC instruction to read a performance counter's contents. An exception is generated whenever a less-privileged program attempts execution of the RDPMC instruction.
CR4[PCE] = 1. Any program can execute the RDPMC instruction to read a performance counter's contents.
Just like the RDTSC instruction (see “RDTSC Is Not a Serializing Instruction” on page 499), the RDPMC instruction is not a serializing instruction (see “Serializing Instructions” on page 1079). It can be executed out-of-order by the processor core and therefore may yield an inaccurate reading. It can be used in conjunction with the CPUID instruction (which is a serializing instruction) to obtain an accurate count.
In some cases, the programmer may want to preload a count into a counter prior to enabling the counter (so the counter will generate an overflow when that count has been achieved. To do so, enter the start count as a twos-complement negative integer in the counter. The counter then counts from the start value up to -1 and then overflows. A counter is written to using the WRMSR instruction and all 40 bits are written simultaneously.
After being started, a counter continues counting. If and when the counter overflows, it wraps around and continues counting. When the counter wraps around, it sets the CCCR[OVF] bit to indicate that the counter has overflowed. To halt counting, clear CCCR[ENABLE] to 0 (see Figure 56-24 on page 1398).
To halt a cascaded counter (i.e., a counter that was automatically enabled when another counter overflowed), take one of the following actions:
CCCR[Cascade] = 0 in the cascaded counters CCCR.
CCCR[OVF] = 0 in the other counter's CCCR.
While At-Retirement Counting (see “At-Retirement Event Counting” on page 1409) only counts events related to μops along the correctly predicted path, Non-Retirement Event Counting counts all events of the specified type (even if the event is associated with an μop that resides on a mispredicted path and is therefore not ultimately retired).
Table 56-7 on page 1385 provides a listing of the Non-Retirement events by Event Class and Event Mask. These events can be counted using either of the following two counting methods:
A running event count. A counter is set up to count one or more event types. Software periodically reads the counter to determine how many of the selected event type(s) have occurred since the last time the counter was read.
Interrupt on counter overflow. Intel® documentation refers to this as Non-Precise Event-based Sampling (NPEBS if you will):
- A counter is set up to count one or more event types.
- It is preset with an initial count.
- It is enabled to generate a Performance Counter interrupt when the counter has an overflow condition.
- When the counter overflows, the handler is called.
- The interrupt handler records the Return Instruction Pointer (RIP), resets the count to its initial programmed value, restarts the counter, and returns to the interrupted program.
Non-Retirement events may not be counted using the PEBS mechanism.
In order to program the Performance Monitoring logic to count one or more Non-Retirement events, the programmer first decides which event or events are to be counted (see Table 56-7 on page 1385). For each desired event type, the following general series of steps are then accomplished:
1. | In the “ESCR” column of Table 56-7 on page 1385, the programmer selects an ESCR that supports the desired event type. |
2. | The programmer finds the CCCR and Counter associated with the selected ESCR using Table 56-5 on page 1376. |
3. | The programmer sets up the ESCR for the specific event or events to be counted and the privilege levels they are to be counted at. |
4. | The programmer sets up the CCCR:
|
5. | In the CCCR, the programmer then enables the counter to begin counting. |
During a given clock cycle, the Performance Monitoring logic may detect multiple instances of an event selected for counting. When the event filtering mechanism is disabled, the actual event count in a given clock cycle is added to the counter that has been set up to count the respective event.
It may be, however, that the programmer only wants to increment the counter (by one) when the per clock count is > or <= a threshold value that has been specified in the CCCR associated with the counter.
To utilize this capability, the programmer uses the following CCCR fields:
- The CCCR[Threshold] is set to the threshold count value. It is a 4-bit field, so the Threshold value can be between 0 and 15d.
- The CCCR[Complement] bit is set to the appropriate value:
- Complement = 0: A per clock event count > the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).
- Complement = 1: A per clock event count <= the Threshold value results in the counter being incremented by one (NOT by the actual event count detected in the clock cycle).
As an example, if Complement = 0 and Threshold = 6, a count of 7 or greater in a clock cycle causes the counter to be incremented by one, while a count less than 7 results in the counter not being incremented in that clock cycle. If Complement = 1, a count value in a given clock cycle from 0 to 6 causes the counter to be incremented by one, while any count value from 7 to 15 results in the counter not being incremented in that clock cycle.
It may be that the programmer only wants to increment a counter by one when the condition being measured by the Threshold comparison is false in one clock and then true in the following clock. This capability is enabled by setting CCCR[Edge] = 1.
In Table 56-8 on page 1395, the term “bogus” refers to IA32 instructions or μops that are canceled because they are on a path program taken due to a mispredicted branch. The terms “retired” and “non-bogus” refer to IA32 instructions or μops along a correctly-predicted path.
The same event can happen to a μop more than once while it is being executed. Just counting the number of times the event occurred would not indicate how many μops experienced that event. A μop can be tagged once during its lifetime and counted once at retirement. In the Intel® documentation, the “retired” suffix is appended to performance metrics that increment a count once per μop, rather than once per event.
As an example, a μop could experience a cache miss more than once while it is being executed, but a “Miss Retired” event (counts the number of retired μops that experienced a cache miss) only increments once for that μop. This can be used to measure the performance of the cache hierarchy for a code fragment.
To achieve the optimum performance when dealing with commonly-encountered cases, the schedulers sometimes schedules μops for execution before all of the conditions necessary for correct execution are guaranteed to be satisfied. Obviously, the hope is that by the time the μop is actually dispatched for execution, the condition(s) will have been met.
If they have not, the μop must be re-issued and this is referred to as Replay. Some causes of replays are:
- Cache misses.
- Dependence violations (e.g., store forwarding problems; see “Store-to-Load Forwarding” on page 1070 for more information).
- Unforeseen resource constraints.
The processor will always experience some replays, but too many replays indicates that the code in question should be tuned.
When the hardware needs the assistance of microcode (supplied from the MS ROM) to deal with an event, the machine is said to “take an assist”. An example would be an underflow condition in the input operands of a FP operation. In this case, the processor must modify the operand format to perform the computation. Assists clear the entire machine of μops before they begin and are therefore cause a severe dip in performance.
At-Retirement Counting only counts events related to instructions along the correctly predicted path. If a performance counter had been set up to count all executed instructions, the count would also include events related to instructions that were executed along a mispredicted path.
Using the tagging mechanism, a counter can be set up to count only the events related to instructions along the correctly predicted path. This is referred to as At-Retirement Counting.
Table 56-8 on page 1395 lists the At-Retirement Event Classes as well as the events within each class.
A counter can count each incident of an event, or the number of μops that experienced the event. A μop may be tagged when it encounters some of the events listed in Table 56-8 on page 1395. The tagging mechanisms (there are four types of tagging) can be used in Non-Precise Event Based Sampling (NPEBS; see “There Are Three Sampling Methods” on page 1374), and some of the mechanisms can be used in PEBS (Precise Event-Based Sampling; see “There Are Three Sampling Methods” on page 1374). There are four tagging mechanisms:
The tagging mechanisms are independent of each other. A μop tagged using one mechanism is not detected by another mechanism's tagged-μop detector. As an example, a μop tagged by the Front-End tagging mechanism is not counted by the “Execution_Event” unless it was also tagged by the Execution tagging mechanism. It should be noted, that execution tags allow up to four different types of μops to be counted at retirement.
When using PEBS, however, only one tagging mechanism can be used at a time.
The following μops cannot be tagged: IO, uncacheable accesses, locked accesses, Return μops, far jumps, and far calls.
The Front_end_event counts μops with tags that indicate they have experienced any of the following events:
The Front_end_event is defined in Table 56-8 on page 1395. None of the events currently supported requires the use of the MSR_TC_Precise_Event MSR, but some may in the future.
The Execution_event is defined in Table 56-8 on page 1395.
The execution tagging mechanism uses two ESCRs:
- One upstream ESCR specifies the event to detect and assigns a 4-bit Tag (in the ESCR's Tag field) to identify that event. This ESCR must have its Tag Enable bit = 1. The 4-bit Tag is actually a mask that specifies which tag bit(s) should be set for a particular μop. The Tag mask must match the Event Mask bit setting in another downstream ESCR (e.g., if the TAG ID in the upstream ESCR 1h (a mask value of 0001b), then the Event Mask field in the downstream ESCR should be set as follows (see the Execution_event class in Table 56-8 on page 1395):
- Bit 0, the NBOGUS0 bit, = 1b.
- Bit 1, the NBOGUS1 bit, = 0b.
- Bit 2:, the NBOGUS2 bit, = 0b.
- Bit 3, the NBOGUS3 bit, = 0b.
- The second, downstream ESCR is used to detect μops with that Tag value using the Execution_event class in the ESCR's Event Select field. This ESCR's Event Mask bits specify which tag bits accompanying a μop to count. If any of the tag bits that accompany a μop select a mask bit that = 1, the related counter is incremented by one. If more than one mask bit is selected by the bits in the μop's tag, the counter is incremented once for each matching bit. The Tag Enable and Tag value in the downstream ESCR are “don't care”.
The author is puzzled by the fact that eight Event Mask bits (rather than four) are shown in the Execution_Event class entry of Table 56-8 on page 1395.
This mechanism tags μops that must be replayed (e.g., a cache miss) as well as mispredicted branches. They are counted using the “Replay_event” event. The Replay_event is defined in Table 56-8 on page 1395. Replay tagging is enabled with the μop_Tag bit (bit 24) in the IA32_PEBS_ENABLE MSR.
The Replay tagging mechanism requires selecting:
Table 56-11 on page 1413 lists the information used to set up a counter to count Replay events. The setup information in this table enables Precise Event-Based Sampling (see “There Are Three Sampling Methods” on page 1374). Non-Precise Event-Based Sampling can be used by not setting bits 24 or 25 in IA_32_PEBS_ENABLE_MSR (see Figure 56-26 on page 1417).
Replay Event | IA32_PEBS_Enable bits to set | MSR_PEBS_MATRIX_VERT bits to set | Additional Setup Info | Event Mask Value |
---|---|---|---|---|
1stL_cache_load_miss_retired | 0, 24 and 25 | 0 | None | NBOGUS |
2ndL_cache_load_miss_retired | 1, 24 and 25 | 0 | None | NBOGUS |
DTLB_load_miss_retired | 2, 24 and 25 | 0 | None | NBOGUS |
DTLB_store_miss_retired | 2, 24 and 25 | 1 | None | NBOGUS |
DTLB_all_miss_retired | 2, 24 and 25 | 0 and 1 | None | NBOGUS |
MOB_load_replay_retired | 9, 24 and 25 | 0 | In the ESCR, select the MOB_load_replay event and set the PARTIAL_DATA and UNALGN_ADDR Event Mask bits. | NBOGUS |
split_load_retired | Bit 10, Bit 24, Bit 25 | 0 | In MSR_SAAT_ESCR1, select the load_port_replay event and set the SPLIT_LD mask bit. | NBOGUS |
split_store_retired | Bit 10, Bit 24, Bit 25 | 1 | In MSR_SAAT_ESCR0, select the store_port_replay event and set the SPLIT_ST mask bit. | NBOGUS |
The Debug Store (DS) mechanism was introduced in the Pentium® 4 processor. A complete description can be found in “The Debug Store (DS) Mechanism” on page 1366. The processor can be set up to automatically store both PEBS records and BTS records in the DS save area in memory.
What the Intel® documentation refers to as Non-Precise Event-Based Sampling could more aptly be named Automatic state save on counter overflow. It works as follows:
A counter is set up to count one or more event types within the same Event Class.
It is preset with an initial count.
It is enabled to store an event record in the Debug Store (DS) memory area each time that the counter has an overflow condition.
When the counter overflows, the processor automatically copies the contents of the GPRs, the EFlags register and EIP into an event record in the DS memory area.
The processor then automatically resets the count to its initial programmed value and resumes counting.
The processor then resumes execution of the program.
When the DS save area is approaching a full condition, a Performance Counter interrupt is generated and the event records currently in the DS save area can be saved to non-volatile memory (e.g., to disk). A circular DS save buffer is not supported for event records.
Automatic state save on counter overflow (i.e., PEBS) is only supported for the following Event Classes within the At-Retirement Event category:
The Execution_event.
The Front_end_event.
The Replay_event.
PEBS can only be performed using Counter 16.
The programmer can determine that a processor supports PEBS by executing a CPUID type 1 and verifying that EDX[21] = 1. This indicates that the processor supports the DS feature. The programmer then verifies that IA32_MISC_ ENABLE[PEBS_UNAVAILABLE] = 0 (see Figure 56-21 on page 1373).
Setting IA32_PEBS_ENABLE[24] (see Figure 56-26 on page 1417) enables the processor's PEBS capability. The reader should also note that the DS capability must have been configured (see “The Debug Store (DS) Mechanism” on page 1366).
As mentioned earlier, the processor will generate an interrupt when the PEBS memory buffer in the DS save area is approaching a full condition. See “The Debug Store (DS) Mechanism” on page 1366 for more information.
The processor automatically disables the DS feature under the following circumstances:
In a processor that supports Hyper-Threading, PEBS is enabled and qualified using the following two bits in the IA32_PEBS_ENABLE MSR (this register is replicated for each of the logical processors; see Figure 56-26 on page 1417):
Software executing on a logical processor uses these two bits to enable PEBS for subsequent threads of execution:
On the same logical processor on which the software is running (“my thread”) or
For the other logical processor in the physical package (“other thread”).
PEBS can be used only with two performance counters:
Additional information regarding PEBS on a Hyper-Threading capable processor can be found in section 15.10.4, Performance Monitoring Events, in Intel®'s IA32 Intel® Architecture Software Developer's Manual, Volume 3: System Programming Guide.
Processor clock cycles are referred to as clockticks and can be used to measure how long a program takes to execute, as well as to derive efficiency measurements such as Cycles Per Instruction (CPI).
There are three processor clock cycle measurements:
Non-Halted Clockticks. This measurement counts the clock cycles during which the specified logical processor is not halted and is not in any power-saving state. If Hyper-Threading is enabled, this measurement can be performed on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. The processor clock is stopped under the following circumstances:
- When the processor enters the Sleep power conservation state (see “The Sleep State” on page 692).
- When the processor enters the Deep Sleep power conservation state (see “The Deep Sleep State” on page 693).
See “The Non-Halted Clockticks Measurement” on page 1418 for more information.
Non-Sleep Clockticks. This measurement counts the clock cycles during which the physical processor is not in the Sleep mode. This measurement cannot be taken on a per logical processor basis. This measurement is taken using a Performance Counter and can be set up to cause an interrupt upon overflow. See “The Non-Sleep Clockticks Measurement” on page 1419 for more information.
Time Stamp Counter. This measurement counts the clock cycles during which the physical processor is not in Deep Sleep state. These ticks cannot be measured on a per logical processor basis.
For applications wherein the processor is halted during some periods, there are two ratios of interest:
Non-Halted CPI: The Non-Halted Clockticks Per Instructions Retired ratio measures the CPI only during non-halted periods of time (i.e., the processor is actually executing code). This ratio can be measured on a per logical processor basis when Hyper-Threading Technology is enabled.
Nominal CPI: The TSC Ticks Per Instructions Retired ratio measures the CPI over the full period of a program, including those periods of time while the processor is halted.
As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:
In an ESCR, select the global_power_events Event Class, set the RUNNING Event Mask bit, and also set the appropriate mask bits (T0_OS, T0_USR, T1_OS, T1_USR) for the targeted processor.
Set the ESCR Select field in a CCCR to select that ESCR.
Enable counting in the CCCR for that counter by setting the Enable bit.
If Hyper-Threading is enabled, the count may include some portion of the clock cycles for that logical processor to complete a transition to a halted state.
If Hyper-Threading is enabled and both logical processors execute the HLT instruction, the physical processor enters the AutoHalt Powerdown power conservation state (see “The AutoHalt Power Down State” on page 686).
As mentioned earlier, this measurement is taken using a Performance Counter in the following manner:
Choose a counter to use for the measurement.
Choose an ESCR associated with that counter.
Set that ESCR's Event Select field to any Event Class other than the “no_event” Event Class.
Set CCCR[Compare] = 1.
Set CCCR[Threshold] = 15d.
Set CCCR[Complement] = 1.
This setup causes the counter to count every cycle. Note that this overrides any other qualifications (e.g., by CPL) that may be specified in the ESCR.
Set CCCR[Enable] =1. to enable the counter.
This measurement tool continues to increment as long as one logical processor is still running.
The TSC continues to increment unless one of the following is true:
RESET# is asserted.
The processor enters the Sleep power conservation state.
The processor enters the Deep Sleep power conservation state.
The counter can be read by executing the RDTSC instruction (please also refer to “Time Stamp Counter” on page 498). Computing the difference in values between two reads (modulo 264) yields the number of processor clocks between reads.
18.219.14.63