The last chapter in the three chapter introduction to Intel® Architecture, this chapter covers the modern Intel® Core™ microarchitectures. The chapter begins by highlighting the significance of Intel’s tick-tock cadence. The first processor covered, the Pentium® M, is used to discuss ACPI power states, including C, P, and T states. The second processor covered, the Second Generation Intel® Core™ microarchitecture, focuses on RAPL, Intel® Turbo Boost, and other modern power saving techniques.
In the previous chapter, the Intel® Pentium® lineup of processors was introduced. At that time, mobility wasn’t as pervasive as it is today. Of course, there were laptops and mobile users who cared about battery life and power consumption; however, for the majority of computer users, raw computational performance was their primary concern. This trend was reflected in the design of the Pentium processors. While they did introduce new features for reducing power consumption, and there were mobile versions of some Pentiums, more time was spent optimizing their performance.
Once again, the technological landscape was changing, with a new emphasis on reduced power consumption and smaller form factors. At the same time, these mobility needs were compounded by a still increasing demand for raw computational performance. Whereas the Pentium lineup needed to handle intense 3D games, the Intel® Core™ processor family needed to handle blistering 3D games that were being played on battery power while waiting in line at the grocery store.
To accommodate these needs, Intel focused on architecting a series of new processors designed to improve performance while also simultaneously reducing power consumption. The Intel® Xeon® processors are architected to meet the needs of data centers, workstations, HPC supercomputers, and enterprise infrastructure. The Intel® Core™ processors are architected to meet the needs of the consumer and business markets, including both desktop and mobile systems. These chips carry the logo of either Intel® Core™ i3, Intel® Core™ i5, or Intel® Core™ i7. The Intel® Atom™ processors are architected to meet the needs of smaller form factors, such as phones, tablets, netbooks, and microservers. The Intel® QuarkTM microcontrollers are architected to meet the needs of the Internet of Things.
Additionally, Intel adopted a Tick Tock model for processor design. Each tock, which is followed by a tick, introduces a new microarchitecture. Each tick, which is followed by a tock, improves the manufacturing process technology for the previous microarchitecture. Improvements to the manufacturing process reduce the processor’s power consumption and thermal output. For example, the Second Generation Intel® Core™ processor family was a tock, manufactured with 32nm manufacturing process technology. The Third Generation Intel® Core™ processor family was a tick, manufacturing the Second Generation microarchitecture with 22nm manufacturing process technology. The Fourth Generation Intel® Core™ processor family is a tock, introducing a new microarchitecture manufactured with the 22nm manufacturing process technology.
This chapter focuses on the power management functionality added to the Intel® Architecture.
The Intel® Pentium® M processor family was introduced in 2003. This family consisted of mobile processors, hence the “M,” and exemplifies many of the fundamental power management techniques available for Intel Architecture.
The first revision of the Advanced Configuration and Power Interface (ACPI) specification was released in 1996. While ACPI is most commonly associated with power management, the majority of the ACPI specification actually deals with device enumeration and configuration. At boot, UEFI or the legacy BIOS is responsible for creating a series of tables that describe the devices available in the system and their supported configurations.
Additionally, through these ACPI tables, devices expose functions, written in the ACPI Machine Language (AML). The Linux kernel implements an AML interpreter that can parse and execute these functions. This allows for device specific code to be abstracted behind a standardized device interface, and thus for the kernel to control certain devices without requiring a specialized driver.
ACPI also defines a series of processor, device, and system states. The most user-visible of these states are the system S and G states, which include states that most laptop users know as sleep and hibernation. There are three important ACPI processor states to understand.
The first, thermal states, which are often referred to as T states, are responsible for automatically throttling the processor in the case of a thermal trip. In other words, these states are built into the processor to automatically prevent damage caused by overheating. Since a misconfiguration of the T states could result in physical damage, these states are completely independent of software. Obviously, tripping these states should be avoided by ensuring that the processor is cooled by a solution capable of handling its TDP. It is also important to be aware of the fact that thermal throttling will drastically reduce performance until temperatures have returned to safe levels. As a result, this is something to be cautious of when running long benchmarks or tests.
The rest of this section focuses on the other two processor states.
The ACPI specification defines the concept of Device and Processor performance states. Performance states, or simply P states, are designed to improve energy efficiency by allowing the processor to reduce performance, and therefore power consumption, when maximum performance is not required. As of the latest ACPI specification, revision 5.1, the performance states are defined as:
P0 “While a device or processor is in this state, it uses its maximum performance capability and may consume maximum power” (Unified EFI, Inc, 2014).
P1 “In this performance power state, the performance capability of a device or processor is limited below its maximum and consumes less than maximum power” (Unified EFI, Inc, 2014).
Pn “In this performance state, the performance capability of a device or processor is at its minimum level and consumes minimal power while remaining in an active state. State n is a maximum number and is processor or device dependent. Processors and devices may define support for an arbitrary number of performance states not to exceed 16” (Unified EFI, Inc, 2014).
Notice that the lower P state numbers correspond to higher power consumption, while the higher P state numbers correspond to lower power consumption. The specification uses P0 and P1 to establish the pattern between P states, and then provides each hardware vendor with the flexibility of defining up to sixteen additional performance states. As a result, the number of supported P states for a given system varies depending on the exact processor configuration, and must be detected at runtime through the ACPI tables.
On Intel Architecture, these states are implemented via Enhanced Intel SpeedStep® Technology (EIST). EIST provides an interface for controlling the processor’s operating frequency and voltage (Intel Corporation, 2004). This interface consists of various model specific registers for requesting a state transition, IA32_PERF_CTL_REGISTER, and for obtaining hardware feedback regarding the state, IA32_MPERF and IA32_APERF.
It is important to note that these states are requested, as opposed to being set, by software. Many complex factors contribute to the actual P state the processor is put into, such as the states requested by the other cores, the system’s thermal state, and so on. Additionally, these MSRs are duplicated per Hyper-Thread, yet prior to the Fourth Generation Intel® Core™ processor family, P states were governed at the granularity of the processor package.
User space applications do not need to worry about explicitly controlling the P state, unless they specifically wanted to. Within the Linux kernel, the CPU frequency scaling driver, CONFIG_CPU_FREQ, has traditionally been responsible for controlling the system’s P state transitions. This driver supports multiple governors for controlling the desired management policy. At the time of this writing, there are five standard governors:
1. Performance (CONFIG_CPU_FREQ_GOV_PERFORMANCE)
2. Powersave (CONFIG_CPU_FREQ_GOV_POWERSAVE)
3. Userspace (CONFIG_CPU_FREQ_GOV_USERSPACE)
4. Ondemand (CONFIG_CPU_FREQ_GOV_ONDEMAND)
5. Conservative (CONFIG_CPU_FREQ_GOV_CONSERVATIVE)
The performance governor maintains the highest processor frequency available. The powersave governor maintains the lowest processor frequency available. The userspace governor defers the policy to user space, allowing for an application or daemon to manually set the frequency. The ondemand governor monitors the processor utilization and adjusts the frequency accordingly. Finally, the conservative governor is similar to the ondemand governor but additionally attempts to avoid frequent state changes. Due to the concept of race-to-idle, as introduced in the Introduction chapter, the ondemand governor has typically provided the best power savings on Intel Architectures.
One of the challenges with this driver has been tuning the algorithm for specific hardware generations. In order to remedy this, a new and more efficient P state driver, the Intel® P state driver, was written by Dirk Brandewie. Unlike the previous driver, this one checks the processor generation and then tunes accordingly.
The current frequency driver can be determined by checking the /sys/devices/system/cpu/ directory. If the older cpufreq driver is active, a cpufreq/ directory will be present, whereas if the Intel P state driver is active, an intel_pstate directory will be present. Within these sysfs directories are the tunables exposed to user space by these drivers.
While lowering the core’s frequency reduces power consumption, dropping the frequency to zero provides significantly greater savings. C states are the result of the observation that the processor cores in client systems sit idle most of the time. Therefore, while running the core at its lowest possible frequency is better than running it at its highest, it’s even better to simply turn it off.
The ACPI specification also defines the concept of processor power states. As of the latest ACPI specification, the states are defined as:
C0 “While the processor is in this state, it executes instructions” (Unified EFI, Inc, 2014).
C1 “This processor power state has the lowest latency. The hardware latency in this state must be low enough that the operating software does not consider the latency aspect of the state when deciding whether to use it. Aside from putting the processor in a non-executing power state, this state has no other software-visible effects” (Unified EFI, Inc, 2014).
C2 “The C2 state offers improved power savings over the C1 state. The worse-case hardware latency for this state is provided via the ACPI system firmware and the operating software can use this information to determine when the C1 state should be used instead of the C2 state. Aside from putting the processor in a non-executing power state, this state has no other software-visible effects” (Unified EFI, Inc, 2014).
C3 “The C3 state offsets improved power savings of the C1 and C2 states. The worst-case hardware latency for this state is provided via the ACPI system firmware and the operating system can use this information to determine when the C2 state should be used instead of the C3 state. While in the C3 state, the processor’s caches maintain state but ignore any snoops. The operating system is responsible for ensuring that the caches maintain coherency” (Unified EFI, Inc, 2014).
Notice that similar to P states, a lower C state number corresponds to higher power consumption, while a higher C state number corresponds to lower power consumption. While only four C states are defined in the specification, processors are capable of supporting deeper sleep states, such as C6 and C7. Also notice that each sleep state has an entrance and exit latency, which is provided to the operating system through an ACPI table.
With each P state, the core is constantly executing instructions. On the other hand, with C states, once the core exits C0, it is halted, that is, it completely stops executing instructions. As a result, a core’s P state is only valid when the core is in C0. Otherwise, the frequency is or is almost zero.
When all of a processor’s cores are in the same, or deeper, sleep states, additional uncore resources can also be powered off, providing even greater savings. These processor-wide states are referred to as package C states, since they are a C state for the entire processor package.
Similar to P states, user space applications don’t need to worry about requesting C states, as this is handled by the kernel. The Linux kernel has both a generic ACPI idle driver and also a more advanced Intel® Idle driver, CONFIG_INTEL_IDLE, specifically designed and tuned for Intel® processors. One of the advantages of the Intel Idle driver is that it includes the estimated entrance and exit latencies for each C state for each supported hardware generation. This is especially beneficial for users whose firmware provides an incorrect or incomplete ACPI table for C state latencies, since the driver will default to using the correct values.
As deeper sleep states are entered, additional resources are powered down when the core enters the C state and then must be powered up when the core exits the C state. In other words, the lower C states provide smaller power savings, but are quicker to enter and exit, while the deeper C states provide more power savings, but are slower to enter and exit. As a result, when selecting a C state, the kernel considers how long the processor can sleep, and then chooses the deepest C state that meets that latency requirement. In other words, while user space applications aren’t responsible for manually selecting C states, their behavior must be tuned in order to allow the core to stay in deep sleep for as long as possible.
Launched in early 2011, the Second Generation Intel® Core™ processor family, shown in Figure 3.1, introduced significant improvements to a wide range of components. For example, the Front End was updated with a new cache for storing decoded μops. The instruction set was updated with the introduction of Intel® Advanced Vector Extensions (Intel® AVX), which extended the SIMD register width.
Prior to the Second Generation Intel Core processor family, Intel® integrated graphics were not part of the processor, but were instead located on the motherboard. These graphics solutions were designated Intel® Graphics Media Accelerators (Intel® GMA) and were essentially designed to accommodate users whose graphical needs weren’t intensive enough to justify the expense of dedicated graphics hardware. These types of tasks included surfing the Internet or editing documents and spreadsheets. As a result, Intel GMA hardware has long been considered inadequate for graphically intensive tasks, such as gaming or computed-aided design.
Starting with the Second Generation Intel Core processor family, Intel integrated graphics are now part of the processor, meaning that the GPU is now an uncore resource. While the term “integrated graphics” has typically had a negative connotation for performance, it actually has many practical advantages over discrete graphics hardware. For example, the GPU can now share the Last Level Cache (LLC) with the processor’s cores. Obviously, this has a drastic impact on tasks where the GPU and CPU need to collaborate on data processing.
In many ways, the GPU now acts like another core with regards to power management. For example, starting with the Second Generation Intel Core processor family, the GPU has a sleep state, RC6. Additional GPU sleep states, such as RC6P, were added in the Third and Fourth Generation Intel Core processors. This also means that for the processor to enter a package C state, not only must all the CPU cores enter deep sleep, but also the GPU. Additionally, the GPU is now part of the processor’s package’s TDP, meaning that the GPU can utilize the power budget of the cores and that the cores can utilize the power budget of the GPU. In other words, the GPU affects a lot of the CPU’s power management.
Unfortunately, while processor technology has been advancing rapidly, memory speeds have not kept pace. One technique for improving memory throughput has been the addition of parallel channels. Since these channels are independent, the memory controller can operate in an interleaved mode, where addresses can alternate between the channels. This technique effectively multiplies the potential memory bandwidth by the number of available channels. At the time of this writing, dual-channel memory is the most prevalent, although triple-channel and quadruple-channel systems are available.
In order to utilize interleaved mode, three prerequisites need to be met. First, the amount of memory in each channel must be equal. Second, the size of the memory modules installed in each of the paired memory slots must be identical. Finally, the memory modules need to be installed in the correct channel slots on the motherboard, which are typically color-coded so that alternating pairs share the same color. Failure to meet any of these requirements would result in the memory controller falling back to asymmetric mode. As a result, two 2-GB memory modules, correctly configured, will be faster than one 4-GB memory module.
Whereas symmetric, that is, interleaved, or asymmetric mode has traditionally been all or nothing, Intel Flex Memory Technology allows for finer-grain control over the access mode. In other words, memory accesses to regions where both channels have physical memory are performed in interleaved mode, while memory accesses to regions where only one channel has physical memory are performed in asymmetric mode. This provides a performance boost even when the memory configuration is less than ideal.
For example, consider the situation where one 4-GB and one 2-GB memory module are installed. For the bottom 4-GB, memory accesses occur interleaved between the bottom 2 GB of the 4-GB module, and the 2-GB module. On the other hand, higher memory accesses will occur in asymmetric mode, since all requests will need to access the top 2 GB of the 4-GB module. This type of memory configuration is often referred to as a L-shaped memory configuration. This is because the interleaved portion resembles the lower part of the letter L, while the asymmetric portion resembles the upper part of the letter.
The result of a L-shaped memory configuration is similar to a Non-Uniform Memory Access (NUMA) configuration, typically seen in the server space. This is because some memory accesses will be faster than others. While it may be tempting to use the NUMA infrastructure to inform the kernel about these discrepancies in performance, the existing techniques for handling NUMA are a bit too heavy handed for this kind of situation.
While Intel Flex Memory Technology reduces the performance impact of L-shaped memory configurations, they should still be avoided. Since Intel integrated graphics, except for the Iris™ Pro brand, have no dedicated memory, and therefore utilize the system’s memory, L-shaped memory configurations can also adversely affect graphics performance.
One of the challenges involved in optimizing a processor architecture is balancing the tradeoffs in performance between optimizing for serial and parallel workloads. While providing excellent parallel performance is a priority to Intel, neglecting serial workloads would be significantly detrimental to many common workloads.
While modern processors provide multiple cores on the package, outside of parallel workloads, most of the cores in client systems are often idle, either in deep sleep states, or operating at a reduced frequency in order to conserve power. At the same time, the processor package must be designed to handle the thermal and power requirements of each of the cores, and the GPU, operating under sustained heavy load. As a result, the processor is often operating well below the power and thermal levels at which it was designed to operate. This value, the amount of power the processor package requires to be dissipated in order to prevent overheating, is referred to as the Thermal Design Power (TDP). The TDP is also sometimes referred to as the processor’s power budget. It is important to note that TDP is not the maximum amount of power the processor can pull.
Intel Turbo Boost technology leverages this observation in order to significantly boost single-threaded performance. When the processor detects that it is operating below the TDP, it uses the extra, currently unutilized, power to increase the frequency of one or more of the active cores, or GPU, beyond its normally available frequency. This continues, typically for a very short period of time, until the conditions for boosting are no longer met, at which point the core, or cores, return to normal frequencies.
One common misconception is that Intel Turbo Boost is a form of automated overclocking. It’s important to emphasize that Turbo Boost does not push the processor beyond its designed limitations.
The Running Average Power Limit (RAPL) interface provides the ability to deterministically monitor and enforce power management policies. This interface provides the administrator or operating system detailed control over how much power can be consumed by the specified components. As a result, RAPL can be utilized in order to avoid thermal throttling, conserve battery life, or lower the electricity utility bill.
The RAPL interface partitions the system into hierarchical domains. There are currently four supported domains, although not all processors support every domain. These four are:
Package (PKG) Processor Package (Core and Uncore)
Power Plane 0 (PP0) Processor Core Resources
Power Plane 1 (PP1) Processor Uncore Resources
DRAM (DRAM) Memory
For each supported domain, a series of two to five model-specific registers (MSRs) is available. These MSRs, along with the RAPL concept, are nonarchitectural. This means that their presence and behavior is not guaranteed to remain consistent throughout all future processor generations. At the time of this writing, the Second, Third, and Fourth Generation Intel Core processor families support RAPL.
Each of the MSRs corresponds to a RAPL domain and capability, and therefore follows the naming convention of MSR_x_y, where x is the domain name, as shown in parenthesis above, and y is the capability name. For example, the register exposing the energy status capability of the package and PP0 domains would be MSR_PKG_ENERGY_STATUS and MSR_PP0_ENERGY_STATUS, respectively.
The power limit capability, controlled through the registers named with the MSR_x_POWER_LIMIT pattern, allows for the average power consumption over a specified time window to be capped for the relevant domain. Note that this is the average power usage for the component, not the maximum power usage. Some domains, such as the processor package, support two independent power caps, each with their own average power limit and designated time window. This is often useful for specifying one short term goal for controlling the component’s temperature, and one long term goal for controlling battery life.
The energy status capability, controlled through the registers named with the MSR_x_ENERGY_STATUS pattern, provides the total energy consumption, in joules.
Because the RAPL interface is exposed through MSRs, any application with sufficient privileges can utilize this interface without any kernel support. For example, the PowerTOP tool, described in further detail in Section 11.3, manually accesses the power consumption data from RAPL for some of its power estimates. Aside from manually accessing the MSR registers, the Intel® RAPL powercap Linux kernel driver, CONFIG_INTEL_RAPL, exposes the MSRs through the sysfs filesystem. This sysfs interface is utilized by the Linux Thermal Daemon, thermald. By default, the thermald daemon monitors the system’s temperatures and uses RAPL power capping, along with the Intel P state driver, CONFIG_X86_INTEL_PSTATE, in order to prevent thermal throttling with T states. More information on the Linux Thermal Daemon can be found at https://01.org/linux-thermal-daemon.
18.217.43.228