The Microarchitecture Behind Meltdown

Since the recent (Jan. 2018) disclosure of the Meltdown vulnerability, there has been a lot of interest, speculation, and hysteria, but not a particularly good understanding of the processor microarchitecture feature responsible for it. Understanding of the root cause of the vulnerability allows one to understand why only some microarchitectures are affected, and allows reliably testing for the existence (or, even harder, the non-existence) of the vulnerability on various processors, instead of relying solely on vendor self-reporting (or worse, speculation…).

This article first defines the microarchitectural mechanism that allows Meltdown to work, then develops a microbenchmark to specifically test for this behaviour on multiple microarchitectures.

. . . → Read More: The Microarchitecture Behind Meltdown

Microbenchmarking Return Address Branch Prediction

Modern processors use branch predictors to predict a program’s control flow in order to execute further ahead in the instruction stream. Function return instructions use a specialized branch predictor called a Return Address Stack (RAS), Return Stack Buffer (RSB), return stack, or other various names. This article presents a series of increasingly complex microbenchmarks to measure the behaviour of the RAS found in several Intel and AMD processor microarchitectures. . . . → Read More: Microbenchmarking Return Address Branch Prediction

TLB and Pagewalk Coherence in x86 Processors

Processors that support paging use TLBs to cache translations. On x86, translation caches are not coherent and requires software to explicitly invalidate a TLB entry after updating a page table entry. Similarly, pagewalks are not guaranteed to be coherent, so modifying a page table entry must be followed by an invalidation even if the page table entry is not cached in the TLB.

Real processor implementations do not provide TLB coherence, but it turns out many (but not all) processors actually do provide pagewalk coherence. Most provide pagewalk coherence by detecting when page table entry update conflicts with a pagewalk’s memory accesses, but some provide coherence by disallowing speculative pagewalks, at some performance cost. I show a microbenchmark that can test for TLB and pagewalk coherence and whether speculative pagewalks are used.

. . . → Read More: TLB and Pagewalk Coherence in x86 Processors

Windows 9x TLB Invalidation Bug

In processor architectures that support paging, there are usually one or more TLBs or pagewalk caches to cache address translations. On x86, these translation caches are not coherent with memory accesses that modify the page tables, and need invalidating after a page table entry is modified.

The Windows 9x kernel contains code that modifies a page table entry, then immediately uses it without an invalidation. This causes crashes if the processor strictly follows the instruction set specification and does not provide pagewalk coherence.

. . . → Read More: Windows 9x TLB Invalidation Bug

Store-to-Load Forwarding and Memory Disambiguation in x86 Processors

In pipelined processors, instruction are executed speculatively and are not permitted to modify system state until instruction commit. For stores to memory, speculative stores write into a store queue at execution time and only write into cache after the store instructions have committed. Out of order memory execution requires hardware that learns dependencies between stores and loads, and also the ability to forward stored values from the store queue to loads that depend on them. I describe two variations of a microbenchmark that can measure some aspects of store-to-load forwarding and the memory execution hardware. These showed that AMD’s Bulldozer and Piledriver processors likely do not use a dynamic memory dependence predictor. They were also used to generate interesting 2D charts that can reveal some details about how the memory execution hardware might be designed. . . . → Read More: Store-to-Load Forwarding and Memory Disambiguation in x86 Processors

AMD Bulldozer/Piledriver Modules and Hyper-Threading

Ever since Intel’s Hyper-Threading and AMD’s Bulldozer modules, there has been much debate on what qualifies as a real CPU “core”. Unfortunately, I don’t think “core” is easy to define, so marketing tends to name things for their own benefit. In the end, it’s the performance that matters, not the name. Two-way Hyper-Threading gives around 23% improvement over one thread, while two-way multithreading in a “module” gives 54%. This is still quite far from >90% that replicating the entire CPU core would achieve . . . → Read More: AMD Bulldozer/Piledriver Modules and Hyper-Threading

Measuring Reorder Buffer Capacity

On conventional out of order processors, instructions are not necessarily executed in “program order”, although the processor must give the same results as though execution occurred in program order. The instruction window contains a small window of instructions that are allowed to execute out of order, before the instructions are committed in program order as they leave the instruction window. This article describes a microbenchmark that can measure the size of the instruction window, demonstrated on several x86 microarchitectures, then extends the microbenchmark to measure the speculative register file size. . . . → Read More: Measuring Reorder Buffer Capacity

Intel Ivy Bridge Cache Replacement Policy

Caches are used to store a subset of a larger memory space in a smaller, faster memory, with the hope that future memory accesses will find their data in the cache. Traditionally, caches have used (approximations of) the least-recently used (LRU) replacement policy, but LRU performs poorly with cyclic access patterns with working sets larger than the cache. Intel Ivy Bridge’s L3 cache uses an improved adaptive replacement policy, and is no longer purely pseudo-LRU . . . → Read More: Intel Ivy Bridge Cache Replacement Policy

A Comparison of Intel’s 32nm and 22nm Core i5 CPUs: Power, Voltage, Temperature, and Frequency

Each new generation of CMOS manufacturing processes brings about a new set of trade-offs. Intel’s recent tradition of manufacturing the same processor microarchitecture across two processes provides an opportunity to measure some of the voltage-delay-power scaling trends. The 22nm Ivy Bridge significantly improves on static (leakage) power over 32nm Sandy Bridge, but only shows small reductions in dynamic power. Ivy Bridge also requires higher voltage increases for the same frequency increase. Also, thermal resistance of Ivy Bridge increased over Sandy Bridge, likely due to the change from solder to polymer thermal interface material. . . . → Read More: A Comparison of Intel’s 32nm and 22nm Core i5 CPUs: Power, Voltage, Temperature, and Frequency

Ivy Bridge Power Consumption

This is a preliminary attempt at characterizing the power consumption of Ivy Bridge at various clock frequencies and loads. I present plots of CPU power consumption (at the 12V connector) at varying frequency, voltage, and number of cores utilized, including the power impact of Hyper-Threading. . . . → Read More: Ivy Bridge Power Consumption