TLB and Pagewalk Coherence in x86 Processors

Henry Wong

TLB and Pagewalk Coherence in x86 Processors

By Henry, on August 10th, 2015

Since the 386, x86 processors have supported paging, which uses a page table to map virtual address pages to physical address pages. This mapping is controlled by the operating system, which gives user applications a contiguous virtual memory space, and isolates the memory spaces of different processes.

Page tables are located in main memory, so a cache (TLB: Translation Lookaside Buffer) is needed for acceptable performance. When doing virtual to physical address translations, the TLB maps virtual pages to physical pages, and is typically looked up in parallel with the L1 cache. For x86, the processor “walks” the page tables in memory if there is a TLB miss. Some other architectures throw an exception and ask the OS to load the required entry into the TLB.

The x86 architecture specifies that the TLB is not coherent or ordered with memory accesses (i.e., page tables), and requires that the relevant TLB entry (or the entire TLB) be flushed after any changes to the page tables. Failing to invalidate would cause the processor to use the stale entry in the TLB if the entry is in the TLB, or a page table walk could non-deterministically see either the old or new page table entry. With out-of-order processors, relaxing coherence requirements allows the processor to more easily reorder operations for more performance.

But do real processor implementations really behave this way, or do some processors provide more coherence guarantees in practice? One particular interesting case concerns what happens when a page table entry that is known not to be cached in the TLB is changed, then immediately used for a translation (via a pagewalk) without any invalidations. Are real processors’ pagewalks more coherent than required by the specification (and does any software rely on this)? And if pagewalks are coherent with memory, what mechanism is used?

In the rest of this post, I’ll attempt answer some of the above questions, and measure the pagewalking behaviour of a bunch of x86 processors.

Basics of Paging
TLB and Pagewalk Coherence
Design Space
Results Summary
Methodology
Detailed Test Descriptions and Results
- Test 1: Coherence
- Test 2: Pagewalk Speculation
Summary and Conclusions
Source code

Basics of Paging

The basic task of paging is to map virtual addresses to physical addresses. This occurs at the granularity of a “page” (default 4 KB on x86). This means that for a 32-bit address, the upper 20 bits are translated (“virtual page number” mapped to “physical page number”), while the lower 12 are unchanged (“page offset”). Figure 1 shows an example where the virtual address 0x12345678 is mapped to 0x00abc678, doing two memory accesses. PAE and 64-bit extensions allow larger address spaces by increasing the number of levels of page tables, to 3 and 4, respectively.

Figure 1: Two-level pagewalks in 32-bit x86.

Of course, if every memory access became three accesses, the performance penalty would not be acceptable. A TLB (Figure 2) is a cache of translations that stores the result of previous pagewalks, which directly maps virtual page numbers to physical page numbers without reading the page tables on a TLB hit.

Figure 2: TLB caches translations to avoid page table walks

TLB and Pagewalk Coherence

Whenever there are caches, there is the problem of cache coherence. If a page table entry is changed, would the corresponding entry in the TLB be automatically updated or invalidated, or would software need to explicitly invalidate or flush the TLB? In x86 (and all other architectures I can think of), the TLB is not coherent. The behaviour of non-coherent TLBs is fairly intuitive: if there is a stale entry cached in the TLB, a translation would use the stale entry in the TLB. But if there is no stale entry cached in the TLB, a translation requires a page table walk. Is this page table walk coherent with memory accesses? Here is what Intel’s and AMD’s manuals say about it:

From Section 4.10.2.3 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide (June 2015):

The processor may cache translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.

From Section 7.3.1 of AMD64 Architecture Programmer’s Manual Volume 2: System Programming (Rev 3.23):

…it is possible for the processor to prefetch the data from physical-page M before the page table for virtual-page A is updated in Step 2 [to page N] … Similar behavior can occur when instructions are prefetched from beyond the page table update instruction. To prevent this problem, software must use an INVLPG or MOV CR3 instruction immediately after the page-table update…

Both Intel and AMD manuals state that without invalidation, even if the old entry is not cached in the TLB, the pagewalk may still see the old entry. That is, in the following pseudocode, the second instruction (load) can non-deterministically use either the old mapping or new mapping, and that there must be an invalidation or TLB flush in between to guarantee the new page table entry is visible by the second instruction.
mov [page table], new_mapping mov eax, [linear address using updated mapping]

Interestingly, non-coherent pagewalk behaviour is mentioned in the Pentium 4 specification update as an erratum (N94, NoFix) even though the supposedly-erroneous behaviour is still “correct”.

In this post, I use “TLB coherence” to refer to coherence between page tables in memory and the TLB contents (affecting TLB hits), and “pagewalk coherence” for the coherence between pagewalks and page table modifications (affecting TLB misses).

Who cares about pagewalk coherence?

Even though the instruction set specification does not guarantee pagewalk coherence, there is OS code that relies on it for correctness. There is kernel code in Windows 95 through Me that modifies page table mappings without invalidation, and this causes random crashes if pagewalks are not coherent.

This leads me to think that perhaps real processors do provide pagewalk coherence (or that Windows 95 through Me were crash-prone, hmm…)

Design Space of TLB and Pagewalk Coherence

The first design decision is whether TLB contents are coherent with memory accesses (page table modifications). This is not required, and real processors do not provide it.

The second design decision is whether page table walks are coherent with memory accesses. The instruction set specifications do not require it, but many processors choose to provide coherence anyway.

For processors that do provide coherence, there are two classes of mechanisms to provide it: All pagewalks could be performed non-speculatively, after all earlier memory instructions are completed. Or, for more performance, pagewalks could be performed speculatively, with a mechanism to detect and retry the pagewalk if the processor later detects that the pagewalk conflicted with an earlier store to memory.

The design space of pagewalk behaviours can be classified into three categories: Non-coherent pagewalks, coherent pagewalks done non-speculatively, and coherent pagewalks done speculatively with detect-and-retry.

Results Summary

Pagewalk Behaviour	Processors (Family, Model, Stepping)
Coherent TLB and pagewalks	VirtualBox without nested paging
Coherent, Speculative	Intel Pentium Pro (6, 1, 9) Intel Pentium 4 Northwood (15, 2, 9) Intel Sandy Bridge (6, 42, 7) Intel Ivy Bridge (6, 58, 9) Intel Haswell (6, 60, 3) AMD K7 Thunderbird (6, 4, 2) AMD K7 Thoroughbred (6, 8, 1) AMD Phenom (16, 2, 3) AMD Phenom II (16, 10, 0)
Coherent?, Speculative	Intel Core 2 45nm (6, 23, 10) Intel Lynnfield (6, 30, 5) These two very rarely uses the old mapping. I can’t tell whether there is some form of TLB prefetching causing the old mapping to end up in the TLB, or whether the pagewalk coherence mechanism occasionally fails.
Coherent, Non-speculative	Intel Atom Silvermont (6, 55, 3) Via C3 “Samuel 2” (CentaurHauls 6, 7, 3)
Non-coherent	AMD Bulldozer (21, 1, 2) AMD Piledriver (21, 2, 0) AMD Zen 2 (23, 113, 0) Ryzen 5 3600X

Coherence within VirtualBox

VirtualBox with shadow page tables behaves as a coherent TLB (not merely coherent pagewalks). Updates to page tables are always immediately visible without TLB invalidation. But the cost of providing this coherence is around 60,000 clock cycles per page table update, compared to one or a few for a simple store instruction.

Virtual machines have two levels of translation: Guest virtual to guest physical (which is also the host virtual address), then the usual host virtual to host physical translation, with two sets of page tables, one managed by the guest OS, one managed by the host OS. Virtual machines without nested paging usually implement shadow page tables, where the CPU uses “shadow” page tables that directly map guest virtual addresses to host physical addresses. The VM monitors changes in the guest OS’s page tables and updates the shadow page tables in response. Often the guest page tables are marked as read-only so all writes immediately trap into the VM to be handled.

Methodology

Testing for TLB and pagewalk coherence requires changing page table mappings and observing their behaviour and timing. The test needed kernel-mode code needed to gain access to the page tables. It also needed to create a mechanism for modifying the page tables quickly from user mode.

There were two variations of the test, one to test for TLB and pagewalk coherence (functionality), and another timing-based test to test whether pagewalks can occur speculatively and whether there is a penalty for misspeculations. These tests were then run on multiple systems.

Getting access to the page tables

Page tables are located in physical address space, is controlled by the OS, and normally inaccessible from user mode. Gaining access to the page tables requires that we know where the page tables are located in (physical) memory, and a method to read and write to physical memory.

The page tables used by the current process is stored in control register CR3. Directly reading from CR3 (using a MOV instruction) is privileged, so can’t be done in user mode. Linux does not appear to provide an alternative mechanism to read a process’s CR3, so I wrote a short kernel module that simply reports the current CR3 value (presumably valid for the current process).

Accessing physical memory is easier, as Linux already provides a /dev/mem character device that can read and write to physical memory (as long as the kernel was not compiled with STRICT_DEVMEM).
Knowing the location of the page tables and having a mechanism to read/write arbitrary physical memory allows me to write pagewalking routines that can traverse and modify the page tables. There are different routines depending on the type of paging in use (two-level 32-bit x86, three-level 32-bit PAE, and four-level 48-bit x86-64).

However, since /dev/mem is accessed with file read/write operations and requires a switch into kernel mode, it is much too slow. A microbenchmark needs a method to update a page table entry quickly with just one instruction.

Fast Page Table Updates

Fast page table updates is done by using /dev/mem to set up the page tables so that a region of the virtual memory space maps to the page tables for a different array. Something similar is often used by OSes as well. This allows user-mode code (with paging enabled) to directly update page tables with a single instruction, without switching to kernel mode, and without disabling paging. Figure 3 shows the page table setup. The page tables that control the mappings for the pt_window array are changed (using slow /dev/mem) so that accesses to pt_array actually access the page tables for bigbuf, allowing fast modification of bigbuf‘s page table mappings.

Linux checks the consistency of a process’s page tables when the process exits, so the microbenchmark backs up any of the page tables entries it modifies and restores them before exiting.

Figure 3: Page table mappings to allow fast modification of page tables. Two-level 32-bit x86 page tables shown.

Detailed Test Descriptions and Results

Test 1: TLB and Pagewalk Coherence

(This test is opts.mode=0 in the code.)

TLB and pagewalk coherence concerns the processor’s behaviour when a page table mapping is modified and then used without an intervening invalidation. We can measure this by changing the mappings to point to one of two pages with known contents, so the result of a load indicates which page mapping was actually used. Pointing all of the mappings to just two pages has the additional advantage of using only 8 KB of (VIPT or PIPT) L1 cache footprint, allowing better timing measurements with less interference from data cache misses.

The test accesses pages in a random cyclic ordering, to defeat prefetching and generate TLB misses with high probability. All pages point to a page containing the number 1 (“page 1”) by default. For each page, the mapping is changed to point to a different page containing the number 2 (“page 2”), then immediately used by a load. This is immediately changed back to point to page 1, and another load performed, so the page tables point to page 2 only very briefly.

*p_thispte = map_pte2;         // Change mapping
n = *(unsigned int*)p_array;   // First access: n is always 1 or 2.
lcounts[n+1]++;                // Count occurrences of n

*p_thispte = map_pte1;         // Change mapping back
n = *(unsigned int*)p_array;   // Second access
lcounts[n-1]++;

In a fully-coherent system, the first load will read a 2 from the newly-changed mapping, and the second load will read a 1 right after the mapping is reverted. If the TLB is not coherent, the second load will still see 2 because the first load has already placed the mapping in the TLB which the second load reuses. If the pagewalks are not coherent, the first load can (but doesn’t necessarily) see 1, because the pagewalk speculatively occurred before the page table update.

First access	Second access	Explanation
2	1	Both TLB and pagewalk are coherent. This doesn’t happen in real processors, but does with shadow paging in a VM.
2	2	Coherent pagewalk, non-coherent TLB. Second access’s TLB hit uses mapping from first access’s pagewalk.
1	1	Non-coherent TLB and pagewalk. Pagewalk occurs early, before the remapping.
1	2	This doesn’t happen. It would require both accesses to do pagewalks, and both speculative.

Table 2: Testing for TLB and Pagewalk Coherence

This test only measures whether TLBs and pagewalks are coherent. For systems with coherent pagewalks, a timing-based test is needed to determine the mechanism to provide it: Non-speculative pagewalks, or speculative pagewalks with a recovery mechanism.

Test 2: Speculative or Non-Speculative Pagewalks?

(This test is opts.mode=1 in the code.)

To distinguish between speculative and non-speculative pagewalks, we need to construct a test case where a pagewalk result can be produced and used earlier than the completion of the previous instruction, then proving whether this reordering actually occurred. To detect this, I use an inner loop containing a chain of data-dependent arithmetic instructions (to create a known latency), followed by a store that optionally remaps a page table entry, and a pagewalk (a load). The remapping and pagewalk are similar what’s used in Test 1. The data dependencies between these components of the inner loop can be selectively enabled. Figure 4 shows three unrolled iterations of the test’s inner loop and the four selectively-enabled data dependencies.

When there are no data dependencies between the arithmetic and memory operations, two independent chains of arithmetic operations from different loop iterations can be overlapped. For example, 16 dependent IMUL instructions have a latency of 48 cycles on Intel Haswell, but the loop only takes 26 cycles per iteration with no data dependencies, 49 cycles per iteration if the IMULs between loop iterations are dependent (C), and 76 cycles if the pagewalk and IMULs are also dependent on each other (A and B). Selectively adding data dependencies between components allows observing whether there are other implicit dependencies for microarchitectural reasons. For example, non-speculative pagewalks behaves similarly to a dependence of the pagewalk on the preceding instruction (A). IMUL was chosen because it is a pipelineable high latency arithmetic instruction with few micro-ops.

Figure 4 also shows schematics of the execution of the IMUL chain (blue) and pagewalk (green) for all 8 cases (0 to 7). In an out-of-order processor, loop iterations can be overlapped. Adding one dependency while keeping the loop iterations independent (A or B) usually has little impact on execution time, but adding both dependencies has a large impact (A and B). Dependency C serializes the IMUL chain, but the load is free to execute earlier or later, unless constrained by data dependencies A or B or for microarchitectural reasons.

Figure 4: Testing for speculative pagewalks by selectively enabling data dependencies (A through D).

Since a non-speculative pagewalk behaves similarly to enabling dependence A (the pagewalk can’t proceed until the previous IMUL is done), I can test for its presence by enabling only dependence B, then observing whether the loop iterations are still overlapped (indicating speculative pagewalks) or whether it serializes every loop iteration (similar to having both A and B enabled). Table 3 shows results for a few of the processors. Speculative pagewalks are indicated by case 2 (B only) having similar execution time per iteration to case 0 (no dependencies), while non-speculative pagewalks are indicated by case 2 having runtime similar to case 3 (A and B). To be certain the pagewalk is speculative, the runtime for case 2 needs to be smaller than the latency of just the IMULs (case 4), as overlapping the execution of the IMUL chains requires that a pagewalk must be completed before the previous IMUL chain completes. This affects the Phenom II: case 2 is faster than case 3, but isn’t faster than case 4, so it is likely speculative, but not fast enough to be completely certain.

There is no remapping (--dont_remap) to avoid triggering any coherence recovery mechanisms, as we only want to observe pagewalks for Table 3.

	0	1	2	3	4	5	6	7	Case
	N				Y				C: IMUL to IMUL
	N		Y		N		Y		B: Load to IMUL
	N	Y	N	Y	N	Y	N	Y	A: IMUL to load
Haswell	26.0	26.1	26.2	75.6	49.1	49.1	49.2	75.5	Speculative pagewalks. 3-cycle IMUL.
P4 Northwood	205.4	205.5	206.1	284.2	242.5	242.4	242.7	286.0	Speculative pagewalks. ~15-cycle IMUL.
AMD Phenom II	42.5	45.3	71.7	92.6	49.3	49.4	71.7	92.5	Probably speculative pagewalks. IMULs don’t overlap enough to be certain (Case 2 is not smaller than case 4).
AMD Piledriver	33.6	40.3	54.9	136.4	65.7	65.4	66.8	144.5	Speculative pagewalks.
AMD Zen 2	21.1	21.1	46.3	93.2	49.1	49.1	50.8	93.2	Speculative pagewalks.
Atom Silvermont	41.7	41.7	77.6	77.6	50.4	50.4	77.6	77.6	Non-speculative pagewalks.

Table 3: Runtime for varying data dependencies, without page table remapping.

A coherent pagewalk requires that the load that uses the new mapping must occur after the mapping was modified (Green dependence in Figure 4). Unlike in Table 3, Table 4 shows results with a page table remapping immediately before using the new mapping (without --dont_remap), and observes whether execution time is affected. For non-coherent systems, the remapping and pagewalk are not ordered, so we expect no change compared to Table 3. Non-speculative systems should behave the same regardless of whether a remapping occurred. Speculative coherent systems should be slower when a remapping occurs because it needs to retry the pagewalk once it discovers the remapping.

	0	1	2	3	4	5	6	7	Case
	N				Y				C: IMUL to IMUL
	N		Y		N		Y		B: Load to IMUL
	N	Y	N	Y	N	Y	N	Y	A: IMUL to load
Haswell	40.1	40.1	84.2	70.2	49.1	49.1	84.2	70.2	Misspeculation recovery makes case 2 slower than case 3.
P4 Northwood	206.1	207.5	489.5	364.6	243.4	244.3	375.4	364.2	Misspeculation recovery hurts case 2
AMD Phenom II	124.9	128.4	158.3	146.2	131.0	131.1	158.3	146.3	Much slower than Table 3 suggests a penalty for misspeculation recovery.
AMD Piledriver	33.7	40.5	55.1	137.5	65.7	65.4	66.8	157.2	Non-coherent pagewalk: Remapping has no impact vs. Table 3
AMD Zen 2	21.2	21.1	47.1	95.3	49.1	49.1	50.9	95.2	Non-coherent pagewalk: Remapping has no impact vs. Table 3
Atom Silvermont	41.3	41.4	77.1	77.2	50.3	50.3	77.3	77.2	No change from Table 3: Non-speculative pagewalks has no misspeculations.

Table 4: Runtime for varying data dependencies, with page table remapping.

Both Piledriver (non-coherent) and Silvermont (non-speculative) behave identically whether a remapping occurred or not. Those with speculative coherent pagewalks have a penalty for recovering from the misspeculation, and are slower in Table 4 than Table 3. Although Table 3 was unable to prove that the Phenom II had speculative pagewalks, Table 4 provides more evidence: The Phenom II got slower with remapping, suggesting that it does do pagewalks speculatively, with a penalty for misspeculation.

In systems with speculative coherent pagewalks, it is possible for cases 3 and 7 to not cause misspeculations because the pagewalk isn’t started until after the remapping store is executed, though this can vary depending on the mechanism used to detect misspeculations. Haswell (and other Intel P6-derived processors) seems to benefit from a remapping in these cases. Performance counters suggests that the reason is fewer L1 data cache misses when the remapping store modifies page tables rather than an unrelated location in memory. Perhaps this suggests that memory accesses from pagewalks are cached in the L1 cache (i.e., pagewalks compete for L1 cache space)?

Misspeculation detection mechanism?

The above results did not enable dependence D from Figure 4, which allowed the remapping store to be executed early. Enabling it constrains the store so it must execute after the IMUL chain completes. With page remapping disabled, enabling dependence D should only affect memory systems that perform speculative pagewalks and do not have memory dependence speculation: Dependence speculation allows loads to pass stores with unknown store addresses, so dependence D behaves similarly to dependence A on systems without dependence prediction. Unexpectedly, Intel Core 2 and newer systems behaved as though a pagewalk coherence misspeculation had occurred even though there were no page table modifications. These systems have memory dependence prediction, so the load should have speculatively executed much earlier than the store and broken the data dependence chain.

	0	1	2	3	4	5	6	7	Case
	N				Y				C: IMUL to IMUL
	N		Y		N		Y		B: Load to IMUL
	N	Y	N	Y	N	Y	N	Y	A: IMUL to load
Haswell	26.0	26.3	81.6	75.9	49.1	49.1	82.0	75.7	Misspeculation penalty even though there is no remapping…
P4 Northwood	205.5	205.5	285.2	285.5	242.7	242.7	284.5	286.6	No memory dependence speculation, so cases 2 and 6 perform like 3 and 7.
AMD Phenom II	45.5	45.5	92.8	92.8	49.6	49.7	92.8	92.8	No memory dependence speculation
AMD Piledriver	34.2	41.2	50.4	138.9	65.5	65.6	66.3	144.0	No impact
AMD Zen 2	21.1	21.2	46.4	93.6	49.2	49.2	50.9	93.5	No impact
Atom Silvermont	42.5	41.3	77.3	77.3	50.3	50.3	77.2	77.3	No impact, since pagewalk was non-speculative anyway.

Table 5: Runtime for varying data dependencies, without page table remapping, with delayed store.

It turns out it is precisely the early-executing load that is responsible for the incorrectly-detected misspeculation. This gives a hint on how coherence violations may be detected: by comparing pagewalks to known older store addresses (in the store queue?), and assuming a coherence violation if there is an older store with a conflict or an unknown address. By adjusting the relative execution times between the store and load, I found that the misspeculation penalty applies if the load executes too far in advance relative to the store. On Haswell and Ivy Bridge, the misspeculation penalty occurs if the load begins execution more than 9 cycles before the store (10 cycles on Sandy Bridge).

Summary and Conclusions

When dealing with caching of translations from a page table, there are two distinct cases of coherence to consider: TLB coherence, and pagewalk coherence. The x86 instruction set specification requires neither. No x86 processor I’ve seen has TLB coherence, but many do have pagewalk coherence. Windows 95 to Me apparently relies on pagewalk coherence for correctness. There are two classes of methods to provide pagewalk coherence: Either non-speculatively perform pagewalks (stall all pagewalks until earlier instructions are done), or speculatively perform pagewalks and detect and correct for coherence violations.

Using a few microbenchmarks observing the behaviour and timing of pagewalks and page table modifications, I found examples of processors in all categories: Non-coherent pagewalks, coherent non-speculative pagewalks, and speculative coherent pagewalks with detect and recover. Interestingly, a virtual machine with shadow paging behaved like a coherent TLB.

In addition, unexpected behaviour of recent Intel processors gave enough information to speculate on the mechanism used by those processors to detect pagewalk coherence violations: By querying the store queue during pagewalks to check for older stores with conflicting store addresses.

The technique of manipulating page tables in a microbenchmark allows flexibility not normally available to user-mode benchmarks. Manipulating page tables allows decoupling the measurement of data cache and data TLB accesses using aliases, and enables a wide variety of future microbenchmarks that can target the paging and caching systems separately. The microbenchmark used here demonstrated the ability to measure the pagewalk system with a minimum of data cache interference.

Updates

2020/06/07: Add data for AMD Zen 2, and update pgmod to build on newer kernels (tested on 5.6.8) and disable STRICT_DEVMEM when module is loaded.

Source Code

paging.cc: User-mode test program
pgmod.c: Kernel module to access CR3 and do TLB flushes. Creates /proc/pgmod that prints current CR3 and CR4 when read.
Makefile

Cite

H. Wong, TLB and Pagewalk Coherence in x86 Processors, Aug., 2015. [Online]. Available: http://blog.stuffedcow.net/2015/08/pagewalk-coherence/
[Bibtex]

@misc{pagewalk-coherence,
author={Henry Wong},
title={{TLB} and Pagewalk Coherence in {x86} Processors},
month={aug},
year=2015,
url={http://blog.stuffedcow.net/2015/08/pagewalk-coherence/}
}

Simon Farnsworth

2016/02/01 at 4:22 pm · Reply

A quick note on the P4 erratum; Intel has multiple versions of the specification update (public and partners get different visibilities). It sounds like erratum N94 was put in place when Intel had confirmed the crash, but not yet proven that the reporter’s code was at fault. Once Intel confirmed that there was no fault in the processor, it was stuck – it had published this erratum to partners, and didn’t want to leave a gap in the public record.

This is partly Intel learning from the Pentium FDIV bug – by publishing a questionable erratum while it’s under investigation, Intel encourages other partners (not just the reporter) to report in if they see the same issue; this, in turn, means that Intel gets a wider variety of test cases when it’s hunting for the root cause (so can see that – e.g. – DuckOS sees this behaviour, CowOS does too, but CowOS is obviously correct against the SDM specification where DuckOS has some questionable performance optimizations that obscure the problem).

Mark Rutland

2016/02/07 at 9:26 am · Reply

It wasn’t clear to me from the article, but I assume that other than the VIrtualBox case the test results are all derived from bare-metal runs?

In table 2, the 1-2 case (two speculative page table walks) is described as not happening in practice. I suspect this would be much more likely when run under a hypervisor using nested paging, if the guest OS is preempted immediately before the second page table update. In this case the TLB contents may be lost during execution of the hypervisor, forcing a new (potentially speculative) walk when the guest context is restored.

Henry

2016/02/11 at 10:57 am · Reply

They’re all run on Linux, which might be bare metal depending on your definition. (It’s still subject to Linux’s memory management and thread scheduling)

Yes, I think you’re right that the 1-2 case can occur on a processor that does speculative page walks if there is a context switch in the middle of the routine (whether that’s to a hypervisor or just an OS). But this is highly unlikely on an unloaded machine, and I don’t recall observing it happening in practice. Also, it’s technically no longer running the microbenchmark: the processor is executing a large number of unrelated instructions right inside the inner loop.

Maynard Handley

2018/04/21 at 2:33 pm · Reply

If you’re not aware of it, you may find this paper interesting:
https://www.cs.rice.edu/CS/Architecture/docs/barr-isca10.pdf
Translation Caching: Skip, Don’t Walk (the Page Table) ; Thomas W. Barr, Alan L. Cox, Scott Rixner 2010

It discusses the design space of x86 page caches, how the different possibilities perform, and what Intel and AMD actually use.

Morten Andersen

2020/06/03 at 4:25 am · Reply

In hindsight, isn’t this article a strong precursor to the Meltdown/Spectre attacks? I am thinking particularly of the way it distinguishes between speculative and non-speculative coherent pagewalks, which takes advantage of code that is speculatively executed (common to all the attacks in the class) as well as ofc the investigation of page table caching and coherence, which is specific to meltdown. I am wondering if maybe some of the discoverers of meltdown read this article and this gave them inspiration that there’s an issue – almost all the information to determine there is a security issue is included here!

Henry

2020/06/05 at 4:31 am · Reply

I suppose it’s not impossible, but it would be hard to know. The timing and functional effects of speculation have been known for a long time, so in retrospect, a lot of CPU microarchitecture information might look like it could have inspired Meltdown/Spectre. Designers of every CPU already need to understand how every speculation mechanism works, what code sequences trigger which kind of roll-back, the timing impact, and guarantee that every possible program is still functionally correct under all possible cases.

I think the key insight of these new side channel attacks wasn’t that speculation had data-value-dependent timing effects, but the realization that “data-value-dependent” means one could use them in nefarious ways.

That said, the team at Google that initially disclosed some of the side channel attacks (https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html) did cite my article on measuring the Intel Ivy Bridge L3 cache replacement policy, so I guess they thought it was at least somewhat relevant.

Travis Downs

2021/10/23 at 1:56 am · Reply

You might be interested in this Twitter thread and this SO answer which I wrote in response.

The pattern in question is interleaved loads and stores where each store address depends on the previous load result. Per your discovery regarding the “detection mechanism” this causes page walks for later loads to be cancelled effectively serializing each load, with a 3x performance impact in the hash benchmark (and a > 10x impact in the microbenchmark targeting specifically this behavior).

Interestingly, I didn’t find a similar effect on Ice Lake so maybe the detection mechanism has improved, or perhaps coherence has been abandoned. People report that Zen 2 doesn’t suffer the slowdown, consistent with your finding that there is no page-table coherence on Zen 2.

Travis Downs

2021/10/23 at 1:58 am · Reply

Ooops, here’s the corrected link to the Twitter thread.

My name

2023/07/13 at 10:53 am · Reply

One thing I don’t understand is why VirtualBox (without nested paging) offers such a coherent strategy i.e. trapping on all updates to the page tables (through them being marked as read only). Another strategy could simply be to let the guest freely modify the page tables as normal RAM, and only consult the tables when an access was made to something not previously mapped or when the guest explicitly invalidated the tables. Effectively it would exploit the same mechanism/contract as a real TLB. Supposedly this could be faster than incurring a 60K cost on handling every update. Maybe in the big scheme of things it doesn’t matter too much (not that many page table updates) and they did this to avoid the issues caused by buggy software.

Henry

2023/07/24 at 5:21 am · Reply

I don’t know the reasoning behind their page table update strategy, but I can think of some scenarios where not trapping on page table updates would perform poorly.

When software “invalidates the page tables”, it does not necessarily tell you which part of the page tables it modified, if it used a MOV to CR3. This could happen for older OSes (pre-486) or even in newer OSes where enough page table entries are updated that they think dropping the entire TLB is faster than multiple INVLPGs.

If you let the guest OS update the page table without tracking exactly what got changed, a MOV to CR3 would require discarding and rebuilding the entire shadow page table, which is likely more expensive than trapping on each modification.

I think a similar scenario happens with context switches. When the guest OS does a context switch (and changes CR3), can the hypervisor just swap in the shadow page tables associated with the new CR3 (because the guest OS can’t modify page tables without trapping and notifying the hypervisor), or does every context switch mean I need to rebuild the shadow page tables from scratch because I don’t know which entries could have been silently updated?

Kyle Hagan

2024/09/20 at 12:44 am · Reply

You do such incredible research, I’ve learned loads from your work and greatly appreciate the effort and depth. I hope you continue to put out research like this for the latest setups with P-core/E-core architecture and such. I wanted to comment and let you know that you have inspired a lot of my investigation into the latest microarchitectures and attempting to understand them, however, I don’t yet believe I fully understand how to determine when something is dependent on something else, and when delays are caused by some optimization or interrupt. An example of which is when I had a chain of IMUL followed by an adjustment to stack and then ret to measure some return latency when multiple returns followed one another, but the times were “expected” except on random runs. If I changed the outer for loop… surprise! The strange latency is gone!

I’m hoping as I read your other articles I am better equipped to design sequences that are useful in microbenchmarking. If you have any tips or resources for identifying those types of things, or what mental methodology you follow to construct these chains, I would be so grateful.

Thank you again for sharing your research!

Blog

TLB and Pagewalk Coherence in x86 Processors

Basics of Paging

TLB and Pagewalk Coherence

Who cares about pagewalk coherence?

Design Space of TLB and Pagewalk Coherence

Results Summary

Coherence within VirtualBox

Methodology

Getting access to the page tables

Fast Page Table Updates

Detailed Test Descriptions and Results

Test 1: TLB and Pagewalk Coherence

Test 2: Speculative or Non-Speculative Pagewalks?

Misspeculation detection mechanism?

Summary and Conclusions

Updates

Source Code

Cite

11 comments to TLB and Pagewalk Coherence in x86 Processors

Leave a Reply Cancel reply

Recent Posts

Meta