Since the 386, x86 processors have supported paging, which uses a page table to map virtual address pages to physical address pages. This mapping is controlled by the operating system, which gives user applications a contiguous virtual memory space, and isolates the memory spaces of different processes.
Page tables are located in main memory, so a cache (TLB: Translation Lookaside Buffer) is needed for acceptable performance. When doing virtual to physical address translations, the TLB maps virtual pages to physical pages, and is typically looked up in parallel with the L1 cache. For x86, the processor “walks” the page tables in memory if there is a TLB miss. Some other architectures throw an exception and ask the OS to load the required entry into the TLB.
The x86 architecture specifies that the TLB is not coherent or ordered with memory accesses (i.e., page tables), and requires that the relevant TLB entry (or the entire TLB) be flushed after any changes to the page tables. Failing to invalidate would cause the processor to use the stale entry in the TLB if the entry is in the TLB, or a page table walk could non-deterministically see either the old or new page table entry. With out-of-order processors, relaxing coherence requirements allows the processor to more easily reorder operations for more performance.
But do real processor implementations really behave this way, or do some processors provide more coherence guarantees in practice? One particular interesting case concerns what happens when a page table entry that is known not to be cached in the TLB is changed, then immediately used for a translation (via a pagewalk) without any invalidations. Are real processors’ pagewalks more coherent than required by the specification (and does any software rely on this)? And if pagewalks are coherent with memory, what mechanism is used?
In the rest of this post, I’ll attempt answer some of the above questions, and measure the pagewalking behaviour of a bunch of x86 processors.
- Basics of Paging
- TLB and Pagewalk Coherence
- Design Space
- Results Summary
- Detailed Test Descriptions and Results
- Summary and Conclusions
- Source code
Basics of Paging
The basic task of paging is to map virtual addresses to physical addresses. This occurs at the granularity of a “page” (default 4 KB on x86). This means that for a 32-bit address, the upper 20 bits are translated (“virtual page number” mapped to “physical page number”), while the lower 12 are unchanged (“page offset”). Figure 1 shows an example where the virtual address 0x12345678 is mapped to 0x00abc678, doing two memory accesses. PAE and 64-bit extensions allow larger address spaces by increasing the number of levels of page tables, to 3 and 4, respectively.
Of course, if every memory access became three accesses, the performance penalty would not be acceptable. A TLB (Figure 2) is a cache of translations that stores the result of previous pagewalks, which directly maps virtual page numbers to physical page numbers without reading the page tables on a TLB hit.
TLB and Pagewalk Coherence
Whenever there are caches, there is the problem of cache coherence. If a page table entry is changed, would the corresponding entry in the TLB be automatically updated or invalidated, or would software need to explicitly invalidate or flush the TLB? In x86 (and all other architectures I can think of), the TLB is not coherent. The behaviour of non-coherent TLBs is fairly intuitive: if there is a stale entry cached in the TLB, a translation would use the stale entry in the TLB. But if there is no stale entry cached in the TLB, a translation requires a page table walk. Is this page table walk coherent with memory accesses? Here is what Intel’s and AMD’s manuals say about it:
From Section 126.96.36.199 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide (June 2015):
The processor may cache translations required for prefetches and for accesses that are a result of speculative execution that would never actually occur in the executed code path.
From Section 7.3.1 of AMD64 Architecture Programmer’s Manual Volume 2: System Programming (Rev 3.23):
…it is possible for the processor to prefetch the data from physical-page M before the page table for virtual-page A is updated in Step 2 [to page N] … Similar behavior can occur when instructions are prefetched from beyond the page table update instruction. To prevent this problem, software must use an INVLPG or MOV CR3 instruction immediately after the page-table update…
Both Intel and AMD manuals state that without invalidation, even if the old entry is not cached in the TLB, the pagewalk may still see the old entry. That is, in the following pseudocode, the second instruction (load) can non-deterministically use either the old mapping or new mapping, and that there must be an invalidation or TLB flush in between to guarantee the new page table entry is visible by the second instruction.
mov [page table], new_mapping
mov eax, [linear address using updated mapping]
Interestingly, non-coherent pagewalk behaviour is mentioned in the Pentium 4 specification update as an erratum (N94, NoFix) even though the supposedly-erroneous behaviour is still “correct”.
In this post, I use “TLB coherence” to refer to coherence between page tables in memory and the TLB contents (affecting TLB hits), and “pagewalk coherence” for the coherence between pagewalks and page table modifications (affecting TLB misses).
Who cares about pagewalk coherence?
Even though the instruction set specification does not guarantee pagewalk coherence, there is OS code that relies on it for correctness. There is kernel code in Windows 95 through Me that modifies page table mappings without invalidation, and this causes random crashes if pagewalks are not coherent.
This leads me to think that perhaps real processors do provide pagewalk coherence (or that Windows 95 through Me were crash-prone, hmm…)
Design Space of TLB and Pagewalk Coherence
The first design decision is whether TLB contents are coherent with memory accesses (page table modifications). This is not required, and real processors do not provide it.
The second design decision is whether page table walks are coherent with memory accesses. The instruction set specifications do not require it, but many processors choose to provide coherence anyway.
For processors that do provide coherence, there are two classes of mechanisms to provide it: All pagewalks could be performed non-speculatively, after all earlier memory instructions are completed. Or, for more performance, pagewalks could be performed speculatively, with a mechanism to detect and retry the pagewalk if the processor later detects that the pagewalk conflicted with an earlier store to memory.
The design space of pagewalk behaviours can be classified into three categories: Non-coherent pagewalks, coherent pagewalks done non-speculatively, and coherent pagewalks done speculatively with detect-and-retry.
|Pagewalk Behaviour||Processors (Family, Model, Stepping)|
|Coherent TLB and pagewalks||VirtualBox without nested paging|
Intel Pentium Pro (6, 1, 9)
Intel Pentium 4 Northwood (15, 2, 9)
Intel Sandy Bridge (6, 42, 7)
Intel Ivy Bridge (6, 58, 9)
Intel Haswell (6, 60, 3)
AMD K7 Thunderbird (6, 4, 2)
AMD K7 Thoroughbred (6, 8, 1)
AMD Phenom (16, 2, 3)
AMD Phenom II (16, 10, 0)
Intel Core 2 45nm (6, 23, 10)
Intel Lynnfield (6, 30, 5)
These two very rarely uses the old mapping. I can’t tell whether there is some form of TLB prefetching causing the old mapping to end up in the TLB, or whether the pagewalk coherence mechanism occasionally fails.
|Coherent, Non-speculative||Intel Atom Silvermont (6, 55, 3)
Via C3 “Samuel 2″ (CentaurHauls 6, 7, 3)
|Non-coherent||AMD Bulldozer (21, 1, 2)
AMD Piledriver (21, 2, 0)
Coherence within VirtualBox
VirtualBox with shadow page tables behaves as a coherent TLB (not merely coherent pagewalks). Updates to page tables are always immediately visible without TLB invalidation. But the cost of providing this coherence is around 60,000 clock cycles per page table update, compared to one or a few for a simple store instruction.
Virtual machines have two levels of translation: Guest virtual to guest physical (which is also the host virtual address), then the usual host virtual to host physical translation, with two sets of page tables, one managed by the guest OS, one managed by the host OS. Virtual machines without nested paging usually implement shadow page tables, where the CPU uses “shadow” page tables that directly map guest virtual addresses to host physical addresses. The VM monitors changes in the guest OS’s page tables and updates the shadow page tables in response. Often the guest page tables are marked as read-only so all writes immediately trap into the VM to be handled.
Testing for TLB and pagewalk coherence requires changing page table mappings and observing their behaviour and timing. The test needed kernel-mode code needed to gain access to the page tables. It also needed to create a mechanism for modifying the page tables quickly from user mode.
There were two variations of the test, one to test for TLB and pagewalk coherence (functionality), and another timing-based test to test whether pagewalks can occur speculatively and whether there is a penalty for misspeculations. These tests were then run on multiple systems.
Getting access to the page tables
Page tables are located in physical address space, is controlled by the OS, and normally inaccessible from user mode. Gaining access to the page tables requires that we know where the page tables are located in (physical) memory, and a method to read and write to physical memory.
The page tables used by the current process is stored in control register CR3. Directly reading from CR3 (using a MOV instruction) is privileged, so can’t be done in user mode. Linux does not appear to provide an alternative mechanism to read a process’s CR3, so I wrote a short kernel module that simply reports the current CR3 value (presumably valid for the current process).
Accessing physical memory is easier, as Linux already provides a /dev/mem character device that can read and write to physical memory (as long as the kernel was not compiled with STRICT_DEVMEM).
Knowing the location of the page tables and having a mechanism to read/write arbitrary physical memory allows me to write pagewalking routines that can traverse and modify the page tables. There are different routines depending on the type of paging in use (two-level 32-bit x86, three-level 32-bit PAE, and four-level 48-bit x86-64).
However, since /dev/mem is accessed with file read/write operations and requires a switch into kernel mode, it is much too slow. A microbenchmark needs a method to update a page table entry quickly with just one instruction.
Fast Page Table Updates
Fast page table updates is done by using /dev/mem to set up the page tables so that a region of the virtual memory space maps to the page tables for a different array. Something similar is often used by OSes as well. This allows user-mode code (with paging enabled) to directly update page tables with a single instruction, without switching to kernel mode, and without disabling paging. Figure 3 shows the page table setup. The page tables that control the mappings for the
pt_window array are changed (using slow /dev/mem) so that accesses to
pt_array actually access the page tables for
bigbuf, allowing fast modification of
bigbuf‘s page table mappings.
Linux checks the consistency of a process’s page tables when the process exits, so the microbenchmark backs up any of the page tables entries it modifies and restores them before exiting.
Detailed Test Descriptions and Results
Test 1: TLB and Pagewalk Coherence
(This test is opts.mode=0 in the code.)
TLB and pagewalk coherence concerns the processor’s behaviour when a page table mapping is modified and then used without an intervening invalidation. We can measure this by changing the mappings to point to one of two pages with known contents, so the result of a load indicates which page mapping was actually used. Pointing all of the mappings to just two pages has the additional advantage of using only 8 KB of (VIPT or PIPT) L1 cache footprint, allowing better timing measurements with less interference from data cache misses.
The test accesses pages in a random cyclic ordering, to defeat prefetching and generate TLB misses with high probability. All pages point to a page containing the number 1 (“page 1″) by default. For each page, the mapping is changed to point to a different page containing the number 2 (“page 2″), then immediately used by a load. This is immediately changed back to point to page 1, and another load performed, so the page tables point to page 2 only very briefly.
*p_thispte = map_pte2; // Change mapping n = *(unsigned int*)p_array; // First access: n is always 1 or 2. lcounts[n+1]++; // Count occurrences of n *p_thispte = map_pte1; // Change mapping back n = *(unsigned int*)p_array; // Second access lcounts[n-1]++;
In a fully-coherent system, the first load will read a 2 from the newly-changed mapping, and the second load will read a 1 right after the mapping is reverted. If the TLB is not coherent, the second load will still see 2 because the first load has already placed the mapping in the TLB which the second load reuses. If the pagewalks are not coherent, the first load can (but doesn’t necessarily) see 1, because the pagewalk speculatively occurred before the page table update.
This test only measures whether TLBs and pagewalks are coherent. For systems with coherent pagewalks, a timing-based test is needed to determine the mechanism to provide it: Non-speculative pagewalks, or speculative pagewalks with a recovery mechanism.
Test 2: Speculative or Non-Speculative Pagewalks?
(This test is opts.mode=1 in the code.)
To distinguish between speculative and non-speculative pagewalks, we need to construct a test case where a pagewalk result can be produced and used earlier than the completion of the previous instruction, then proving whether this reordering actually occurred. To detect this, I use an inner loop containing a chain of data-dependent arithmetic instructions (to create a known latency), followed by a store that optionally remaps a page table entry, and a pagewalk (a load). The remapping and pagewalk are similar what’s used in Test 1. The data dependencies between these components of the inner loop can be selectively enabled. Figure 4 shows three unrolled iterations of the test’s inner loop and the four selectively-enabled data dependencies.
When there are no data dependencies between the arithmetic and memory operations, two independent chains of arithmetic operations from different loop iterations can be overlapped. For example, 16 dependent IMUL instructions have a latency of 48 cycles on Intel Haswell, but the loop only takes 26 cycles per iteration with no data dependencies, 49 cycles per iteration if the IMULs between loop iterations are dependent (C), and 76 cycles if the pagewalk and IMULs are also dependent on each other (A and B). Selectively adding data dependencies between components allows observing whether there are other implicit dependencies for microarchitectural reasons. For example, non-speculative pagewalks behaves similarly to a dependence of the pagewalk on the preceding instruction (A). IMUL was chosen because it is a pipelineable high latency arithmetic instruction with few micro-ops.
Figure 4 also shows schematics of the execution of the IMUL chain (blue) and pagewalk (green) for all 8 cases (0 to 7). In an out-of-order processor, loop iterations can be overlapped. Adding one dependency while keeping the loop iterations independent (A or B) usually has little impact on execution time, but adding both dependencies has a large impact (A and B). Dependency C serializes the IMUL chain, but the load is free to execute earlier or later, unless constrained by data dependencies A or B or for microarchitectural reasons.
Since a non-speculative pagewalk behaves similarly to enabling dependence A (the pagewalk can’t proceed until the previous IMUL is done), I can test for its presence by enabling only dependence B, then observing whether the loop iterations are still overlapped (indicating speculative pagewalks) or whether it serializes every loop iteration (similar to having both A and B enabled). Table 3 shows results for a few of the processors. Speculative pagewalks are indicated by case 2 (B only) having similar execution time per iteration to case 0 (no dependencies), while non-speculative pagewalks are indicated by case 2 having runtime similar to case 3 (A and B). To be certain the pagewalk is speculative, the runtime for case 2 needs to be smaller than the latency of just the IMULs (case 4), as overlapping the execution of the IMUL chains requires that a pagewalk must be completed before the previous IMUL chain completes. This affects the Phenom II: case 2 is faster than case 3, but isn’t faster than case 4, so it is likely speculative, but not fast enough to be completely certain.
There is no remapping (--dont_remap) to avoid triggering any coherence recovery mechanisms, as we only want to observe pagewalks for Table 3.
A coherent pagewalk requires that the load that uses the new mapping must occur after the mapping was modified (Green dependence in Figure 4). Unlike in Table 3, Table 4 shows results with a page table remapping immediately before using the new mapping (without --dont_remap), and observes whether execution time is affected. For non-coherent systems, the remapping and pagewalk are not ordered, so we expect no change compared to Table 3. Non-speculative systems should behave the same regardless of whether a remapping occurred. Speculative coherent systems should be slower when a remapping occurs because it needs to retry the pagewalk once it discovers the remapping.
Both Piledriver (non-coherent) and Silvermont (non-speculative) behave identically whether a remapping occurred or not. Those with speculative coherent pagewalks have a penalty for recovering from the misspeculation, and are slower in Table 4 than Table 3. Although Table 3 was unable to prove that the Phenom II had speculative pagewalks, Table 4 provides more evidence: The Phenom II got slower with remapping, suggesting that it does do pagewalks speculatively, with a penalty for misspeculation.
In systems with speculative coherent pagewalks, it is possible for cases 3 and 7 to not cause misspeculations because the pagewalk isn’t started until after the remapping store is executed, though this can vary depending on the mechanism used to detect misspeculations. Haswell (and other Intel P6-derived processors) seems to benefit from a remapping in these cases. Performance counters suggests that the reason is fewer L1 data cache misses when the remapping store modifies page tables rather than an unrelated location in memory. Perhaps this suggests that memory accesses from pagewalks are cached in the L1 cache (i.e., pagewalks compete for L1 cache space)?
Misspeculation detection mechanism?
The above results did not enable dependence D from Figure 4, which allowed the remapping store to be executed early. Enabling it constrains the store so it must execute after the IMUL chain completes. With page remapping disabled, enabling dependence D should only affect memory systems that perform speculative pagewalks and do not have memory dependence speculation: Dependence speculation allows loads to pass stores with unknown store addresses, so dependence D behaves similarly to dependence A on systems without dependence prediction. Unexpectedly, Intel Core 2 and newer systems behaved as though a pagewalk coherence misspeculation had occurred even though there were no page table modifications. These systems have memory dependence prediction, so the load should have speculatively executed much earlier than the store and broken the data dependence chain.
It turns out it is precisely the early-executing load that is responsible for the incorrectly-detected misspeculation. This gives a hint on how coherence violations may be detected: by comparing pagewalks to known older store addresses (in the store queue?), and assuming a coherence violation if there is an older store with a conflict or an unknown address. By adjusting the relative execution times between the store and load, I found that the misspeculation penalty applies if the load executes too far in advance relative to the store. On Haswell and Ivy Bridge, the misspeculation penalty occurs if the load begins execution more than 9 cycles before the store (10 cycles on Sandy Bridge).
Summary and Conclusions
When dealing with caching of translations from a page table, there are two distinct cases of coherence to consider: TLB coherence, and pagewalk coherence. The x86 instruction set specification requires neither. No x86 processor I’ve seen has TLB coherence, but many do have pagewalk coherence. Windows 95 to Me apparently relies on pagewalk coherence for correctness. There are two classes of methods to provide pagewalk coherence: Either non-speculatively perform pagewalks (stall all pagewalks until earlier instructions are done), or speculatively perform pagewalks and detect and correct for coherence violations.
Using a few microbenchmarks observing the behaviour and timing of pagewalks and page table modifications, I found examples of processors in all categories: Non-coherent pagewalks, coherent non-speculative pagewalks, and speculative coherent pagewalks with detect and recover. Interestingly, a virtual machine with shadow paging behaved like a coherent TLB.
In addition, unexpected behaviour of recent Intel processors gave enough information to speculate on the mechanism used by those processors to detect pagewalk coherence violations: By querying the store queue during pagewalks to check for older stores with conflicting store addresses.
The technique of manipulating page tables in a microbenchmark allows flexibility not normally available to user-mode benchmarks. Manipulating page tables allows decoupling the measurement of data cache and data TLB accesses using aliases, and enables a wide variety of future microbenchmarks that can target the paging and caching systems separately. The microbenchmark used here demonstrated the ability to measure the pagewalk system with a minimum of data cache interference.