Caches are used to store a subset of a larger memory space in a smaller, faster memory, with the hope that future memory accesses will find their data in the cache rather than needing to access slower memory. A cache replacement policy decides which cache lines should be replaced when a new cache line needs to be stored. Ideally, data that will be accessed in the near future should be preserved, but real systems cannot know the future. Traditionally, caches have used (approximations of) the least-recently used (LRU) replacement policy, where the next cache line to be evicted is the one that has been least recently used.
Assuming data that has been recently accessed will likely be accessed again soon usually works well. However, an access pattern that repeatedly cycles through a working set larger than the cache results in 100% cache miss: The most recently used cache line won’t be reused for a long time. Adaptive Insertion Policies for High Performance Caching (ISCA 2007) and a follow-on paper High performance cache replacement using re-reference interval prediction (RRIP) (ISCA 2010) describe similar cache replacement policies aimed at improving this problem. The L3 cache on Intel’s Ivy Bridge appears to use an adaptive policy resembling these, and is no longer pseudo-LRU.
Measuring Cache Sizes
The behaviour of LRU replacement policies with cyclic access patterns is useful for measuring cache sizes and latencies. The access pattern used to generate Figure 1 is a random cyclic permutation, where each cache line (64 bytes) in an array is accessed exactly once in a random order before the sequence repeats. Each access is data-dependent so this measures the full access latency (not bandwidth) of the cache. Using a cyclic pattern results in sharp changes in latency between cache levels.
Figure 1 clearly shows two levels of cache for the Yorkfield Core 2 (32 KB and 6 MB) and three levels for the other three. All of these transitions are fairly sharp, except for the L3-to-memory transition for Ivy Bridge (“3rd Generation Core i5″). There is new behaviour in Ivy Bridge’s L3 cache compared to the very similar Sandy Bridge. Curiosity strikes again.
Ivy Bridge vs. Sandy Bridge
Figure 2 shows a comparison of Sandy Bridge and Ivy Bridge, for varying stride. The stride parameter affects which bytes in the array are accessed by the random cyclic permutation. For example, a stride of 64 bytes will only access pointers spaced 64 bytes apart, accessing only the first 4 bytes of each cache line, and accessing each cache line exactly once in the cyclic sequence. A stride less than the cache line size results in accessing each cache line more than once in the random sequence, leading to some cache hits and transitions between cache levels that are not sharp. Figure 2 shows that Sandy Bridge and Ivy Bridge have the same behaviour except for strides at least as big as the cache line size for Ivy Bridge’s L3.
There are several hypotheses that can explain an improvement in L3 cache miss rates. Only a changed cache replacement policy agrees with observations:
- Prefetching: An improved prefetcher capable of prefetching near-random accesses would benefit accesses of any stride. Figure 2 shows no improvement over Sandy Bridge for strides smaller than a cache line.
- Changed hash function: This would show a curve with a strange shape, as some parts of the array see a smaller cache size while some other parts see a bigger size. This is not observed.
- Changed replacement policy: Should show apparent cache size unchanged, but transitions between cache levels may not show the sharp transition seen with LRU policies. This agrees with observations.
Figure 3 shows two plots similar to Figure 2 for larger stride values (512 bytes to 128 MB). The curved shape of the plots for Ivy Bridge is clearly visible for many stride values.
Adaptive replacement policy?
An interesting paper from UT Austin and Intel from ISCA 2007 (Adaptive Insertion Policies for High Performance Caching) discussed improvements to the LRU replacement policy for cyclic access patterns that don’t fit in the cache. The adaptive policy tries to learn whether the access pattern reuses the cache lines before eviction and chooses an appropriate replacement policy (LRU vs. Bimodal Insertion Policy, BIP). BIP places new cache lines most of the time in the LRU position, the opposite behaviour of LRU.
Testing for an adaptive policy can be done by attempting to defeat it. The idea is to trick the cache into thinking that cached data is reused, by modifying the access pattern to reuse each cache line before eviction. Instead of a single pointer chase by repeating p = *(void**)p, the inner loop was changed to do two pointer chases p = *(void**)p; q = *(void**)q;, with one pointer lagging behind the other by some number of iterations, designed to touch each line fetched into the L3 cache exactly twice before eviction. Figure 4 plots the same parameters as Figure 3 but with the dual pointer chase access pattern. The Ivy Bridge plots closely resemble Sandy Bridge, showing that the replacement policy is adaptive, and has been mostly defeated.
Since Ivy Bridge uses an adaptive replacement policy, it is likely that the replacement policy is closely related to the one proposed in the paper. Before we probe for more detail, we need to look a little more closely at the L3 cache.
Ivy Bridge L3 Cache
The L3 cache (also known as LLC, last level cache) is organized the same way for both Sandy Bridge and Ivy Bridge. The cache line size is 64 bytes. The cache is organized as four slices on a ring bus. Each slice is either 2048 sets * 16-way (2 MB for Core i7) or 2048 sets * 12-way (1.5 MB for Core i5). Physical addresses are hashed and statically mapped to each slice. (See http://www.realworldtech.com/sandy-bridge/8/) Thus, the way size within each slice is 128 KB, and a stride of 128 KB should access the same set of all four slices using traditional cache hash functions.
Figure 3 reveals some information about the hash function used. Here are two observations:
- Do bits [16:0] (128KB) of the physical address map directly to sets? I think so. If not, the transition at 6 MB would be spread out somewhat, with latency increasing over several steps, with the increase starting below 6 MB.
- Is the cache slice chosen by exactly bits [18:17] of the physical address? No. Higher-order address bits are also used to select the cache slice. In Figure 3, the apparent cache size with 256 KB stride has doubled to 12 MB. It continues to double at 512, 1024, and 2048 KB strides. This behaviour resembles a 48-way cache. This can happen if the hash function equally distributes 256 through 2048 KB strides over all four slices. Thus, some physical address bits higher than bit 20 are used to select the slice, not just bits [20:17]. Paging with 2 MB pages limits my ability to test physical address strides greater than 2 MB.
Choosing a Policy using Set Dueling
Adaptive cache replacement policies choose between two policies depending on which one is appropriate at the moment. Set dueling proposes to dedicate a small portion of the cache sets to each policy to detect which policy is performing better (dedicated sets), and the remainder of the cache as follower sets that follow the better policy. A single saturating counter compares the number of cache misses occurring in the two dedicated sets.
This test attempts to show set-dueling behaviour by finding which sets in the cache are dedicated and which are follower sets. This test uses a 256 KB stride (128 KB works equally well), which maps all accesses onto one set in each of the four cache slices (due to the hash function). By default, this would mean all accesses only touch the first set in each slice if the low address bits [16:0] are zero. Thus, I introduce a new parameter, offset, which adds a constant to each address so that I can map all accesses to the second, third, etc. sets in each slice instead of always the first. This parameter is swept both up and down, and the cache replacement policy used is observed.
Since the adaptive policy chooses between two different replacement policies, I chose to distinguish between the two by observing the average latency at one specific array size, 4/3 of the L3 cache size (e.g., compare Sandy Bridge vs. Ivy Bridge at 8 MB in Figure 2). A high latency indicates the use of an LRU-like replacement policy, while a lower latency indicates the use of a thrashing-friendlier BIP/MRU type policy. Figure 5 plots the results.
It appears that most of the cache sets can use both cache replacement policies, except for two 4 KB regions at 32-36 and 48-52 KB (4 KB = 64 cache sets). These two regions always use LRU and BIP/MRU policies, respectively. The plot is periodic every 128 KB because there are 2048 sets per cache slice.
The global-counter learning behaviour is seen in Figure 5 by observing the different policies used while sweeping offset ascending vs. descending. Whenever the offset causes memory accesses to land on a dedicated set, it accumulates cache misses, causing the rest of the cache to flip to using the other policy. Cache misses on follower sets do not influence the policy, so the policy does not change until offset reaches the next dedicated set.
DIP or DRRIP?
The later paper from Intel (ISCA 2010) proposes a replacement policy that is improved over DIP-SD by also being resistant to scans. DIP and DRRIP are similar in that both use set dueling to choose between two policies (SRRIP vs. BRRIP in DRRIP, LRU vs. BIP in DIP).
One characteristic of RRIP is that it proposes four levels of re-reference prediction. True LRU has as many levels as the associativity, while NRU (not recently used) has two levels (recently used, or not). Intel’s presentation at Hot Chips 2012 hinted at using four-level RRIP (“Quad-Age LRU” on slide 46). Thus, I would like to measure whether Ivy Bridge uses DIP or DRRIP.
Scans are short bursts of cache accesses that are not reused. SRRIP (but not LRU) attempts to be scan tolerant by predicting new cache lines to have a “long” re-reference interval and allowing recently-used existing cache lines to age slowly before being evicted, preferentially evicting the newly-loaded but not reused cache lines. This behaviour provides an opportunity to detect scan resistance by pointer-chasing through two arrays, one of which is much larger than the cache (scanning), one of which fits in the cache (working set).
Figure 6 plots the average access time for accesses that alternate between a huge array (cache miss) and a small array (possible cache hit) whose size is on the x-axis. The size of the working set that will fit in the cache in the presence of scanning is related to scan resistance. Memory accesses are serialized using lfence. Replacement policy is selected by choosing a stride and offset that coincides with a dedicated set (See Figure 5).
With half of the accesses hitting each array, an LRU replacement policy splits the cache capacity evenly between the two arrays, causing L3 misses when the small array size exceeds ~half the cache. BIP (or BRRIP) is scan-resistant and keeps most of the cache for the frequently-reused small array. The LRU/SRRIP policy in Ivy Bridge behaves very similar to Sandy Bridge and does not seem to be scan-resistant, thus it is likely not SRRIP as proposed. It could, however, be a variant on SRRIP crafted to behave like LRU.
The four-level RRIP family of replacement policies are an extension of NRU, using two bits per cache line to encode how soon each cache line should be evicted (RRPV). On eviction, one cache line with RRPV=3 (LRU position) is chosen to be evicted. If no cache lines have RRPV=3, all RRPV values are incremented until a cache line can be evicted. Whenever a cache line is accessed (cache hit), its RRPV value is set to 0 (MRU position). The various insertion policies as proposed in the paper are as follows:
- BRRIP: New cache lines are inserted with RRPV=3 with high probability, RRPV=2 otherwise.
- SRRIP: New cache lines are inserted with RRPV=2
I made some modifications for my simulations:
- BRRIP: Modified the low-probability case to insert with RRPV=0 instead of RRPV=2
- LRU-like RRIP: New cache lines are inserted with RRPV=0. This is intended to approximate LRU.
Figure 7 shows L3-only cache simulation results of running the same access pattern through simulations of several replacement policies. The simulated cache is configured to match the Core i5 (four 1.5 MB slices, 12-way). The two policies used by Ivy Bridge (Figure 6) match very closely to LRU and BRRIP. Ivy Bridge matches less closely to a modified four-level RRIP configured to behave like LRU, and also matches slightly less well to BIP (not shown) than BRRIP. The simulation of SRRIP verifies its scan-resistance properties, which are not observed in the experimental measurement of Ivy Bridge.
I am not able to measure whether Ivy Bridge has really changed from the pseudo-LRU that was used in Sandy Bridge to a four-level RRIP. Given Intel’s hint in the slide and that Ivy Bridge behaves slightly differently from Sandy Bridge, I am inclined to believe that Ivy Bridge uses RRIP, despite the experimental measurements matching more closely to LRU than LRU-like four-level RRIP. However, it is fairly clear that Ivy Bridge lacks the scan-resistance property proposed in the ISCA 2010 paper.
Although the cache organization between Sandy Bridge and Ivy Bridge are essentially identical, Ivy Bridge’s L3 cache has an improved replacement policy. The policy appears to be similar to a hybrid between “Dynamic Insertion Policy — Set Duel” (DIP-SD) and “Dynamic Re-Reference Interval Prediction” (DRRIP), using four-level re-reference predictions and set dueling, but without scan resistance. For each 2048-set cache slice, 64 sets are dedicated to each of the LRU-like and BRRIP policies, with the remaining 1920 cache sets being follower sets that follow the better policy.
Caution: Not well documented.
- lat.cc: Measure cyclic permutation load latency
- lat2.cc: Touch elements twice before eviction
- lat3.cc: Cycle through two arrays of different size to detect scan resistance
Many thanks to my friend Wilson for pointing me to the two ISCA papers on DIP and DRRIP.