<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog</title>
	<atom:link href="http://blog.stuffedcow.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.stuffedcow.net</link>
	<description>Random stuff...</description>
	<lastBuildDate>Sun, 19 May 2013 18:45:33 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
		<item>
		<title>Measuring Reorder Buffer Capacity</title>
		<link>http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/</link>
		<comments>http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/#comments</comments>
		<pubDate>Wed, 15 May 2013 01:33:25 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[microarchitecture]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=1233</guid>
		<description><![CDATA[On conventional out of order processors, instructions are not necessarily executed in "program order", although the processor must give the same results as though execution occurred in program order. The instruction window contains a small window of instructions that are allowed to execute out of order, before the instructions are committed in program order as they leave the instruction window. This article describes a microbenchmark that can measure the size of the instruction window, demonstrated on several x86 microarchitectures, then extends the microbenchmark to measure the speculative register file size <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">Measuring Reorder Buffer Capacity</a></span>]]></description>
			<content:encoded><![CDATA[<p>On conventional out of order processors, instructions are not necessarily executed in &#8220;program order&#8221;, although the processor must give the same results as though execution occurred in program order. This is achieved by preserving an in-order front-end (fetch, decode, rename) and in-order instruction commit, but allowing the processor to search a limited region of the instruction stream to find independent instructions to work on, before rearranging the instruction results back in their original order and committing them.</p>
<p>The reorder buffer (ROB) is a queue of instructions in program order that tracks the status of the instructions currently in the instruction window. Instructions are enqueued in program order after register renaming. These instructions can be in various states of execution as the instructions in the window are selected for execution. Completed instructions leave the reorder buffer in program order when committed. The capacity of the reorder buffer limits how far ahead in the instruction stream the processor can search for independent instructions. This has been steadily increasing over many processor generations.</p>
<p>The instruction window can also be limited by exhausting resources on the processor other than reorder buffer entries. In physical register file microarchitectures (Pentium 4, Intel Sandy Bridge and newer), the effective instruction window can also be limited by running out of physical registers for renaming.</p>
<p>In this article, I describe a microbenchmark to measure the size of the instruction window and present results for several processor microarchitectures. I also use the same microbenchmark to measure the number of available rename registers on PRF microarchitectures, and show some special cases where certain instructions are handled without consuming physical rename registers.</p>
<h2>Measuring Instruction Window Size</h2>
<div id="attachment_1246" class="wp-caption alignright" style="width: 241px"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/rob_code.png" alt="Example microbenchmark for measuring ROB size." title="ROB Size Microbenchmark Example" width="231" height="400" class="size-full wp-image-1246" /><p class="wp-caption-text">Fig. 1: Microbenchmark example showing two iterations of the microbenchmark inner loop for <acronym title="Reorder buffer">ROB</acronym> size of 4 instructions, with <acronym title="mov, nop, nop">three instructions</acronym> per load (left) and <acronym title="mov, nop, nop, nop">four instructions</acronym> per load (right). Loop latency nearly doubles when enough nops are inserted so that the next load does not fit in the ROB.</p></div>
<p>The instruction window constrains how many instructions ahead of the most recent unfinished instruction a processor can look at to find instructions to execute in parallel. Thus, I can create a microbenchmark that measures the size of the instruction window if it has these two characteristics: It uses a suitable long-latency instruction to block instruction commit, and it is able to observe how many instructions ahead the processor is able to execute once instruction commit is stalled.</p>
<p>A good long-latency instruction to use is a load that misses the cache. A typical cache miss latency is greater than 200 clock cycles, enough time to fill the reorder buffer with other instructions. One way to get a single load to consistently miss in the cache is to do pointer chasing in an appropriately-initialized array, like the method used to measure <a href="http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/">cache system behaviours</a>.</p>
<p>The method I used to detect how far ahead of a cache miss a processor can examine is to place another cache miss some number of instructions after a first cache miss. When the first cache miss is blocking instruction commit, if the second cache miss is within the instruction window, then the two cache misses are overlapped (memory-level parallelism), but if it is outside the instruction window, the two memory accesses are serialized. NOP is a good filler instruction in between the memory loads because it executes quickly, has no data dependencies, and occupies a reorder buffer entry.</p>
<p>Figure 1 shows an example of the behaviour of two inner-loop iterations of this microbenchmark for a reorder buffer size of 4. When there are only three instructions per load (<code><b>mov</b>, nop, nop, <b>mov</b>, nop, nop, ...</code>), the processor is able to search ahead to the next <code>mov</code> while stalled on the first, so that two cache misses can be partially overlapped, leading to executing one iteration of the loop for every cache miss time (plus a few nops). When enough nops are inserted so that the next cache miss cannot fit into the instruction window (<code><b>mov</b>, nop, nop, nop, <b>mov</b>, nop, nop, nop, ...</code>), the two cache misses are serialized and one loop iteration takes nearly twice as long, executing one iteration in the time of <i>two</i> cache misses (plus a few nops).</p>
<h2>Reorder Buffer Capacity</h2>
<div id="attachment_1259" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/rob_graph1.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/rob_graph1-360x221.png" alt="Ivy Bridge (168) and Lynnfield (128) Reorder Buffer Size" title="Reorder Buffer Capacity" width="360" height="221" class="size-medium wp-image-1259" /></a><p class="wp-caption-text">Fig 2: Reorder buffer size for Ivy Brige (168 entries) and Lynnfield (128 entries).</p></div>
<p>Figure 2 shows the results of using the microbenchmark to measure the reorder buffer size of Intel&#8217;s Ivy Bridge and Lynnfield (similar to Nehalem) microarchitectures. The measured values (168 and 128, respectively) agree with the published numbers. The stepwise increase in execution time caused by serializing the two cache misses is clearly visible in the graphs. The slight slope upwards reflects the execution throughput of NOPs.</p>
<p>Figure 3 shows results for other microarchitectures, with the curves cropped to the region of interest, for clarity. Again, the measurements agree with published numbers for all of the microarchitectures.<br />
<br style="clear: right;"></p>
<div id="attachment_1260" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/rob_graph2.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/rob_graph2-360x179.png" alt="" title="Reorder Buffer Capacity" width="360" height="179" class="size-medium wp-image-1260" /></a><p class="wp-caption-text">Fig 3: Reorder buffer capacities for several other microarchitectures: Ivy Bridge (168), Sandy Bridge (168), Lynnfield (128), Northwood (126), Yorkfield (96), Palermo (72), and Coppermine (40).</p></div>
<h3>Hyper-Threading</h3>
<p>On processors with Hyper-Threading, the reorder buffer is partitioned between two threads. It is known that this partitioning is static (and it makes sense to partition statically anyway), so each thread gets half of the ROB entries. One interesting question is whether this static partitioning occurs only when both thread contexts are active, or whether it&#8217;s permanently partitioned whenever Hyper-Threading is enabled.</p>
<p>Tests on Lynnfield (1st generation Core i7) and Northwood (Pentium 4) show that this partitioning occurs only when both thread contexts are active. Two copies of the microbenchmark pinned to two thread contexts on the same core shows half the ROB capacity when both thread contexts are active, but full capacity when one thread context is idle. Good.</p>
<p><br style="clear:right;"></p>
<h2>Physical Register File Size</h2>
<div id="attachment_1270" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_size.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_size-360x240.png" alt="" title="Physical Register File Size" width="360" height="240" class="size-medium wp-image-1270" /></a><p class="wp-caption-text">Fig. 4: Measuring the size of the speculative portion of the physical register file. Northwood (around 115), Sandy Bridge and Ivy Bridge (around 130-135).</p></div>
<p>Measuring the reorder buffer size uses nop instructions to separate the memory loads. This is ideal because nop instructions consume reorder buffer entries, but not destination registers. On microarchitectures that rename registers using a physical register file (PRF), there are often fewer registers than reorder buffer entries (because some instructions don&#8217;t produce a register result, such as branches). If the filler instruction is changed from nop to one that produces a register result (<code>ADD</code>), I can measure the size of the speculative portion of the PRF. </p>
<p>Note that since the PRF holds both speculative and committed register values, I cannot measure the size of the entire PRF, only the portion holding speculative values. However, if we assume that published PRF sizes are correct, the difference would indicate the number of registers used for architectural state, which includes both x86 ISA-visible state as well as some extra architectural registers for internal use.</p>
<p>Figure 4 shows the results for microarchitectures that use a physical register file. The Pentium 4 Northwood appears to have 115 speculative PRF entries. Since the PRF size is expected to be 128 entries, that means 13 registers are used for non-speculative state. Sandy Bridge and Ivy Bridge appear to have 131. The noisy region between 118 and 131 registers hints at some non-ideal behaviour in reclaiming unused registers (in exchange for simpler hardware?), resulting in fewer registers being available for use some but not all of the time. Assuming the expected 160-entry PRF leaves 29 registers of non-speculative state. This seems somewhat high even considering that SNB/IVB are 64-bit (16 architectural integer registers) and Northwood is 32-bit (8 architectural integer registers).<br style="clear:right;"><br />
<div id="attachment_1276" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_size_xmm.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_size_xmm-360x240.png" alt="" title="FP/SSE Physical Register File Size" width="360" height="240" class="size-medium wp-image-1276" /></a><p class="wp-caption-text">Fig. 5: FP/SSE Physical Register File Size. Sandy Bridge and Ivy Bridge have 120 available speculative entries (32-bit mode). Northwood has 87.</p></div></p>
<p>The same experiment (Figure 5) can be repeated with a 128-bit SSE (or 256-bit AVX) instruction to measure the size of the floating-point and SSE PRF. As before, there are more registers used for non-speculative state than expected. Sandy Bridge and Ivy Bridge should have 144 registers (120 available for speculative state), and Northwood should have 128 registers (87 available for speculative state). Interestingly, there are 9 fewer speculative registers (111 vs. 120) in 64-bit mode than 32-bit, probably for YMM8-15 that are inaccessible in 32-bit mode. I can&#8217;t explain why so many registers are needed for non-speculative state (24 to 41), since the ISA-visible state mainly consists of 8 x87 floating-point registers, and 8 or 16 XMM or YMM registers.</p>
<p><br style="clear:right;"></p>
<h2>x86 Zeroing Idiom and MOV Optimization</h2>
<div id="attachment_1284" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_mov.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_mov-360x240.png" alt="" title="Ivy Bridge MOV Optimization" width="360" height="240" class="size-medium wp-image-1284" /></a><p class="wp-caption-text">Fig. 6a: Ivy Bridge executes MOV instructions without consuming destination registers (limited by 168-entry <acronym title="Reorder buffer">ROB</acronym>), while Sandy Bridge and Northwood do (limited by speculative portion of PRF).</p></div>
<p>A common x86 idiom for zeroing registers is to XOR a register with itself. Processors usually recognize this sequence and break the dependence on the source register value. In PRF microarchitectures, it may be possible to optimize this even further and point the renamer entry for that register to a special &#8220;zero&#8221; register, or augment the renamer with a special value indicating the register holds a value of zero, and not require any execution resources for the instruction. I can test for the zeroing idiom by repeating the above test using <code>XOR reg, reg</code> instructions and observing whether those instructions consume destination registers. Both Sandy Bridge and Ivy Bridge can zero registers without consuming destination registers, while Northwood does consume destination registers. (Graphs not shown)</p>
<p><br style="clear:right;"><div id="attachment_1291" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_mov_xmm.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/prf_mov_xmm-360x240.png" alt="" title="Ivy Bridge MOV Optimization: AVX" width="360" height="240" class="size-medium wp-image-1291" /></a><p class="wp-caption-text">Fig. 6b: Ivy Bridge MOV optimization for 256-bit MOVDQA instructions (128-bit MOVDQA for Northwood). Fewer registers are used, but there is something other than the ROB size that limits the instruction window.</p></div><br />
I can measure some other microarchitecture features by changing the instruction used, similar to the previous section.</p>
<p>Repeating the test with MOV instructions can show whether register to register moves are executed by manipulating pointers in the register renamer rather than copying values. This is a known new feature in Ivy Bridge. Figure 6 shows that it is indeed new in Ivy Bridge. Oddly, when the source and destination registers are the same, this optimization does not happen, nor is the operation treated as a <code>NOP</code>. Figure 6b shows the results for using 256-bit AVX MOVDQA. These look strange. It is also strange that Ivy Bridge&#8217;s instruction window is lower than the <acronym title="Reorder buffer">ROB</acronym> size of 168 entries, as though registers were still consumed, though fewer.</p>
<p><br style="clear:right;"></p>
<h2>Unresolved Puzzle</h2>
<div id="attachment_1307" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/05/snb_avx_strangeness.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/05/snb_avx_strangeness-360x240.png" alt="" title="Sandy Bridge AVX strangeness" width="360" height="240" class="size-medium wp-image-1307" /></a><p class="wp-caption-text">Fig. 7: Sandy Bridge executing interleaved 256-bit AVX and 64-bit addition instructions is limited to a 147-instruction window, which is greater than both <acronym title="Physical Register File. Integer=131, AVX=111">PRFs</acronym>, but smaller than the <acronym title="Reorder buffer. 168 entries">ROB</acronym></p></div>
<p>We know that in both Sandy/Ivy Bridge and Northwood, there are separate floating-point/SSE/AVX and integer register files. If I use instructions that do not produce a destination register (<code>nop</code>), the instruction window is limited by the ROB size. Similarly, if I use instructions that produce an integer or AVX/SSE result, the instruction windows becomes limited by the integer or FP/SSE register file instead.  If I alternate between integer and SSE/AVX instructions, I would expect that the instruction window becomes ROB-size limited again, since the demand for registers is spread across two register files.</p>
<p>Figure 7 shows the results of this experiment. Ivy Bridge and Northwood behave as expected: AVX or SSE interleaved with an integer addition are limited by the <acronym title="Reorder buffer: Northwood=126, SNB/IVB=168">ROB</acronym> size. Sandy Bridge AVX or SSE interleaved with integer instructions seems to be limited to looking ahead ~147 instructions by something other than the ROB. Having tried other combinations (e.g., varying the ordering and proportion of AVX vs. integer instructions, inserting some NOPs into the mix), it seems as though both SSE/AVX and integer instructions consume registers from some form of shared pool, as the instruction window is always limited to around 147 regardless of how many of each type of instruction are used, as long as neither type exhausts its own PRF supply on its own. That is, when AVX instructions < ~110 and integer instructions < ~130, the instruction window becomes limited to ~147, regardless of how many of each type are in the instruction window. I can't think of any reason why SSE/AVX and integer register allocations should be linked. This strange behaviour appears to be fixed in Ivy Bridge, but Figure 6b shows Ivy Bridge still has some strangeness in its AVX register allocation...</p>
<h2>Conclusion</h2>
<p>Using two chains of pointer chasing with filler instructions is a powerful tool to find out certain details about a processor&#8217;s microarchitecture. I&#8217;ve shown how to use it to measure the reorder buffer size, the number of available speculative registers in physical register files, and measure the existence of several related optimizations. This is not an <acronym title="e.g., &#010; which registers share a PRF, &#010; whether partial registers are renamed separately, &#010; whether condition codes are renamed with general-purpose registers, &#010; whether immediate values consume PRF space, &#010; whether PRFs are banked, &#010; ...">exhaustive list</acronym>, but I need to get back to doing real work now&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Intel Ivy Bridge Cache Replacement Policy</title>
		<link>http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/</link>
		<comments>http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/#comments</comments>
		<pubDate>Fri, 25 Jan 2013 19:52:48 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[microarchitecture]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=1051</guid>
		<description><![CDATA[Caches are used to store a subset of a larger memory space in a smaller, faster memory, with the hope that future memory accesses will find their data in the cache. Traditionally, caches have used (approximations of) the least-recently used (LRU) replacement policy, but LRU performs poorly with cyclic access patterns with working sets larger than the cache. Intel Ivy Bridge's L3 cache uses an improved adaptive replacement policy, and is no longer purely pseudo-LRU <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/">Intel Ivy Bridge Cache Replacement Policy</a></span>]]></description>
			<content:encoded><![CDATA[<p>Caches are used to store a subset of a larger memory space in a smaller, faster memory, with the hope that future memory accesses will find their data in the cache rather than needing to access slower memory. A cache replacement policy decides which cache lines should be replaced when a new cache line needs to be stored. Ideally, data that will be accessed in the near future should be preserved, but real systems cannot know the future. Traditionally, caches have used (approximations of) the least-recently used (LRU) replacement policy, where the next cache line to be evicted is the one that has been least recently used. </p>
<p>Assuming data that has been recently accessed will likely be accessed again soon usually works well. However, an access pattern that repeatedly cycles through a working set larger than the cache results in 100% cache miss: The most recently used cache line won&#8217;t be reused for a long time. <a href="http://dl.acm.org/citation.cfm?id=1250709">Adaptive Insertion Policies for High Performance Caching</a> (<abbr title="34th International Symposium on Computer Architecture">ISCA 2007</abbr>) and a follow-on paper <a href="http://dl.acm.org/citation.cfm?id=1815971">High performance cache replacement using re-reference interval prediction (RRIP)</a> (<abbr title="37th International Symposium on Computer Architecture">ISCA 2010</abbr>) describe similar cache replacement policies aimed at improving this problem. The L3 cache on Intel&#8217;s Ivy Bridge appears to use an adaptive policy resembling these, and is no longer pseudo-LRU.</p>
<h2>Measuring Cache Sizes</h2>
<div id="attachment_1063" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/latency.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/latency-360x250.png" alt="Cache Latency plot for four Intel microarchitectures" title="Cache Latency" width="360" height="250" class="size-medium wp-image-1063" /></a><p class="wp-caption-text">Figure 1: Cache access latencies for four Intel microarchitectures (Stride = 64 bytes) <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/latency.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/latency.pdf">[pdf]</a></p></div>
<p>The behaviour of LRU replacement policies with cyclic access patterns is useful for measuring cache sizes and latencies. The access pattern used to generate Figure 1 is a random cyclic permutation, where each cache line (64 bytes) in an array is accessed exactly once in a random order before the sequence repeats. Each access is data-dependent so this measures the full access latency (not bandwidth) of the cache. Using a cyclic pattern results in sharp changes in latency between cache levels.</p>
<p>Figure 1 clearly shows two levels of cache for the Yorkfield Core 2 (32 KB and 6 MB) and three levels for the other three. All of these transitions are fairly sharp, except for the L3-to-memory transition for Ivy Bridge (&#8220;3rd Generation Core i5&#8243;). There is new behaviour in Ivy Bridge&#8217;s L3 cache compared to the very similar Sandy Bridge. Curiosity strikes again.</p>
<p><br style="clear: both;"></p>
<h2>Ivy Bridge vs. Sandy Bridge</h2>
<div id="attachment_1061" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb_stride.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb_stride-360x255.png" alt="" title="Ivy Bridge vs. Sandy Bridge" width="360" height="255" class="size-medium wp-image-1061" /></a><p class="wp-caption-text">Figure 2: Varying the stride of the random cyclic permutation access pattern for Ivy Bridge and Sandy Bridge. Sharp corners appear when stride >= cache line size, except for Ivy Bridge L3. <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb_stride.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb_stride.pdf">[pdf]</a></p></div>
<p>Figure 2 shows a comparison of Sandy Bridge and Ivy Bridge, for varying stride. The stride parameter affects which bytes in the array are accessed by the random cyclic permutation. For example, a stride of 64 bytes will only access pointers spaced 64 bytes apart, accessing only the first 4 bytes of each cache line, and accessing each cache line exactly once in the cyclic sequence. A stride less than the cache line size results in accessing each cache line more than once in the random sequence, leading to some cache hits and transitions between cache levels that are not sharp. Figure 2 shows that Sandy Bridge and Ivy Bridge have the same behaviour except for strides at least as big as the cache line size for Ivy Bridge&#8217;s L3.</p>
<p>There are several hypotheses that can explain an improvement in L3 cache miss rates. Only a changed cache replacement policy agrees with observations:</p>
<ul>
<li><b>Prefetching</b>: An improved prefetcher capable of prefetching near-random accesses would benefit accesses of any stride. Figure 2 shows no improvement over Sandy Bridge for strides smaller than a cache line.</li>
<li><b>Changed hash function</b>: This would show a curve with a strange shape, as some parts of the array see a smaller cache size while some other parts see a bigger size. This is not observed.</li>
<li><b>Changed replacement policy</b>: Should show apparent cache size unchanged, but transitions between cache levels may not show the sharp transition seen with LRU policies. This agrees with observations.</li>
</ul>
<p>Figure 3 shows two plots similar to Figure 2 for larger stride values (512 bytes to 128 MB). The curved shape of the plots for Ivy Bridge is clearly visible for many stride values.</p>
<div style="width:770px;">
<div style="display:inline-block;">
<div id="attachment_1069" class="wp-caption alignnone" style="width: 380px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb1.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb1-360x255.png" alt="" title="Sandy Bridge, pointer chase." width="360" height="255" class="size-medium wp-image-1069" /></a><p class="wp-caption-text">Figure 3a: Sandy Bridge, larger strides <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb1.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb1.pdf">[pdf]</a></p></div></div>
<div style="display:inline-block;"><div id="attachment_1059" class="wp-caption alignnone" style="width: 380px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb1.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb1-360x255.png" alt="" title="Ivy Bridge, pointer chase" width="360" height="255" class="size-medium wp-image-1059" /></a><p class="wp-caption-text">Figure 3b: Ivy Bridge <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb1.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb1.pdf">[pdf]</a></p></div></div>
</div>
<p><br style="clear: both;"></p>
<h2>Adaptive replacement policy?</h2>
<p>An interesting paper from UT Austin and Intel from ISCA 2007 (<a href="http://dl.acm.org/citation.cfm?id=1250709">Adaptive Insertion Policies for High Performance Caching</a>) discussed improvements to the LRU replacement policy for cyclic access patterns that don&#8217;t fit in the cache. The adaptive policy tries to learn whether the access pattern reuses the cache lines before eviction and chooses an appropriate replacement policy (LRU vs. Bimodal Insertion Policy, BIP). BIP places new cache lines most of the time in the LRU position, the opposite behaviour of LRU.</p>
<p>Testing for an adaptive policy can be done by attempting to defeat it. The idea is to trick the cache into thinking that cached data is reused, by modifying the access pattern to reuse each cache line before eviction. Instead of a single pointer chase by repeating <tt>p = *(void**)p</tt>, the inner loop was changed to do two pointer chases  <tt>p = *(void**)p; q = *(void**)q;</tt>, with one pointer lagging behind the other by some number of iterations, designed to touch each line fetched into the L3 cache exactly twice before eviction. Figure 4 plots the same parameters as Figure 3 but with the dual pointer chase access pattern. The Ivy Bridge plots closely resemble Sandy Bridge, showing that the replacement policy <i>is</i> adaptive, and has been mostly defeated.</p>
<div style="width:770px;">
<div style="display:inline-block;">
<div id="attachment_1069" class="wp-caption alignnone" style="width: 380px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb2.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb2-360x255.png" alt="" title="Sandy Bridge, dual pointer chase." width="360" height="255" class="size-medium wp-image-1069" /></a><p class="wp-caption-text">Figure 4a: Sandy Bridge, dual pointer chase <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb2.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/snb2.pdf">[pdf]</a></p></div></div>
<div style="display:inline-block;"><div id="attachment_1059" class="wp-caption alignnone" style="width: 380px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb2.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb2-360x255.png" alt="" title="Ivy Bridge, dual pointer chase" width="360" height="255" class="size-medium wp-image-1059" /></a><p class="wp-caption-text">Figure 4b: Ivy Bridge, dual pointer chase <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb2.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/ivb2.pdf">[pdf]</a></p></div></div>
</div>
<p>Since Ivy Bridge uses an adaptive replacement policy, it is likely that the replacement policy is closely related to the one proposed in the paper. Before we probe for more detail, we need to look a little more closely at the L3 cache.</p>
<h2>Ivy Bridge L3 Cache</h2>
<p>The L3 cache (also known as LLC, last level cache) is organized the same way for both Sandy Bridge and Ivy Bridge. The cache line size is 64 bytes. The cache is organized as four slices on a ring bus. Each slice is either 2048 sets * 16-way (2 MB for Core i7) or 2048 sets * 12-way (1.5 MB for Core i5). Physical addresses are hashed and statically mapped to each slice. (See <a href="http://www.realworldtech.com/sandy-bridge/8/">http://www.realworldtech.com/sandy-bridge/8/</a>) Thus, the way size within each slice is 128 KB, and a stride of 128 KB should access the same set of all four slices using traditional cache hash functions.</p>
<p>Figure 3 reveals some information about the hash function used. Here are two observations:</p>
<ul>
<li><b>Do bits [16:0] (128KB) of the physical address map directly to sets?</b> I think so. If not, the transition at 6 MB would be spread out somewhat, with latency increasing over several steps, with the increase starting below 6 MB.</li>
<li><b>Is the cache slice chosen by exactly bits [18:17] of the physical address?</b> No. Higher-order address bits are also used to select the cache slice. In Figure 3, the apparent cache size with 256 KB stride has doubled to 12 MB. It continues to double at 512, 1024, and 2048 KB strides. This behaviour resembles a 48-way cache. This can happen if the hash function equally distributes 256 through 2048 KB strides over all four slices. Thus, some physical address bits higher than bit 20 <i>are</i> used to select the slice, not just bits [20:17]. Paging with 2&nbsp;MB pages limits my ability to test physical address strides greater than 2 MB.</li>
</ul>
<h2>Choosing a Policy using Set Dueling</h2>
<div id="attachment_1065" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/setduel.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/setduel-360x255.png" alt="" title="Dynamic Insertion Policy, Set Duel" width="360" height="255" class="size-medium wp-image-1065" /></a><p class="wp-caption-text">Figure 5: Dynamic Insertion Policy, Set Duel <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/setduel.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/setduel.pdf">[pdf]</a></p></div>
<p>Adaptive cache replacement policies choose between two policies depending on which one is appropriate at the moment. Set dueling proposes to dedicate a small portion of the cache sets to each policy to detect which policy is performing better (<i>dedicated sets</i>), and the remainder of the cache as <i>follower sets</i> that follow the better policy. A single saturating counter compares the number of cache misses occurring in the two dedicated sets.</p>
<p>This test attempts to show set-dueling behaviour by finding which sets in the cache are <i>dedicated</i> and which are <i>follower</i> sets. This test uses a 256 KB stride (128 KB works equally well), which maps all accesses onto one set in each of the four cache slices (due to the hash function). By default, this would mean all accesses only touch the <i>first</i> set in each slice if the low address bits [16:0] are zero. Thus, I introduce a new parameter, <tt>offset</tt>, which adds a constant to each address so that I can map all accesses to the second, third, etc. sets in each slice instead of always the first. This parameter is swept both up and down, and the cache replacement policy used is observed. </p>
<p>Since the adaptive policy chooses between two different replacement policies, I chose to distinguish between the two by observing the average latency at one specific array size, 4/3 of the L3 cache size (e.g., compare Sandy Bridge vs. Ivy Bridge at 8 MB in Figure 2). A high latency indicates the use of an LRU-like replacement policy, while a lower latency indicates the use of a thrashing-friendlier BIP/MRU type policy. Figure 5 plots the results.</p>
<p>It appears that most of the cache sets can use both cache replacement policies, except for two 4 KB regions at 32-36 and 48-52 KB (4 KB = 64 cache sets). These two regions always use LRU and BIP/MRU policies, respectively. The plot is periodic every 128 KB because there are 2048 sets per cache slice.</p>
<p>The global-counter learning behaviour is seen in Figure 5 by observing the different policies used while sweeping <tt>offset</tt> ascending vs. descending. Whenever the <tt>offset</tt> causes memory accesses to land on a dedicated set, it accumulates cache misses, causing the rest of the cache to flip to using the other policy. Cache misses on follower sets do not influence the policy, so the policy does not change until <tt>offset</tt> reaches the next dedicated set. </p>
<h2>DIP or DRRIP?</h2>
<p>The later paper from Intel (ISCA 2010) proposes a replacement policy that is improved over DIP-SD by also being resistant to scans. DIP and DRRIP are similar in that both use set dueling to choose between two policies (<abbr title="Static RRIP">SRRIP</abbr> vs. <abbr title="Bimodal RRIP">BRRIP</abbr> in <abbr title="Dynamic RRIP (ISCA 2010)">DRRIP</abbr>, <abbr title="Least Recently Used">LRU</abbr> vs. <abbr title="Bimodal Insertion Policy">BIP</abbr> in <abbr title="Dynamic Insertion Policy (ISCA 2007)">DIP</abbr>).</p>
<div id="attachment_1196" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan-360x270.png" alt="" title="Scan resistance" width="360" height="270" class="size-medium wp-image-1196" /></a><p class="wp-caption-text">Figure 6: Scan resistance of Sandy Bridge LRU, Ivy Bridge &#8220;LRU-like&#8221;, and Ivy Bridge &#8220;BIP-like&#8221;. Pointer chasing through one huge array, and one array whose size is swept (x-axis). Replacement policy is scan-resistant if it preferentially keeps the small array in the cache rather than split 50-50. <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan.pdf">[pdf]</a></p></div>
<p>One characteristic of RRIP is that it proposes four levels of re-reference prediction. True LRU has as many levels as the associativity, while NRU (not recently used) has two levels (recently used, or not). Intel&#8217;s presentation at Hot Chips 2012 hinted at using four-level RRIP (<a href="http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-1-Microprocessor/HC24.28.117-HotChips_IvyBridge_Power_04.pdf">&#8220;Quad-Age LRU&#8221; on slide 46</a>). Thus, I would like to measure whether Ivy Bridge uses DIP or DRRIP.</p>
<p>Scans are short bursts of cache accesses that are not reused. SRRIP (but not LRU) attempts to be scan tolerant by predicting new cache lines to have a &#8220;long&#8221; re-reference interval and allowing recently-used existing cache lines to age slowly before being evicted, preferentially evicting the newly-loaded but not reused cache lines. This behaviour provides an opportunity to detect scan resistance by pointer-chasing through two arrays, one of which is much larger than the cache (scanning), one of which fits in the cache (working set).</p>
<p>Figure 6 plots the average access time for accesses that alternate between a huge array (cache miss) and a small array (possible cache hit) whose size is on the x-axis. The size of the working set that will fit in the cache in the presence of scanning is related to scan resistance. Memory accesses are serialized using <tt>lfence</tt>. Replacement policy is selected by choosing a stride and offset that coincides with a dedicated set (See Figure 5). </p>
<p>With half of the accesses hitting each array, an LRU replacement policy splits the cache capacity evenly between the two arrays, causing L3 misses when the small array size exceeds ~half the cache. BIP (or BRRIP) is scan-resistant and keeps most of the cache for the frequently-reused small array. The LRU/SRRIP policy in Ivy Bridge behaves very similar to Sandy Bridge and does not seem to be scan-resistant, thus it is likely not SRRIP as proposed. It could, however, be a variant on SRRIP crafted to behave like LRU.</p>
<p>The four-level RRIP family of replacement policies are an extension of NRU, using two bits per cache line to encode how soon each cache line should be evicted (<abbr title="Re-Reference Prediction Value: LRU (evict me) = 3, MRU (keep me in cache) = 0">RRPV</abbr>). On eviction, one cache line with RRPV=3 (LRU position) is chosen to be evicted. If no cache lines have RRPV=3, all RRPV values are incremented until a cache line can be evicted. Whenever a cache line is accessed (cache hit), its RRPV value is set to 0 (MRU position). The various insertion policies as proposed in the paper are as follows:</p>
<ul>
<li><b>BRRIP</b>: New cache lines are inserted with RRPV=3 with high probability, RRPV=2 otherwise.</li>
<li><b>SRRIP</b>: New cache lines are inserted with RRPV=2</li>
</ul>
<p>I made some modifications for my simulations:</p>
<ul>
<li><b>BRRIP</b>: Modified the low-probability case to insert with RRPV=0 instead of RRPV=2</li>
<li><b>LRU-like RRIP</b>: New cache lines are inserted with RRPV=0. This is intended to approximate LRU.</li>
</ul>
<div id="attachment_1214" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan_sim.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan_sim-360x270.png" alt="" title="Cache replacement simulations" width="360" height="270" class="size-medium wp-image-1214" /></a><p class="wp-caption-text">Figure 7: Cache simulations of scan resistance of several replacement policies. Ivy Bridge (Figure 6) matches very well with LRU and BRRIP policies, but does not show the scan-resistance property of SRRIP. <a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan_sim.png">[png]</a><a href="http://blog.stuffedcow.net/wp-content/uploads/2013/01/scan_sim.pdf">[pdf]</a></p></div>
<p>Figure 7 shows L3-only cache simulation results of running the same access pattern through simulations of several replacement policies. The simulated cache is configured to match the Core i5 (four 1.5 MB slices, 12-way). The two policies used by Ivy Bridge (Figure 6) match very closely to LRU and BRRIP. Ivy Bridge matches less closely to a modified four-level RRIP configured to behave like LRU, and also matches slightly less well to BIP (not shown) than BRRIP. The simulation of SRRIP verifies its scan-resistance properties, which are not observed in the experimental measurement of Ivy Bridge.</p>
<p>I am not able to measure whether Ivy Bridge has really changed from the pseudo-LRU that was used in Sandy Bridge to a four-level RRIP. Given Intel&#8217;s hint in the slide and that Ivy Bridge behaves slightly differently from Sandy Bridge, I am inclined to believe that Ivy Bridge uses RRIP, despite the experimental measurements matching more closely to LRU than LRU-like four-level RRIP. However, it is fairly clear that Ivy Bridge lacks the scan-resistance property proposed in the ISCA 2010 paper.</p>
<p><br style="clear: both;"></p>
<h2>Conclusion</h2>
<p>Although the cache organization between Sandy Bridge and Ivy Bridge are essentially identical, Ivy Bridge&#8217;s L3 cache has an improved replacement policy. The policy appears to be similar to a hybrid between &#8220;Dynamic Insertion Policy &#8212; Set Duel&#8221; (DIP-SD) and &#8220;Dynamic Re-Reference Interval Prediction&#8221; (DRRIP), using four-level re-reference predictions and set dueling, but without scan resistance. For each 2048-set cache slice, 64 sets are dedicated to each of the LRU-like and BRRIP policies, with the remaining 1920 cache sets being follower sets that follow the better policy.</p>
<h2>Acknowledgements</h2>
<p>Many thanks to my friend <a href="http://www.ece.ubc.ca/~wwlfung/">Wilson</a> for pointing me to the two ISCA papers on DIP and DRRIP.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A Comparison of Intel&#8217;s 32nm and 22nm Core i5 CPUs: Power, Voltage, Temperature, and Frequency</title>
		<link>http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/</link>
		<comments>http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/#comments</comments>
		<pubDate>Wed, 31 Oct 2012 05:35:53 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>
		<category><![CDATA[cpu]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=926</guid>
		<description><![CDATA[Each new generation of CMOS manufacturing processes brings about a new set of trade-offs. Intel's recent tradition of manufacturing the same processor microarchitecture across two processes provides an opportunity to measure some of the voltage-delay-power scaling trends. The 22nm Ivy Bridge significantly improves on static (leakage) power over 32nm Sandy Bridge, but only shows small reductions in dynamic power. Ivy Bridge also requires higher voltage increases for the same frequency increase. Also, thermal resistance of Ivy Bridge increased over Sandy Bridge, likely due to the change from solder to polymer thermal interface material. <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/">A Comparison of Intel&#8217;s 32nm and 22nm Core i5 CPUs: Power, Voltage, Temperature, and Frequency</a></span>]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>Each new generation of CMOS manufacturing processes brings about a new set of trade-offs. Intel&#8217;s recent tradition of manufacturing the same processor microarchitecture across two processes provides an opportunity to look at some of the voltage-delay-power scaling trends. Intel&#8217;s new Ivy Bridge processor manufactured on a 22nm tri-gate CMOS process, which is a significant change from the planar transistors used in previous processes. Intel&#8217;s previous-generation Sandy Bridge processor made on their 32nm planar CMOS process uses a similar architecture, and can be used as a point of comparison.</p>
<p>In a complete system, a processor&#8217;s power consumption, voltage, temperature, and operating frequency can be observed, while the latter three can be controlled. Using those tools, we can measure static and dynamic power as a function of temperature, frequency, and voltage, create shmoo plots (voltage vs. operating frequency), and compare overall thermal resistance.</p>
<p>There have been some rumblings that Ivy Bridge does not overclock as well as Sandy Bridge. On the other hand, Intel claims the 22nm process improves performance over 32nm. Another difference between the two processors is the switch from using solder thermal interface material (STIM) to polymer (PTIM), resulting in increased thermal resistance and higher junction temperatures on Ivy Bridge for the same power. A comparison of the measurements across Sandy Bridge and Ivy Bridge can quantify some of these observations.</p>
<p>&nbsp;</p>
<h2>Methodology</h2>
<ul>
<li>Core i5-2500K (32nm Sandy Bridge) and i5-3570K (22nm Ivy Bridge)</li>
<li>Biostar TZ77MXE motherboard</li>
</ul>
<p>The TZ77MXE motherboard allows adjustment of processor frequency (multiplier) and voltage, although it does not allow manual adjustment of voltage below 1.0V, or negative voltage offsets.</p>
<p>Power consumption is measured using multimeters on the 12V power connector to measure current and voltage. On modern ATX motherboards, CPU and GPU power regulators are powered by the 12V power connector, which gives a convenient place to measure the current and voltage consumed by the processor package excluding the rest of the system. Power is measured before the processor&#8217;s voltage converter. DC-DC converters are typically efficient, so I make no attempt to compensate for it. Voltage is measured at the connector after the ammeter&#8217;s voltage drop to reduce the skew caused by the ammeter&#8217;s resistance. Switching converters are generally tolerant of varying input voltages (10.9V to 12.2V observed) with minimal impact on efficiency.</p>
<p>The processor&#8217;s operating voltage is measured using the on-board IT8728F chip. Temperature is measured using coretemp (CPU on-die temperature sensors), reporting the temperature of the hottest core (usually core 2, cores numbered 0-3).</p>
<p>To control processor operating frequency, I changed only the multiplier while leaving BCLK at 100 MHz. Core voltage is controlled by setting a fixed voltage in the BIOS. I rely on the measured voltage rather than the voltage setting because the actual voltage can vary based on the load line (a mechanism that lowers supply voltage under high load to reduce the peak voltage swing) or &#8220;load line calibration&#8221; (a mechanism to defeat the load line). Processor temperature is controlled by lowering the cooling fan speed to raise the temperature.</p>
<p>Power consumption (switching activity?) depends strongly on the choice of workload. Power and temperature measurements are made when all four cores of the processor are active running the Prime95 torture test. Prime95 is able to sanity-check its own calculations, so it is also used to check for processor stability when generating a shmoo plot.</p>
<h2>Results</h2>
<h3>Power and Temperature</h3>
<div style="float:right; margin:5px;">
<div id="attachment_943" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_temp.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_temp-360x242.png" alt="" title="SNB Power vs. Temperature" width="360" height="242" class="size-medium wp-image-943" /></a><p class="wp-caption-text">Fig. 1a: SNB Power vs. Temperature<br />1.26 V, 1.6 GHz and 2.4 GHz</p></div><br />
<div id="attachment_942" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_temp.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_temp-360x240.png" alt="" title="IVB Power vs. Temperature" width="360" height="240" class="size-medium wp-image-942" /></a><p class="wp-caption-text">Fig. 1b: IVB Power vs. Temperature<br />1.26 V, 1.6 GHz and 2.4 GHz</p></div>
</div>
<p>I measure power consumption vs. temperature first, since its results can be used to compensate for varying temperature in later measurements. For both processors, I measure total power at 1.26V at both 1.6 GHz and 2.4 GHz. Total power can be broken down into two components: Static power that does not vary with switching frequency, and dynamic power that varies with switching frequency. Assuming dynamic power scales linearly with frequency, measuring at two frequencies allows extrapolating power consumption down to 0 Hz to separate out dynamic power and static power.</p>
<p>Figure 1a and 1b shows power vs. temperature for Sandy Bridge (SNB) and Ivy Bridge (IVB), respectively. Total power is plotted, as well as the extrapolated static power. Figure 1b plots both Ivy Bridge&#8217;s and Sandy Bridge&#8217;s static power for comparison. Dynamic power does not depend on temperature, since the 1.6 GHz and 2.4 GHz curves are parallel. The extrapolated static power curve includes data points from both curves translated downwards by twice and three times the difference in power between the two curves. The extrapolated static power data points fits an exponential function very well, which agrees with theory that says leakage power typically grows exponentially with temperature. Ivy Bridge shows a significant improvement in static (leakage) power. One of the claimed benefits of multi-gate transistors is better channel control resulting in a better subthreshold slope and lower subthreshold leakage, and this measurement agrees.<br />
<br clear=both></p>
<h3>Dynamic Power vs. Frequency</h3>
<div style="float:right; margin:5px;">
<div id="attachment_974" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_freq.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_freq-360x239.png" alt="" title="SNB Power vs. Frequency" width="360" height="239" class="size-medium wp-image-974" /></a><p class="wp-caption-text">Fig. 2a: SNB Power vs. Frequency<br />1.26 V, variable temperature</p></div><br />
<div id="attachment_975" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_freq.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_freq-360x238.png" alt="" title="IVB Power vs. Frequency" width="360" height="238" class="size-medium wp-image-975" /></a><p class="wp-caption-text">Fig. 2b: Power vs. Frequency<br />1.26 V, variable temperature</p></div>
</div>
<p>Another classic textbook result is that dynamic power scales linearly with frequency. Figures 2a and 2b show measurements of total power and dynamic power for Sandy Bridge and Ivy Bridge, at 1.26V. Total power consumption is measured, while dynamic power is calculated by subtracting out the temperature-dependent static power found in the previous section.</p>
<p>The dynamic power curve fits a linear trendline very well. The intercept of the dynamic power trendline is expected to be zero (no dynamic power when no switching activity). A non-zero intercept for the trendline indicates some amount of experimental error, around half a watt in these plots. The red curves of total power has a slight upwards curve because total power (static power, but not dynamic power) increases with temperature.</p>
<p>Figure 2b includes the dynamic power curves for both processors for comparison. At 1.26V (an arbitrary voltage somewhat higher than the typical operating point), dynamic power for Ivy Bridge is only slightly lower (~6%). The main objective of this graph was to show that dynamic power increases linearly with frequency. The next section shows how dynamic power scales with processor supply voltage.</p>
<p><br clear=both></p>
<h3>Power vs. Supply Voltage</h3>
<div style="float:right; margin:5px;">
<div id="attachment_981" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_voltage.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_voltage-360x206.png" alt="" title="SNB  Power vs. Voltage" width="360" height="206" class="size-medium wp-image-981" /></a><p class="wp-caption-text">Fig. 3a: SNB Power vs. Voltage<br />1.6 GHz and 2.4 GHz, 90&deg;C</p></div><br />
<div id="attachment_980" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_voltage.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_voltage-360x208.png" alt="" title="IVB Power vs. Voltage" width="360" height="208" class="size-medium wp-image-980" /></a><p class="wp-caption-text">Fig. 3b: IVB Power vs. Voltage<br />1.6 GHz and 2.4 GHz, 90&deg;C</p></div><br />
<div id="attachment_991" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snbivb_voltage.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snbivb_voltage-360x248.png" alt="" title="Power vs. Voltage Comparison" width="360" height="248" class="size-medium wp-image-991" /></a><p class="wp-caption-text">Fig. 3c: Power vs. Voltage Comparison<br />2.4 GHz, 90&deg;C</p></div>
</div>
<p>The textbook formula says that dynamic power should be proportional to the square of the supply voltage. This section describes the same measurement. I vary the processor supply voltage while keeping frequency and temperature constant. Like earlier, dynamic and static power is separated by measuring power consumption and 1.6 and 2.4 GHz. I keep temperature constant at 90&deg;C because it is easy to raise the operating temperature by slowing down the cooling fan, but very difficult to lower it. The resulting measurements will show how dynamic power scales with supply voltage and how static power scales with supply voltage at a fixed 90&deg;C temperature.</p>
<p>Figures 3a and 3b show the results of these measurements for Sandy Bridge and Ivy Bridge, respectively.</p>
<p>The top two curves in each figure are direct measurements of total processor power at 1.6 and 2.4 GHz. Since total power includes both static and dynamic power, we need to break total power into static and dynamic components before curve fitting. Because temperature is kept constant, each pair of data points at a given voltage have the same static power, so static power can be computed as above, by taking the difference between total power at 1.6 and 2.4 GHz, independently for each voltage, giving the green static power curve. Dynamic power is then computed by subtracting static power from the total power.</p>
<p>For Sandy Bridge (Fig. 3a), the dynamic power fits a power curve well, and comes surprisingly close to the expected quadratic relation, P<sub>dynamic</sub> &prop; V<sup>2</sup>. Static power also fits a power curve (although I&#8217;m not aware of theory that requires it), where static power increases roughly as the cube of the voltage.</p>
<p>On Ivy Bridge (Fig. 3b), the curve fits are somewhat unexpected. Static power grows much slower than on Sandy Bridge (roughly P<sub>static</sub> &prop; V<sup>1.85</sup> instead of V<sup>3</sup>), but dynamic power grows slightly more quickly with voltage (P<sub>dynamic</sub> &prop; V<sup>2.3</sup> compared to V<sup>2</sup>). A comparison of just the 2.4 GHz dynamic power and static power is plotted in Fig. 3c. Dynamic power on Ivy Bridge is lower for all practical voltages (the curve fit suggests Ivy Bridge dynamic power will exceed Sandy Bridge above 1.9V).</p>
<p>I speculate that these differences (slower static power increase, but slightly higher dynamic power increase with voltage) are properties of tri-gate processes, but I don&#8217;t know enough about the differences between planar and tri-gate to know whether these observations match with theory.</p>
<p><br clear=both></p>
<h3>Voltage-Frequency Shmoo Plot</h3>
<div style="float:right; margin:5px;">
<div id="attachment_994" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_shmoo.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snb_shmoo-360x237.png" alt="" title="SNB Voltage-Frequency Shmoo" width="360" height="237" class="size-medium wp-image-994" /></a><p class="wp-caption-text">Fig 4a: SNB Voltage-Frequency Shmoo</p></div><br />
<div id="attachment_995" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_shmoo.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/ivb_shmoo-360x244.png" alt="" title="IVB Voltage-Frequency Shmoo" width="360" height="244" class="size-medium wp-image-995" /></a><p class="wp-caption-text">Fig 4b: IVB Voltage-Frequency Shmoo</p></div><br />
<div id="attachment_996" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snbivb_shmoo.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/snbivb_shmoo-360x247.png" alt="" title="Voltage-Frequency Shmoo Comparison" width="360" height="247" class="size-medium wp-image-996" /></a><p class="wp-caption-text">Fig 4c: Voltage-Frequency Shmoo Comparison</p></div>
</div>
<p>The primary knob for increasing the frequency of a processor is increasing its operating voltage. A shmoo plot characterizes the voltage-frequency relationship by testing a processor at various voltage and frequencies and recording which points function correctly (&#8220;pass&#8221;) and which do not (&#8220;fail&#8221;). The boundary between the pass and fail points indicate the lowest voltage at a given frequency (or, alternatively, highest frequency at a given voltage) at which that the processor can still operate, which would correlate to how easily one can overclock the processor.</p>
<p>Unlike the rest of the measurements, the shmoo plots are made while only using one processor core with three cores idle. Prime95 was run on the slowest of the four cores, and a particular voltage and frequency is considered &#8220;pass&#8221; if Prime95 runs for around 10 minutes without error. The shmoo plots are slightly optimistic: A real-world usage scenario with four active cores instead of one usually requires higher voltage and causes higher temperatures, further reducing achievable frequency. Although running just one active core reduces the effect of temperature (by reducing the temperature change), I do not measure or compensate for the impact of temperature on maximum frequency.</p>
<p>Figures 4a and 4b show the shmoo plots for Sandy Bridge and Ivy Bridge, respectively. Additionally, a line was drawn that connects the lowest voltage that passes at each frequency, which approximates the boundary between the &#8220;pass&#8221; and &#8220;fail&#8221; points. Figure 4c shows a comparison of Sandy Bridge and Ivy Bridge. The two boundary lines from Figures 4a and 4b are plotted in Figure 4c. It is interesting that the slope of the Ivy Bridge curve (blue) is higher than for Sandy Bridge. Although Ivy Bridge is significantly faster than Sandy Bridge at low voltages, increasing the operating frequency requires a larger voltage increase on Ivy Bridge, such that the two chips require the same voltage (1.32V) to run at 4.5 GHz. This would suggest that overclocking Ivy Bridge beyond this point is somewhat more difficult, even though Ivy Bridge is faster/lower voltage at the lower non-overclocked frequency (below 3.8-3.9 GHz).</p>
<p><br clear=both></p>
<div style="float:right; margin:5px;">
<div id="attachment_1003" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/schmoo_transistor.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/schmoo_transistor-360x206.png" alt="" title="Overlaying schmoo plot on Intel&#039;s chart" width="360" height="206" class="size-medium wp-image-1003" /></a><p class="wp-caption-text">Fig. 5: 22nm Performance improvements, from <a href=http://download.intel.com/newsroom/kits/22nm/pdfs/22nm-Details_Presentation.pdf>Intel&#039;s presentations</a>, extended.</p></div>
</div>
<p>One might recall Intel&#8217;s initial presentations on their 22nm process showing charts showing performance and/or voltage improvements over their 32nm process. One such graph is reproduced in the left half of Figure 5. Intel&#8217;s chart is interesting: The performance and voltage gains claimed are indeed impressive, but the gain decreases at nigher voltages (37% faster at 0.7V, 18% faster at 1.0V), but the typical operating point for the desktop processors is beyond the right edge of the chart (even before overclocking). Is there something unpleasant about the higher (typical!) voltages that Intel didn&#8217;t want to mention?</p>
<p>Subject to a few important caveats, Intel&#8217;s chart of voltage vs. gate delay is equivalent to a shmoo plot. One caveat is that Intel&#8217;s chart shows low-level transistor delays, while a shmoo plot shows the delay of a more complex circuit. In addition, a complex circuit consists of both transistor delay and interconnect delay, so it is expected that performance gains seen at the transistor level will be smaller when applied to a whole processor because interconnect delays are expected to become worse with each process shrink. </p>
<p>Given the above caveats, I have attempted to transform the shmoo plot (by plotting delay instead of its reciprocal, frequency) and overlay that onto Intel&#8217;s chart in Figure 5. Notice that the voltage range I was able to test is actually entirely off the right edge of Intel&#8217;s chart. My shmoo plot seems to match up reasonably well with Intel&#8217;s plot. Although performance improvements at low-voltage are high, the improvement shrinks to around 5 percent at typical operating voltages, and performance improve even turns into a performance loss at higher voltages seen when overclocking.</p>
<p><br clear=both></p>
<h3>Thermal Resistance</h3>
<div style="float:right; margin:5px;">
<div id="attachment_1014" class="wp-caption alignright" style="width: 370px"><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/10/thermal.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/10/thermal-360x243.png" alt="" title="Thermal Resistance" width="360" height="243" class="size-medium wp-image-1014" /></a><p class="wp-caption-text">Fig. 6: Thermal Resistance</p></div>
</div>
<p>The ability to cool a processor is determined by its thermal resistance. Power is dissipated at the bottom side of the chip, with most of the heat being dissipated through the top side. Most of the heat must pass through the silicon die, heat spreader, heatsink, then out to air, with some form of thermal interface material in the interface between each of those. The overall thermal resistance can be measured by measuring the power dissipation and total temperature difference between the on-die temperature sensors and ambient air.</p>
<p>There are two main reasons why Sandy Bridge and Ivy Bridge may have different thermal resistance. First, as chips are scaled smaller, power dissipation does not scale as much, leading to higher power density. Ivy Bridge&#8217;s die size (160 mm<sup>2</sup>) is 26% smaller than Sandy Bridge&#8217;s (216 mm<sup>2</sup>), reducing the contact surface area between the die and heat spreader. Second, Intel has switched from using solder between the die and heat spreader (solder thermal interface material, <a href="http://www.intel.com/technology/itj/2008/v12i1/1-materials/5-solder.htm">STIM, ~87 W/mK</a>) to a polymer material (<a href="http://www.intel.com/technology/itj/2008/v12i1/1-materials/5-solder.htm">PTIM, 3-4 W/mK</a>), presumably because Ivy Bridge&#8217;s reduced power dissipation is now comfortably within the range suitable for using PTIM (<a href="http://www.intel.com/technology/itj/2008/v12i1/1-materials/figures/Figure_16_lg.gif">See Figure 16</a>).</p>
<p>Thermal resistance is measured with all four cores active (fewer active cores results in a hot spot). The stock thermal paste, heatsink, and cooling fan are used on both processors. The cooling fan is kept at its maximum constant speed (around 2050 RPM), and power dissipation is varied by changing the CPU supply voltage. </p>
<p>Figure 6 shows a measurement of the thermal resistance on both processors. On both processors, thermal resistance improves somewhat at higher power. The thermal resistance of Ivy Bridge is around 0.15 &deg;C/W worse than Sandy Bridge. Although it&#8217;s not possible to break down the contribution of the two reasons, it seems likely that most of the increase in thermal resistance is due to the change in TIM. An increase of 0.15 &deg;C/W roughly corresponds to the bulk thermal resistance of a ~90 &mu;m layer of PTIM over the die area of 160 mm<sup>2</sup>.</p>
<h2>Summary</h2>
<p>The above measurements attempt to characterize some of the changes when moving from Intel&#8217;s 32nm planar to 22nm Tri-Gate process. The 22nm Ivy Bridge significantly improves on static (leakage) power over 32nm Sandy Bridge, but only shows small reductions in dynamic power. Ivy Bridge also requires higher voltage increases for the same frequency increase, leading to more difficult overclocking but power savings at lower (standard) speeds.</p>
<p>In addition to the CMOS process changes, the thermal resistance of Ivy Bridge increased over Sandy Bridge, likely due to the change from solder to polymer thermal interface material between the die and heat spreader.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>JMicron JMB363 Add-on Card AHCI mode</title>
		<link>http://blog.stuffedcow.net/2012/08/jmicron-jmb36x-add-on-card-ahci-mode/</link>
		<comments>http://blog.stuffedcow.net/2012/08/jmicron-jmb36x-add-on-card-ahci-mode/#comments</comments>
		<pubDate>Mon, 27 Aug 2012 03:05:11 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Building Stuff]]></category>
		<category><![CDATA[hackintosh]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=881</guid>
		<description><![CDATA[The JMicron JMB363 is a 2-port SATA + 1-port PATA controller chip often found embedded in motherboards and in low-cost add-on cards. The chip supports operating in IDE, AHCI, and RAID controller modes. Motherboard BIOSes allow choosing the operating mode, but add-on cards are stuck in RAID mode. I attempt to solve this problem by hacking the JMB363 option ROM to put the card into AHCI mode <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/08/jmicron-jmb36x-add-on-card-ahci-mode/">JMicron JMB363 Add-on Card AHCI mode</a></span>]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.jmicron.com/Product_JMB363.htm">JMicron JMB363</a> is a 2-port SATA + 1-port PATA controller chip often found embedded in motherboards and in low-cost add-on cards. The chip supports operating in IDE, AHCI, and RAID controller modes. Motherboard BIOSes allow choosing the operating mode, but add-on cards are stuck in RAID mode. </p>
<p>The problem with RAID mode is that standard AHCI drivers cannot be used. A JMicron-specific driver is available only for Windows. Some OSes (Linux, drivers/pci/quirks.c) set a PCI configuration register to put the controller into AHCI+IDE mode after booting and use standard drivers. Other OSes (Mac OS X) do nothing and can&#8217;t use the card.</p>
<p>I attempt to solve this problem by hacking the JMB363 option ROM to put the card into AHCI mode, with partial success. The two SATA ports work in OS X, including hot-plugging, but I could not get the parallel ATA enabled, and I had to bypass loading the option ROM (thus, can&#8217;t boot from the SATA ports anymore). The patched ROM might also be useful for other OSes that don&#8217;t have Linux-style PCI quirks.</p>
<h3>Option ROMs</h3>
<ul>
<li><a href="/wp-content/uploads/2012/08/jmb363_1.07.24.bin">Original Option ROM 1.07.24</a></li>
<li>Patch three uses of PCI configuration register 0xdf[1:0] to read 2&#8242;b10. This seems to put the controller in AHCI mode.
<ul>
<li><a href="/wp-content/uploads/2012/08/jmb363_1.07.24_ahci.bin">Patched Option ROM</a></li>
</ul>
</li>
<li>Also patch two uses of 0xdf[6] to read 1&#8242;b1 and writes 0xa1 to configuration register 0x41. This bypasses the option ROM and keeps the SATA ports enabled.
<ul>
<li><a href="/wp-content/uploads/2012/08/jmb363_1.07.24_ahci_skiprom.bin">Patched Option ROM</a></li>
</ul>
</li>
</ul>
<h2>Hacking the Option ROM</h2>
<p>I used the newest version of the option ROM (1.07.24) posted at JMicron&#8217;s <a href="ftp://driver.jmicron.com.tw/SATA_Controller/Option_ROM/">FTP site</a>. Some hints about the configuration of the JMB363 can be found in JMicron&#8217;s <a href="ftp://driver.jmicron.com.tw/SATA_Controller/Option_ROM/release%20note.txt">release notes</a>, particularly about PCI configuration register 0xdf. The release notes hints at the existence of a newer 1.07.28, but it wasn&#8217;t posted. Linux&#8217;s drivers/pci/quirks.c gave some hints about the meaning of PCI configuration register 0x40-43.</p>
<h3>Configuration Register 0xDF</h3>
<p>PCI configuration register 0xdf seems to be intended for main BIOS to communicate its settings to the option ROM, rather than having a direct effect on the hardware. Bits [1:0] seem to indicate controller mode. Bits [7:6] and [5:4] look like they control two instances of the same thing because the code that parses them is structured the same way. The release notes say bit [6] is used to put the chip into &#8220;multi-function&#8221; mode with the option ROM disabled, which feels a little bit like a hack to me. Bits [3:2] don&#8217;t appear to be used.</p>
<table>
<tr>
<th>df[6]</th>
<th>df[1:0]</th>
<th>Result</th>
</tr>
<tr>
<td>0</td>
<td>00</td>
<td>Default, RAID mode</td>
</tr>
<tr>
<td>0</td>
<td>01</td>
<td>IDE device (class 0101)</td>
</tr>
<tr>
<td>0</td>
<td>10</td>
<td>AHCI device (class 0106)</td>
</tr>
<tr>
<td>0</td>
<td>11</td>
<td>IDE device (class 0101)</td>
</tr>
<tr>
<td>1</td>
<td>00</td>
<td>RAID device (class 0104), no option ROM</td>
</tr>
<tr>
<td>1</td>
<td>01</td>
<td>AHCI device (class 0106), no option ROM</td>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>AHCI device (class 0106), no option ROM</td>
</tr>
<tr>
<td>1</td>
<td>11</td>
<td>AHCI device (class 0106), no option ROM</td>
</tr>
</table>
<p>It seems like bits [1:0] select the card mode (RAID, IDE, AHCI, IDE?), with the value 2&#8242;b11 probably not intended to be used (the option ROM code only distinguishes between 2&#8242;b00, 2&#8242;b01, and other).</p>
<p>Since setting df[6] causes &#8220;jump out of Option ROM, do nothing&#8221; (See release notes for v1.03), it is no longer possible to boot from any disks attached to the chip.</p>
<p>In no case did setting 0xdf[6] cause the JMB363 to become a multi-function device. I speculate its purpose is to disable the option ROM, allowing the main BIOS to set up the rest of the device configuration without interference.</p>
<ul>
<li>Register df[6] is used by the option ROM code at offset 0x3517.</li>
<li>Registers df[7:6] and df[5:4] are used at offsets 0x372d and 0x3756, respectively. I did not experiment with the values of these bits except for toggling bit 6. They seem important, causing PCI config register bits 0xed[5:2] and 0xcd[5:2] to be set when df[7:6] and df[5:4], respectively, is 2&#8242;b01 or 2&#8242;b10.</li>
<li>Register df[1:0] is used in the option ROM at offset 0x3553, 0x3574, and 0x37c1</li>
</ul>
<h3>Configuration Register 0x40-43</h3>
<p>These configuration registers seem to control the hardware directly. The option ROM sets them but doesn&#8217;t read them.</p>
<table>
<tr>
<th>Register</th>
<th>Function</th>
</tr>
<tr>
<td>43</td>
<td>Defaults to 0x80. Unknown purpose.</td>
</tr>
<tr>
<td>42</td>
<td>Seems to take values of 0xc2 or 0x82. Bits [1:0] depends on 0xdf[1:0], with a special case when 0xdf[6] is set. Unknown purpose.</td>
</tr>
<tr>
<td>41</td>
<td>Bits [7:4] affect the SATA ports with [7:6] controlling SATA1, [5:4] controlling SATA0. Bits [3:0] unknown. Option ROM sets this to either 0xf1 or 0x51.</td>
</tr>
<tr>
<td>40</td>
<td>Bit 1 appears to control whether the second PCI function is enabled. Other bits unknown.</td>
</tr>
</table>
<p>Linux (drivers/pci/quirks.c) modifies this configuration register, which results in 0x80c2a1bf on my card. I tried modifying the option ROM to also configure register 0x40-43 to this value, but had many problems booting. Either I&#8217;m getting some of the other configuration bytes wrong or this value is intended only to be used after booting the system.</p>
<h2>Problems</h2>
<p>Only modifying register df[1:0] to value 2&#8242;b10 seems to put the controller into AHCI mode (with no IDE PCI function, but PATA drives are still detected by the option ROM). The option ROM loads, correctly reports connected drives, and allows booting from them. However, OS X does not detect hard drives that were detected by the option ROM at boot time. If SATA drives are not attached until after the option ROM loads, they are detected by the OS and can be hot-plugged. Therefore, I chose to set register df[6] to cause the option ROM to quit without detecting drives. A side-effect is that without the option ROM, the computer can&#8217;t boot from the disks.</p>
<p>The problems with configuration register 0x40-43 are more problematic.</p>
<p>Linux sets register 0x41 to 0xa1, while the option ROM will set it to 0xf1 or 0x51. 0x51 appears to disable hot-plugging the disks on the SATA channels entirely. 0xa1 causes a hang for several minutes at boot, seemingly trying and failing to find a disk, while using 0xf1 boots normally. Trying with 0xb1 and 0xe1 shows the same behaviour as for 0xa1, but hangs only if a disk is not present on SATA0 or SATA1, respectively. Thus it seems like setting 0x41[7:6] or 0x41[5:4] to 2&#8242;b10 causes the option ROM to assert that a disk must be present at boot time for that channel. The hang does not occur if the option ROM is skipped by setting 0xdf[6], so I use 0xa1 to be closer to what Linux does, although I notice no other differences between 0xf1 and 0xa1.</p>
<p>Linux sets register 40 bit 2 (to enable the IDE port?). Trying to do that in the option ROM causes a several-minute hang during boot when loading the option ROM (again, seemingly waiting for a disk and giving up), even when a PATA disk is present. With df[6] set, the option ROM does not detect any disks, despite spending several minutes.</p>
<h2>Final configuration</h2>
<ul>
<li>PATA not enabled because turning it on causes a hang during boot.</li>
<li>Option ROM not enabled by setting df[6], so the disks are not bootable.</li>
<li>SATA drives working in OS X with the standard AHCI driver, including hot-plugging.</li>
</ul>
<pre>
PCICFG v1.24  (c) Copyright 1997,1998 Ralf Brown
Modified by Datapath based on V1.19

-----------------------------------------------------------
PCI bus 04 device 00 function 00:  Header Type 'non-bridge' (single-func)
Vendor:	197B	???                                               
Device:	2363	???                                               
Class:	  01	disk                	Revision:	10
SubClass: 06	???                 	ProgramI/F:	01
CommandReg:   0007 = I/O-on mem-on busmstr
Status Reg:  0010 = CapList (fast)
CacheLine:      08	Latency:	00	BIST:	     00
SubsysVendor:    197B	SubsysDevice: 2363
Base Addresses:
	(0) 00009001 = I/O base=00009000 len=8
	(1) 00009401 = I/O base=00009400 len=4
	(2) 00009801 = I/O base=00009800 len=8
	(3) 00009C01 = I/O base=00009C00 len=4
	(4) 0000A001 = I/O base=0000A000 len=16
	(5) D9000000 = mem base=D9000000 len=512
CardBus:     00000000	ExpansionROM: 00000000 (64K,disabled)
INTline:	   0A	INTpin:       01
MinGrant:	   00	MaxLatency:   00
Device-Specific Data:
 40: 80C2A1BD  E4FF0808  40F00060  00000000  00110010  00008000 
 58: 000A2000  00036C11  10110000  00000000  00000000  00000000 
 70: 00000000  00000000  00000000  00000000  10000000  00000000 
 88: 00000000  40035001  00000000  00000000  00000000  00000000 
 A0: 00000000  00000000  00000000  00000000  00000000  00000000 
 B8: 00000000  80000000  00000000  00000000  00000000  00000000 
 D0: 80000018  1000001C  00EB0000  00000000  00000000  00000000 
 E8: 00000000  00000000  00100058  00000000  00000000  00100000 
Capabilities List:
	ID @8C = 01 PCI Power Management
	 PMC    =  PME#-D3hot
		DynClk = 0, PCI_PM version = 3
	 PMCSR  = 0000, data-select=0 unknown/unimplemented
		state=D0   
	 PMCSRX = -- -- -- --
	 Data   = 00
	ID @50 = 10 (unknown)

</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/08/jmicron-jmb36x-add-on-card-ahci-mode/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>Compiling a Contrived Chunk of Code</title>
		<link>http://blog.stuffedcow.net/2012/07/compiling-a-contrived-chunk-of-code/</link>
		<comments>http://blog.stuffedcow.net/2012/07/compiling-a-contrived-chunk-of-code/#comments</comments>
		<pubDate>Thu, 26 Jul 2012 01:02:05 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=843</guid>
		<description><![CDATA[While crafting some C code to stress integer ALU bandwidth, I decided I would compile the code through various compilers to see what would come out. The code is a hand-unrolled loop with 5 independent chains of dependent ALU operations (add, and) designed to provide many independent ALU instructions for the integer core to execute. Even for this simple repetitive code, the best (Intel C Compiler) compiler produces code that runs 26% faster than the worst (llvm-clang). <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/compiling-a-contrived-chunk-of-code/">Compiling a Contrived Chunk of Code</a></span>]]></description>
			<content:encoded><![CDATA[<p>While crafting some C code to stress integer ALU bandwidth, I decided I would compile the code through various compilers to see what would come out. The code is a hand-unrolled loop with 5 independent chains of dependent ALU operations (add, and) designed to provide many independent ALU instructions for the integer core to execute.</p>
<p>I tested the same C code through the following compilers, both 32- and 64-bit, on the same Core i7-3770K system:</p>
<ul>
<li>OS X Apple clang version 3.1 (tags/Apple/clang-318.0.61) (based on LLVM 3.1svn)</li>
<li>OS X gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.9.00)</li>
<li>Linux gcc version 4.7.1 (GCC)</li>
<li>Linux Intel C Compiler Version 12.0.3</li>
</ul>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/loop_speed.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/loop_speed.png" alt="" title="Execution Speed" width="606" height="380" class="aligncenter size-full wp-image-849" /></a></p>
<p>The execution speed is normalized to the geometric mean of all the results. </p>
<p>The two sets of bars show the single-threaded performance as well as the performance when running eight copies of the routine (i.e., with Hyper-Threading). The performance gap between compilers narrows when Hyper-Threading is used, because a second thread can absorb the effects of poor instruction scheduling.</p>
<p>Even for a simple loop, there is a surprising difference in the execution speed of the compiled code. Because the loop is so simple and repetitive, a human can figure out roughly what the optimal output is and compare that to what the compilers generate.</p>
<h2>C Source Code</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ccode.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ccode.png" alt="" title="Source Code" width="413" height="427" class="aligncenter size-full wp-image-844" /></a></p>
<p>In the above code, <code><b>i</b></code> is the loop counter and also used as a non-changing integer value in the computations. There are 5 independent chains of ALU operations, alternating between addition and logical AND. <code><b>counter</b></code> is a volatile integer variable that is incremented once per iteration. This routine should be able to provide enough data-independent instructions to execute 5 instructions per clock, which is more than sufficient to saturate today&#8217;s processors. The highest value I measured is 2.94 instructions per clock including loop and counting overhead. It&#8217;s 3.00 instructions per clock once the loop overhead of 2 (fused) instructions every iteration of 100 ALU operations is removed.</p>
<p>The optimal assembly code after compilation should roughly follow the structure of the C code. It would alternate between 5 (independent) ADD operations and 5 AND, with a few loop-counter instructions inserted somewhere within the loop. The general principle is that data-dependent operations should be spaced out to make it easier for the processor to find instruction-level parallelism.</p>
<h2>Mac OS X Compilers</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_full.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_half.png" alt="" title="OS X LLVM-Clang and LLVM-GCC" width="750" height="791" class="aligncenter size-full wp-image-847" /></a></p>
<p>Here is the disassembly of the loop compiled on the Mac OS X compilers. The disassembly is in AT&#038;T syntax, so the destination operand is the one on the right. The instructions operating on each &#8220;chain&#8221; of instructions is colour-coded the same as in the C source code. </p>
<p>It appears that for both Clang and GCC, the Apple compilers use LLVM for optimization and code generation, with Clang and GCC only being front-ends. This might explain the similarity in instruction scheduling strategies between the two compilers: There is a tendency to group <i>dependent</i> chains of instructions together, making it hard for the processor to extract ILP. Only the 64-bit LLVM-Clang has some amount of interleaving of instructions, which leads to a significant performance improvement compared to the other three compilers.</p>
<p>Given the poor instruction scheduling of 32-bit LLVM-Clang and both 32- and 64-bit LLVM-GCC, it&#8217;s no surprise these three perform worst. So what distinguishes between these three?</p>
<p>In 32-bit LLVM-Clang, the first 6 instructions in the loop have many long-latency data dependencies. The value of <code>eax</code> is modified at the bottom of the loop (loop-carried dependency at instruction #105), then stored to memory (instruction #1), loaded again (#4), and is immediately consumed (#6). There wasn&#8217;t even an attempt to space out the long-latency store-load-use operations, and this hurts performance. </p>
<p>32-bit LLVM-GCC generates nearly the same code with the same problems as 32-bit LLVM-Clang, but backing up the register value (instruction #1 in the LLVM-Clang code) is done in the middle of the loop (instruction #24).</p>
<p>64-bit LLVM-GCC&#8217;s output is significantly better. There are more registers in x86-64 so register spilling is no longer necessary, and the long-latency memory operations are no longer data-dependent with the ALU operations. It still suffers from poor instruction scheduling.</p>
<h2>Linux Compilers</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/linux_full.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/linux_half.png" alt="" title="Linux GCC and ICC" width="750" height="732" class="aligncenter size-full wp-image-848" /></a></p>
<p>The Linux compilers perform quite well, and the disassembly clearly shows that instruction scheduling is much improved compared to the LLVM-generated code.</p>
<p>32-bit GCC  (real GCC this time, not an LLVM backend) shows some irregularity in the instruction interleaving. It spills one register to memory, and does that fairly intelligently. The register is spilled at instructions #87-88 and not consumed until instruction #105. Likewise, the same register is spilled again at instruction #1 but not consumed until instruction #11. Despite needing to spill a register to memory, its performance impact is fairly small.</p>
<p>64-bit code allows GCC to have enough registers to avoid spilling to memory&#8230;maybe too many. In instructions #1 and #92, it needlessly switches between using <code><b>edx</b></code> and <code><b>r15d</b></code> to hold the same variable. Like 32-bit GCC, 64-bit GCC also unnecessarily splits up the counter increment into three instructions (load #87, add #97, store #104). Because it does not need to spill registers to memory, GCC&#8217;s instruction scheduling spaces out dependent instructions well, although seemingly with some randomness.</p>
<p>Both the 32-bit and 64-bit Intel compilers generate nearly identical code, except for register assignment, and therefore, code size. Instructions are perfectly interleaved, with maximal spacing between dependent instructions. Incrementing the volatile counter uses a pointer stored in a register (instruction #97) with the destination operand in memory, with no register needed to temporarily store the counter value. This saves one register, making 32-bit ICC to be the only 32-bit compiler that does not spill any registers to memory for this routine. The use of dec-jne (instructions #102-103) to terminate the loop also allows macro-op fusion to work.</p>
<h2>Register Spilling</h2>
<p>LLVM-Clang and LLVM-GCC seem to interpret the <code><b>volatile</b></code> differently from the GCC and Intel C compilers. The <code>counter</code> variable was declared as <code><b>volatile unsigned int &#038;counter</b></code>. LLVM-Clang and LLVM-GCC interpret this as meaning that both the pointer itself is volatile, as well as the value to which it points. This leads to code which loads the value of the pointer from memory, then increments the integer located in memory using that pointer. GCC and Intel&#8217;s compiler interprets the volatile declaration as meaning only the final integer is volatile. It keeps the pointer in a register, and simply increments the integer located where the pointer points. This reduces register usage by one. GCC wastes one register by splitting up <code>counter++</code> into three instructions instead of using a destination operand in memory, so the 32-bit Intel compiler is the only compiler that generates 32-bit code that does not spill any registers to memory.</p>
<h2>Instruction Scheduling</h2>
<p>It seems like this version of LLVM gets it wrong (it tends to group dependent operations together), while GCC and ICC do it right. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/compiling-a-contrived-chunk-of-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>OS X Process Scheduling</title>
		<link>http://blog.stuffedcow.net/2012/07/os-x-process-scheduling/</link>
		<comments>http://blog.stuffedcow.net/2012/07/os-x-process-scheduling/#comments</comments>
		<pubDate>Wed, 25 Jul 2012 04:54:20 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=834</guid>
		<description><![CDATA[<p>Earlier, I wrote about the SMT-awareness of the CFQ and BFS schedulers on Linux. Here, I do a similar test on the Mac OS X process scheduler.</p> System Core i7-3770K 3.5 GHz, 4 cores, 8 threads (2-way SMT) Mageia Linux, kernel 3.4.4, CFQ scheduler Mac OS X 10.7 (Update: Also 10.8) Workload: Independent ALU instructions <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/os-x-process-scheduling/">OS X Process Scheduling</a></span>]]></description>
			<content:encoded><![CDATA[<p>Earlier, I wrote about the SMT-awareness of the <a href="/2011/08/linux-smt-aware-process-scheduling/">CFQ and BFS schedulers on Linux</a>. Here, I do a similar test on the Mac OS X process scheduler.</p>
<h2>System</h2>
<ul>
<li>Core i7-3770K 3.5 GHz, 4 cores, 8 threads (2-way SMT)</li>
<li>Mageia Linux, kernel 3.4.4, CFQ scheduler</li>
<li>Mac OS X 10.7 (Update: Also 10.8)</li>
<li>Workload: Independent ALU instructions (does not improve with SMT)</li>
</ul>
<h3>Update</h3>
<p>The thread scheduler on Mac OS X 10.8 behaves identically.</p>
<h2>Test Result</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_scheduler.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_scheduler.png" alt="" title="OS X Process Scheduler" width="577" height="385" class="alignnone size-full wp-image-835" /></a></p>
<p>As more concurrent processes are run, the total throughput increases up to 4 threads on 4 cores. Due to the nature of the workload, SMT offers no additional throughput improvement. A good SMT-aware process scheduler would first assign threads to separate cores before doubling up threads on the same core. The CFQ scheduler is able to do that, but the OS X scheduler does not. It makes sub-optimal scheduling decisions when there are between 4 and 6 threads, choosing to assign two threads on the same core while leaving another core idle, causing a performance loss (11% loss at 4 threads).</p>
<p>Is the OS X process scheduler SMT-aware? Yes. Does it make good scheduling decisions? Not really, but still <a href="/2011/08/linux-smt-aware-process-scheduling/">better than the BFS scheduler</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/os-x-process-scheduling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Replacing VIA HD Audio Codec Chip</title>
		<link>http://blog.stuffedcow.net/2012/07/replacing-via-hd-audio-codec-chip/</link>
		<comments>http://blog.stuffedcow.net/2012/07/replacing-via-hd-audio-codec-chip/#comments</comments>
		<pubDate>Mon, 23 Jul 2012 00:53:48 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Building Stuff]]></category>
		<category><![CDATA[hackintosh]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=810</guid>
		<description><![CDATA[Gigabyte's new UEFI BIOS is particularly well-suited for building Hackintoshes. However, many of Gigabyte's recent motherboards, including all of the MicroATX Z77 and H77 boards, use the VIA VT2021 HD Audio codec chip, which is not well-supported. Since I'm building a Hackintosh with a GA-Z77M-D3H with VIA VT2021 chip, I decided to work around the audio issues by swapping the VT2021 with a Realtek ALC885 chip. <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/replacing-via-hd-audio-codec-chip/">Replacing VIA HD Audio Codec Chip</a></span>]]></description>
			<content:encoded><![CDATA[<p>Gigabyte&#8217;s new UEFI BIOS is particularly <!-- a href="http://www.tonymacx86.com/viewtopic.php?f=22&#038;t=56617"--><a href="http://www.tonymacx86.com/showthread.php/49446">well-suited for building Hackintoshes</a>. However, many of Gigabyte&#8217;s recent motherboards, including all of the MicroATX Z77 and H77 boards, use the VIA VT2021 HD Audio codec chip, which is <!--a href="http://www.tonymacx86.com/viewtopic.php?f=169&#038;t=67429"--><a href="http://www.tonymacx86.com/showthread.php/58811">not well-supported</a>. In contrast, the Realtek line of HD Audio codecs are generally well-supported. The ALC885 and ALC889A in particular don&#8217;t even need editing AppleHDA.kext for the device ID.</p>
<p>The <a href="http://www.intel.com/content/www/us/en/standards/high-definition-audio-specification.html">Intel High Definition Audio Specification</a> defines defines a standard 48-pin package for the codec chip, although it is &#8220;not a compliance requirement for all codecs&#8221;. It appears that Both Realtek and VIA codecs usually follow the standard package, so it&#8217;s likely they&#8217;re (almost?) pin-compatible. <b>Update:</b> The pin-outs of the ALC885 and VT2021 are the same.</p>
<p>Datasheets:</p>
<ul>
<li><a href="http://realtek.info/pdf/ALC885_1-1.pdf">ALC885</a></li>
<li><a href="http://www.tonymacx86.com/audio/65981-data-sheet-via-vt2021-codec.html">VT2021</a> (Thanks, David!)</li>
</ul>
<p>Since I&#8217;m building a Hackintosh with a GA-Z77M-D3H with a VIA VT2021 chip, I decided to work around the audio issues by swapping the VT2021 with a Realtek ALC885 chip.</p>
<h2>Procedure</h2>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/dj3I8YKbZdw?fs=1&#038;feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<ol>
<li>Desolder the existing codec from the motherboard</li>
<li>Clean off any bridged pads on the motherboard (oops)</li>
<li>Add flux and solder new chip</li>
<li>Inspect, then power on</li>
</ol>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/vt2021.jpg"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/vt2021-360x360.jpg" alt="" title="VIA VT2021" width="360" height="360" class="alignnone size-medium wp-image-819" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/alc885.jpg"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/alc885-360x360.jpg" alt="" title="Realtek ALC885" width="360" height="360" class="alignnone size-medium wp-image-818" /></a></p>
<p>Here are close-up photographs of the removed VT2021 chip and newly-installed ALC885.</p>
<h2>Results</h2>
<p>Surprisingly, the replaced chip appears to work fine. It was auto-detected and configured automatically in Linux, and also works in OS X with the usual amount of headache. I have not rigorously tested whether all of the output ports work. Those that I tested (rear microphone and rear line out) work fine.</p>
<p>The codec chip and its accompanying 12V-to-5V linear regulator (AS78L05, left of the codec) get suspiciously hot (around 50 C?). I do not know if it&#8217;s normal for the ALC885 to dissipate enough power to reach that temperature, nor did I think of checking the VT2021&#8242;s temperature before replacing it. Since the audio output is working, I currently have no proof that this is abnormal&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/replacing-via-hd-audio-codec-chip/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Intel HD4000 QE/CI Acceleration</title>
		<link>http://blog.stuffedcow.net/2012/07/intel-hd4000-qeci-acceleration/</link>
		<comments>http://blog.stuffedcow.net/2012/07/intel-hd4000-qeci-acceleration/#comments</comments>
		<pubDate>Sat, 14 Jul 2012 03:39:12 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Building Stuff]]></category>
		<category><![CDATA[hackintosh]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=732</guid>
		<description><![CDATA[Graphics acceleration (Core Image, Quartz Extreme) for Intel HD Graphics 4000 (on Ivy Bridge processors) works in Mac OS X! Setting the AAPL,ig-platform-id device property is required to get the drivers to load. <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/intel-hd4000-qeci-acceleration/">Intel HD4000 QE/CI Acceleration</a></span>]]></description>
			<content:encoded><![CDATA[<p>Graphics acceleration (Core Image, Quartz Extreme) for Intel HD Graphics 4000 (on Ivy Bridge processors) works in Mac OS X! The drivers are included in the <a href="http://support.apple.com/kb/DL1542">MacBook Pro (Mid 2012) Software Update 1.0</a>, Lion 10.7.5, or Mountain Lion 10.8. (Also found in <a href="http://tonymacx86.blogspot.ca/2012/06/bridgehelper-50-native-ivy-bridge.html">BridgeHelper 5.0</a>). </p>
<ul>
<li>AppleIntelGraphicsHD4000.kext</li>
<li>AppleIntelGraphicsFramebufferCapri.kext</li>
<li>&#8230;and maybe some others I don&#8217;t know about</li>
</ul>
<p>As pointed out by <a href="http://www.tonymacx86.com/viewtopic.php?f=169&#038;t=65592">ElNono</a> and <a href="http://www.insanelymac.com/forum/index.php?showtopic=280372">proteinshake</a>, the critical bit missing to get the HD4000 drivers to load is the <b>AAPL,ig-platform-id</b> device property for the graphics device. Of course, on anything other than Apple hardware, this property wouldn&#8217;t exist, and would need to be added.</p>
<p>The AAPL,ig-platform-id property is a 32-bit number that must be one of the values listed in the following table. Thanks to <a href="http://www.tonymacx86.com/viewtopic.php?f=169&#038;t=65592">ElNono</a> for figuring it out.</p>
<table style="margin:auto;">
<tr>
<th>AAPL,ig-platform-id</th>
<th>Memory (MB)</th>
<th>Pipes</th>
<th>Ports</th>
<th>Comment</th>
</tr>
<tr>
<td>01660000</td>
<td>96</td>
<td>3</td>
<td>4</td>
<td></td>
</tr>
<tr>
<td>01660001</td>
<td>96</td>
<td>3</td>
<td>4</td>
<td></td>
</tr>
<tr>
<td>01660002</td>
<td>64</td>
<td>3</td>
<td>1</td>
<td>No DVI</td>
</tr>
<tr>
<td>01660003</td>
<td>64</td>
<td>2</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>01660004</td>
<td>32</td>
<td>3</td>
<td>1</td>
<td>No DVI</td>
</tr>
<tr>
<td>016<b>2</b>0005</td>
<td>32</td>
<td>2</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>016<b>2</b>0006</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>No display</td>
</tr>
<tr>
<td>016<b>2</b>0007</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>No display</td>
</tr>
<tr>
<td>01660008</td>
<td>64</td>
<td>3</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>01660009</td>
<td>64</td>
<td>3</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>0166000a</td>
<td>32</td>
<td>2</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>0166000b</td>
<td>32</td>
<td>2</td>
<td>3</td>
<td></td>
</tr>
</table>
<p>The value of AAPL,ig-platform-id selects which graphics configuration the driver will use (er, blindly assume). Setting AAPL,ig-platform-id to any of the 12 values will load the HD4000 driver, but there are some other constraints.</p>
<p>First, the setting affects which ports are enabled. Configurations with zero ports should be avoided (no output?). Configurations with just one port should probably be avoided because it&#8217;s probably not the port you&#8217;re looking for. On my GA-Z77M-D3H, for configurations with one port, the enabled port is not the DVI port. For configurations with two or more ports, DVI was available. The VGA port is not enabled for any of the configurations. I did not test the HDMI port.</p>
<p>Also, the graphics memory size for the configuration <b><i>must</i></b> match the setting in BIOS. If they don&#8217;t match, the driver may crash (kernel panic at <code>gen7_memory.cpp:721</code>), or the display may be corrupted. For example, using configuration <code>0x01660000</code> (96 MB, 3 pipes, 4 ports), set the Graphics Memory Size to 96 MB to match. It appears that OS X doesn&#8217;t obey the DVMT Total Memory Size setting (always 512 MB with 8 GB system RAM), so I just left it as MAX to match.<br />
<a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/hd4000_bios.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/hd4000_bios.png" alt="" title="HD4000 Graphics Memory Setting" width="610" height="167" class="aligncenter size-full wp-image-746" /></a></p>
<h2>Results</h2>
<div style="border: #cccccc 1px solid;">
<code style="font-size:8pt;"><br />
    | +-o GFX0@2  <class IOPCIDevice, id 0x1000001e8, registered, matched, active, busy 0 (262 ms), retain 26><br />
    | | | {<br />
    | | |   "assigned-addresses" = <1010008200000000000080f70000000000004000181000c20f000000000000100000000000000010201000810000000000f000000000000040000000><br />
    | | |   "IOInterruptSpecifiers" = (<1000000007000000>,<0400000000000100>)<br />
    | | |   "class-code" = <00000300><br />
    | | |   "IODeviceMemory" = (({"address"=4152360960,"length"=4194304}),({"address"=64692944896,"length"=268435456}),"IOSubMemoryDescriptor is not serializable")<br />
    | | |   "AAPL,gray-page" = <01000000><br />
    | | |   "IOHibernateState" = <00000000><br />
    | | |   "IOPowerManagement" = {"MaxPowerState"=2,"ChildrenPowerState"=2,"CurrentPowerState"=2}<br />
    | | |   "subsystem-vendor-id" = <58140000><br />
    | | |   "built-in" = <00><br />
    | | |   "acpi-device" = "IOACPIPlatformDevice is not serializable"<br />
    | | |   "IOPCIMSIMode" = Yes<br />
    | | |   "IOInterruptControllers" = ("io-apic-0","IOPCIMessagedInterruptController")<br />
    | | |   "name" = "display"<br />
    | | |   "vendor-id" = <86800000><br />
    | | |   "device-id" = <62010000><br />
    | | |   "IOPCIResourced" = Yes<br />
    | | |   "compatible" = <"pci1458,d000","pci8086,162","pciclass,030000"><br />
    | | |   "AAPL,iokit-ndrv" = <a0d7d2807fffffff><br />
    | | |   "acpi-path" = "IOACPIPlane:/_SB/PCI0@0/GFX0@20000"<br />
    | | |   "model" = <"Intel HD Graphics 4000"><br />
    | | |   "subsystem-id" = <00d00000><br />
    | | |   "revision-id" = <09000000><br />
    | | |   <span style="background-color:#ff0;">"AAPL,ig-platform-id" = <00006601></span><br />
    | | |   "AAPL,gray-value" = <c38c6400><br />
    | | |   "pcidebug" = "0:2:0"<br />
    | | |   "IOName" = "display"<br />
    | | |   "device_type" = <"display"><br />
    | | |   "reg" = <0010000000000000000000000000000000000000101000020000000000000000000000000000400018100042000000000000000000000000000000102010000100000000000000000000000040000000><br />
    | | | }</p>
<p></code></div>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_hd4000.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/osx_hd4000.png" alt="" title="OS X Graphics Properties" width="592" height="436" class="aligncenter size-full wp-image-751" /></a></p>
<p><code>ioreg -l -p IODeviceTree</code> can be used to verify that the device property was correctly added to the list of properties for the graphics device. The end result is that the AAPL,ig-platform-id property shows up in <code>ioreg</code> output. A correct value causes <code>AppleIntelGraphicsHD4000.kext</code> to load.</p>
<h2>Setting AAPL,ig-platform-id</h2>
<p>The goal is to inject a device property named <code>AAPL,ig-platform-id</code> with a 32-bit value that is one of the table entries (e.g., 01660000). There are many ways to inject a device property, and I know very few of them. The method I used was to modify the Chameleon bootloader&#8217;s GraphicsEnabler code to add the property whenever GraphicsEnabler=y (gma.c). Here are a few options I&#8217;ve heard of:</p>
<ul>
<li>Modify the bootloader&#8217;s GraphicsEnabler code. I modified Chimera 1.10 (rev 1999, <code>i386/libsaio/gma.c</code>) to add a new device property value.</li>
<li>Device property injection by the bootloader using &#8220;device-properties&#8221; in org.chameleon.Boot.plist. I don&#8217;t know how to generate such a string, but there are a few examples posted in <a href="http://www.insanelymac.com/forum/index.php?showtopic=280372">this thread</a>. Something about gfxutil.</li>
<li>Editing the DSDT to add the property for the graphics device.</li>
</ul>
<h3>Modifying the Chameleon Boot Loader</h3>
<p>I chose to modify the GraphicsEnabler code from the Chameleon/Chimera bootloader. The patches below can be applied to the source code from svn (http://forge.voodooprojects.org/svn/chameleon). I&#8217;ve also compiled patched versions of Chameleon (rev. 2012) and Chimera (rev. 1999) for convenience.</p>
<ul>
<li>Chameleon svn r2012
<ul>
<li><a href="/wp-content/uploads/2012/07/chameleon_r1956_hd4000.patch">Patch for gma.c</a></li>
<li>Patched <a href="/wp-content/uploads/2012/07/Chameleon-2.1svn-r2012.pkg">Chameleon r2012 [pkg]</a> <a href="/wp-content/uploads/2012/07/Chameleon-2.1svn-r2012.pkg.md5">[md5]</a></li>
</ul>
</li>
<li>Update July 20, 2012: Chimera 1.11
<ul>
<li><a href="http://tonymacx86.blogspot.ca/2012/07/chimera-111-update-ivy-bridge-hd-4000.html">Chimera 1.11</a> has been updated with something similar to what&#8217;s described here (AAPL,ig-platform-id set to 01660009 and device-id set to 0166).</li>
</ul>
<li>Chimera 1.10.0
<ul>
<li><a href="/wp-content/uploads/2012/07/chimera_r1999_hd4000.patch">Patch for gma.c</a></li>
<li>Patched <a href="/wp-content/uploads/2012/07/Chimera-1.10.0-r1999.pkg">Chimera 1.10.0 [pkg]</a> <a href="/wp-content/uploads/2012/07/Chimera-1.10.0-r1999.pkg.md5">[md5]</a></li>
</ul>
</li>
</ul>
<h3>Settings that don&#8217;t matter</h3>
<p>It appears that AAPL,ig-platform-id is the only critical setting to enable hardware acceleration. I&#8217;ve tried changing a few other things and none of them seem to matter.</p>
<p><span style="font-size: 8pt;"><b>device-id and revision</b></span>. The driver doesn&#8217;t care whether this is mobile (0166) or desktop (0162). I suspect it needs to be one of the two in order to load the kext, but it&#8217;s not important to match it with the high-order 16 bits of AAPL,ig-platform-id. I left mine unchanged at 0162:0009 using configuration 01660000.</p>
<p><span style="font-size: 8pt;"><b>GFX0 vs. IGPU in DSDT</b></span>. This doesn&#8217;t seem to matter either. Both work the same, so I left mine as GFX0 without issues.</p>
<p><span style="font-size: 8pt;"><b>SMBios version/productname</b></span>. I tried a few, and also tried without smbios.plist. The driver loaded fine in all cases. (Mac Pro, iMac, MacBook Pro were all ok)</p>
<h3>Settings that <i>do</i> matter</h3>
<p><span style="font-size: 8pt;"><b>AAPL,ig-platform-id</b></span>. Yes, it matters.</p>
<p><span style="font-size: 8pt;"><b>Graphics memory size in BIOS</b></span>. This must match the amount of graphics memory for the chosen configuration.</p>
<p><span style="font-size: 8pt;"><b>Connectors</b></span>. Not all output connectors are enabled. Getting a garbled display could be a symptom the connector currently used is disabled. Changing the value of AAPL,ig-platform-id might change which connectors are enabled/disabled.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/intel-hd4000-qeci-acceleration/feed/</wfw:commentRss>
		<slash:comments>57</slash:comments>
		</item>
		<item>
		<title>Ivy Bridge Power Consumption</title>
		<link>http://blog.stuffedcow.net/2012/07/ivy-bridge-power-consumption/</link>
		<comments>http://blog.stuffedcow.net/2012/07/ivy-bridge-power-consumption/#comments</comments>
		<pubDate>Thu, 12 Jul 2012 23:01:57 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>
		<category><![CDATA[cpu]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=711</guid>
		<description><![CDATA[This is a preliminary attempt at characterizing the power consumption of Ivy Bridge at various clock frequencies and loads. I present plots of CPU power consumption (at the 12V connector) at varying frequency, voltage, and number of cores utilized, including the power impact of Hyper-Threading. <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/ivy-bridge-power-consumption/">Ivy Bridge Power Consumption</a></span>]]></description>
			<content:encoded><![CDATA[<p>This is a preliminary attempt at characterizing the power consumption of Ivy Bridge at various clock frequencies and loads.</p>
<h2>Test Setup</h2>
<table>
<tr>
<th>CPU</th>
<td>Core i7-3770K</td>
</tr>
<tr>
<th>Motherboard</th>
<td>Gigabyte GA-Z77M-D3H</td>
</tr>
<tr>
<th>Workload</th>
<td>Prime95 v.27.7, 64-bit Linux</td>
</tr>
</table>
<p>Current was measured by multimeter at the 12V connector. Power is calculated assuming voltage is 12V. Measuring power at the 12V CPU power connector isolates the power to the CPU, GFX, and power converter losses, without measuring the power consumed by the rest of the system. I don&#8217;t attempt to adjust for power conversion losses.</p>
<p>CPU core voltage was measured by the sensor on the motherboard. The GA-Z77M-D3H does not have core voltage adjustments, so I&#8217;m limited to observing what Intel&#8217;s voltage algorithm decides (which is interesting too).</p>
<p>In addition to frequency and voltage, the number of active cores was also varied. This is accomplished by setting processor affinity to constrain the test workload to run on fewer cores. The data presented here involve running Prime95 on both thread contexts of an active core, except the series testing four active cores without Hyper-Threading. Leaving one of each pair of thread contexts idle measures no-Hyper-Threading power, and leaving both thread contexts idle results in an idle processor core.</p>
<h3>Limitations</h3>
<p>The most obvious limitation is that I don&#8217;t have control over VCore, so I can&#8217;t characterize the power consumption of the processor at all operating points, nor can I produce a frequency-voltage schmoo plot.</p>
<p>Temperature affects power consumption (higher temperature causes higher leakage power). In these tests, I used Intel&#8217;s stock cooler with PWM set to default, so the core temperature isn&#8217;t constant. The stock cooler also limited the maximum frequency I could test without thermal throttling. I did let temperature stabilize before taking voltage and power measurements. I haven&#8217;t characterized the dependence of CPU power on core temperature (at the same voltage and frequency).</p>
<h2>Results</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_power.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_power.png" alt="" title="Ivy Bridge Power Consumption" width="769" height="577" class="alignnone size-full wp-image-721" /></a></p>
<h3>Power vs. Frequency</h3>
<p>Intel&#8217;s voltage control algorithm seems fairly simple: </p>
<ul>
<li>Below 3 GHz: 0.96 V</li>
<li>3 GHz to 4 GHz: Linear increase</li>
<li>Above 4 GHz: 1.20 V</li>
</ul>
<p>The voltage is slightly higher when more cores are utilized. In the scatterplot of VCore points, the highest voltage for each frequency is for 4 loaded cores, while the bottom voltage is for 1 loaded core (3 idle cores).</p>
<p>This voltage control algorithm is interesting. Below 3 GHz, VCore is set at 0.96 V, suggesting that the processor isn&#8217;t able to run reliably below that voltage (with some safety margin). This is in contrast to Intel&#8217;s <a href="http://download.intel.com/newsroom/kits/22nm/pdfs/22nm-Details_Presentation.pdf">PR campaign</a> for their 22&nbsp;nm process which touted leakage reductions at 0.7 V (-37%) through 1.0 V (-18%) compared to their 32&nbsp;nm process. Ivy Bridge&#8217;s actual operating voltage range (0.96 &#8211; 1.2 V) is almost entirely outside (and above) the range of voltages presented in Intel&#8217;s data. Suspicious.</p>
<p>A second interesting feature is that the voltage increases end at 4 GHz, slightly above the highest turbo frequency bin. Clearly, Intel doesn&#8217;t cater to people who overclock beyond the highest turbo bin. Ideally, the voltage-frequency curve should approximate the shmoo plot, which typically has linearly increasing voltage with frequency. This stepped curve is sub-optimal when using a voltage offset (shifting this curve upwards or downwards) when overclocking, where a more desirable curve would be to (super?-)linearly extend the voltage curve beyond 4 GHz. Of course, using a fixed voltage (flat curve) is even less optimal.</p>
<p>Below 3 GHz, where VCore is roughly constant, power consumption increases fairly linearly with frequency, as expected. Power increases roughly cubic with respect to frequency in the region between 3 and 4 GHz because of the linearly-increasing voltage, and the voltage increases returns to ~linear once voltage stops increasing past 4 GHz. </p>
<p>Idle power is an impressive 4.4 W (including voltage converter losses).</p>
<h3>Hyper-Threading Power</h3>
<p>One of the data series in the plot involves loading all 4 cores with one thread each. This measures the power consumption impact of Hyper-Threading. Comparing 4 active cores with one thread vs. two threads each, Hyper-Threading consumes an extra 8% of power. However, at an average of <a href="/2012/07/ivy-bridge-benchmarks/#hyperthreading">22% performance improvement</a>, Hyper-Threading is a power-efficient method of improving performance, provided the workload has enough threads to use it. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/ivy-bridge-power-consumption/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Ivy Bridge Benchmarks</title>
		<link>http://blog.stuffedcow.net/2012/07/ivy-bridge-benchmarks/</link>
		<comments>http://blog.stuffedcow.net/2012/07/ivy-bridge-benchmarks/#comments</comments>
		<pubDate>Thu, 12 Jul 2012 07:16:56 +0000</pubDate>
		<dc:creator>Henry</dc:creator>
				<category><![CDATA[Measuring Stuff]]></category>
		<category><![CDATA[cpu]]></category>

		<guid isPermaLink="false">http://blog.stuffedcow.net/?p=660</guid>
		<description><![CDATA[So I got myself a new Core i7-3770K, using the stock heatsink/fan, and a motherboard that doesn't have VCore adjustments. I re-ran a bunch of benchmarks used in my earlier posts to measure Ivy Bridge's performance, and Hyper-threading scaling, in comparison to earlier processors. The workloads were used in the my earlier tests: <ul> <li><a href="/2010/11/fpga-cad-benchmarks/">Core2, Nehalem, FPGA CAD</a></li> <li><a href="/2011/08/hyperthreading-performance/">Hyper-Threading Performance</a></li> </ul> <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.stuffedcow.net/2012/07/ivy-bridge-benchmarks/">Ivy Bridge Benchmarks</a></span>]]></description>
			<content:encoded><![CDATA[<p>So I got myself a new Core i7-3770K, using the stock heatsink/fan, and a motherboard that doesn&#8217;t have VCore adjustments. Therefore, not much overclocking, unfortunately.</p>
<p>I will run the workloads that I ran on various older systems and compare them with the new processor. Since I don&#8217;t have a Sandy Bridge processor, the comparisons will be against the previous microarchitecture (Lynnfield/Gulftown). I expect Sandy Bridge would be very similar to (but slightly slower) than Ivy Bridge.</p>
<p><b>See also:</b></p>
<ul>
<li><a href="/2010/11/fpga-cad-benchmarks/">Core2, Nehalem, FPGA CAD</a></li>
<li><a href="/2011/08/hyperthreading-performance/">Hyper-Threading Performance</a></li>
</ul>
<h1>FPGA CAD</h1>
<p>The first set of benchmarks are an extension of the earlier <a href="/2010/11/fpga-cad-benchmarks/">FPGA CAD benchmarks</a>. Most of the results have already been presented there, but will be copied here for convenience.</p>
<h2>Hardware</h2>
<table>
<tr>
<th>System
<th>CPU
<th>Memory</tr>
<tr>
<td>Pentium 4 2800
<td>130 nm Pentium 4 2.8 GHz (Northwood)
<td>2-channel DDR-400, Intel 875P</tr>
<tr>
<td>Xeon 3000
<td>65 nm (Core 2) Xeon 5160 x 2 (Woodcrest)
<td>4-channel DDR2 FB-DIMM, Intel 5000X</tr>
<tr>
<td>C2Q 2660
<td>65 nm Core 2 Quad Q6700 (Kentsfield)
<td>2-channel DDR2, Intel Q35</tr>
<tr>
<td>C2Q 3500
<td>45 nm Core 2 Quad Q9550 (Yorkfield)
<td>2-channel DDR2-824 4-4-4-12, Intel P965</tr>
<tr>
<td>i7 3300
<td>45 nm Core i7-860 (Lynnfield)
<td>2-channel DDR3-1580 9-9-9-24, Intel P55</tr>
<tr>
<td>i7 4215
<td>32 nm Core i7-980X (Gulftown)
<td>3-channel DDR3-1690</tr>
<tr>
<td>IVB 4300
<td>22 nm Core i7-3770K (Ivy Bridge)
<td>2-channel DDR3-1600 9-9-8-24-1T, Intel Z77</tr>
</table>
<h2>Workloads</h2>
<table>
<tr>
<th>Test
<th>Description</tr>
<tr>
<td>Memory Latency 128M
<td>Read latency while randomly accessing a 128 MB array, using 4 KB pages. Includes the impact of TLB misses.</tr>
<tr>
<td>Memory Bandwidth
<td><a href="http://www.streambench.org/">STREAM</a> benchmark, copy bandwidth. Compiled with 64-bit gcc 4.4.3</tr>
<tr>
<td>Quartus 32-bit
<td>Quartus 10.0 SP1 32-bit on 64-bit Linux, doing a full compile of OpenSPARC T1 for Stratix III (87 KALUTs utilization). Quartus tests are single-threaded with parallel compile disabled.</tr>
<tr>
<td>Quartus 64-bit
<td>Quartus 10.0 SP1 64-bit on 64-bit Linux, same as above.</tr>
<tr>
<td>VPR 5.0 64-bit
<td>Modified VPR 5.0 compiled with gcc 4.4.3, compiling a ~9000-block circuit (mkDelayWorker32B.mem_size14.blif)</tr>
</table>
<h2>Results</h2>
<h3>Memory Latency and Bandwidth</h3>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_memory.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_memory.png" alt="" title="Memory Latency and Bandwidth" width="625" height="403" class="alignnone size-full wp-image-666" /></a></p>
<p>The memory latency on Ivy Bridge is essentially unchanged from the previous microarchitecture (Core i7 Lynnfield/Gulftown), but bandwidth has increased significantly, despite using the same memory (DDR3 ~1600).</p>
<h3>FPGA CAD</h3>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_vpr.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_vpr-360x221.png" alt="" title="VPR 5.0" width="360" height="221" class="alignnone size-medium wp-image-669" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_vpr_cpi.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_vpr_cpi-360x221.png" alt="" title="VPR 5.0 CPI" width="360" height="221" class="alignnone size-medium wp-image-670" /></a><br />
<a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q32.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q32-360x221.png" alt="" title="Quartus 32-bit" width="360" height="221" class="alignnone size-medium wp-image-667" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q32_cpi.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q32_cpi-360x221.png" alt="" title="Quartus 32-bit CPI" width="360" height="221" class="alignnone size-medium wp-image-672" /></a><br />
<a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q64.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q64-360x221.png" alt="" title="Quartus 64-bit" width="360" height="221" class="alignnone size-medium wp-image-668" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q64_cpi.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_q64_cpi-360x221.png" alt="" title="Quartus 64-bit CPI" width="360" height="221" class="alignnone size-medium wp-image-671" /></a></p>
<p>All around per-clock performance improvements of nearly 15% in Ivy Bridge. Stock clock speeds are up about 15% vs. the top-binned Lynnfield (Core i7-880) too. It&#8217;s strange how VPR seems to behave differently from Quartus. Quartus placement improves more than clustering and routing over several processor generations, but VPR placement improves less.</p>
<h3>Other Benchmarks</h3>
<p>These tests were run on the same systems as before, but clock speeds are lower. These are the same benchmarks used in the next section on simultaneous multithreading.</p>
<table>
<tr>
<th>System
<th>CPU</tr>
<tr>
<td>C2Q 2833
<td>45 nm Core 2 Quad Q9550 (Yorkfield)</tr>
<tr>
<td>i7 2800
<td>45 nm Core i7-860 (Lynnfield)</tr>
<tr>
<td>IVB 3900
<td>22 nm Core i7-3770K (Ivy Bridge)</tr>
</table>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_runtime.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_runtime-360x239.png" alt="" title="Runtime" width="360" height="239" class="alignnone size-medium wp-image-673" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_cpi.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_cpi-360x239.png" alt="" title="Per-clock Performance" width="360" height="239" class="alignnone size-medium wp-image-674" /></a></p>
<p>The per-clock performance of Ivy Bridge is 20% better than both Lynnfield and Yorkfield. Surprisingly, Lynnfield doesn&#8217;t seem much better than its preceding generation Yorkfield (2% CPI) on these workloads.</p>
<p>The runtime comparisons include the impact of the processors&#8217; clock speeds. The Core 2 (Yorkfield) and Core i7 (Lynnfield) systems are at stock speed, while Ivy Bridge is overclocked by 11%. The graph shows Ivy Bridge a good 60% faster than both earlier chips, which would still be near 50% if all chips were not overclocked. At stock clock speeds, 20% of this gain comes from microarchitectural improvements, and 25% from increased clock speeds.</p>
<p><a name="hyperthreading"></a></p>
<h1>Hyper-Theading Performance</h1>
<p>This section repeats the tests done earlier for Lynnfield&#8217;s Hyper-Threading: <a href="/2011/08/hyperthreading-performance/">Hyper-Threading Performance</a>. The Lynnfield results are taken from the measurements made for the earlier tests.</p>
<h2>Hardware</h2>
<table>
<tr>
<th>System
<th>CPU
<th>Memory</tr>
<tr>
<td>i7 3300
<td>45 nm Core i7-860 (Lynnfield)
<td>2-channel DDR3-1580 9-9-9-24, Intel P55</tr>
<tr>
<td>IVB 3900
<td>22 nm Core i7-3770K (Ivy Bridge)
<td>2-channel DDR3-1600 9-9-8-24-1T, Intel Z77</tr>
</table>
<h2>Workloads</h2>
<table>
<tr>
<th>Workload
<th>Description</tr>
<tr>
<td>Dhrystone
<td>Version 2.1. A synthetic integer benchmark. Compiled with Intel C Compiler 11.1</tr>
<tr>
<td><a href="http://www.coremark.org/">CoreMark</a>
<td>Version 1.0. Another integer CPU core benchmark, intended as a replacement for Dhrystone. Compiled with Intel C Compiler 12.0.3</tr>
<tr>
<td>Kernel Compile
<td>Compile kernel-tmb-2.6.34.8 using GCC 4.4.3/4.6.3</tr>
<tr>
<td><a href="http://www.eecg.utoronto.ca/vpr/">VPR</a>
<td>Academic FPGA packing, placement, and routing tool from the University of Toronto. Modified version 5.0. Intel C Compiler 11.1</tr>
<tr>
<td><a href="http://www.altera.com/products/software/sfw-index.jsp">Quartus</a>
<td>Commercial FPGA design software for Altera FPGAs. Compile a 6,000-LUT circuit for the Stratix III FPGA. Includes logic synthesis and optimization (quartus_map), packing, placement, and routing (quartus_fit), and timing analysis (quartus_sta). Version 10.0, 64-bit.</tr>
<tr>
<td><a href="http://bochs.sourceforge.net/">Bochs</a>
<td>Instruction set (functional) simulator of an x86 PC system. This benchmark runs the first ~4 billion timesteps of a simulation. Modified version 2.4.6. GCC 4.4.3</tr>
<tr>
<td><a href="http://www.simplescalar.com">SimpleScalar</a>
<td>Processor microarchitecture simulator. This test runs sim-outorder (a cycle-accurate simulation of a dynamically-scheduled RISC processor), simulating 100M instructions. Version 3.0. Compiled with GCC 4.4.3</tr>
<tr>
<td><a href="http://www.gpgpu-sim.org/">GPGPU-Sim</a>
<td>Cycle-level simulator of contemporary GPU microarchitectures running CUDA and OpenCL workloads. Version 3.0.9924.</tr>
</table>
<h2>Throughput Scaling with Multiple Threads</h2>
<p>With the exception of the kernel compile workload, all of these tests start multiple instances of the same task and measures the total throughput of the processor (number of tasks/average runtime for task).</p>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_htscaling.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_htscaling.png" alt="" title="Ivy Bridge Multiprocess Scaling" width="577" height="385" class="aligncenter size-full wp-image-675" /></a></p>
<p>As expected, total throughput increases near linearly with the number of cores used up to 4 (cores are relatively independent), throughput increases slowly between 4 and 8 thread contexts used (Hyper-threading thread contexts are not equivalent to full processors), and is roughly flat beyond 8 thread contexts (time-slicing by the OS does not improve throughput.</p>
<h2>Hyper-Threading Throughput Scaling</h2>
<p><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_htspeedup.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_htspeedup-360x239.png" alt="" title="Hyper-Threading Speedup" width="360" height="239" class="alignnone size-medium wp-image-676" /></a><a href="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_corescaling.png"><img src="http://blog.stuffedcow.net/wp-content/uploads/2012/07/ivb_corescaling-360x239.png" alt="" title="Multicore Scaling" width="360" height="239" class="alignnone size-medium wp-image-677" /></a></p>
<p>The first chart focuses on comparing the throughput at 8 threads vs. 4 threads for the different workloads. The median geometric mean improvement for HT is 23%. The pathological Dhrystone workload has improved: Although Dhrystone still does not benefit from Hyper-threading, it is no longer slower. It seems like Ivy Bridge gains slightly less from Hyper-threading than Lynnfield. This is not necessarily a bad thing: It could be a symptom that Ivy Bridge is doing a better job utilizing the pipeline with just one thread, reducing the performance gain available for two threads.</p>
<p>The second chart compares the throughput at 4 threads vs. 1 thread. Ivy Bridge seems to be noticeably worse at this than Lynnfield. </p>
<p>There seems to be no correlation between workloads that scale well on real cores and those that scale well under Hyper-threading. The correlation between the two microarchitectures is higher: Workloads that scale well on Lynnfield tend to also scale well on Ivy Bridge, and vice versa. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stuffedcow.net/2012/07/ivy-bridge-benchmarks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
