Ivy Bridge Benchmarks

So I got myself a new Core i7-3770K, using the stock heatsink/fan, and a motherboard that doesn’t have VCore adjustments. Therefore, not much overclocking, unfortunately.

I will run the workloads that I ran on various older systems and compare them with the new processor. Since I don’t have a Sandy Bridge processor, the comparisons will be against the previous microarchitecture (Lynnfield/Gulftown). I expect Sandy Bridge would be very similar to (but slightly slower) than Ivy Bridge.

See also:

FPGA CAD

The first set of benchmarks are an extension of the earlier FPGA CAD benchmarks. Most of the results have already been presented there, but will be copied here for convenience.

Hardware

System

CPU

Memory
Pentium 4 2800

130 nm Pentium 4 2.8 GHz (Northwood)

2-channel DDR-400, Intel 875P
Xeon 3000

65 nm (Core 2) Xeon 5160 x 2 (Woodcrest)

4-channel DDR2 FB-DIMM, Intel 5000X
C2Q 2660

65 nm Core 2 Quad Q6700 (Kentsfield)

2-channel DDR2, Intel Q35
C2Q 3500

45 nm Core 2 Quad Q9550 (Yorkfield)

2-channel DDR2-824 4-4-4-12, Intel P965
i7 3300

45 nm Core i7-860 (Lynnfield)

2-channel DDR3-1580 9-9-9-24, Intel P55
i7 4215

32 nm Core i7-980X (Gulftown)

3-channel DDR3-1690
IVB 4300

22 nm Core i7-3770K (Ivy Bridge)

2-channel DDR3-1600 9-9-8-24-1T, Intel Z77

Workloads

Test

Description
Memory Latency 128M

Read latency while randomly accessing a 128 MB array, using 4 KB pages. Includes the impact of TLB misses.
Memory Bandwidth

STREAM benchmark, copy bandwidth. Compiled with 64-bit gcc 4.4.3
Quartus 32-bit

Quartus 10.0 SP1 32-bit on 64-bit Linux, doing a full compile of OpenSPARC T1 for Stratix III (87 KALUTs utilization). Quartus tests are single-threaded with parallel compile disabled.
Quartus 64-bit

Quartus 10.0 SP1 64-bit on 64-bit Linux, same as above.
VPR 5.0 64-bit

Modified VPR 5.0 compiled with gcc 4.4.3, compiling a ~9000-block circuit (mkDelayWorker32B.mem_size14.blif)

Results

Memory Latency and Bandwidth

The memory latency on Ivy Bridge is essentially unchanged from the previous microarchitecture (Core i7 Lynnfield/Gulftown), but bandwidth has increased significantly, despite using the same memory (DDR3 ~1600).

FPGA CAD



All around per-clock performance improvements of nearly 15% in Ivy Bridge. Stock clock speeds are up about 15% vs. the top-binned Lynnfield (Core i7-880) too. It’s strange how VPR seems to behave differently from Quartus. Quartus placement improves more than clustering and routing over several processor generations, but VPR placement improves less.

Other Benchmarks

These tests were run on the same systems as before, but clock speeds are lower. These are the same benchmarks used in the next section on simultaneous multithreading.

System

CPU
C2Q 2833

45 nm Core 2 Quad Q9550 (Yorkfield)
i7 2800

45 nm Core i7-860 (Lynnfield)
IVB 3900

22 nm Core i7-3770K (Ivy Bridge)

The per-clock performance of Ivy Bridge is 20% better than both Lynnfield and Yorkfield. Surprisingly, Lynnfield doesn’t seem much better than its preceding generation Yorkfield (2% CPI) on these workloads.

The runtime comparisons include the impact of the processors’ clock speeds. The Core 2 (Yorkfield) and Core i7 (Lynnfield) systems are at stock speed, while Ivy Bridge is overclocked by 11%. The graph shows Ivy Bridge a good 60% faster than both earlier chips, which would still be near 50% if all chips were not overclocked. At stock clock speeds, 20% of this gain comes from microarchitectural improvements, and 25% from increased clock speeds.

Hyper-Theading Performance

This section repeats the tests done earlier for Lynnfield’s Hyper-Threading: Hyper-Threading Performance. The Lynnfield results are taken from the measurements made for the earlier tests.

Hardware

System

CPU

Memory
i7 3300

45 nm Core i7-860 (Lynnfield)

2-channel DDR3-1580 9-9-9-24, Intel P55
IVB 3900

22 nm Core i7-3770K (Ivy Bridge)

2-channel DDR3-1600 9-9-8-24-1T, Intel Z77

Workloads

Workload

Description
Dhrystone

Version 2.1. A synthetic integer benchmark. Compiled with Intel C Compiler 11.1
CoreMark

Version 1.0. Another integer CPU core benchmark, intended as a replacement for Dhrystone. Compiled with Intel C Compiler 12.0.3
Kernel Compile

Compile kernel-tmb-2.6.34.8 using GCC 4.4.3/4.6.3
VPR

Academic FPGA packing, placement, and routing tool from the University of Toronto. Modified version 5.0. Intel C Compiler 11.1
Quartus

Commercial FPGA design software for Altera FPGAs. Compile a 6,000-LUT circuit for the Stratix III FPGA. Includes logic synthesis and optimization (quartus_map), packing, placement, and routing (quartus_fit), and timing analysis (quartus_sta). Version 10.0, 64-bit.
Bochs

Instruction set (functional) simulator of an x86 PC system. This benchmark runs the first ~4 billion timesteps of a simulation. Modified version 2.4.6. GCC 4.4.3
SimpleScalar

Processor microarchitecture simulator. This test runs sim-outorder (a cycle-accurate simulation of a dynamically-scheduled RISC processor), simulating 100M instructions. Version 3.0. Compiled with GCC 4.4.3
GPGPU-Sim

Cycle-level simulator of contemporary GPU microarchitectures running CUDA and OpenCL workloads. Version 3.0.9924.

Throughput Scaling with Multiple Threads

With the exception of the kernel compile workload, all of these tests start multiple instances of the same task and measures the total throughput of the processor (number of tasks/average runtime for task).

As expected, total throughput increases near linearly with the number of cores used up to 4 (cores are relatively independent), throughput increases slowly between 4 and 8 thread contexts used (Hyper-threading thread contexts are not equivalent to full processors), and is roughly flat beyond 8 thread contexts (time-slicing by the OS does not improve throughput.

Hyper-Threading Throughput Scaling

The first chart focuses on comparing the throughput at 8 threads vs. 4 threads for the different workloads. The median geometric mean improvement for HT is 23%. The pathological Dhrystone workload has improved: Although Dhrystone still does not benefit from Hyper-threading, it is no longer slower. It seems like Ivy Bridge gains slightly less from Hyper-threading than Lynnfield. This is not necessarily a bad thing: It could be a symptom that Ivy Bridge is doing a better job utilizing the pipeline with just one thread, reducing the performance gain available for two threads.

The second chart compares the throughput at 4 threads vs. 1 thread. Ivy Bridge seems to be noticeably worse at this than Lynnfield.

There seems to be no correlation between workloads that scale well on real cores and those that scale well under Hyper-threading. The correlation between the two microarchitectures is higher: Workloads that scale well on Lynnfield tend to also scale well on Ivy Bridge, and vice versa.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>