See also: Ivy Bridge Benchmarks
Here are some FPGA CAD benchmarks across a few relatively-modern machines. The original motivation was to figure out why VPR ran much slower on a Core 2 Xeon 5160 system than a desktop-class Core 2 Quad Q9550. A secondary goal is to measure the Core i7-980X @ 4215 MHz. I added in some Pentium 4 results for fun, to show how far processor microarchitecture has progressed since then.
The test workloads and systems are listed in the tables, followed by results. All runtime results are normalized to the Core i7-980X @ 4215 MHz.
|Memory Latency 128M||Read latency while randomly accessing a 128 MB array, using 4 KB pages|
|Memory Bandwidth||STREAM benchmark, copy bandwidth. Compiled with 64-bit gcc 4.4.3|
|Quartus 32-bit||Quartus 10.0 SP1 32-bit on 64-bit Linux, doing a full compile of OpenSPARC T1 for Stratix III (87 KALUTs utilization). Quartus tests are single-threaded with parallel compile disabled.|
|Quartus 64-bit||Quartus 10.0 SP1 64-bit on 64-bit Linux, same as above.|
|VPR 5.0 64-bit||Modified VPR 5.0 compiled with gcc 4.4.3, compiling a ~9000-block circuit (mkDelayWorker32B.mem_size14.blif)|
|Pentium 4 2800||130 nm Pentium 4 Northwood 2.8 GHz||2-channel DDR-400 on Intel 875P chipset|
|Xeon 3000||65 nm (Core 2) Xeon 5160 x 2-CPU||4-channel DDR2 FB-DIMM on Intel 5000X chipset|
|C2Q 2660||65 nm Core 2 Quad Q6700||2-channel 4 GB DDR2 on Intel Q35 chipset|
|C2Q 3500 (box)||45 nm Core 2 Quad Q9550 at 3.5 GHz||2-channel 4 GB DDR2-824 4-4-4-12 on Intel P965 chipset|
|i7 3300 (box2)||45 nm Core i7-860 @ 3.3 GHz||2-channel 8 GB DDR3-1580 9-9-9-24|
|i7 4215||32 nm Core i7-980X @ 4.215 GHz||3-channel 6 GB DDR3-1690|
Memory Latency and Bandwidth
Nehalem’s memory system is significantly better than Core 2, with lower latency and STREAM copy bandwidth more than 2.5x higher than Core 2 systems. The Core i7-980X system has 18% higher bandwidth than the i7-860, due to a combination of 3 DDR3 channels vs. 2, higher clocked memory (1690 MHz vs 1580 MHz), and higher CPU clock speed. The bandwidth difference narrows to 12% between those two systems at the same CPU clock frequency (not shown in charts). Interestingly, the Xeon FB-DIMM system has poor memory performance compared to other Core 2 desktop-class systems.
Compile time for OpenSPARC T1 normalized to Core i7-980X (lower is better).
Same test with 64-bit Quartus. The speed difference between the Core 2 and Nehalem systems increases a little bit, especially for placement, probably due to increased memory working set from 64-bit pointers and Core 2 memory systems being slower.
Roughly 2/3 of the time is spent in routing, and 1/3 in packing (clustering), with placement taking ~1.6% of runtime.
The two slowest Core 2 systems show very poor routing performance, 2.3x slower than the overclocked 980X. I speculate this has something to do with poor memory performance, but the impact on routing appears even worse than the low-level benchmarks. Also odd is that placement is faster per clock on the Core 2 microarchitecture.
Clock Speed-Normalized Results
The Core i7-980X 4.21 GHz scales roughly linearly with clock speed compared to the Core i7-860 3.3 GHz. Normally clock speed scaling is slightly below linear because memory and I/O speed doesn’t increase as much, but the i7-980X is helped by a 12 MB L3 cache compared to 8 MB on the i7-860.
Notice the ~40% difference in CPI between the Xeon 3 GHz system and the similar Core 2 Quad 3.5 GHz system for VPR routing. The Core 2 Quad 2.66 GHz system falls in between the Xeon and C2Q 3.5. I speculate this is due to memory system performance differences. I’m not aware of any changes between the 65 nm and 45 nm Core 2 that would significantly change performance, but it isn’t ruled out that the performance difference is due to a microarchitectural change.
The Core 2 microarchitecture has higher per-clock performance than Nehalem in VPR placement.