Core 2, Nehalem, FPGA CAD

By Henry, on November 21st, 2010

Here are some FPGA CAD benchmarks across a few relatively-modern machines. The original motivation was to figure out why VPR ran much slower on a Core 2 Xeon 5160 system than a desktop-class Core 2 Quad Q9550. A secondary goal is to measure the Core i7-980X @ 4215 MHz. I added in some Pentium 4 results for fun, to show how far processor microarchitecture has progressed since then.

The test workloads and systems are listed in the tables, followed by results. All runtime results are normalized to the Core i7-980X @ 4215 MHz.

Test	Description
Memory Latency 128M	Read latency while randomly accessing a 128 MB array, using 4 KB pages
Memory Bandwidth	STREAM benchmark, copy bandwidth. Compiled with 64-bit gcc 4.4.3
Quartus 32-bit	Quartus 10.0 SP1 32-bit on 64-bit Linux, doing a full compile of OpenSPARC T1 for Stratix III (87 KALUTs utilization). Quartus tests are single-threaded with parallel compile disabled.
Quartus 64-bit	Quartus 10.0 SP1 64-bit on 64-bit Linux, same as above.
VPR 5.0 64-bit	Modified VPR 5.0 compiled with gcc 4.4.3, compiling a ~9000-block circuit (mkDelayWorker32B.mem_size14.blif)

System	CPU	Memory
Pentium 4 2800	130 nm Pentium 4 Northwood 2.8 GHz	2-channel DDR-400 on Intel 875P chipset
Xeon 3000	65 nm (Core 2) Xeon 5160 x 2-CPU	4-channel DDR2 FB-DIMM on Intel 5000X chipset
C2Q 2660	65 nm Core 2 Quad Q6700	2-channel 4 GB DDR2 on Intel Q35 chipset
C2Q 3500 (box)	45 nm Core 2 Quad Q9550 at 3.5 GHz	2-channel 4 GB DDR2-824 4-4-4-12 on Intel P965 chipset
i7 3300 (box2)	45 nm Core i7-860 @ 3.3 GHz	2-channel 8 GB DDR3-1580 9-9-9-24
i7 4215	32 nm Core i7-980X @ 4.215 GHz	3-channel 6 GB DDR3-1690

Memory Latency and Bandwidth

Nehalem’s memory system is significantly better than Core 2, with lower latency and STREAM copy bandwidth more than 2.5x higher than Core 2 systems. The Core i7-980X system has 18% higher bandwidth than the i7-860, due to a combination of 3 DDR3 channels vs. 2, higher clocked memory (1690 MHz vs 1580 MHz), and higher CPU clock speed. The bandwidth difference narrows to 12% between those two systems at the same CPU clock frequency (not shown in charts). Interestingly, the Xeon FB-DIMM system has poor memory performance compared to other Core 2 desktop-class systems.

Quartus 32-bit

Quartus 32 bit Compile Time

Compile time for OpenSPARC T1 normalized to Core i7-980X (lower is better).

Quartus 64-bit

Quartus 64 bit Compile Time

Same test with 64-bit Quartus. The speed difference between the Core 2 and Nehalem systems increases a little bit, especially for placement, probably due to increased memory working set from 64-bit pointers and Core 2 memory systems being slower.

VPR 5.0

VPR 5.0 Compile Time

Roughly 2/3 of the time is spent in routing, and 1/3 in packing (clustering), with placement taking ~1.6% of runtime.
The two slowest Core 2 systems show very poor routing performance, 2.3x slower than the overclocked 980X. I speculate this has something to do with poor memory performance, but the impact on routing appears even worse than the low-level benchmarks. Also odd is that placement is faster per clock on the Core 2 microarchitecture.

Clock Speed-Normalized Results

Quartus 32 bit Compile Time

Quartus 64 bit Compile Time

VPR 5.0 Compile Time

The Core i7-980X 4.21 GHz scales roughly linearly with clock speed compared to the Core i7-860 3.3 GHz. Normally clock speed scaling is slightly below linear because memory and I/O speed doesn’t increase as much, but the i7-980X is helped by a 12 MB L3 cache compared to 8 MB on the i7-860.

Notice the ~40% difference in CPI between the Xeon 3 GHz system and the similar Core 2 Quad 3.5 GHz system for VPR routing. The Core 2 Quad 2.66 GHz system falls in between the Xeon and C2Q 3.5. I speculate this is due to memory system performance differences. I’m not aware of any changes between the 65 nm and 45 nm Core 2 that would significantly change performance, but it isn’t ruled out that the performance difference is due to a microarchitectural change.

The Core 2 microarchitecture has higher per-clock performance than Nehalem in VPR placement.

Blog