Ever since Intel’s Hyper-Threading and AMD’s Bulldozer modules, there has been much debate on what qualifies as a real CPU “core”. Unfortunately, I don’t think “core” is easy to define, so marketing tends to name things for their own benefit. In the end, it’s the performance that matters, not the name.
What is a “core”?
It used to be simple to define “core”: Fetch, decode, ALUs, cache, attached over a memory bus of some sort, executing a single thread. Multiprocessor systems would replicate this on a board (SMP), package (MCM), or die (CMP). However, when considering two threads, there is a continuum of design points that range from time-slicing threads on an unmodified CPU core (1.0× throughput) to two full cores with all hardware replicated (>1.9× throughput on two CPU-bound tasks). The two extreme points are traditionally called “one core” and “two cores”, but we now have to come up with names for intermediate design points where only some hardware is replicated to service two hardware thread contexts.
Thus, marketing chooses whatever term suits their interests. GPU manufacturers also abuse the words “core” and “processor” to describe something more similar to an ALU or execution unit (Nvidia: “Streaming processor” or “Cuda Core”; AMD: “Stream processor”; but not Intel).
A Performance Metric
The essential characteristic of having two threads is that the software developer must parallelize a task into two mostly-independent threads (this is a hard problem!), and the system will “consume” this two-way parallelism and turn that into some amount of performance improvement (from ~1× for time-slicing a “single-core” to 2× for a perfect “dual-core”), at some hardware cost (from 1× for time-slicing to somewhat over 2× for two cores with interconnect). I will ignore the issue of hardware cost, as higher performance usually implies a higher cost and whether this trade-off is “good” is far too complicated to have a single answer.
This performance metric is quite easy to measure, and answers the question: If I gave the system two independent threads rather than one, how much performance will it give me in return for my effort? We are already familiar with the two extreme points of this design space: 1.0 is called “single core”, and near-2.0 is called “dual core”. Both Intel Hyper-Threading and AMD’s “modules” lie somewhere in between.
All of the above really just boils down to re-running my earlier Hyper-Threading Performance tests again on AMD systems. This involved taking some workloads, running multiple independent instance of it, and measuring total throughput as the number of parallel instances increase. The workloads use very little floating-point, avoiding a bottleneck on AMD chips caused by not having replicated FPUs within a module.
I tested four microarchitectures:
|Intel Lynnfield||Core i7-860, 3300 MHz|
|Intel Ivy Bridge||Core i7-3770K, 3900 MHz|
|AMD Bulldozer||FX-8120, 3400 MHz|
|AMD Piledriver||FX-8320, 3400 MHz|
This plot shows how throughput scales as more instances of a workload are run concurrently on an AMD FX-8320, normalized to four threads. There is linear scaling until a clear bend in the curve at 4 threads, which indicates that using two threads in a module does not perform as well as two full cores. Running two threads per module results in 38-79% increased throughput (geomean 54%). Throughput does not increase beyond 8 threads because the OS is time-slicing threads onto 8 hardware thread contexts, and time-slicing offers no performance improvement to CPU-bound tasks.
The following two graphs compare the throughput of 8 vs. 4 threads (ideal speedup = 2), and 4 vs. 1 thread (ideal speedup = 4), respectively.
The two-way multithreading speedup shows that AMD’s greater replication of hardware within a module results in greater performance improvement (1.54-1.57) compared to Intel’s minimal hardware replication for Hyper-Threading (1.23-1.24). In my opinion, neither of these are close enough to 1.9-2.0 to deserve being called a “core”. However, the two-thread performance gains are sufficiently higher than Intel’s that it wouldn’t be entirely fair to say “it’s just Hyper-Threading” either.
The quad core speedups show that “real” cores actually do scale very close to ideal for most workloads, at least for up to four cores.
But all of this may be a moot point. In a thermally-constrained environment (densely-packed cardboard boxes) where the high power consumption of an AMD chip causes it to lose its clock speed advantage over Intel, the AMD chips perform rather poorly regardless of how many threads are available. As configured (see above table), the AMD chips already consume more power.
AMD’s marketing department did a good job convincing most of the world that a thread context that delivers only 54% more throughput deserves to be called a “core”.