art with code


Gazing at the CPU-GPU crystal ball

Reading AnandTech's Core i7 980X review got me thinking. CPU single-thread performance has roughly doubled over the past four years. And we have six cores instead of just two, for a total speedup in the 5-7x range. In the last two years, GPU performance has quadrupled.

The current top-of-the-line CPU (Core i7 980X) does around 100 GFLOPS at double-precision. That's for parallelized and vectorized code, mind you. Single-threaded scalar code fares far worse. Now, even the 100 GFLOPS number is close to a rounding error compared to today's top-of-the-line GPU (Radeon HD 5970) with its 928 GFLOPS at double-precision and 4640 GFLOPS at single-precision. Comparing GFLOPS per dollar, the Core i7 980X costs $999 and gets roughly 0.1 GFLOPS/$, whereas the HD 5970 costs $599 and gets 1.5 GFLOPS/$ at double precision and 7.7 GFLOPS/$ at single precision.

The GFLOPS numbers are a bit quirky. The HD 5970 number is based on 160 processors running at 725 MHz where each processor can execute five instructions per cycle. And if each is executing a four-element multiply-add, that's 8 FLOPS per instruction. To put it all together, 160 processors * 0.725 GHz * 5 instructions * 8 FLOPS = 4640 GFLOPS. For doubles, each processor can do four scalar multiply-adds per cycle: 160 * 0.725 * 4 * 2 = 928 GFLOPS. If you're not doing multiply-adds, halve the numbers.

The Core i7 GFLOPS number is apparently based on 5 ops per cycle, as 6 cores * 3.33 GHz * 5 FLOPS = 100 GFLOPS. I don't know how you achieve five double ops per cycle. To get four double ops per cycle, you can issue a 2-wide SSE mul and add and they are executed in parallel. Maybe there's a third scalar op executed in parallel with that? If that's the case, you could maybe get 180 GFLOPS with floats: two 4-wide SSE ops and a scalar op for 9 FLOPS per cycle. The GFLOPS/$ for 180 GFLOPS would be 0.18. For a non-multiply-add workload, the numbers are halved to 40 GFLOPS for doubles and 80 GFLOPS for floats.

If we look at normal software (i.e. single-threaded, not vectorized, gets no ops in parallel), the Core i7 does 3.33 GHz * 1 ops = 3.3 GFLOPS. That's a good 30x worse than peak performance, so you better optimize your code. If you're silly enough to run single-threaded scalar programs on the GPU, the HD 5970 would do 0.725 GHz * 2 op mul-add = 1.45 GFLOPS. Again, halve the number for non-mul-add workloads.

Anyhow, looking at number-crunching price-performance, the HD 5970 is 15x better value for doubles and 43x better value for floats compared to the 100 GFLOPS and 180 GFLOPS numbers. If you want dramatic performance numbers to wow your boss with, port some single-threaded non-vectorized 3D math to the GPU: the difference in speed should be around 700x. If you've also strategically written the code in, say, Ruby, a performance boost of four orders of magnitude is not a dream!

Add in the performance growth numbers from the top and you arrive at 1.6x yearly growth for CPUs and 2x yearly growth for GPUs. Also consider that Nvidia's Fermi architecture is reducing the cost of double math to a CPU-style halving of performance. Assuming that these growth trends hold for the next two years and that GPUs move to CPU-style double performance, you'll be seeing 250 GFLOPS CPUs going against 9200 GFLOPS GPUs. The three and four year extrapolations are 409/18600 and 655/37100, respectively. The GPU/CPU performance ratios would be 37x, 45x and 57x for the two-to-four-year scenarios. The corresponding price-performance ratios would be 62x, 75x and 95x.

With regard to performance-per-watt, the Core i7 980x uses 100W under load, compared to the 300W load consumption of the HD 5970. The 980x gets 1 GFLOPS/W for doubles and 1.8 GFLOPS/W for floats. The HD 5970 does 3.1 GFLOPS/W for doubles and 15.5 GFLOPS/W for floats.

If number crunching performance was all that mattered in a CPU price, top-of-the-line CPUs would be priced at $50 today. And $10 in four years...

The CPU-GPU performance difference creates an interesting dynamic. For Intel, the horror story is there being no perceivable difference between a $20 CPU with a $100 GPU vs. a $500 CPU with a $100 GPU. That would blow away the discrete CPU market and you'd end up with cheap CPUs integrated on the motherboard with the GPU as the differentiating factor. Much like the Atom boxes, come think of it. The plain Atom can't play video or play 3D games. An Atom with ION can [edit] play videos that have GPU-accelerated codecs and light 3D games [/edit].

Intel strategy?

To slow down the migration towards GPUs, you need to make targeting GPUs an unattractive proposition to software developers. A software developer does a cost-benefit analysis based on the size of the market and the cost of entering it. To keep them away from a market, reduce the its size and increase the cost. When it comes to retarding GPU computing, the goal is to make fewer computers ship with a capable GPU and to make developing for GPUs harder. If you're a GPU manufacturer, your goals are the exact opposite.

To make fewer computers ship with a capable GPU, they must ship with a graphics-only GPU. Competition in the GPU market is based on performance, so vendors are unlikely to buy a slow GPU. To make vendors buy a slow GPU, you must integrate it to the package you're selling, and thus lower the probability that they will buy a superfluous extra GPU. You also want to make it harder for vendors to buy a motherboard with a capable GPU integrated. These actions should restrict the market of capable GPUs to enthusiast machines.

This is still only a stop-gap measure, as it's pretty difficult to beat a 50x performance gap. In the long term, you want to switch the graphics-only integrated GPU to a home-built capable GPU and then proceed with commoditizing the CPU market. Or if that fails, maybe go for a monopoly on the chipset market: use IP to keep out competition, hoist prices.

Of course, this kind of maneuvering tends to lead to lawsuits.
Post a Comment

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes