art with code

2010-03-17

Gazing at the CPU-GPU crystal ball

Reading AnandTech's Core i7 980X review got me thinking. CPU single-thread performance has roughly doubled over the past four years. And we have six cores instead of just two, for a total speedup in the 5-7x range. In the last two years, GPU performance has quadrupled.

The current top-of-the-line CPU (Core i7 980X) does around 100 GFLOPS at double-precision. That's for parallelized and vectorized code, mind you. Single-threaded scalar code fares far worse. Now, even the 100 GFLOPS number is close to a rounding error compared to today's top-of-the-line GPU (Radeon HD 5970) with its 928 GFLOPS at double-precision and 4640 GFLOPS at single-precision. Comparing GFLOPS per dollar, the Core i7 980X costs $999 and gets roughly 0.1 GFLOPS/$, whereas the HD 5970 costs $599 and gets 1.5 GFLOPS/$ at double precision and 7.7 GFLOPS/$ at single precision.

The GFLOPS numbers are a bit quirky. The HD 5970 number is based on 160 processors running at 725 MHz where each processor can execute five instructions per cycle. And if each is executing a four-element multiply-add, that's 8 FLOPS per instruction. To put it all together, 160 processors * 0.725 GHz * 5 instructions * 8 FLOPS = 4640 GFLOPS. For doubles, each processor can do four scalar multiply-adds per cycle: 160 * 0.725 * 4 * 2 = 928 GFLOPS. If you're not doing multiply-adds, halve the numbers.

The Core i7 GFLOPS number is apparently based on 5 ops per cycle, as 6 cores * 3.33 GHz * 5 FLOPS = 100 GFLOPS. I don't know how you achieve five double ops per cycle. To get four double ops per cycle, you can issue a 2-wide SSE mul and add and they are executed in parallel. Maybe there's a third scalar op executed in parallel with that? If that's the case, you could maybe get 180 GFLOPS with floats: two 4-wide SSE ops and a scalar op for 9 FLOPS per cycle. The GFLOPS/$ for 180 GFLOPS would be 0.18. For a non-multiply-add workload, the numbers are halved to 40 GFLOPS for doubles and 80 GFLOPS for floats.

If we look at normal software (i.e. single-threaded, not vectorized, gets no ops in parallel), the Core i7 does 3.33 GHz * 1 ops = 3.3 GFLOPS. That's a good 30x worse than peak performance, so you better optimize your code. If you're silly enough to run single-threaded scalar programs on the GPU, the HD 5970 would do 0.725 GHz * 2 op mul-add = 1.45 GFLOPS. Again, halve the number for non-mul-add workloads.

Anyhow, looking at number-crunching price-performance, the HD 5970 is 15x better value for doubles and 43x better value for floats compared to the 100 GFLOPS and 180 GFLOPS numbers. If you want dramatic performance numbers to wow your boss with, port some single-threaded non-vectorized 3D math to the GPU: the difference in speed should be around 700x. If you've also strategically written the code in, say, Ruby, a performance boost of four orders of magnitude is not a dream!

Add in the performance growth numbers from the top and you arrive at 1.6x yearly growth for CPUs and 2x yearly growth for GPUs. Also consider that Nvidia's Fermi architecture is reducing the cost of double math to a CPU-style halving of performance. Assuming that these growth trends hold for the next two years and that GPUs move to CPU-style double performance, you'll be seeing 250 GFLOPS CPUs going against 9200 GFLOPS GPUs. The three and four year extrapolations are 409/18600 and 655/37100, respectively. The GPU/CPU performance ratios would be 37x, 45x and 57x for the two-to-four-year scenarios. The corresponding price-performance ratios would be 62x, 75x and 95x.

With regard to performance-per-watt, the Core i7 980x uses 100W under load, compared to the 300W load consumption of the HD 5970. The 980x gets 1 GFLOPS/W for doubles and 1.8 GFLOPS/W for floats. The HD 5970 does 3.1 GFLOPS/W for doubles and 15.5 GFLOPS/W for floats.

If number crunching performance was all that mattered in a CPU price, top-of-the-line CPUs would be priced at $50 today. And $10 in four years...

The CPU-GPU performance difference creates an interesting dynamic. For Intel, the horror story is there being no perceivable difference between a $20 CPU with a $100 GPU vs. a $500 CPU with a $100 GPU. That would blow away the discrete CPU market and you'd end up with cheap CPUs integrated on the motherboard with the GPU as the differentiating factor. Much like the Atom boxes, come think of it. The plain Atom can't play video or play 3D games. An Atom with ION can [edit] play videos that have GPU-accelerated codecs and light 3D games [/edit].

Intel strategy?


To slow down the migration towards GPUs, you need to make targeting GPUs an unattractive proposition to software developers. A software developer does a cost-benefit analysis based on the size of the market and the cost of entering it. To keep them away from a market, reduce the its size and increase the cost. When it comes to retarding GPU computing, the goal is to make fewer computers ship with a capable GPU and to make developing for GPUs harder. If you're a GPU manufacturer, your goals are the exact opposite.

To make fewer computers ship with a capable GPU, they must ship with a graphics-only GPU. Competition in the GPU market is based on performance, so vendors are unlikely to buy a slow GPU. To make vendors buy a slow GPU, you must integrate it to the package you're selling, and thus lower the probability that they will buy a superfluous extra GPU. You also want to make it harder for vendors to buy a motherboard with a capable GPU integrated. These actions should restrict the market of capable GPUs to enthusiast machines.

This is still only a stop-gap measure, as it's pretty difficult to beat a 50x performance gap. In the long term, you want to switch the graphics-only integrated GPU to a home-built capable GPU and then proceed with commoditizing the CPU market. Or if that fails, maybe go for a monopoly on the chipset market: use IP to keep out competition, hoist prices.

Of course, this kind of maneuvering tends to lead to lawsuits.

11 comments:

Anonymous said...

In reality, the Ion can't play games, either. It's very very VERY slow, and being coupled with a brain-dead CPU doesn't help. What it can do more than "plain Atom" is play 720p video, if all the planets align and you actually hit the hardware accelerated path.

Chris said...

Nice article, and you've summed up the performance benefits of the GPU over CPU market nicely. However, I'm guessing Intel doesn't have to worry (yet) about these issues because of two factors:

1. It is still hard to write parallel apps that can take advantage of the GPU speedups (and very few commercial apps do this now)

2. OS scheduling for GPUs is nonexistent. If you want an application to get scheduled on a GPU, you need to not only write it with that GPU's API (e.g. CUDA or OpenCL), but you also have to schedule it yourself in another application. I.e., multitasking isn't possible in the same sense as normal applications.

Ilmari Heikkinen said...

@Chris: Thanks, and good points. As you said, the software side of GPGPU isn't really quite there yet. Especially the scheduling, I'd hoped we'd left co-operative multitasking in the 90s.

The software-dev dream scenario would be an OpenCL/CUDA/DirectCompute-like cross-platform, cross-GPU, cross-CPU API that would produce fast CPU code (parallelized and vectorized) and take advantage of GPUs. That way you could write code using that API even if you're targeting only CPUs. On GPUs it'd just go faster.

If Intel is doing its strategy based on a worst-case analysis -- "assume that they execute flawlessly" -- the five-year strategy would probably be against something like the above. "Microsoft DirectCompute 3.0" or such.

If Intel fails at a home-built GPU, they'd at least try to secure some hold of the profits from shipped PCs. I wouldn't be surprised if they started integrating 30GB SSDs on motherboards in a few years. Put it on PCI-E, market as QuickDrive, eat market share from other storage vendors.

Anonymous said...

The spectacular difference in the architecture of CPUs and GPU means that the CPU will probably remain a vital part of running most software in the future. Comparing the GFLOPs between the two architectures serves only to motivate the use of GPUs where they can be well utilised, i.e. graphics processing matrix multiplication etc. For the majority of software applications farming out tasks to GPUs through CUDA means a massive performance loss and an awkward development environment.

If your application is not spectacularly data-parallel it is unlikely that you will get much out of a GPU.

Ilmari Heikkinen said...

@Anonymous: Yes, if your application doesn't do a lot of computation it doesn't need a lot of computational power.

Bear in mind that a CPU isn't going to be fast on those kinds of apps either. Or be getting faster for that matter. CPUs seem to be dropping clock speeds and increasing core counts, see e.g. the Opteron 6174 with 12 cores @ 2.2GHz. Similar single-threaded performance to a P4 from five years ago, but around 12x multi-threaded performance.

Fast code is parallelized, vectorized, has flat streaming memory access patterns and does lots of computation per byte of memory access. Both on the CPU and the GPU.

If you do single-threaded random access pointer-chasing, it's going to be slow on any modern architecture. And will keep getting slower in comparison to peak performance.

Ilmari Heikkinen said...

Or rather, my prediction is that the CPU is either turning into a CPU-GPU-hybrid or getting integrated into the motherboard, much like today's audio chips. The pure-CPU speed race is probably going to die off as performance is plateauing.

I might be spectacularly wrong here, of course. But how the heck are the CPUs going to catch up with that 30x-and-increasing performance gap?

Anonymous said...

GPUs lose an order of magnitude or more when they need to go across the bus to slow DDR instead of fast, expensive GDDR. Many enterprise workloads require working sets in the tens or hundreds of gigabytes already. If the next GPUs plug directly into HT/QP (and memory speeds come up enough) they might get more interesting.

Anonymous said...

@Chris

"Intel doesn't have to worry (yet) about these issues because of two factors:

1. It is still hard to write parallel apps that can take advantage of the GPU speedup (and very few commercial apps do this now)"

CUDA makes it very easy to write parallel apps as long as your app has sufficient parallelism. If you can't find enough parallelism in your app then there`s no hope to begin with.

However data parallelism is actually very common. Of course the best known commercial software that successfully uses GPUs is graphics. The graphics market alone is enough to sustain GPU development. And research is showing that CUDA can have dramatic performance impact on common commercial applications such as databases and data analysis algorithms (e.g. network packet analysis, virus detection, etc).

Databases are the first thing that come to mind with the term "commercial enterprise software." SQL databases are actually massively parallel by design and many types of queries can be offloaded to GPUs. Data mining and OLAP algorithms can also see huge performance increases on GPUs. OLAP support in databases has been around for a long time and data mining algorithms have also been packaged with databases for many years. It is only a matter of time before GPUs become an integral part of databases.

Another famous "commercial application" that happens to be massively parallel is the good old spreadsheet. Excel 2007 now supports up to million rows in each sheet and is already multi-threaded. It's a no-brainer that Excel should be accelerated with GPUs.

Also keep in mind that GPUs are really just devices designed for doing huge mathematical computations in real-time. The impact that such accelerated mathematics could have on software should not be underestimated. The massive and inexpensive power of GPUs will could lead to a great empowerment of computationally intensive statistical methods that were previously unusable in many places. Perhaps the web browser of the future could be more intelligent with a machine learning algorithim accelerated on a GPU with thousands of streaming processors?

So I think Intel should be very worried. CPUs have already effectively been demoted to "boring" problem domains like procedural control applications, operating systems, etc.

"2. OS scheduling for GPUs is nonexistent. If you want an application to get scheduled on a GPU, you need to not only write it with that GPU's API (e.g. CUDA or OpenCL), but you also have to schedule it yourself in another application I.e., multitasking isn't possible in the same sense as normal applications."

Multiple programs can and do run on the same GPU. For example, the operating system needs to draw windows on the screen, a game might be rendering some images, and at the same time an OpenCL application might be crunching some numbers, all on the same GPU and sharing the same GPU memory. The device driver takes care of this scheduling. Secondly, the new CUDA Fermi architecture from NVIDIA supports "concurrent kernel execution" which allows for a single process to launch multiple GPU programs that are optimally scheduled by the hardware itself in order to maximize performance.

Anonymous said...

@ Anonymous

"GPUs lose an order of magnitude or more when they need to go across the bus to slow DDR instead of fast, expensive GDDR. Many enterprise workloads require working sets in the tens or hundreds of gigabytes already. If the next GPUs plug directly into HT/QP (and memory speeds come up enough) they might get more interesting."

Any application that processes a lot of data in parallel would work better on GPUs than CPUs. Hundreds of gigabytes is a trivial amount of data for a GPU to crunch through. GPU were designed for high bandwidth memory access. They have a much wider memory buses and much higher memory clock speeds than CPUs do. This is because GPUs are optimized for moving huge amount of data in parallel, while CPUs are optimized for handling small amounts of sequential random data access (small enough to fit into the L1/L2 caches).

Now it is true that the GPU could be slowed down by the need to transfer data, from system RAM to GPU memory, across the x16 PCI-E bus which has a maximum bandwidth of 8GB/s. But the CPU also faces similar constraints with a peak memory bandwidth of about 50GB/s. In comparison, the NVIDIA Fermi GPU will have a peak bandwidth of 250GB/s. If you really have as much data to process as you say you do then you hide the PCI-e 8GB/s memory latency by using for example the CUDA stream API which allows for asynchronous host memory access and GPU execution. If you are dealing with terabytes or petabytes of data then you have to store on disk so you are trying to hide disk access latency and do things in parallel anyway. Processing that amount of data without using GPUs doesn't make much sense.

Anonymous said...

"If you really have as much data to process as you say you do then you hide the PCI-e 8GB/s memory latency by using for example the CUDA stream API which allows for asynchronous host memory access and GPU execution. If you are dealing with terabytes or petabytes of data then you have to store on disk so you are trying to hide disk access latency and do things in parallel anyway."

I think there's some confusion here.

I can currently buy Nehalem EX systems with a terabyte of RAM for 4 (*8) CPU sockets in 4U. In many shops (pharma, finance, etc) we've been buying hideously expensive memory solutions like TMS for years because we want flat access to terabyte-sized datsets. We were already bottlenecked by pci-e, and going from qdr infiniband to local RAM already gives large gains, so this should give you some idea of the performance characteristics.

WH said...

Hey,

Loved your detailed post. I am a GPU fan haha heres my blog:

http://programminglinuxgames.blogspot.com/

Blog Archive