tag:blogger.com,1999:blog-5103406499726689782.post6396465584848858889..comments2022-04-04T15:01:57.085+03:00Comments on fhtr: Gazing at the CPU-GPU crystal ballIlmari Heikkinenhttp://www.blogger.com/profile/10857385258792531336noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-5103406499726689782.post-22592218462764352412010-04-16T08:53:23.928+03:002010-04-16T08:53:23.928+03:00Hey,
Loved your detailed post. I am a GPU fan hah...Hey,<br /><br />Loved your detailed post. I am a GPU fan haha heres my blog:<br /><br /><a href="http://programminglinuxgames.blogspot.com/" rel="nofollow">http://programminglinuxgames.blogspot.com/</a>WHhttps://www.blogger.com/profile/16562556011003748345noreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-25611867285168253962010-04-06T19:09:10.593+03:002010-04-06T19:09:10.593+03:00"If you really have as much data to process a...<i>"If you really have as much data to process as you say you do then you hide the PCI-e 8GB/s memory latency by using for example the CUDA stream API which allows for asynchronous host memory access and GPU execution. If you are dealing with terabytes or petabytes of data then you have to store on disk so you are trying to hide disk access latency and do things in parallel anyway."</i><br /><br />I think there's some confusion here. <br /><br />I can currently buy Nehalem EX systems with a terabyte of RAM for 4 (*8) CPU sockets in 4U. In many shops (pharma, finance, etc) we've been buying hideously expensive memory solutions like TMS for years because we want flat access to terabyte-sized datsets. We were already bottlenecked by pci-e, and going from qdr infiniband to local RAM already gives large gains, so this should give you some idea of the performance characteristics.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-11056264417839122172010-04-01T06:02:14.361+03:002010-04-01T06:02:14.361+03:00@ Anonymous
"GPUs lose an order of magnitud...@ Anonymous <br /><br />"GPUs lose an order of magnitude or more when they need to go across the bus to slow DDR instead of fast, expensive GDDR. Many enterprise workloads require working sets in the tens or hundreds of gigabytes already. If the next GPUs plug directly into HT/QP (and memory speeds come up enough) they might get more interesting."<br /><br />Any application that processes a lot of data in parallel would work better on GPUs than CPUs. Hundreds of gigabytes is a trivial amount of data for a GPU to crunch through. GPU were designed for high bandwidth memory access. They have a much wider memory buses and much higher memory clock speeds than CPUs do. This is because GPUs are optimized for moving huge amount of data in parallel, while CPUs are optimized for handling small amounts of sequential random data access (small enough to fit into the L1/L2 caches). <br /><br />Now it is true that the GPU could be slowed down by the need to transfer data, from system RAM to GPU memory, across the x16 PCI-E bus which has a maximum bandwidth of 8GB/s. But the CPU also faces similar constraints with a peak memory bandwidth of about 50GB/s. In comparison, the NVIDIA Fermi GPU will have a peak bandwidth of 250GB/s. If you really have as much data to process as you say you do then you hide the PCI-e 8GB/s memory latency by using for example the CUDA stream API which allows for asynchronous host memory access and GPU execution. If you are dealing with terabytes or petabytes of data then you have to store on disk so you are trying to hide disk access latency and do things in parallel anyway. Processing that amount of data without using GPUs doesn't make much sense.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-85867404809373465952010-04-01T05:37:24.066+03:002010-04-01T05:37:24.066+03:00@Chris
"Intel doesn't have to worry (yet...@Chris<br /><br />"Intel doesn't have to worry (yet) about these issues because of two factors:<br /><br />1. It is still hard to write parallel apps that can take advantage of the GPU speedup (and very few commercial apps do this now)"<br /><br />CUDA makes it very easy to write parallel apps as long as your app has sufficient parallelism. If you can't find enough parallelism in your app then there`s no hope to begin with.<br /><br />However data parallelism is actually very common. Of course the best known commercial software that successfully uses GPUs is graphics. The graphics market alone is enough to sustain GPU development. And research is showing that CUDA can have dramatic performance impact on common commercial applications such as databases and data analysis algorithms (e.g. network packet analysis, virus detection, etc). <br /><br />Databases are the first thing that come to mind with the term "commercial enterprise software." SQL databases are actually massively parallel by design and many types of queries can be offloaded to GPUs. Data mining and OLAP algorithms can also see huge performance increases on GPUs. OLAP support in databases has been around for a long time and data mining algorithms have also been packaged with databases for many years. It is only a matter of time before GPUs become an integral part of databases. <br /><br />Another famous "commercial application" that happens to be massively parallel is the good old spreadsheet. Excel 2007 now supports up to million rows in each sheet and is already multi-threaded. It's a no-brainer that Excel should be accelerated with GPUs.<br /><br />Also keep in mind that GPUs are really just devices designed for doing huge mathematical computations in real-time. The impact that such accelerated mathematics could have on software should not be underestimated. The massive and inexpensive power of GPUs will could lead to a great empowerment of computationally intensive statistical methods that were previously unusable in many places. Perhaps the web browser of the future could be more intelligent with a machine learning algorithim accelerated on a GPU with thousands of streaming processors?<br /><br />So I think Intel should be very worried. CPUs have already effectively been demoted to "boring" problem domains like procedural control applications, operating systems, etc. <br /><br />"2. OS scheduling for GPUs is nonexistent. If you want an application to get scheduled on a GPU, you need to not only write it with that GPU's API (e.g. CUDA or OpenCL), but you also have to schedule it yourself in another application I.e., multitasking isn't possible in the same sense as normal applications."<br /><br />Multiple programs can and do run on the same GPU. For example, the operating system needs to draw windows on the screen, a game might be rendering some images, and at the same time an OpenCL application might be crunching some numbers, all on the same GPU and sharing the same GPU memory. The device driver takes care of this scheduling. Secondly, the new CUDA Fermi architecture from NVIDIA supports "concurrent kernel execution" which allows for a single process to launch multiple GPU programs that are optimally scheduled by the hardware itself in order to maximize performance.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-33981933915423860042010-03-31T23:33:44.557+03:002010-03-31T23:33:44.557+03:00GPUs lose an order of magnitude or more when they ...GPUs lose an order of magnitude or more when they need to go across the bus to slow DDR instead of fast, expensive GDDR. Many enterprise workloads require working sets in the tens or hundreds of gigabytes already. If the next GPUs plug directly into HT/QP (and memory speeds come up enough) they might get more interesting.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-79906957674133674652010-03-31T21:36:20.692+03:002010-03-31T21:36:20.692+03:00Or rather, my prediction is that the CPU is either...Or rather, my prediction is that the CPU is either turning into a CPU-GPU-hybrid or getting integrated into the motherboard, much like today's audio chips. The pure-CPU speed race is probably going to die off as performance is plateauing.<br /><br />I might be spectacularly wrong here, of course. But how the heck are the CPUs going to catch up with that 30x-and-increasing performance gap?Ilmari Heikkinenhttps://www.blogger.com/profile/10857385258792531336noreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-48371382002982698902010-03-31T21:17:43.177+03:002010-03-31T21:17:43.177+03:00@Anonymous: Yes, if your application doesn't d...@Anonymous: Yes, if your application doesn't do a lot of computation it doesn't need a lot of computational power.<br /><br />Bear in mind that a CPU isn't going to be fast on those kinds of apps either. Or be getting faster for that matter. CPUs seem to be dropping clock speeds and increasing core counts, see e.g. the Opteron 6174 with 12 cores @ 2.2GHz. Similar single-threaded performance to a P4 from five years ago, but around 12x multi-threaded performance.<br /><br />Fast code is parallelized, vectorized, has flat streaming memory access patterns and does lots of computation per byte of memory access. Both on the CPU and the GPU.<br /><br />If you do single-threaded random access pointer-chasing, it's going to be slow on any modern architecture. And will keep getting slower in comparison to peak performance.Ilmari Heikkinenhttps://www.blogger.com/profile/10857385258792531336noreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-91505379462340348462010-03-31T19:55:04.489+03:002010-03-31T19:55:04.489+03:00The spectacular difference in the architecture of ...The spectacular difference in the architecture of CPUs and GPU means that the CPU will probably remain a vital part of running most software in the future. Comparing the GFLOPs between the two architectures serves only to motivate the use of GPUs where they can be well utilised, i.e. graphics processing matrix multiplication etc. For the majority of software applications farming out tasks to GPUs through CUDA means a massive performance loss and an awkward development environment.<br /><br />If your application is not spectacularly data-parallel it is unlikely that you will get much out of a GPU.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-34448426278669671932010-03-31T19:43:09.180+03:002010-03-31T19:43:09.180+03:00@Chris: Thanks, and good points. As you said, the ...@Chris: Thanks, and good points. As you said, the software side of GPGPU isn't really quite there yet. Especially the scheduling, I'd hoped we'd left co-operative multitasking in the 90s.<br /><br />The software-dev dream scenario would be an OpenCL/CUDA/DirectCompute-like cross-platform, cross-GPU, cross-CPU API that would produce fast CPU code (parallelized and vectorized) and take advantage of GPUs. That way you could write code using that API even if you're targeting only CPUs. On GPUs it'd just go faster.<br /><br />If Intel is doing its strategy based on a worst-case analysis -- "assume that they execute flawlessly" -- the five-year strategy would probably be against something like the above. "Microsoft DirectCompute 3.0" or such.<br /><br />If Intel fails at a home-built GPU, they'd at least try to secure some hold of the profits from shipped PCs. I wouldn't be surprised if they started integrating 30GB SSDs on motherboards in a few years. Put it on PCI-E, market as QuickDrive, eat market share from other storage vendors.Ilmari Heikkinenhttps://www.blogger.com/profile/10857385258792531336noreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-74252707393079440842010-03-31T17:41:06.290+03:002010-03-31T17:41:06.290+03:00Nice article, and you've summed up the perform...Nice article, and you've summed up the performance benefits of the GPU over CPU market nicely. However, I'm guessing Intel doesn't have to worry (yet) about these issues because of two factors:<br /><br />1. It is still hard to write parallel apps that can take advantage of the GPU speedups (and very few commercial apps do this now)<br /><br />2. OS scheduling for GPUs is nonexistent. If you want an application to get scheduled on a GPU, you need to not only write it with that GPU's API (e.g. CUDA or OpenCL), but you also have to schedule it yourself in another application. I.e., multitasking isn't possible in the same sense as normal applications.Chrishttps://www.blogger.com/profile/16538671715523016085noreply@blogger.comtag:blogger.com,1999:blog-5103406499726689782.post-86990741389165711692010-03-19T09:49:06.285+02:002010-03-19T09:49:06.285+02:00In reality, the Ion can't play games, either. ...In reality, the Ion can't play games, either. It's very very VERY slow, and being coupled with a brain-dead CPU doesn't help. What it can do more than "plain Atom" is play 720p video, if all the planets align and you actually hit the hardware accelerated path.Anonymousnoreply@blogger.com