Multicore Eroding Moore's Law

Author Dean Dauger
Dauger Research, Inc

Do you seek ever increasing speed from your computers? Do you think multicore chips are a sign of a healthy chip industry? In major trade journals, most articles on the subject seem to uncritically accept multicore as the processor solution moving forward, without suggesting any viable alternative. Meanwhile, the chipmaker giants, Intel, IBM, and AMD, increasingly emphasize multicore chips. But I believe the hardware problems that were serious enough to prompt such fierce competitors to agree on multicore herald a major transition in the computer industry whose full consequences, especially for software writers, are not yet appreciated.

Intel Core

The reign of the single-processor computer is over

The popularized version of Moore's law, expecting doubling performance per processing element every couple years, has ended. While Dr. Gordon Moore's original observation of doubling transistor count every 18 to 24 months is holding true, by 2006 the popular expectation of doubling performance per processing core is not. Intel, as well as IBM, AMD, and others, could not produce faster processors because the chips ran too hot. Therefore Intel saw it necessary to produce a chip with two cores to increase total performance. But, read another way, that decision says Intel ran out of ideas to improve performance per chip save one: the technology of "copy and paste".

Processor makers today offer chips with multiple "Core"s for a good reason. In 2004, the microprocessor manufacturers industry-wide encountered a wall in the form of heat issues preventing higher clock speeds, with Intel delaying, then canceling a 4 GHz chip.
In 2005, Apple surprised the world when it decided to switch from PowerPC to Intel because it saw a growth path in Intel processors that IBM did not pursue with PowerPC. In 2006, Intel introduced their Core Duo processor, a chip that, as far as a software writer is concerned, has two processors. At the same time, they announced that a 4-core chip was forthcoming, and both AMD and Intel slated 8- and 16-core chips for 2007 and beyond. Intel CEO Paul Otellini even previewed an 80-core chip as a way of pointing to the future. Others have since prototyped 64-core chips.
Intel CEO Otellini with 80 cores

Like before (such as the introduction of SIMD units like AltiVec and SSE), chipmakers have recast their problems in a way that transfers them to software writers. This time software parallelism must carry on where hardware parallelism cannot. As far as a software programmer is concerned a core is a processor. Software writers, in order to best use multiple core hardware, must choose an efficient parallel programming paradigm and use it carefully.

Is shared-memory multithreading the answer to multicore?

By far the most commonly discussed programming method to apply multicore is multithreading. This parallel computing paradigm assumes many concurrent processing threads that all have shared access to all memory. The problems with this approach are two-fold:

  • Software

    Because memory is shared, threads may step on each others' work, potentially giving erroneous results randomly. Determinism, formerly a defining feature of a computer, is easily obliterated, requiring the programmer to track down and eliminate such non-determinism. E. A. Lee of UC Berkeley writes in
    The Problem with Threads: "... we in fact require that programmers of multithreaded systems be insane." I highly recommend to anyone the thorough and thoughtful analysis in Lee's article.

    Creators of typical shared memory implementations recognize that such machines are inherently nondeterministic, so their solution is to have the software programmer apply mechanisms to prune away this nondeterminism. Specifically, the shared memory with threads approach would use either a locking or a semaphore mechanism which can too easily negate parallel performance.

    When a race condition occurs unforeseen, the output is random and not repeatable, so diagnosing a problem whose symptom is never the same twice is very frustrating. There could even be circumstances that cause the right answer to result on one system, but random answers to appear on another, further frustrating the writer. Such shared memory issues are very difficult to isolate and solve.

    Proponents of shared memory would argue that writing code for such systems is easier than using distributed memory with message passing. While for trivially parallelizable examples the code would appear easier, most of the interesting problems have inherent internal dependency that cannot be eliminated. Fundamental issues occur when applying the shared memory paradigm to such problems. Most language-based multithreaded solutions obfuscate these data dependencies, resulting in code with cryptic directives and hidden effects, all the more confusing for software writers.

    OpenMP Programming Directives

  • Hardware

    Data is commonly served from memory using a shared bus that easily can be overwhelmed by transaction requests of the processing cores. Beyond 16 cores, this memory bus is so taxed that hardware makers must design much more expensive, complex technology to compensate. Another reference I recommend is Chapter 6 of In Search of Clusters by Gregory F. Pfister, who provides an excellent description of the complications a hardware designer encounters to maintain cache coherence and data rates between processors and memory of such hardware.

    Except for the most data-independent problems, memory contention and data congestion were already issues on the most recent single-processor personal computers. Isn't that why Apple invested so much in making the system bus, memory, and I/O of the first Power Mac G5 faster than any Mac that came before? Now we have a Mac Pro whose 8 cores can easily overwhelm an even-faster system bus.

    This problem occurs not only for scientific problems but also Apple's latest H.264 compression algorithms for HD video. Benchmarks show how Apple-supplied H.264 QuickTime compressor flatlines beyond 4 cores and is no faster when using 8 cores. The benchmarks show only 20% parallelism when using 4 cores versus 2 cores. Clearly some sort of data bottleneck is holding performance back, despite Apple software writers' advanced skills. (It turns out,
    except for game developers, few apply multithreading well. And yet 16-core chips are to come?)

    "Those who do not listen to history are doomed to repeat it."

    A central debate of high-performance computing (HPC) in the 1990's was between two camps: shared memory with threads versus distributed memory with message passing. Silicon Graphics, Inc., (SGI) became the prime corporate advocate of the shared memory approach. While other companies like Intel, Cray, IBM, and Fujitsu, in their HPC offerings, abandoned shared memory in favor of distributed memory, SGI, at their peak, built impressive boxes with 256 processors all sharing memory at great expense. SGI helped build technologies that we know today as OpenMP and, because their approach worked well for graphics because such applications are often easy to parallelize, OpenGL. But, when asked to build 1024 processor systems, even SGI would have to build 4 nodes of 256 processors each connected via a network. Software and hardware
    layers would recreate the illusion of a shared memory system, but its speed would at best be limited by the network, just like for message
    passing. While today distributed memory systems operate with several thousand processors, SGI itself encountered a practical limit to the pure shared memory approach it advocated.

    SGI Origin 2000 cluster

    So in the 21st century, what was SGI's fate? SGI was unable to produce technology sufficiently more powerful and more economic than its competitors in the HPC arena. Likewise, the corporate fortunes of SGI faired poorly, with stock prices under 50 cents per share after 2001, sinking below NYSE minimums. In May 2006, SGI filed Chapter 11 bankruptcy and major layoffs, leaving its stock worthless. Only in the following October did the company find new life after a complete financial overhaul.

    Today, Intel is advocating threads, but that also means Intel is tacitly advocating the shared memory approach championed by SGI from the 1990's until declaring bankruptcy. The essential comparison I see is this: Intel is following the path SGI has already tread, only this time, because of Intel's microprocessor dominance, the personal computer industry is following the path to doom blazed by SGI. And we already know where the old SGI's path ended: technological and financial ruin.
    Does this make sense?

    The End of Single-Processor Computing is Near

    Meanwhile, all new Macs and nearly all new PCs are multicore, and the most discussed approach to apply multicore is that of yesteryear's doomed SGI.

    What is HPC using?

    For a decade, clusters and supercomputers, well-represented at major supercomputing centers around the world, adopted the MPI standard on distributed memory hardware. Today, distributed memory MPI is the de facto standard at the San Diego Supercomputing Center, National Center for Supercomputing Applications, National Energy Research Scientific Computing Center, Lawrence Livermore National Laboratory and many more. In the annual reports of these organizations, they highlight the accomplishments they perform with their hardware, but it also goes without saying that these applications use the distributed-memory MPI approach. Even the new SGI's product list includes the distributed-memory message-passing hardware design and MPI support. While novel alternatives exist on the horizon, HPC in practice would imply distributed memory with message passing is the best multiple processor programming paradigm yet.

    My prediction is that, sooner or later, the entire industry will evolve to use some sort of message passing, perhaps something MPI-like, to wrangle all those processors. (IBM Cell processors have interconnects between SPEs, suggesting the possibility.) So why not get a head start and use multiple processors the way HPC does today, bypassing the forthcoming Multithreading Meltdown? With the prospect of more systems with four, eight, or more processors in a machine, it is left to the software programmers to use these systems efficiently. Although the personal computer industry is trending towards shared memory with threads, the HPC industry has already shown that path is a mistake. The need for programming parallel computers has arrived at our desktops, and lessons learned from scientists who apply computing can show how to use multicore, and beyond, if heeded.

  • Comments

    Great article

    Thanks for providing such a good perspective on multicore.

    A bit oversimplistic?

    An interesting take on multithreading, but I find it a bit oversimplified. There are a number of assertions that I don't fully agree with:

    There are probably many reasons for SGI's demise, but in my experience, poor quality hardware or parallelism was not one of them. I've been using an SGI Origin with around a 1000 processors in my research for about 5 years, with both OpenMP and MPI software. I found large scale OpenMP applications scale reasonably up to around 64 processors. MPI scales a bit better, but not drastically so.

    The SGI in question has just been replaced by a new IBM Power5+ system. That too has shared-memory nodes of 16 processors. So shared memory is not just a dumb SGI idea that Intel is blindly following; it has been used by all vendors, including big blue, for many years, and with great success.

    It's true that for certain types of applications, MPI is a better choice. If you software does not require much communication between processes, a loose-coupled, distributed memory machine is a good choice. In this case, a cell-based system could be appropriate.

    But there are many other applications that are better suited by tight-coupled, threaded systems. If there is a lot of data communication between threads/processes, a shared memory system can achieve much higher transfer rates, because there is no network involved. (You are very critical of memory bottlenecks in shared memory systems, but completely gloss over the network bottlenecks of distributed memory systems.)

    Even if I were to agree that MPI was a better choice for most scientific software, and maybe even video conversion, it doesn't mean it is a good choice for most desktop applications, or as the basis of a desktop computer like the Mac. I regularly develop scientific software with MPI and OpenMP, but I don't look forward to the day when I am forced to use MPI in a Cocoa app.

    Earlier today, my colleague at MacResearch, Gaurav Khanna, summarized my feelings quite well. He said he doubted either of the approaches in use today would be the ultimate solution. I agree.

    If I had to guess what type of programming model we might be using in 10 years time, to deal with the parallelism available, I would guess it might be something like distributed objects in cocoa: a high-level MPI. Where MPI is just low-level passing of data, distributed objects is much more intelligent, and transparent to the developer. And it is much less prone to the threading problems you discussed. (For an interesting discussion of distributed objects, check out the latest Late Night Cocoa podcast.)

    Drew

    ---------------------------
    Drew McCormack
    http://www.maccoremac.com
    http://www.macanics.net
    http://www.macresearch.org

    Nice article

    I agree with many of your comments, Dean. Multi-core doesn't work very well for my research work because of the memory bandwidth limitations. I mostly use MPI on large supercomputers for my work.

    But, I'm curious about some possibilities. One plus of these multi-core chips is that they have an ample amount of cache available. Are there any upcoming technologies that utilize that more effectively? I know there are cache-aware algorithms, but I was wondering about some serious infrastructure to utilize this fast cache.

    Also, I'm curious to hear of what people think about the alternate GPU/Cell archs. They appear to have more control about what to pull from main memory and compute on. Used wisely, it can help overcome these memory bandwidth type issues. To me, it looks very promising ..

    Overall, just as Drew remarked, my hope is that in the future there is a better way to approach the whole parallelization paradigm ..

    I agree

    I just have to echo some of Drew's feelings on the article. I also find it interesting that your website is selling clustering tools that rely on MPI. Hardly seems unbiased.

    Like Drew, I work with code on various multi-processor computers (ranging from a 4-core MP, to a 1024 node BlueGene/L) every day. OpenMP and MPI are two very different tools, for two very different purposes. To say that one is better than the other, is like saying a hammer is better than a screwdriver. You wouldn't use a screwdriver to put in a nail, but you also wouldn't use a hammer to drive a screw.

    Many of your complaints about shared memory systems, are that you can write buggy/inaccurate code. I have seen (and confess- have written ;) ) my fair share buggy/inaccurate MPI code as well. It all comes down to 1) Understanding your problem, 2) Picking the best tool for the job and 3) Understanding the tool you are using.

    Agreed

    I'll chime in with my agreement too - this article is unfairly harsh on OpenMP and shared memory machines. It's correct to say that threads can accidentally stamp on each other's variables. But I've never found that to be a problem, and it's the only problem which OpenMP has which MPI doesn't. MPI programs can deadlock and be non-deterministic just as well as OpenMP ones.

    Furthermore, as pointed out above, the memory timing problems in OpenMP are present in MPI too. In fact, they are writ so large that the programmer is forced to consider them: if you want data from another process, you have to request it explicitly, rather than relying on the memory architecture to "make it happen." This means that when writing an MPI program, you have to be very careful about what gets put where, and when you request it. Ultimately this is why MPI programs can scale better than OpenMP ones - MPI relies on the programmer being far smarter. Similarly, a human will always be able to write faster assembly code than any compiler.

    Interesting Points

    Dean has an interesting point to be put out: I agree with him on one point that the current trend (!) puts more responsibilities on software developers to attack large scale computations, rather than relying on a more involved hardware solution.

    I have been developing analysis software (finite element analysis) last five years, and I am really interested in what all this OpenMP and MPI are. and I wonder how I get get benefit from these technologies. One thing is clear to me that developers (inlcuding me) need to "think parallel" to exploit these tools, which is a very new concept to me.

    Like many people, I have been trying to understand all of these technologies.

    bulent

    P.S. recently, I came along with an example of solving over 1 million equations. While I was looking out how we use OpenMP and MPI to gain more computational power, I found a super-fast solver, which completed the task less than 8 minutes. obviously, it is hard to make any solid conclusing from this example, but I believe good algorithms still have some great values to be investigated further.

    Re: Interesting Points

    "P.S. recently, I came along with an example of solving over 1 million equations. While I was looking out how we use OpenMP and MPI to gain more computational power, I found a super-fast solver, which completed the task less than 8 minutes. obviously, it is hard to make any solid conclusing from this example, but I believe good algorithms still have some great values to be investigated further."

    This brings up a very good point. I spend much of my time working with engineers/academics who are trying to parallelize their code. My first question is- have you exhausted all other possibilities? From the scientific computing view, one needs to take into account a whole lot of overhead when converting an existing serial code to parallel. Overhead not only in the processing, but also in the time it will take to convert your code! If it'll take 3 weeks to run your sims on your desktop using your existing code, and 4 weeks to convert it to a validated parallel code that will run in 2 hours on a cluster, and it is a one-off run, there is no reason to go through the conversion.

    Another example is all of the Matlab code that I see come across my desk. By simply switching from Matlab to a compiled code using some fast math libraries (i.e.- MKL), many times codes have seen orders of magnitude increases in efficiency, and are suddenly "fast enough"(tm). Get your serial code running as efficient as it can, and if it isn't fast enough, THEN look to parallelizing.

    Re: multithreading

    I appreciate and respect your comments. It's a relief to me to see this intelligently discussed. Perhaps in contrast, most articles I see in the trade press today (IEEE Computer, ACM Communications, and others) seem to presume that shared-memory threading like OpenMP is the be-all-end-all of parallel processing. At the same time I see Intel and others producing seminar after seminar on threading. Questioning that presumption is a purpose of this article.

    Desktop application writers have had years to develop multithreading, but their results are poor so far:

    http://www.wired.com/techbiz/it/news/2007/08/multicore

    I think that evidence dims multithreading's prospect, beyond a few processors, for desktop applications.

    I point out H.264 because Apple's media strategy relies on that technology, so therefore they should be putting significant resources accelerating its encoding. That makes sense because, especially as HD increasingly dominates the video content industry, most new Apple devices (iPhone, iPod, Apple TV) favor some form of it (not to mention iLife '08's new Media Server). But the fruits of their encoding efforts are disappointing.

    You can generate that evidence yourself with this experiment: On an 8-core Mac Pro, encode several minutes of HD video into multipass H.264. Time how long that takes when you set the Processor System Preference to 8, 4, and 2. Those are the benchmarks I mention.

    Threading as a model allows a novice programmer to assume memory access is always fast because the language makes no distinction when a processor is accessing data meant only for itself or from another processor. Chapter 6 of the Pfister reference goes into far greater depth on what the hardware has to do to maintain that illusion or else compromise performance (or price).

    MPI encourages (you could say "forces") the beginning programmer to know when interaction between processors occurs. That activity is clearly distinct from memory access. I think that explicit distinction is a good thing, a "best practice" if you will, because then that programmer can begin thinking about how much bandwidth is needed between processors and what it means to partition a problem for parallel computing.

    I think the old SGI had good hardware, in some ways the best in their class, but I think their faith in shared memory contributed greatly to their demise. But, as you say, OpenMP there scaled well up to 64 processors, the Mac Pro is only 3 doublings away from that (and one doubling away from the IBM Power5+ you cited). After 64, then what would you do? I work with colleagues at UCLA who use MPI on 1000+ processors to run tightly-coupled plasma physics codes.

    Most multicore hardware today rely on a bus (called "North Bridge" in the Mac Pro) to relay data from the cores to memory, forcing all data transmission that cannot be cached down the same bottleneck. Here the North Bridge is serving as a network. The Mac Pro's architecture is diagrammed here:

    http://developer.apple.com/documentation/HardwareDrivers/Conceptual/Mac_Pro_0608/Articles/M43_0906_arch.html

    Distributed memory allows each processor to have its own pipe to its own memory, a pipe distinct from the network, so the two modes of communication are unlikely to interfere with each other. I think MPI benefits from that separation of labor. I'd like to see the Mac Pro designed with a switched network between each core and memory module, but that's expensive, so at the moment it's more like an Ethernet hub. Who today would design a network as a hub?

    Yes, MPI can be used to add nondeterminism. But I'd rather start with a deterministic system and, only if required, add nondeterminism, instead of starting with nondeterminism. Still, I am keeping my eye out for a third choice that has long-term viability.

    But if you think I'm harsh, have you read Lee's "The Problem with Threads"? (For the record, Lee has no connection to me.)

    http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html

    It also appeared in the May 2006 IEEE Computer issue, with "From a fundamental perspective, threads are seriously flawed as a computation model." and "... we must discard threads as a programming model."

    Sounds harsh to me. But what if it's true?

    Multiprocessing code

    You can generate that evidence yourself with this experiment: On an 8-core Mac Pro, encode several minutes of HD video into multipass H.264. Time how long that takes when you set the Processor System Preference to 8, 4, and 2. Those are the benchmarks I mention.

    So.... how should they be doing it? If it's bottlenecking on bandwidth, then it's going to bottleneck on bandwidth whether the programming model is MPI or OpenMP. Run it on eight single core machines, and you've still got to shovel the raw HD video across some sort of network (not to mention read it from some hard drive in the first place), and that's still going to take time. A better comparison might be an nbody force evaluation, or even a travelling salesman problem. That would eliminate the bottleneck of reading vast amounts of data off a slow hard drive.

    After 64, then what would you do? I work with colleagues at UCLA who use MPI on 1000+ processors to run tightly-coupled plasma physics codes.

    What's their scaling efficiency?

    Yes, MPI can be used to add nondeterminism. But I'd rather start with a deterministic system and, only if required, add nondeterminism, instead of starting with nondeterminism.

    Well, whenever I used OpenMP, I started out with a working serial code which is about as deterministic as you can get in this context.

    Related article at American Scientist

    American Scientist, the magazine of Sigma Xi had a related, recent article on Computing in a Parallel Universe.

    I think shared memory and distributed message passing have advantages and disadvantages as mentioned by many others here. I think in the future, we'll also look for strategies from compilers/libraries for facilitating multi-core algorithms. I look at techniques like:

    I'm sure I've forgotten some, but these techniques make it easier to write software in a parallel world -- often regardless of whether it's shared-memory OpenMP or message-passing MPI doing the work.

    While computer scientists have done research on parallel processing, the real "crunch" didn't hit most developers until the recent advent of multicore chips. Now you can't easily buy a computer without a multicore CPU!

    A somewhat tangential comment on cluster computing

    I have to admit that while I found Dean's article interesting, I also found its negative view of SMP somewhat mysterious. I went and read the suggested reference about deterministic computing and I was not convinced that distributed memory approach was anymore deterministic than shared memory approach. The authors of the journal article were attempting to get people to consider algorithms from a more fundamental point of view - I think. Certainly, in a draconian way, distributed memory is a way to do get people to do that. However, I will agree that writing truely parallel algorithms is in reality a lot more of a paridigm shift that many of us probaby have time to do. What we do have time for (maybe) is to parallelizing serial algorithms by using directives, either OpenMP or MPI. While I believe that a lot more computing is going to be using 1000-10,000 processors in ten years I agree with Drew and others here it is unclear what approaches are going to raise the ceiling for all types of algorithms, including those that have relatively equal cpu to memory access like CFD. Fortunately it seems clear there is still a great deal of good work to be done using 1 to 50 threads for a lot of interesting problems.

    But what I really want to say was that from our experience here at the National Severe Storms Lab that there are other equally significant issues with cluster compting: Running and maintaining the system itself. Since this comment is somewhat off topic - I will keep it short.

    We learned the hard way (again, duh!) several years ago that "you get what you pay for". Yes, cluster systems are a lot cheaper than SGI Altix or IBM's SMP box - but what was missing then (and I bet still is) was good AND inexpensive management software. If you have 256 P machine, generally, you are maintaining 256 systems. And since they are made of commodity parts - when you buy it, and then 1% parts (all of them, disk, power suppiles, bad memory) that break have to be replaced, and it crashes the jobs, brings down your system, force someone (you?) to figure out what happened, etc. The burn in time is months, in my opinion. And when you need to upgrade, patch, etc, you are doing that to 256 separate systems. There are some management systems out there, but again, you get what you pay for. Most decent ones end up adding enough cost to the machine to make it as expensive as: you guessed it - your SMP box (at least in the 32-128 P class).

    So our biggest shift in our thinking here in Norman, after originally having SMP boxes (Origins in the 1990s), then trying 2 different cluster systems, and then returning to SGI to purchase several types of Altix systems is this: we dont buy SMP boxes because they are SMP (for programming), we buy them because there is ONLY ONE system image to maintain. One OS, one kernel, one set of disk interfaces, power supplies, etc to fix, upgrade, check, etc. One of our clusters essentially never really worked so the downtime for that was pretty bad. Our total cost of ownership is much lower with the SMP machines. The reliability you get from the SMP systems (and their maintenance programs) PLUS the hugely lower maintenance workload for our IT folks makes it at least, if not more, financially and computationally efficient, even though you pay for it up front.

    And we use MPI on it for our big jobs, like the Weather Research and Forecasting (WRF) model.

    To me, what is missing in the commodity offerings is a way forward where the cpu, memory, and I/O are all balanced and scale equally. We dont have that now, and maybe we never have (or it lasted for 3-6 months around 1998 or something...)

    Anyway, the value in my comments, as always, are quite possibly proportional to what you paid to see them (e.g., zero, zip, nada, etc.)!