openMP fast on Linux slow on osx

Hi there,
Recently I have been compiling some code on a Mac Pro (2x 3.0GHz Quad-Core Intel Xeon) with icc. However I have been seeing some strange differences between platforms.
For example I compiled some test code on the same machine, one on openSUSE, one on mac 10.5.4, both with the intel compiler, both with OMP_NUM_THREADS=8.

The test problem in osx not only runs slower (18s wall clock, 120s sum cpu time), buy runs faster with less cores.

However, as described above, same machine, same compiler etc, just on linux, the test problem runs with a wall clock time of 1.2s and sum cpu time of ~6s!

Has anyone come across a similar problem before? Does any one know if it is a platform issue or a platform version of icc issue which is the problem. Does anyone know any fixes?

Any help would be most appreciated.

Thanks heaps,

Nic

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Are you runnigng icc with

Are you runnigng icc with the shell variables for 64bits?

The only idea I have

No it's definitely 32bit.

No it's definitely 32bit. I'm pretty sure it's something to do with the thread scheduling within the OS. Anyone with any ideas on how to control this?

Re:No it's definitely 32bit.

This is almost certainly a thread scheduling issue. Unfortunately, there isn't much you as a user can do about this at the moment. There have been a lot of discussions about this issue, and Apple is addressing it for Snow Leopard.

Is there a reason to use 32-bit? If not, you may want to try 64-bit as well. Depending on the constraints of your code, you may see some performance boost. As the 64-bit ABI gets you two things:

1) Away from the stack based 32-bit ABI (variables are passed in register like on PPC)
2) Access to double the number of registers (for CPU bound numerical code this can be a win)

Dave

OpenMPI

I am not very much into this topic, but I'm using an OpenMPI-enabled program (MEME) and compiling this under OS X (my MacBook Pro Core 2 Duo) with gcc seems to work perfectly. I haven't compared running times on the same machine under both OS'es, but it scales perfectly with the cores under OS X and runs faster (about as expected) than on an older Xeon Server running Ubuntu.

OpenMPI is not OpenMP

May be you're mistaking OpenMPI for OpenMP. OpenMPI is an implementation of the Message Passing Interface, which connects distributet processes via network. OpenMP is an API to control multithreaded execution in shared memory on a single machine.

Indeed, OpenMPI != OpenMP

Oops, I'm sorry, I was indeed mistaken. :-)

OpenMP gcc vs icc

Hi all

Well I cannot comment regarding Linux but I developed some code running on a quad core MacPro 2.66MHz. The code is written in Objective C but calls some C/C++ maths routines. One particular routine performs an LU decomposition (no pivoting) of a complex double precision matrix. In my routine I called three successive OpenMP "parallel for" routines and got a noticeable improvement when using the icc compiler. I did not buy the icc compiler (wife drew the line), so I waited one year and recently installed Xcode 3.1 and compiled the same code using gcc4.2. The same threaded version ran noticeably slower than the unthreaded version. On a Matrix size 1320x1320 it was something like an incredible 10 times slower. I am sorry I do not have the exact values, as I do not have icc installed anymore. Anyway, I rewrote the code to perform only a single OpenMP "parallel for", but a nested one. It made a big difference. Threaded now takes 3.08 seconds for a 1320 x 1320 matrix compared to 5.7seconds for the unthreaded one. Considering the four cores I would have hoped for a factor 3 improvement, but only get about 2. It would seem the gnu OpenMP compiler has a lot of overhead associated with it.

Regards

Andrew

For severely CPU-bound

For severely CPU-bound algorithms, it is not uncommon to see a linear speed-up due to the number of cores. For example, a two core system executes the algorithm twice as fast, and a four core system executes the algorithm four times as fast. Memory-bound algorithms scale based on the memory bandwith available to the cores. For example, memory-bound algorithms scale up to almost 1.5X on my four core Opteron system due to its NUMA architecture. Some systems/CPUs are able to immediately context switch to another thread if the core would be blocked waiting for memory, allowing multiple memory accesses to be pending at once, and thereby improving throughput.