Tips on common compilers

Hi Fellow MacResearchers,

I thought I'd share some tips on common compilers that are available on Mac OS X. I've picked these up over the years, and they are mostly related to their optimization capability. Please leave a comment or two if you have other relevant tips or corrections to report. Thanks!

For now, I'll be talking only about Fortran and C, which happen to be the most common languages in use for scientific computing. And the goal here is to discuss how to use the compilers for creating the fastest executable. Let me jump in right away ..

POWERPC MACS (G4/G5)

Common compilers are: gcc, gfortran, g77 (open source) and xlc, xlf (commercial)

gcc: This is part of Apple supplied XCode Tools. The best flag I have found is an Apple provided option "-fast". However, beware that this produces a G5 executable that may crash on a G4! If you plan to deploy your application on a G4, use an additional flag with gcc, i.e. "gcc -fast -mcpu=G4" and that should work both on a G4 and a G5. In addition, if you want to optimize your code for your G4/G5's velocity engine, AltiVec, automatically, include the additional flag "-ftree-vectorize".

gfortran: Since Apple does not develop gfortran, unfortunately the "-fast" doesn't work with it. You should experiment a little bit here, but for me, the best combination appears to be "-O5 -funroll-loops". Again, as in gcc, here you can also use the auto-vectorizing flag "-ftree-vectorize".

g77: Just like with gfortran, "-fast" doesn't exist, but I've had some good luck with "-O5 -funroll-loops". Unfortunately, g77 does not support auto-vectorization.

xlc: This is IBM's C compiler for Mac OS X that does an extremely good job optimizing code for the PowerPC. Unfortunately, this product is not being developed anymore, although you can still purchase it. A good combination of flags for this compiler is "-O5 -qtune=auto -qarch=auto -qunroll=auto". This compiler does not auto-vectorize, but it can auto-parallelize, i.e. optimize your code automatically to take advantage of dual (or quad) processors. To enable auto-parallelization use: "xlc_r -qtune=auto -qarch=auto -qunroll=auto -qsmp". Then to control the number of processors you want the executable to run on, set the OMP_NUM_THREADS environmental variable. For example, to use two processors, type "export OMP_NUM_THREADS=2" before you run the executable.

xlf: This is IBM's Fortran compiler that works almost exactly the same way as xlc. The exact same combination of flags as mentioned above do simply great here.

INTEL MACS

Common compilers are: gcc, gfortran, g77 (open source) and icc, ifort (commercial)

For gcc, gfortran and g77 use the same tips as mentioned above for the PowerPC case.

icc: This is Intel's C compiler that generates excellently optimized executables. Moreover, it is extremely fast in building even very large codes. Time spent during compile time is negligible with this compiler! This compiler also has a "-fast" option that produces an executable tweaked for the system it is compiling on. And I have found that to be the best option for use. Note that this option automatically enables auto-vectorization i.e. optimizes for the SSE vector processing unit, and in doing so, it really improves the code's performance. Note that the vector units in Intel Macs can perform double-precision floating point operations (as opposed to AltiVec, where only single-precision operations are possible). This has been a huge advantage in terms of speed, at least for my codes, because much like other scientific computation, my research codes need numerical double-precision accuracy. Furthermore, this compiler can also auto-parallelize, letting your plain old serial code take advantage of multiple processor cores. To enable that option use "icc -fast -parallel". And again, as mentioned before, to set the number of processors at runtime, use the environment variable OMP_NUM_THREADS.

ifort: This is Intel's Fortran compiler and its flags are very similar to those of icc, as mentioned above. I have had excellent performance using the options "-fast -parallel" here as well.

I hope you find some of this useful. Please leave comments, corrections or your experiences with compilers on OS X. Thanks!

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

mpicc and gcc optimizations together?

Thanks for your posted tips! I used them to compile mrbayes. Boy, does it run a bit faster.

Being new to this I just realized that I can use the '-fast' and '-ftree-vectorize' with mpicc. Which is really nice.

-Thanks again!
-Mike

I'm glad this helped you.

I'm glad this helped you. Yes, if your mpicc is using gcc at the back-end, you should be able to use all gcc options.

MrBayes

Hi,

Please feel free to share details of any performance gains, I'm sure other readers would be interested to hear.

Performance Gains

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings. These are my results for these mr bayes settings:

begin mrbayes;
set autoclose=yes;
charset 1st_pos = 1-4170\3;
charset 2nd_pos = 2-4170\3;
charset 3rd_pos = 3-4170\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition=by_codon;
prset ratepr=variable;

mcmcp
savebrlens=yes
ngen=300000;
mcmc;
sumt
burnin=3000
contype=halfcompat;
end;

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

Does anyone know how the '-fast' option really works under the Intel based Macs? Because I was going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

mrbayes results

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings. These are my results for these mr bayes settings:

begin mrbayes;
set autoclose=yes;
charset 1st_pos = 1-4170\3;
charset 2nd_pos = 2-4170\3;
charset 3rd_pos = 3-4170\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition=by_codon;
prset ratepr=variable;

mcmcp
savebrlens=yes
ngen=300000;
mcmc;
sumt
burnin=3000
contype=halfcompat;
end;

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

Does anyone know how the '-fast' option really works under the Intel based Macs? Because I was going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

Re: MrBayes

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings.

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

I am going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

RE: MrBayes

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings.

Results with no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results with optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

-Cheers!
-Mike

g77 on intel

Is there a g77 that works on Apple Intel?

I've had to use g95 and gfortran

g77 for Mac/Intel

There's a link for an intel binary in http://hpc.sourceforge.net/

RE: MrBayes

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings. These are my results for these mr bayes settings:

begin mrbayes;
set autoclose=yes;
charset 1st_pos = 1-4170\3;
charset 2nd_pos = 2-4170\3;
charset 3rd_pos = 3-4170\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition=by_codon;
prset ratepr=variable;

mcmcp
savebrlens=yes
ngen=300000;
mcmc;
sumt
burnin=3000
contype=halfcompat;
end;

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

Does anyone know how the '-fast' option really works under the Intel based Macs? Because I was going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

results

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings. These are my results for these mr bayes settings:

begin mrbayes;
set autoclose=yes;
charset 1st_pos = 1-4170\3;
charset 2nd_pos = 2-4170\3;
charset 3rd_pos = 3-4170\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition=by_codon;
prset ratepr=variable;

mcmcp
savebrlens=yes
ngen=300000;
mcmc;
sumt
burnin=3000
contype=halfcompat;
end;

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

Does anyone know how the '-fast' option really works under the Intel based Macs? Because I was going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

results

Well, I ran a couple trials of MrBayes using MPI on my dual 1.42 GHz G4. Using default and optimized compile settings. These are my results for these mr bayes settings:

begin mrbayes;
set autoclose=yes;
charset 1st_pos = 1-4170\3;
charset 2nd_pos = 2-4170\3;
charset 3rd_pos = 3-4170\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition=by_codon;
prset ratepr=variable;

mcmcp
savebrlens=yes
ngen=300000;
mcmc;
sumt
burnin=3000
contype=halfcompat;
end;

Results for no optimization flags:
Run 1 : 3 hours 22 minutes
Run 2: 3 hours 20 minutes

Results for optimization flags using: OPTFLAGS ?= -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize:
Run 1: 2 hours 47 minutes
Run 2: 2 hours 56 minutes

So, I with the optimization setting I speed up this run by about 24 - 35 minutes. Which is not so bad. So, I get about a 12 - 17% gain in performance with the above settings. I probably get this wide range due to the stochastic nature of the program? Perhaps, my computer was running something else in the back ground at the time (middle of the night)?

I think these will help with my larger runs which usually take 2-3 days to run on a MacPro w/ MPI (5 million generations on my larger data sets) w/o the compile optimizations. So, the vectorization options should cut that down by large amount.

I will have to play around with RAxML, PHYML, and PHYLIP to see if I get any performance gains there.

Does anyone know how the '-fast' option really works under the Intel based Macs? Because I was going to use 'OPTFLAGS ?= -O3 -fast -ftree-vectorize' as my options on our quad MacPro machine when compiling software. So, if anyone thinks there are any other options that I can use, let me know.

-Cheers!
-Mike

Re: MrBayes

Well, base on my preliminary testing on my PowerMac G4 (Dual 1.42 GHz) MrBayes ran about ~20% by using the following optimization flags:

" -O3 -fast -mcpu=G4 -mtune=G4 -ftree-vectorize "

as opposed to the default flag:

" -O3 "

This will definitley save time on my larger runs when using all 4 cores on the MacPro via mpich2!

-Cheers!
-Mike

MrBayes

Mikey,

we just got a new dual XEON quad MacPro and have been running the standard Mac verson of MrBayes and it is incedibly slow. I am a noob to Macs and to compiling. We installed the mac lam package and have been trying to compile, but we are getting no where. I have editied the 'makefile' file as instructed on the myBayes wiki page, but now what? we need step by step instructions for noobs.

I have tried opening a termianl shell cd into the UNIX MrBayes directory and type 'makefile' and nothing happens. HELP!

Brian

mrbayes runs so slow on snow leopard

Oh, I have been running mrbayes on PC and i have a job that will run in 25 hours. now on my new top of line mac book pro, which is 8gb ram, 3.06 ghz core 2 duo processor etc, i tried to test it by running the same job and it now will take 833 hours. something is wrong here as i was expecting the mac to rip through the analysis. please help...

thank you,

Mark

MrBayes MPI

Hi Brian,
I am in the same boat. have been trying to compile mrbayes on a new dual XEON quad MacPro with lam package and am getting no where. did you every get a step by step guide to the installation? or is there another way around the problem?
thanks

Pete

Long time

Sorry, I have not checked this site in a long time. But back before Leopard / Snow Leopard I always used MPICH2. I found it much easier to use and install than LAM. See:
http://www.mcs.anl.gov/research/projects/mpich2/

Now, you do not need these as OMP is built into Leopard / Snow Leopard. Just set the MPI option in the makefile and go.. However, the sumt command always seems to cause a segmentation fault or crash the program.. then I have to take it to an older computer to finish the sumt.

However, there seems to be a better version of mrbayes out now, see:
http://wwwkramer.in.tum.de/exelixis/software.html

Though I do not have an Intel compiler (ICC). Has anyone used this version yet? Or can mail me a compiled version of this so that I can run it on a 4 / 8 core macpro?

I tend not to use mrbayes much anymore.. as it is effectively useless on large data sets (I use FastTree, RAxML or PhyML). Partly because it takes over a week or more to run... and nothing ever converges. I would like to see how well this ICC version, that I linked to above, works.

Cheers!