Hello all,
I cannot bear this any longer. My run is too slow with less than 9000 steps per earth day. I read some previous posts in the forum but am still not clear how to accelerate it. Can anybody with such experience help point out which one of my configurations might not be good. Not necessary to be the optimum, at least no obvious defect...
model
Grid size 960*768*30
tiling 8*32
3D,4D variables in all files are float type
8 monthly external surface forcing variables, bry variables change every 3 day
7-pool biological model + 97 point sources
time step 60sec
Cpu: four 64-core AMD Opteron
Cache: L3 6144K, L2 2048K
Memory: around 3800mb per core
gfortran 4.4.6 + mvapich2 1.6, flags -O3 -ffast-math
Am I using too many cpus? does compiler matter? is my tiling bad? I did some random tests but couldnt find visible performance improvement...
Thanks a lot!
PS: I have another model with 576*528*35 grid using 128 cores and 2g mem per core. The tiling is 4*32. This guy runs much faster than that (more than doubling)....
Update:
By using pgi 12.3 + mvapich2 1.8, the model speed increases to around 43200 steps per day
I guess I had some optimization problems with gfortran
how to determine the optimal tiling
-
- Posts: 39
- Joined: Wed Jun 25, 2008 2:49 am
- Location: Georgia Institute of Technology
Re: how to determine the optimal tiling
Let me guess: Based on what said above, you are running code in MPI mode usingGrid size 960*768*30 ... tiling 8*32 ... Cpu: four 64-core AMD Opteron ...mvapich2 1.6,
4 hardware nodes each of whom has 4 CPU sockets, and each CPU is a 16-core Opteron.
The hardware nodes are interconnected by Infiniband (otherwise these is no point using
MVAPICH).
...However, not clear exactly what kind of Opteron: too many cores for Magny-Cours
generation, and L3 cache of 6M seems to be too small for Interlagos. What is exactly
your hardware configuration?
Your 8x32 partitioning of the 960x768 grid results in MPI subdomain size of 120x24
which does not seem to be too small, so you are not using too many CPUs.
Most likely the cause of inefficiency is due to suboptimal MPI-node placement resulting
in excessive messaging send across Infiniband interconnect, but again, to fix this one
needs to know your hardware geometry.
Are you specifying machines.LINUX file or hostfile to control MPI node placement when
executing mpiexec?
-
- Posts: 39
- Joined: Wed Jun 25, 2008 2:49 am
- Location: Georgia Institute of Technology
Re: how to determine the optimal tiling
Hi Sasha
So glad to see you here. You are one of the persons I am most eagerly waiting for. I have read your previous posts on i7 cpu performance and the presentation of poor man's computing. They are very helpful and instructive. There are still some points that I am not very clear about like how to do the estimation to fit the cache size according to the input data file.. etc..
Back to my current problem, you are all right on the cpu. It is 4 nodes and each has 4 cpu with 16 cores and yes IB architecture. The cpus are AMD Opteron(TM) Processor 6274 with 2200.072MHz freq. (I am non-root user so I couldnt get more info)
I use moab scheduler to submit the job.
#PBS -l nodes=64:ppn=4
#PBS -l pmem=3800mb
mpiexec -np ${NPROCS} ${BIN} ${INPUT} >> $LOGFILE
here BIN is executable and INPUT is input file
Thank you very much for your help!
BTW, I was using pgf90 instead of gfortran this afternoon. The model is about 5 times faster than before (though it is still very slow). The flags I used with pgi are O4 fastsse and Mipa=fast,inline. I dont know too much about optm flags. I was just checking the pgi manual and FAQ in its web. Do you think I should add something else? I am trying to turn on Mconcur now...
Thanks...Yisen
Updates: Mconcur doesnt help. But I get a similar performance as pgi using gfortran -O3 -ffast-math -funroll-loops -ftree-vectorize.
So glad to see you here. You are one of the persons I am most eagerly waiting for. I have read your previous posts on i7 cpu performance and the presentation of poor man's computing. They are very helpful and instructive. There are still some points that I am not very clear about like how to do the estimation to fit the cache size according to the input data file.. etc..
Back to my current problem, you are all right on the cpu. It is 4 nodes and each has 4 cpu with 16 cores and yes IB architecture. The cpus are AMD Opteron(TM) Processor 6274 with 2200.072MHz freq. (I am non-root user so I couldnt get more info)
I use moab scheduler to submit the job.
#PBS -l nodes=64:ppn=4
#PBS -l pmem=3800mb
mpiexec -np ${NPROCS} ${BIN} ${INPUT} >> $LOGFILE
here BIN is executable and INPUT is input file
Thank you very much for your help!
BTW, I was using pgf90 instead of gfortran this afternoon. The model is about 5 times faster than before (though it is still very slow). The flags I used with pgi are O4 fastsse and Mipa=fast,inline. I dont know too much about optm flags. I was just checking the pgi manual and FAQ in its web. Do you think I should add something else? I am trying to turn on Mconcur now...
Thanks...Yisen
Updates: Mconcur doesnt help. But I get a similar performance as pgi using gfortran -O3 -ffast-math -funroll-loops -ftree-vectorize.