Hi,
I am running ROMS 2.0 on a 12-node linux cluster. I have no problem with the grid size 80x120x20, But when I doubled the horizontal resolution, ie, grid size 160x240x20, the run failed in the initial phase. Here is the model output,
....
INITIAL: Configurating and initializing ...
Node # 0 (pid= 21711) is active.
Node # 2 (pid= 15312) is active.
Node # 3 (pid= 12698) is active.
Node # 1 (pid= 3404) is active.
.....
Centers of gravity and integrals (values must be 1, 1, approx 1/2, 1, 1):
1.000000000000 1.033944429488 0.516972214744 1.000000000000 1.000000000000
rank 3 in job 1697 master_4268 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 1697 master_4268 caused collective abort of all ranks
exit status of rank 2: killed by signal 11
------
I tried to increase stack size by setting "ulimit -s unlimited", but it did not work.
I also tried to test the model in OpenMP. it did not work on the linux cluster, but it did work on a linux workstation with dual processors.
Can someone give any suggestion?
By the way, For the grid 80x160x20, if I only run hydrodynamic part, I can use 12 processors, but if bio-model was included, I can use only up to 4 processors. Does anyone know why?
I use ifort 9.0 and MPICH2.
Liejun
ROMS 2.0 with MPI
Shih-Nan, I did do as you said, but it did not help.
I got a little more progress in debugging the code, but still have no idea how to solve the problem.
Here is what I found. The program aborted when it either read in 3D variables (u,v,T,S) in nf_fread.F or wrote out 3D variables in nf_fwrite.F. Note that these two routines were also called to deal with 2D variables (sea level, ubar, vbar) without problem. Digging into these two routines, I found that the errors came from CALL mp_scatter (in nf_fread) and CALL mp_gather (in nf_fwrite).
mp_scatter and mp_gather are routines used by the master node to scatter/collect data to/from each tiled node. I have no experience in parallel coding and do not understand why they work for 2D variables but not for 3D variables.
I appreciate any help.
Liejun
I got a little more progress in debugging the code, but still have no idea how to solve the problem.
Here is what I found. The program aborted when it either read in 3D variables (u,v,T,S) in nf_fread.F or wrote out 3D variables in nf_fwrite.F. Note that these two routines were also called to deal with 2D variables (sea level, ubar, vbar) without problem. Digging into these two routines, I found that the errors came from CALL mp_scatter (in nf_fread) and CALL mp_gather (in nf_fwrite).
mp_scatter and mp_gather are routines used by the master node to scatter/collect data to/from each tiled node. I have no experience in parallel coding and do not understand why they work for 2D variables but not for 3D variables.
I appreciate any help.
Liejun
unfortunately, the same error persisted in ROMS 3.0. I did not debug the code, but the output message seemed to point to the same problem.
.....
INITIAL: Configurating and initializing forward nonlinear model ...
NLM: GET_STATE - Read state initial conditions, t = 0.0000
(Iter=0001, File: cpb_ini_1996_160x240.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
rank 3 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 2: killed by signal 11
rank 1 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 1: killed by signal 11
.....
INITIAL: Configurating and initializing forward nonlinear model ...
NLM: GET_STATE - Read state initial conditions, t = 0.0000
(Iter=0001, File: cpb_ini_1996_160x240.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated u-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
- vertically integrated v-momentum component
(Min = 0.00000000E+00 Max = 0.00000000E+00)
rank 3 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
rank 2 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 2: killed by signal 11
rank 1 in job 1778 master_4268 caused collective abort of all ranks
exit status of rank 1: killed by signal 11
-
- Posts: 19
- Joined: Wed Apr 23, 2003 1:34 pm
- Location: IMR, Bergen, Norway
I/O on large array?
Have you tried:
#define INLINE_2DIO
Perhaps there are some issues with reading in large 3D arrays all at once. Reading them in level by level might help.
#define INLINE_2DIO
Perhaps there are some issues with reading in large 3D arrays all at once. Reading them in level by level might help.