SWAN can't use more than one CPU

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
hbzong
Posts: 36
Joined: Thu Oct 04, 2007 4:14 am
Location: Fathom Science/NCSU

SWAN can't use more than one CPU

#1 Unread post by hbzong »

When i run the TESTHEAD example on a cluster with the mpi, I can't set more than one CPU for SWAN. ROMS can use more than one CPU, but SWAN can't. If I set more than one CPU for SWAN, there was a error during running. Could you give me some suggestions?

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#2 Unread post by jcwarner »

to change the number of cpus for swan , you need to change 2 things and check 1 thing:
1) coupling_test_head.in : you need to modify the NnodesWAV value to equal the number of processors to allocate to SWAN.
2) when you run the job, you need to specify the total number of processors , such as

mpirun - np X ./oceanM ROMS/External/coupling_test_head.in

where X = NnodesOCN + NnodesWAV

3) Also check that NnodesOCN is equal to the number of partitions set in ocean_test_head.in (so NnondesOCN = NtileI + NtileJ)

hbzong
Posts: 36
Joined: Thu Oct 04, 2007 4:14 am
Location: Fathom Science/NCSU

#3 Unread post by hbzong »

I have done what you said. My X = NnodesOCN + NnodesWAV
I can set X=3 , NnodesOCN = 2, NnodesWAV = 1
or X=5 , NnodesOCN = 4, NnodesWAV = 1
or any others while NnodesWAV = 1
But I can't set X=4 , NnodesOCN = 2, NnodesWAV = 2
or X=6 , NnodesOCN = 4, NnodesWAV = 2
or any others while NnodesWAV > 1. There was a error.

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#4 Unread post by jcwarner »

can you show me exactly what the error is ?

hbzong
Posts: 36
Joined: Thu Oct 04, 2007 4:14 am
Location: Fathom Science/NCSU

#5 Unread post by hbzong »

X=6 , NnodesOCN = 4 NnodesWAV = 2
The error message are shown below:
-------------------------------------------------------------------------------------------------------------------------------
NL ROMS/TOMS: started time-stepping:( TimeSteps: 00000001 - 00001440)

== SWAN sent wave fields and Myerror= 0
== SWAN recvd ocean fields and Myerror= 0
+time 20030101.000200 , step 1; iteration 1; sweep 1

STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd

0 0.00000 0.000000E+00 9.619779E+01 9.619779E+01 1.952863E+10 0
DEF_HIS - creating history file: ocean_his.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
1 0.00035 1.993010E-12 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 2
2 0.00069 1.534867E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
3 0.00104 3.498813E-09 9.619779E+01 9.619779E+01 1.952863E+10 0
4 0.00139 1.010260E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 3
5 0.00174 2.099356E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
6 0.00208 4.158252E-08 9.619779E+01 9.619779E+01 1.952863E+10 0
7 0.00243 7.630712E-08 9.619778E+01 9.619778E+01 1.952863E+10 0
+time 20030101.000200 , step 1; iteration 1; sweep 4
p4_6891: p4_error: interrupt SIGSEGV: 11
p5_6905: p4_error: interrupt SIGSEGV: 11
rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32
p1_6849: p4_error: net_recv read: probable EOF on socket: 1
p2_6863: p4_error: net_recv read: probable EOF on socket: 1
p3_6877: p4_error: net_recv read: probable EOF on socket: 1
rm_l_1_6860: (2.402344) net_send: could not write to fd=5, errno = 32
p4_6891: (2.312500) net_send: could not write to fd=5, errno = 32
rm_l_2_6874: (2.375000) net_send: could not write to fd=5, errno = 32
rm_l_3_6888: (2.343750) net_send: could not write to fd=5, errno = 32
rm_l_5_6916: (2.285156) net_send: could not write to fd=5, errno = 32
-- end MPICH run --
p2_6863: (6.375000) net_send: could not write to fd=5, errno = 32
p3_6877: (6.347656) net_send: could not write to fd=5, errno = 32
p1_6849: (6.406250) net_send: could not write to fd=5, errno = 32
p5_6905: (10.285156) net_send: could not write to fd=5, errno = 32
-----------------------------------------------------------------------------------------------------------------
If X=5 , NnodesOCN = 4, NnodesWAV = 1, it runs normally.

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#6 Unread post by jcwarner »

- What kind of a system is this : Linux cluster, PC, ??
- Are you using latest version ?

- I remember an issue like this, but thought i fixed it along the way.
It looks like there is a write error.
"rm_l_4_6902: (2.312500) net_send: could not write to fd=5, errno = 32 "
See what files have been created for swan.
Since you are trying to run swan with 2 processors, it should have 2 files for each type of output:
PRINT-001
PRINT-002
hsig.mat-001
hsig.mat-002
swan_restart.dat-001
swan_restart.dat-002
etc etc

There should also be a files called swaninit.
Remove swaninit and rerun it.

fyshi
Posts: 3
Joined: Thu Nov 06, 2003 4:35 pm
Location: Center for Applied Coastal Research

what's difference

#7 Unread post by fyshi »

I guess Haibo used the package checked out from the ROMS trunk rather than CSTM. The codes checked out from the CSTM trunk should work with multi-processor setup for swan. I just tried the CSTM version I checked out several month ago. The only error message was null communicator at the final stage which I guess caused by double calls of mpi_finalize (in swan and coupler). It won't affect results. So what's the difference between CSTM trunk and ROMS trunk right now?

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#8 Unread post by jcwarner »

Right now, the roms trunk is identical to the cstm trunk.
I
did submit a fix for that mpi finalize issue. I also just found that we have a fortran stop in the mct router cleanup phase, and I want to remove that as well. But, as you said, that is all clean up stuff that does not affect the run itself.
I just checked the release, and I can get test_head to run with multiple processors for swan. So i can not recreate the problem.
Is he running it on a different system (a PC??).

hbzong
Posts: 36
Joined: Thu Oct 04, 2007 4:14 am
Location: Fathom Science/NCSU

#9 Unread post by hbzong »

I'm running it on a Linux cluster. The ROMS is the latest version.
As the error messages shown, it seems that a MPI node of ocean model(p0, ocean model use p0,p1,p2,p3) died first, and then caused the communication problem. But i don't know what make p0 node death.

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#10 Unread post by jcwarner »

can you check about the number of files that it creates (2 for eachnode)?
Then delete all those *mat* , PRINT*, swaninit, etc files and try to rerun it.

hbzong
Posts: 36
Joined: Thu Oct 04, 2007 4:14 am
Location: Fathom Science/NCSU

#11 Unread post by hbzong »

There are two output files (*.mat-001,*mat-002,PRINT-001,PRINT-002) when NnodesWAV = 2.
I removed all *mat-001,*mat-002, PRINT*, swaninit and etc files, then rerun it. There were same errors.

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#12 Unread post by jcwarner »

can you send the entire output that is written to stdout, not just a short section of it? Send it to my email so we don't fill upthis whole screen with it:
jcwarner@usgs.gov

jacopo
Posts: 81
Joined: Fri Nov 21, 2003 9:30 pm
Location: CNR-ISMAR

#13 Unread post by jacopo »

did you find out anything relevant?
I'm having the very same problems when running SWAN on more than 1 proc with ROMS version 116

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#14 Unread post by jcwarner »

yes, here was the fix:

In the file Waves/SWAN/Src/swanparll.F
change line 1169 from
REAL IOPTR
to
REAL IOPTR(ILEN)

let me know if that works for you.

jacopo
Posts: 81
Joined: Fri Nov 21, 2003 9:30 pm
Location: CNR-ISMAR

#15 Unread post by jacopo »

yep, it worked!

tnx

J.-

yadusharma
Posts: 25
Joined: Tue Sep 22, 2015 3:09 pm
Location: Indian Institute of Technology Gandhinagar

Re: SWAN can't use more than one CPU

#16 Unread post by yadusharma »

Hi All,

I am also not able to provide more than one processor for the SWAN model. I am trying to run the SANDY_Coupled Test case and I am getting the following error.

Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
Timing for main: time 2012-10-28_12:01:00 on domain 2: 2.76283 elapsed seconds
WRT_HIS - wrote history fields (Index=1,1) in record = 0000001 01
DEF_AVG - creating average file, Grid 01: Sandy_ocean_avg.nc


At line 5036 of file swanparll.f90
Fortran runtime error: Bad unit number in statement


In the coupling_sandy.in I have mentioned as follows

NnodesATM = 4 ! atmospheric model
NnodesWAV = 4 ! wave model
NnodesOCN = 4 ! ocean model

and I am running as

mpirun -np 12 ./coawstM coupling_sandy.in

I am running in my Workstation with Ubuntu OS. I am attaching the entire log. The line (In the file Waves/SWAN/Src/swanparll.F, line 1169 from REAL IOPTR to REAL IOPTR(ILEN)) mentioned in the suggestion by Dr. Warner is not there in the new subversion of swanparll.F.

I would be grateful for any suggestions to solve this error.

Thanking You,
Attachments
Sandy_test.log
(107.01 KiB) Downloaded 354 times

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: SWAN can't use more than one CPU

#17 Unread post by jcwarner »

not sure. are you writing to multiple netcdf files? this part is where swan opens files for each processor for writing. comment out swan writing of output just to see if that makes is work.

Post Reply