perfect_restart problem
perfect_restart problem
Hi,
When I define PERFECT_RESTART option in my .h file, ROMS will get a exit-signal to end the all processes, without any blowing up. I use the GLS_MIXING scheme in my test.
Is there anybody who can give me some advices? thanks!
When I define PERFECT_RESTART option in my .h file, ROMS will get a exit-signal to end the all processes, without any blowing up. I use the GLS_MIXING scheme in my test.
Is there anybody who can give me some advices? thanks!
Thank you for replying me so quickly!
I used 28 processors, and defined NRST=60 to make ROMS write rst file at 60th time steps. Here are the output informations:
Signal received, cleaning up temporary files and exiting ...
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ ======================= ===================
0001 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0002 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0003 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0004 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0005 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0006 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0007 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0008 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0009 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0010 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0011 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0012 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0013 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0014 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0015 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0016 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0017 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0018 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0019 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0020 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0021 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0022 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0023 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0024 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0025 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0026 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0027 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0028 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:37
------------------------------------------------------------
Sender: LSF System <lsfadmin>
Subject: Job 30399: </bin> Exited
Job </bin> was submitted from host <login03> by user <lmingku>.
Job was executed on host(s) <4>, in queue <he>, as user <lmingku>.
<4>
<4>
<4>
<4>
<4>
<4>
</home> was used as the home directory.
</home> was used as the working directory.
Started at Wed Oct 31 21:18:46 2007
Results reported at Wed Oct 31 21:20:54 2007
I used 28 processors, and defined NRST=60 to make ROMS write rst file at 60th time steps. Here are the output informations:
Code: Select all
Model Input Parameters: ROMS/TOMS version 3.0
Operating system : Linux
CPU/hardware : x86_64
Compiler system : ifort
Compiler command : /usr/local/intel/mpich-mx-1.2.7..1/linux86-64/9.1/bin/mpif90
Compiler flags : -ip -O3 -xW -free
SVN Root URL : https://www.myroms.org/svn/src/trunk
Resolution, Grid 01: 0438x0318x036, Parallel Nodes: 28, Tiling: 007x004
Physical Parameters, Grid: 01
=============================
324288 ntimes Number of timesteps for 3-D equations.
300.000 dt Timestep size (s) for 3-D equations.
30 ndtfast Number of timesteps for 2-D equations between
each 3D timestep.
1 ERstr Starting ensemble/perturbation run number.
1 ERend Ending ensemble/perturbation run number.
0 nrrec Number of restart records to read from disk.
T LcycleRST Switch to recycle time-records in restart file.
60 nRST Number of timesteps between the writing of data
into restart fields.
60 ninfo Number of timesteps between print of information
to standard output.
T ldefout Switch to create a new output NetCDF file(s).
288 nHIS Number of timesteps between the writing fields
into history file.
2880 ndefHIS Number of timesteps between creation of new
history files.
1 ntsAVG Starting timestep for the accumulation of output
time-averaged data.
8640 nAVG Number of timesteps between the writing of
time-averaged data into averages file.
103680 ndefAVG Number of timesteps between creation of new
time-averaged file.
5.0000E+01 visc2 Horizontal, harmonic mixing coefficient (m2/s)
for momentum.
1.0000E-06 Akt_bak(01) Background vertical mixing coefficient (m2/s)
for tracer 01: temp
1.0000E-06 Akt_bak(02) Background vertical mixing coefficient (m2/s)
for tracer 02: salt
1.0000E-05 Akv_bak Background vertical mixing coefficient (m2/s)
for momentum.
3.0000E-04 rdrg Linear bottom drag coefficient (m/s).
3.0000E-03 rdrg2 Quadratic bottom drag coefficient.
2.0000E-02 Zob Bottom roughness (m).
1.0000E+01 blk_ZQ Height (m) of surface air humidity measurement.
1.0000E+01 blk_ZT Height (m) of surface air temperature measurement.
1.0000E+01 blk_ZW Height (m) of surface winds measurement.
1 lmd_Jwt Jerlov water type.
5.0000E+00 theta_s S-coordinate surface control parameter.
4.0000E-01 theta_b S-coordinate bottom control parameter.
50.000 Tcline S-coordinate surface/bottom layer width (m) used
in vertical coordinate stretching.
1025.000 rho0 Mean density (kg/m3) for Boussinesq approximation.
52791.000 dstart Time-stamp assigned to model initialization (days).
18581117.00 time_ref Reference time for units attribute (yyyymmdd.dd)
3.0000E+01 Tnudg(01) Nudging/relaxation time scale (days)
for tracer 01: temp
3.0000E+01 Tnudg(02) Nudging/relaxation time scale (days)
for tracer 02: salt
3.0000E+01 Znudg Nudging/relaxation time scale (days)
for free-surface.
3.0000E+01 M2nudg Nudging/relaxation time scale (days)
for 2D momentum.
3.0000E+01 M3nudg Nudging/relaxation time scale (days)
for 3D momentum.
1.0000E+01 obcfac Factor between passive and active
open boundary conditions.
10.000 T0 Background potential temperature (C) constant.
35.000 S0 Background salinity (PSU) constant.
1.000 gamma2 Slipperiness variable: free-slip (1.0) or
no-slip (-1.0).
T Hout(idFsur) Write out free-surface.
T Hout(idUbar) Write out 2D U-momentum component.
T Hout(idVbar) Write out 2D V-momentum component.
T Hout(idUvel) Write out 3D U-momentum component.
T Hout(idVvel) Write out 3D V-momentum component.
T Hout(idWvel) Write out W-momentum component.
T Hout(idTvar) Write out tracer 01: temp
T Hout(idTvar) Write out tracer 02: salt
T Hout(idUsms) Write out surface U-momentum stress.
T Hout(idVsms) Write out surface V-momentum stress.
Tile partition information for Grid 01: 0438x0318x0036 tiling: 007x004
tile Istr Iend Jstr Jend Npts
0 1 62 1 79 176328
1 63 125 1 79 179172
2 126 188 1 79 179172
3 189 251 1 79 179172
4 252 314 1 79 179172
5 315 377 1 79 179172
6 378 438 1 79 173484
7 1 62 80 159 178560
8 63 125 80 159 181440
9 126 188 80 159 181440
10 189 251 80 159 181440
11 252 314 80 159 181440
12 315 377 80 159 181440
13 378 438 80 159 175680
14 1 62 160 239 178560
15 63 125 160 239 181440
16 126 188 160 239 181440
17 189 251 160 239 181440
18 252 314 160 239 181440
19 315 377 160 239 181440
20 378 438 160 239 175680
21 1 62 240 318 176328
22 63 125 240 318 179172
23 126 188 240 318 179172
24 189 251 240 318 179172
25 252 314 240 318 179172
26 315 377 240 318 179172
27 378 438 240 318 173484
Maximum halo size in XI and ETA directions:
HaloSizeI(1) = 146
HaloSizeJ(1) = 180
TileSide(1) = 84
TileSize(1) = 5628
Activated C-preprocessing Options:
SABGOM_H SABGOM 3-Year Hindcast
ANA_BSFLUX Analytical kinematic bottom salinity flux.
ANA_BTFLUX Analytical kinematic bottom temperature flux.
ANA_RAIN Analytical rain fall rate.
ANA_SSFLUX Analytical kinematic surface salinity flux.
ASSUMED_SHAPE Using assumed-shape arrays.
AVERAGES Writing out time-averaged fields.
BULK_FLUXES Surface bulk fluxes parametererization.
CURVGRID Orthogonal curvilinear grid.
DJ_GRADPS Parabolic Splines density Jacobian (Shchepetkin, 2002).
DOUBLE_PRECISION Double precision arithmetic.
EAST_FSCHAPMAN Eastern edge, free-surface, Chapman condition.
EAST_M2FLATHER Eastern edge, 2D momentum, Flather condition.
EAST_M3NUDGING Eastern edge, 3D momentum, passive/active outflow/inflow.
EAST_M3RADIATION Eastern edge, 3D momentum, radiation condition.
EAST_TNUDGING Eastern edge, tracers, passive/active outflow/inflow.
EAST_TRADIATION Eastern edge, tracers, radiation condition.
INLINE_2DIO Processing 3D IO level by level to reduce memory needs.
LMD_CONVEC LMD convective mixing due to shear instability.
LMD_MIXING Large/McWilliams/Doney interior mixing.
LMD_NONLOCAL LMD convective nonlocal transport.
LMD_RIMIX LMD diffusivity due to shear instability.
LMD_SKPP KPP surface boundary layer mixing.
LONGWAVE Compute net longwave radiation internally.
MASKING Land/Sea masking.
MIX_S_UV Mixing of momentum along constant S-surfaces.
MPI MPI distributed-memory configuration.
NONLINEAR Nonlinear Model.
NONLIN_EOS Nonlinear Equation of State for seawater.
NORTHERN_WALL Wall boundary at Northern edge.
PERFECT_RESTART Processing perfect restart variables.
POWER_LAW Power-law shape time-averaging barotropic filter.
PROFILE Time profiling activated .
RADIATION_2D Use tangential phase speed in radiation conditions.
!RST_SINGLE Double precision fields in restart NetCDF file.
SALINITY Using salinity.
SOLAR_SOURCE Solar Radiation Source Term.
SOLVE3D Solving 3D Primitive Equations.
SOUTH_FSCHAPMAN Southern edge, free-surface, Chapman condition.
SOUTH_M2FLATHER Southern edge, 2D momentum, Flather condition.
SOUTH_M3NUDGING Southern edge, 3D momentum, passive/active outflow/inflow.
SOUTH_M3RADIATION Southern edge, 3D momentum, radiation condition.
SOUTH_TNUDGING Southern edge, tracers, passive/active outflow/inflow.
SOUTH_TRADIATION Southern edge, tracers, radiation condition.
SPLINES Conservative parabolic spline reconstruction.
TCLIMATOLOGY Processing tracer climatology data.
TCLM_NUDGING Nudging toward tracer climatology.
TS_U3HADVECTION Third-order upstream bias horizontal advection of tracers.
TS_SVADVECTION Parabolic splines vertical advection of tracers.
TS_PSOURCE Tracers point sources and sinks.
UV_ADV Advection of momentum.
UV_COR Coriolis term.
UV_U3HADVECTION Third-order upstream bias advection of momentum.
UV_QDRAG Quadratic bottom stress.
UV_PSOURCE Mass point sources and sinks.
UV_VIS2 Harmonic mixing of momentum.
VAR_RHO_2D Variable density barotropic mode.
WESTERN_WALL Wall boundary at Western edge.
INITIAL: Configurating and initializing forward nonlinear model ...
Vertical S-coordinate System:
level S-coord Cs-curve at_hmin over_slope at_hmax
36 0.0000000 0.0000000 0.000 0.000 0.000
35 -0.0277778 -0.0019878 -0.139 -5.720 -11.300
34 -0.0555556 -0.0042675 -0.278 -12.258 -24.239
33 -0.0833333 -0.0069437 -0.417 -19.910 -39.404
32 -0.1111111 -0.0101452 -0.556 -29.037 -57.519
31 -0.1388889 -0.0140312 -0.694 -40.086 -79.477
30 -0.1666667 -0.0187972 -0.833 -53.605 -106.376
29 -0.1944444 -0.0246815 -0.972 -70.263 -139.554
28 -0.2222222 -0.0319695 -1.111 -90.862 -180.614
27 -0.2500000 -0.0409944 -1.250 -116.338 -231.426
26 -0.2777778 -0.0521319 -1.389 -147.744 -294.099
25 -0.3055556 -0.0657822 -1.528 -186.205 -370.882
24 -0.3333333 -0.0823381 -1.667 -232.823 -463.979
23 -0.3611111 -0.1021338 -1.806 -288.537 -575.268
22 -0.3888889 -0.1253769 -1.944 -353.928 -705.912
21 -0.4166667 -0.1520732 -2.083 -429.014 -855.945
20 -0.4444444 -0.1819651 -2.222 -513.072 -1023.921
19 -0.4722222 -0.2145100 -2.361 -604.577 -1206.794
18 -0.5000000 -0.2489214 -2.500 -701.323 -1400.146
17 -0.5277778 -0.2842780 -2.639 -800.722 -1598.805
16 -0.5555556 -0.3196767 -2.778 -900.240 -1797.701
15 -0.5833333 -0.3543864 -2.917 -997.823 -1992.728
14 -0.6111111 -0.3879574 -3.056 -1092.209 -2181.362
13 -0.6388889 -0.4202649 -3.194 -1183.048 -2362.901
12 -0.6666667 -0.4514899 -3.333 -1270.848 -2538.363
11 -0.6944444 -0.4820609 -3.472 -1356.812 -2710.152
10 -0.7222222 -0.5125827 -3.611 -1442.638 -2881.665
9 -0.7500000 -0.5437741 -3.750 -1530.344 -3056.937
8 -0.7777778 -0.5764230 -3.889 -1622.141 -3240.394
7 -0.8055556 -0.6113613 -4.028 -1720.366 -3436.704
6 -0.8333333 -0.6494565 -4.167 -1827.454 -3650.740
5 -0.8611111 -0.6916165 -4.306 -1945.952 -3887.599
4 -0.8888889 -0.7388018 -4.444 -2078.560 -4152.675
3 -0.9166667 -0.7920448 -4.583 -2228.173 -4451.763
2 -0.9444444 -0.8524713 -4.722 -2397.954 -4791.185
1 -0.9722222 -0.9213261 -4.861 -2591.396 -5177.930
0 -1.0000000 -1.0000000 -5.000 -2812.404 -5619.808
Time Splitting Weights: ndtfast = 30 nfast = 42
ndtfast, nfast = 30 42 nfast/ndtfast = 1.40000
Centers of gravity and integrals (values must be 1, 1, approx 1/2, 1, 1):
1.000000000000 1.047601458608 0.523800729304 1.000000000000 1.000000000000
Power filter parameters, Fgamma, gamma = 0.28400 0.18933
Minimum X-grid spacing, DXmin = 5.84518929E+00 km
Maximum X-grid spacing, DXmax = 7.09321009E+00 km
Minimum Y-grid spacing, DYmin = 4.99958718E+00 km
Maximum Y-grid spacing, DYmax = 5.27049616E+00 km
Minimum Z-grid spacing, DZmin = 1.38888595E-01 m
Maximum Z-grid spacing, DZmax = 4.41877891E+02 m
Minimum barotropic Courant Number = 1.67762600E-02
Maximum barotropic Courant Number = 5.85791511E-01
Maximum Coriolis Courant Number = 2.74759622E-02
NLM: GET_STATE - Read state initial conditions, t = 52791.0000
...
...
Maximum grid stiffness ratios: rx0 = 4.492189E-01 (Beckmann and Haidvogel)
rx1 = 1.550243E+01 (Haney)
Initial basin volumes: TotVolume = 7.08080639348850E+15 m3
MinVolume = 4.09865863294175E+06 m3
MaxVolume = 1.51778869573190E+10 m3
Max/Min = 3.70313517581856E+03
NL ROMS/TOMS: started time-stepping:( TimeSteps: 00000001 - 00324288)
GET_NGFLD - river runoff mass transport, t = 52792.0000
(File: SABGOM_H_tide_20030601.nc, Rec=1250, Index=1)
(Tmin= 51543.0000 Tmax= 54186.0000)
(Min = -1.96954702E+04 Max = 6.68277580E+02)
...
...
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
0 52791.00000 1.659717E-02 1.917084E+04 1.917086E+04 7.093434E+15 0
DEF_HIS - creating history file: /share3/lmkli/SABGOM/NARR_OUT/sabgom_his_0001.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
DEF_AVG - creating average file: /share3/lmkli/SABGOM/NARR_OUT/sabgom_avg_0001.nc
...
...
(Tmin= 52640.0000 Tmax= 53005.0000)
(Min = 5.58790801E-01 Max = 9.98642064E-01)
60 52791.20833 1.638506E-02 1.917090E+04 1.917092E+04 7.093346E+15 0
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ ======================= ===================
0001 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0002 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0003 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0004 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0005 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0006 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0007 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0008 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0009 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0010 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0011 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0012 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0013 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0014 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0015 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0016 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0017 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0018 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0019 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0020 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0021 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0022 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0023 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0024 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0025 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0026 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0027 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0028 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:37
------------------------------------------------------------
Sender: LSF System <lsfadmin>
Subject: Job 30399: </bin> Exited
Job </bin> was submitted from host <login03> by user <lmingku>.
Job was executed on host(s) <4>, in queue <he>, as user <lmingku>.
<4>
<4>
<4>
<4>
<4>
<4>
</home> was used as the home directory.
</home> was used as the working directory.
Started at Wed Oct 31 21:18:46 2007
Results reported at Wed Oct 31 21:20:54 2007
perfect_restart in upwelling case
PERFECT_RESTART option works well in upwelling case.
What does this indicate to my case?
What does this indicate to my case?
well, i am not sure. here are some things to try.
1) does your app work ok without perfect restart ? (so does it work ok with just the normal restart)
2) did anything get written to the restart file?
was a restart file actually created?
3) the create rst is a process, and write to the rst is another process. So it is important to check if the file was created first. Then if it was created, did anything get written? So there may be some values that are ok and some bad. Can you dig into these issues?
4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls (just to check the restart issue - i am not talking physics here. Just computer software issues.)
1) does your app work ok without perfect restart ? (so does it work ok with just the normal restart)
2) did anything get written to the restart file?
was a restart file actually created?
3) the create rst is a process, and write to the rst is another process. So it is important to check if the file was created first. Then if it was created, did anything get written? So there may be some values that are ok and some bad. Can you dig into these issues?
4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls (just to check the restart issue - i am not talking physics here. Just computer software issues.)
> 1) does your app work ok without perfect restart ? (so does it work ok with just the normal restart)
yes, it works without perfect restart.
> 2) did anything get written to the restart file? was a restart file actually created?
yes, restart file was created, and had something in it.
> 3) the create rst is a ... ... bad. Can you dig into these issues?
I am digging it. if I could have any new information, I will let you know.
> 4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls.
I will follow what differs from upwilling about perfect restart, also will try gls.
THANKS!
yes, it works without perfect restart.
> 2) did anything get written to the restart file? was a restart file actually created?
yes, restart file was created, and had something in it.
> 3) the create rst is a ... ... bad. Can you dig into these issues?
I am digging it. if I could have any new information, I will let you know.
> 4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls.
I will follow what differs from upwilling about perfect restart, also will try gls.
THANKS!
new results:
In perfect_restart case, the restart file was created, but no variable was written into it.
I also compared my .h file with upwelling, and I put following scripts into my .h file:
#ifdef PERFECT_RESTART
# undef AVERAGES
# undef DIAGNOSTICS_BIO
# undef DIAGNOSTICS_TS
# undef DIAGNOSTICS_UV
# define OUT_DOUBLE
#endif
It didn't work.
Finally, I change the fixing scheme to gls, it didn't work.
In perfect_restart case, the restart file was created, but no variable was written into it.
I also compared my .h file with upwelling, and I put following scripts into my .h file:
#ifdef PERFECT_RESTART
# undef AVERAGES
# undef DIAGNOSTICS_BIO
# undef DIAGNOSTICS_TS
# undef DIAGNOSTICS_UV
# define OUT_DOUBLE
#endif
It didn't work.
Finally, I change the fixing scheme to gls, it didn't work.
Similar issue - AVERAGE_DETIDE
Hi all,
I just updated to the newest release and tried PERFECT_RESTART with similar issues.
Here's the application info from the outfile:
Starting timestepping looks like this:
and the crash at the end looks like this:
p0_10096: p4_error: interrupt SIGSEGV: 11
p0_10096: (2296.429688) net_send: could not write to fd=4, errno = 32
which clearly pointed me to this point in the infile:
! Output history, average, diagnostic files parameters.
LDEFOUT == T
NHIS == 44710
NDEFHIS == 172800
NTSAVG == 1
NAVG == 5962
NDEFAVG == 0
NTSDIA == 1
NDIA == 57600
NDEFDIA == 0
So it crashes trying to write the average file. This is not a restart solution, NRREC=0, and the 3D fields in the average file are empty (null) according to MATLAB.
Did I miss anything in the CPPDEFS that AVERAGE_DETIDE needs?
Cheers,
Justin
I just updated to the newest release and tried PERFECT_RESTART with similar issues.
Here's the application info from the outfile:
Code: Select all
Operating system : Linux
CPU/hardware : x86_64
Compiler system : ifort
Compiler command : /opt/roms/mpich-1.2.7p1/bin/mpif90
Compiler flags : -i-static -ip -O2 -ip -O3 -xW -free
Input Script : risc7a.in
SVN Root URL : https://www.myroms.org/svn/src/trunk
SVN Revision : 147M
and the CPPDEFS:
risc7 Narragansett Bay and RIS
ANA_BSFLUX Analytical kinematic bottom salinity flux.
ANA_BTFLUX Analytical kinematic bottom temperature flux.
ANA_SSFLUX Analytical kinematic surface salinity flux.
ASSUMED_SHAPE Using assumed-shape arrays.
AVERAGES Writing out time-averaged fields.
AVERAGES_DETIDE Writing out time-averaged detided fields.
BULK_FLUXES Surface bulk fluxes parametererization.
CURVGRID Orthogonal curvilinear grid.
DIAGNOSTICS_TS Computing and writing tracer diagnostic terms.
DIAGNOSTICS_UV Computing and writing momentum diagnostic terms.
DIFF_GRID Horizontal diffusion coefficient scaled by grid size.
DIURNAL_SRFLUX Modulate shortwave radiation by the local diurnal cycle.
DJ_GRADPS Parabolic Splines density Jacobian (Shchepetkin, 2002).
DOUBLE_PRECISION Double precision arithmetic.
EAST_FSCHAPMAN Eastern edge, free-surface, Chapman condition.
EAST_M2FLATHER Eastern edge, 2D momentum, Flather condition.
EAST_M3RADIATION Eastern edge, 3D momentum, radiation condition.
EAST_TRADIATION Eastern edge, tracers, radiation condition.
FLOATS Simulated Lagrangian drifters.
GLS_MIXING Generic Length-Scale turbulence closure.
MASKING Land/Sea masking.
MIX_GEO_TS Mixing of tracers along geopotential surfaces.
MIX_S_UV Mixing of momentum along constant S-surfaces.
MPI MPI distributed-memory configuration.
NONLINEAR Nonlinear Model.
NONLIN_EOS Nonlinear Equation of State for seawater.
NORTHERN_WALL Wall boundary at Northern edge.
PERFECT_RESTART Processing perfect restart variables.
POWER_LAW Power-law shape time-averaging barotropic filter.
PROFILE Time profiling activated .
K_GSCHEME Third-order upstream bias advection of TKE fields.
!RST_SINGLE Double precision fields in restart NetCDF file.
SALINITY Using salinity.
SOLAR_SOURCE Solar Radiation Source Term.
SOLVE3D Solving 3D Primitive Equations.
SOUTH_FSCHAPMAN Southern edge, free-surface, Chapman condition.
SOUTH_M2FLATHER Southern edge, 2D momentum, Flather condition.
SOUTH_M3NUDGING Southern edge, 3D momentum, passive/active outflow/inflow.
SOUTH_M3RADIATION Southern edge, 3D momentum, radiation condition.
SOUTH_TNUDGING Southern edge, tracers, passive/active outflow/inflow.
SOUTH_TRADIATION Southern edge, tracers, radiation condition.
SPLINES Conservative parabolic spline reconstruction.
SPONGE Enhanced horizontal mixing in the sponge areas.
SSH_TIDES Add tidal elevation to SSH climatology.
STATIONS Writing out station data.
TS_A4HADVECTION Fouth-order Akima horizontal advection of tracers.
TS_A4VADVECTION Fouth-order Akima vertical advection of tracers.
TS_DIF2 Harmonic mixing of tracers.
TS_PSOURCE Tracers point sources and sinks.
UV_ADV Advection of momentum.
UV_COR Coriolis term.
UV_U3HADVECTION Third-order upstream bias advection of momentum.
UV_LOGDRAG Logarithmic bottom stress.
UV_PSOURCE Mass point sources and sinks.
UV_TIDES Add tidal currents to 2D momentum climatologies.
UV_VIS2 Harmonic mixing of momentum.
VAR_RHO_2D Variable density barotropic mode.
VISC_GRID Horizontal viscosity coefficient scaled by grid size.
WEST_FSCHAPMAN Western edge, free-surface, Chapman condition.
WEST_M2FLATHER Western edge, 2D momentum, Flather condition.
WEST_M3NUDGING Western edge, 3D momentum, passive/active outflow/inflow.
WEST_M3RADIATION Western edge, 3D momentum, radiation condition.
WEST_TRADIATION Western edge, tracers, radiation condition.
Code: Select all
0 0.00001 6.290587E-13 1.300771E+02 1.300771E+02 3.213179E+10 0
DEF_HIS - creating history file: out/risc7a_his_0001.nc
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000001
DEF_AVG - creating average file: out/risc7a_avg.nc
DEF_DIAGS - creating diagnostics file: out/risc7a_dia.nc
DEF_STATION - creating stations file: out/risc7a_sta.nc
DEF_FLOATS - creating floats file: out/risc7a_flt.nc
1 0.00010 2.365654E-04 1.301077E+02 1.301079E+02 3.213680E+10 0
Code: Select all
5961 0.51746 1.307970E-02 1.328896E+02 1.329027E+02 3.322129E+10 0
5962 0.51755 1.306887E-02 1.328901E+02 1.329032E+02 3.322153E+10 0
p0_10096: (2296.429688) net_send: could not write to fd=4, errno = 32
which clearly pointed me to this point in the infile:
! Output history, average, diagnostic files parameters.
LDEFOUT == T
NHIS == 44710
NDEFHIS == 172800
NTSAVG == 1
NAVG == 5962
NDEFAVG == 0
NTSDIA == 1
NDIA == 57600
NDEFDIA == 0
So it crashes trying to write the average file. This is not a restart solution, NRREC=0, and the 3D fields in the average file are empty (null) according to MATLAB.
Did I miss anything in the CPPDEFS that AVERAGE_DETIDE needs?
Cheers,
Justin
- arango
- Site Admin
- Posts: 1364
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Well, you attached a lot of the standard output in this posting but your are missing the most important information. Your problem is a parallel one and not an IO one. It seems that the mp_gather routine failed during MPI communications. What is your grid size and tile partition? The information that you included is pretty much irrelevant for this kind of problem.
Why is your baroclinic time-step so so small? Obviously, you are trying to average M2 tides. This has nothing to do with perfect restart. If you check the ROMS svn track ticket you will notice that the perfect restart ticket is still open because we still don't get perfect restart with sediment, biology, and other algorithms. I have postponed the debugging of this option. However, this is not your problem.
If you think that AVERAGES_DETIDE is the problem, turn off this switch and see what happens. All the applications are different. So this is a good way to determine what option has problems in your application.
Why is your baroclinic time-step so so small? Obviously, you are trying to average M2 tides. This has nothing to do with perfect restart. If you check the ROMS svn track ticket you will notice that the perfect restart ticket is still open because we still don't get perfect restart with sediment, biology, and other algorithms. I have postponed the debugging of this option. However, this is not your problem.
If you think that AVERAGES_DETIDE is the problem, turn off this switch and see what happens. All the applications are different. So this is a good way to determine what option has problems in your application.
Last edited by arango on Fri Feb 01, 2008 1:02 am, edited 1 time in total.
The baroclinic timestep is very small due to complex topography in a shallow region where rivers are coming in. I was running with 15-20 second dt, but river inputs of ~100 m3/sec during the spring rains was causing blowups. To that end, I do a fair amount of restarting since I hate to run with a dt that small - perfect_restart will be handy. I am pushing ROMS into a shallow, complex estuary with high velocities, so stability is a persistent problem.
Here's the parallel line from the outfile.
Resolution, Grid 01: 0098x0198x015, Parallel Nodes: 24, Tiling: 006x004
It's a lot of nodes, but message passing is still only ~1% of time, if I'm reading that part of the outfile correctly.
I am running fine without averages_detide, just setting NAVG such that it outputs every 12.42 hours or so.
Thanks for the response,
Justin Rogers, M.S. Candidate
URI GSO
Here's the parallel line from the outfile.
Resolution, Grid 01: 0098x0198x015, Parallel Nodes: 24, Tiling: 006x004
It's a lot of nodes, but message passing is still only ~1% of time, if I'm reading that part of the outfile correctly.
I am running fine without averages_detide, just setting NAVG such that it outputs every 12.42 hours or so.
Thanks for the response,
Justin Rogers, M.S. Candidate
URI GSO
- arango
- Site Admin
- Posts: 1364
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
This is a complete overkill. I will never use more that 8 cpu's on a problem of this size. This may your problem during IO. Sometimes it depends on the architecture. Try either 1x6 or 2x6 to have balanced threads. You are having more partitions on the smaller dimension. Do you have small cashe?
The problem is that your application in hanging-up in one of the 24 nodes and we don't know why.
If you have only one tidal component (M2) your average window makes sense. Otherwise, there is a lot of surperposition between tidal components and nonlinear coupling. In that case, this averaging window will not filter the tides and a more robust scheme is required. The AVERAGES_DETIDE uses a least-squares that will improve with time. We started to document how this is done in WikiROMS.
The problem is that your application in hanging-up in one of the 24 nodes and we don't know why.
If you have only one tidal component (M2) your average window makes sense. Otherwise, there is a lot of surperposition between tidal components and nonlinear coupling. In that case, this averaging window will not filter the tides and a more robust scheme is required. The AVERAGES_DETIDE uses a least-squares that will improve with time. We started to document how this is done in WikiROMS.
It does seem to be a parallelization problem. I reset NAVG to 100 for diagnostic, and tried with a single processor - no crash. Then 2 - they're dual-cpu nodes, no crash. 4 nodes at 2x2 worked fine. 8 nodes at 2x4 or 1x8 crashed right at timestep 100. I'm rebooting nodes and doing a disk check for now, but my previous application works fine with lots of nodes, no AVERAGE-DETIDE there.
I do use restarts to change timestep, so PERFECT_RESTART might be a lost cause for me at the moment. My rivers vary wildly in this coastal application and cause what look like CFL errors, even when smoothed a bit.
I detide station files the smart way, the M2-period NAVG is a rough look at what's going on I suppose.
CPU cache? They're Opteron 248's with a 1MB cache. 2GB of RAM per node, nowhere near fully utilized of course. Efficiency certainly goes down with lots of nodes, but I usually use 16 CPU's on 8 nodes to good effect.
-Justin
I do use restarts to change timestep, so PERFECT_RESTART might be a lost cause for me at the moment. My rivers vary wildly in this coastal application and cause what look like CFL errors, even when smoothed a bit.
I detide station files the smart way, the M2-period NAVG is a rough look at what's going on I suppose.
CPU cache? They're Opteron 248's with a 1MB cache. 2GB of RAM per node, nowhere near fully utilized of course. Efficiency certainly goes down with lots of nodes, but I usually use 16 CPU's on 8 nodes to good effect.
-Justin