perfect_restart problem

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
User avatar
lmkli
Posts: 24
Joined: Wed Aug 02, 2006 1:21 pm
Location: TAMU

perfect_restart problem

#1 Unread post by lmkli »

Hi,

When I define PERFECT_RESTART option in my .h file, ROMS will get a exit-signal to end the all processes, without any blowing up. I use the GLS_MIXING scheme in my test.

Is there anybody who can give me some advices? thanks!

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#2 Unread post by jcwarner »

can you provide more of the output so that we can see what is happening?
Do you get the same effect if you use a deafult case, like upwelling, and also define perfect restart?

-j

User avatar
lmkli
Posts: 24
Joined: Wed Aug 02, 2006 1:21 pm
Location: TAMU

#3 Unread post by lmkli »

Thank you for replying me so quickly!
I used 28 processors, and defined NRST=60 to make ROMS write rst file at 60th time steps. Here are the output informations:

Code: Select all

Model Input Parameters:  ROMS/TOMS version 3.0

 Operating system : Linux
 CPU/hardware     : x86_64
 Compiler system  : ifort
 Compiler command : /usr/local/intel/mpich-mx-1.2.7..1/linux86-64/9.1/bin/mpif90
 Compiler flags   :  -ip -O3 -xW -free

 SVN Root URL  : https://www.myroms.org/svn/src/trunk

 Resolution, Grid 01: 0438x0318x036,  Parallel Nodes:  28,  Tiling: 007x004


 Physical Parameters, Grid: 01
 =============================

     324288  ntimes          Number of timesteps for 3-D equations.
    300.000  dt              Timestep size (s) for 3-D equations.
         30  ndtfast         Number of timesteps for 2-D equations between
                               each 3D timestep.
          1  ERstr           Starting ensemble/perturbation run number.
          1  ERend           Ending ensemble/perturbation run number.
          0  nrrec           Number of restart records to read from disk.
          T  LcycleRST       Switch to recycle time-records in restart file.
         60  nRST            Number of timesteps between the writing of data
                               into restart fields.
         60  ninfo           Number of timesteps between print of information
                               to standard output.
          T  ldefout         Switch to create a new output NetCDF file(s).
        288  nHIS            Number of timesteps between the writing fields
                               into history file.
       2880  ndefHIS         Number of timesteps between creation of new
                               history files.
          1  ntsAVG          Starting timestep for the accumulation of output
                               time-averaged data.
       8640  nAVG            Number of timesteps between the writing of
                               time-averaged data into averages file.
     103680  ndefAVG         Number of timesteps between creation of new
                               time-averaged file.
 5.0000E+01  visc2           Horizontal, harmonic mixing coefficient (m2/s)
                               for momentum.
 1.0000E-06  Akt_bak(01)     Background vertical mixing coefficient (m2/s)
                               for tracer 01: temp
 1.0000E-06  Akt_bak(02)     Background vertical mixing coefficient (m2/s)
                               for tracer 02: salt
 1.0000E-05  Akv_bak         Background vertical mixing coefficient (m2/s)
                               for momentum.
 3.0000E-04  rdrg            Linear bottom drag coefficient (m/s).
 3.0000E-03  rdrg2           Quadratic bottom drag coefficient.
 2.0000E-02  Zob             Bottom roughness (m).
 1.0000E+01  blk_ZQ          Height (m) of surface air humidity measurement.
 1.0000E+01  blk_ZT          Height (m) of surface air temperature measurement.
 1.0000E+01  blk_ZW          Height (m) of surface winds measurement.
          1  lmd_Jwt         Jerlov water type.
 5.0000E+00  theta_s         S-coordinate surface control parameter.
 4.0000E-01  theta_b         S-coordinate bottom  control parameter.
     50.000  Tcline          S-coordinate surface/bottom layer width (m) used
                               in vertical coordinate stretching.
   1025.000  rho0            Mean density (kg/m3) for Boussinesq approximation.
  52791.000  dstart          Time-stamp assigned to model initialization (days).
18581117.00  time_ref        Reference time for units attribute (yyyymmdd.dd)
 3.0000E+01  Tnudg(01)       Nudging/relaxation time scale (days)
                               for tracer 01: temp
 3.0000E+01  Tnudg(02)       Nudging/relaxation time scale (days)
                               for tracer 02: salt
 3.0000E+01  Znudg           Nudging/relaxation time scale (days)
                               for free-surface.
 3.0000E+01  M2nudg          Nudging/relaxation time scale (days)
                               for 2D momentum.
 3.0000E+01  M3nudg          Nudging/relaxation time scale (days)
                               for 3D momentum.
 1.0000E+01  obcfac          Factor between passive and active
                               open boundary conditions.
     10.000  T0              Background potential temperature (C) constant.
     35.000  S0              Background salinity (PSU) constant.
      1.000  gamma2          Slipperiness variable: free-slip (1.0) or
                                                    no-slip (-1.0).
          T  Hout(idFsur)    Write out free-surface.
          T  Hout(idUbar)    Write out 2D U-momentum component.
          T  Hout(idVbar)    Write out 2D V-momentum component.
          T  Hout(idUvel)    Write out 3D U-momentum component.
          T  Hout(idVvel)    Write out 3D V-momentum component.
          T  Hout(idWvel)    Write out W-momentum component.
          T  Hout(idTvar)    Write out tracer 01: temp
          T  Hout(idTvar)    Write out tracer 02: salt
          T  Hout(idUsms)    Write out surface U-momentum stress.
          T  Hout(idVsms)    Write out surface V-momentum stress.

 Tile partition information for Grid 01:  0438x0318x0036  tiling: 007x004

     tile     Istr     Iend     Jstr     Jend     Npts

        0        1       62        1       79   176328
        1       63      125        1       79   179172
        2      126      188        1       79   179172
        3      189      251        1       79   179172
        4      252      314        1       79   179172
        5      315      377        1       79   179172
        6      378      438        1       79   173484
        7        1       62       80      159   178560
        8       63      125       80      159   181440
        9      126      188       80      159   181440
       10      189      251       80      159   181440
       11      252      314       80      159   181440
       12      315      377       80      159   181440
       13      378      438       80      159   175680
       14        1       62      160      239   178560
       15       63      125      160      239   181440
       16      126      188      160      239   181440
       17      189      251      160      239   181440
       18      252      314      160      239   181440
       19      315      377      160      239   181440
       20      378      438      160      239   175680
       21        1       62      240      318   176328
       22       63      125      240      318   179172
       23      126      188      240      318   179172
       24      189      251      240      318   179172
       25      252      314      240      318   179172
       26      315      377      240      318   179172
       27      378      438      240      318   173484


 Maximum halo size in XI and ETA directions:

               HaloSizeI(1) =     146
               HaloSizeJ(1) =     180
                TileSide(1) =      84
                TileSize(1) =    5628

 Activated C-preprocessing Options:

  SABGOM_H           SABGOM 3-Year Hindcast
  ANA_BSFLUX         Analytical kinematic bottom salinity flux.
  ANA_BTFLUX         Analytical kinematic bottom temperature flux.
  ANA_RAIN           Analytical rain fall rate.
  ANA_SSFLUX         Analytical kinematic surface salinity flux.
  ASSUMED_SHAPE      Using assumed-shape arrays.
  AVERAGES           Writing out time-averaged fields.
  BULK_FLUXES        Surface bulk fluxes parametererization.
  CURVGRID           Orthogonal curvilinear grid.
  DJ_GRADPS          Parabolic Splines density Jacobian (Shchepetkin, 2002).
  DOUBLE_PRECISION   Double precision arithmetic.
  EAST_FSCHAPMAN     Eastern edge, free-surface, Chapman condition.
  EAST_M2FLATHER     Eastern edge, 2D momentum, Flather condition.
  EAST_M3NUDGING     Eastern edge, 3D momentum, passive/active outflow/inflow.
  EAST_M3RADIATION   Eastern edge, 3D momentum, radiation condition.
  EAST_TNUDGING      Eastern edge, tracers, passive/active outflow/inflow.
  EAST_TRADIATION    Eastern edge, tracers, radiation condition.
  INLINE_2DIO        Processing 3D IO level by level to reduce memory needs.
  LMD_CONVEC         LMD convective mixing due to shear instability.
  LMD_MIXING         Large/McWilliams/Doney interior mixing.
  LMD_NONLOCAL       LMD convective nonlocal transport.
  LMD_RIMIX          LMD diffusivity due to shear instability.
  LMD_SKPP           KPP surface boundary layer mixing.
  LONGWAVE           Compute net longwave radiation internally.
  MASKING            Land/Sea masking.
  MIX_S_UV           Mixing of momentum along constant S-surfaces.
  MPI                MPI distributed-memory configuration.
  NONLINEAR          Nonlinear Model.
  NONLIN_EOS         Nonlinear Equation of State for seawater.
  NORTHERN_WALL      Wall boundary at Northern edge.
  PERFECT_RESTART    Processing perfect restart variables.
  POWER_LAW          Power-law shape time-averaging barotropic filter.
  PROFILE            Time profiling activated .
  RADIATION_2D       Use tangential phase speed in radiation conditions.
  !RST_SINGLE        Double precision fields in restart NetCDF file.
  SALINITY           Using salinity.
  SOLAR_SOURCE       Solar Radiation Source Term.
  SOLVE3D            Solving 3D Primitive Equations.
  SOUTH_FSCHAPMAN    Southern edge, free-surface, Chapman condition.
  SOUTH_M2FLATHER    Southern edge, 2D momentum, Flather condition.
  SOUTH_M3NUDGING    Southern edge, 3D momentum, passive/active outflow/inflow.
  SOUTH_M3RADIATION  Southern edge, 3D momentum, radiation condition.
  SOUTH_TNUDGING     Southern edge, tracers, passive/active outflow/inflow.
  SOUTH_TRADIATION   Southern edge, tracers, radiation condition.
  SPLINES            Conservative parabolic spline reconstruction.
  TCLIMATOLOGY       Processing tracer climatology data.
  TCLM_NUDGING       Nudging toward tracer climatology.
  TS_U3HADVECTION    Third-order upstream bias horizontal advection of tracers.
  TS_SVADVECTION     Parabolic splines vertical advection of tracers.
  TS_PSOURCE         Tracers point sources and sinks.
  UV_ADV             Advection of momentum.
  UV_COR             Coriolis term.
  UV_U3HADVECTION    Third-order upstream bias advection of momentum.
  UV_QDRAG           Quadratic bottom stress.
  UV_PSOURCE         Mass point sources and sinks.
  UV_VIS2            Harmonic mixing of momentum.
  VAR_RHO_2D         Variable density barotropic mode.
  WESTERN_WALL       Wall boundary at Western edge.

 INITIAL: Configurating and initializing forward nonlinear model ...


 Vertical S-coordinate System:

 level   S-coord     Cs-curve          at_hmin  over_slope     at_hmax

    36   0.0000000   0.0000000           0.000       0.000       0.000
    35  -0.0277778  -0.0019878          -0.139      -5.720     -11.300
    34  -0.0555556  -0.0042675          -0.278     -12.258     -24.239
    33  -0.0833333  -0.0069437          -0.417     -19.910     -39.404
    32  -0.1111111  -0.0101452          -0.556     -29.037     -57.519
    31  -0.1388889  -0.0140312          -0.694     -40.086     -79.477
    30  -0.1666667  -0.0187972          -0.833     -53.605    -106.376
    29  -0.1944444  -0.0246815          -0.972     -70.263    -139.554
    28  -0.2222222  -0.0319695          -1.111     -90.862    -180.614
    27  -0.2500000  -0.0409944          -1.250    -116.338    -231.426
    26  -0.2777778  -0.0521319          -1.389    -147.744    -294.099
    25  -0.3055556  -0.0657822          -1.528    -186.205    -370.882
    24  -0.3333333  -0.0823381          -1.667    -232.823    -463.979
    23  -0.3611111  -0.1021338          -1.806    -288.537    -575.268
    22  -0.3888889  -0.1253769          -1.944    -353.928    -705.912
    21  -0.4166667  -0.1520732          -2.083    -429.014    -855.945
    20  -0.4444444  -0.1819651          -2.222    -513.072   -1023.921
    19  -0.4722222  -0.2145100          -2.361    -604.577   -1206.794
    18  -0.5000000  -0.2489214          -2.500    -701.323   -1400.146
    17  -0.5277778  -0.2842780          -2.639    -800.722   -1598.805
    16  -0.5555556  -0.3196767          -2.778    -900.240   -1797.701
    15  -0.5833333  -0.3543864          -2.917    -997.823   -1992.728
    14  -0.6111111  -0.3879574          -3.056   -1092.209   -2181.362
    13  -0.6388889  -0.4202649          -3.194   -1183.048   -2362.901
    12  -0.6666667  -0.4514899          -3.333   -1270.848   -2538.363
    11  -0.6944444  -0.4820609          -3.472   -1356.812   -2710.152
    10  -0.7222222  -0.5125827          -3.611   -1442.638   -2881.665
     9  -0.7500000  -0.5437741          -3.750   -1530.344   -3056.937
     8  -0.7777778  -0.5764230          -3.889   -1622.141   -3240.394
     7  -0.8055556  -0.6113613          -4.028   -1720.366   -3436.704
     6  -0.8333333  -0.6494565          -4.167   -1827.454   -3650.740
     5  -0.8611111  -0.6916165          -4.306   -1945.952   -3887.599
     4  -0.8888889  -0.7388018          -4.444   -2078.560   -4152.675
     3  -0.9166667  -0.7920448          -4.583   -2228.173   -4451.763
     2  -0.9444444  -0.8524713          -4.722   -2397.954   -4791.185
     1  -0.9722222  -0.9213261          -4.861   -2591.396   -5177.930
     0  -1.0000000  -1.0000000          -5.000   -2812.404   -5619.808

 Time Splitting Weights: ndtfast =  30    nfast =  42


 ndtfast, nfast =   30  42   nfast/ndtfast = 1.40000

 Centers of gravity and integrals (values must be 1, 1, approx 1/2, 1, 1):

    1.000000000000 1.047601458608 0.523800729304 1.000000000000 1.000000000000

 Power filter parameters, Fgamma, gamma =  0.28400   0.18933

 Minimum X-grid spacing, DXmin =  5.84518929E+00 km
 Maximum X-grid spacing, DXmax =  7.09321009E+00 km
 Minimum Y-grid spacing, DYmin =  4.99958718E+00 km
 Maximum Y-grid spacing, DYmax =  5.27049616E+00 km
 Minimum Z-grid spacing, DZmin =  1.38888595E-01 m
 Maximum Z-grid spacing, DZmax =  4.41877891E+02 m

 Minimum barotropic Courant Number =  1.67762600E-02
 Maximum barotropic Courant Number =  5.85791511E-01
 Maximum Coriolis   Courant Number =  2.74759622E-02


 NLM: GET_STATE - Read state initial conditions,               t =   52791.0000
 ...
 ... 

 Maximum grid stiffness ratios:  rx0 =   4.492189E-01 (Beckmann and Haidvogel)
                                 rx1 =   1.550243E+01 (Haney)


 Initial basin volumes: TotVolume =  7.08080639348850E+15 m3
                        MinVolume =  4.09865863294175E+06 m3
                        MaxVolume =  1.51778869573190E+10 m3
                          Max/Min =  3.70313517581856E+03


NL ROMS/TOMS: started time-stepping:( TimeSteps: 00000001 - 00324288)

    GET_NGFLD   - river runoff mass transport,                 t =   52792.0000
                   (File: SABGOM_H_tide_20030601.nc, Rec=1250, Index=1)
                   (Tmin=      51543.0000 Tmax=      54186.0000)
                   (Min = -1.96954702E+04 Max =  6.68277580E+02)
...
...
   STEP  time[DAYS]  KINETIC_ENRG    POTEN_ENRG    TOTAL_ENRG   NET_VOLUME  trd

      0 52791.00000  1.659717E-02  1.917084E+04  1.917086E+04  7.093434E+15   0
      DEF_HIS   - creating history file: /share3/lmkli/SABGOM/NARR_OUT/sabgom_his_0001.nc
      WRT_HIS   - wrote history  fields (Index=1,1) into time record = 0000001
      DEF_AVG   - creating average file: /share3/lmkli/SABGOM/NARR_OUT/sabgom_avg_0001.nc
...
...
                   (Tmin=      52640.0000 Tmax=      53005.0000)
                   (Min =  5.58790801E-01 Max =  9.98642064E-01)
     60 52791.20833  1.638506E-02  1.917090E+04  1.917092E+04  7.093346E+15   0
Signal received, cleaning up temporary files and exiting ...

TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
==== ========== ================ ======================= ===================
0001 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0002 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0003 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0004 blade21-3 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0005 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0006 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0007 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0008 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0009 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0010 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0011 blade21-2 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0012 blade21-1 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0013 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0014 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0015 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0016 blade21-11 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0017 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0018 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0019 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0020 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0021 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0022 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0023 blade21-13 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0024 blade21-14 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0025 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0026 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0027 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:54
0028 blade21-12 mpich_mx_wrapper Signaled (SIGSEGV) 10/31/2007 21:20:37


------------------------------------------------------------
Sender: LSF System <lsfadmin>
Subject: Job 30399: </bin> Exited

Job </bin> was submitted from host <login03> by user <lmingku>.
Job was executed on host(s) <4>, in queue <he>, as user <lmingku>.
<4>
<4>
<4>
<4>
<4>
<4>
</home> was used as the home directory.
</home> was used as the working directory.
Started at Wed Oct 31 21:18:46 2007
Results reported at Wed Oct 31 21:20:54 2007

User avatar
lmkli
Posts: 24
Joined: Wed Aug 02, 2006 1:21 pm
Location: TAMU

perfect_restart in upwelling case

#4 Unread post by lmkli »

PERFECT_RESTART option works well in upwelling case.
What does this indicate to my case?

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

#5 Unread post by jcwarner »

well, i am not sure. here are some things to try.
1) does your app work ok without perfect restart ? (so does it work ok with just the normal restart)
2) did anything get written to the restart file?
was a restart file actually created?
3) the create rst is a process, and write to the rst is another process. So it is important to check if the file was created first. Then if it was created, did anything get written? So there may be some values that are ok and some bad. Can you dig into these issues?
4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls (just to check the restart issue - i am not talking physics here. Just computer software issues.)

User avatar
lmkli
Posts: 24
Joined: Wed Aug 02, 2006 1:21 pm
Location: TAMU

#6 Unread post by lmkli »

> 1) does your app work ok without perfect restart ? (so does it work ok with just the normal restart)

yes, it works without perfect restart.

> 2) did anything get written to the restart file? was a restart file actually created?

yes, restart file was created, and had something in it.

> 3) the create rst is a ... ... bad. Can you dig into these issues?

I am digging it. if I could have any new information, I will let you know.

> 4) compare what is different with your setup to that of upwelling. You said upwelling worked. What is different? Maybe try your app with gls.

I will follow what differs from upwilling about perfect restart, also will try gls.

THANKS!

User avatar
lmkli
Posts: 24
Joined: Wed Aug 02, 2006 1:21 pm
Location: TAMU

#7 Unread post by lmkli »

new results:

In perfect_restart case, the restart file was created, but no variable was written into it.

I also compared my .h file with upwelling, and I put following scripts into my .h file:
#ifdef PERFECT_RESTART
# undef AVERAGES
# undef DIAGNOSTICS_BIO
# undef DIAGNOSTICS_TS
# undef DIAGNOSTICS_UV
# define OUT_DOUBLE
#endif
It didn't work.

Finally, I change the fixing scheme to gls, it didn't work.

jmrogers
Posts: 9
Joined: Tue Jan 31, 2006 9:15 pm
Location: University of Rhode Island GSO

Similar issue - AVERAGE_DETIDE

#8 Unread post by jmrogers »

Hi all,

I just updated to the newest release and tried PERFECT_RESTART with similar issues.
Here's the application info from the outfile:

Code: Select all

 Operating system : Linux
 CPU/hardware     : x86_64
 Compiler system  : ifort
 Compiler command : /opt/roms/mpich-1.2.7p1/bin/mpif90
 Compiler flags   : -i-static -ip -O2 -ip -O3 -xW -free

 Input Script  : risc7a.in

 SVN Root URL  : https://www.myroms.org/svn/src/trunk
 SVN Revision  : 147M

and the CPPDEFS:

  risc7              Narragansett Bay and RIS
  ANA_BSFLUX         Analytical kinematic bottom salinity flux.
  ANA_BTFLUX         Analytical kinematic bottom temperature flux.
  ANA_SSFLUX         Analytical kinematic surface salinity flux.
  ASSUMED_SHAPE      Using assumed-shape arrays.
  AVERAGES           Writing out time-averaged fields.
  AVERAGES_DETIDE    Writing out time-averaged detided fields.
  BULK_FLUXES        Surface bulk fluxes parametererization.
  CURVGRID           Orthogonal curvilinear grid.
  DIAGNOSTICS_TS     Computing and writing tracer diagnostic terms.
  DIAGNOSTICS_UV     Computing and writing momentum diagnostic terms.
  DIFF_GRID          Horizontal diffusion coefficient scaled by grid size.
  DIURNAL_SRFLUX     Modulate shortwave radiation by the local diurnal cycle.
  DJ_GRADPS          Parabolic Splines density Jacobian (Shchepetkin, 2002).
  DOUBLE_PRECISION   Double precision arithmetic.
  EAST_FSCHAPMAN     Eastern edge, free-surface, Chapman condition.
  EAST_M2FLATHER     Eastern edge, 2D momentum, Flather condition.
  EAST_M3RADIATION   Eastern edge, 3D momentum, radiation condition.
  EAST_TRADIATION    Eastern edge, tracers, radiation condition.
  FLOATS             Simulated Lagrangian drifters.
  GLS_MIXING         Generic Length-Scale turbulence closure.
  MASKING            Land/Sea masking.
  MIX_GEO_TS         Mixing of tracers along geopotential surfaces.
  MIX_S_UV           Mixing of momentum along constant S-surfaces.
  MPI                MPI distributed-memory configuration.
  NONLINEAR          Nonlinear Model.
  NONLIN_EOS         Nonlinear Equation of State for seawater.
  NORTHERN_WALL      Wall boundary at Northern edge.
  PERFECT_RESTART    Processing perfect restart variables.
  POWER_LAW          Power-law shape time-averaging barotropic filter.
  PROFILE            Time profiling activated .
  K_GSCHEME          Third-order upstream bias advection of TKE fields.
  !RST_SINGLE        Double precision fields in restart NetCDF file.
  SALINITY           Using salinity.
  SOLAR_SOURCE       Solar Radiation Source Term.
  SOLVE3D            Solving 3D Primitive Equations.
  SOUTH_FSCHAPMAN    Southern edge, free-surface, Chapman condition.
  SOUTH_M2FLATHER    Southern edge, 2D momentum, Flather condition.
  SOUTH_M3NUDGING    Southern edge, 3D momentum, passive/active outflow/inflow.
  SOUTH_M3RADIATION  Southern edge, 3D momentum, radiation condition.
  SOUTH_TNUDGING     Southern edge, tracers, passive/active outflow/inflow.
  SOUTH_TRADIATION   Southern edge, tracers, radiation condition.
  SPLINES            Conservative parabolic spline reconstruction.
  SPONGE             Enhanced horizontal mixing in the sponge areas.
  SSH_TIDES          Add tidal elevation to SSH climatology.
  STATIONS           Writing out station data.
  TS_A4HADVECTION    Fouth-order Akima horizontal advection of tracers.
  TS_A4VADVECTION    Fouth-order Akima vertical advection of tracers.
  TS_DIF2            Harmonic mixing of tracers.
  TS_PSOURCE         Tracers point sources and sinks.
  UV_ADV             Advection of momentum.
  UV_COR             Coriolis term.
  UV_U3HADVECTION    Third-order upstream bias advection of momentum.
  UV_LOGDRAG         Logarithmic bottom stress.
  UV_PSOURCE         Mass point sources and sinks.
  UV_TIDES           Add tidal currents to 2D momentum climatologies.
  UV_VIS2            Harmonic mixing of momentum.
  VAR_RHO_2D         Variable density barotropic mode.
  VISC_GRID          Horizontal viscosity coefficient scaled by grid size.
  WEST_FSCHAPMAN     Western edge, free-surface, Chapman condition.
  WEST_M2FLATHER     Western edge, 2D momentum, Flather condition.
  WEST_M3NUDGING     Western edge, 3D momentum, passive/active outflow/inflow.
  WEST_M3RADIATION   Western edge, 3D momentum, radiation condition.
  WEST_TRADIATION    Western edge, tracers, radiation condition.
Starting timestepping looks like this:

Code: Select all

 
      0     0.00001  6.290587E-13  1.300771E+02  1.300771E+02  3.213179E+10   0
      DEF_HIS   - creating history file: out/risc7a_his_0001.nc
      WRT_HIS   - wrote history  fields (Index=1,1) into time record = 0000001
      DEF_AVG   - creating average file: out/risc7a_avg.nc
      DEF_DIAGS - creating diagnostics file: out/risc7a_dia.nc
      DEF_STATION - creating stations file: out/risc7a_sta.nc
      DEF_FLOATS  - creating floats file: out/risc7a_flt.nc
      1     0.00010  2.365654E-04  1.301077E+02  1.301079E+02  3.213680E+10   0
and the crash at the end looks like this:

Code: Select all

   5961     0.51746  1.307970E-02  1.328896E+02  1.329027E+02  3.322129E+10   0
   5962     0.51755  1.306887E-02  1.328901E+02  1.329032E+02  3.322153E+10   0
p0_10096: p4_error: interrupt SIGSEGV: 11
p0_10096: (2296.429688) net_send: could not write to fd=4, errno = 32

which clearly pointed me to this point in the infile:
! Output history, average, diagnostic files parameters.

LDEFOUT == T
NHIS == 44710
NDEFHIS == 172800
NTSAVG == 1
NAVG == 5962
NDEFAVG == 0
NTSDIA == 1
NDIA == 57600
NDEFDIA == 0

So it crashes trying to write the average file. This is not a restart solution, NRREC=0, and the 3D fields in the average file are empty (null) according to MATLAB.

Did I miss anything in the CPPDEFS that AVERAGE_DETIDE needs?

Cheers,
Justin

User avatar
arango
Site Admin
Posts: 1364
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

#9 Unread post by arango »

Well, you attached a lot of the standard output in this posting but your are missing the most important information. Your problem is a parallel one and not an IO one. It seems that the mp_gather routine failed during MPI communications. What is your grid size and tile partition? The information that you included is pretty much irrelevant for this kind of problem.

Why is your baroclinic time-step so so small? Obviously, you are trying to average M2 tides. This has nothing to do with perfect restart. If you check the ROMS svn track ticket you will notice that the perfect restart ticket is still open because we still don't get perfect restart with sediment, biology, and other algorithms. I have postponed the debugging of this option. However, this is not your problem.

If you think that AVERAGES_DETIDE is the problem, turn off this switch and see what happens. All the applications are different. So this is a good way to determine what option has problems in your application.
Last edited by arango on Fri Feb 01, 2008 1:02 am, edited 1 time in total.

jmrogers
Posts: 9
Joined: Tue Jan 31, 2006 9:15 pm
Location: University of Rhode Island GSO

#10 Unread post by jmrogers »

The baroclinic timestep is very small due to complex topography in a shallow region where rivers are coming in. I was running with 15-20 second dt, but river inputs of ~100 m3/sec during the spring rains was causing blowups. To that end, I do a fair amount of restarting since I hate to run with a dt that small - perfect_restart will be handy. I am pushing ROMS into a shallow, complex estuary with high velocities, so stability is a persistent problem.

Here's the parallel line from the outfile.
Resolution, Grid 01: 0098x0198x015, Parallel Nodes: 24, Tiling: 006x004

It's a lot of nodes, but message passing is still only ~1% of time, if I'm reading that part of the outfile correctly.

I am running fine without averages_detide, just setting NAVG such that it outputs every 12.42 hours or so.

Thanks for the response,
Justin Rogers, M.S. Candidate
URI GSO

User avatar
arango
Site Admin
Posts: 1364
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

#11 Unread post by arango »

This is a complete overkill. I will never use more that 8 cpu's on a problem of this size. This may your problem during IO. Sometimes it depends on the architecture. Try either 1x6 or 2x6 to have balanced threads. You are having more partitions on the smaller dimension. Do you have small cashe?

The problem is that your application in hanging-up in one of the 24 nodes and we don't know why.

If you have only one tidal component (M2) your average window makes sense. Otherwise, there is a lot of surperposition between tidal components and nonlinear coupling. In that case, this averaging window will not filter the tides and a more robust scheme is required. The AVERAGES_DETIDE uses a least-squares that will improve with time. We started to document how this is done in :arrow: WikiROMS.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

#12 Unread post by kate »

Are you hoping to use PERFECT_RESTART while changing the timestep? That's what I think you were implying above. Note that the way PERFECT_RESTART works, it probably won't be perfect in the face of a variable timestep. I'm sure Hernan could chime in about that too.

jmrogers
Posts: 9
Joined: Tue Jan 31, 2006 9:15 pm
Location: University of Rhode Island GSO

#13 Unread post by jmrogers »

It does seem to be a parallelization problem. I reset NAVG to 100 for diagnostic, and tried with a single processor - no crash. Then 2 - they're dual-cpu nodes, no crash. 4 nodes at 2x2 worked fine. 8 nodes at 2x4 or 1x8 crashed right at timestep 100. I'm rebooting nodes and doing a disk check for now, but my previous application works fine with lots of nodes, no AVERAGE-DETIDE there.

I do use restarts to change timestep, so PERFECT_RESTART might be a lost cause for me at the moment. My rivers vary wildly in this coastal application and cause what look like CFL errors, even when smoothed a bit.

I detide station files the smart way, the M2-period NAVG is a rough look at what's going on I suppose.

CPU cache? They're Opteron 248's with a 1MB cache. 2GB of RAM per node, nowhere near fully utilized of course. Efficiency certainly goes down with lots of nodes, but I usually use 16 CPU's on 8 nodes to good effect.

-Justin

Post Reply