CPU time and MPI issues with nested grids

Message

Tomasz · #1 Unread post by **Tomasz** » Fri Aug 01, 2014 12:42 pm

We have developed a nested (refined by a factor of 3) grid application within our existing operational model of the west coast of Ireland. The grid sizes are as follows:

Parent : 638 x 438 x 20
Child: 198 x 249 x 20

However, there is a very significant computational overhead as well as the MPI/tiling problem when running the nested configuration when compared to the parent standalone. We tested the Rutgers release as well as COAWST and also ROMS_AGRIF for comparison. Below is the report on run times:

ROMS:

PARENT STANDALONE:
Tiling: 18 x 24 = 432 Timing: ~05 min / day
Tiling: 16 x 20 = 320 Timing: ~05 min / day
Tiling: 10 x 16 = 160 Timing: ~11 min / day
Tiling: 8 x 12 = 96 Timing: ~15 min / day
NESTED 1 WAY
Tiling: 18 x 24 = 432 Timing: ~1hr 20 min / day
Tiling: 16 x 20 = 320 Timing: ~1hr 20 min / day
Tiling: 10 x 16 = 160 Timing: ~1hr 02 min / day
Tiling: 8 x 12 = 96 Timing: ~1hr 06 min / day
NESTED 2 WAY
Tiling: 18 x 24 = 432 Timing: ~2hr 10 min / day
Tiling: 16 x 20 = 320 Timing: ~2hr 08 min / day
Tiling: 10 x 16 = 160 Timing: ~1hr 32 min / day
Tiling: 8 x 12 = 96 Timing: ~1hr 32 min / day

COAWST

PARENT STANDALONE
Tiling: 20 x 24 = 480 Timing: ~4 min / day
NESTED 1 WAY
Tiling: 20 x 24 = 480 Timing: ~2hr 48 min / day
NESTED 2 WAY
Tiling: 20 x 24 = 480 Timing: ~2hr 48 min / day

ROMS_AGRIF

PARENT STANDALONE
I do not have the exact figure, but very similar to ROMS and COAWST
NESTED 1 WAY
Tiling: 24 x 20 = 480 Timing: ~18 min / day
Tiling: 20 x 20 = 400 Timing: ~18 min / day
Tiling: 10 x 12 = 120 Timing: ~18 min / day
Tiling: 8 x 8 = 64 Timing: ~28 min / day
NESTED 2 WAY
Tiling: 24 x 20 = 480 Timing: ~23 min / day

I would greatly appreciate any comments on two issues arising from the above:

1) Why are the computational costs so massive in ROMS and COAWST, whereas they are reasonable in AGRIF?

2) What could the issue be with the MPI / tiling when running the nested configuration (runs faster on fewer nodes)?

I am also pasting the time profile stats from the COAWST run that show that the model spends disproportionate amount of time on reading and processing input data (and it is not the initialization, as the integration commences in normal time):

Nonlinear model elapsed time profile:

Initialization ................................... 7414.111 ( 0.4980 %)
OI data assimilation ............................. 0.084 ( 0.0000 %)
Reading of input data ............................ 1129394.223 (75.8601 %)
Processing of input data ......................... 1136092.318 (76.3100 %)
Computation of vertical boundary conditions ...... 1500.082 ( 0.1008 %)
Computation of global information integrals ...... 229.418 ( 0.0154 %)
Writing of output data ........................... 1986.356 ( 0.1334 %)
Model 2D kernel .................................. 10940.908 ( 0.7349 %)
2D/3D coupling, vertical metrics ................. 12390.746 ( 0.8323 %)
Omega vertical velocity .......................... 1538.468 ( 0.1033 %)
Equation of state for seawater ................... 12300.072 ( 0.8262 %)
KPP vertical mixing parameterization ............. 6061.499 ( 0.4071 %)
3D equations right-side terms .................... 226.410 ( 0.0152 %)
3D equations predictor step ...................... 887.124 ( 0.0596 %)
Pressure gradient ................................ 182.883 ( 0.0123 %)
Harmonic mixing of tracers, geopotentials ........ 266.205 ( 0.0179 %)
Harmonic stress tensor, S-surfaces ............... 164.894 ( 0.0111 %)
Corrector time-step for 3D momentum .............. 1336.560 ( 0.0898 %)
Corrector time-step for tracers .................. 2696.120 ( 0.1811 %)
Total: 2325608.482 156.2084

Thanks,
Tomasz

kate · #2 Unread post by **kate** » Fri Aug 01, 2014 3:27 pm

Fascinating - thanks for posting this. Do others have similar experience?

alanberry · Wed Aug 27, 2014 10:37 am

All,
Apologies for the blatant bump up of the above post, but we would greatly appreciate if the ROMS community have any insights to offer in relation to the MPI/tiling discrepancies identified between the various options for nesting in ROMS (in all it's incarnations)
Regards,
Alan.

arango · #4 Unread post by **arango** » Mon Sep 21, 2015 6:53 pm

We need to take into account that ROMS_AGRIF has a reduced and simpler barotropic stepping engine (step2d.F), so it is more efficient. Therefore, we are not comparing apples with apples. Rutgers version of step2d.F includes additional terms so it is slower. In ROMS_AGRIF, all of those terms are included in the forcing terms (rufrc and rvfrc) and persisted over all barotropic terms. Recall that the barotropic engine is the more expensive part of the ROMS kernel.

It is interesting that other users are trying to compare with ROMS_AGRIF but can only do it with 1:3 refinement ratios. They get unstable solutions in ROMS_AGRIF with a 1:5 ratio or higher. I haven't confirmed this since I haven't never ran my applications or tests with ROMS_AGRIF. We have been able to have stable solutions with 1:5 and 1:7 ratios. I think this is because in our design we include a full stencil evaluation of governing equations in the contact areas and contact points.

I think that once I improve the MPI communications in the two-way nesting our code will be more efficient. Currently, I am gathering the full data array from the donor grid to perform easily the fine-to-coarse averaging and it is expensive. I know what to do but it is very tricky. It is in my TO-DO list.

shchepet · #5 Unread post by **shchepet** » Wed Sep 30, 2015 6:07 pm

Tomasz, you must be more specific: Hernan points to the differences in
step2d which dated back to long time ago as reflected in a post from 2005,
http://www.myroms.org/forum/viewtopic.php?f=19&t=280
still relevant today (?) -- simply put, AGRIF code was updated with respect
to this matter, but Rutgers code was not. However, these differences should
equally affect both nested and non-nested runs. To the contrary, you observe
very little difference or none at all:

ROMS_AGRIF

PARENT STANDALONE
I do not have the exact figure, but very similar to ROMS and COAWST
NESTED 1 WAY
Tiling: 24 x 20 = 480 Timing: ~18 min / day

Besides AGRIF has about 1.5 more operations in 3D part of the code because
of predictor-corrector stepping for momentum equations, so things like pressure
gradient, EOS, and rhs2d are called twice per time step (only once in Rutgers).
The extra computational cost may be offset by the ability to use a larger time
step in AGRIF (which makes it more efficient at the end), but if you run both
codes with the same time step, it is simply waste. So time step size dt?
Mode splitting ratio ndtfast?

Secondly, your profiling report shows the dominance by I/O operations way
beyond everything else. This needs to be tracked down. It is possible that
some kind of ballast was added to the code. Do you have diagnostics activated
by CPP? Another possibility (which I personally observed many times) is
pathetically inefficient way of setting up community clusters in Universities
with respect to disk file systems. A common practice is when faculty members
who participate (donate grant money) in a centrally-managed cluster also want to
have an attached private storage system to be accessed exclusively by their own
research group. They just go ahead an buy a high performance server with RAID
system and ask system administrators to NFS mount it to the cluster nodes.
The outcome as it felt by the users (i.e. when running job on cluster nodes while
reading/writing from/to such filesystem) is usually between pathetic and
catastrophically slow (even thought the server itself is a high-performance
expensive system featuring hardware RAID and tons of memory). It is possible
that you stumble into such situation, and this has nothing to do with your code
at all. But in any case it needs to be tracked down.

patrickm · #6 Unread post by **patrickm** » Thu Oct 01, 2015 5:34 am

just a precision. ROMS_AGRIF also uses a full stencil at the interface (nesting with 1:1 gives the same results as no nesting at all). The model runs stably with 1:5 ratio, which gives excellent results in 2-way nesting (1:3 is preferable for 1-way nesting, because the solutions are naturally less smooth across the interface). We can run with 1:7 ratio but generally avoid it. We can also run 1:2 or 1:4 ratios ... Patrick

arango · #7 Unread post by **arango** » Fri Oct 02, 2015 4:19 pm

That's good that you guys use the full stencil at the interface to compute the mass fluxes entering the refined grid. I haven't observed smoothness problems in my testing applications with one-way nesting and 1:5 ratios. Anyway, we should always use two-way nesting since the one-way interaction has no effect whatsoever on the coarser donor grid because there is no feed back of information. Due to its different spatial and temporal resolutions, the finer grid better resolves the physical phenomena at smaller scales. The averaging of a finer grid solution to update the coarse grid values (fine to coarse) in two-way nesting keeps both solutions in line with each other. Therefore, we can run nested solutions for long time.

I understand the the AGRIF library is generic and can do all kind of refined ratios. However since ROMS is a C-grid model, we should do uneven ratios. That is, 1:3, 1:5, 1:7, and so on. John Warner has successful applications with 1:7 ratios. All depends where the refined grids are located with respect to major currents.

pmaccc · #8 Unread post by **pmaccc** » Fri Dec 01, 2017 9:07 pm

I'm trying 2-way nesting in an existing run. I made the refined grid and contacts files using the matlab tools. Then I recompiled the executable using the NESTING flag. When I try to run it with mpirun I get a segmentation fault right away. I'm running on 144 cores across 6 nodes. One question is about the tiling specified in the .in file: should I set the tiling so that both the coarse and fine grids use all the cores? Also, when calling mpirun do I tell it the nodes to run in the usual way (a list) or do I call the nodes out separately for each grid?

I will also try some experiments with dogbone. Thanks for your help.

kate · #9 Unread post by **kate** » Fri Dec 01, 2017 10:08 pm

The mpirun stuff should remain the same. I no longer have the job script for running my one and only nested domain, but both grids had the same 2x16 tiling on 32 cores. ROMS runs the domains one at a time, computing the hand-shaking between them on the master process.

wilkin · #10 Unread post by **wilkin** » Sat Dec 02, 2017 1:58 pm

should I set the tiling so that both the coarse and fine grids use all the cores?

Yes, this is required. The product NtileI times NtileJ must be the same in every nest.

pmaccc · #11 Unread post by **pmaccc** » Fri Dec 08, 2017 10:35 pm

When I create a new grid using coarse2fine.m what is the correct number of rows and columns to put in the .in file for the refined grid? For a regular grid I just look at the size of a field on the rho-grid and subtract 2 from both dimensions to get the right number. Is it the same for the refinement grid or do I have to subtract more because there is a larger boundary region associated with the contact area? I ask because I keep getting segmentation faults when I try to run with 2-way nesting.

jcwarner · Fri Dec 08, 2017 11:09 pm

the child grid should be listed with the -2 in each direction as well.
try to compile in debug=on and run oceanG to see if you get more info.
-j

pmaccc · #13 Unread post by **pmaccc** » Mon Dec 11, 2017 9:08 pm

Thanks to everyone for your help. I did succeed in getting my realistic run to work with 2-way nesting. The problem is that the performance is poor. I'm running on 144 cores using the Intel compiler and mpi. The parent grid alone runs at faster than 1 minute per hour, and the child grid alone in 3 minutes per hour. But when combined using 2-way nesting they slow down to 26 minutes per hour, about 6x slower than I would expect.

wilkin · #14 Unread post by **wilkin** » Mon Dec 11, 2017 10:12 pm

It would be interesting to see the elapsed time profile and message passage profile reports at the end of the log file.

Can you share those?

arango · #15 Unread post by **arango** » Tue Dec 12, 2017 1:24 am

Maybe your application is not big enough to justify 144 nodes, and your application is slowing down because of excessive MPI communications. It is a critical issue that some users ignore always. There is always an optimal parallel partition for each ROMS application.

Now if you are using a recent version of the code, you need to experiment what are the best CPP options for MPI communications in your architecture and MPI library version. The user now has options and decisions to make. For more information, check the following

trac ticket about MPI communications update (5-Oct-2017). You need to experiment with the optimal CPP options

pmaccc · #16 Unread post by **pmaccc** » Tue Dec 12, 2017 6:31 pm

John,

Here are the time reports from the log file. What stands out is that most of the total time is devoted to message passing between the two grids (I'm not sure why the numbers don't all add up to 100%).

Code: Select all

Nonlinear model elapsed time profile, Grid: 01

  Allocation and array initialization ..............        72.738  ( 0.0015 %)
  Ocean state initialization .......................       747.678  ( 0.0150 %)
  Reading of input data ............................     11655.932  ( 0.2339 %)
  Processing of input data .........................     17108.063  ( 0.3432 %)
  Computation of vertical boundary conditions ......      1164.472  ( 0.0234 %)
  Computation of global information integrals ......       132.431  ( 0.0027 %)
  Writing of output data ...........................     52236.969  ( 1.0481 %)
  Model 2D kernel ..................................     36335.786  ( 0.7290 %)
  Tidal forcing ....................................     19276.110  ( 0.3867 %)
  2D/3D coupling, vertical metrics .................     15409.223  ( 0.3092 %)
  Omega vertical velocity ..........................      6782.946  ( 0.1361 %)
  Equation of state for seawater ...................      9548.830  ( 0.1916 %)
  Atmosphere-Ocean bulk flux parameterization ......      6201.497  ( 0.1244 %)
  GLS vertical mixing parameterization .............     67972.121  ( 1.3638 %)
  3D equations right-side terms ....................      4979.004  ( 0.0999 %)
  3D equations predictor step ......................     13907.891  ( 0.2790 %)
  Pressure gradient ................................      3432.519  ( 0.0689 %)
  Harmonic mixing of tracers, geopotentials ........      6084.001  ( 0.1221 %)
  Corrector time-step for 3D momentum ..............     14825.854  ( 0.2975 %)
  Corrector time-step for tracers ..................    144749.393  ( 2.9042 %)
                                              Total:    432623.458    8.6799

 Nonlinear model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............     52855.011  ( 1.0605 %)
  Message Passage: 3D halo exchanges ...............     66809.430  ( 1.3404 %)
  Message Passage: 4D halo exchanges ...............     46588.617  ( 0.9347 %)
  Message Passage: data broadcast ..................     25652.818  ( 0.5147 %)
  Message Passage: data reduction ..................       112.406  ( 0.0023 %)
  Message Passage: data gathering ..................     27070.074  ( 0.5431 %)
  Message Passage: data scattering..................     13777.804  ( 0.2764 %)
  Message Passage: boundary data gathering .........     15120.260  ( 0.3034 %)
  Message Passage: point data gathering ............   1558668.617  (31.2722 %)
                                              Total:   1806655.038   36.2477

 All percentages are with respect to total time =      4984198.707


 Nonlinear model elapsed time profile, Grid: 02

  Allocation and array initialization ..............        72.738  ( 0.0015 %)
  Ocean state initialization .......................       746.848  ( 0.0150 %)
  Reading of input data ............................      2977.951  ( 0.0597 %)
  Processing of input data .........................     11720.352  ( 0.2352 %)
  Computation of vertical boundary conditions ......      2080.900  ( 0.0418 %)
  Computation of global information integrals ......       151.563  ( 0.0030 %)
  Writing of output data ...........................     69953.520  ( 1.4035 %)
  Model 2D kernel ..................................     51148.082  ( 1.0262 %)
  Tidal forcing ....................................         2.866  ( 0.0001 %)
  2D/3D coupling, vertical metrics .................     25443.594  ( 0.5105 %)
  Omega vertical velocity ..........................     19122.004  ( 0.3837 %)
  Equation of state for seawater ...................     26480.756  ( 0.5313 %)
  Atmosphere-Ocean bulk flux parameterization ......     12627.570  ( 0.2534 %)
  GLS vertical mixing parameterization .............    141120.453  ( 2.8314 %)
  3D equations right-side terms ....................      7677.668  ( 0.1540 %)
  3D equations predictor step ......................     27425.561  ( 0.5503 %)
  Pressure gradient ................................      6919.026  ( 0.1388 %)
  Harmonic mixing of tracers, geopotentials ........     12043.431  ( 0.2416 %)
  Corrector time-step for 3D momentum ..............     37494.648  ( 0.7523 %)
  Corrector time-step for tracers ..................    396066.929  ( 7.9466 %)
                                              Total:    851276.460   17.0798

 Nonlinear model message Passage profile, Grid: 02

  Message Passage: 2D halo exchanges ...............     74316.654  ( 1.4911 %)
  Message Passage: 3D halo exchanges ...............    133328.411  ( 2.6751 %)
  Message Passage: 4D halo exchanges ...............     79743.248  ( 1.5999 %)
  Message Passage: data broadcast ..................     60860.584  ( 1.2211 %)
  Message Passage: data reduction ..................       140.826  ( 0.0028 %)
  Message Passage: data gathering ..................   1107430.733  (22.2192 %)
  Message Passage: data scattering..................      4622.345  ( 0.0927 %)
  Message Passage: point data gathering ............   1194043.625  (23.9569 %)
                                              Total:   2654486.425   53.2588

arango · #17 Unread post by **arango** » Tue Dec 12, 2017 6:35 pm

Yes, you need to select a better option for MPI communications as I mentioned yesterday. Check the information on the svn trac ticket. If you are using too many processes for this application, the data exchanges due to nesting is a bottleneck. See the number for point data gathering. It explains the slowdown.

Dan_chan · #18 Unread post by **Dan_chan** » Wed Mar 29, 2023 8:32 am

Hi,

I have a nesting error, and I am not sure if it is somthing to this topic.

I have two exp, they are both online-nesting. One is two-layers, outer parent grid's rho points is '221*303' (grdA), and its child grid (refined factor = 3) is '362*254' (grdB). I test one year, it can run. And I set NtileI=13,NtileJ=8 for both grids.

Another is three-layers, it add a refined grid of grdB (refined factor=3), and its rho points is '505*392'. But, three-layers case go wrong. I tried both 'NtileI=13,NtileJ=8', or both 'NtileI=4, NtileJ=4', blowing-up in different steps. What's more, their rst show different. I plot their density anomaly of inner grid at blowing-up time.

Code: Select all

NtileI=13, NtileJ=8: blow up after only 3 steps
NtileI=4,  NtileJ=4: blow up after 18 hours modelling

I am stuck in this issue for a long time, and try to modify topo, it seems not work. I don't know how to solve this problem. Could anyone help?

Many thanks in advance.

Dan_chan · Sat May 06, 2023 10:52 am

Hello everyone, I apologize for resurrecting this post. However, I believe I have identified where my previous issue occurred. It relates to the advection schemes, and I realized that the new version of ROMS allows for setting advection schemes for each tracer with switches instead of a single CPP flag for all of them. I omitted the advenction schemes of inner grid in my roms.in. It is really very very careless. The absence of advection schemes for the inner grid accounted for this issue. Now, it can run when I supply the advection schemes for the inner grid. Although I am still unsure why it experienced blowups at different steps previously, it is possible like that the instability resulted from differences in server stability.

Ocean Modeling Discussion

CPU time and MPI issues with nested grids

CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids

Re: CPU time and MPI issues with nested grids