MPI Tiling - Dumb Question

Bug reports, work arounds and fixes

Moderators: arango, robertson

Post Reply
Message
Author
winstns1
Posts: 7
Joined: Mon Feb 22, 2016 10:22 pm
Location: JHU/APL

MPI Tiling - Dumb Question

#1 Unread post by winstns1 »

I am trying to run ROMS compiled under Red Hat with intel compilers, mpi. I am running into the following error:

Resolution, Grid 01: 898x762x90, Parallel Nodes: 1, Tiling: 16x16

ROMS/TOMS: Wrong choice of grid 01 partition or number of parallel nodes.
NtileI * NtileJ must be equal to the number of parallel nodes.
Change -np value to mpirun or
change domain partition in input script.
Found Error: 06 Line: 153 Source: ROMS/Utility/inp_par.F
Found Error: 06 Line: 111 Source: ROMS/Drivers/nl_ocean.h

So this would be 256 processors (16 x 16)

The submit script actually IS partitioning 256 nodes (8 processors, 32 processors per node) so NtileI * NtileJ DOES equal the number of processors allocated:

This job runs on the following processors:
vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-
064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-064 vn-063 vn-063 vn-063 vn-063 vn-063
vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn
-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-063 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-06
2 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 vn-062 v
n-062 vn-062 vn-062 vn-062 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-0
61 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-061 vn-060
vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-
060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-060 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059
vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn
-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-059 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-05
8 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 vn-058 v
n-058 vn-058 vn-058 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-0
57 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057 vn-057
This job has allocated 32 processors per node.
This job has allocated 256 processors.
This job has allocated 8 nodes.


Has anyone else run into this issue? I am sure I am doing something silly on my end! Apologies if the answer is obvious. :)

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: MPI Tiling - Dumb Question

#2 Unread post by jcwarner »

what does your mpirun command line look like? maybe your cluster also needs some slurm info like
#SBATCH --ntasks=108 # Number of MPI ranks
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=36
or something ????
-j

winstns1
Posts: 7
Joined: Mon Feb 22, 2016 10:22 pm
Location: JHU/APL

Re: MPI Tiling - Dumb Question

#3 Unread post by winstns1 »

Here is what gets called:

#//# Number of nodes= 2, using 4 Processor Per Node
#PBS -l nodes=16:ppn=16

Now, our admins at the moment, ignore this and instead of giving me 16 nodes/ 16 processors per node, they give me 8 nodes with 32 processors.

Below is the relevant command. What I don't understand is whether I need to force the cluster to do the 16 x 16 thing or whether 8 x 32 would work. Both are the correct number of processors. I should say that when I try to run from the command line with mpirun -np 1 (and set ntilei and ntilej both equal to 1, it works). If I change one of the ntiles to 2 and then do -np 2, it does NOT work. Gives the tiling error.

So I am more than willing to believe it is something in my openmpi configuration? But I have openmpi on many other codes (e.g. WRF) and it works fine.

Thanks again, in advance, for any help on this! I will figure it out eventually...

MPI_EXECUTABLE="./oceanM ocean_ET2S.in"

echo "========================================================"
echo "======= BEGIN RUN ========="
echo "========================================================"
#
#
# FOR RUNNING OUTSIDE OF TORQUE
#
if [ -z ${PBS_JOBID} ]; then
PBS_JOBID=$$
fi
#


# run the code
echo ""
echo "Running CODE"
# time mpirun ${MPI_EXECUTABLE} ${inputFile} > ${logFile} 2>&1
time mpirun ${MPI_EXECUTABLE} 2>&1

winstns1
Posts: 7
Joined: Mon Feb 22, 2016 10:22 pm
Location: JHU/APL

Re: MPI Tiling - Dumb Question

#4 Unread post by winstns1 »

Quick Update:

After further testing, I am beginning to think there is an issue with my openMPI compile. I will try recompiling openmp and see if I can get this to work. I tested another application that I know works, and got exactly the same error. Thanks for the initial reply - I will post again only if I later find I think this is a ROMS issue vs. a local compilation problem on my end.

Best regards

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: MPI Tiling - Dumb Question

#5 Unread post by kate »

Do your nodes indeed have 16 cores or 32 cores? Your admins might be trying to spawn 32 tasks on 16 nodes which can sometimes give good performance (I've heard), but when ROMS checks the numbers, it can't figure out what to do.

winstns1
Posts: 7
Joined: Mon Feb 22, 2016 10:22 pm
Location: JHU/APL

Re: MPI Tiling - Dumb Question

#6 Unread post by winstns1 »

Hi Kate,

It turns out that my compile of openmpi got messed up somehow. I don't know how. When I recompiled it, it worked fine now. So everything is good in ROMS world again! :)

Post Reply