I've successfully compiled and run ROMS many times, but have never encountered a problem like this. I hadn't recompiled in the last several releases; when I finally needed to, it took a little bit of troubleshooting, in that I needed to revert the Linux-ifort.mk back to an older version that had my netCDF pathways set up properly (I tried to put them into the newest version and ROMS couldn't find netCDF, wouldn't compile). Once that had been done, compilation proceeded as normal.
However, when I actually executed oceanM, I got error messages I'd never seen before, not within the ROMS logfile but within the job error script, as follows:
Several more of these seg faults, long enough that I won't paste here, and then:[n0000:11863] *** Process received signal ***
[n0000:11863] Signal: Segmentation fault (11)
[n0000:11863] Signal code: Address not mapped (1)
[n0000:11863] Failing at address: 0x150
[n0000:11859] *** Process received signal ***
[n0000:11859] Signal: Segmentation fault (11)
[n0000:11859] Signal code: Address not mapped (1)
[n0000:11859] Failing at address: 0x150
[n0000:11863] [ 0] /lib64/libc.so.6 [0x2b06180932d0]
[n0000:11863] [ 1] oceanM(wclock_on_+0x97) [0x483baf]
[n0000:11863] [ 2] oceanM(distribute_mod_mp_mp_bcasti_0d_+0x24) [0x49c2de]
[n0000:11863] [ 3] oceanM(inp_par_+0x284) [0x42acf2]
[n0000:11863] [ 4] oceanM(ocean_control_mod_mp_roms_initialize_+0xbb) [0x423e9f]
[n0000:11863] [ 5] oceanM(MAIN__+0xfd) [0x423991]
[n0000:11863] [ 6] oceanM(main+0x2a) [0x423882]
[n0000:11863] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b0618080994]
[n0000:11863] [ 8] oceanM [0x4237a9]
[n0000:11863] *** End of error message ***
[n0000:11855] *** Process received signal ***
[n0000:11855] Signal: Segmentation fault (11)
[n0000:11855] Signal code: Address not mapped (1)
[n0000:11855] Failing at address: 0x150
[n0000:11855] [ 0] /lib64/libc.so.6 [0x2b640b8602d0]
[n0000:11855] [ 1] oceanM(wclock_on_+0x97) [0x483baf]
[n0000:11855] [ 2] oceanM(distribute_mod_mp_mp_bcasti_0d_+0x24) [0x49c2de]
[n0000:11855] [ 3] oceanM(inp_par_+0x284) [0x42acf2]
[n0000:11855] [ 4] oceanM(ocean_control_mod_mp_roms_initialize_+0xbb) [0x423e9f]
[n0000:11855] [ 5] oceanM(MAIN__+0xfd) [0x423991]
[n0000:11855] [ 6] oceanM(main+0x2a) [0x423882]
[n0000:11855] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b640b84d994]
[n0000:11855] [ 8] oceanM [0x4237a9]
[n0000:11855] *** End of error message ***
[n0000:11857] *** Process received signal ***
[n0000:11857] Signal: Segmentation fault (11)
[n0000:11857] Signal code: Address not mapped (1)
[n0000:11857] Failing at address: 0x150
[n0000:11857] [ 0] /lib64/libc.so.6 [0x2b9f2187d2d0]
[n0000:11857] [ 1] oceanM(wclock_on_+0x97) [0x483baf]
[n0000:11857] [ 2] oceanM(distribute_mod_mp_mp_bcasti_0d_+0x24) [0x49c2de]
[n0000:11857] [ 3] oceanM(inp_par_+0x284) [0x42acf2]
[n0000:11857] [ 4] oceanM(ocean_control_mod_mp_roms_initialize_+0xbb) [0x423e9f]
[n0000:11857] [ 5] oceanM(MAIN__+0xfd) [0x423991]
[n0000:11857] [ 6] oceanM(main+0x2a) [0x423882]
[n0000:11857] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b9f2186a994]
[n0000:11857] [ 8] oceanM [0x4237a9]
[n0000:11857] *** End of error message ***
[n0000:11865] *** Process received signal ***
[n0000:11865] Signal: Segmentation fault (11)
[n0000:11865] Signal code: Address not mapped (1)
[n0000:11865] Failing at address: 0x150
[n0000:11865] [ 0] /lib64/libc.so.6 [0x2b049d2342d0]
[n0000:11865] [ 1] oceanM(wclock_on_+0x97) [0x483baf]
[n0000:11865] [ 2] oceanM(distribute_mod_mp_mp_bcasti_0d_+0x24) [0x49c2de]
[n0000:11865] [ 3] oceanM(inp_par_+0x284) [0x42acf2]
[n0000:11865] [ 4] oceanM(ocean_control_mod_mp_roms_initialize_+0xbb) [0x423e9f]
[n0000:11865] [ 5] oceanM(MAIN__+0xfd) [0x423991]
[n0000:11865] [ 6] oceanM(main+0x2a) [0x423882]
[n0000:11865] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b049d221994]
[n0000:11865] [ 8] oceanM [0x4237a9]
[n0000:11865] *** End of error message ***
[n0000:11862] *** Process received signal ***
[n0000:11862] Signal: Segmentation fault (11)
[n0000:11862] Signal code: Address not mapped (1)
[n0000:11862] Failing at address: 0x150
[n0000:11862] [ 0] /lib64/libc.so.6 [0x2aecd73df2d0]
[n0000:11862] [ 1] oceanM(wclock_on_+0x97) [0x483baf]
[n0000:11862] [ 2] oceanM(distribute_mod_mp_mp_bcasti_0d_+0x24) [0x49c2de]
[n0000:11862] [ 3] oceanM(inp_par_+0x284) [0x42acf2]
[n0000:11862] [ 4] oceanM(ocean_control_mod_mp_roms_initialize_+0xbb) [0x423e9f]
[n0000:11862] [ 5] oceanM(MAIN__+0xfd) [0x423991]
[n0000:11862] [ 6] oceanM(main+0x2a) [0x423882]
[n0000:11862] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aecd73cc994]
[n0000:11862] [ 8] oceanM [0x4237a9]
[n0000:11862] *** End of error message ***
I've never had things go so wrong that the ROMS logfile itself couldn't elucidate why things might be blowing up, but clearly something isn't right. (For the record, the ROMS logfile gets about two lines in and gives me a READ_PHYPAR (can't find Ngrids) error but that's presumably a product of oceanM's failure to communicate.)[n0000.hadley:11853] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
[n0000.hadley:11853] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_tm_module.c at line 572
[n0000.hadley:11853] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
mpirun noticed that job rank 0 with PID 11855 on node n0000.hadley exited on signal 11 (Segmentation fault).
[n0000.hadley:11853] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
[n0000.hadley:11853] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_tm_module.c at line 603
--------------------------------------------------------------------------mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
[n0000.hadley:11854] OOB: Connection to HNP lost
I went ahead and repeated this process (compiling and attempting to execute) with several old, previously perfectly functional build scripts and input files, and all of them seemed to compile, and then came back with similar oceanM executable segmentation faults when run.
Please, does anyone have any ideas what's going wrong? Is the difference in formatting in the Linux-ifort.mk file in the new version that critical?
The new one looks like:
Code: Select all
ifdef USE_NETCDF4
NC_CONFIG ?= nc-config
NETCDF_INCDIR ?= $(shell $(NC_CONFIG) --prefix)/include
LIBS := $(shell $(NC_CONFIG) --flibs)
else
NETCDF_INCDIR ?= /usr/local/include
NETCDF_LIBDIR ?= /usr/local/lib
LIBS := -L$(NETCDF_LIBDIR) -lnetcdf
endif
Code: Select all
ifdef USE_NETCDF4
NETCDF_INCDIR ?= /opt/intelsoft/netcdf4/include
NETCDF_LIBDIR ?= /opt/intelsoft/netcdf4/lib
HDF5_LIBDIR ?= /opt/intelsoft/hdf5/lib
else
NETCDF_INCDIR ?= /opt/intelsoft/netcdf/include
NETCDF_LIBDIR ?= /opt/intelsoft/netcdf/lib
endif
Best,
Liz