Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test MAPL v2.46.3 in UFS weather model #2346

Open
Tracked by #2984
junwang-noaa opened this issue Jun 28, 2024 · 58 comments
Open
Tracked by #2984

Test MAPL v2.46.3 in UFS weather model #2346

junwang-noaa opened this issue Jun 28, 2024 · 58 comments
Assignees
Labels
enhancement New feature or request

Comments

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 28, 2024

Description

MAPL 2.46.2 has fixes for issue #2162. UFS weather model needs to be tested and updated with this version.

20240830:
MAPL 2.46.2 has a bug. MAPL 2.46.3 should be installed and tested in UFS weather model. The issue title is updated.

Solution

Alternatives

Related to

@junwang-noaa
Copy link
Collaborator Author

@lipan-NOAA Can you confirm this version of MAPL works in GOCART for GEFSv13? Thanks

@BrianCurtis-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA is MAPL 2.46.2 installed on Acorn/WCOSS2?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Jul 1, 2024 via email

@lipan-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA Can you tell me where you installed it?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Jul 2, 2024 via email

@ulmononian
Copy link
Collaborator

do you want this installed in spack-stack/1.6.0? and on which machine?

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA @junwang-noaa new maple and esmf version are available on hercules and orion for the test and debug activities. @RatkoVasic-NOAA thanks for the installation!

Hercules: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/ue-esmf-8.6.1-mapl-2.46.3/install/modulefiles/Core
Orion: /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/ue-esmf-8.6.1-mapl-2.46.3/install/modulefiles/Core

@mathomp4
Copy link

As an FYI, the MAPL 2.46.3 fix was when using externally initialized MPI. (Which is something you all do but we don't do internally).

We also had this notice to users:

Notice to users: External code should initialize MPI with MPI_THREAD_MULTIPLE as that is what MAPL expects. Also, code might need to call:

call ESMF_InitializePreMPI()

for all features of MAPL to be supported, namely if ESMF Single System Image (SSI) code is enabled. If users do not use this (or know what it is), it most likely is not needed.

My guess is you do not need to call ESMF_InitializePreMPI() as I don't think even we internally quite use all the SSI code yet.

@junwang-noaa
Copy link
Collaborator Author

@weiyuan-jiang may I ask what compiler/mpi versions you want to test? Thanks

@mathomp4
Copy link

mathomp4 commented Dec 2, 2024

@weiyuan-jiang may I ask what compiler/mpi versions you want to test? Thanks

@junwang-noaa Well for that version of MAPL, we would have been testing with:

  • Intel ifort 2021.6
  • Intel ifort 2021.12 (or .13)
  • GCC 13.2.0

We never had Intel ifort 2021.9 on any machines we have. Are any of these possible?

@junwang-noaa
Copy link
Collaborator Author

@AlexanderRichert-NOAA do we have these intel and GCC version on Hera or Hercules orGaea?

@AlexanderRichert-NOAA
Copy link
Collaborator

I can't get onto Hera at the moment but for the others:

  • Gaea (C5 & C6):
    • ifort 2021.6 (module load intel-classic/2022.1.0)
    • GCC 13.2 (module load gcc-native/13.2)
  • Hercules & Orion:
    • ifort 2021.12.0 (module load intel-oneapi-compilers/2024.1.0)

@mathomp4
Copy link

mathomp4 commented Dec 3, 2024

As we have access to Hercules/Orion, I guess ifort 2021.12 is our best bet at the moment (though I'd have to imagine there must be a recent-ish gcc on there).

I'm fairly certain we can run with 2021.12. There were some bugfixes needed in places of GEOS (I don't think MAPL, but I can find out), so by the time we got them all in, 2021.13 was out, but I think 2021.12 works.

Plus, we have 2021.12 on discover, so we can do some matching if need be.

@junwang-noaa
Copy link
Collaborator Author

@weiyuan-jiang @mathomp4 Which MAPL version would you like us to test in UFS? I assume the ESMF version is 8.6.1

@weiyuan-jiang
Copy link
Collaborator

@junwang-noaa It doesn't matter. I can always replace the MAPL as long as it can be built and run under new compiler on Hercules

@weiyuan-jiang
Copy link
Collaborator

I am not sure how it is upgraded. It is still 2021.9 to me.

git pull
module use modulefiles

module load ufs_hercules.intel
icc --version
icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
icc (ICC) 2021.9.0 20230302

ifort --version
ifort (IFORT) 2021.9.0 20230302

@DusanJovic-NOAA
Copy link
Collaborator

@weiyuan-jiang I repeated 'cpld_control_p8 intel' test and confirmed that version 2021.12.0 is used (see: /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1593997/cpld_control_p8_intel/out). ifort command is still showing 2012.9.0 but in the module file we set I_MPI_F90:

https://github.com/DusanJovic-NOAA/ufs-weather-model/blob/9296c85645444541558f052d1a48717ead654475/modulefiles/ufs_hercules.intel.lua#L24

so when mpiifort wrapper is used, the version of the compiler used is:

$ mpiifort --version
ifort: remark #10448: Intel(R) Fortran Compiler Classic (ifort) is now deprecated and will be discontinued late 2024. Intel recommends that customers transition now to using the LLVM-based Intel(R) Fortran Compiler (ifx) for continued Windows* and Linux* support, new language support, new language features, and optimizations. Use '-diag-disable=10448' to disable this message.
ifort (IFORT) 2021.12.0 20240222
Copyright (C) 1985-2024 Intel Corporation.  All rights reserved.

Alex was not able to use both C and Fortran compiler from 2021.12.0 to build libraries, that's why this hack was necessary, see above comments.

Also, take a look at the err file from my last run: /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1593997/cpld_control_p8_intel/err

@weiyuan-jiang
Copy link
Collaborator

weiyuan-jiang commented Dec 19, 2024

@DusanJovic-NOAA Thanks. Is the ESMF library also build by the same compiler ? I think ESMF should be built by 2021.12 because it is ESMF's call that crashes

@DusanJovic-NOAA
Copy link
Collaborator

I think it is. @AlexanderRichert-NOAA can you confirm (check?) that esmf is compiled with fortran version 2012.12.0 Thanks.

@weiyuan-jiang
Copy link
Collaborator

It seems to me the ipc and icpc is still 2021.9. ESMF are written in C and C++ mostly

@DusanJovic-NOAA
Copy link
Collaborator

According to this comment #2346 (comment), 2021.12.0 has only icx/icpx and ifort, and there were some issues using icx, if I understand correctly.

@weiyuan-jiang
Copy link
Collaborator

If it is mixed, can it go back to the older ipc and icps like 2021.6 ? We don't have 2021.9 in our system so we cannot reproduce the issue.

@DavidHuber-NOAA
Copy link
Collaborator

FWIW the only NOAA system I see with 2021.6 is Gaea (C5 and C6) via the intel/2022.1.0 module.

@DusanJovic-NOAA
Copy link
Collaborator

I see this esmf error in PET* log files when I run atmaero_control_p8 test:

20241223 080207.562 INFO             PET000 UFS Aerosols: Advancing from 2021-03-22T06:00:00 to 2021-03-22T06:12:00
20241223 080219.317 ERROR            PET000 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLons:' not found
20241223 080219.317 ERROR            PET000 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 ESMF_Info.F90:2661 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 ESMCI_Info.C:668 Info::erase() Not found  - [json.exception.out_of_range.403] key 'GridCornerLats:' not found
20241223 080219.318 ERROR            PET000 ESMCI_Info.C:688 Info::erase() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 ESMC_InfoCDef.C:243 ESMC_InfoErase() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 ESMF_Info.F90:2661 ESMF_InfoRemove() Not found  - Internal subroutine call returned Error
20241223 080219.318 ERROR            PET000 src/Superstructure/AttributeAPI/interface/ESMF_Attribute.F90:46022 ESMF_AttributeRemoveAttPackGrid( Not found  - Internal subroutine call returned Error

Could this be the reason for the error we see in mapl where the grid is not recognized as cubed-sphere grid?

@DeniseWorthen
Copy link
Collaborator

This looks to be the same message I saw in #1888

@natalie-perlin
Copy link
Collaborator

The message "json exception" sounds a bit familiar... (#2371 (comment))

What helped to resolve this issue on MacOS (clang for C, CXX, and gfortran for Fortran) was using CXX linker instead of Fortran. The issue could arise when mixing C/CXX compilers and Fortran compilers from different vendors.
#2371 (comment)
The change is to use CXX instead of Fortran in CMakeLists.txt

set_target_properties(ufs_model PROPERTIES LINKER_LANGUAGE CXX)

This may not be the solution for this particular case, but it may worth testing as an option.
(my 2c)

@DusanJovic-NOAA
Copy link
Collaborator

The message "json exception" sounds a bit familiar... (#2371 (comment))

What helped to resolve this issue on MacOS (clang for C, CXX, and gfortran for Fortran) was using CXX linker instead of Fortran. The issue could arise when mixing C/CXX compilers and Fortran compilers from different vendors. #2371 (comment) The change is to use CXX instead of Fortran in CMakeLists.txt

set_target_properties(ufs_model PROPERTIES LINKER_LANGUAGE CXX)

This may not be the solution for this particular case, but it may worth testing as an option. (my 2c)

I tried setting LINKER_LANGUAGE to CXX but compilation fails:

of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
icpc: command line warning #10006: ignoring unknown option '-threads'
ld: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
make[2]: *** [CMakeFiles/ufs_model.dir/build.make:153: ufs_model] Error 1
make[1]: *** [CMakeFiles/Makefile2:642: CMakeFiles/ufs_model.dir/all] Error 2
make: *** [Makefile:136: all] Error 2                                                                                                  

@natalie-perlin
Copy link
Collaborator

The message "json exception" sounds a bit familiar... (#2371 (comment))
What helped to resolve this issue on MacOS (clang for C, CXX, and gfortran for Fortran) was using CXX linker instead of Fortran. The issue could arise when mixing C/CXX compilers and Fortran compilers from different vendors. #2371 (comment) The change is to use CXX instead of Fortran in CMakeLists.txt
set_target_properties(ufs_model PROPERTIES LINKER_LANGUAGE CXX)
This may not be the solution for this particular case, but it may worth testing as an option. (my 2c)

I tried setting LINKER_LANGUAGE to CXX but compilation fails:

of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
icpc: command line warning #10006: ignoring unknown option '-threads'
ld: /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
make[2]: *** [CMakeFiles/ufs_model.dir/build.make:153: ufs_model] Error 1
make[1]: *** [CMakeFiles/Makefile2:642: CMakeFiles/ufs_model.dir/all] Error 2
make: *** [Makefile:136: all] Error 2                                                                                                  

Thank you very much for testing this option, good to know that this potential quick fix had other implications... I will look more into this issue with the "json.exception error". It has something to do with nlohmann/json library in C++, which may need help to be located or linked.

@weiyuan-jiang
Copy link
Collaborator

When the error said "undefined reference to `main'", it usually means the link language should be Fortran.

@DusanJovic-NOAA
Copy link
Collaborator

I think I found the reason why model reports the error message "" and ultimately fails.

122: pe=00122 FAIL at line=02772    Base_Base_implementation.F90             <It only works for cubed-sphere grid>
122: pe=00122 FAIL at line=02634    Base_Base_implementation.F90             <status=1>
122: pe=00122 FAIL at line=00695    SU2G_GridCompMod.F90                     <status=1>

The above error message is first printed in "MAPL_GetGlobalHorzIJIndex" subroutine called from "MAPL_GetHorzIJIndex" which is called from "Run1" routine in SU2G_GridCompMod. The reason the error is printed is that IM_WORLD*6 != JM_WORLD. IM_WORLD, JM_WORLD are returned from MAPL_GridGet as globalCellCountPerDim where JM_WORLD is set to be JM_WORLD*6 only if tileCount is 6, which it should be for "cubed-sphere" grids. The grid is passed as subroutine argument from Run1 is SU2G_GridCompMod. I printed the value of tileCount in Run1, and it is 1, not 6. When I print tileCount of the grid in 'Initialize' phase of the SU2G grid component it is 6, which is consistent with the grid created in fv3atm and which is the grid used in gocart. In "Initialize" phase the grid is retrieved from the grid component:

   call ESMF_GridCompGet (GC, grid=grid, name=COMP_NAME, config=universal_cfg, __RC__)

while in the "Run1" phase the grid is retrieved from the mapl object:

    call ESMF_GridCompGet (GC, NAME=COMP_NAME, __RC__)
    call MAPL_GetObjectFromGC (GC, mapl, __RC__)
    call MAPL_Get(mapl, grid=grid, __RC__)

The reason grids in "initialize" and "run1" are different is because the grid from the mapl object in 'run1' is actually a subgrid computed from the original 'cubed-sphere' grid passed from fv3atm in case gocart/mapl are running using omp threading. The subgrids created for each of the omp threads are not 'cubed-sphere' grids (with tileCount == 6), instead they are just regular single tile rectangular grids. This is done in 'make_subgrids_from_bounds' in generic/OpenMP_Support.F90 here.

I think the code that creates the subgrids should be updated to correctly create multi-tiled grids if the original (primary_grid) is multi-tiled. Or maybe the logic in MAPL_GridGet should be updated to not rely on the value of tileCount to compute globalCellCountPerDim.

As a quick fix, if I run the test without omp threads in gocart by setting "use_threads: .FALSE." it does not print those "It only works for cubed-sphere grid" FAIL messages. 

Unfortunately it still fails with floating point exception later in the forecast.

@weiyuan-jiang
Copy link
Collaborator

@DusanJovic-NOAA Thanks for finding this. At this point I can confirm that multi threading in MAPL ( except the older one like 2.40...) would not work with GOCART.

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA Thanks for finding this. At this point I can confirm that multi threading in MAPL ( except the older one like 2.40...) would not work with GOCART.

Thanks for confirming.

@DusanJovic-NOAA
Copy link
Collaborator

The model is failing with the floating point exception because of grid mask inconsistencies between fv3atm/gocart and what MAPL expects. MAPL is computing remapping route handles in "create_route_handle" in base/MAPL_EsmfRegridder.F90 and it sets 'dstMaskValues' to MAPL_MASK_OUT (which is 0):

          if (has_mask) dstMaskValues = [MAPL_MASK_OUT] ! otherwise unallocated

where "has_mask" is true if grid has ESMF_GRIDITEM_MASK item. The fv3grid has ESMF_GRIDITEM_MASK set, and the mask values are either 0=ocean or 1=land. Which means the grid points with mask = 0 are valid destination points and should not be masked out. Currently, by using hard-coded value MAPL_MASK_OUT constant it is assumed that all mask points with 0 are masked out, which is currently not the case in fv3atm and gocart when used with fv3atm. Maybe value can be passed via grid attributes or something like that. Or maybe in fv3atm we can change the mask values to something other than 0, for example, 1=land, 2=ocean, but I do not know if and how that will impact other coupled components and CMEPS.

Anyway, as a test I simply commented the above line that sets the value of dstMaskValues, and the model finally finished the test without any error.

@DeniseWorthen
Copy link
Collaborator

@DusanJovic-NOAA I do not think we want to change the current (0,1) mask values for ATM because that will have downstream impacts in CMEPS.

I can't even find the MAPL code you reference, but if you don't want to mask the destination anywhere, then why not provide a special value? CMEPS uses ispval_mask = -987987 ! spval for RH mask values

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I do not think we want to change the current (0,1) mask values for ATM because that will have downstream impacts in CMEPS.

I can't even find the MAPL code you reference, but if you don't want to mask the destination anywhere, then why not provide a special value? CMEPS uses ispval_mask = -987987 ! spval for RH mask values

MAPL code I reference is here:
https://github.com/GEOS-ESM/MAPL/blob/1c55973c244834ed0d78c39964d2738c3d8f8a8b/base/MAPL_EsmfRegridder.F90#L1519

I do not understand where that special value should be provided and how is MAPL going to use that value?

@DeniseWorthen
Copy link
Collaborator

I didn't mean that MAPL should somehow reach into CMEPS for the value. I was just pointing out that by setting a special value (one that is never encountered), you can map to all destination points.

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Jan 2, 2025

I didn't mean that MAPL should somehow reach into CMEPS for the value. I was just pointing out that by setting a special value (one that is never encountered), you can map to all destination points.

Sorry, I still do not understand what should be set to a special value, a grid mask? The problem is that some of the fv3atm grid points have grid mask set to 0 and MAPL considers them as destination grid points that should be masked out.

The fv3atm grid mask is defined in this routine https://github.com/NOAA-EMC/fv3atm/blob/a7d46eee01a78f0915373ebc58c9b20ba14a6c36/atmos_model.F90#L3654

@DeniseWorthen
Copy link
Collaborator

The dstMaskValue can be set to a special value.

I'm probably misunderstanding, since I've never looked at MAPL. But the issue is that when the destination is ATM, you want all destination points mapped, right? Or is the destination in this case the aerosol "grid/mesh" ?

@DusanJovic-NOAA
Copy link
Collaborator

The dstMaskValue can be set to a special value.

Yes, it can be. But it is currently set to 0 (MAPL_MASK_OUT is 0). Here. That means any fv3atm destination point with grid mask 0 (all ocean points) will be masked out.

I'm probably misunderstanding, since I've never looked at MAPL. But the issue is that when the destination is ATM, you want all destination points mapped, right? Or is the destination in this case the aerosol "grid/mesh" ?

Yes. All fv3atm destination points should be mapped, ie. it should not have grid mask set to 0.

If we can not redefine fv3atm grid mask to not use 0 for any grid point, which it seems we can not do, then either we pass the information to MAPL to not use 0 to mask out destination points. Or maybe somewhere in fv3atm or maybe in gocart we redefine grid mask that is going to be passed to MAPL.

@DeniseWorthen
Copy link
Collaborator

But why does the dstMaskValue have to be set using the values that ATM uses to define it's mask? I realize that may be how the code is working, but I don't think it should be constrained like that.

@junwang-noaa
Copy link
Collaborator Author

@weiyuan-jiang @tclune UFS is using a land sea mask on FV3 cubed sphere grid (0 for ocean and 1 for land) for coupling purposes. May I ask if the following line (MAPL_MASK_OUT) can be updated so that mask value 0 in UFS won't be considered undefined?

          if (has_mask) dstMaskValues = [MAPL_MASK_OUT] ! otherwise unallocated

@weiyuan-jiang
Copy link
Collaborator

I think the constant value of MAPL_MASK_OUT should be updated

@DusanJovic-NOAA
Copy link
Collaborator

With this change in GOCART I was able to make a successful (no floating point exception) run.

diff --git a/ESMF/UFS/Aerosol_Cap.F90 b/ESMF/UFS/Aerosol_Cap.F90
index b753afa..e21ed11 100644
--- a/ESMF/UFS/Aerosol_Cap.F90
+++ b/ESMF/UFS/Aerosol_Cap.F90
@@ -243,6 +243,7 @@ contains
     type(MAPL_CapOptions)     :: maplCapOptions
     type(Aerosol_InternalState_T) :: is
     type(Aerosol_Tracer_T), pointer :: trp
+    integer(kind=ESMF_KIND_I4), pointer  :: maskPtr(:,:)
 
     ! begin
     rc = ESMF_SUCCESS
@@ -338,6 +339,10 @@ contains
       file=__FILE__)) &
       return  ! bail out
 
+    call ESMF_GridGetItem(grid, itemflag=ESMF_GRIDITEM_MASK,   &
+                          staggerloc=ESMF_STAGGERLOC_CENTER, farrayPtr=maskPtr, _RC)
+    maskPtr = 1
+
     ! provide model grid to MAPL
     call cap % cap_gc % set_grid(grid, lm=nlev, _RC)

This change explicitly sets all grid mask points to 1, before passing the grid to MAPL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Status: In Progress
Status: No status
Development

No branches or pull requests