Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSTRID process striding broken in EAMxx gpu cases #3019

Open
amametjanov opened this issue Sep 27, 2024 · 2 comments
Open

PSTRID process striding broken in EAMxx gpu cases #3019

amametjanov opened this issue Sep 27, 2024 · 2 comments

Comments

@amametjanov
Copy link
Member

I'm getting run-time property-check errors with non-default PSTRID and hoping someone can take look.

./cime/scripts/create_test SMS_D_P32x1.ne4_ne4.F2000-SCREAMv1-AQP1.pm-gpu_gnugpu.scream-output-preset-2

runs fine by default on 8 nodes at 4 tasks/node.
If I set process stride PSTRID=16 (also 4 tasks/node at 8 nodes)

./preview_run && ./pelayout
./xmlchange MAX_MPITASKS_PER_NODE=64
./xmlchange PSTRID=16
./case.setup -r
./preview_run && ./pelayout

I get errors below.
A similar case works fine on CPUs:

/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/SMS_D_P32x1.ne4pg2_oQU480.WCYCLXX2010.pm-gpu_gnu.20240905/run-02-8x4x1-pstrid16-ok-2.624sypd/

Error:

  0: Using memory pool. Initial size: 4.92383GB ;  Grow size: 4.92383GB.
  0: NVIDIA A100-SXM4-40GB
  0: INFORM: Automatically inserting fence() after every parallel_for
  0: bfbhash>              0 8d32ee02e0000000 (Hommexx)
  0:
  0:  FAIL:
  0: false
  0: /global/u2/a/azamat/saul/scream/components/eamxx/src/share/atm_process/atmosphere_process.cpp:455
  0: Error! Failed post-condition property check (cannot be repaired).
  0:   - Atmosphere process name: p3
  0:   - Property check name: T_mid within interval [100, 500]
  0:   - Atmosphere process MPI Rank: 0
  0:   - Message: Check failed.
  0:   - check name: T_mid within interval [100, 500]
  0:   - field id: T_mid[Physics GLL] <double:ncol,lev>(30,72) [K]
  0:   - minimum:
  0:     - value: 1.46505e-09
  0:     - indices (w/ global column index): (106,16)
  0:     - lat/lon: (6.21885, 0)
  0:     - additional data (w/ local column index):
  0:
  0:      phis<ncol>(30)
  0:
  0:   phis(2)
  0:     0,
  0:
  0:      landfrac<ncol>(30)
  0:
  0:   landfrac(2)
  0:     0,
  0:
  0:     END OF ADDITIONAL DATA
  0:
  0:   - maximum:
  0:     - value: 0.017285
  0:     - indices (w/ global column index): (106,71)
  0:     - lat/lon: (6.21885, 0)
  0:     - additional data (w/ local column index):
  0:
  0:      phis<ncol>(30)
  0:
  0:   phis(2)
  0:     0,
  0:
  0:      landfrac<ncol>(30)
  0:
  0:   landfrac(2)
  0:     0,
  0:
  0:     END OF ADDITIONAL DATA

Path to that run-dir:

/pscratch/sd/a/azamat/e3sm_scratch/pm-gpu/SMS_D_P32x1.ne4_ne4.F2000-SCREAMv1-AQP1.pm-gpu_gnugpu.scream-output-preset-2.20240923/run-02-err-8x4x1-pstrid16/

This is with Sep-4 version of master 42ab514 .

@PeterCaldwell
Copy link
Contributor

Context - we need to change PSTRID to interleave atm and ocn processes on the same nodes, which would allow us to do coupled k-scale runs almost as quickly as we can do atm-only F cases right now. Thus I see this as a moderately high priority task.

@bartgol
Copy link
Contributor

bartgol commented Oct 18, 2024

Since sept 4th is a while ago, can we confirm first that the error still happens with current master?

Other thoughts:

  • does this happen regardless of I/O? That is, does the crash happen without any output stream?
  • during which timestep does the error show up? Any chance we can infer which subcycle iter of p3 this was? You may have to increase the log level (in driver options) to get a bit more info in atm.log.
  • does this happen for every non-default value of pstrid?

@rljacob rljacob changed the title PSTRID process striding PSTRID process striding broken in EAMxx Oct 18, 2024
@rljacob rljacob changed the title PSTRID process striding broken in EAMxx PSTRID process striding broken in EAMxx gpu cases Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants