Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion compile fails intermittently with "cannot move" error for placing executable into tests directory #2121

Open
DeniseWorthen opened this issue Jan 29, 2024 · 14 comments
Assignees

Comments

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 29, 2024

Description

Running the RTs from a clean environment sometimes fail at the final step of moving the executable into place. The error message generated is, for example

mv: cannot move '/work/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_138020/compile_datm_cdeps_debug_intel/build_fv3_datm_cdeps_debug_intel/ufs_model' to a subdirectory of itself, '/work/noaa/marine/dworthen/ufs_hafsmom6/tests/fv3_datm_cdeps_debug_intel.exe'

To Reproduce:

The issue appears intermittent and I've only seen it for the datm_cdeps tests. For the case quoted above, the file fv3_datm_cdeps_debug_intel.exe exists in the tests directory so either it is being moved into place somewhere else or ???

Additional context

Output

@DeniseWorthen DeniseWorthen added the bug Something isn't working label Jan 29, 2024
@BinLiu-NOAA
Copy link
Contributor

@DeniseWorthen I also encountered the same issue on Orion. And same here, retry 1-2 times, it will go through. Not sure if this is a ufs-weather-model side issue or the Orion file system side issue though.

@DeniseWorthen
Copy link
Collaborator Author

@BinLiu-NOAA Yes Ufuk also mentioned he'd run into it. So I created an issue so in order to track it.

@climbfuji
Copy link
Collaborator

I've seen this as well

@zach1221
Copy link
Collaborator

zach1221 commented May 7, 2024

I've been looking out for this issue for a few weeks now on Orion, through regular testing of WM PRs, but have not been able to reproduce the error. Has anyone noticed if this is still occurring?

@zach1221
Copy link
Collaborator

Closing for now, if the issue is reported or experienced again we can re-open, however I haven't been able to replicate.

@DusanJovic-NOAA
Copy link
Collaborator

The crash described in this issue still happens occasionally. #2183 (comment)

@DusanJovic-NOAA
Copy link
Collaborator

I ran all compile jobs (./rt.sh -e -o) on Hera few times and did not find any issues. I'll try today to repeat same tests on Hercules.

@DusanJovic-NOAA
Copy link
Collaborator

After two tries, all compile steps on Hercules finished successfully, so instead of retrying to reproduce the error, my suggestion is to add explicit error checking around mv command in compile.sh script, something like:

diff --git a/tests/compile.sh b/tests/compile.sh
index 458d985a..7386b06c 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -121,7 +121,14 @@ export CMAKE_FLAGS

 bash -x "${PATHTR}/build.sh"

-mv "${BUILD_DIR}/ufs_model" "${PATHTR}/tests/${BUILD_NAME}.exe"
+if ! mv "${BUILD_DIR}/ufs_model" "${PATHTR}/tests/${BUILD_NAME}.exe"; then
+  stat "${BUILD_DIR}/"
+  stat "${BUILD_DIR}/ufs_model"
+  stat "${PATHTR}/tests/"
+  ls -l "${PATHTR}/tests/"
+  exit 1
+fi
+
 if [[ ${MACHINE_ID} == linux ]]; then
   cp "${PATHTR}/modulefiles/ufs_${MACHINE_ID}.${RT_COMPILER}" "${PATHTR}/tests/modules.${BUILD_NAME}"
 else

This should at least print the status of source directory and file, and target directory after failed mv command but before the script exits with error. And hopefully next time this error occurs we'll get some more information. Are there any other ideas on how to trace the problem?

@DeniseWorthen
Copy link
Collaborator Author

@DusanJovic-NOAA I tried adding your error catch. I'm not sure it tells me much

+ mv /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/fv3_s2s_intel.exe
mv: cannot move '/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model' to a subdirectory of itself, '/scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/fv3_s2s_intel.exe'
+ stat /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/
+ stat /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model
+ stat /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/
+ ls -l /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/
+ exit 1

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA I tried adding your error catch. I'm not sure it tells me much

+ mv /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/fv3_s2s_intel.exe
mv: cannot move '/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model' to a subdirectory of itself, '/scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/fv3_s2s_intel.exe'
+ stat /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/
+ stat /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model
+ stat /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/
+ ls -l /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/
+ exit 1

Well it at least catches the mv error and executes stat and ls commands, which was the goal. I do not see how:

/scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/

is a subdirectory of:

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model

Can you try to find the output of these commands in stdout file.

@BrianCurtis-NOAA
Copy link
Collaborator

It's my guess, that since ufswm/tests/run_dir is symbolically linked to FV3_RT/rt_ that it might be the problem where lustre feels like it's moving to a subdirectory? I don't know why or how, but i've seen worse mind bending things happen.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Aug 12, 2024

I'm not sure this is the stdout you're asking about, but in the stdout from the build (ie, this file /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/out)

I see

make[1]: Leaving directory '/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel'
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/cmake-3.23.1-qjplcak/bin/cmake -E cmake_progress_start /scratch1/NCEP\
DEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/CMakeFiles 0
  File: /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 8f127eeh/150022126d     Inode: 144123247414257494  Links: 10
Access: (2755/drwxr-sr-x)  Uid: (20099/Denise.Worthen)   Gid: (11830/    stmp)
Access: 2024-08-12 15:43:51.000000000 +0000
Modify: 2024-08-12 15:52:25.000000000 +0000
Change: 2024-08-12 15:52:25.000000000 +0000
 Birth: 2024-08-12 15:41:02.000000000 +0000
  File: /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model
  Size: 222433576       Blocks: 294928     IO Block: 4194304 regular file
Device: 8f127eeh/150022126d     Inode: 144123247414262861  Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (20099/Denise.Worthen)   Gid: (11830/    stmp)
Access: 2024-08-12 15:52:32.000000000 +0000
Modify: 2024-08-12 15:52:32.000000000 +0000
Change: 2024-08-12 15:52:32.000000000 +0000
 Birth: 2024-08-12 15:52:25.000000000 +0000
  File: /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/
  Size: 20480           Blocks: 40         IO Block: 4096   directory
Device: 8f127eeh/150022126d     Inode: 144123077528220077  Links: 16
Access: (2755/drwxr-sr-x)  Uid: (20099/Denise.Worthen)   Gid: (11833/    nems)
Access: 2024-08-12 15:52:18.000000000 +0000
Modify: 2024-08-12 15:52:23.000000000 +0000
Change: 2024-08-12 15:52:23.000000000 +0000
 Birth: 2024-06-26 21:11:56.000000000 +0000
total 17971333
-rwxr-xr-x  1 Denise.Worthen nems       2041 Jun 26 21:11 abort_dep_tasks.py
-rwxr-xr-x  1 Denise.Worthen nems       5086 Jun 26 21:11 atparse.bash
drwxr-sr-x  3 Denise.Worthen nems       4096 Jun 26 21:11 auto
drwxr-sr-x  3 Denise.Worthen nems       4096 Jun 26 21:11 auto-jenkins
-rw-r--r--  1 Denise.Worthen nems         24 Aug 12 12:58 bl_date.conf
drwxr-sr-x  9 Denise.Worthen nems       4096 Jul  3 13:24 build_fv3_datm
....

@DusanJovic-NOAA
Copy link
Collaborator

I do not understand why this error happens only sometimes or for some users. If the operating system or file system 'thinks' that target directory is a sub directory of a file (the ufs model executable) then why does it not fail every time.
Another option to try is to copy a file instead of moving it.

@NickSzapiro-NOAA
Copy link
Collaborator

I get this error more often when running regression tests multiple times (e.g., make a baseline then check against baseline). It may be related to https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1822064

rsync --remove-source-files instead of mv has worked for me consistently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Todo
Development

No branches or pull requests

9 participants