-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orion compile fails intermittently with "cannot move" error for placing executable into tests directory #2121
Comments
@DeniseWorthen I also encountered the same issue on Orion. And same here, retry 1-2 times, it will go through. Not sure if this is a ufs-weather-model side issue or the Orion file system side issue though. |
@BinLiu-NOAA Yes Ufuk also mentioned he'd run into it. So I created an issue so in order to track it. |
I've seen this as well |
I've been looking out for this issue for a few weeks now on Orion, through regular testing of WM PRs, but have not been able to reproduce the error. Has anyone noticed if this is still occurring? |
Closing for now, if the issue is reported or experienced again we can re-open, however I haven't been able to replicate. |
The crash described in this issue still happens occasionally. #2183 (comment) |
I ran all compile jobs (./rt.sh -e -o) on Hera few times and did not find any issues. I'll try today to repeat same tests on Hercules. |
After two tries, all compile steps on Hercules finished successfully, so instead of retrying to reproduce the error, my suggestion is to add explicit error checking around mv command in compile.sh script, something like: diff --git a/tests/compile.sh b/tests/compile.sh
index 458d985a..7386b06c 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -121,7 +121,14 @@ export CMAKE_FLAGS
bash -x "${PATHTR}/build.sh"
-mv "${BUILD_DIR}/ufs_model" "${PATHTR}/tests/${BUILD_NAME}.exe"
+if ! mv "${BUILD_DIR}/ufs_model" "${PATHTR}/tests/${BUILD_NAME}.exe"; then
+ stat "${BUILD_DIR}/"
+ stat "${BUILD_DIR}/ufs_model"
+ stat "${PATHTR}/tests/"
+ ls -l "${PATHTR}/tests/"
+ exit 1
+fi
+
if [[ ${MACHINE_ID} == linux ]]; then
cp "${PATHTR}/modulefiles/ufs_${MACHINE_ID}.${RT_COMPILER}" "${PATHTR}/tests/modules.${BUILD_NAME}"
else This should at least print the status of source directory and file, and target directory after failed mv command but before the script exits with error. And hopefully next time this error occurs we'll get some more information. Are there any other ideas on how to trace the problem? |
@DusanJovic-NOAA I tried adding your error catch. I'm not sure it tells me much
|
Well it at least catches the mv error and executes stat and ls commands, which was the goal. I do not see how: /scratch1/NCEPDEV/nems/Denise.Worthen/WORK/ufs-weather-model/tests/ is a subdirectory of: /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_3426748/compile_s2s_intel/build_fv3_s2s_intel/ufs_model Can you try to find the output of these commands in stdout file. |
It's my guess, that since ufswm/tests/run_dir is symbolically linked to FV3_RT/rt_ that it might be the problem where lustre feels like it's moving to a subdirectory? I don't know why or how, but i've seen worse mind bending things happen. |
I'm not sure this is the stdout you're asking about, but in the stdout from the build (ie, this file I see
|
I do not understand why this error happens only sometimes or for some users. If the operating system or file system 'thinks' that target directory is a sub directory of a file (the ufs model executable) then why does it not fail every time. |
I get this error more often when running regression tests multiple times (e.g., make a baseline then check against baseline). It may be related to https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/1822064
|
Description
Running the RTs from a clean environment sometimes fail at the final step of moving the executable into place. The error message generated is, for example
To Reproduce:
The issue appears intermittent and I've only seen it for the datm_cdeps tests. For the case quoted above, the file
fv3_datm_cdeps_debug_intel.exe
exists in the tests directory so either it is being moved into place somewhere else or ???Additional context
Output
The text was updated successfully, but these errors were encountered: