Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestCardiacSimulationX tests fails when run in parallel #353

Open
MichaelClerx opened this issue Jan 9, 2025 · 5 comments
Open

TestCardiacSimulationX tests fails when run in parallel #353

MichaelClerx opened this issue Jan 9, 2025 · 5 comments
Labels

Comments

@MichaelClerx
Copy link
Contributor

MichaelClerx commented Jan 9, 2025

Describe the bug
Running the tests below fails with -jN and N>1, with the number of failures increasing as N gets larger.

To Reproduce

Load chaste docker with

docker run --init -it --rm -v chaste_data:/home/chaste chaste/release

then

chaste@a5db1df53490:~/build$ ctest -R TestCardiacSimulation -E TestCardiacSimulationN -j8
Test project /home/chaste/build
    Start 445: TestCardiacSimulation
    Start 533: TestCardiacSimulationCodegen
    Start 534: TestCardiacSimulationArchiverCodegen
    Start 446: TestCardiacSimulationArchiver
    Start 502: TestCardiacSimulationArchiverParallel
1/5 Test #534: TestCardiacSimulationArchiverCodegen ....***Failed    1.85 sec
2/5 Test #446: TestCardiacSimulationArchiver ...........***Failed    1.86 sec
3/5 Test #445: TestCardiacSimulation ...................Subprocess aborted***Exception:   4.87 sec
4/5 Test #502: TestCardiacSimulationArchiverParallel ...   Passed    6.69 sec
5/5 Test #533: TestCardiacSimulationCodegen ............***Failed   21.06 sec

20% tests passed, 4 tests failed out of 5

Label Time Summary:
Codegen_heart       =  22.91 sec*proc (2 tests)
Continuous_heart    =   6.72 sec*proc (2 tests)
Parallel_heart      =  13.37 sec*proc (1 test)

Total Test time (real) =  21.07 sec

The following tests FAILED:
	445 - TestCardiacSimulation (Subprocess aborted)
	446 - TestCardiacSimulationArchiver (Failed)
	533 - TestCardiacSimulationCodegen (Failed)
	534 - TestCardiacSimulationArchiverCodegen (Failed)
Errors while running CTest
Output from these tests are in: /home/chaste/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

but

chaste@a5db1df53490:~/build$ ctest -R TestCardiacSimulation -E TestCardiacSimulationN -j1
Test project /home/chaste/build
    Start 445: TestCardiacSimulation
1/5 Test #445: TestCardiacSimulation ...................   Passed   21.00 sec
    Start 446: TestCardiacSimulationArchiver
2/5 Test #446: TestCardiacSimulationArchiver ...........   Passed    8.67 sec
    Start 502: TestCardiacSimulationArchiverParallel
3/5 Test #502: TestCardiacSimulationArchiverParallel ...   Passed    5.31 sec
    Start 533: TestCardiacSimulationCodegen
4/5 Test #533: TestCardiacSimulationCodegen ............   Passed   19.89 sec
    Start 534: TestCardiacSimulationArchiverCodegen
5/5 Test #534: TestCardiacSimulationArchiverCodegen ....   Passed    8.27 sec

100% tests passed, 0 tests failed out of 5

Label Time Summary:
Codegen_heart       =  28.16 sec*proc (2 tests)
Continuous_heart    =  29.67 sec*proc (2 tests)
Parallel_heart      =  10.61 sec*proc (1 test)

Total Test time (real) =  63.15 sec

With repeated attempts different tests pass or fail. One time ctest hung for minutes, after which I terminated it, suspecting a race condition.

Perhaps this happens when the tests are all initiated at more or less the same time, while in normal testing either N-1 cores are busy so that they run sequentially, or perhaps they are just started a bit further apart in time?

Expected behavior
Tests should pass regardless of parallelisation

System

  • Whatever the docker sets up
  • Running docker itself on Fedora 41
@mirams
Copy link
Member

mirams commented Jan 10, 2025

can you run it with the ctest --output-on-failure tag so we can see what the tests say. It is probably something docker-specific, because all these are run in parallel in standard test pack, may well be more about file permissions as archiving and code generation and everything is more about file access than other tests.

[Edit: it isn't docker specific, and these aren't being run 'in parallel' in the MPI sense! See below.]

@mirams
Copy link
Member

mirams commented Jan 10, 2025

(by the way, unless you've configured it with cmake -DChaste_NUM_CPUS_TEST=2 or something (https://chaste.github.io/docs/dev-guides/cmake-build-guide/#chaste-configuration-options), then these aren't being run in parallel in the parallel computing (MPI) sense, you're just running all 5 at once in sequential on different cores)

This might then lead to some trouble if some of these tests are trying to open/create/wipe folders with the same names because they've been copied and pasted, we should check the folders being used are unique for each test.

Actually, I'm not even sure what TestCardiacSimulation and TestCardiacSimulationCodegen (etc.) actually are, it could be they are never supposed to be in the same test pack and are identical!

@mirams
Copy link
Member

mirams commented Jan 10, 2025

OK, they are identical tests, the same executable. There's some slightly unholy CMake mess here in that someone has hijacked the idea of adding "Codegen" onto the end of a ctest target to make it part of a Codegen test pack, by copying what happens to some of the "Parallel" tests.

So what you're essentially trying to do here is run ctest -R Test -j inf (run all the tests in all the different test packs at exactly the same time) and I'm not sure that was ever 'supposed to work'. It might be tidiest if it did, but currently they are in some different test packs that conflict on the file paths if run at exactly the same time.

@MichaelClerx
Copy link
Contributor Author

MichaelClerx commented Jan 10, 2025 via email

@MichaelClerx
Copy link
Contributor Author

Here's another one:

chaste@3e382f587495:~/build$ ctest -R "TetrahedralMesh" -j32 --output-on-failure
Test project /home/chaste/build
    Start 128: TestDistributedTetrahedralMeshParallel
Could not find executable /usr/bin/mpiexec --oversubscribe
Looked in the following places:
/usr/bin/mpiexec --oversubscribe
/usr/bin/mpiexec --oversubscribe
/usr/bin/Release/mpiexec --oversubscribe
/usr/bin/Release/mpiexec --oversubscribe
/usr/bin/Debug/mpiexec --oversubscribe
/usr/bin/Debug/mpiexec --oversubscribe
/usr/bin/MinSizeRel/mpiexec --oversubscribe
/usr/bin/MinSizeRel/mpiexec --oversubscribe
/usr/bin/RelWithDebInfo/mpiexec --oversubscribe
/usr/bin/RelWithDebInfo/mpiexec --oversubscribe
/usr/bin/Deployment/mpiexec --oversubscribe
/usr/bin/Deployment/mpiexec --oversubscribe
/usr/bin/Development/mpiexec --oversubscribe
/usr/bin/Development/mpiexec --oversubscribe
usr/bin/mpiexec --oversubscribe
usr/bin/mpiexec --oversubscribe
usr/bin/Release/mpiexec --oversubscribe
usr/bin/Release/mpiexec --oversubscribe
usr/bin/Debug/mpiexec --oversubscribe
usr/bin/Debug/mpiexec --oversubscribe
usr/bin/MinSizeRel/mpiexec --oversubscribe
usr/bin/MinSizeRel/mpiexec --oversubscribe
usr/bin/RelWithDebInfo/mpiexec --oversubscribe
usr/bin/RelWithDebInfo/mpiexec --oversubscribe
usr/bin/Deployment/mpiexec --oversubscribe
usr/bin/Deployment/mpiexec --oversubscribe
usr/bin/Development/mpiexec --oversubscribe
usr/bin/Development/mpiexec --oversubscribe
Unable to find executable: /usr/bin/mpiexec --oversubscribe
1/4 Test #128: TestDistributedTetrahedralMeshParallel ...***Not Run   0.00 sec
    Start  78: TestNonCachedTetrahedralMesh
    Start  80: TestTetrahedralMesh
    Start  69: TestDistributedTetrahedralMesh
2/4 Test  #69: TestDistributedTetrahedralMesh ...........   Passed    2.51 sec
3/4 Test  #80: TestTetrahedralMesh ......................   Passed    2.57 sec
4/4 Test  #78: TestNonCachedTetrahedralMesh .............   Passed    9.73 sec

75% tests passed, 1 tests failed out of 4

Label Time Summary:
Continuous_mesh    =  14.82 sec*proc (3 tests)
Parallel_mesh      =   0.00 sec*proc (1 test)
Production_mesh    =   2.57 sec*proc (1 test)

Total Test time (real) =   9.74 sec

The following tests FAILED:
	128 - TestDistributedTetrahedralMeshParallel (Not Run)
Errors while running CTest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants