-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP #285
Comments
Hi, *MTD 12 terminated with early ... 1 min, 10.969 sec Again, it works perfectly with CREST 2.12. Mostafa |
Is this a new build from PR #288 by any chance? What's the build information displayed at the end of a 2.12 and 3.0 work fundamentally different, errors in one are most likely unrelated to errors in the other. |
The reported issues are both related to the CREST 3.0 version. I just wanted to emphasize that these issues were not encountered in the previous 2.12 version. Here's the build information displayed at the end of a --dry run using CREST 3.0: Dry run was requested. Input file : ../../opt_xtb/mol_7/xtbopt.xyz Job type :
Job settings CRE settings General MD/MTD settings Calculation settings Technical settings CREST binary info normal dry run termination. |
Yes, I got that. What I'm saying is that 3.0 is a complete code revision, even if 2.12 works, this does not indicate much about the error. As for the output you attached, 120 threads seems a bit extreme for OpenMP parallelization, it might break down for that. And overhead from thread handling could be an issue. |
Following your suggestion, I tried the calculations with 8 CPUs and encountered the same problem. However, when I repeated the calculations in the "gas phase", everything seemed to work fine. Attached is the input structure in case you'd like to try. |
I tried running this structure and don't have any issues with it in the gas phase (Ctrl+C'd after the MDs):
output (dropdown)
However, upon trying the same thing with ALPB implicit solvation, I encounter the same Intel MKL error. and early terminations in the MD. |
Thank you Philipp. I'm looking forward to hearing back from you. |
I've some new insight. It seems to be a memory issue for large molecules related to either the MKL libraries, or the Intel compilers (or both). I can reproduce it for other large molecules with ease, and with different levels of theory even (GFN2, GFNFF). GNU (gfortran) builds don't seem to suffer from it and large molecules can be calculated, although the builds in turn don't seem able to handle nested parallelism (as mentioned in #284). But then again, this is only relevant for the MD part. I'll try and see if it is possible to circumvent the MKL issue somehow via the code or the build. |
Hi Philipp, Thanks, |
I have the same error message with @moabe84, and my situation is somewhat different. My system works well with crest 2.12 version, but if I used multiple thread by --T command, one of the MTDs does not get started. I stucked in infinite loop. Nothing is included in the trajectory file. It's just empty |
Hi, I spent a while looking into this issue, unfortunately with not much actual success. A few comments:
At the moment I seem to be unable find a workaround for the affected source code parts, although I will continue looking into it. After all, running multi-layered processes like the MDs and optimizations is abusing the OpenMP functionalities a bit. These would be much better suited for MPI parallelization, which I will address at some point in the future. As for now, the advice I can give is:
I will upload the 3.0.1 hotfix this week which addresses some other recent problems. I've included a warning in that versions README.md regarding the CMake/ifort build, referencing to this issue. |
Thank you so much, Philipp, for the updates. I've started using the continuous release builds and so far everything's working fine. |
I just wanted to provide an update on this issue. I faced the same problem when running calculations for a different new structure. The issue seems to be selective. I do have a question: will using version 3 yield better results in the QCG calculations compared to version 2.12? It seems I'll have to use version 2.12 for the time being, and I just want to make sure that it's fine. |
It is systemsize dependent as far as I can tell, yes. The QCG implementation is still the same in both versions so it won't matter |
Some update on the parallelization issue, in particular the I'm continuing to investigate, although I have not found a conclusive answer. However, I may have found another possible explanation within the parallel processing of calculations. It seems that reinitialization of the calculator/wavefunction helps to avoid the error. The MKL error occurs somewhere within the BLAS and LAPACK implementation as DLASWP is not called directly in CREST or its subprojects. Any part that processes larger matrices like the wavefunction could be the origin here. My thinking is that if tblite calculations are not reinitialized between different molecules/conformers there might be a mismatch in some algorithm/dimension within the linear algebra part which causes the error. At least if the two molecules/conformers are quite different. Within the optimization loops the motivation for not reinitializing was to save some compute time, assuming the wavefunction between different conformers are similar enough to provide a nice SCF starting guess. But maybe that can actually cause problems. |
After some back-and-forth the issue seems to result from nested parallelism and a mismatch between OpenMP and MKL after all. I was not able to get rid of it entirely, but some changes in #331 made it much more robust against this error. I had a reliably reproducible example to test on. Unfortunately the omp nested settings do not fully affect the MKL implementation and vice versa, it's a really complicated problem. |
Hi @pprcht. When we get this error (here's mine for references): *MTD 13 completed successfully ... 3 min, 38.531 sec
*MTD 4 completed successfully ... 3 min, 58.353 sec
*MTD 11 completed successfully ... 4 min, 10.991 sec
*MTD 3 completed successfully ... 4 min, 11.186 sec
*MTD 12 completed successfully ... 4 min, 19.571 sec
*MTD 6 terminated with early ... 4 min, 23.957 sec
*MTD 9 completed successfully ... 4 min, 25.710 sec
*MTD 7 completed successfully ... 4 min, 28.931 sec
*MTD 10 completed successfully ... 4 min, 32.039 sec
*MTD 8 completed successfully ... 4 min, 32.993 sec
*MTD 2 completed successfully ... 4 min, 37.151 sec
*MTD 5 completed successfully ... 4 min, 38.266 sec
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
*MTD 14 completed successfully ... 4 min, 42.447 sec
What does this mean about the calculation? Are the results invalid? Are the results just fine? My calculation throws this error but then does not exit--it seems the calculation just sort of hangs for a very long time... So maybe the results will never get calculated anyways?? A single thread continues to run at 100% but the calculation does not advance. How should we interpret this error coming up and is there a way to prevent it (set My googling/chatgpt-ing around on this issue suggests this is an issue with the pivot indices for an LU decomposition inside of the To Reproduce:XYZ file attached as
threads = 16
input = "structure.xyz"
runtype = "imtd-gc"
[calculation]
level = [
{ method = "gfn2", charge = -1, uhf = 0 },
] Additional DetailsThis happens for me when doing conformer search on a superstructure of 3 molecules (all part of a reaction complex). Perhaps this has something to do with having multiple molecules in the structure? When doing conformer search I can see how this would make any initial wavefunction very inappropriate for an initial guess because the atom centers may have moved dramatically from a starting frame. Just throwing that out there as a possible caused based on your comments above. |
As explained above, this is most likely an issue with nested parallelism in OpenMP and MKL, and so far I have not found a definitive way of preventing it, except not using an MKL-based build at all, or running the thing in serial. Approaching the problem via |
I ran the exact same input as above and then did not get the error. This makes me think it's a race condition of some sort which is what makes it happen sporadically. Just adding that context. Although I did not get the error printed out at the same point in the conformer search, the run did ultimately end up frozen at some point (threads running for 24 hours + without making any visible progress via stdout print outs). So perhaps the issue still occured but without the print out; however, if so it happened at a very different point in the algorithms which might still suggest a race condition. |
Hi _ERROR STOP error while reading input coordinates Error termination. Backtrace: |
@moabe84 I would prefer to keep this separate from the MKL issue. STOP and ERROR STOP statements in the GNU builds will attach the backtrace, so this is really the "error while reading input coordinates" stop. Grepping through the code reveals that it triggers here because there is a mismatch between "best.xyz" in the read-in and expected number of atoms. Is the file best.xyz present in your run? does the content make sense? CREST doesn't seem to write this file, so it must be written by either xtb or xtbiff. xtb here, according to the code. |
Here's the command line: |
Again, the file best.xyz should be written by xtb. Which means it is likely a feature of aISS, which, in turn, was not interfaced to the 2.12 code, so if the command goes through with the old version it is because the algorithm differs. You best either contact Christoph Plett directly about this and ask under what conditions best.xyz is written in aISS, or you try switching to the xtbiff version. |
Hi,
I've been trying to run QCG calculations for a small organic molecule using CREST 3.0., but I keep encountering the trial MTD convergence error (after the "quantum cluster growth: GROW" part), regardless of the timestep or SHAKE option I select. Interestingly, using CREST 2.12 everything is fine and it works even with the default settings. Could it be a bug of some sort?
Thanks,
Mostafa
The text was updated successfully, but these errors were encountered: