-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arkane Pdep Functional Test Fails Often #1682
Comments
That's incorrect, the change in the multi-D torsions PR was made because we switched from calculating 1D HRs semi-classically to quantum mechanically. This impacts pressure dependent rate calculations because they are dependent through TST and thermochemistry on the 1D HR partition functions. |
I see, that makes sense. Any ideas what might be going on here? Should we rethink this line in this test i.e. should this test only check that the number is within the right ballpark (one or two decimal places? 1%?) if it will fail whenever we make improvements to Arkane? |
As far as I can tell the test passes consistently on travis (or at least did while I was testing the nDTorsions stuff) and I really don't think we should be getting rid of or changing failing tests because we don't know why they fail (the blanket over the head approach to bugs). The way I see it some change in your code is causing Arkane to estimate pdep rate coefficients differently, so you should test at different commits and figure out what specific commit results in the estimation change and use that to figure out why this is happening. |
Update: the commit in question is this one here, as this commit fails with this error, but the one before it does not. Again, I think this is interesting because I do not change any Arkane files except the two of my own that I have created in this PR, and at this point in the PR I had not yet integrated these files with the rest of Arkane. In this commit though I do add a few new dependencies, so maybe that is the problem? It would seem strange--I will keep looking into this |
Break the commit up, find the smallest change that causes it...don't worry about test failures not in the pdep test. |
I am trying a few things like that, I'll let you know what happens. My thought with the dependency thing was what if the new dependencies require us to use other versions of other dependencies that pdep needs but these versions are not compatible with pdep. |
Okay, now we are getting somewhere: I created a branch that just added in the new dependencies, but otherwise has no new code additions and I get this failure (as a side not I also get the error with the unit test that is plaguing the Chlorine DB builds). You can check out the branch here. Here is the diff of the conda list between master and this branch:
|
I'm getting this error with the py3 testing branch and new py3 environment.
|
@mliu49 are you getting 88.8819863725482 exactly as well or a different number? Interestingly, we first thought that this might be a BLAS issue, but the version hasn't changed in your environment. |
No, I'm getting a larger value...
|
As evidenced by the references in PRs, this error is appearing again, though we never figured out what was causing it the first time around. In the case of #1840, the push build failed while the pr build passed. In comparing the two logs, the conda environments were identical, and the main difference (besides the push vs. pr) seemed to be the Travis worker version. 🤦♂ In the failing build, the worker version is 6.2.6, while in the in passing build, it's 6.2.5. In the failing push build for ReactionMechanismGenerator/RMG-database#367, the failing build is also on version 6.2.6. If this is an accurate correlation, then the ongoing build for #1829 (using version 6.2.6) should fail. Differences in Travis worker versions affecting build success is not unprecedented, but the mechanism by which this particular test could be affected is not clear. |
I am running on my local computer running just this test using pytest in the PyCharm debugger on RMG-Py: b691b45 It fails as follows:
This should help us debug by comparison |
As an update, I am working on comparing the failing branch above with the branch With help from @mliu49, we have determined that |
Update: The issue goes further back than the call to Inside this initialize call though, we make a call to I made the following change to the code (because debugging cython is hard) to see if these calls differed. Here is the modification to lines 340 of configuration.pyx on current master:
The excetpion is just so that it only prints the results from one loop. Here are the results: Therefore, it appears that |
Does setting overwrite_a_band=False impact the results? |
No, at least for the PyCharm cases I got the same answers as before for each branch |
This is the FORTRAN code that is ultimately being called. Given your past experience with FORTRAN, do you see anything interesting? Note that I am not sure which version of the code we use, so you might need to look elsewhere for the source of the correct version number. http://www.netlib.org/lapack/explore-html/d5/db8/chbevd_8f_source.html |
I mean there is this bit:
But I would think these would be the same for all modern machines. |
Wait, Mark can you check the condition number on the Hamiltonian? |
That would indeed be the problem: np.linalg.cond(H) = 7.025714299685296e+20 Something to talk about in 10.34 :) |
XD...well now we know... |
Nevermind. As we discussed offline, I forgot that H was formatted as a banded matrix. I created a sparse matrix from these diags (thanks @mjohnson541 for the suggestion) and got a condition number of 32239.525070193977, which is not too high. To get this, I used the following code (please verify that it makes sense).
Note that H is the hamiltonian loaded from the YAML file, which is a 6x401 matrix formatted as a lower banded matrix |
You should use numpy.linalg.cond instead. If the condition number is high the matrix inverse cannot be calculated accurately so its norm will be inaccurate. Check the eigenvalues it gives you after you construct it (I think you can calculate them without it taking too long, but I might be wrong) and compare with the ones you expect based on your past experiments that should tell us if we have the same matrix at least. |
I tried numpy.linalg.cond, but it did not like dealing with sparse matrices. My computer may be able to handle a 401x401 matrix, let me see |
Converting to a dense matrix and running np.linalg.cond yielded 1417.03 |
When testing HinderedRotor.get_enthalpy() method using a Fourier series potential in the quantum limit it calls get_enthalpy which in the quantum limit calls solve_schrodinger_equation which calls scipy.linalg.eig_banded(H, lower=True, eigvals_only=True, overwrite_a_band=True) There's a lengthy discussion in #1682 trying to track down the possibility of a bug. In the end they increased the tolerance on a test. Since nothing has changed recently, except some C libraries etc. on conda-forge, I propose taking the same approach, and relaxing the tolerance
When testing HinderedRotor.get_enthalpy() method using a Fourier series potential in the quantum limit it calls get_enthalpy which in the quantum limit calls solve_schrodinger_equation which calls scipy.linalg.eig_banded(H, lower=True, eigvals_only=True, overwrite_a_band=True) There's a lengthy discussion in #1682 trying to track down the possibility of a bug. In the end they increased the tolerance on a test. Since nothing has changed recently, except some C libraries etc. on conda-forge, I propose taking the same approach, and relaxing the tolerance
When testing HinderedRotor.get_enthalpy() method using a Fourier series potential in the quantum limit it calls get_enthalpy which in the quantum limit calls solve_schrodinger_equation which calls scipy.linalg.eig_banded(H, lower=True, eigvals_only=True, overwrite_a_band=True) There's a lengthy discussion in #1682 trying to track down the possibility of a bug. In the end they increased the tolerance on a test. Since nothing has changed recently, except some C libraries etc. on conda-forge, I propose taking the same approach, and relaxing the tolerance
Bug Description
The latest build of #1561 failed with the following error:
This PR does not change any part of Arkane or RMG's pdep code though, and in fact this error only showed when I recently rebased this branch on current master. I notice that this test fails often when I run the functional tests locally, and it appears that the number for the rate coefficient appears to drift slightly for reasons unknown (it appears to happen whenever we add a new Arkane example or functional test, for example I believe this was why the most recent change was made to this unit test here.)
I also wonder if it is somehow related to the error with the most recent Chlorine DB branch build here, where the order of the reactants has changed when the functional tests are run in this build.
A going theory that I have is that this is related to Arkane's use of global and module variables, which might cause problems when we run all of the Arkane examples at once, but I am not sure.
In the meantime, we might want to consider re-writing this test. Thoughts?
How To Reproduce
Run the RMG functional tests locally. Check out the
errorCancelingReactions
branch if need be.The text was updated successfully, but these errors were encountered: