-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SC_Softmx_JitAot_Linux hang #16353
Comments
I ran a 20x Ginder job, and got no failures. |
Additional 30x Grinder job: No failures. |
@knn-k can you tell anything from the collected core files? There is another one. https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.system_aarch64_linux_Nightly_testList_0/299 - ub20-aarch64-2
|
PR #16199 and #16119 are the recent JIT changes related to monitors. I cannot tell where the waiting Java threads are blocked. I cannot find thread information from the core files using gdb.
I can see the list of threads and their call stack using jdmpview, but I don't know how I can get their contexts, especially the register values if the threads are running JITed code. Call stack of the load-0 thread:
|
For gdb, does it help to have the system libraries as collected by jpackcore? These are from the first failure on ub20-aarch64-4 although both machines should be the same. core.20221122.011528.2115359.0001.dmp.zip gdb can find these and the JVM libraries by setting solib-search-path |
The |
Getting proper stacks (native and java using |
Findings from the failure in #16353 (comment) :
From javacore.20221125.023330.932995.0002.txt:
|
Another failure with SC_Softmx_JitAot_Linux_1 in https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330/ on ub20-aarch64-5. Threads
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330 - ub20-aarch64-5
|
I've set this as a blocker for the 0.36 release. |
If this is repeatable, it would be worth grinding for the first build where it fails. |
2 failures in a 30x Grinder job: https://openj9-jenkins.osuosl.org/job/Grinder/1567/tapResults/ |
This issue was opened 9 days ago, but a nightly binary from two weeks ago seems to fail.
|
The test environment might be the key. |
My assumption above was not correct. |
Is this only observed only on aarch64? That would essentially imply that this is a JIT issue. |
Yes, I saw that - have there been changes to compilation control recently? @mpirvu |
Again, we really need the detailed stack information to see where/why the threads are waiting. Might need to reproduce with a full symbols build - the javacore information isn't enough. |
We should have all the symbols and full debug info by overlaying the debug image that matches the JVM. For Linux these are .debuginfo files that gdb will find and understand. Example, for the last failure the JVM is https://openj9-jenkins.osuosl.org/job/Build_JDK17_aarch64_linux_Nightly/344/ and there is a link to the JVM and the debug-image. |
No, there were no changes to compilation control aside from JITServer related changes |
I found that the core and javacore files from those failures have a thread waiting at Thread
Thread
Thread
|
When a method body uses an optimization called "pre-existence" and the assumptions made during compilation are invalidated, the JIT will perform a synchronous compilation so that an application thread cannot execute the invalidated body, while the recompilation of the invalid body is in progress. |
So it may just be that a compilation is taking longer than the test timeout. |
Is it possible to determine if the compilation threads make progress while app threads wait on QueueSlotMonitor? |
A JIT compilation thread is active and running when all other threads are waiting, as I attached gdb to the java process (See #16353 (comment)). I will try extending the test timeout later. |
This test is excluded on Power and Z Linux platforms. Is there any reason for that? |
I extended the timeout value of "Step 16 - Run Jvm4 workload process" in SC_Softmx_JitAot_Linux_1 from 30 minutes to 1 hour. From ITERATION 12 of https://openj9-jenkins.osuosl.org/job/Grinder/1632/consoleText :
It took 38 minutes in ITERATION 18 of the same Grinder job.
|
It is the escapeAnalysis optimization that takes long time in compiling the method I inserted timestamps in seconds for each optimization phase in the JIT trace file. The escapeAnalysis phase (id=47) took 233 seconds (~ 4 minutes) in the sample below, while optimizations from id=48 to 217 took only 2 seconds in total. The optimization level is "hot".
|
Option https://openj9-jenkins.osuosl.org/job/Grinder/1637/ (30 runs in 1 hour 18 minutes) |
The question is why is the compilation so much slower on aarch64? These are modern processors that should compete with the other architectures, aren't they? |
I think the reason we are seeing this failure only on aarch64 is that this test is disabled on P and Z (#16353 (comment)) and openj9/runtime/compiler/optimizer/EscapeAnalysis.cpp Lines 738 to 741 in e42a04b
|
|
Note that the flush elimination part of escape analysis should be gated by whether the platform wants to optimize the fences after allocation. Since those fences only exist on Power and AArch64, I am guessing that logic probably runs here in which case disabling just that part could be an option in the short term (as opposed to all of escape analysis, which cannot be considered) while the underlying compile time issue is sorted out. This could be an option if we did not want to hold up a release for this. |
There is a little history of excluding SC_Softmx_JitAot_Linux from adoptium/aqa-tests#1772 SC_Softmx_JitAot_Linux is excluded on Power and Z Linux due to adoptium/aqa-systemtest#79 |
Sorry ignore that, I was confusing these with the OSU machines, which are slow. The equinix machines are fast. |
There was a change to the equinix machines, on Nov 17. |
I wrote a small Java program for reproducing the problem with Run it on AArch64 Linux or macOS with Java 17 as follows. It takes minutes to complete.
It finishes immediately when you disable escape analysis.
Can anybody try this program on Power? OpenJ9 v0.35.0 is free from this problem. |
@0xdaryl guessed it right (#16353 (comment)). I reverted PR #16119 and related PRs (#16199 for AArch64 and eclipse-omr/omr#6772) locally. |
@klangman fyi (since I mentioned in passing to you yesterday) |
@knn-k, I verified that we see similar behaviour on Power with ShowTZ |
I think the primary problem lies with a call to I've tested out a fix on Power, and it appears to resolve the problem. In the process of running personal builds on it. |
Reopening to pull request fix for v0.36.0-release branch |
Grinder of SC_Softmx_JitAot_Linux_1 on AArch64 Linux after the fix was merged: https://openj9-jenkins.osuosl.org/job/Grinder/1693/ |
#16465 is merged for 0.36 |
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/324 - ub20-aarch64-4
SC_Softmx_JitAot_Linux_1
-Xcompressedrefs -Xjit -Xgcpolicy:gencon
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/324/system_test_output.tar.gz
There are core/javacore/etc. diagnostic files. The javacore files show this state:
The text was updated successfully, but these errors were encountered: