Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SC_Softmx_JitAot_Linux hang #16353

Closed
pshipton opened this issue Nov 22, 2022 · 62 comments · Fixed by #16455
Closed

SC_Softmx_JitAot_Linux hang #16353

pshipton opened this issue Nov 22, 2022 · 62 comments · Fixed by #16455

Comments

@pshipton
Copy link
Member

pshipton commented Nov 22, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/324 - ub20-aarch64-4
SC_Softmx_JitAot_Linux_1 -Xcompressedrefs -Xjit -Xgcpolicy:gencon

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/324/system_test_output.tar.gz

STF 01:15:28.337 - **FAILED** Process jvm4 has timed out
STF 01:15:28.337 - Collecting dumps for: jvm4

There are core/javacore/etc. diagnostic files. The javacore files show this state:

2LKREGMON          JIT-QueueSlotMonitor-1 lock (0x0000FFFF94003300): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-1" (J9VMThread:0x000000000057C200)
3LKWAITNOTIFY            "load-2" (J9VMThread:0x000000000057E700)
3LKWAITNOTIFY            "load-3" (J9VMThread:0x0000000000580B00)
3LKWAITNOTIFY            "load-4" (J9VMThread:0x0000000000583000)
3LKWAITNOTIFY            "load-5" (J9VMThread:0x0000000000585500)
3LKWAITNOTIFY            "load-6" (J9VMThread:0x0000000000587A00)

2LKREGMON          JIT-QueueSlotMonitor-4 lock (0x0000FFFF60002610): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "main" (J9VMThread:0x0000000000011200)
2LKREGMON          JIT-QueueSlotMonitor-5 lock (0x0000FFFF600026C8): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-0" (J9VMThread:0x00000000003E6E00)
@pshipton
Copy link
Member Author

@0xdaryl @knn-k can someone pls take a look.

@knn-k
Copy link
Contributor

knn-k commented Nov 23, 2022

I ran a 20x Ginder job, and got no failures.
https://openj9-jenkins.osuosl.org/job/Grinder/1512/

@knn-k
Copy link
Contributor

knn-k commented Nov 23, 2022

Additional 30x Grinder job: No failures.
https://openj9-jenkins.osuosl.org/job/Grinder/1513/

@pshipton
Copy link
Member Author

pshipton commented Nov 25, 2022

@knn-k can you tell anything from the collected core files? There is another one.

https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.system_aarch64_linux_Nightly_testList_0/299 - ub20-aarch64-2
SC_Softmx_JitAot_Linux_1

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk8_j9_extended.system_aarch64_linux_Nightly_testList_0/299/system_test_output.tar.gz

STF 02:33:30.500 - **FAILED** Process jvm4 has timed out
STF 02:33:30.500 - Collecting dumps for: jvm4
3LKMONOBJECT       java/lang/J9VMInternals$ClassInitializationLock@0x00000000FFE449C8: <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-0" (J9VMThread:0x0000000000304000)
3LKWAITNOTIFY            "load-1" (J9VMThread:0x0000000000469D00)
3LKWAITNOTIFY            "load-5" (J9VMThread:0x0000000000473000)
3LKWAITNOTIFY            "load-6" (J9VMThread:0x0000000000475500)
...
2LKREGMON          JIT-QueueSlotMonitor-3 lock (0x0000FFFF3C002558): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-4" (J9VMThread:0x0000000000470C00)
2LKREGMON          JIT-QueueSlotMonitor-4 lock (0x0000FFFF3C002838): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-3" (J9VMThread:0x000000000046E700)
2LKREGMON          JIT-QueueSlotMonitor-5 lock (0x0000FFFF3C0028F0): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-2" (J9VMThread:0x000000000046C200)

@knn-k
Copy link
Contributor

knn-k commented Nov 28, 2022

PR #16199 and #16119 are the recent JIT changes related to monitors.

I cannot tell where the waiting Java threads are blocked. I cannot find thread information from the core files using gdb.

#0  0x0000ffff93a2fd34 in ?? ()
(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 934649        0x0000ffff93a2fd34 in ?? ()
(gdb) bt
#0  0x0000ffff93a2fd34 in ?? ()
#1  0x0000ffff9295dd18 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further

I can see the list of threads and their call stack using jdmpview, but I don't know how I can get their contexts, especially the register values if the threads are running JITed code.

Call stack of the load-0 thread:

> !stack 0x00304000
<304000>                                known but unhandled frame type com.ibm.j9ddr.vm29.pointer.U8Pointer @ 0x00000005

 FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT  FAULT

<304000>        !j9method 0x00000000004B6D60   org/junit/runner/Computer.getSuite(Lorg/junit/runners/model/RunnerBuilder;[Ljava/lang/Class;)Lorg/junit/runner/Runner;
<304000>        !j9method 0x00000000004ACD60   org/junit/runner/Request.classes(Lorg/junit/runner/Computer;[Ljava/lang/Class;)Lorg/junit/runner/Request;
<304000>        !j9method 0x00000000004ACD80   org/junit/runner/Request.classes([Ljava/lang/Class;)Lorg/junit/runner/Request;
<304000>        !j9method 0x000000000042E338   net/adoptopenjdk/loadTest/adaptors/JUnitAdaptor.executeTest()Lnet/adoptopenjdk/loadTest/adaptors/AdaptorInterface$ResultStatus;
<304000>        !j9method 0x0000000000450E18   net/adoptopenjdk/loadTest/LoadTestRunner$2.run()V
<304000>        !j9method 0x000000000039BE68   java/util/concurrent/ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
<304000>        !j9method 0x0000000000462A00   java/util/concurrent/ThreadPoolExecutor$Worker.run()V
<304000>        !j9method 0x000000000004B790   java/lang/Thread.run()V
<304000>                                JNI call-in frame
<304000>                                Native method frame

@pshipton
Copy link
Member Author

pshipton commented Nov 28, 2022

For gdb, does it help to have the system libraries as collected by jpackcore? These are from the first failure on ub20-aarch64-4 although both machines should be the same.

core.20221122.011528.2115359.0001.dmp.zip

gdb can find these and the JVM libraries by setting solib-search-path

@pshipton
Copy link
Member Author

The known but unhandled frame type com.ibm.j9ddr.vm29.pointer.U8Pointer @ 0x00000005 is the unresolved #16271

@pshipton
Copy link
Member Author

@0xdaryl @gacholio do you have any tips to diagnose this further?

@gacholio
Copy link
Contributor

Getting proper stacks (native and java using !stackslots) would be a good start to see why threads are waiting.

@knn-k
Copy link
Contributor

knn-k commented Nov 29, 2022

Findings from the failure in #16353 (comment) :

  • Threads load-0/1/5/6 are waiting in org/junit/runner/Computer.getSuite() on J9VMInternals$ClassInitializationLock, whose initializing thread is load-2.
  • Thread load-2 is in org/junit/validator/AnnotationsValidator$ClassValidator.<init>(), waiting on JIT-QueueSlotMonitor-5.
  • Threads main and load-3/4 are waiting on other JIT-QueueSlotMonitors.

From javacore.20221125.023330.932995.0002.txt:

1LKMONPOOLDUMP Monitor Pool Dump (flat & inflated object-monitors):
2LKMONINUSE      sys_mon_t:0x0000FFFF3C003248 infl_mon_t: 0x0000FFFF3C0032C8:
3LKMONOBJECT       java/lang/J9VMInternals$ClassInitializationLock@0x00000000FFE449C8: <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-0" (J9VMThread:0x0000000000304000)
3LKWAITNOTIFY            "load-1" (J9VMThread:0x0000000000469D00)
3LKWAITNOTIFY            "load-5" (J9VMThread:0x0000000000473000)
3LKWAITNOTIFY            "load-6" (J9VMThread:0x0000000000475500)

3XMTHREADINFO      "load-0" J9VMThread:0x0000000000304000, omrthread_t:0x0000FFFF8C1C7DA0, java/lang/Thread:0x000000008426EA70, state:CW, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1B, isDaemon:false)
(... snip ...)
3XMTHREADBLOCK     Waiting on: java/lang/J9VMInternals$ClassInitializationLock@0x00000000FFE449C8 Owned by: <unowned> Initializing thread: "load-2"

2LKREGMON          JIT-QueueSlotMonitor-3 lock (0x0000FFFF3C002558): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-4" (J9VMThread:0x0000000000470C00)
2LKREGMON          JIT-QueueSlotMonitor-4 lock (0x0000FFFF3C002838): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-3" (J9VMThread:0x000000000046E700)
2LKREGMON          JIT-QueueSlotMonitor-5 lock (0x0000FFFF3C0028F0): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-2" (J9VMThread:0x000000000046C200)
> !j9object 0xffe449c8
!J9Object 0x00000000FFE449C8 {
        struct J9Class* clazz = !j9class 0x2B000 // java/lang/J9VMInternals$ClassInitializationLock
        Object flags = 0x00000008;
        I lockword = 0x0000B0F9 (offset = 0) (java/lang/Object) <hidden>
        Ljava/lang/Class; theClass = !fj9object 0x843799d0 (offset = 4) (java/lang/J9VMInternals$ClassInitializationLock)
}

@knn-k
Copy link
Contributor

knn-k commented Nov 30, 2022

Another failure with SC_Softmx_JitAot_Linux_1 in https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330/ on ub20-aarch64-5.
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330/system_test_output.tar.gz

Threads load-2/4 and main are waiting on JIT-QueueSlotMonitor-x.
Other load-x threads are waiting on java/lang/J9VMInternals$ClassInitializationLock, but they are in a method that is different from the one in #16353 (comment).

2LKMONINUSE      sys_mon_t:0x0000FFFF58003528 infl_mon_t: 0x0000FFFF580035A8:
3LKMONOBJECT       java/lang/J9VMInternals$ClassInitializationLock@0x00000000FFDE8680: <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-0" (J9VMThread:0x00000000003E6E00)
3LKWAITNOTIFY            "load-1" (J9VMThread:0x000000000057C200)
3LKWAITNOTIFY            "load-3" (J9VMThread:0x0000000000580B00)
3LKWAITNOTIFY            "load-5" (J9VMThread:0x0000000000585500)
3LKWAITNOTIFY            "load-6" (J9VMThread:0x0000000000587A00)

2LKREGMON          JIT-QueueSlotMonitor-1 lock (0x0000FFFF88003300): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-2" (J9VMThread:0x000000000057E700)

2LKREGMON          JIT-QueueSlotMonitor-3 lock (0x0000FFFF580023E8): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-4" (J9VMThread:0x0000000000583000)

2LKREGMON          JIT-QueueSlotMonitor-6 lock (0x0000FFFF58002780): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "main" (J9VMThread:0x0000000000011200)

3XMTHREADINFO      "load-0" J9VMThread:0x00000000003E6E00, omrthread_t:0x0000FFFFB8402590, java/lang/Thread:0x000000008427D148, state:CW, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1B, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x000000008408BBF8)
3XMTHREADINFO1            (native thread ID:0x1A081D, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2            (native stack address range from:0x0000FFFF958E2000, to:0x0000FFFF95922000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.529787360 secs, current category="Application"
3XMTHREADBLOCK     Waiting on: java/lang/J9VMInternals$ClassInitializationLock@0x00000000FFDE8680 Owned by: <unowned> Initializing thread: "load-2"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=41352 (0xA188)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at org/junit/runners/ParentRunner.createTestClass(ParentRunner.java:88(Compiled Code))
4XESTACKTRACE                at org/junit/runners/ParentRunner.<init>(ParentRunner.java:83(Compiled Code))
4XESTACKTRACE                at org/junit/runners/Suite.<init>(Suite.java:112(Compiled Code))

@pshipton
Copy link
Member Author

pshipton commented Nov 30, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330 - ub20-aarch64-5
SC_Softmx_JitAot_Linux_1 -Xcompressedrefs -Xjit -Xgcpolicy:gencon

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330/system_test_output.tar.gz

2LKREGMON          JIT-QueueSlotMonitor-1 lock (0x0000FFFF88003300): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "load-2" (J9VMThread:0x000000000057E700)

@pshipton
Copy link
Member Author

I've set this as a blocker for the 0.36 release.

@gacholio
Copy link
Contributor

If this is repeatable, it would be worth grinding for the first build where it fails.

@knn-k
Copy link
Contributor

knn-k commented Nov 30, 2022

2 failures in a 30x Grinder job: https://openj9-jenkins.osuosl.org/job/Grinder/1567/tapResults/
This job used yesterday's nightly build: OpenJ9-JDK17-aarch64_linux-20221130-000732.tar.gz
ub20-aarch64-3

@knn-k
Copy link
Contributor

knn-k commented Dec 1, 2022

This issue was opened 9 days ago, but a nightly binary from two weeks ago seems to fail.

@knn-k
Copy link
Contributor

knn-k commented Dec 1, 2022

The test environment might be the key.
All the successful Grinder jobs above were on cent8-aarch64-x. Failures were on ub20-aarch64-x.

@knn-k
Copy link
Contributor

knn-k commented Dec 1, 2022

My assumption above was not correct.
I see failures in the Grinder job below on cent8-aarch64-3. I used the oldest nightly build available, OpenJ9-JDK17-aarch64_linux-20221108-000811.tar.gz in it.
https://openj9-jenkins.osuosl.org/job/Grinder/1573/

@gacholio
Copy link
Contributor

gacholio commented Dec 1, 2022

Is this only observed only on aarch64? That would essentially imply that this is a JIT issue.

@pshipton
Copy link
Member Author

pshipton commented Dec 1, 2022

@gacholio There was one failure on vmfarm that looks similar
#16388

@gacholio
Copy link
Contributor

gacholio commented Dec 1, 2022

Yes, I saw that - have there been changes to compilation control recently? @mpirvu

@gacholio
Copy link
Contributor

gacholio commented Dec 1, 2022

Again, we really need the detailed stack information to see where/why the threads are waiting. Might need to reproduce with a full symbols build - the javacore information isn't enough.

@pshipton
Copy link
Member Author

pshipton commented Dec 1, 2022

We should have all the symbols and full debug info by overlaying the debug image that matches the JVM. For Linux these are .debuginfo files that gdb will find and understand.

Example, for the last failure the JVM is https://openj9-jenkins.osuosl.org/job/Build_JDK17_aarch64_linux_Nightly/344/ and there is a link to the JVM and the debug-image.

@mpirvu
Copy link
Contributor

mpirvu commented Dec 1, 2022

have there been changes to compilation control recently?

No, there were no changes to compilation control aside from JITServer related changes

@knn-k
Copy link
Contributor

knn-k commented Dec 2, 2022

I found that the core and javacore files from those failures have a thread waiting at sun/util/resources/OpenListResourceBundle.loadLookup() in common. The Java methods in other threads are different from a core to another.

Thread load-4 in https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_linux_Nightly_testList_2/330/ :

3XMTHREADINFO      "load-4" J9VMThread:0x0000000000583000, omrthread_t:0x0000FFFFB84105D0, java/lang/Thread:0x00000000843CFF50, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1F, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x000000008408BBF8)
3XMTHREADINFO1            (native thread ID:0x1A0821, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x0000FFFF95459000, to:0x0000FFFF95499000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.514063000 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=41184 (0xA0E0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:137(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:128(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.handleKeySet(OpenListResourceBundle.java:96(Compiled Code))
4XESTACKTRACE                at java/util/ResourceBundle.containsKey(ResourceBundle.java:2294(Compiled Code))

Thread load-3 in https://openj9-jenkins.osuosl.org/job/Grinder/1567/ :

3XMTHREADINFO      "load-3" J9VMThread:0x0000000000580B00, omrthread_t:0x0000FFFF7C420410, java/lang/Thread:0x0000000084357508, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1E, isDaemon:false)
3XMJAVALTHRCCL            jdk/internal/loader/ClassLoaders$AppClassLoader(0x0000000084082060)
3XMTHREADINFO1            (native thread ID:0x20556F, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x0000FFFF42A7E000, to:0x0000FFFF42ABE000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.191878080 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=28672 (0x7000)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:137(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:128(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.handleKeySet(OpenListResourceBundle.java:96(Compiled Code))
4XESTACKTRACE                at java/util/ResourceBundle.containsKey(ResourceBundle.java:2294(Compiled Code))

Thread load-3 in https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.system_aarch64_linux_Nightly_testList_0/299/ :

3XMTHREADINFO      "load-3" J9VMThread:0x000000000046E700, omrthread_t:0x0000FFFF8C2E3650, java/lang/Thread:0x000000008426F738, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1E, isDaemon:false)
3XMJAVALTHRCCL            sun/misc/Launcher$AppClassLoader(0x000000008405CFE0)
3XMTHREADINFO1            (native thread ID:0xE3CB6, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000081)
3XMTHREADINFO2            (native stack address range from:0x0000FFFF787C7000, to:0x0000FFFF78807000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.142945440 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=17040 (0x4290)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:137(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:128(Compiled Code))
4XESTACKTRACE                at sun/util/resources/OpenListResourceBundle.handleKeySet(OpenListResourceBundle.java:96(Compiled Code))
4XESTACKTRACE                at java/util/ResourceBundle.containsKey(ResourceBundle.java:1824(Compiled Code))

@mpirvu
Copy link
Contributor

mpirvu commented Dec 7, 2022

why would the app thread be waiting for the comp thread?

When a method body uses an optimization called "pre-existence" and the assumptions made during compilation are invalidated, the JIT will perform a synchronous compilation so that an application thread cannot execute the invalidated body, while the recompilation of the invalid body is in progress.
The number of such invalidations is typically very small (single digits)

@gacholio
Copy link
Contributor

gacholio commented Dec 7, 2022

So it may just be that a compilation is taking longer than the test timeout.

@mpirvu
Copy link
Contributor

mpirvu commented Dec 7, 2022

Is it possible to determine if the compilation threads make progress while app threads wait on QueueSlotMonitor?
If they do, then yes compilation times may be a factor here (we can also increase the test timeout to validate this hyposthesis).
If not, we have to determine what monitor the compilation threads are blocked on.

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

A JIT compilation thread is active and running when all other threads are waiting, as I attached gdb to the java process (See #16353 (comment)).
I also ran the top command, and the CPU usage of the java process is around 100% (such as 99.7% or 100.2%) on a multicore system for a couple of minutes. I think it shows the single-thread activity of JIT compilation.

I will try extending the test timeout later.

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

This test is excluded on Power and Z Linux platforms. Is there any reason for that?

https://github.com/adoptium/aqa-tests/blob/acaba28464d993731d9b484599ac14a0854ef1f9/system/sharedClasses/playlist.xml#L155

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

I extended the timeout value of "Step 16 - Run Jvm4 workload process" in SC_Softmx_JitAot_Linux_1 from 30 minutes to 1 hour.
See the output below. It finished in 34 minutes. I think it is not a deadlock but compilation time is sometimes too long for unknown reason.

From ITERATION 12 of https://openj9-jenkins.osuosl.org/job/Grinder/1632/consoleText :

[2022-12-08T05:17:13.107Z] STF 05:17:10.838 - Monitoring processes: jvm4
[2022-12-08T05:21:57.423Z] STF 05:21:54.451 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:26:57.118Z] STF 05:26:54.198 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:31:56.847Z] STF 05:31:54.463 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:36:56.780Z] STF 05:36:54.079 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:41:56.886Z] STF 05:41:54.234 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:46:56.963Z] STF 05:46:54.067 - Heartbeat: Process jvm4 is still running
[2022-12-08T05:51:26.937Z] STF 05:51:14.819 - Monitoring Report Summary:
[2022-12-08T05:51:26.937Z] STF 05:51:14.819 -   o Process jvm4 ended with the expected exit code (1)

It took 38 minutes in ITERATION 18 of the same Grinder job.

[2022-12-08T07:27:08.709Z] STF 07:27:06.260 - Monitoring processes: jvm4
[2022-12-08T07:31:53.187Z] STF 07:31:50.195 - Heartbeat: Process jvm4 is still running
[2022-12-08T07:36:53.011Z] STF 07:36:50.351 - Heartbeat: Process jvm4 is still running
[2022-12-08T07:41:52.667Z] STF 07:41:50.485 - Heartbeat: Process jvm4 is still running
[2022-12-08T07:46:55.804Z] STF 07:46:50.294 - Heartbeat: Process jvm4 is still running
[2022-12-08T07:51:55.715Z] STF 07:51:50.473 - Heartbeat: Process jvm4 is still running
[2022-12-08T07:56:55.543Z] STF 07:56:50.072 - Heartbeat: Process jvm4 is still running
[2022-12-08T08:01:56.012Z] STF 08:01:50.198 - Heartbeat: Process jvm4 is still running
[2022-12-08T08:05:39.637Z] STF 08:05:34.238 - Monitoring Report Summary:
[2022-12-08T08:05:39.637Z] STF 08:05:34.238 -   o Process jvm4 ended with the expected exit code (1)

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

It is the escapeAnalysis optimization that takes long time in compiling the method TimeZoneNames.getContents().

https://github.com/ibmruntimes/openj9-openjdk-jdk17/blob/openj9/src/java.base/share/classes/sun/util/resources/TimeZoneNames.java#L45

I inserted timestamps in seconds for each optimization phase in the JIT trace file. The escapeAnalysis phase (id=47) took 233 seconds (~ 4 minutes) in the sample below, while optimizations from id=48 to 217 took only 2 seconds in total. The optimization level is "hot".

<optimization id=1 name=osrLiveRangeAnalysis method=sun/util/resources/TimeZoneNames.getContents()[[Ljava/lang/Object; time=1670502712>
(... snip ...)
<optimization id=46 name=preEscapeAnalysis method=sun/util/resources/TimeZoneNames.getContents()[[Ljava/lang/Object; time=1670502713>
<optimization id=47 name=escapeAnalysis method=sun/util/resources/TimeZoneNames.getContents()[[Ljava/lang/Object; time=1670502713>
<optimization id=48 name=postEscapeAnalysis method=sun/util/resources/TimeZoneNames.getContents()[[Ljava/lang/Object; time=1670502946>
(... snip ...)
<optimization id=218 name=hotFieldMarking method=sun/util/resources/TimeZoneNames.getContents()[[Ljava/lang/Object; time=1670502948>

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

Option -Xjit:{sun/util/resources/*.getContents*}(disableEscapeAnalysis) makes it much faster.

https://openj9-jenkins.osuosl.org/job/Grinder/1637/ (30 runs in 1 hour 18 minutes)

@gacholio
Copy link
Contributor

gacholio commented Dec 8, 2022

The question is why is the compilation so much slower on aarch64? These are modern processors that should compete with the other architectures, aren't they?

@Akira1Saitoh
Copy link
Contributor

I think the reason we are seeing this failure only on aarch64 is that this test is disabled on P and Z (#16353 (comment)) and TR_LocalFlushElimination is not performed on x86 because it is guarded by getEnforceStoreOrder().

if (cg()->getEnforceStoreOrder())
{
TR_FlowSensitiveEscapeAnalysis flowSensitiveAnalysis(comp(), optimizer(), comp()->getFlowGraph()->getStructure(), this);
}

@mpirvu
Copy link
Contributor

mpirvu commented Dec 8, 2022

It is the escapeAnalysis optimization that takes long time in compiling the method TimeZoneNames.getContents().

@hzongaro see #16353 (comment)

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 8, 2022

@hzongaro could you take a look at this please. Possibly related to the recently merged #16119 (just speculation).

@vijaysun-omr
Copy link
Contributor

Note that the flush elimination part of escape analysis should be gated by whether the platform wants to optimize the fences after allocation. Since those fences only exist on Power and AArch64, I am guessing that logic probably runs here in which case disabling just that part could be an option in the short term (as opposed to all of escape analysis, which cannot be considered) while the underlying compile time issue is sorted out. This could be an option if we did not want to hold up a release for this.

@pshipton
Copy link
Member Author

pshipton commented Dec 8, 2022

why is the compilation so much slower on aarch64?

The alinux failures all seem to all be on equinix machines, which are quite a bit slower and likely virtualized.

There is a little history of excluding SC_Softmx_JitAot_Linux from adoptium/aqa-tests#1772

SC_Softmx_JitAot_Linux is excluded on Power and Z Linux due to adoptium/aqa-systemtest#79

@pshipton
Copy link
Member Author

pshipton commented Dec 8, 2022

The alinux failures all seem to all be on equinix machines, which are quite a bit slower and likely virtualized.

Sorry ignore that, I was confusing these with the OSU machines, which are slow. The equinix machines are fast.

@pshipton
Copy link
Member Author

pshipton commented Dec 8, 2022

There was a change to the equinix machines, on Nov 17.
https://openj9.slack.com/archives/CDS7QE9HB/p1668209052404539
"The Linux on AArch64 machines provided by Equinix will be unavailable in the farm most of next week. They are being physically moved to a new site."

@knn-k
Copy link
Contributor

knn-k commented Dec 8, 2022

I wrote a small Java program for reproducing the problem with getContents() as attached.
ShowTZ.java.txt

Run it on AArch64 Linux or macOS with Java 17 as follows. It takes minutes to complete.

$ jdk/bin/java "-Xjit:limit={sun/util/resources/TimeZoneNames.getContents*},count=0" -Xnoaot ShowTZ
Coordinated Universal Time

It finishes immediately when you disable escape analysis.

$ jdk/bin/java "-Xjit:limit={sun/util/resources/TimeZoneNames.getContents*},count=0,disableEscapeAnalysis" -Xnoaot ShowTZ

Can anybody try this program on Power?

OpenJ9 v0.35.0 is free from this problem.

@knn-k
Copy link
Contributor

knn-k commented Dec 9, 2022

@0xdaryl guessed it right (#16353 (comment)).

I reverted PR #16119 and related PRs (#16199 for AArch64 and eclipse-omr/omr#6772) locally.
Compilation of TimeZoneNames.getContents() finishes immediately with the build.

@vijaysun-omr
Copy link
Contributor

@klangman fyi (since I mentioned in passing to you yesterday)

@hzongaro
Copy link
Member

hzongaro commented Dec 9, 2022

Can anybody try this program on Power?

@knn-k, I verified that we see similar behaviour on Power with ShowTZ

@hzongaro
Copy link
Member

I think the primary problem lies with a call to TR::TreeTop::getEnclosingBlock in TR_LocalFlushElimination::examineNode. The affected method has a block with 10000+ TreeTops, so every call to getEnclosingBlock iterates from that point in the block back to the BBStart.

I've tested out a fix on Power, and it appears to resolve the problem. In the process of running personal builds on it.

@hzongaro
Copy link
Member

Reopening to pull request fix for v0.36.0-release branch

@hzongaro hzongaro reopened this Dec 13, 2022
@knn-k
Copy link
Contributor

knn-k commented Dec 13, 2022

Grinder of SC_Softmx_JitAot_Linux_1 on AArch64 Linux after the fix was merged: https://openj9-jenkins.osuosl.org/job/Grinder/1693/
30x finished in 1 hour and 22 minutes. Fast enough.

@pshipton
Copy link
Member Author

#16465 is merged for 0.36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants