Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SC_Softmx_JitAot_Linux Invalid class pointer assert MarkingSchemeRootMarker.cpp:48 #17240

Closed
pshipton opened this issue Apr 21, 2023 · 27 comments · Fixed by eclipse-omr/omr#7014

Comments

@pshipton
Copy link
Member

https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.system_x86-64_linux_Nightly_testList_0/499
SC_Softmx_JitAot_Linux_1 -Xcompressedrefs -Xjit -Xgcpolicy:gencon

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk8_j9_extended.system_x86-64_linux_Nightly_testList_0/499/system_test_output.tar.gz

STF 21:25:17.023 - +------ Step 16 - Run Jvm4 workload process
STF 21:26:10.239 - **FAILED** Process jvm4 ended with exit code (255) and not the expected exit code/s (0,1)

16.jvm4.stderr

0000000000549500: Invalid class pointer in thread Thread-13
0000000000549500:	O-Slot=000000000052EBF8
0000000000549500:	O-Slot value=00000000FFF7A3D8
0000000000549500:	PC=00007F6E76096F07
0000000000549500:	framesWalked=1
0000000000549500:	arg0EA=000000000052EC60
0000000000549500:	walkSP=000000000052EB88
0000000000549500:	literals=0000000000000010
0000000000549500:	jitInfo=00007F6E55BD8878
0000000000549500:	method=00000000004F25E0 (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.run()V) (JIT)
0000000000549500:	stack=000000000052BCF8-000000000052ECE0
01:26:08.002 0x12b800    j9mm.479    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK8_x86-64_linux_Nightly/openj9/runtime/gc_glue_java/MarkingSchemeRootMarker.cpp:48: ((MM_StackSlotValidator(0, object, stackLocation, walkState).validate(_env)))

Same assert as in #17052

@pshipton
Copy link
Member Author

@dmitripivkine fyi

@dmitripivkine
Copy link
Contributor

Heap address -Slot value=00000000FFF7A3D8 contains Forwarded Pointer to 0xffca1230:

0xFFF7A3D0 :  00000001 00000000 ffca1234 00000000 [ ........4....... ] <--- FP to 0xFFCA1230
0xFFF7A3E0 :  fff4f5e8 00000000 00000002 00000000 [ ................ ]
0xFFF7A3F0 :  00000000 00000000 00000000 00000006 [ ................ ]
0xFFF7A400 :  00000004 00000001 00000000 00000000 [ ................ ]
0xFFF7A410 :  c25b0da2 006b0064 ffca0b44 00000000 [ ..[.d.k.D....... ] <-- hash code

Pointer 0xffca1230, however, it mid-object !J9Object 0x00000000FFCA1228
Seems real location of forwarded object is !J9Object 0xFFCA12F0' and FP should be 0xFFCA12F4':

0xFFCA1200 :  75db540d 00007f6e 75b290b4 00007f6e [ .T.un......un... ]
0xFFCA1210 :  000cf420 00000000 ffca3f90 83012f30 [  ........?..0/.. ]
0xFFCA1220 :  00000000 00000000 000b4720 00000000 [ ........ G...... ]
0xFFCA1230 :  ffca3fa0 ffca1270 ffca3f90 ffca3fb8 [ .?..p....?...?.. ] <-- mid object
0xFFCA1240 :  00000000 00000000 00000000 00000000 [ ................ ]
0xFFCA1250 :  000dd620 00000000 ffca3fc8 ffca3f90 [  ........?...?.. ]
0xFFCA1260 :  ffca12b0 ffca3fe0 00000000 ffca1228 [ .....?......(... ]
0xFFCA1270 :  00419520 00000000 ffca3ff0 00000000 [  .A......?...... ]
0xFFCA1280 :  00000000 00000001 ffca4000 ffca3fa0 [ .........@...?.. ]
0xFFCA1290 :  ffca1228 ffca3f90 ffca4010 ffca4028 [ (....?...@..(@.. ]
0xFFCA12A0 :  00000000 00000001 00000000 00000000 [ ................ ]
0xFFCA12B0 :  00419520 00000000 ffca4038 00000000 [  .A.....8@...... ]
0xFFCA12C0 :  00000000 00000001 ffca4048 ffca3fc8 [ ........H@...?.. ]
0xFFCA12D0 :  ffca1250 ffca3f90 ffca4058 ffca4070 [ [email protected]@.. ]
0xFFCA12E0 :  00000000 00000000 00000001 00000000 [ ................ ]
0xFFCA12F0 :  0004712a 0000000a fff4f5e8 00000000 [ *q.............. ] <-- real forwarded object location
0xFFCA1300 :  00000002 00000000 00000000 00000000 [ ................ ]
0xFFCA1310 :  00000000 00000006 00000004 00000001 [ ................ ]
0xFFCA1320 :  00000000 00000000 c25b0da2 006b0064 [ ..........[.d.k. ] <-- hash code
0xFFCA1330 :  00047122 0000000a fff4f658 00000000 [ "q......X....... ]

@dmitripivkine
Copy link
Contributor

trying grinder 25 jobs https://openj9-jenkins.osuosl.org/job/Grinder/2310/

@dmitripivkine
Copy link
Contributor

trying grinder 25 jobs https://openj9-jenkins.osuosl.org/job/Grinder/2310/

passed

@dmitripivkine
Copy link
Contributor

The assertion occur in the STW Global GC Marking phase, but I think the Forwarded Pointer was corrupted during previous Scavenge Copy phase. DDR gccheck shows slot in Monitor Table (Clearable root) which has been fixed up using bad pointer. It means the pointer was corrupted even before start of Clearable processing:

Checking MONITOR TABLE...
  <gc check (1): from debugger: MONITOR TABLE: slot 7f6e900624e0(7f6e38016508) -> ffca1230: class pointer not in a class segment>

It implies this Forwarded Pointer has been corrupted by Scavenger itself most likely.

@dmitripivkine
Copy link
Contributor

The STW Global GC has been started under the same Exclusive Access umbrella because object allocation can not be succeeded after Local GC. It means there was no mutator's threads activity in between. This fact has nothing to do with problem itself but explains observed behaviour.

I can not explain why stack of java thread 0x549500 contains references to problematic FP location in O-slots like it has not been fixed up. From Snap traces I can see this thread has been scanned during last Local GC.

@dmitripivkine
Copy link
Contributor

Working theory:
The pointer 0xffca12F0 was corrupted to 0xffca1230 by Currently Referenced optimization by mistake in this function: https://github.com/eclipse/omr/blob/779c51b9568b18dbc6650c2ca284580e65aa4792/gc/base/standard/Scavenger.cpp#L3243...#L3276
This code assumes the Forwarded object is Tenured and installs 0x30 remembered state to it (so correct 0xF0 has been replaced to 0x30.
Please note that isOld() ("Object in Tenure") check is not accurate precisely, it checks "Object is not in Nursery Survivor". Logically the location of the object can not be in Nursery Evacuate at this point, so "Object is not in Nursery Survivor" effectively means "Object in Tenure". The location of corrupted pointer is Nursery Evacuate, so it might be corrupted by this function potentially:

+----------------+----------------+----------------+----------------+--------+----------------+----------------------
|    region      |     start      |      end       |    subspace    | flags  |      size      |      region type
+----------------+----------------+----------------+----------------+--------+----------------+----------------------
 00007f6e9008c9b0 0000000082fd0000 0000000083980000 00007f6e9007aac0 00000009           9b0000 ADDRESS_ORDERED
 00007f6e9008c560 00000000ffb80000 00000000ffdc0000 00007f6e90086380 0000000a           240000 ADDRESS_ORDERED Survivour
 00007f6e9008c110 00000000ffdc0000 0000000100000000 00007f6e900808e0 0000000a           240000 ADDRESS_ORDERED Evacuate
+----------------+----------------+----------------+----------------+--------+----------------+----------------------

0xFFF7A3C0 :  fff7bbc0 fff7bbd8 00000000 00000000 [ ................ ]
0xFFF7A3D0 :  00000001 00000000 ffca1234 00000000 [ ........4....... ] <--- FP to 0xFFCA1230
0xFFF7A3E0 :  fff4f5e8 00000000 00000002 00000000 [ ................ ]

To prove this theory is correct (we know these flags are set to destination object) we should be able to find Forwarded pointer to it. So, searching for 0xFFF7A3D8 plus Forwarded bit 0x4, So for 0xFFF7A3DC:

> !findall u64 0xFFF7A3Dc
Scanning memory for dc a3 f7 ff 00 00 00 00 aligned to 8 starting from 0x0
Match found at 0xfff4f5d8 <---- found!!!!!
No more matches

0xFFF4F580 :  fff7bbc4 00000000 fff4f598 00000000 [ ................ ]
0xFFF4F590 :  00000000 00000000 fff7c36c 00000000 [ ........l....... ]
0xFFF4F5A0 :  fff4f5a8 00000000 00000000 00000000 [ ................ ]
0xFFF4F5B0 :  00000000 00000000 fff7bbdc 00000000 [ ................ ]
0xFFF4F5C0 :  00000000 00000000 fff7bbb4 00000000 [ ................ ]
0xFFF4F5D0 :  00000000 00000000 fff7a3dc 00000000 [ ................ ] <---- FP to corrupted slot
0xFFF4F5E0 :  fff4f5e8 00000000 00000000 00000000 [ ................ ]
0xFFF4F5F0 :  00000000 00000000 00000000 00000000 [ ................ ]

Also we should find java stack reference to 0xFFF4F5D8:

> !findall u32 0xFFF4F5D8
Scanning memory for d8 f5 f4 ff aligned to 4 starting from 0x0
Match found at 0x52e758 <-- java stack
Match found at 0x52e7c8

Match found at 0x532570 <-- java stack
Match found at 0x532630
Match found at 0x532868
Match found at 0x5328a0

Match found at 0xfff48d80 <-- heap

Match found at 0x7f6e559058c8 <--- C-stack
No more matches

This theory stays strong so far. There is a question how we got all this mess with Forwarded Pointers Evacuate to Evacuate. One possible clue the Nursery has just been expanded.
I am going to continue investigation.

@dmitripivkine
Copy link
Contributor

I believe there is a problem with missed object to be reported and scanned in JIT method (GC map?). The object pointer to array [I size 0xA elements has not been seen by Local GC. Current bad O-slot is in net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.run()V !j9method 0x00000000004F25E0

There is post failure scenario, it happen to be unusual, so it causes confusion but there is no problem in GC code but just exotic reaction to stale O-slot value:

Object pointer has not been fixed up previous Local GC (0xFFF4F5D8 should be fixed up to 0xFFF7A3D8 but it has not). The Nursery has been expanded before next Local GC cycle, It makes both pointers locate in the same half of Nursery. So, next Local GC cycle object has been forwarded from 0xFFF7A3D8 to 0xFFCA12F0. As a result missed to be fixed up slot in the thread 0x549500 stack becames pointer to Forwarded Pointer pointed to Forwarded Pointer (illegal situation). isOld() however returns 'true' (see comment above, this is side effect of handling of incorrect situation) and Scavenger recognized (mistakenly) the 0xFFF7A3D8 is Currently Referenced object in Tenure and replaced 0xF0 to 0x30 in it corrupting FP by attempt to set Remembered bits. Later on this corrupted FP has been used to Fixup slot in Monitor Table. Pointers to 0xFFF4F5D8 from thread 0x549500 stack have been fixed up to 0xFFF7A3D8. But object real location now is:

> !j9object 0xFFCA12F0
!J9IndexableObject 0x00000000FFCA12F0 {
    struct J9Class* clazz = !j9arrayclass 0x47100   // [I
    Object flags = 0x0000002A;
    U_32 size = 0x0000000A;
	[0] =          2, 0x00000002, 0.00000000F
	[1] =          0, 0x00000000, 0.00000000F
	[2] =          0, 0x00000000, 0.00000000F
	[3] =          0, 0x00000000, 0.00000000F
	[4] =          0, 0x00000000, 0.00000000F
	[5] =          6, 0x00000006, 0.00000000F
	[6] =          4, 0x00000004, 0.00000000F
	[7] =          1, 0x00000001, 0.00000000F
	[8] =          0, 0x00000000, 0.00000000F
	[9] =          0, 0x00000000, 0.00000000F
}

This problem has been detected by follow up Global GC.

There is problematic JIT frame:

<549500> JIT frame: bp = 0x000000000052EC58, pc = 0x00007F6E76096F07, unwindSP = 0x000000000052EBC0, cp = 0x00000000004F17C0, arg0EA = 0x000000000052EC60, jitInfo = 0x00007F6E55BD8878
<549500> 	Method: net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.run()V !j9method 0x00000000004F25E0
<549500> 	Bytecode index = 831, inlineDepth = 0, PC offset = 0x0000000000000387
<549500> 	stackMap=0x00007F6E55BD8C74, slots=I16(0x0001) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0001), localBaseOffset=I16(0xFFA0)
<549500> 	Described JIT args starting at 0x000000000052EC60 for U16(0x0001) slots
<549500> 		O-Slot: : a0[0x000000000052EC60] = 0x00000000FFCB3570
<549500> 	Described JIT temps starting at 0x000000000052EBF8 for IDATA(0x000000000000000C) slots
<549500> 		O-Slot: : t11[0x000000000052EBF8] = 0x00000000FFF7A3D8 <------ fixed up wrong way
<549500> 		I-Slot: : t10[0x000000000052EC00] = 0x00000000FFE57A98
<549500> 		I-Slot: : t9[0x000000000052EC08] = 0x0000000083011D70
<549500> 		I-Slot: : t8[0x000000000052EC10] = 0x00000000FFE70350
<549500> 		I-Slot: : t7[0x000000000052EC18] = 0x00000000004EF100
<549500> 		I-Slot: : t6[0x000000000052EC20] = 0x00007F6E55987B20
<549500> 		I-Slot: : t5[0x000000000052EC28] = 0x00007F6E90015170
<549500> 		I-Slot: : t4[0x000000000052EC30] = 0x000000008333B530
<549500> 		I-Slot: : t3[0x000000000052EC38] = 0x00000000004F25E0
<549500> 		I-Slot: : t2[0x000000000052EC40] = 0x00000000FFF59418
<549500> 		I-Slot: : t1[0x000000000052EC48] = 0x00007F6E75B290DD
<549500> 		I-Slot: : t0[0x000000000052EC50] = 0x0000000000000206
<549500> 	JIT-RegisterMap = UDATA(0x0000000000001002)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905960] = UDATA(0x00000000FFF59418) (jit_rax)
<549500> 		JIT-RegisterMap-O-Slot[0x00007F6E55905968] = 0x00000000FFCB3570 (jit_rbx)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905970] = UDATA(0x0000000A00047112) (jit_rcx)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905978] = UDATA(0x00000000FFF48D60) (jit_rdx)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905980] = UDATA(0x0000000000000000) (jit_rdi)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905988] = UDATA(0x000000008322D9D8) (jit_rsi)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905990] = UDATA(0x0000000000549500) (jit_rbp)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E55905998] = UDATA(0x0000000000000000) (jit_rsp)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059A0] = UDATA(0x000000008320A8A0) (jit_r8)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059A8] = UDATA(0x000000008320A8A0) (jit_r9)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059B0] = UDATA(0x000000008322D9D8) (jit_r10)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059B8] = UDATA(0x0000000000002000) (jit_r11)
<549500> 		JIT-RegisterMap-O-Slot[0x00007F6E559059C0] = 0x00000000FFF7A3D8 (jit_r12) <----- fixed up wrong way
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059C8] = UDATA(0x0000000000000004) (jit_r13)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059D0] = UDATA(0x000000008322D9D8) (jit_r14)
<549500> 		JIT-RegisterMap-I-Slot[0x00007F6E559059D8] = UDATA(0x0000000000000010) (jit_r15)
<549500> 	JIT-Frame-RegisterMap[0x000000000052EBE0] = UDATA(0x00000000004F2400) (jit_rbx)
<549500> 	JIT-Frame-RegisterMap[0x000000000052EBE8] = UDATA(0x0000000000000000) (jit_r9)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059B0] = UDATA(0x000000008322D9D8) (jit_r10)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059B8] = UDATA(0x0000000000002000) (jit_r11)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059C0] = UDATA(0x00000000FFF7A3D8) (jit_r12)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059C8] = UDATA(0x0000000000000004) (jit_r13)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059D0] = UDATA(0x000000008322D9D8) (jit_r14)
<549500> 	JIT-Frame-RegisterMap[0x00007F6E559059D8] = UDATA(0x0000000000000010) (jit_r15)

I have test results preserved, please ask me if necessary. Also I can provide more triaging details.

@dmitripivkine
Copy link
Contributor

@0xdaryl FYI. Looks like the root of the problem is stale O-slot value (object has been missed to be reported to GC)

@0xdaryl
Copy link
Contributor

0xdaryl commented May 16, 2023

@a7ehuo : please investigate

@a7ehuo
Copy link
Contributor

a7ehuo commented May 17, 2023

@dmitripivkine I have some questions to help myself understand the failure better.

  1. In the comment

Seems real location of forwarded object is !J9Object 0xFFCA12F0' and FP should be 0xFFCA12F4':

How is it concluded that the corrupted 0xffca1230 should be 0xFFCA12F0? I'm trying to understand how to trace things like this.

  1. In the comment

The location of corrupted pointer is Nursery Evacuate

My understanding is that if the range for Evacuate is 00000000ffdc0000 0000000100000000 and the range for Survivour is 00000000ffb80000 00000000ffdc0000, wouldn't the corrupted pointer 0xffca1230 fall into the Survivour range?

  1. In the comment

(0xFFF4F5D8 should be fixed up to 0xFFF7A3D8 but it has not)

0xFFF4F5D8 points to 0xFFF7A3DC. Wouldn't it be considered that 0xFFF4F5D8 is fixed up to 0xFFF7A3D8 since 0xFFF7A3DC - 0x4 is 0xFFF7A3D8

next Local GC cycle object has been forwarded from 0xFFF7A3D8 to 0xFFCA12F0

What trace shows 0xFFF7A3D8 is forwarded to 0xFFCA12F0? I guess this question is sort of related to the first one

@dmitripivkine
Copy link
Contributor

How is it concluded that the corrupted 0xffca1230 should be 0xFFCA12F0? I'm trying to understand how to trace things like this.

Usually it is very hard or impossible to find which right pointer should be used instead of corrupted. However we are lucky this time:

  • right object located very close to corrupted reference points to, so it is on the same screen
  • the values of array elements are identical
  • dataAddr slot in the object header has been initialized by GC but not updated, so it is identical as well. This was very temporary transitional state (Dual Header implemented at GC side but disabled because JIT change was not ready at the moment), does not exist even now since JIT enables Dual Header Indexable object format (now dataAddr slot is used for Balanced GC only and does not exist for Gencon)
  • and, mostly important, this Indexable object has been hashed. Hash code for Indexable Object is stored at the end (extra inflated slot). And both hash codes are identical.
    In evacuate:
 0xFFF7A3D0 :  00000001 00000000 ffca1234 00000000 [ ........4....... ] <- FP to 0xFFCA1230 (replaces class slot)
0xFFF7A3E0 :  fff4f5e8 00000000 00000002 00000000 [ ................ ]  <- 0xfff4f5e8 (dataAddr slot)
0xFFF7A3F0 :  00000000 00000000 00000000 00000006 [ ................ ]
0xFFF7A400 :  00000004 00000001 00000000 00000000 [ ................ ]
0xFFF7A410 :  c25b0da2 006b0064 ffca0b44 00000000 [ ..[.d.k.D....... ] <-- hash code 0xc25b0da2

In Survivor:

0xFFCA12F0 :  0004712a 0000000a fff4f5e8 00000000 [ *q.............. ] <- class 0x 0004712a, size 0xA, dataAddr 0xfff4f5e8
0xFFCA1300 :  00000002 00000000 00000000 00000000 [ ................ ]
0xFFCA1310 :  00000000 00000006 00000004 00000001 [ ................ ]
0xFFCA1320 :  00000000 00000000 c25b0da2 006b0064 [ ..........[.d.k. ] <-- hash code 0xc25b0da2

@dmitripivkine
Copy link
Contributor

My understanding is that if the range for Evacuate is 00000000ffdc0000 0000000100000000 and the range for Survivour is 00000000ffb80000 00000000ffdc0000, wouldn't the corrupted pointer 0xffca1230 fall into the Survivour range?

Yes, sure. Both 0xffca1230 and 0xffca12F0 addresses belong to Survivor. The corruption, however occur at 0xFFF7A3D8 in Evacuate

@dmitripivkine
Copy link
Contributor

(0xFFF4F5D8 should be fixed up to 0xFFF7A3D8 but it has not)
0xFFF4F5D8 points to 0xFFF7A3DC. Wouldn't it be considered that 0xFFF4F5D8 is fixed up to 0xFFF7A3D8 since 0xFFF7A3DC - 0x4 is 0xFFF7A3D8

Ok, this is more complicate question. Please consider we have deal with two consecutive Scavenges:

  • first (previous) Scavenge copied object from 0xFFF4F5D8 to 0xFFF7A3D8. So it should Fixup O-slot by replacing 0xFFF4F5D8 to 0xFFF7A3D8 but it has not been done due JIT frame Iterator missed it.
  • second (last) Scavenge copied object from 0xFFF7A3D8 to 0xFFCA12F0 and installs FP to 0xFFF7A3D8. When it scans problematic thread stack it discovers O-slot points to 0xFFF4F5D8 (Forwarded Pointer which points to another Forwarded Pointer at 0xFFF7A3D8 now, both in Evacuate). The 0xFFF7A3D8(Evacuate) address is treated as Tenure address (misinterpretation due Evacuate address can not be "destination", this is reaction to illegal situation). So as "Tenure" object address code tries to install special Remembered state and set flags 0x30 which happen to replace 0xf0 at memory location 0xFFF7A3D8 (and FP to 0xffca12f0 became FP to 0xffca1230). After this Scavenge "fixed up" O-slot by replacing 0xFFF4F5D8 to 0xFFF7A3D8 (by following FP), it is not correct obviously, this Fixup should be done previous Scavenge but was missed.
  • So, followed Global GC discovered 0xFFF7A3D8 in O-slot and assert because this is pointer to FP, not an actual object location

@dmitripivkine
Copy link
Contributor

dmitripivkine commented May 17, 2023

BTW may be it helps in investigation - problematic array is referenced from typeOfBuffer slot of net/adoptopenjdk/test/nio/NioBuffersTest$Buffers object:

> !j9object 0xffca0468
!J9Object 0x00000000FFCA0468 {
	struct J9Class* clazz = !j9class 0x4F1600 // net/adoptopenjdk/test/nio/NioBuffersTest$Buffers
	Object flags = 0x00000020;
	I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
	Lnet/adoptopenjdk/test/nio/NioBuffersTest; bufferTest = !fj9object 0xffca0520 (offset = 4) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Ljava/io/File; BufferFile = !fj9object 0xffca1210 (offset = 8) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Ljava/io/FileOutputStream; fos = !fj9object 0xffca1228 (offset = 12) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Ljava/io/FileInputStream; fis = !fj9object 0xffca1250 (offset = 16) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Ljava/nio/channels/FileChannel; readIntoFileFromBuffers = !fj9object 0xffca1270 (offset = 20) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Ljava/nio/channels/FileChannel; readFromFileIntoBuffers = !fj9object 0xffca12b0 (offset = 24) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[I typeOfBuffer = !fj9object 0xffca12f0 (offset = 28) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers) <-------
	[I numberOfBytes = !fj9object 0xffca0b40 (offset = 32) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[I sizeOfBufferInBytes = !fj9object 0xffca1330 (offset = 36) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[I numberOfRandomNumbers = !fj9object 0xffca1370 (offset = 40) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[Ljava/nio/ByteBuffer; readIntoFile = !fj9object 0xffca13b0 (offset = 44) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[Ljava/nio/ByteBuffer; readFromFile = !fj9object 0xffca13e8 (offset = 48) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[Ljava/util/List; inputArrayOfLinkedLists = !fj9object 0xffca1420 (offset = 52) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	[Ljava/util/List; outputArrayOfLinkedLists = !fj9object 0xffca1458 (offset = 56) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
	Lnet/adoptopenjdk/test/nio/NioBuffersTest; this$0 = !fj9object 0xffca0520 (offset = 60) (net/adoptopenjdk/test/nio/NioBuffersTest$Buffers)
}

@a7ehuo
Copy link
Contributor

a7ehuo commented May 18, 2023

@dmitripivkine Thank you very much for the clarification!

@a7ehuo
Copy link
Contributor

a7ehuo commented May 18, 2023

Ran SC_Softmx_JitAot_Linux_1 in internal Grinder (ID 32988): x200, all passed.

I'll continue to look at the core while rerunning it in the Grinder.

@a7ehuo
Copy link
Contributor

a7ehuo commented May 18, 2023

Another 400x run in Grinder (Internal ID 33015) have passed

@a7ehuo
Copy link
Contributor

a7ehuo commented May 26, 2023

I looked at the logs from multiple runs in the pass cases and found an issue on loading net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1. Sometimes BufferThread.this$1 (n5019n) is evaluated in a register that is not a collected reference register: GPR_0321 (without &) [1].

Since BufferThread.this$1 (n5019n) is not in a collected reference register, it is not added to the GC map for byteCodeIndex=826 [2]. BufferThread.this$1 is used as a base address to load Buffers.typeOfBuffer (n5022n) which is unresolved.

The reason that BufferThread.this$1 (n5019n) is not in a collected reference register is that evaluation of its grandparent node n412n l2a is skipped [1]. Evaluating l2a (OMR::X86::TreeEvaluator::l2aEvaluator) would make sure that the target register GPR_0321 is in a collected reference register.

The reason that n412n l2a is skipped is: When n5022n (iloadi Buffers.typeOfBuffer) is evaluated, it invokes generateX86MemoryReference which eventually invokes populateMemoryReference on the base n412n l2a. In OMR::X86::MemoryReference::populateMemoryReference, it skips l2a if using compressed refs and the node has reference count as 1 and its register is NULL [3].

I found a similar issue is fixed in arm64 in OMR::ARM64::MemoryReference::populateMemoryReference for #14663 [4].

This issue in X86 is similar: The memory reference for n5022n (iloadi Buffers.typeOfBuffer) has UnresolvedDataSnippet (instruction 0x7fa3205a8fd0)[5].

I tried the similar fix (check !self()->hasUnresolvedDataSnippet() before skipping l2a) in X86. With the fix, l2a is now evaluated and BufferThread.this$1 is in &GPR_0321 (edx). edx is now added in the GC map for byteCodeIndex=826 [6].

[1]

------------------------------
 n416n    (  0)  ResolveAndNULLCHK on n5023n [#32]                                                    [0x7fa31f8e41c0] bci=[-1,826,660] rc=0 vc=3551 vn=- li=35 udi=- nc=1
 n415n    (  4)    l2a (in &GPR_0323)                                                                 [0x7fa31f8e4170] bci=[-1,826,660] rc=4 vc=3551 vn=- li=35 udi=36656 nc=1
 n5023n   (  0)      iu2l (in &GPR_0323)                                                              [0x7fa32023e1d0] bci=[-1,826,660] rc=0 vc=3551 vn=- li=- udi=36656 nc=1
 n5022n   (  0)        iloadi  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#442  unresolved notAccessed volatile Shadow] [flags 0x2607 0x0 ] (in &GPR_0323)  [0x7fa32023e180] bci=[-1,826,660] rc=0 vc=3551 vn=- li=- udi=36656 nc=1
 n412n    (  0)          l2a (X!=0 X>=0 )    /*<-- skipped */                                                         [0x7fa31f8e4080] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=- nc=1 flg=0x104
 n5020n   (  0)            iu2l (in GPR_0321) (X!=0 )                                                 [0x7fa32023e0e0] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=36224 nc=1 flg=0x4
 n5019n   (  0)              iloadi  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#441  final Shadow +104] [flags 0x20607 0x0 ] (in GPR_0321) (X!=0 )  [0x7fa32023e090] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=36224 nc=1 flg=0x4
 n4297n   ( 10)                ==>aRegLoad (in &GPR_0320) (X!=0 X>=0 SeenRealReference )
------------------------------

 [0x7fa3205a8e20]	mov	GPR_0321, dword ptr [&GPR_0320+0x68]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#669  final Shadow +104] [flags 0x20607 0x0 ] 
 [0x7fa3205a9060]	nop			# Avoid boundary @8 [0x0:8]
 [0x7fa3205a8fd0]	mov	&GPR_0323, dword ptr [GPR_0321-0x0]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#668  unresolved volatile Shadow] [flags 0x2607 0x0 ]
0x7fa30fa1d2a6 00000372 [0x7fa3205a8e20] 8b 53 68                           mov	edx, dword ptr [rbx+0x68]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#669  final Shadow +104] [flags 0x20607 0x0 ]
0x7fa30fa1d2a9 00000375 [0x7fa3205a9060] 0f 1f 80 00 00 00 00               nop (7 bytes)		# Avoid boundary @8 [0x0:8]
0x7fa30fa1d2b0 0000037c [0x7fa3205a8fd0] e8 ef 81 00 00 00 00               mov	r12d, dword ptr [rdx-0x0]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#668  unresolved volatile Shadow] [flags 0x2607 0x0 ]

[2]

    stackmap location: 00007FA30C66B242
    map range: starting at [00007FA30FA1D288]
      lowOffset: 00000354
      byteCodeInfo: <_callerIndex=-1, byteCodeIndex=102>, _isSameReceiver=0, _doNotProfile=0
      registerSaveDescription: starting at [0C66B248] { 00000000 }
      registers: 00000002	{ 1:ebx }
      stack map: 1000000000000	{ 160 }

    stackmap location: 00007FA30C66B252
    map range: starting at [00007FA30FA1D2B0]
      lowOffset: 0000037C
      byteCodeInfo: <_callerIndex=-1, byteCodeIndex=826>, _isSameReceiver=0, _doNotProfile=1
      ByteCodeInfo Map //<----- missing  

[3]

MemoryReference: base n412n 00007FA31F8E4080 calling populateMemoryReference
populateMemoryReference: subTree n412n 00007FA31F8E4080 hasUnresolvedDataSnippet 1
populateMemoryReference: origSubTree n412n 00007FA31F8E4080 is skipped and changed to n5020n 00007FA32023E0E0
// OMR::X86::MemoryReference::populateMemoryReference

   if (comp->useCompressedPointers())
       {
       if ((subTree->getOpCodeValue() == TR::l2a) && (subTree->getReferenceCount() == 1) &&
             (subTree->getRegister() == NULL))
          {
          cg->decReferenceCount(subTree);
          subTree = subTree->getFirstChild();
          if (subTree->getRegister() == NULL)
             nodeToBeAdjusted = subTree;
          }
       }

[4]

// OMR::ARM64::MemoryReference::populateMemoryReference

      if (subTree->getOpCodeValue() == TR::l2a && subTree->getReferenceCount() == 1 && subTree->getRegister() == NULL &&
          self()->getUnresolvedSnippet() == NULL)
         {
         /*
          * We need to avoid skipping l2a node when the memory reference has a UnresolvedDataSnippet because skipping l2a
          * makes the base register non-collected reference register.
          * When a UnresolvedDataSnippet exists, the memory reference will generate multiple instructions.
          * The first instruction is the branch to the UnresolvedDataSnippet which will be patched when the resolution
          * finishes, and the last instruction is the actual load using the base register and resolved offset.
          * The branch to the resolution helper can trigger GC, but if the base register is not a collected reference
          * register, the valule of the base register will not be updated by GC.
          * This is a tactical solution for OpenJ9 issue 14663.
          */

[5]

0x7fa30fa1d2b0 0000037c [0x7fa3205a8fd0] e8 ef 81 00 00 00 00               mov	r12d, dword ptr [rdx-0x0]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#668  unresolved volatile Shadow] [flags 0x2607 0x0 ]
00007FA30FA254A4 00008570                                                 Snippet Label L0208:		# Unresolved Data Snippet for instr [0x7fa3205a8fd0]
0x7fa30fa254a4 00008570                      e8 87 28 7c 13                 call	interpreterUnresolvedFieldGlue
0x7fa30fa254a9 00008575                      10 28 55 00 00 00 00 00        .quad	0000000000552810	# address of constant pool for this method
0x7fa30fa254b1 0000857d                      49 00 00 00                    .int	0x00000049				# constant pool index
0x7fa30fa254b5 00008581                      73                             .byte	73							# instruction descriptor: length=7, disp32 offset=3
0x7fa30fa254b6 00008582                      44 8b a2 00 00 00 00 66        .byte	(8)						# patch instruction bytes

[6]

// With the fix

------------------------------
 n416n    (  0)  ResolveAndNULLCHK on n5023n [#32]                                                    [0x7fd842c9e1c0] bci=[-1,826,660] rc=0 vc=3551 vn=- li=35 udi=- nc=1
 n415n    (  4)    l2a (in &GPR_0323)                                                                 [0x7fd842c9e170] bci=[-1,826,660] rc=4 vc=3551 vn=- li=35 udi=12144 nc=1
 n5023n   (  0)      iu2l (in &GPR_0323)                                                              [0x7fd8435f81d0] bci=[-1,826,660] rc=0 vc=3551 vn=- li=- udi=12144 nc=1
 n5022n   (  0)        iloadi  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#442  unresolved notAccessed volatile Shadow] [flags 0x2607 0x0 ] (in &GPR_0323)  [0x7fd8435f8180] bci=[-1,826,660] rc=0 vc=3551 vn=- li=- udi=12144 nc=1
 n412n    (  0)          l2a (in &GPR_0321) (X!=0 X>=0 )                                              [0x7fd842c9e080] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=11648 nc=1 flg=0x104
 n5020n   (  0)            iu2l (in &GPR_0321) (X!=0 )                                                [0x7fd8435f80e0] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=11648 nc=1 flg=0x4
 n5019n   (  0)              iloadi  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#441  final Shadow +104] [flags 0x20607 0x0 ] (in &GPR_0321) (X!=0 )  [0x7fd8435f8090] bci=[-1,823,660] rc=0 vc=3551 vn=- li=35 udi=11648 nc=1 flg=0x4
 n4297n   ( 10)                ==>aRegLoad (in &GPR_0320) (X!=0 X>=0 SeenRealReference )
------------------------------

 [0x7fd843962e20]	mov	&GPR_0321, dword ptr [&GPR_0320+0x68]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#669  final Shadow +104] [flags 0x20607 0x0 ]
 [0x7fd8439630a0]	nop			# Avoid boundary @8 [0x0:8]
 [0x7fd843963010]	mov	&GPR_0323, dword ptr [&GPR_0321-0x0]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#668  unresolved volatile Shadow] [flags 0x2607 0x0 ]
 [0x7fd843963140]	nop			# Padding (2 bytes)
0x7fd8445351e6 00000372 [0x7fd843962e20] 8b 53 68                           mov	edx, dword ptr [rbx+0x68]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers$BufferThread.this$1 Lnet/adoptopenjdk/test/nio/NioBuffersTest$Buffers;[#669  final Shadow +104] [flags 0x20607 0x0 ]
0x7fd8445351e9 00000375 [0x7fd8439630a0] 0f 1f 80 00 00 00 00               nop (7 bytes)		# Avoid boundary @8 [0x0:8]
0x7fd8445351f0 0000037c [0x7fd843963010] e8 ef 81 00 00 00 00               mov	r12d, dword ptr [rdx-0x0]		# L4RegMem, SymRef  net/adoptopenjdk/test/nio/NioBuffersTest$Buffers.typeOfBuffer [I[#668  unresolved volatile Shadow] [flags 0x2607 0x0 ]
    stackmap location: 00007FD7EBA01492
    map range: starting at [00007FD8445351F0]
      lowOffset: 0000037C
      byteCodeInfo: <_callerIndex=-1, byteCodeIndex=826>, _isSameReceiver=0, _doNotProfile=0
      registerSaveDescription: starting at [EBA01498] { 00000000 }
      registers: 0000000A	{ 1:ebx 3:edx } //<---- edx is added on the GC map
      stack map: 1000000000000	{ 160 }

@0xdaryl
Copy link
Contributor

0xdaryl commented May 26, 2023

Good analysis. There is similar l2a avoidance code in P and Z memory references as well. Can you check (or ask Julian/Rahil) if such a fix needs to apply there?

@a7ehuo
Copy link
Contributor

a7ehuo commented May 29, 2023

Since this particular test SC_Softmx_JitAot_Linux is disabled on P and Z, I'm not too sure similar issue exists on P and Z, although the code does look the same as X86 and Aarch64.

Skipped due to jvm options ( -Xcompressedrefs -Xjit -Xgcpolicy:gencon ) and/or platform requirements ([os.linux,^arch.ppc,^arch.390]) => SC_Softmx_JitAot_Linux_1_SKIPPED

Just to test the fix, I made the similar change for P and Z as well on eclipse-omr/omr@master...a7ehuo:omr:fix-missing-l2a-evaluation-2.

@r30shah @zl-wang Could you shed some light on if this fix is required for P and Z?

@zl-wang
Copy link
Contributor

zl-wang commented May 29, 2023

@a7ehuo i do think the fix for p is needed as well. i am wondering though, if decompression needs a shift operation, the specific tree (above) looks the same? (it might not be a problem. when it is in a different tree, the reference count is not 1 anymore) also, should we open up the test in question? i don't know the reason it was skipped.

@a7ehuo
Copy link
Contributor

a7ehuo commented May 29, 2023

should we open up the test in question?

Looks like the test is skipped on P and Z because of SharedClassesWorkloadTest_Softmx_Increase_JitAot test fails on ppc64le linux & s390x linux due to not enough AOT being generated #79.

	<!-- Exclude the following test on Linux ppc64le & s390x. Reason: AdoptOpenJDK/openjdk-systemtest/issues/79 -->
 	<test>
 		<testCaseName>SC_Softmx_JitAot</testCaseName>

I'll run this test in Grinder on P and Z and see what the latest result is

@a7ehuo
Copy link
Contributor

a7ehuo commented May 30, 2023

I'm seeing a failure in openjdk jdk_util_0 test in s390x_linux

Ran 100x jdk_util_0 with s390x_linux with my change in Grinder 33103, all passed.


if decompression needs a shift operation, the specific tree (above) looks the same?

@zl-wang I guess the tree would look different, but I'm not sure. Regardless, whether or not l2a is skipped doesn't depend on its children


Ran 50x SC_Softmx_JitAot_Linux_0/1 on P and Z with Java 8 and Java 17 with the fix:

Version Platform Grinder Results
Java8 P Grinder ID 33113 All passed
Java8 Z Grinder ID 33112 All passed
Java17 P Grinder ID 33114 All passed
Java17 Z Grinder ID 33115 All passed

@zl-wang @r30shah If it is okay with you, I can make the following changes. What do you think?

  1. Make the same change for P and Z along with X86 in one PR in OMR
  2. After the change is merged/promoted, I'll open another PR in aqa-tests to enable the SC_Softmx_JitAot_Linux_0/1for P and Z

a7ehuo added a commit to a7ehuo/omr that referenced this issue May 30, 2023
@zl-wang
Copy link
Contributor

zl-wang commented May 30, 2023

@a7ehuo sounds good.

@r30shah
Copy link
Contributor

r30shah commented May 30, 2023

@a7ehuo Sorry for late response, Changes you are proposing looks good to me. I am looking into that odd failure that you saw, which I do not think is because of this change.

a7ehuo added a commit to a7ehuo/aqa-tests that referenced this issue Jun 5, 2023
…penj9/openj9#17240

`SC_Softmx_JitAot_Linux` was disabled in adoptium/aqa-systemtest/adoptium#79
for P and Z. Tested recently with the fix for eclipse-openj9/openj9#17240
on Java 8 and Java 17. The test passes for P and Z.

Signed-off-by: Annabelle Huo <[email protected]>
a7ehuo added a commit to a7ehuo/aqa-tests that referenced this issue Jun 5, 2023
…penj9/openj9#17240

`SC_Softmx_JitAot_Linux` was disabled for P and Z due to
adoptium/aqa-systemtest/adoptium#79. Tested recently with the fix
for eclipse-openj9/openj9#17240 on Java 8 and Java 17.
The test passes for P and Z.

Signed-off-by: Annabelle Huo <[email protected]>
@a7ehuo
Copy link
Contributor

a7ehuo commented Jun 5, 2023

Created adoptium/aqa-tests#4606 to enable SC_Softmx_JitAot_Linux for P and Z

Mesbah-Alam pushed a commit to adoptium/aqa-tests that referenced this issue Jun 5, 2023
…penj9/openj9#17240 (#4606)

`SC_Softmx_JitAot_Linux` was disabled for P and Z due to
adoptium/aqa-systemtest/#79. Tested recently with the fix
for eclipse-openj9/openj9#17240 on Java 8 and Java 17.
The test passes for P and Z.

Signed-off-by: Annabelle Huo <[email protected]>
rmnattas pushed a commit to rmnattas/omr that referenced this issue Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants