-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SC_Softmx_JitAot_Linux Invalid class pointer assert MarkingSchemeRootMarker.cpp:48 #17240
Comments
@dmitripivkine fyi |
Heap address
Pointer
|
trying grinder 25 jobs https://openj9-jenkins.osuosl.org/job/Grinder/2310/ |
passed |
The assertion occur in the STW Global GC Marking phase, but I think the Forwarded Pointer was corrupted during previous Scavenge Copy phase. DDR gccheck shows slot in Monitor Table (Clearable root) which has been fixed up using bad pointer. It means the pointer was corrupted even before start of Clearable processing:
It implies this Forwarded Pointer has been corrupted by Scavenger itself most likely. |
The STW Global GC has been started under the same Exclusive Access umbrella because object allocation can not be succeeded after Local GC. It means there was no mutator's threads activity in between. This fact has nothing to do with problem itself but explains observed behaviour. I can not explain why stack of java thread |
Working theory:
To prove this theory is correct (we know these flags are set to destination object) we should be able to find Forwarded pointer to it. So, searching for
Also we should find java stack reference to
This theory stays strong so far. There is a question how we got all this mess with Forwarded Pointers Evacuate to Evacuate. One possible clue the Nursery has just been expanded. |
I believe there is a problem with missed object to be reported and scanned in JIT method (GC map?). The object pointer to array There is post failure scenario, it happen to be unusual, so it causes confusion but there is no problem in GC code but just exotic reaction to stale O-slot value: Object pointer has not been fixed up previous Local GC (
This problem has been detected by follow up Global GC. There is problematic JIT frame:
I have test results preserved, please ask me if necessary. Also I can provide more triaging details. |
@0xdaryl FYI. Looks like the root of the problem is stale O-slot value (object has been missed to be reported to GC) |
@a7ehuo : please investigate |
@dmitripivkine I have some questions to help myself understand the failure better.
How is it concluded that the corrupted
My understanding is that if the range for Evacuate is
What trace shows |
Usually it is very hard or impossible to find which right pointer should be used instead of corrupted. However we are lucky this time:
In Survivor:
|
Yes, sure. Both |
Ok, this is more complicate question. Please consider we have deal with two consecutive Scavenges:
|
BTW may be it helps in investigation - problematic array is referenced from
|
@dmitripivkine Thank you very much for the clarification! |
Ran I'll continue to look at the core while rerunning it in the Grinder. |
Another 400x run in Grinder (Internal ID |
I looked at the logs from multiple runs in the pass cases and found an issue on loading Since The reason that The reason that I found a similar issue is fixed in arm64 in OMR::ARM64::MemoryReference::populateMemoryReference for #14663 [4]. This issue in X86 is similar: The memory reference for I tried the similar fix (check [1]
[2]
[3]
// OMR::X86::MemoryReference::populateMemoryReference
if (comp->useCompressedPointers())
{
if ((subTree->getOpCodeValue() == TR::l2a) && (subTree->getReferenceCount() == 1) &&
(subTree->getRegister() == NULL))
{
cg->decReferenceCount(subTree);
subTree = subTree->getFirstChild();
if (subTree->getRegister() == NULL)
nodeToBeAdjusted = subTree;
}
} [4] // OMR::ARM64::MemoryReference::populateMemoryReference
if (subTree->getOpCodeValue() == TR::l2a && subTree->getReferenceCount() == 1 && subTree->getRegister() == NULL &&
self()->getUnresolvedSnippet() == NULL)
{
/*
* We need to avoid skipping l2a node when the memory reference has a UnresolvedDataSnippet because skipping l2a
* makes the base register non-collected reference register.
* When a UnresolvedDataSnippet exists, the memory reference will generate multiple instructions.
* The first instruction is the branch to the UnresolvedDataSnippet which will be patched when the resolution
* finishes, and the last instruction is the actual load using the base register and resolved offset.
* The branch to the resolution helper can trigger GC, but if the base register is not a collected reference
* register, the valule of the base register will not be updated by GC.
* This is a tactical solution for OpenJ9 issue 14663.
*/ [5]
[6]
|
Good analysis. There is similar l2a avoidance code in P and Z memory references as well. Can you check (or ask Julian/Rahil) if such a fix needs to apply there? |
Since this particular test
Just to test the fix, I made the similar change for P and Z as well on eclipse-omr/omr@master...a7ehuo:omr:fix-missing-l2a-evaluation-2.
@r30shah @zl-wang Could you shed some light on if this fix is required for P and Z? |
@a7ehuo i do think the fix for p is needed as well. i am wondering though, if decompression needs a shift operation, the specific tree (above) looks the same? (it might not be a problem. when it is in a different tree, the reference count is not 1 anymore) also, should we open up the test in question? i don't know the reason it was skipped. |
Looks like the test is skipped on P and Z because of SharedClassesWorkloadTest_Softmx_Increase_JitAot test fails on ppc64le linux & s390x linux due to not enough AOT being generated #79.
I'll run this test in Grinder on P and Z and see what the latest result is |
Ran 100x
@zl-wang I guess the tree would look different, but I'm not sure. Regardless, whether or not Ran
@zl-wang @r30shah If it is okay with you, I can make the following changes. What do you think?
|
Fixes: eclipse-openj9/openj9#17240 Signed-off-by: Annabelle Huo <[email protected]>
@a7ehuo sounds good. |
@a7ehuo Sorry for late response, Changes you are proposing looks good to me. I am looking into that odd failure that you saw, which I do not think is because of this change. |
…penj9/openj9#17240 `SC_Softmx_JitAot_Linux` was disabled in adoptium/aqa-systemtest/adoptium#79 for P and Z. Tested recently with the fix for eclipse-openj9/openj9#17240 on Java 8 and Java 17. The test passes for P and Z. Signed-off-by: Annabelle Huo <[email protected]>
…penj9/openj9#17240 `SC_Softmx_JitAot_Linux` was disabled for P and Z due to adoptium/aqa-systemtest/adoptium#79. Tested recently with the fix for eclipse-openj9/openj9#17240 on Java 8 and Java 17. The test passes for P and Z. Signed-off-by: Annabelle Huo <[email protected]>
Created adoptium/aqa-tests#4606 to enable |
…penj9/openj9#17240 (#4606) `SC_Softmx_JitAot_Linux` was disabled for P and Z due to adoptium/aqa-systemtest/#79. Tested recently with the fix for eclipse-openj9/openj9#17240 on Java 8 and Java 17. The test passes for P and Z. Signed-off-by: Annabelle Huo <[email protected]>
Fixes: eclipse-openj9/openj9#17240 Signed-off-by: Annabelle Huo <[email protected]>
https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.system_x86-64_linux_Nightly_testList_0/499
SC_Softmx_JitAot_Linux_1
-Xcompressedrefs -Xjit -Xgcpolicy:gencon
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk8_j9_extended.system_x86-64_linux_Nightly_testList_0/499/system_test_output.tar.gz
16.jvm4.stderr
Same assert as in #17052
The text was updated successfully, but these errors were encountered: