Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only reset unlog bits of objects in modbuf for nursery GCs #1169

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

k-sareen
Copy link
Collaborator

@k-sareen k-sareen commented Jul 17, 2024

This fixes a bug wherein we were erroneously resetting the unlog bits
for objects in the modbuf for full-heap GCs. If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

We now set the unlog bits for mature objects when tracing for VM
space (if compiled with set_unlog_bits_vm_space), immortal space, and
large object space.

This PR also adds debug assertions checking that any object that has
been added to the modbuf is considered "live" by MMTk, or in other
words, it is a mature object.

@k-sareen k-sareen requested a review from wks July 17, 2024 12:44
@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 470787e to ec6ba72 Compare July 17, 2024 12:48
// Scan objects in the modbuf and forward pointers
let modbuf = std::mem::take(&mut self.modbuf);
GCWork::do_work(
&mut ScanObjects::<E>::new(modbuf, false, WorkBucketStage::Closure),
worker,
mmtk,
)
} else {
// Flip the per-object unlogged bits to "unlogged" state for objects inside the bootimage
#[cfg(feature = "vm_space")]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know what was better: to go through the list or bulk set the unlog bits in vm_space.release(). I thought this was better since it is more general than bulk setting the unlog bits (in case the VM binding is not using side-metadata for the unlog bits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why the VM space needs to be specially dealt here. Why do we need to set the unlog bits in mature GCs for the VM space? What about other spaces like immortal or LOS, how is the VM space different from those spaces?

Copy link
Collaborator Author

@k-sareen k-sareen Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. So what we want to do is only reset the unlog bit for objects that are considered live. All ImmixSpace and CopySpace mature objects will set the unlog bit when you copy them over because of the CopySemantics::Mature. But objects in the VM space (or even immortal and LOS like you mentioned) don't perform copies and so we need to reset their bits iff those objects are live. VM space and immortal space objects are always live so we need to unconditionally reset them, but for LOS we need to reset only when the object is live. So really we're doing this reset at the wrong place and time.

One way to fix it would be to reset the unlog bit for mature LOS objects inside its trace_object. And then we can reset the unlog bits for VM space and immortal space objects in the ProcessModBuf work packet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But objects in the VM space (or even immortal and LOS like you mentioned) don't perform copies and so we need to reset their bits iff those objects are live.

Why not do that in the trace_object method for the policy where we know the object is reached? For ImmixSpace and CopySpace, the log bit is set during copying in trace_object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's what I suggested for LOS objects. But it's not clear to me if ART, for example, will actually end up tracing through all objects in the bootimage or immortal space for a full heap GC given their optimization(s). Maybe that's an ART-specific issue though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative could be to create a new work packet to be executed after the transitive closure (or maybe release?) for major GCs. In that work packet we now have all the correct information about if a current object address is marked or not, and if it is, then we can reset the unlog bit for that object.

Copy link
Member

@qinsoon qinsoon Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to unconditionally set the log bit in ProcessModBuf (no matter nursery or mature GCs), and bulk clear log bits when we release memory for dead objects.

In the current code base, for Immix space, we will set log bits during tracing if unlog_object_when_traced is true, and will clear log bits in prepare if reset_log_bit_in_major_gc is true. Only in Sticky Immix, both flags are set to true. The former is needed, as we 'promote' nursery objects in immix space when we trace it (and we do not have to copy it to 'promote'). But the latter may not be necessary -- we can bulk clear log bits for the dead objects after a GC. For other spaces, we can always bulk clear log bits after a GC for reclaimed memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this warrants a group discussion. We can discuss this in the next group meeting. I'll do a hack for ART in the meantime since this current PR is probably still incorrect because of LOS objects' unlog bits not being reset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cases where we recycle mature LOS objects? I think that is broken as well. If the unlog bit was set for it in a previous GC and it has died since then, if we allocate a new LOS object at the same address, we have a young object with the unlog bit set which is also wrong. We need to clear the unlog bit for dead LOS objects in the sweep here

@k-sareen
Copy link
Collaborator Author

@wks Looks like OpenJDK actually triggers the assertion 😆 . I think the assertion is sound so this is actually an OpenJDK bug. The assertion is not triggered for ART.

@@ -231,6 +231,7 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> {
target: Option<ObjectReference>,
) {
if self.log_object(src) {
debug_assert!(src.is_live::<S::VM>(), "{} was logged but is not live", src);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_live doesn't make sense during mutator time. The liveness of an object is only determined after computing transitive closure. During mutator time, all objects that can be accessed are, by definition, live.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the unlog bit should only be set for mature objects, so they have to be considered alive. Or at the very least have to be marked. I previously had is_marked there but objects in the bootimage may not actually have mark bits set so it was triggering the assertion incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought I actually agree with you and will remove the debug assert from the write barrier code. I think the debug assert inside the ProcessModBuf work packet is correct though

@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 0282ba5 to 4a29049 Compare July 28, 2024 02:41
@k-sareen k-sareen changed the title Only reset unlog bits of objects in modbuf for nursery GCs or if in bootimage Only reset unlog bits of objects in modbuf for nursery GCs Jul 28, 2024
@k-sareen k-sareen requested a review from wks July 29, 2024 04:11
@k-sareen
Copy link
Collaborator Author

Just fixed it like we discussed in a previous meeting

@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from e26f37c to 0753a49 Compare July 29, 2024 05:16
@@ -275,6 +275,17 @@ impl<VM: VMBinding> VMSpace<VM> {
);
debug_assert!(self.in_space(object));
if self.mark_state.test_and_mark::<VM>(object) {
// Flip the per-object unlogged bits to "unlogged" state for objects inside the
// bootimage
#[cfg(feature = "set_unlog_bits_vm_space")]
Copy link
Collaborator Author

@k-sareen k-sareen Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can remove the conditional compile feature here since it is likely objects in the VM space will want to have the unlog bit set and the feature was more for initialization

@k-sareen
Copy link
Collaborator Author

@wks Can you re-review this?

@wks
Copy link
Collaborator

wks commented Aug 23, 2024

If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

I assume the "issues for future allocations" means newly allocated (young) objects may have unlog bit set, and may be logged when written into, and we will retain more objects than necessary.

But I don't think it is sufficient to ensure this by skipping setting unlog bits in mature GCs. There may be dead mature objects that are not written to (i.e. unlogged). When those objects die, their unlog bits will remain set. If future allocations reuse the memory space of those objects, some young objects will also have unlog bits set.

If you want to ensure nursery objects never get logged, we may need to (bulk) clear unlog bits for reclaimed regions, like how we clear VO bits. I think this only affects StickyImmix because we never set the unlog bits for objects in the GenCopy or GenImmix nursery in the first place.

@k-sareen
Copy link
Collaborator Author

Hence why we do a bulk clear?

@k-sareen
Copy link
Collaborator Author

Oh. How odd. This PR doesn't do the bulk clear. I must have messed up when cherry-picking from my branch

@wks
Copy link
Collaborator

wks commented Aug 23, 2024

Instead of bulk clearing, it is also possible to copy the mark bits over to the unlog bits (and remove ImmixSpaceArgs::unlog_object_when_traced and ImmixSpaceArgs::reset_log_bit_in_major_gc). I think that's a bit like VO bits where there are two different strategies, and we found that it is faster to just copy over mark bits. We may refactor the code to unify the handling of unlog bits and VO bits.

p.s. I think the so-called "unlog bit" should really be named "is candidate for logging". Obviously nursery objects are not such candidates.

@k-sareen
Copy link
Collaborator Author

What that suggests is we need to be more systematic with how we deal with stale metadata values

This fixes a bug wherein we were erroneously resetting the unlog bits
for objects in the modbuf for full-heap GCs. If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

We now set the unlog bits for mature objects when tracing for VM
space (if compiled with `set_unlog_bits_vm_space`), immortal space, and
large object space.

This PR also adds debug assertions checking that any object that has
been added to the modbuf is considered "live" by MMTk, or in other
words, it is a mature object.
@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 5e37a4b to 2b32380 Compare November 29, 2024 00:43
@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

@wks What needs to be done for this PR to get it to merge? I think it's very important to merge bug fixes ASAP since it leads to duplication of debugging efforts otherwise. This is a very very subtle bug.

EDIT: I can remove the object.is_live() debug assertions if you want, although I still think they're right.

@qinsoon
Copy link
Member

qinsoon commented Nov 29, 2024

If I understand correct, the issue is that for dead objects, we do not clear their unlog bits, and new objects allocated will have random unlog bits. The obvious fix would be to just clear the unlog bits before new allocation.

The PR tries to maintain the unlog bits before the objects become dead: if an object dies, its unlog bit is 0, and if an object lives, its unlog bit is 1. This doesn't sound like a straight solution to the problem. What is the reason for using this solution rather than just clearing the unlog bits?

@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

The PR tries to maintain the unlog bits before the objects become dead: if an object dies, its unlog bit is 0, and if an object lives, its unlog bit is 1. This doesn't sound like a straight solution to the problem. What is the reason for using this solution rather than just clearing the unlog bits?

Fast-path allocation.

EDIT: It seems like I haven't cherry-picked a further change from here: k-sareen@8f6fcd8

The above commit bulk clears the unlog bits for a global GC and then sets them again while tracing.

@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

Really what we want is a meta-module that takes care of all relevant metadata when an object dies/a space is released/a TLAB is allocated. What we have currently is a hack since each metadata (like VO-bit or side forwarding bits) has been special cased. But I feel like that is out of scope of this PR.

We now:
 - Clear the unlog bits for dead LOS objects
 - Bulk clear the unlog bits for MarkSweepSpace for full heap GC
 - Set the unlog bits for LOS and MarkSweepSpace objects during trace
   object

This is because we want to ensure the unlog bits are only set for
objects that are alive. We do not reset the unlogs bits inside the
`ProcessModBuf` work packet because we do not know what objects are
alive at that moment in time.
if mmtk
.get_plan()
.generational()
.unwrap()
.is_current_gc_nursery()
{
// Flip the per-object unlogged bits to "unlogged" state.
for obj in &self.modbuf {
debug_assert!((*obj).is_live(), "{} was logged but is not live", *obj);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails for GenCopy in the first nursery GC after the first full-heap GC. It is because it is in the from space of the semispace mature space, and is_live() returns false for all objects in the from space.

@wks
Copy link
Collaborator

wks commented Nov 29, 2024

@wks What needs to be done for this PR to get it to merge? I think it's very important to merge bug fixes ASAP since it leads to duplication of debugging efforts otherwise. This is a very very subtle bug.

You still need to implement bulk clearing of unlog bits for ImmixSpace. It is sufficient to do the clearing in Block::sweep in the two places where vo_bit::helper::on_region_swept is called. You only need to do this for StickyImmix because we only allocate nursery objects in ImmixSpace in StickyImmix. Unlike VO bits, stale unlog bits are benign in places where nursery objects cannot be allocated. That includes non-empty lines (or blocks if BLOCK_ONLY). By doing this, we maintain an invariant that newly allocated young objects must not have unlog bits set.

And I suggest adding a debug assertion that young objects cannot have unlog bits set. We can do this while tracing or while allocating. Possible places include CommonGenPlan::trace_object_nursery and StickyImmix::trace_object_nursery.

EDIT: I can remove the object.is_live() debug assertions if you want, although I still think they're right.

Probably remove it for now because it fails in GenCopy. We can introduce a similar assertion later.

I think the actual assertion you want to make is that the object is a mature object. For StickyImmix, we use the mark bit to distinguish between old and young objects. So is_live happens to work. But for GenCopy, it is different.

@k-sareen
Copy link
Collaborator Author

k-sareen commented Dec 4, 2024

I think the actual assertion you want to make is that the object is a mature object. For StickyImmix, we use the mark bit to distinguish between old and young objects. So is_live happens to work. But for GenCopy, it is different.

Yes exactly. My best approximation was is_live. Maybe negation of is_object_in_nursery is better?

@wks
Copy link
Collaborator

wks commented Dec 4, 2024

I think the actual assertion you want to make is that the object is a mature object. For StickyImmix, we use the mark bit to distinguish between old and young objects. So is_live happens to work. But for GenCopy, it is different.

Yes exactly. My best approximation was is_live. Maybe negation of is_object_in_nursery is better?

Yes. It should work in general. It is implemented as self.immix.immix_space.in_space(object) && !self.immix.immix_space.is_marked(object) for StickyImmix. But for other generational plans, it is just self.gen.nursery.in_space(object).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants