Expose alloc_slow. Add a section in user guide about allocation optim…

…ization (#967) This PR exposes `alloc_slow()` to the bindings, adds a few public methods to allow bindings to implement allocation efficiently without duplicating mmtk-core code, and adds a section in the user guide to discuss allocation optimization. The changes in this PR includes: 1. Expose `alloc_slow()` in `memory_manager`. 2. Add `Mutator::allocator()` to allow bindings to get a specific allocator from an allocator selector. Add `Mutator::allocator_impl()` to allow bindings to get a typed allocator from a selector. 3. Add `Mutator::get_allocator_base_offset()` to allow bindings to use a specific allocator without selector (for performance). 4. Add a section in the user guide about allocation optimization. Remove some unused `SUMMARY.md` in the user guide. 5. Add `Address::as_mut_ref()`. 6. Expose the field for the fastpath bump pointer in some allocators. Related discussion on Zulip: https://mmtk.zulipchat.com/#narrow/stream/262679-General/topic/Refilling.20BumpPointer.20using.20AllocatorInfo/near/394142997
mmtk · Oct 11, 2023 · 0328b05 · 0328b05
1 parent 9a676e6
commit 0328b05
Show file tree

Hide file tree

Showing 15 changed files with 534 additions and 46 deletions.
diff --git a/docs/userguide/src/SUMMARY.md b/docs/userguide/src/SUMMARY.md
@@ -31,6 +31,9 @@
     - [How to Undertake a Port](portingguide/howto/prefix.md)
         - [NoGC](portingguide/howto/nogc.md)
         - [Next Steps](portingguide/howto/next_steps.md)
+    - [Performance Tuning](portingguide/perf_tuning/prefix.md)
+        - [Link Time Optimization](portingguide/perf_tuning/lto.md)
+        - [Optimizing Allocation](portingguide/perf_tuning/alloc.md)
 
 -----------
 

diff --git a/docs/userguide/src/portingguide/SUMMARY.md b/docs/userguide/src/portingguide/SUMMARY.md
diff --git a/docs/userguide/src/portingguide/perf_tuning/alloc.md b/docs/userguide/src/portingguide/perf_tuning/alloc.md
@@ -0,0 +1,120 @@
+# Optimizing Allocation
+
+MMTk provides [`alloc()`](https://docs.mmtk.io/api/mmtk/memory_manager/fn.alloc.html)
+and [`post_alloc()`](https://docs.mmtk.io/api/mmtk/memory_manager/fn.post_alloc.html), to allocate a piece of memory, and
+finalize the memory as an object. Calling them is sufficient for a functional implementation, and we recommend doing
+so in the early development of an MMTk integration. However, as allocation is performance critical, runtimes generally would
+optimize to make allocation as fast as possible, in which invoking `alloc()` and `post_alloc()` becomes inadequent.
+
+The following discusses a few design decisions and optimizations related to allocation. The discussion mainly focuses on `alloc()`.
+`post_alloc()` works in a similar way, and the discussion can also be applied to `post_alloc()`.
+For conrete examples, you can refer to any of our supported bindings, and check the implementation in the bindings.
+
+Note that some of the optimizations need to make assumptions about the MMTk's internal implementation and may make the code less maintainable.
+We recommend adding assertions in the binding code to make sure the assumptions are not broken across versions.
+
+## Efficient access to MMTk mutators
+
+An MMTk mutator context (created by [`bind_mutator()`](https://docs.mmtk.io/api/mmtk/memory_manager/fn.bind_mutator.html)) is a thread local data structure
+of type [`Mutator`](https://docs.mmtk.io/api/mmtk/plan/struct.Mutator.html).
+MMTk expects the binding to provide efficient access to the mutator structure in their thread local storage (TLS).
+Usually one of the following approaches is used to store MMTk mutators.
+
+### Option 1: Storing the pointer
+
+The `Box<Mutator<VM>>` returned from `mmtk::memory_manager::bind_mutator` is actually a pointer to
+a `Mutator<VM>` instance allocated in the Rust heap. It is simple to store it in the TLS.
+This approach does not make any assumption about the intenral of a MMTk `Mutator`. However, it requires an extra pointer dereference
+whene accessing a value in the mutator. This may sound not that bad. However, this degrades the performance of
+a carefully implemented inlined fastpath allocation sequence which is normally just a few instructions.
+This approach could be a simple start in the early development, but we do not recommend it for an efficient implementation.
+
+If the VM is not implemented in Rust,
+the binding needs to turn the boxed pointer into a raw pointer before storing it.
+
+```rust
+{{#include ../../../../../vmbindings/dummyvm/src/tests/doc_mutator_storage.rs:mutator_storage_boxed_pointer}}
+```
+
+### Option 2: Embed the `Mutator` struct
+
+To remove the extra pointer dereference, the binding can embed the `Mutator` type into their TLS type. This saves the extra dereference.
+
+If the implementation language is not Rust, the developer needs to create a type that has the same layout as `Mutator`. It is recommended to
+have an assertion to ensure that the native type has the exact same layout as the Rust type `Mutator`.
+
+```rust
+{{#include ../../../../../vmbindings/dummyvm/src/tests/doc_mutator_storage.rs:mutator_storage_embed_mutator_struct}}
+```
+
+### Option 3: Embed the fastpath struct
+
+The size of `Mutator` is a few hundreds of bytes, which could be considered as too large for TLS in some langauge implementations.
+Embedding `Mutator` also requires to duplicate a native type for the `Mutator` struct if the implementation language is not Rust.
+Sometimes it is undesirable to embed the `Mutator` type. One can choose only embed the fastpath struct that is in use.
+
+Unlike the `Mutator` type, the fastpath struct has a C-compatible layout, and it is simple and primitive enough
+so it is unlikely to change. For example, MMTk provides [`BumpPointer`](https://docs.mmtk.io/api/mmtk/util/alloc/struct.BumpPointer.html),
+which simply includes a `cursor` and a `limit`.
+
+In the following example, we embed one `BumpPointer` struct in the TLS.
+The `BumpPointer` is used in the fast path, and carefully synchronized with the allocator in the `Mutator` struct in the slow path.
+Note that the `allocate_default` closure in the example below assumes the allocation semantics is `AllocationSemantics::Default`
+and its selected allocator uses bump-pointer allocation.
+Real-world fast-path implementations for high-performance VMs are usually JIT-compiled, inlined, and specialized for the current plan
+and allocation site, so that the allocation semantics of the concrete allocation site (and therefore the selected allocator) is known to the JIT compiler.
+
+For the sake of simplicity, we only store _one_ `BumpPointer` in the TLS in the example.
+In MMTk, each plan has multiple allocators, and the allocation semantics are mapped
+to those allocator by the GC plan you choose. So a plan use multiple allocators, and
+depending on how many allocation semantics are used by a binding, the binding may use multiple allocators as well.
+In practice, a binding may embed multiple fastpath structs as the example for those allocators if they would like
+more efficient allocation.
+
+Also for simpliticy, the example assumes the default allocator for the plan in use is a bump pointer allocator.
+Many plans in MMTk use bump pointer allocator for their default allocation semantics (`AllocationSemantics::Default`),
+which includes (but not limited to) `NoGC`, `SemiSpace`, `Immix`, generational plans, etc.
+If a plan does not do bump-pointer allocation, we may still implement fast paths, but we need to embed different data structures instead of `BumpPointer`.
+
+```rust
+{{#include ../../../../../vmbindings/dummyvm/src/tests/doc_mutator_storage.rs:mutator_storage_embed_fastpath_struct}}
+```
+
+## Avoid resolving the allocator at run time
+
+For a simple and general API of `alloc()`, MMTk requires `AllocationSemantics` as an argument in an allocation request, and resolves it at run-time.
+The following is roughly what `alloc()` does internally.
+
+1. Resolving the allocator
+    1. Find the `Allocator` for the required `AllocationSemantics`. It is defined by the plan in use.
+    2. Dynamically dispatch the call to [`Allocator::alloc()`](https://docs.mmtk.io/api/mmtk/util/alloc/trait.Allocator.html#tymethod.alloc).
+2. `Allocator::alloc()` executes the allocation fast path.
+3. If the fastpath fails, it executes the allocation slow path [`Allocator::alloc_slow()`](https://docs.mmtk.io/api/mmtk/util/alloc/trait.Allocator.html#method.alloc_slow).
+4. The slow path will further attempt to allocate memory, and may trigger a GC.
+
+Resolving to a specific allocator and doing dynamic dispatch is expensive for an allocation.
+With the build-time or JIT-time knowledge on the object that will be allocated, an MMTK binding can possibly skip the first step in the run time.
+
+If you implement an efficient fastpath allocation in the binding side (like the Option 3 above, and generating allocation code in a JIT which will be discussed next),
+that naturally avoids this problem. If you do not want to implement the fastpath allocation, the following is another example of how to avoid resolving the allocator.
+
+Once MMTK is initialized, a binding can get the memory offset for the default allocator, and save it somewhere. When we know an object should be allocated
+with the default allocation semantics, we can use the offset to get a reference to the actual allocator (with unsafe code), and allocate with the allocator.
+
+```rust
+{{#include ../../../../../vmbindings/dummyvm/src/tests/doc_avoid_resolving_allocator.rs:avoid_resolving_allocator}}
+```
+
+## Emitting Allocation Sequence in a JIT Compiler
+
+If the language has a JIT compiler, it is generally desirable to generate the code sequence for the allocation fast path, rather
+than simply emitting a call instruction to the allocation function. The optimizations we talked above are relevant as well: 1.
+the compiler needs to be able to access the mutator, and 2. the compiler needs to be able to resolve to a specific allocator at
+JIT time. The actual implementation highly depends on the compiler implementation.
+
+The following are some examples from our bindings (at the time of writing):
+* OpenJDK:
+  * <https://github.com/mmtk/mmtk-openjdk/blob/9ab13ae3ac9c68c5f694cdd527a63ca909e27b15/openjdk/mmtkBarrierSetAssembler_x86.cpp#L38>
+  * <https://github.com/mmtk/mmtk-openjdk/blob/9ab13ae3ac9c68c5f694cdd527a63ca909e27b15/openjdk/mmtkBarrierSetC2.cpp#L45>
+* JikesRVM: <https://github.com/mmtk/mmtk-jikesrvm/blob/fbfb91adafd9e9b3f45bd6a4b32c845a5d48d20b/jikesrvm/rvm/src/org/jikesrvm/mm/mminterface/MMTkMutatorContext.java#L377>
+* Julia: <https://github.com/mmtk/julia/blob/5c406d9bb20d76e2298a6101f171cfac491f651c/src/llvm-final-gc-lowering.cpp#L267>
diff --git a/docs/userguide/src/portingguide/perf_tuning/prefix.md b/docs/userguide/src/portingguide/perf_tuning/prefix.md
@@ -0,0 +1,5 @@
+# Performance Tuning for Bindings
+
+In this section, we discuss how to achieve the best performance with MMTk in a binding implementation.
+MMTk is a high performance GC library. But there are some key points that need to be done correctly
+to achieve the optimal performance.
diff --git a/docs/userguide/src/tutorial/SUMMARY.md b/docs/userguide/src/tutorial/SUMMARY.md
diff --git a/src/memory_manager.rs b/src/memory_manager.rs
@@ -175,6 +175,27 @@ pub fn alloc<VM: VMBinding>(
     mutator.alloc(size, align, offset, semantics)
 }
 
+/// Invoke the allocation slow path. This is only intended for use when a binding implements the fastpath on
+/// the binding side. When the binding handles fast path allocation and the fast path fails, it can use this
+/// method for slow path allocation. Calling before exhausting fast path allocaiton buffer will lead to bad
+/// performance.
+///
+/// Arguments:
+/// * `mutator`: The mutator to perform this allocation request.
+/// * `size`: The number of bytes required for the object.
+/// * `align`: Required alignment for the object.
+/// * `offset`: Offset associated with the alignment.
+/// * `semantics`: The allocation semantic required for the allocation.
+pub fn alloc_slow<VM: VMBinding>(
+    mutator: &mut Mutator<VM>,
+    size: usize,
+    align: usize,
+    offset: usize,
+    semantics: AllocationSemantics,
+) -> Address {
+    mutator.alloc_slow(size, align, offset, semantics)
+}
+
 /// Perform post-allocation actions, usually initializing object metadata. For many allocators none are
 /// required. For performance reasons, a VM should implement the post alloc fast-path on their side
 /// rather than just calling this function.

diff --git a/src/plan/mutator_context.rs b/src/plan/mutator_context.rs
@@ -5,6 +5,7 @@ use crate::plan::global::Plan;
 use crate::plan::AllocationSemantics;
 use crate::policy::space::Space;
 use crate::util::alloc::allocators::{AllocatorSelector, Allocators};
+use crate::util::alloc::Allocator;
 use crate::util::{Address, ObjectReference};
 use crate::util::{VMMutatorThread, VMWorkerThread};
 use crate::vm::VMBinding;
@@ -118,6 +119,20 @@ impl<VM: VMBinding> MutatorContext<VM> for Mutator<VM> {
         .alloc(size, align, offset)
     }
 
+    fn alloc_slow(
+        &mut self,
+        size: usize,
+        align: usize,
+        offset: usize,
+        allocator: AllocationSemantics,
+    ) -> Address {
+        unsafe {
+            self.allocators
+                .get_allocator_mut(self.config.allocator_mapping[allocator])
+        }
+        .alloc_slow(size, align, offset)
+    }
+
     // Note that this method is slow, and we expect VM bindings that care about performance to implement allocation fastpath sequence in their bindings.
     fn post_alloc(
         &mut self,
@@ -169,6 +184,80 @@ impl<VM: VMBinding> Mutator<VM> {
             unsafe { self.allocators.get_allocator_mut(selector) }.on_mutator_destroy();
         }
     }
+
+    /// Get the allocator for the selector.
+    ///
+    /// # Safety
+    /// The selector needs to be valid, and points to an allocator that has been initialized.
+    /// [`crate::memory_manager::get_allocator_mapping`] can be used to get a selector.
+    pub unsafe fn allocator(&self, selector: AllocatorSelector) -> &dyn Allocator<VM> {
+        self.allocators.get_allocator(selector)
+    }
+
+    /// Get the mutable allocator for the selector.
+    ///
+    /// # Safety
+    /// The selector needs to be valid, and points to an allocator that has been initialized.
+    /// [`crate::memory_manager::get_allocator_mapping`] can be used to get a selector.
+    pub unsafe fn allocator_mut(&mut self, selector: AllocatorSelector) -> &mut dyn Allocator<VM> {
+        self.allocators.get_allocator_mut(selector)
+    }
+
+    /// Get the allocator of a concrete type for the selector.
+    ///
+    /// # Safety
+    /// The selector needs to be valid, and points to an allocator that has been initialized.
+    /// [`crate::memory_manager::get_allocator_mapping`] can be used to get a selector.
+    pub unsafe fn allocator_impl<T: Allocator<VM>>(&self, selector: AllocatorSelector) -> &T {
+        self.allocators.get_typed_allocator(selector)
+    }
+
+    /// Get the mutable allocator of a concrete type for the selector.
+    ///
+    /// # Safety
+    /// The selector needs to be valid, and points to an allocator that has been initialized.
+    /// [`crate::memory_manager::get_allocator_mapping`] can be used to get a selector.
+    pub unsafe fn allocator_impl_mut<T: Allocator<VM>>(
+        &mut self,
+        selector: AllocatorSelector,
+    ) -> &mut T {
+        self.allocators.get_typed_allocator_mut(selector)
+    }
+
+    /// Return the base offset from a mutator pointer to the allocator specified by the selector.
+    pub fn get_allocator_base_offset(selector: AllocatorSelector) -> usize {
+        use crate::util::alloc::*;
+        use memoffset::offset_of;
+        use std::mem::size_of;
+        offset_of!(Mutator<VM>, allocators)
+            + match selector {
+                AllocatorSelector::BumpPointer(index) => {
+                    offset_of!(Allocators<VM>, bump_pointer)
+                        + size_of::<BumpAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::FreeList(index) => {
+                    offset_of!(Allocators<VM>, free_list)
+                        + size_of::<FreeListAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::Immix(index) => {
+                    offset_of!(Allocators<VM>, immix)
+                        + size_of::<ImmixAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::LargeObject(index) => {
+                    offset_of!(Allocators<VM>, large_object)
+                        + size_of::<LargeObjectAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::Malloc(index) => {
+                    offset_of!(Allocators<VM>, malloc)
+                        + size_of::<MallocAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::MarkCompact(index) => {
+                    offset_of!(Allocators<VM>, markcompact)
+                        + size_of::<MarkCompactAllocator<VM>>() * index as usize
+                }
+                AllocatorSelector::None => panic!("Expect a valid AllocatorSelector, found None"),
+            }
+    }
 }
 
 /// Each GC plan should provide their implementation of a MutatorContext. *Note that this trait is no longer needed as we removed
@@ -186,6 +275,13 @@ pub trait MutatorContext<VM: VMBinding>: Send + 'static {
         offset: usize,
         allocator: AllocationSemantics,
     ) -> Address;
+    fn alloc_slow(
+        &mut self,
+        size: usize,
+        align: usize,
+        offset: usize,
+        allocator: AllocationSemantics,
+    ) -> Address;
     fn post_alloc(&mut self, refer: ObjectReference, bytes: usize, allocator: AllocationSemantics);
     fn flush_remembered_sets(&mut self) {
         self.barrier().flush();

diff --git a/src/util/address.rs b/src/util/address.rs
@@ -300,6 +300,14 @@ impl Address {
         &*self.to_mut_ptr()
     }
 
+    /// converts the Address to a mutable Rust reference
+    ///
+    /// # Safety
+    /// The caller must guarantee the address actually points to a Rust object.
+    pub unsafe fn as_mut_ref<'a, T>(self) -> &'a mut T {
+        &mut *self.to_mut_ptr()
+    }
+
     /// converts the Address to a pointer-sized integer
     pub const fn as_usize(self) -> usize {
         self.0