Skip to content

Latest commit

 

History

History
806 lines (706 loc) · 27.6 KB

05-memory-returns.org

File metadata and controls

806 lines (706 loc) · 27.6 KB

Memory returns

In this section we’re going to go back and improve memory management. The current frame allocator has no way to return frames, so eventually all the frames will be used. We need to be able to free frames when programs exit, to make them available to other programs. We’ll look at some optimisation strategies, and how to use the RDTSC instruction to perform timing.

Frame allocation

The current frame allocator is a “watermark allocator” (described here) that just allocates frames but never frees them. Now we’re going to implement an allocator which can free frames too.

There are many variations on page frame allocation, with trade-offs in performance so that the best solution depends on the workload. Some desirable features of an allocator appear to be:

  • Fast allocation of single frames, as programs’ memory use gradually grows.
  • Fast deallocation of lots of frames, mostly not contiguous, e.g when programs exit.
  • There are ISA / DMA applications which need memory in contiguous blocks, and may have other constraints like not crossing 64k boundaries.
  • A nice to have is the ability to allocate huge pages (e.g 2Mb pages), though fragmentation makes this increasingly difficult as a system runs.

For now we will try a multi-level bitmap approach, keeping track of frame availability in a tree structure.

  • A level-1 bitmap with one bit per frame, in 32-bit chunks
  • A level 2 bitmap with one bit per 32 bits of level-1 bitmap. A ‘1’ bit means one or more frames are available in the corresponding level-1 chunk.

If we use 32 bits for the level-2 bitmap, then we can keep track of 32 x 32 = 1024 frames, or 4Mb. Later we’ll generalise this to more than two levels, and the number of levels needed will increase as the log (base 32) of the total number of frames. The process of finding an available frame in a 3-level bitmap is illustrated in figure fig-multi_level_bitmap.

./img/multi_level_bitmap.svg

Finding an available frame means quickly finding a ‘1’ bit somewhere in a 32-bit chunk of the bitmap. To scan bitmaps efficiently we can use the Bit Scan Forward instruction, which on modern hardware finds the first non-zero bit in a 16, 32 or 64-bit register in a constant (small) number of clock cycles. Let’s wrap this into a function:

fn nonzero_bit_index(bitmap: u32) -> u32 {
    let index: u32;
    unsafe {
        asm!("bsf eax, ecx",
             in("ecx") bitmap,
             lateout("eax") index,
             options(pure, nomem, nostack));
    }
    index
}

To find available frames we need the virtual addresses of the level-2 and level-1 bitmaps; To translate the frame number into an address we need the physical address of the first frame:

pub struct MultilevelBitmapFrameAllocator {
    level_2_virt_addr: VirtAddr,
    level_1_virt_addr: VirtAddr,
    frame_phys_addr: PhysAddr,
}

To initialise the new allocator we can adapt the code from the current frame allocator, which first finds a usable region, its start and end address:

impl MultilevelBitmapFrameAllocator {
    pub unsafe fn init(memory_map: &'static MemoryMap,
                       physical_memory_offset: VirtAddr) -> Self {
      let mut usable_regions = memory_map
          .iter()
          .filter(|r| r.region_type == MemoryRegionType::Usable);

      _ = usable_regions.next(); // Skip first region
      let region = usable_regions.next().unwrap(); // Second region
      let start_addr = region.range.start_addr();
      let end_addr = region.range.end_addr();
      ...

Note that the way that the memory is laid out in my tests means that the majority of the available memory is in a second region.

Next we can create some pointers to bitmaps that we’re going to store in the first frame:

let level_2_virt_addr = physical_memory_offset + start_addr;
let level_1_virt_addr = level_2_virt_addr + 4u64;

let level_2_ptr = level_2_virt_addr.as_mut_ptr() as *mut u32;
let level_1_ptr = level_1_virt_addr.as_mut_ptr() as *mut u32;

Now we set the bitmap values. All frames are available except the first one, which we’re using to store the bitmaps. The 32 bits of level 2 should therefore all be 1:

*level_2_ptr = 0xFFFF_FFFF;

and the first level-1 bitmap chunk should have just the first bit cleared:

*level_1_ptr = 0xFFFF_FFFE;

All the other level-1 bitmaps should be marked available:

for i in 1..32 {
    *(level_1_ptr.offset(i)) = 0xFFFF_FFFF;
}

Finally we store values in the struct:

MultilevelBitmapFrameAllocator {
    level_2_virt_addr,
    level_1_virt_addr,
    frame_phys_addr: PhysAddr::new(start_addr),
}

To find a frame we can define a method fetch_frame:

impl MultilevelBitmapFrameAllocator {
    fn fetch_frame(&mut self) -> u64 {
      ...
    }
}

In this function we first get the level 2 bitmap, and find a non-zero bit:

let l2_ptr = self.level_2_virt_addr.as_mut_ptr() as *mut u32;
let mut l2_bitmap = unsafe{*l2_ptr};
let l2_index = nonzero_bit_index(l2_bitmap);

and then use this index to get the level 1 bitmap and find a non-zero bit:

let l1_ptr = unsafe{(self.level_1_virt_addr.as_mut_ptr() as *mut u32)
                    .offset(l2_index as isize)};
let mut l1_bitmap = unsafe{*l1_ptr};
let l1_index = nonzero_bit_index(self.cache);

The frame number is a combination of these indices, giving the index of the non-zero bit:

let frame_number =
    (l2_index as u64) * 32u64
    + (l1_index as u64);

We then need to mark this frame as used, by modifying the level 1 bitmap:

l1_bitmap ^= 1 << l1_index;
unsafe{*l1_ptr = l1_bitmap;}

If this level 1 chunk is now empty, clear the bit in the level 2 bitmap:

if l1_bitmap == 0 {
    l2_bitmap ^= 1 << l2_index;
    unsafe{*l2_ptr = l2_bitmap;}
}

Optimisation: Timing

Allocating and freeing memory is going to happen quite frequently, so its one of the areas that are probably worth optimising. The problem is that the result is probably very dependent on the pattern of memory use. We can however try some things to see what happens, and learn how to time parts of the kernel.

An easy way to get a high resolution counter is to use the RDTSC instruction (ReaD Time Stamp Counter). This reads a Model Specific Register containing a count of the number of clock ticks since the CPU was reset. It’s a 64-bit counter, but the 32 high bits are put into EDX, and the low 32 bits into EAX registers. We can define a function to put these two pieces back together:

fn time_stamp_counter() -> u64 {
    let counter: u64;
    unsafe{
        asm!("rdtsc",
             "shl rdx, 32", // High bits in EDX
             "or rdx, rax", // Low bits in EAX
             out("rdx") counter,
             out("rax") _, // Clobbers RAX
             options(pure, nomem, nostack)
        );
    }
    counter
}

With this we can try timing allocating frames and then freeing them again:

let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};
let mut alloc = &mut memory_info.frame_allocator;

const N: usize = 800;
let count1 = time_stamp_counter();

for i in 0..10 {
    // Allocate frames
    let frames = [0; N].map(|_| alloc.fetch_frame());
    // Free them all again
    for f in frames {
        alloc.return_frame(f);
    }
}
let count2 = time_stamp_counter();
println!("Clock ticks: {} M", (count2 - count1) / 1000000);

Allocating and freeing 800 frames 10 times over took between 56 and 62 million clock cycles, or around 7500 cycles for each frame allocation and free.

Optimisation: Bitmap caches and stacks

Once we find a level-1 bitmap chunk (32 bits) with some non-zero bits (available frames), it’s possible that there are more than one. Rather than have to look for it again, we can cache the chunk and start looking there next time.

Adding the cache and the index of the level-1 bitmap to the MultilevelBitmapFrameAllocator struct:

cache: u32,
cache_index: u64,

and initialising to zero in init:

cache: 0,
cache_index: 0

Allocating and freeing 800 frames 10 times over took between 50 and 58 million clock cycles, or about 7000 cycles for each frame allocation and free.

Optimisation 2: Frame stack

An alternative optimisation, which I think is probably better on balance, is to add a stack of frames as a cache. I think this is better than caching parts of the bitmap because the size of the stack can be changed independently of the bitmap layout, and it doesn’t store the same piece of information (bitmap chunk) in two places.

The idea is that when frames are free’d they can be put on a stack, so when a frame is requested it can just be taken directly from the stack. Hopefully this saves time looking for available frames, though it does add a little extra complexity. The only time we’ll now need to look for available frames in the bitmap is if the stack is empty. It therefore makes sense to make the stack at least as big as the number of bits in a bitmap chunk (32 frames): We can find a chunk with available frames, and move them all into the stack.

Our MultilevelBitmapFrameAllocator struct becomes:

pub struct MultilevelBitmapFrameAllocator {
    level_2_virt_addr: VirtAddr,
    level_1_virt_addr: VirtAddr,
    frame_phys_addr: PhysAddr,
    frame_stack: [u64; 32], // new
    frame_stack_number: usize, // new
}

which we initialise (at the end of init()) as

MultilevelBitmapFrameAllocator {
    level_2_virt_addr,
    level_1_virt_addr,
    frame_phys_addr: PhysAddr::new(start_addr),
    frame_stack: [0; 32],
    frame_stack_number: 0
}

When we fetch a frame we now check if the stack is empty, in which case we find a bitmap chunk with some non-zero entries and fill the stack with them. Since the stack is definitely not empty (unless we’re out of memory), take a frame from the stack and return it. Our fetch_frame function becomes:

fn fetch_frame(&mut self) -> Option<u64> {
    if self.frame_stack_number == 0 {
      // Find more frames
      // Put frames onto stack
    }
    self.frame_stack_number -= 1; // Take a frame
    Some(self.frame_stack[self.frame_stack_number])
}

Finding frames is the same as before: We get the level 2 bitmap, check that there are still available frames, and find the index of one of the non-empty level 1 chunks:

let l2_ptr = self.level_2_virt_addr.as_mut_ptr() as *mut u32;
let l2_bitmap = unsafe{*l2_ptr};
if l2_bitmap == 0 {
    return None; // Out of memory
}
let l2_index = nonzero_bit_index(l2_bitmap);

Then get the level 1 chunk:

let l1_ptr = unsafe{(self.level_1_virt_addr.as_mut_ptr() as *mut u32)
                    .offset(l2_index as isize)};
let mut l1_bitmap = unsafe{*l1_ptr};

Now rather than taking one frame, or putting this bitmap in cache, we find the available frames and put them all on the frame stack:

while l1_bitmap != 0 {
    let l1_index = nonzero_bit_index(l1_bitmap);
    let frame_number =
        (l2_index as u64) * 32u64 +
        (l1_index as u64);
    l1_bitmap ^= 1 << l1_index;
    self.frame_stack[self.frame_stack_number] = frame_number;
    self.frame_stack_number += 1;
}

Returning a frame now also has two possibilities: If the stack isn’t full then put the frame on the stack, otherwise put into the bitmap as before.

fn return_frame(&mut self, frame_number: u64) {
    if self.frame_stack_number < FRAME_ALLOCATOR_STACK_SIZE {
        self.frame_stack[self.frame_stack_number] = frame_number;
        self.frame_stack_number += 1;
        return;
    }
    // Put into bitmap
}

Allocating and freeing 800 frames 10 times over now took between 54 and 59 million clock cycles, or between 7000 and 7500 cycles for each frame allocation and free.

As a baseline for comparison, we can compare the time needed when all frames are on the stack and looking up bitmaps isn’t needed. Changing FRAME_ALLOCATOR_STACK_SIZE to the same number of frames that are being allocated and freed (800) gives a timing of about 6000-6500 cycles for each allocation and free.

Managing more memory

To manage more than 4Mb of memory we can keep adding more levels, so we end up with a tree-like structure. We can work out how many levels we need: If we have 32 (or fewer) pages then one level is enough; 32x32 (1024) pages fit in two levels, 32x32x32 (32768) fit in three, and so on. We can wrap this into a num_levels_needed function:

fn num_levels_needed(num_frames: u64) -> usize {
    let mut max_frames = 32;
    let mut levels = 1;

    while num_frames > max_frames {
        levels += 1;
        max_frames *= 32;
    }
    levels
}

The MultilevelBitmapFrameAllocator now contains an array of bitmap addresses and stores the number of levels used:

pub struct MultilevelBitmapFrameAllocator {
    bitmap_virt_addr: [VirtAddr; FRAME_ALLOCATOR_MAX_LEVELS], // new
    nlevels: usize, // new
    frame_phys_addr: PhysAddr,
    frame_stack: [u64; FRAME_ALLOCATOR_STACK_SIZE],
    frame_stack_number: usize,
}

The maximum number of frames is 32^levels, so 6 levels can keep track of 4Tb of memory in 4k frames:

const FRAME_ALLOCATOR_MAX_LEVELS: usize = 6;

Initialising

Fetching a frame

To get a frame from the allocator now looks like:

fn fetch_frame(&mut self) -> Option<u64> {
    if self.frame_stack_number == 0 {
        // Loop from highest to lowest level, following '1' bits

        // Remove all frames from chunk and put on stack

        // Loop from lowest to highest level, marking chunks as empty
    }
    // Take and return a frame from the stack
}

Returning a frame

Putting a frame back into the allocator looks like:

fn return_frame(&mut self, frame_number: u64) {
    if self.frame_stack_number < FRAME_ALLOCATOR_STACK_SIZE {
      // Put frame onto stack and return
    }

    let mut chunk_number = frame_number;
    for level in 0..self.nlevels {
        // Set the bit `chunk_number` to 1

        // If the chunk was not empty then we're done
        // If it was empty then we need to set a bit at the next level

        // Calculate chunk_number at next level
    }
}

Timing results

With 31589 frames (123 Mb) this frame allocator needs 3 levels, and takes about 8000-8500 cycles per allocation and free, compared to 6000-6500 when frames are on the stack.

Optimisation: Allocate-on-Write tables

The fastest frame allocation is one we don’t do, so one way to optimise memory allocation (and so thread startup etc) is to only allocate frames when they are actually needed.

Map multiple pages to one frame, and mark the unallocated entries read only. When a thread tries to write to one of those pages it will trigger a page fault. At that point a frame can be allocated and the thread resumed. This is used by Linux to implement copy-on-write e.g. during a fork(), to avoid copying pages it doesn’t need to.

When we create the stack for a new thread, we currently allocate 7 frames (28kb memory). We can reduce this to just 4kb without affecting the user process, by only allocating one frame and then allocating more when they’re needed. In memory.rs the allocate_user_stack function looks for an empty stack slot of 8 pages, leaves one of them empty as a guard page to catch stack overflows, and allocates frames for the other 7.

To test our stack allocation we can modify our user program, hello.rs, changing _start so that it tries to write a large array to the stack:

#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
    let arr = [0; 1000];
    println!("{}", arr[10]);
    loop {}
}

(note that we have to use the array or the compiler removes it).

In memory.rs the allocate_user_stack function can now be changed to allocate only one frame:

if table[n * 8 + 1].is_unused() {
    let frame = memory_info.frame_allocator.allocate_frame()
        .ok_or("Failed to allocate frame")?;

    for j in 1..7 {
        // These pages are read-only
        let entry = &mut table[n * 8 + j];
        entry.set_addr(frame.start_address(),
                       PageTableFlags::PRESENT |
                             PageTableFlags::USER_ACCESSIBLE);
    }
    let entry = &mut table[n * 8 + 7];
    entry.set_addr(frame.start_address(),
                   PageTableFlags::PRESENT |
                   PageTableFlags::WRITABLE | // Note!
                   PageTableFlags::USER_ACCESSIBLE);
    ...
}

(Note that it is the highest memory page which is writable, because stacks start at the top and move downwards).

Now we get an output like:

Thread stack: 0x00028000013000 - 0x00028000018000 EXCEPTION: PAGE FAULT Accessed Address: VirtAddr(0x28000016ff8) Error Code: PROTECTION_VIOLATION | CAUSED_BY_WRITE | USER_MODE

which shows that an error occurred due to user mode code trying to write to address 0x28000016ff8 which is in the thread stack range but below the one page which has write permissions.

To fix this we can define a function in memory.rs to find the level 1 page table from an address:

fn active_level_1_table_containing(
    addr: VirtAddr
) -> &'static mut PageTable {
    let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};
    let mut table = unsafe{&mut (*active_pagetable_ptr())};

    for index in [addr.p4_index(),
                  addr.p3_index(),
                  addr.p2_index()] {

        let entry = &mut table[index];
        table = unsafe {&mut *(memory_info.physical_memory_offset
                               + entry.addr().as_u64()).as_mut_ptr()};
    }
    table
}

and use it to add the missing frame:

pub fn allocate_missing_stack_frame(
    addr: VirtAddr
) -> Result<(), &'static str> {
    let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};

    let table = active_level_1_table_containing(addr);
    let entry = &mut table[addr.p1_index()];
    if entry.flags() != (PageTableFlags::PRESENT |
                         PageTableFlags::USER_ACCESSIBLE) {
        return Err("Error: Unexpected table flags");
    }

    let frame = memory_info.frame_allocator.allocate_frame()
        .ok_or("Could not allocate frame")?;

    entry.set_addr(frame.start_address(),
                   PageTableFlags::PRESENT |
                   PageTableFlags::WRITABLE |
                   PageTableFlags::USER_ACCESSIBLE);
    Ok(())
}

and then in interrupt.rs the page fault handler can call this function:

use crate::memory;

extern "x86-interrupt" fn page_fault_handler(
    stack_frame: InterruptStackFrame,
    error_code: PageFaultErrorCode,
) {
    use x86_64::registers::control::Cr2;
    let accessed_virtaddr = Cr2::read();

    if error_code == (PageFaultErrorCode::PROTECTION_VIOLATION |
                      PageFaultErrorCode::CAUSED_BY_WRITE |
                      PageFaultErrorCode::USER_MODE) {
        if let Err(msg) = memory::allocate_missing_stack_frame(accessed_virtaddr) {
            println!("Page fault error: {}", msg);
            hlt_loop();
        }
    } else {
        println!("EXCEPTION: PAGE FAULT");
        println!("Accessed Address: {:?}", accessed_virtaddr);
        println!("Error Code: {:?}", error_code);
        println!("{:#?}", stack_frame);

        hlt_loop();
    }
}

If all goes well then when the page fault is triggered a new frame is allocated. When the page fault handler returns it jumps back to the instruction that caused the page fault. The instruction tries again, this time succeeding because the page table entry now points to a writable frame.

Freeing thread stacks

When a thread exits we need to be able to free its stack. In memory.rs define a new function, reusing the active_level_1_table_containing function:

pub fn free_user_stack(
    stack_end: VirtAddr
) -> Result<(), &'static str> {
    let addr = stack_end - 1u64; // Address in last page
    let table = active_level_1_table_containing(addr);

    let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};

    let iend = usize::from(addr.p1_index());
    for index in ((iend - 6)..=iend).rev() {
        let entry = &mut table[index];

        // Only writable pages have unique frames
        if entry.flags().contains(PageTableFlags::WRITABLE) {
            // Free this frame
            memory_info.frame_allocator.deallocate_frame(
                entry.frame().unwrap());
        }
        entry.set_flags(PageTableFlags::empty());
    }

    Ok(())
}

Then in process.rs we can implement a drop function for Thread, to free the user stack frames:

impl Drop for Thread {
    fn drop(&mut self) {
        memory::free_user_stack(
            VirtAddr::new(self.user_stack_end));
    }
}

Freeing user pages

When all threads in a process are finished we need to free all frames used by that process. To do that we need to keep track of the shared state. For now that means the page table but later threads will share other resources like file handles or environment variables. Threads might be created and destroyed while the program is running, but as long as one thread is still running we want the process to stay around. To do that we’ll use Rust’s Arc thread-safe reference counting pointer to hold the Process, shared between Threads. We can’t use the faster Rc reference counting pointer in this case because it can’t safely be copied between threads.

In process.rs define a Process struct, for now storing only the physical address of the level 4 page table:

struct Process {
    /// Page table physical address
    page_table_physaddr: u64
}

then add a reference counted pointer to the Thread struct:

use alloc::sync::Arc;

struct Thread {
  tid: u64,
  process: Arc<Process>, // new
  ...
}

In new_user_thread we initialise this:

Box::new(Thread {
    tid: unique_id(),
    process: Arc::new(Process {
        page_table_physaddr: user_page_table_physaddr
    }), // new
  ...
}

and in new_kernel_thread:

Box::new(Thread {
    tid: unique_id(),
    process: Arc::new(Process {
        page_table_physaddr: 0
    }),
    ...
}

We can now implement a drop method for Process, to free user frames. First test this by just printing a message:

impl Drop for Process {
    fn drop(&mut self) {
        println!("Dropping Process");
    }
}

and modify the hello.rs user program so that all threads exit. At the end of _start(), replace the infinite loop with an exit syscall:

#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
    ...
    asm!("mov rax, 1", // exit_current_thread syscall
         "syscall",
    options(noreturn)); // needed for ! return
}

Note that the noreturn option tells the compiler that this assembly code will never return (because the thread will be dropped and control passed to another thread).

Running this you should now see the “Dropping Process” message.

Now we just need a function to recursively free all user-accessible pages and page tables, because we made copies of all of these when creating a new user program:

pub fn free_user_pagetables(level_4_physaddr: u64) {
    let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};

    fn free_pages_rec(physical_memory_offset: VirtAddr,
                      frame_allocator: &mut MultilevelBitmapFrameAllocator,
                      table_physaddr: PhysAddr,
                      level: u16) {
        let table = unsafe{&mut *(physical_memory_offset
                                  + table_physaddr.as_u64())
                           .as_mut_ptr() as &mut PageTable};
        for entry in table.iter() {
            if !entry.is_unused() {
                if (level == 1) || entry.flags().contains(PageTableFlags::HUGE_PAGE) {
                    // Maps a frame, not a page table
                    if entry.flags().contains(PageTableFlags::USER_ACCESSIBLE) {
                        // A user frame => deallocate
                        frame_allocator.deallocate_frame(
                            entry.frame().unwrap());
                    }
                } else {
                    // A page table
                    free_pages_rec(physical_memory_offset,
                                   frame_allocator,
                                   entry.addr(),
                                   level - 1);
                }
            }
        }
        // Free page table
        frame_allocator.deallocate_frame(
            PhysFrame::from_start_address(table_physaddr).unwrap());
    }

    free_pages_rec(memory_info.physical_memory_offset,
                   &mut memory_info.frame_allocator,
                   PhysAddr::new(level_4_physaddr),
                   4);
}

We have to be a bit careful with freeing page tables: When a program exits its page table will be active. Before freeing page tables we should switch to the kernel page table, even though the function above doesn’t actually modify any page tables (it just marks them as available in the frame allocator, so another process might request and then modify them).

pub fn switch_to_kernel_pagetable() {
    let memory_info = unsafe {MEMORY_INFO.as_mut().unwrap()};
    let phys_addr = (memory_info.kernel_l4_table as *mut PageTable as u64)
        - memory_info.physical_memory_offset.as_u64();
    switch_to_pagetable(phys_addr);
}

Drop is now quite straightforward for Process:

impl Drop for Process {
    fn drop(&mut self) {
        if self.page_table_physaddr == memory::active_pagetable_physaddr() {
            memory::switch_to_kernel_pagetable();
        }
        memory::free_user_pagetables(self.page_table_physaddr);
    }
}

Now finally threads and processes return frames to the allocator when they exit! In the next section we’ll implement user memory management, so user programs will be able to allocate and free memory.

Appendix: Subdirectories in src/

The memory handling code is going to be a significant part of the kernel, so breaking up this 1000+ lines of code into separate files might help keep it manageable. According to this Stackoverflow answer one way to do this is to have a file src/memory.rs, and sub-modules as files in src/memory/. We can move all frame allocator-related code into src/memory/frame_allocator.rs and then add near the top of memory.rs:

mod frame_allocator; // In memory/frame_allocator.rs
use frame_allocator::MultilevelBitmapFrameAllocator;