Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline some memory handling #419

Merged
merged 24 commits into from
Jan 15, 2025

Conversation

ArthurBrussee
Copy link
Contributor

@ArthurBrussee ArthurBrussee commented Jan 13, 2025

This PR streamlines some memory allocation problems, particularly when using ExclusivePages.

WGPU create() mapping

Firstly, applicable to any memory allocator, is a different approach to efficiently use create(). Using queue.write_buffer_with is nice because we can keep one single ComputePass for a batch of operations, which has some benefits. However, the implicit ordering of queue.write_buffer_with has been a bit of a nightmare (#83, #156, #405). Additionally, it's still not optimal - write_buffer_with still needs to allocate staging_buffers on the fly, and doesn't re-use them. Lastly, wgpu ideally wants global uniform data marked as uniform instead of storage. That helps uniformity analysis, and can be faster on some integrated GPU's with special fast memory for uniforms.

This PR instead splits out small, uniform, allocations and large create calls. For small uniforms we use a StagingBelt which efficiently re-uses staging buffers, and a seperate encoder to manually insert these copies before the compute work. Because we now only have a tiny subset of buffers to worry about, we can simplify the memory locking to just keeping a reference to it.

For this to work some more of the memory management had to be pushed to the wgpu stream. The good news is is that removes the brittle connection between stream.flush() and server.on_flushed().

ExclusivePages allocator

Secondly, this PR reword the ExclusivePages allocator:

  • No more ring buffer. This caused the allocator to cycle between free pages which in turn meant they weren't deallocated properly.
  • Similair to the above, don't keep pages in a hashmap, as the random order causes dealloc behaviour.
  • Use a more standard exponential distribution with much fewer pools. The fragmentation between pools meant memory would often go un-used, whereas a more greedy strategy seems to better re-use memory.
  • Allow allocating from N neighbour pools, if they happen to have a free buffer.
  • Allow allocating pages smalller than the page size in the pools.
  • Tweak the deallocation curve. Deallocating larger buffers more quickly actually doesn't really make sense. Since they are used less frequently, it takes more allocations before we know they really are free!

Results

Some results on a test of training my model (gaussian splats) to 5k steps. This has lots of dynamic memory usage patters, dynamic shapes, and spike-y workloads. Note the blue "memory used" line is much lower than reserved but that's expected - I'm measuring memory after a training step when lots of buffers became free again.

Before:
image

time: 4m29s

After:
image

time: 4m17s

Interestingly, this workload (dynamic shapes, and spike-y memory usage) does really badly with sliced memory, that might have to be investigated in the future.

Slices (Same before and after this PR)
image

time: 4m34s

crates/cubecl-wgpu/src/compute/stream.rs Outdated Show resolved Hide resolved
// in self.locked_copy_handles which means the allocation won't try to use them.
// - For bigger buffers, we do break the compute stream.

let allocator = if aligned_len < MAX_UNIFORM_BUFFER_SIZE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel stupid a bit right now, since I kinda thought that all buffers were actually uniform by default when called with a single allocation! Obviously, we want that for performance! I'm wondering if it's not possible to 'defragment' the 'real' GPU memory allocated for all memory pages. When it gets quite stable in terms of utilization, we could perform one huge defragmentation of all our pages to speed up following work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what you mean! Maybe some different terminology. In wgpu a "uniform" is something that is gauranteed to be:

  • Read only
  • The same value for every thread in a threadgroup

The compiler can largely infer this anyway, but, in wgsl you can explicatly mark a binding as

@group(0) @binding(0) var example_meta_data: array;

instead of

@group(0) @binding(0) var<storage, read_write> example_meta_data: array;

I haven't actually changed the wgsl compiler yet to take advantage of this, but, this should be allowed now, as we know uniforms use an exclusive page and are allocated with BufferUsages::UNIFORMS. I'm not sure it'll make a ton of difference but yeah I think some drivers can take this as a hint to place that memory in some frequently accessed cache.

So, not sure how that relates to a single allocation or fragmentation!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho I had in mind contiguous, so if you allocate 100 MiB, that's all contiguous in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yes ok makes sense, no I think like you said all allocations are contigious by default

};
use wgpu::{util::StagingBelt, BufferDescriptor, BufferUsages, ComputePipeline};

const MAX_UNIFORM_BUFFER_SIZE: u64 = 8192;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really the max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe max is a bad name - renamed and added a comment.

fn flush_if_needed(&mut self) {
// For now we only consider the number of handles locked, but we may consider the amount in
// bytes at some point.
if self.tasks_count >= self.tasks_max || self.locked_copy_handles.len() >= 32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a multiple of tasks_max for the maximum handles locked before we flush. Since it doesn't consider small handles, we could approximate with a factor of 2. tasks_max is the setting that should be used to reduce memory usage in general. Maybe we could introduce another setting and deriving a default from tasks_max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes basing it on tasks_max makes sense! Went with a factor 8 as really this should be an emergency brake, not a cause more flushing.

Comment on lines 504 to 505
let copy_encoder = std::mem::replace(&mut self.copy_encoder, create_encoder(&self.device));
let encoder = std::mem::replace(&mut self.encoder, create_encoder(&self.device));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really wondering why we needed two encoders in the state. copy_encoder seems like a copy of the main encoder instead of the encoder used for copy. I would rename it to encoder_memcopy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this & added some comments

Comment on lines 518 to 524
self.memory_management.cleanup();
self.memory_management.storage().perform_deallocations();

self.memory_management_uniforms.cleanup();
self.memory_management_uniforms
.storage()
.perform_deallocations();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a bit more comments are necessary, it's hard to know why we need two memory management pools and a staging belt.

From what I see:

  • StagingBelt => Serves as a pool of preallocated buffers to be used as staging buffers for small copies, such as kernel metadata.
  • Memory Management Uniforms => Contains the buffers used in the kernels, copied from StagingBelt
    *Memory Management => Everything else.

I'm still wondering is we can use uniform memory for big buffers.

Copy link
Contributor Author

@ArthurBrussee ArthurBrussee Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assesment is correct I think, but have added some more comments! Hopefully other comment clarifies that for big buffers BufferUsages::UNIFORMS doesn't make sense per se.

@nathanielsimard nathanielsimard merged commit 7706210 into tracel-ai:main Jan 15, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants