-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline some memory handling #419
Conversation
crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs
Outdated
Show resolved
Hide resolved
crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs
Outdated
Show resolved
Hide resolved
crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs
Outdated
Show resolved
Hide resolved
// in self.locked_copy_handles which means the allocation won't try to use them. | ||
// - For bigger buffers, we do break the compute stream. | ||
|
||
let allocator = if aligned_len < MAX_UNIFORM_BUFFER_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel stupid a bit right now, since I kinda thought that all buffers were actually uniform by default when called with a single allocation! Obviously, we want that for performance! I'm wondering if it's not possible to 'defragment' the 'real' GPU memory allocated for all memory pages. When it gets quite stable in terms of utilization, we could perform one huge defragmentation of all our pages to speed up following work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what you mean! Maybe some different terminology. In wgpu a "uniform" is something that is gauranteed to be:
- Read only
- The same value for every thread in a threadgroup
The compiler can largely infer this anyway, but, in wgsl you can explicatly mark a binding as
@group(0) @binding(0) var example_meta_data: array;
instead of
@group(0) @binding(0) var<storage, read_write> example_meta_data: array;
I haven't actually changed the wgsl compiler yet to take advantage of this, but, this should be allowed now, as we know uniforms use an exclusive page and are allocated with BufferUsages::UNIFORMS. I'm not sure it'll make a ton of difference but yeah I think some drivers can take this as a hint to place that memory in some frequently accessed cache.
So, not sure how that relates to a single allocation or fragmentation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho I had in mind contiguous, so if you allocate 100 MiB, that's all contiguous in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh yes ok makes sense, no I think like you said all allocations are contigious by default
}; | ||
use wgpu::{util::StagingBelt, BufferDescriptor, BufferUsages, ComputePipeline}; | ||
|
||
const MAX_UNIFORM_BUFFER_SIZE: u64 = 8192; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really the max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe max is a bad name - renamed and added a comment.
fn flush_if_needed(&mut self) { | ||
// For now we only consider the number of handles locked, but we may consider the amount in | ||
// bytes at some point. | ||
if self.tasks_count >= self.tasks_max || self.locked_copy_handles.len() >= 32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a multiple of tasks_max
for the maximum handles locked before we flush. Since it doesn't consider small handles, we could approximate with a factor of 2. tasks_max
is the setting that should be used to reduce memory usage in general. Maybe we could introduce another setting and deriving a default from tasks_max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes basing it on tasks_max makes sense! Went with a factor 8 as really this should be an emergency brake, not a cause more flushing.
let copy_encoder = std::mem::replace(&mut self.copy_encoder, create_encoder(&self.device)); | ||
let encoder = std::mem::replace(&mut self.encoder, create_encoder(&self.device)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was really wondering why we needed two encoders in the state. copy_encoder
seems like a copy of the main encoder
instead of the encoder used for copy. I would rename it to encoder_memcopy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this & added some comments
self.memory_management.cleanup(); | ||
self.memory_management.storage().perform_deallocations(); | ||
|
||
self.memory_management_uniforms.cleanup(); | ||
self.memory_management_uniforms | ||
.storage() | ||
.perform_deallocations(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a bit more comments are necessary, it's hard to know why we need two memory management pools and a staging belt.
From what I see:
- StagingBelt => Serves as a pool of preallocated buffers to be used as staging buffers for small copies, such as kernel metadata.
- Memory Management Uniforms => Contains the buffers used in the kernels, copied from
StagingBelt
*Memory Management => Everything else.
I'm still wondering is we can use uniform memory for big buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assesment is correct I think, but have added some more comments! Hopefully other comment clarifies that for big buffers BufferUsages::UNIFORMS doesn't make sense per se.
…remove neighbour buckets
820b3ec
to
9a8b59e
Compare
This PR streamlines some memory allocation problems, particularly when using ExclusivePages.
WGPU create() mapping
Firstly, applicable to any memory allocator, is a different approach to efficiently use create(). Using
queue.write_buffer_with
is nice because we can keep one single ComputePass for a batch of operations, which has some benefits. However, the implicit ordering ofqueue.write_buffer_with
has been a bit of a nightmare (#83, #156, #405). Additionally, it's still not optimal -write_buffer_with
still needs to allocate staging_buffers on the fly, and doesn't re-use them. Lastly, wgpu ideally wants global uniform data marked asuniform
instead ofstorage
. That helps uniformity analysis, and can be faster on some integrated GPU's with special fast memory for uniforms.This PR instead splits out small, uniform, allocations and large create calls. For small uniforms we use a StagingBelt which efficiently re-uses staging buffers, and a seperate encoder to manually insert these copies before the compute work. Because we now only have a tiny subset of buffers to worry about, we can simplify the memory locking to just keeping a reference to it.
For this to work some more of the memory management had to be pushed to the wgpu stream. The good news is is that removes the brittle connection between
stream.flush()
andserver.on_flushed()
.ExclusivePages allocator
Secondly, this PR reword the ExclusivePages allocator:
Results
Some results on a test of training my model (gaussian splats) to 5k steps. This has lots of dynamic memory usage patters, dynamic shapes, and spike-y workloads. Note the blue "memory used" line is much lower than reserved but that's expected - I'm measuring memory after a training step when lots of buffers became free again.
Before:
time: 4m29s
After:
time: 4m17s
Interestingly, this workload (dynamic shapes, and spike-y memory usage) does really badly with sliced memory, that might have to be investigated in the future.
Slices (Same before and after this PR)
time: 4m34s