Streamline some memory handling #419

ArthurBrussee · 2025-01-13T16:59:14Z

This PR streamlines some memory allocation problems, particularly when using ExclusivePages.

WGPU create() mapping

Firstly, applicable to any memory allocator, is a different approach to efficiently use create(). Using queue.write_buffer_with is nice because we can keep one single ComputePass for a batch of operations, which has some benefits. However, the implicit ordering of queue.write_buffer_with has been a bit of a nightmare (#83, #156, #405). Additionally, it's still not optimal - write_buffer_with still needs to allocate staging_buffers on the fly, and doesn't re-use them. Lastly, wgpu ideally wants global uniform data marked as uniform instead of storage. That helps uniformity analysis, and can be faster on some integrated GPU's with special fast memory for uniforms.

This PR instead splits out small, uniform, allocations and large create calls. For small uniforms we use a StagingBelt which efficiently re-uses staging buffers, and a seperate encoder to manually insert these copies before the compute work. Because we now only have a tiny subset of buffers to worry about, we can simplify the memory locking to just keeping a reference to it.

For this to work some more of the memory management had to be pushed to the wgpu stream. The good news is is that removes the brittle connection between stream.flush() and server.on_flushed().

ExclusivePages allocator

Secondly, this PR reword the ExclusivePages allocator:

No more ring buffer. This caused the allocator to cycle between free pages which in turn meant they weren't deallocated properly.
Similair to the above, don't keep pages in a hashmap, as the random order causes dealloc behaviour.
Use a more standard exponential distribution with much fewer pools. The fragmentation between pools meant memory would often go un-used, whereas a more greedy strategy seems to better re-use memory.
Allow allocating from N neighbour pools, if they happen to have a free buffer.
Allow allocating pages smalller than the page size in the pools.
Tweak the deallocation curve. Deallocating larger buffers more quickly actually doesn't really make sense. Since they are used less frequently, it takes more allocations before we know they really are free!

Results

Some results on a test of training my model (gaussian splats) to 5k steps. This has lots of dynamic memory usage patters, dynamic shapes, and spike-y workloads. Note the blue "memory used" line is much lower than reserved but that's expected - I'm measuring memory after a training step when lots of buffers became free again.

Before:

time: 4m29s

After:

time: 4m17s

Interestingly, this workload (dynamic shapes, and spike-y memory usage) does really badly with sliced memory, that might have to be investigated in the future.

Slices (Same before and after this PR)

time: 4m34s

crates/cubecl-runtime/src/memory_management/memory_manage.rs

crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs

crates/cubecl-wgpu/src/compute/stream.rs

nathanielsimard · 2025-01-13T18:46:42Z

crates/cubecl-wgpu/src/compute/stream.rs

+        //   in self.locked_copy_handles which means the allocation won't try to use them.
+        // - For bigger buffers, we do break the compute stream.
+
+        let allocator = if aligned_len < MAX_UNIFORM_BUFFER_SIZE {


I feel stupid a bit right now, since I kinda thought that all buffers were actually uniform by default when called with a single allocation! Obviously, we want that for performance! I'm wondering if it's not possible to 'defragment' the 'real' GPU memory allocated for all memory pages. When it gets quite stable in terms of utilization, we could perform one huge defragmentation of all our pages to speed up following work.

I'm not 100% sure what you mean! Maybe some different terminology. In wgpu a "uniform" is something that is gauranteed to be:

Read only

The same value for every thread in a threadgroup

The compiler can largely infer this anyway, but, in wgsl you can explicatly mark a binding as

@group(0) @binding(0) var example_meta_data: array;

instead of

@group(0) @binding(0) var<storage, read_write> example_meta_data: array;

I haven't actually changed the wgsl compiler yet to take advantage of this, but, this should be allowed now, as we know uniforms use an exclusive page and are allocated with BufferUsages::UNIFORMS. I'm not sure it'll make a ton of difference but yeah I think some drivers can take this as a hint to place that memory in some frequently accessed cache.

So, not sure how that relates to a single allocation or fragmentation!

Ho I had in mind contiguous, so if you allocate 100 MiB, that's all contiguous in memory.

Ahhh yes ok makes sense, no I think like you said all allocations are contigious by default

nathanielsimard · 2025-01-13T18:47:11Z

crates/cubecl-wgpu/src/compute/stream.rs

+};
+use wgpu::{util::StagingBelt, BufferDescriptor, BufferUsages, ComputePipeline};
+
+const MAX_UNIFORM_BUFFER_SIZE: u64 = 8192;


Is it really the max?

Maybe max is a bad name - renamed and added a comment.

nathanielsimard · 2025-01-13T18:51:44Z

crates/cubecl-wgpu/src/compute/stream.rs

+    fn flush_if_needed(&mut self) {
+        // For now we only consider the number of handles locked, but we may consider the amount in
+        // bytes at some point.
+        if self.tasks_count >= self.tasks_max || self.locked_copy_handles.len() >= 32 {


I would use a multiple of tasks_max for the maximum handles locked before we flush. Since it doesn't consider small handles, we could approximate with a factor of 2. tasks_max is the setting that should be used to reduce memory usage in general. Maybe we could introduce another setting and deriving a default from tasks_max.

Ah yes basing it on tasks_max makes sense! Went with a factor 8 as really this should be an emergency brake, not a cause more flushing.

nathanielsimard · 2025-01-13T18:52:24Z

crates/cubecl-wgpu/src/compute/stream.rs

+        let copy_encoder = std::mem::replace(&mut self.copy_encoder, create_encoder(&self.device));
+        let encoder = std::mem::replace(&mut self.encoder, create_encoder(&self.device));


I was really wondering why we needed two encoders in the state. copy_encoder seems like a copy of the main encoder instead of the encoder used for copy. I would rename it to encoder_memcopy.

Renamed this & added some comments

nathanielsimard · 2025-01-13T18:58:56Z

crates/cubecl-wgpu/src/compute/stream.rs

+        self.memory_management.cleanup();
+        self.memory_management.storage().perform_deallocations();
+
+        self.memory_management_uniforms.cleanup();
+        self.memory_management_uniforms
+            .storage()
+            .perform_deallocations();


I think a bit more comments are necessary, it's hard to know why we need two memory management pools and a staging belt.

From what I see:

StagingBelt => Serves as a pool of preallocated buffers to be used as staging buffers for small copies, such as kernel metadata.

Memory Management Uniforms => Contains the buffers used in the kernels, copied from StagingBelt
*Memory Management => Everything else.

I'm still wondering is we can use uniform memory for big buffers.

Your assesment is correct I think, but have added some more comments! Hopefully other comment clarifies that for big buffers BufferUsages::UNIFORMS doesn't make sense per se.

…remove neighbour buckets

crates/cubecl-wgpu/src/compute/stream.rs

crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs

nathanielsimard reviewed Jan 13, 2025

View reviewed changes

ArthurBrussee added 20 commits January 14, 2025 16:22

Export usage, make clone

78a80a2

Rework bins & de-alloc frequency

f101b8b

Mostly remove memory locking behaviour

1048070

Cleanup

524077a

Allow different page sizes in pool

c4ec6a0

Improved pool logic, tweaks

438e632

Allow allocating in neighbouring pool

a495d61

Allow dynamic slice size

2fc0707

Dealloc period

d91460b

Use StagingBelt for uniforms

b005bd9

Configure neighbour pools

960e2b8

Final tweaks

3b9e3e8

HIP

7f5d83a

HIP

bad024f

HIP

d3b8e6e

HIP

a56df79

Typo

ee60d15

Message for assert

4fb884b

Review comments

c747cf1

Different free heuristic, handles_alloc, fix memory usage reporting, …

9a8b59e

…remove neighbour buckets

ArthurBrussee force-pushed the mem-updates branch from 820b3ec to 9a8b59e Compare January 14, 2025 16:23

ArthurBrussee added 3 commits January 14, 2025 22:07

Smallest free page, re-introduce search

b43f265

no_std

876a9e0

Fix tests

674c368

nathanielsimard reviewed Jan 15, 2025

View reviewed changes

crates/cubecl-wgpu/src/compute/stream.rs Outdated Show resolved Hide resolved

crates/cubecl-runtime/src/memory_management/memory_pool/exclusive_pool.rs Outdated Show resolved Hide resolved

Review

c7c9d0a

nathanielsimard approved these changes Jan 15, 2025

View reviewed changes

nathanielsimard merged commit 7706210 into tracel-ai:main Jan 15, 2025
5 checks passed

ArthurBrussee mentioned this pull request Jan 22, 2025

More improvements to ExclusivePages memory allocator #445

Open

ArthurBrussee mentioned this pull request Jan 25, 2025

Memory not released when training paused, and a new run is selected ArthurBrussee/brush#74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline some memory handling #419

Streamline some memory handling #419

ArthurBrussee commented Jan 13, 2025 •

edited

Loading

nathanielsimard Jan 13, 2025

ArthurBrussee Jan 13, 2025

nathanielsimard Jan 15, 2025

ArthurBrussee Jan 15, 2025

nathanielsimard Jan 13, 2025

ArthurBrussee Jan 13, 2025

nathanielsimard Jan 13, 2025

ArthurBrussee Jan 13, 2025

nathanielsimard Jan 13, 2025

ArthurBrussee Jan 13, 2025

nathanielsimard Jan 13, 2025

ArthurBrussee Jan 13, 2025 •

edited

Loading

		let copy_encoder = std::mem::replace(&mut self.copy_encoder, create_encoder(&self.device));
		let encoder = std::mem::replace(&mut self.encoder, create_encoder(&self.device));

Streamline some memory handling #419

Streamline some memory handling #419

Conversation

ArthurBrussee commented Jan 13, 2025 • edited Loading

WGPU create() mapping

ExclusivePages allocator

Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurBrussee Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

ArthurBrussee commented Jan 13, 2025 •

edited

Loading

ArthurBrussee Jan 13, 2025 •

edited

Loading