Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the pjrt gpu allocator configurable #5759

Merged
merged 3 commits into from
Nov 13, 2023
Merged

Conversation

anw90
Copy link
Contributor

@anw90 anw90 commented Nov 2, 2023

This PR aims to make the pjrt gpu allocator configurable.
The default value of memory allocator fraction is 0.75, which is too small.

sys_util::GetEnvBool(env::kEnvPjrtAllocatorPreallocate, true);
allocator_config.memory_fraction =
sys_util::GetEnvDouble(env::kEnvPjrtAllocatorFraction, 0.9);
return allocator_config;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change the default behavior. If none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kEnvPjrtAllocatorFraction is set to 0.75, it is the same as xla::GpuAllocatorConfig{}. Should the default value be changed from 0.9 to 0.75?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our previous experience with GPUs, 0.9 is considered a reasonable value for the memory fraction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's use the default value from xla:gpu https://github.com/openxla/xla/blob/7ab5df624ff1d98804999b03b21abecd14ec57a6/xla/pjrt/gpu/gpu_helpers.h#L41-L60.
With the new flags, it gives you the flexibility to choose the best configuration that suits your needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?

The reason is that with the current implementation if xla:GPU change their default value, then we have to update ours and we may forget to do so or don't know they have changed the values. By just returning xla::GpuAllocatorConfig{}, we can use whatever default value xla:GPU has set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. When none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, just return xla::GpuAllocatorConfig{}.

@vanbasten23
Copy link
Collaborator

I wonder what problem you run into when memory allocator fraction is set to 0.75.

@anw90
Copy link
Contributor Author

anw90 commented Nov 3, 2023

I wonder what problem you run into when memory allocator fraction is set to 0.75.

For 80G H100/A100, there is 20G memory waste if the memory allocator fraction is set to 0.75. For distributed training job, NCCL does not requires as much memory.

GPU memory is particularly constrained, especially during LLM training. Therefore, setting memory allocator fraction to 0.9 can improve the memory usage more efficiency.

Copy link
Collaborator

@vanbasten23 vanbasten23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vanbasten23 vanbasten23 merged commit 56733fb into pytorch:master Nov 13, 2023
17 checks passed
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
zpcore pushed a commit that referenced this pull request Nov 21, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
lsy323 pushed a commit to lsy323/xla that referenced this pull request Nov 28, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Make the pjrt gpu allocator configurable

* the default value changed from 0.9 to 0.75

* return default GpuAllocatorConfig

---------

Co-authored-by: wangang.wa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants