-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the pjrt gpu allocator configurable #5759
Conversation
sys_util::GetEnvBool(env::kEnvPjrtAllocatorPreallocate, true); | ||
allocator_config.memory_fraction = | ||
sys_util::GetEnvDouble(env::kEnvPjrtAllocatorFraction, 0.9); | ||
return allocator_config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to change the default behavior. If none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If kEnvPjrtAllocatorFraction is set to 0.75, it is the same as xla::GpuAllocatorConfig{}. Should the default value be changed from 0.9 to 0.75?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our previous experience with GPUs, 0.9 is considered a reasonable value for the memory fraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's use the default value from xla:gpu https://github.com/openxla/xla/blob/7ab5df624ff1d98804999b03b21abecd14ec57a6/xla/pjrt/gpu/gpu_helpers.h#L41-L60.
With the new flags, it gives you the flexibility to choose the best configuration that suits your needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, could you just return xla::GpuAllocatorConfig{} as before?
The reason is that with the current implementation if xla:GPU change their default value, then we have to update ours and we may forget to do so or don't know they have changed the values. By just returning xla::GpuAllocatorConfig{}, we can use whatever default value xla:GPU has set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. When none of kEnvPjrtAllocatorCudaAsync, kEnvPjrtAllocatorPreallocate, kEnvPjrtAllocatorFraction is set, just return xla::GpuAllocatorConfig{}.
I wonder what problem you run into when memory allocator fraction is set to 0.75. |
For 80G H100/A100, there is 20G memory waste if the memory allocator fraction is set to 0.75. For distributed training job, NCCL does not requires as much memory. GPU memory is particularly constrained, especially during LLM training. Therefore, setting memory allocator fraction to 0.9 can improve the memory usage more efficiency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
* Make the pjrt gpu allocator configurable * the default value changed from 0.9 to 0.75 * return default GpuAllocatorConfig --------- Co-authored-by: wangang.wa <[email protected]>
This PR aims to make the pjrt gpu allocator configurable.
The default value of memory allocator fraction is 0.75, which is too small.