Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

[FEATURE] A CPU Swapping Runtime #694

Open
3 tasks
merrymercy opened this issue Sep 9, 2022 · 6 comments
Open
3 tasks

[FEATURE] A CPU Swapping Runtime #694

merrymercy opened this issue Sep 9, 2022 · 6 comments
Labels
enhancement New feature good first issue Good for newcomers

Comments

@merrymercy
Copy link
Member

Background

To train or serve large models with limited GPU memory resources, we can utilize the huge amount of available CPU memory by swapping tensors between CPU and GPU. In this project, we are going to implement a swapping runtime for Alpa. We can start with the easiest case: swapping between 1 CPU and 1 GPU for serving. We can then move to more complicated cases: swapping between distributed CPUs and GPUs for training.

Todo

  • A Local Swapping Runtime
    • Implement swapping on top of this local runtime. To see how this runtime works, you can run this testcase. Currently, all tensors are stored in this env as GPU tensors. To implement swapping, we just need to move some tensors in this env to CPU.
    • Implement necessary optimizations such as overlapping swapping and computation (e.g., pre-fetching).
    • Swap to disk if CPU memory is not enough

References

SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

@merrymercy merrymercy added the good first issue Good for newcomers label Sep 9, 2022
@ZYHowell
Copy link
Collaborator

ZYHowell commented Sep 9, 2022

The key point for swapping in XLA is that all parameters should be already in GPU when launching an XlaExecutable. To address this:

  • When the model is not very large. We can split more stages so that the parameter for each stage can be prepared before starting;
  • When the model is extremely large that even parameters of a single transformer layer(or likewise layer) cannot be placed into the GPU memory simultaneously. Although we can still split each operator as a stage, the auto-sharding pass will be inefficient. We can
    • split each operator into a stage, but run auto-sharding with multiple stages. To avoid missing optimization opportunities like fusion, we can split stages not at the JAX level but at the optimized HLO level.
    • modify the HloModule. Use custom calls to swap parameters in the HloComputation and replace all parameters with the output of such custom calls.
  • When the model is even larger that a single GeMM cannot be placed into the GPU memory. We need a hand-optimized GeMM kernel that runs GeMM for a sub-matrix while swapping in another sub-matrix. The hand-optimized kernel will replace the corresponding HloInstruction. Such a kernel also helps with cases that are not extremely memory intense because it overlaps swapping and computation.

@zhisbug zhisbug assigned zhisbug and unassigned zhisbug Sep 21, 2022
@ff7250
Copy link

ff7250 commented Oct 11, 2022 via email

@merrymercy
Copy link
Member Author

merrymercy commented Oct 11, 2022

Cpu Compute Runtime

@ZYHowell
Copy link
Collaborator

ZYHowell commented Oct 11, 2022

Cpu Compute Runtime

I have some similar code in the tpu-support branch

@merrymercy
Copy link
Member Author

merrymercy commented Oct 11, 2022

@ff7250 Sounds good! Could you give us some pointers to the code and usage?

@jon-chuang
Copy link

jon-chuang commented Apr 4, 2023

Hello, I am implementing CPU distributed collectives support to XLA via gloo. Is there any overlap with this project here?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants