[FEATURE] A CPU Swapping Runtime #694

merrymercy · 2022-09-09T05:51:21Z

Background

To train or serve large models with limited GPU memory resources, we can utilize the huge amount of available CPU memory by swapping tensors between CPU and GPU. In this project, we are going to implement a swapping runtime for Alpa. We can start with the easiest case: swapping between 1 CPU and 1 GPU for serving. We can then move to more complicated cases: swapping between distributed CPUs and GPUs for training.

Todo

A Local Swapping Runtime
- Implement swapping on top of this local runtime. To see how this runtime works, you can run this testcase. Currently, all tensors are stored in this env as GPU tensors. To implement swapping, we just need to move some tensors in this env to CPU.
- Implement necessary optimizations such as overlapping swapping and computation (e.g., pre-fetching).
- Swap to disk if CPU memory is not enough

References

SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

ZYHowell · 2022-09-09T20:47:18Z

The key point for swapping in XLA is that all parameters should be already in GPU when launching an XlaExecutable. To address this:

When the model is not very large. We can split more stages so that the parameter for each stage can be prepared before starting;
When the model is extremely large that even parameters of a single transformer layer(or likewise layer) cannot be placed into the GPU memory simultaneously. Although we can still split each operator as a stage, the auto-sharding pass will be inefficient. We can
- split each operator into a stage, but run auto-sharding with multiple stages. To avoid missing optimization opportunities like fusion, we can split stages not at the JAX level but at the optimized HLO level.
- modify the HloModule. Use custom calls to swap parameters in the HloComputation and replace all parameters with the output of such custom calls.
When the model is even larger that a single GeMM cannot be placed into the GPU memory. We need a hand-optimized GeMM kernel that runs GeMM for a sub-matrix while swapping in another sub-matrix. The hand-optimized kernel will replace the corresponding HloInstruction. Such a kernel also helps with cases that are not extremely memory intense because it overlaps swapping and computation.

ff7250 · 2022-10-11T09:11:17Z

For the infra, you may want to leverage the latest xla runtime effort. We already implemented a similar solution in tf/pt with heuristic device/data placement and schedule. It works well in certain hardware systems.

…

On Fri, Sep 9, 2022 at 1:47 PM Yonghao Zhuang ***@***.***> wrote: The key point for swapping in XLA is that all parameters should be already in GPU when launching an XlaExecutable. To address this: - When the model is not very large. We can split more stages so that the parameter for each stage can be prepared before starting; - When the model is extremely large that even parameters of a single transformer layer(or likewise layer) cannot be placed into the GPU memory simultaneously. Although we can still split each operator as a stage, the auto-sharding pass will be inefficient. We can - split each operator into a stage, but run auto-sharding with multiple stages. To avoid missing optimization opportunities like fusion, we can split stages not at the JAX level but at the optimized HLO level. - modify the HloModule. Use custom calls to swap parameters in the HloComputation and replace all parameters with the output of such custom calls. - When the model is even larger that a single GeMM cannot be placed into the GPU memory. We need a hand-optimized GeMM kernel that runs GeMM for a sub-matrix while swapping in another sub-matrix. The hand-optimized kernel will replace the corresponding HloInstruction. Such a kernel also helps with cases that are not extremely memory intense because it overlaps swapping and computation. — Reply to this email directly, view it on GitHub <#694 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDJMFAXPFOKYJWETETT7UDV5OO6DANCNFSM6AAAAAAQILQQAU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

merrymercy · 2022-10-11T22:12:10Z

Cpu Compute Runtime

Add a global configuration to choose platform ("cpu" or "gpu") (https://github.com/alpa-projects/alpa/blob/main/alpa/global_env.py)
Replace all hard-coded "GPU" with that global configuration (e.g.,

alpa/alpa/device_mesh.py

Line 802 in fbcb2ab

self.backend = xb.get_backend("gpu")

). There are multiple places such as alpa/device_mesh.py, alpa/util.py

ZYHowell · 2022-10-11T22:34:35Z

Cpu Compute Runtime

Add a global configure to choose runtime ("cpu" or "gpu") (https://github.com/alpa-projects/alpa/blob/main/alpa/global_env.py)

Replace all hard-coded "GPU" to that global configuration (e.g.,

alpa/alpa/device_mesh.py

Line 802 in fbcb2ab

self.backend = xb.get_backend("gpu")

)

I have some similar code in the tpu-support branch

merrymercy · 2022-10-11T23:09:00Z

@ff7250 Sounds good! Could you give us some pointers to the code and usage?

jon-chuang · 2023-04-04T01:32:36Z

Hello, I am implementing CPU distributed collectives support to XLA via gloo. Is there any overlap with this project here?

merrymercy added the good first issue Good for newcomers label Sep 9, 2022

zhisbug assigned zhisbug and unassigned zhisbug Sep 21, 2022

dlzou mentioned this issue Oct 18, 2022

[WIP] [FEATURE] Local Swapping Runtime #753

Closed

6 tasks

merrymercy added the enhancement New feature label Dec 20, 2022

jon-chuang mentioned this issue Apr 4, 2023

[RFE] Add support for distributed CPU-backend mode jax-ml/jax#11182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] A CPU Swapping Runtime #694

[FEATURE] A CPU Swapping Runtime #694

merrymercy commented Sep 9, 2022

ZYHowell commented Sep 9, 2022

ff7250 commented Oct 11, 2022 via email

merrymercy commented Oct 11, 2022 •

edited

Loading

ZYHowell commented Oct 11, 2022 •

edited by merrymercy

Loading

Cpu Compute Runtime

merrymercy commented Oct 11, 2022 •

edited

Loading

jon-chuang commented Apr 4, 2023 •

edited

Loading

[FEATURE] A CPU Swapping Runtime #694

[FEATURE] A CPU Swapping Runtime #694

Comments

merrymercy commented Sep 9, 2022

Background

Todo

References

ZYHowell commented Sep 9, 2022

ff7250 commented Oct 11, 2022 via email

merrymercy commented Oct 11, 2022 • edited Loading

Cpu Compute Runtime

ZYHowell commented Oct 11, 2022 • edited by merrymercy Loading

Cpu Compute Runtime

merrymercy commented Oct 11, 2022 • edited Loading

jon-chuang commented Apr 4, 2023 • edited Loading

merrymercy commented Oct 11, 2022 •

edited

Loading

ZYHowell commented Oct 11, 2022 •

edited by merrymercy

Loading

merrymercy commented Oct 11, 2022 •

edited

Loading

jon-chuang commented Apr 4, 2023 •

edited

Loading