Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Philip's blog #34

Open
p208p2002 opened this issue Feb 15, 2024 · 0 comments
Open

Philip's blog #34

p208p2002 opened this issue Feb 15, 2024 · 0 comments

Comments

@p208p2002
Copy link
Owner

https://blog.philip-huang.tech/?page=az-multi-node-training

- tags: gpu-cluster multi-node-training LLM model-training - date: 2024/02/15

LLM訓練非常吃資源,單節點多卡的配置還是會常常遇到算力或記憶體不足的問題。

GPU Cluster 使用上會涉及到許多額外的設定,並且通常會搭配排程系統、容器技術一起使用。

本篇簡單紀錄使用 Azure 平台進行多節點訓練的設定與流程。

重點環境:

  • ubuntu: 20.04

  • cuda: 12.2.2

  • python: 3.8

  • torch: 2.1.0

  • lightning: 2.1.4

  • deepspeed: 0.13.2

    前置準備
    建立 GPU Cluster
    位置:ML Studio>管理>計算>計算叢集

可以設置最大節點與最小閒置節點(最小可以=0),閒置時不收費。

GPU叢集會根據任務需要自動拓展。

![](https://media.githubusercontent.com/m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant