Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 用MoE训练的时候tflop超级低 #264

Open
Cerberous opened this issue Jun 28, 2024 · 4 comments
Open

[Bug] 用MoE训练的时候tflop超级低 #264

Cerberous opened this issue Jun 28, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Cerberous
Copy link

描述该错误

训练MoE模型时,模型的tflops只有几十,正常训练的时候是正常的

环境信息

官方镜像代码

其他信息

No response

@Cerberous Cerberous added the bug Something isn't working label Jun 28, 2024
@blankde
Copy link
Collaborator

blankde commented Jun 28, 2024

@Cerberous 方便提供下配置吗?或者使用的是config/7b_Moe4_sft的配置吗?moe因为gate计算有很多小算子,如果不进行fused的话再加上all2all的开销目前moe的MFU大致只有稠密模型的一半。

@Cerberous
Copy link
Author

@Cerberous 方便提供下配置吗?或者使用的是config/7b_Moe4_sft的配置吗?moe因为gate计算有很多小算子,如果不进行fused的话再加上all2all的开销目前moe的MFU大致只有稠密模型的一半。

model = dict(
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
dtype="torch.bfloat16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
max_position_embeddings=SEQ_LEN,
embed_split_hidden=True,
num_layers=NUM_LAYER,
hidden_size=HIDDEN_SIZE,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
num_attention_heads=NUM_ATTENTION_HEAD,
# num_kv_attention_heads=NUM_ATTENTION_HEAD // 2,
mlp_ratio=MLP_RATIO,
multiple_of=MULTIPLE_OF,
norm_type="rmsnorm",
# adapt_hf=True,
apply_post_layer_norm=False,
# no_bias=True,
layer_norm_epsilon=1e-5,
# rope_base=10000,
# norm_head=False,
use_flash_attn=True,
mlp_layer_fusion=True,
qk_interleaved=False,
num_experts=8,
moe_type="GShard", # Support: "GShard", "MegaBlock", "MegaBlock-D"
moe_use_residual=False,
)

moe = dict(
top_k=8,
capacity_factor=1.0,
eval_capacity_factor=1.0,
min_capacity=4,
noisy_gate_policy=None,
drop_tokens=True,
use_rts=True,
use_fused_gating=False,
)

parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=1, mode="mtp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=1, overlap=True, memory_pool=True),
)
这是我的配置,基本上就是照抄7B_MoE的那个配置

@blankde
Copy link
Collaborator

blankde commented Jul 1, 2024

好的,我复现一下。您这边用了多少卡跑的?

@Cerberous
Copy link
Author

好的,我复现一下。您这边用了多少卡跑的?

我这边就是一台8卡的H800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants