[Bug] 用MoE训练的时候tflop超级低 #264

Cerberous · 2024-06-28T01:13:59Z

描述该错误

训练MoE模型时，模型的tflops只有几十，正常训练的时候是正常的

环境信息

官方镜像代码

其他信息

No response

blankde · 2024-06-28T13:59:25Z

@Cerberous 方便提供下配置吗？或者使用的是config/7b_Moe4_sft的配置吗？moe因为gate计算有很多小算子，如果不进行fused的话再加上all2all的开销目前moe的MFU大致只有稠密模型的一半。

Cerberous · 2024-07-01T01:06:26Z

@Cerberous 方便提供下配置吗？或者使用的是config/7b_Moe4_sft的配置吗？moe因为gate计算有很多小算子，如果不进行fused的话再加上all2all的开销目前moe的MFU大致只有稠密模型的一半。

model = dict(
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
dtype="torch.bfloat16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
max_position_embeddings=SEQ_LEN,
embed_split_hidden=True,
num_layers=NUM_LAYER,
hidden_size=HIDDEN_SIZE,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
num_attention_heads=NUM_ATTENTION_HEAD,
# num_kv_attention_heads=NUM_ATTENTION_HEAD // 2,
mlp_ratio=MLP_RATIO,
multiple_of=MULTIPLE_OF,
norm_type="rmsnorm",
# adapt_hf=True,
apply_post_layer_norm=False,
# no_bias=True,
layer_norm_epsilon=1e-5,
# rope_base=10000,
# norm_head=False,
use_flash_attn=True,
mlp_layer_fusion=True,
qk_interleaved=False,
num_experts=8,
moe_type="GShard", # Support: "GShard", "MegaBlock", "MegaBlock-D"
moe_use_residual=False,
)

moe = dict(
top_k=8,
capacity_factor=1.0,
eval_capacity_factor=1.0,
min_capacity=4,
noisy_gate_policy=None,
drop_tokens=True,
use_rts=True,
use_fused_gating=False,
)

parallel = dict(
zero1=dict(size=-1),
tensor=dict(size=1, mode="mtp"),
pipeline=dict(size=1, interleaved_overlap=True),
weight=dict(size=1, overlap=True, memory_pool=True),
)
这是我的配置，基本上就是照抄7B_MoE的那个配置

blankde · 2024-07-01T02:22:16Z

好的，我复现一下。您这边用了多少卡跑的？

Cerberous · 2024-07-01T03:17:03Z

好的，我复现一下。您这边用了多少卡跑的？

我这边就是一台8卡的H800

Cerberous added the bug Something isn't working label Jun 28, 2024

mm-assistant bot assigned ZwwWayne Jun 28, 2024

gaoyang07 unassigned ZwwWayne Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 用MoE训练的时候tflop超级低 #264

[Bug] 用MoE训练的时候tflop超级低 #264

Cerberous commented Jun 28, 2024

blankde commented Jun 28, 2024

Cerberous commented Jul 1, 2024

blankde commented Jul 1, 2024

Cerberous commented Jul 1, 2024

[Bug] 用MoE训练的时候tflop超级低 #264

[Bug] 用MoE训练的时候tflop超级低 #264

Comments

Cerberous commented Jun 28, 2024

描述该错误

环境信息

其他信息

blankde commented Jun 28, 2024

Cerberous commented Jul 1, 2024

blankde commented Jul 1, 2024

Cerberous commented Jul 1, 2024