You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ouyang11111
Are you familiar with pytorch torchrun scripts?
you should fill in your training setting(how many nodes, the rank...) in "--nnodes=... --node_rank=... --master_addr=... --master_port=..."
--nodes means how many nodes you use
--node_rank means the node rank
@ouyang11111 Are you familiar with pytorch torchrun scripts? you should fill in your training setting(how many nodes, the rank...) in "--nnodes=... --node_rank=... --master_addr=... --master_port=..." --nodes means how many nodes you use --node_rank means the node rank
How do I set the "master_addr" and "master_port" for only one server with 4 3090 cards?
when I run:torchrun --nproc_per_node=4 --nnodes=1 --node_rank=3 --master_addr=192.168.1.1 --master_port=12345 train.py --depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
and i stuck for a long time .....
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py
--depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1 fail to run it please provide complete instruction
The text was updated successfully, but these errors were encountered: