You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.93.146 '(export DGL_IP_CONFIG=/ip_list.txt DGL_NUM_SERVER=1 PYTHONPATH=/graphstorm/python/:/root/dgl/tools/: RANK=16 MASTER_ADDR=172.31.95.143 MASTER_PORT=12345; /opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids )'' returned non-zero exit status 1.
cleanupu process runs
"/opt/gs-venv/bin/python3 /root/dgl/tools/distgraphlaunch.py --ssh_port 2222 --num_proc_per_machine 1 --ip_config /ip_list.txt --master_port 12345 ""/opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids """
INFO:root:DGL graph building took -3983.498513 sec
INFO:root:Copying raw_id_mappings to dist_graph
INFO:root:Partition assignment and DGL graph creation took 4722.130789 seconds
"Setting the default backend to ""pytorch"". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)"
Here the opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py process failed, but the failure was not detected, the parent process exited with a zero exit code, and the job reported a success instead of a failure.
I remember we have trouble detecting process failures in general, so adding this issue to track. We can add related failures here.
The text was updated successfully, but these errors were encountered:
Observed the following today:
Here the
opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py
process failed, but the failure was not detected, the parent process exited with a zero exit code, and the job reported a success instead of a failure.I remember we have trouble detecting process failures in general, so adding this issue to track. We can add related failures here.
The text was updated successfully, but these errors were encountered: