Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gpartition] Process failures not recognized between process launches. #874

Open
thvasilo opened this issue Jun 13, 2024 · 0 comments
Open
Assignees

Comments

@thvasilo
Copy link
Contributor

Observed the following today:

subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.93.146 '(export DGL_IP_CONFIG=/ip_list.txt DGL_NUM_SERVER=1 PYTHONPATH=/graphstorm/python/:/root/dgl/tools/:  RANK=16 MASTER_ADDR=172.31.95.143 MASTER_PORT=12345; /opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids )'' returned non-zero exit status 1.
cleanupu process runs
"/opt/gs-venv/bin/python3 /root/dgl/tools/distgraphlaunch.py --ssh_port 2222  --num_proc_per_machine 1  --ip_config /ip_list.txt  --master_port 12345 ""/opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py --world-size 20 --partitions-dir /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/partition_assignment --input-dir /mount/gsprocessing/spear-local-graph-04-23 --graph-name spear-global-graph-0423-features --schema updated_row_counts_metadata.json --num-parts 20 --output /mount/gpartition/spear-local-graph-04-23/dgl-2.3a240609/range/20-parts/dist_graph --process-group-timeout 1800 --log-level INFO --save-orig-nids --save-orig-eids """
INFO:root:DGL graph building took -3983.498513 sec
INFO:root:Copying raw_id_mappings to dist_graph
INFO:root:Partition assignment and DGL graph creation took 4722.130789 seconds
"Setting the default backend to ""pytorch"". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)"

Here the opt/gs-venv/bin/python /root/dgl/tools/distpartitioning/data_proc_pipeline.py process failed, but the failure was not detected, the parent process exited with a zero exit code, and the job reported a success instead of a failure.

I remember we have trouble detecting process failures in general, so adding this issue to track. We can add related failures here.

@classicsong classicsong added this to the 0.4 release milestone Aug 22, 2024
@classicsong classicsong removed the 0.4 label Oct 3, 2024
@classicsong classicsong removed this from the 0.4 release milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants