-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling Performance Testing #3584
Sampling Performance Testing #3584
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested minor changes, mostly looks good.
the number of training epochs here. These are followed by the `REPLICATION_FACTOR` argument, which | ||
can be used to create replications of the dataset for scale testing purposes. | ||
|
||
The final two arguments are `FRAMEWORK` which can be either "cuGraphPyG" or "PyG", and `GPUS_PER_NODE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we shall include "cuGraphDGL" here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the next PR
benchmarks/cugraph/standalone/bulk_sampling/bench_cugraph_training.py
Outdated
Show resolved
Hide resolved
SCRIPTS_DIR=$4 | ||
NUM_EPOCHS=$5 | ||
|
||
SAMPLES_DIR=/samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we are mounting this to the most performant path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is up to us to set LOGS_DIR
, SAMPLES_DIR
, and DATASETS_DIR
in run_train_job.sh
correctly. In the srun
command, those are mounted to /logs
, /samples
, and /datasets
in the container that this script runs in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. I don't feel too strongly, but I noticed several places in the shell scripts that assume things about the file system (/datasets
, etc.). We usually put those scripts in another repo (the repo containing our machine-specific nightly scripts, etc.) and not the open cugraph repo.
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need to update the copyright on a file that wasn't otherwise updated... but OK.
/merge |
…formance Improvements (#4081) Large-scale cuGraph-DGL performance testing scripts. Also changes the DGL and PyG scripts to evaluate on all ranks and reuse the test samples, and adds support for benchmarking cuGraph-DGL/cuGraph-PyG with WholeGraph. Updates `cuGraph.gnn.FeatureStore` and `cuGraph-PyG` for increased performance: * Supporting passing in a WG embedding directly to cugraph.gnn.FeatureStore * Simplifying how cuGraph-PyG handles filtering and using a cache to prevent repeatedly copying data between the device and host * Fix bug in cugraph.gnn.FeatureStore where indexing with a gpu tensor would raise an exception, especially with WG * Add a function to cugraph.gnn.FeatureStore to check where data is stored, which is used by cuGraph-PyG to prevent unnecessary d2h and h2d copies Merge after #3584 Authors: - Alex Barghi (https://github.com/alexbarghi-nv) - Seunghwa Kang (https://github.com/seunghwak) - Vibhu Jawa (https://github.com/VibhuJawa) - Brad Rees (https://github.com/BradReesWork) Approvers: - Vibhu Jawa (https://github.com/VibhuJawa) - Don Acosta (https://github.com/acostadon) - Brad Rees (https://github.com/BradReesWork) - Naim (https://github.com/naimnv) - Joseph Nke (https://github.com/jnke2016) URL: #4081
Adds performance benchmarking scripts for testing MNMG cuGraph GNN workflows.
This branch is the head branch for the cuGraph benchmarking effort. All work supporting the benchmarks should be merged into this branch. It will be merged into branch-24.02 once all features are ready.
Includes patches to cuGraph-PyG required for the latest DLFW container.
To-Do:
Add WholeGraph training portionDeferred to future PR (see Add WholeGraph Support alexbarghi-nv/cugraph#6)Add WholeGraph generatorsIncluded in aboveSupport DGLDeferred to future PRUse appropriate docker containersDeferred, waiting on DLFW releaseCloses #3839