Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling Performance Testing #3584

Merged
merged 490 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
490 commits
Select commit Hold shift + click to select a range
c09bb25
bug fix
seunghwak Aug 3, 2023
4edb9ae
Merge branch 'branch-23.08' of github.com:rapidsai/cugraph into bug_mfg
seunghwak Aug 3, 2023
57fb8e5
Merge branch 'bug_mfg' of https://github.com/seunghwak/cugraph into p…
alexbarghi-nv Aug 3, 2023
3b95106
add latest updates
alexbarghi-nv Aug 3, 2023
3269a4f
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Aug 3, 2023
3e009cd
bug fix (when edge list is empty)
seunghwak Aug 3, 2023
622a17a
Merge branch 'branch-23.08' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Aug 3, 2023
e4d7796
add latest updates
alexbarghi-nv Aug 9, 2023
a226a4e
revert cpp changes
alexbarghi-nv Aug 9, 2023
5d3843f
revert plc changes
alexbarghi-nv Aug 9, 2023
36464a9
revert notebook changes
alexbarghi-nv Aug 9, 2023
c5a81c2
Revert logging change
alexbarghi-nv Aug 9, 2023
95a72ab
correction for dataset name
alexbarghi-nv Aug 9, 2023
aebe742
fix for empty batch issue
alexbarghi-nv Aug 14, 2023
449984d
do merge
alexbarghi-nv Aug 14, 2023
bdaa22f
bring in changes
alexbarghi-nv Aug 15, 2023
223dee3
remove redundant filter function
alexbarghi-nv Aug 15, 2023
0c904ae
construct cugraph graph in CSC format
alexbarghi-nv Aug 16, 2023
399976d
fixes for csc, update tests
alexbarghi-nv Aug 16, 2023
6b1169e
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 16, 2023
3c9afc9
style fix, add comment explaining function
alexbarghi-nv Aug 17, 2023
88831a8
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 17, 2023
246ac33
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 17, 2023
2fe3fe0
improve docstring
alexbarghi-nv Aug 17, 2023
f89a3fb
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 17, 2023
072b1ff
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 21, 2023
85a9c88
cleanup ahead of conversion to mg
alexbarghi-nv Aug 21, 2023
53b334b
mg work
alexbarghi-nv Aug 21, 2023
f0e9f1f
move sampling relatd functions in graph_functions.hpp to sampling_fun…
seunghwak Aug 22, 2023
3b1fd23
draft sampling post processing function APIs
seunghwak Aug 22, 2023
5e99823
mg
alexbarghi-nv Aug 23, 2023
7e4d041
resolve merge conflict
alexbarghi-nv Aug 23, 2023
d62f4f0
update to fix hop numbering issue
alexbarghi-nv Aug 24, 2023
67f4d7b
API updates
seunghwak Aug 24, 2023
5a8194e
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 24, 2023
19f66d0
Persist on host memory
alexbarghi-nv Aug 25, 2023
8f521d2
API updates
seunghwak Aug 25, 2023
da3da9b
deprecate the existing renumber_sampeld_edgelist function
seunghwak Aug 25, 2023
0b87ee1
combine renumber & compression/sorting functions
seunghwak Aug 25, 2023
9b5950b
minor documentation updates
seunghwak Aug 25, 2023
5fbb177
mionr documentation updates
seunghwak Aug 25, 2023
b9611ab
deprecate the existing sampling output renumber function
seunghwak Aug 27, 2023
d1c1440
improvements
alexbarghi-nv Aug 29, 2023
846d3fd
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 29, 2023
e52c614
split homogeneous/heterogeneous for better performance
alexbarghi-nv Aug 29, 2023
2e5479d
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 29, 2023
6463445
add e2e test, fix a lot of bugs found by test
alexbarghi-nv Aug 29, 2023
c291110
style fix
alexbarghi-nv Aug 30, 2023
e9d1fcc
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 30, 2023
8f95c79
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 30, 2023
29aa194
correct docstrings
alexbarghi-nv Aug 30, 2023
99b6f48
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 30, 2023
ebf0d9c
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Aug 30, 2023
26d48dd
rename sampling convert function
alexbarghi-nv Aug 30, 2023
0069d9d
Merge branch 'cugraph-pyg-loader-improvements' of https://github.com/…
alexbarghi-nv Aug 30, 2023
34d6bdc
update loader with new name
alexbarghi-nv Aug 30, 2023
baa8ea8
add comments to renumbering, clarify deprecation, add warning
alexbarghi-nv Aug 30, 2023
c3ee02b
initial implementation of sampling post processing
seunghwak Aug 31, 2023
04c9105
cuda::std::atomic=>cuda::atomic
seunghwak Aug 31, 2023
bdc840c
update API documentation
seunghwak Aug 31, 2023
8c304b3
add additional input testing
seunghwak Aug 31, 2023
b16a071
replace testing for sampling output post processing
seunghwak Aug 31, 2023
09a38d7
cosmetic updates
seunghwak Aug 31, 2023
82ad8e4
bug fixes
seunghwak Aug 31, 2023
e9b39e4
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Sep 1, 2023
d99b512
Merge branch 'fea_mfg' of https://github.com/seunghwak/cugraph into c…
alexbarghi-nv Sep 1, 2023
c15d580
the c api
alexbarghi-nv Sep 1, 2023
2ac8b86
work
alexbarghi-nv Sep 1, 2023
9135629
fix compile errors
alexbarghi-nv Sep 1, 2023
dfd1cb7
reformat
alexbarghi-nv Sep 1, 2023
6dfd4fe
rename test file from .cu to .cpp
seunghwak Sep 5, 2023
f600520
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Sep 6, 2023
7d5821f
bug fixes
seunghwak Sep 6, 2023
58189ed
add fill wrapper
seunghwak Sep 6, 2023
39db98a
undo adding fill wrapper
seunghwak Sep 6, 2023
98c8e0a
sampling test from .cpp to .cu
seunghwak Sep 6, 2023
687d191
latest perf testing
alexbarghi-nv Sep 7, 2023
c151f95
fix a typo
seunghwak Sep 7, 2023
fc5a4f0
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into fea_mfg
seunghwak Sep 7, 2023
a7d1804
merge
alexbarghi-nv Sep 7, 2023
3cda233
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 7, 2023
0a18cde
do merge
alexbarghi-nv Sep 7, 2023
094aaf9
do not return valid nzd vertices if doubly_compress is false
seunghwak Sep 7, 2023
cf57a6d
bug fix
seunghwak Sep 8, 2023
2b48b7e
test code
seunghwak Sep 8, 2023
79acc8e
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into fea_mfg
seunghwak Sep 8, 2023
11009c6
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Sep 8, 2023
0481bfb
Merge branch 'branch-23.10' into cugraph-sample-convert
alexbarghi-nv Sep 8, 2023
2af9333
Merge branch 'fea_mfg' of https://github.com/seunghwak/cugraph into c…
alexbarghi-nv Sep 8, 2023
23cd2c2
bug fix
seunghwak Sep 8, 2023
6eaf67e
update documentation
seunghwak Sep 8, 2023
4dc0a92
fix c api issues
alexbarghi-nv Sep 11, 2023
2947b33
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 11, 2023
0a2b2b7
C API fixes, Python/PLC API work
alexbarghi-nv Sep 11, 2023
db35940
adjust hop offsets when there is a jump in major vertex IDs between hops
seunghwak Sep 11, 2023
b8b72be
add sort only function
seunghwak Sep 12, 2023
38dd11e
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into fea_mfg
seunghwak Sep 12, 2023
2a799a6
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Sep 12, 2023
c86ceac
various improvements
alexbarghi-nv Sep 12, 2023
37a37bf
Merge branch 'fea_mfg' of https://github.com/seunghwak/cugraph into c…
alexbarghi-nv Sep 12, 2023
002fe93
fix merge conflict
alexbarghi-nv Sep 19, 2023
5051dfc
fix bad merge
alexbarghi-nv Sep 19, 2023
6cdf92b
asdf
alexbarghi-nv Sep 19, 2023
6682cb4
clarifying comments
alexbarghi-nv Sep 19, 2023
0d12a28
t
alexbarghi-nv Sep 19, 2023
f5733f2
latest code
alexbarghi-nv Sep 19, 2023
52e2f57
bug fix
seunghwak Sep 19, 2023
befeb25
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into bug_o…
seunghwak Sep 19, 2023
8781612
additional bug fix
seunghwak Sep 19, 2023
f92b5f5
add additional checking to detect the previously neglected bugs
seunghwak Sep 19, 2023
2bd93d9
Merge branch 'bug_offsets' of https://github.com/seunghwak/cugraph in…
alexbarghi-nv Sep 19, 2023
3195298
wrap up sg API
alexbarghi-nv Sep 20, 2023
74195cb
test fix, cleanup
alexbarghi-nv Sep 20, 2023
374b103
refactor code into new shared utility
alexbarghi-nv Sep 20, 2023
bd625e3
get mg api working
alexbarghi-nv Sep 20, 2023
b2a4ed1
add offset mg test
alexbarghi-nv Sep 20, 2023
9fb7438
fix renumber map issue in C++
alexbarghi-nv Sep 20, 2023
c770a17
verify new compression formats for sg
alexbarghi-nv Sep 20, 2023
b569563
complete csr/csc tests for both sg/mg
alexbarghi-nv Sep 20, 2023
ab2a185
get the bulk sampler working again
alexbarghi-nv Sep 20, 2023
89a1b33
remove unwanted file
alexbarghi-nv Sep 20, 2023
a9d46ef
fix wrong dataframe issue
alexbarghi-nv Sep 21, 2023
17e9013
update sg bulk sampler tests
alexbarghi-nv Sep 21, 2023
c5543b2
fix mg bulk sampler tests
alexbarghi-nv Sep 21, 2023
6581f47
Merge branch 'branch-23.10' into cugraph-pyg-loader-improvements
alexbarghi-nv Sep 21, 2023
16e83bc
write draft of csr bulk sampler
alexbarghi-nv Sep 21, 2023
1e7098d
overhaul the writer methods
alexbarghi-nv Sep 22, 2023
ae94c35
remove unused method
alexbarghi-nv Sep 22, 2023
7beba4b
style
alexbarghi-nv Sep 22, 2023
16ed5ef
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 22, 2023
79e3cef
remove notebook
alexbarghi-nv Sep 22, 2023
fd5cceb
add clarifying comment to c++
alexbarghi-nv Sep 22, 2023
a47691d
add future warnings
alexbarghi-nv Sep 22, 2023
195d063
cleanup
alexbarghi-nv Sep 22, 2023
0af1750
remove print statements
alexbarghi-nv Sep 22, 2023
d65632c
fix c api bug
alexbarghi-nv Sep 22, 2023
247d8d2
revert dataloader change
alexbarghi-nv Sep 22, 2023
72bebc2
fix empty df bug
alexbarghi-nv Sep 22, 2023
4d51751
style
alexbarghi-nv Sep 22, 2023
9dfa3fa
io
alexbarghi-nv Sep 22, 2023
10c8c1f
fix test failures, remove c++ compression enum
alexbarghi-nv Sep 23, 2023
08cf3e1
remove removed api from mg tests
alexbarghi-nv Sep 23, 2023
897e6d6
change to future warning
alexbarghi-nv Sep 23, 2023
bb5e621
resolve checking issues
alexbarghi-nv Sep 23, 2023
d20e593
Merge branch 'cugraph-pyg-loader-improvements' into cugraph-pyg-mfg
alexbarghi-nv Sep 23, 2023
eb3aadc
fix wrong index + off by 1 error, add check in test
alexbarghi-nv Sep 25, 2023
a124964
Merge branch 'branch-23.10' into cugraph-sample-convert
alexbarghi-nv Sep 25, 2023
6990c23
add annotations
alexbarghi-nv Sep 25, 2023
920bed7
docstring correction
alexbarghi-nv Sep 25, 2023
f8df56f
remove empty batch check
alexbarghi-nv Sep 25, 2023
ef2ec5b
fix capi sg test
alexbarghi-nv Sep 25, 2023
8e22ab9
disable broken tests, they are too expensive to fix and redundant
alexbarghi-nv Sep 25, 2023
13bdd43
Merge branch 'cugraph-sample-convert' of https://github.com/alexbargh…
alexbarghi-nv Sep 25, 2023
c48a14b
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 25, 2023
cf612c7
update c code
alexbarghi-nv Sep 25, 2023
09a3bd8
Merge branch 'branch-23.10' into cugraph-pyg-mfg
alexbarghi-nv Sep 26, 2023
140b6e4
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 27, 2023
e4544b6
Merge branch 'branch-23.10' into cugraph-sample-convert
alexbarghi-nv Sep 27, 2023
0ee3798
Resolve merge conflict
alexbarghi-nv Sep 27, 2023
6212869
fix bad merge
alexbarghi-nv Sep 27, 2023
0f1a144
initial rewrite
alexbarghi-nv Sep 27, 2023
b369e97
fixes, more testing
alexbarghi-nv Sep 27, 2023
13be49c
fix issue with num nodes and edges
alexbarghi-nv Sep 27, 2023
185143c
e2e smoke test
alexbarghi-nv Sep 28, 2023
99efb9c
Merge branch 'branch-23.10' into cugraph-pyg-mfg
alexbarghi-nv Sep 28, 2023
bc1f30b
Merge branch 'cugraph-sample-convert' into perf-testing-v2
alexbarghi-nv Sep 28, 2023
9ea6c6b
Merge branch 'branch-23.10' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Sep 28, 2023
a127643
Merge branch 'cugraph-pyg-mfg' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Sep 28, 2023
262d1da
fix test column name issues
alexbarghi-nv Sep 29, 2023
7a05c10
Merge branch 'branch-23.10' into cugraph-pyg-mfg
alexbarghi-nv Sep 29, 2023
c440f64
resolve merge conflicts
alexbarghi-nv Sep 29, 2023
d0d0cb2
copyright
alexbarghi-nv Sep 29, 2023
b4e6d06
testing
alexbarghi-nv Sep 29, 2023
20f138c
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Sep 29, 2023
7e770ad
debugging
alexbarghi-nv Sep 29, 2023
4ac962d
perf testing
alexbarghi-nv Oct 2, 2023
55b4e84
regex
alexbarghi-nv Nov 15, 2023
0fd367a
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Nov 15, 2023
894831e
update to latest
alexbarghi-nv Nov 15, 2023
3cad3f2
fixes
alexbarghi-nv Nov 15, 2023
912d6ca
node loader
alexbarghi-nv Nov 29, 2023
ea60f94
Merge branch 'branch-23.12' of https://github.com/rapidsai/cugraph in…
alexbarghi-nv Nov 29, 2023
9972619
finish patch
alexbarghi-nv Nov 29, 2023
1c401d1
merge latest
alexbarghi-nv Dec 1, 2023
02c7210
bulk sampling
alexbarghi-nv Dec 1, 2023
b67d5ed
perf testing
alexbarghi-nv Dec 5, 2023
da389e0
minor fixes
alexbarghi-nv Dec 6, 2023
e29b4e8
get the native workflow working
alexbarghi-nv Dec 6, 2023
d358257
wrap up first version of cugraph trainer
alexbarghi-nv Dec 7, 2023
e08c46c
remove stats file
alexbarghi-nv Dec 7, 2023
a9fc5af
Fixes
alexbarghi-nv Dec 8, 2023
49094db
x
alexbarghi-nv Dec 12, 2023
b8e2354
output multiple epochs, train/test/val
alexbarghi-nv Dec 12, 2023
0fd156b
remove unwanted file
alexbarghi-nv Dec 12, 2023
663febe
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Dec 12, 2023
2a3ee5a
revert file
alexbarghi-nv Dec 12, 2023
b424e7c
remove unwanted file
alexbarghi-nv Dec 12, 2023
b727fcb
remove cmake files
alexbarghi-nv Dec 12, 2023
d37f0d7
train/test
alexbarghi-nv Dec 12, 2023
d0ca16b
reformat
alexbarghi-nv Dec 12, 2023
06dc14d
add scripts
alexbarghi-nv Dec 13, 2023
a5f1b67
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Dec 13, 2023
ad83725
reorganize, add scripts
alexbarghi-nv Dec 13, 2023
e3d28a6
init
alexbarghi-nv Dec 13, 2023
d15a4d4
update
alexbarghi-nv Dec 14, 2023
70a509a
Merge branch 'pyg-nightly-input-nodes-fix' of https://github.com/alex…
alexbarghi-nv Dec 14, 2023
ecc2db1
cugraph
alexbarghi-nv Dec 26, 2023
726c81d
loader debug
alexbarghi-nv Dec 26, 2023
c095769
fix small bugs in cugraph-pyg
alexbarghi-nv Dec 26, 2023
4be1875
c
alexbarghi-nv Dec 26, 2023
59f030d
fix fanout issues
alexbarghi-nv Dec 26, 2023
4bc7f90
remove experimental warnings
alexbarghi-nv Dec 27, 2023
a58d358
remove test files
alexbarghi-nv Dec 27, 2023
318212d
data preprocessing
alexbarghi-nv Dec 27, 2023
68ca511
commit
alexbarghi-nv Dec 27, 2023
dbbd791
Merge branch 'dlfw-patch-24.01' of https://github.com/alexbarghi-nv/c…
alexbarghi-nv Dec 27, 2023
d47c3ba
comment
alexbarghi-nv Dec 27, 2023
367c79c
fixing issues impacting accuracy
alexbarghi-nv Dec 29, 2023
ac1cfbd
add readme
alexbarghi-nv Dec 29, 2023
cc2635b
refactor
alexbarghi-nv Dec 29, 2023
f1ce3e1
Fix mixed experimental import
alexbarghi-nv Dec 29, 2023
e38fe66
update readme
alexbarghi-nv Dec 29, 2023
f3f68bd
update readme
alexbarghi-nv Dec 29, 2023
d2734c4
fix environment variables
alexbarghi-nv Dec 29, 2023
7222cba
remove unwanted file
alexbarghi-nv Dec 29, 2023
c2e8520
minor change to avoid timeout
alexbarghi-nv Dec 29, 2023
a4dad32
remove stats file
alexbarghi-nv Jan 3, 2024
2109bfb
Merge branch 'perf-testing-v2' of https://github.com/alexbarghi-nv/cu…
alexbarghi-nv Jan 3, 2024
6358f9b
switch versions of simple distributed graph for 24.02
alexbarghi-nv Jan 3, 2024
3898cb2
remove test python file
alexbarghi-nv Jan 3, 2024
3f266f5
remove mg utils dir
alexbarghi-nv Jan 3, 2024
864e55e
wait for workers
alexbarghi-nv Jan 3, 2024
67d6aa0
reformat
alexbarghi-nv Jan 3, 2024
78fc260
add copyrights
alexbarghi-nv Jan 3, 2024
d81a9a8
fix wrong file
alexbarghi-nv Jan 3, 2024
16f225a
remove stats file
alexbarghi-nv Jan 3, 2024
259ec47
Merge branch 'branch-24.02' into perf-testing-v2
alexbarghi-nv Jan 5, 2024
18571fe
fix copyright
alexbarghi-nv Jan 5, 2024
40502de
split off feature transfer time
alexbarghi-nv Jan 5, 2024
ea46748
style
alexbarghi-nv Jan 5, 2024
61f30a2
Merge branch 'branch-24.02' into perf-testing-v2
alexbarghi-nv Jan 5, 2024
89ac530
fixes to scripts
alexbarghi-nv Jan 8, 2024
77b0788
compatibility issues
alexbarghi-nv Jan 8, 2024
4e2a706
reset file
alexbarghi-nv Jan 8, 2024
18e43de
c
alexbarghi-nv Jan 8, 2024
c4c45db
copyright
alexbarghi-nv Jan 8, 2024
8ea5c92
whitespace
alexbarghi-nv Jan 8, 2024
441810c
set nthreads to 8
alexbarghi-nv Jan 9, 2024
c053ed0
Merge branch 'branch-24.02' into perf-testing-v2
alexbarghi-nv Jan 9, 2024
3039843
Merge branch 'branch-24.02' into perf-testing-v2
alexbarghi-nv Jan 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions benchmarks/cugraph/standalone/bulk_sampling/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mg_utils/
70 changes: 58 additions & 12 deletions benchmarks/cugraph/standalone/bulk_sampling/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# cuGraph Bulk Sampling
# cuGraph Sampling Benchmarks

## Overview
## cuGraph Bulk Sampling

### Overview
The `cugraph_bulk_sampling.py` script runs the bulk sampler for a variety of datasets, including
both generated (rmat) datasets and disk (ogbn_papers100M, etc.) datasets. It can also load
replicas of these datasets to create a larger benchmark (i.e. ogbn_papers100M x2).

## Arguments
### Arguments
The script takes a variety of arguments to control sampling behavior.
Required:
--output_root
Expand Down Expand Up @@ -51,14 +53,8 @@ Optional:
Seed for random number generation.
Defaults to '62'

--persist
Whether to aggressively use persist() in dask to make the ETL steps (NOT PART OF SAMPLING) faster.
Will probably make this script finish sooner at the expense of memory usage, but won't affect
sampling time.
Changing this is not recommended unless you know what you are doing.
Defaults to False.

## Input Format
### Input Format
The script expects its input data in the following format:
```
<top level directory>
Expand Down Expand Up @@ -103,14 +99,64 @@ the parquet files. It must have the following format:
}
```

## Output Meta
### Output Meta
The script, in addition to the samples, will also output a file named `output_meta.json`.
This file contains various statistics about the sampling run, including the runtime,
as well as information about the dataset and system that the samples were produced from.

This metadata file can be used to gather the results from the sampling and training stages
together.

## Other Notes
### Other Notes
For rmat datasets, you will need to generate your own bogus features in the training stage.
Since that is trivial, that is not done in this sampling script.

## cuGraph MNMG Training

### Overview
The script `run_train_job.sh` runs with the `sbatch` command to launch a series of slurm jobs.
First, for a given number of epochs, the script will produce samples for a given graph.
Then, the training process starts where samples are loaded and training iterations are
processed.

### Important Notes
Downloading the dataset files before running the slurm jobs is highly recommended. Even though
the script will attempt to download the files if they are not available, this can often
lead to a timeout which will crash the scripts. This applies regardless of whether you are training
with native PyG or cuGraph-PyG. You can download data as follows:

```
from ogb.nodeproppred import NodePropPredDataset
dataset = NodePropPredDataset('ogbn-papers100M', root='/home/username/datasets')
```

For datasets other than ogbn-papers100M, you follow the same process but only change the dataset name.
The dataset will be correctly preprocessed when you run training. In case you have a slow system, you
can also run preprocessing by running the training script on a single worker, which will avoid a timeout
which crashes the script.

The multi-GPU utilities are in `mg_utils` in the top level of the cuGraph repository. You should either
copy them to this directory or symlink to them before running the scripts.

### Arguments
You will need to modify the bash scripts to run appopriately for your environment and
desired training workflow. The standard sbatch arguments are at the top of the script, such as
job name, queue, etc. These will need to be modified for your SLURM cluster.

Next are arguments for the container image (required),
and directories where the data and outputs are stored. The directories default to subdirectories
of the current working directory. But if there is a high-throughput storage system available,
using that storage for the samples and datasets is highly recommended.

Next are standard GNN training arguments such as `FANOUT`, `BATCH_SIZE`, etc. You can also set
the number of training epochs here. These are followed by the `REPLICATION_FACTOR` argument, which
can be used to create replications of the dataset for scale testing purposes.

The final two arguments are `FRAMEWORK` which can be either "cuGraphPyG" or "PyG", and `GPUS_PER_NODE`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we shall include "cuGraphDGL" here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the next PR

which must be set to the correct value, even if this is provided by a SLURM argument. If `GPUS_PER_NODE`
is not set to the correct number of GPUs, the script will hang indefinitely until it times out. Mismatched
GPUs per node is currently unsupported by this script but should be possible in practice.

### Output
The results of training will be outputted to the logs directory with an `output.txt` file for each worker.
These will be overwritten upon each run. Accuracy is only reported on rank 0.
251 changes: 251 additions & 0 deletions benchmarks/cugraph/standalone/bulk_sampling/bench_cugraph_training.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

os.environ["RAPIDS_NO_INITIALIZE"] = "1"
os.environ["CUDF_SPILL"] = "1"
os.environ["LIBCUDF_CUFILE_POLICY"] = "KVIKIO"
os.environ["KVIKIO_NTHREADS"] = "64"
alexbarghi-nv marked this conversation as resolved.
Show resolved Hide resolved

import argparse
import json
import warnings

import torch
import numpy as np
import pandas

import torch.distributed as dist

from datasets import OGBNPapers100MDataset

from cugraph.testing.mg_utils import enable_spilling


def init_pytorch_worker(rank: int, use_rmm_torch_allocator: bool = False) -> None:
import cupy
import rmm
from pynvml.smi import nvidia_smi

smi = nvidia_smi.getInstance()
pool_size = 16e9 # FIXME calculate this

rmm.reinitialize(
devices=[rank],
pool_allocator=True,
initial_pool_size=pool_size,
)

if use_rmm_torch_allocator:
warnings.warn(
"Using the rmm pytorch allocator is currently unsupported."
" The default allocator will be used instead."
)
# FIXME somehow get the pytorch allocator to work
# from rmm.allocators.torch import rmm_torch_allocator
# torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

from rmm.allocators.cupy import rmm_cupy_allocator

cupy.cuda.set_allocator(rmm_cupy_allocator)

cupy.cuda.Device(rank).use()
torch.cuda.set_device(rank)

# Pytorch training worker initialization
torch.distributed.init_process_group(backend="nccl")


def parse_args():
parser = argparse.ArgumentParser()

parser.add_argument(
"--gpus_per_node",
type=int,
default=8,
help="# GPUs per node",
required=False,
)

parser.add_argument(
"--num_epochs",
type=int,
default=1,
help="Number of training epochs",
required=False,
)

parser.add_argument(
"--batch_size",
type=int,
default=512,
help="Batch size",
required=False,
)

parser.add_argument(
"--fanout",
type=str,
default="10_10_10",
help="Fanout",
required=False,
)

parser.add_argument(
"--sample_dir",
type=str,
help="Directory with stored bulk samples (required for cuGraph run)",
required=False,
)

parser.add_argument(
"--output_file",
type=str,
help="File to store results",
required=True,
)

parser.add_argument(
"--framework",
type=str,
help="The framework to test (PyG, cuGraphPyG)",
required=True,
)

parser.add_argument(
"--model",
type=str,
default="GraphSAGE",
help="The model to use (currently only GraphSAGE supported)",
required=False,
)

parser.add_argument(
"--replication_factor",
type=int,
default=1,
help="The replication factor for the dataset",
required=False,
)

parser.add_argument(
"--dataset_dir",
type=str,
help="The directory where datasets are stored",
required=True,
)

parser.add_argument(
"--train_split",
type=float,
help="The percentage of the labeled data to use for training. The remainder is used for testing/validation.",
default=0.8,
required=False,
)

parser.add_argument(
"--val_split",
type=float,
help="The percentage of the testing/validation data to allocate for validation.",
default=0.5,
required=False,
)

return parser.parse_args()


def main(args):
import logging

logging.basicConfig(
level=logging.INFO,
)
logger = logging.getLogger("bench_cugraph_training")
logger.setLevel(logging.INFO)

local_rank = int(os.environ["LOCAL_RANK"])
global_rank = int(os.environ["RANK"])

init_pytorch_worker(
local_rank, use_rmm_torch_allocator=(args.framework == "cuGraph")
)
enable_spilling()
print(f"worker initialized")
dist.barrier()

world_size = int(os.environ["SLURM_JOB_NUM_NODES"]) * args.gpus_per_node

dataset = OGBNPapers100MDataset(
replication_factor=args.replication_factor,
dataset_dir=args.dataset_dir,
train_split=args.train_split,
val_split=args.val_split,
load_edge_index=(args.framework == "PyG"),
)

if global_rank == 0:
dataset.download()
dist.barrier()

fanout = [int(f) for f in args.fanout.split("_")]

if args.framework == "PyG":
from trainers.pyg import PyGNativeTrainer

trainer = PyGNativeTrainer(
model=args.model,
dataset=dataset,
device=local_rank,
rank=global_rank,
world_size=world_size,
num_epochs=args.num_epochs,
shuffle=True,
replace=False,
num_neighbors=fanout,
batch_size=args.batch_size,
)
elif args.framework == "cuGraphPyG":
sample_dir = os.path.join(
args.sample_dir,
f"ogbn_papers100M[{args.replication_factor}]_b{args.batch_size}_f{fanout}",
)
from trainers.pyg import PyGCuGraphTrainer

trainer = PyGCuGraphTrainer(
model=args.model,
dataset=dataset,
sample_dir=sample_dir,
device=local_rank,
rank=global_rank,
world_size=world_size,
num_epochs=args.num_epochs,
shuffle=True,
replace=False,
num_neighbors=fanout,
batch_size=args.batch_size,
)
else:
raise ValueError("unsupported framework")

logger.info(f"Trainer ready on rank {global_rank}")
stats = trainer.train()
logger.info(stats)

with open(f"{args.output_file}[{global_rank}]", "w") as f:
json.dump(stats, f)


if __name__ == "__main__":
args = parse_args()
main(args)
Loading