-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU architectures to the build-info file #1840
Add GPU architectures to the build-info file #1840
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
build/build-info
Outdated
@@ -32,6 +32,7 @@ echo_build_properties() { | |||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | |||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | |||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | |||
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation:
This command searches for files libcudfjni.a
or libcudf.a
in the current directory, then runs cuobjdump
on each file to extract architecture information and sorts the unique results numerically.
Result Format:
70 75 80 86 90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.
I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.
This should not be using find
, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find
invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)
We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be using find, it should be examining the known location where the artifact is placed before it is
For cudf supported architectures, we should check ./thirdparty/cudf/cpp/build/libcudf.a
.
For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so
and ./target/classes/amd64/Linux/libcudf.so
We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going
We should probably use set intersection of each lib. Can we ever have a case when spark-rapids-jni
has different architectures supported from cudf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spark-rapids-jni, what is the difference between ./target/cmake-build/libcudf.so and ./target/classes/amd64/Linux/libcudf.so
They are the same, you can pick either path. The former is where the library is orginally built, the latter is where it's copied to pull it into the resulting jar.
Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?
Theoretically yes but not easily the way it builds today. IMO we could easily overengineer this thing to solve a problem that doesn't need to be solved in practice. We could simply probe for the architectures on libcudf.so and call it a day. Is it 100% bulletproof? No, theoretically someone could do a very crazy thing to get that to not be correct because they went out of their way to somehow build libcudf.a, libcudfjni.a, and libcudf.so differently. But I've never seen this in practice and arguably YAGNI applies. That avoids the "where are all the places to look" problem and the need to do set intersections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ever have a case when spark-rapids-jni has different architectures supported from cudf?
inconsistent build environment after upmerges, branch switches, other mistakes during custom dev build, bugs in a prod build plus things like nvcomp kernels built externally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed script delimiter to semi-colon and explicitly passed 'libcudf.a' and 'libcudf.so' as library files for cudf and spark-rapids-jni for processing by the build info script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My perception from internal chats is that devs routinely runs into build situations that would be YAGNI, but we do not need to optimize for this 😄
build/build-info
Outdated
@@ -32,6 +32,7 @@ echo_build_properties() { | |||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | |||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | |||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | |||
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.
I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.
This should not be using find
, it should be examining the known location where the artifact is placed before it is put into the jar artifact. find
invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.
build/build-info
Outdated
@@ -32,6 +32,7 @@ echo_build_properties() { | |||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | |||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | |||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | |||
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of the space-separated arch list. Java properties file format allows for the space being the separator between the key and the value portion of the line
jshell> var props = new java.util.Properties()
props ==> {}
jshell> props.load(new java.io.StringReader("gpu_architectures 70 75 80 86 90"))
jshell> props
props ==> {gpu_architectures=70 75 80 86 90}
I prefer a semicolon-separated value format without any white space in the value 70;75;80;86;90
similar to the first example https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES and this is what we typically use for -DGPU_ARCHS
passed to Maven
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. This would keep it consistent with current usage.
build/build-info
Outdated
@@ -32,6 +32,7 @@ echo_build_properties() { | |||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | |||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | |||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | |||
echo gpu_architectures=$(cd "$git_path" && find . -name libcudfjni.a -o -name libcudf.a | xargs -n 1 bash -c 'cuobjdump $1 || exit 0' _ | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)
We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.
… build info script Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build/build-info
Outdated
echo version=$version | ||
echo user=$USER | ||
echo revision=$(cd "$git_path" && git rev-parse HEAD) | ||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | ||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | ||
echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not a git op and no need to cd?
echo gpu_architectures=$(cd "$git_path" && cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';') | |
echo gpu_architectures=$(cuobjdump "$library_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kept it as safety but the build paths are absolute. Refactored as suggested.
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, not mustfix.
build/build-info
Outdated
echo version=$version | ||
echo user=$USER | ||
echo revision=$(cd "$git_path" && git rev-parse HEAD) | ||
echo branch=$(cd "$git_path" && git rev-parse --abbrev-ref HEAD) | ||
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
echo url=$(cd "$git_path" && git config --get remote.origin.url) | ||
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: awk can be leveraged to do the grep directly.
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | grep 'arch = ' | awk -F_ '{print $2}' | sort -n -u | tr '\n' ';') | |
echo gpu_architectures=$(cuobjdump "$libcudf_path" 2>/dev/null | awk -F_ '/arch =/ {print $2}' | sort -n -u | tr '\n' ';') |
48eb19a
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #10430. This PR ensures that Spark RAPIDS jobs are executed on supported GPU architectures without relying on manual configuration. ### Changes: 1. Processes `gpu_architectures` property from the `*version-info.properties` file generated by the native builds. 2. Verifies if the user is running the job on an architecture supported by the cuDF and JNI libraries and throws an exception if the architecture is unsupported. ### Testing Tested on a Dataproc VM running on Nvidia P4 (GPU Architecture 6.1) ``` 24/03/06 17:44:58 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU. 24/03/06 17:45:10 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down! java.lang.RuntimeException: Device architecture 61 is unsupported. Minimum supported architecture: 75. at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitectureInternal(Plugin.scala:366) at com.nvidia.spark.rapids.RapidsPluginUtils$.checkGpuArchitecture(Plugin.scala:375) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:461) ``` ### Related PR * NVIDIA/spark-rapids-jni#1840 * Add conf for minimum supported CUDA and error handling Signed-off-by: Partho Sarthi <[email protected]> * Revert "Add conf for minimum supported CUDA and error handling" This reverts commit 7b8eaea. * Verify the GPU architecture is supported by the plugin libraries Signed-off-by: Partho Sarthi <[email protected]> * Use semi-colon as delimiter and use intersection of supported gpu architectures Signed-off-by: Partho Sarthi <[email protected]> * Allow for compatibility with major architectures Signed-off-by: Partho Sarthi <[email protected]> * Check for version as integers Signed-off-by: Partho Sarthi <[email protected]> * Modify compatibility check for same major version and same or higher minor version Signed-off-by: Partho Sarthi <[email protected]> * Add a config to skip verification and refactor checking Signed-off-by: Partho Sarthi <[email protected]> * Update RapidsConf.scala Co-authored-by: Jason Lowe <[email protected]> * Update verification logic Signed-off-by: Partho Sarthi <[email protected]> * Update warning message Signed-off-by: Partho Sarthi <[email protected]> * Add unit tests and update warning message. Signed-off-by: Partho Sarthi <[email protected]> * Update exception class Signed-off-by: Partho Sarthi <[email protected]> * Address review comments Signed-off-by: Partho Sarthi <[email protected]> --------- Signed-off-by: Partho Sarthi <[email protected]> Co-authored-by: Jason Lowe <[email protected]>
Related Issue: NVIDIA/spark-rapids#10430
Based on the discussion thread here, a new property named
gpu_architectures
has been added in theversion-info.properties
file. This property stores a semicolon-separated list of GPU architectures supported by the cuDF and JNI plugin.The plugin will parse this property and verify if the user is running Spark RAPIDS job on a supported architecture.
Output
File:
spark-rapids-jni-version-info.properties
File:
cudf-java-version-info.properties