Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU architectures to the build-info file #1840
Add GPU architectures to the build-info file #1840
Changes from 1 commit
4daf3f8
82cfc38
176c5d6
48eb19a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation:
This command searches for files
libcudfjni.a
orlibcudf.a
in the current directory, then runscuobjdump
on each file to extract architecture information and sorts the unique results numerically.Result Format:
70 75 80 86 90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this looking at both libcudfjni.a and libcudf.a? It should only be looking only at the libcudf artifact for the cudf properties, IMO, as that's where almost all the kernels are. If libcudfjni.a has an arch libcudf.a does not, we do not want to advertise it in the properties.
I don't see where this is handling the spark-rapids-jni kernels. Those are not in either libcudfjni.a or libcudf.a but rather in libcudf.so. The way this is written, cudf and spark-rapids-jni properties will always have the same property values. I'm OK with that, but it begs the question why the plugin would be checking both of them if they're known to be identical.
This should not be using
find
, it should be examining the known location where the artifact is placed before it is put into the jar artifact.find
invites accidental errors where someone has a rogue binary somewhere in their repo (e.g.: temporary directory, whatever) and it accidentally adds an arch that is not in the final jar artifact.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe find is from my POC NVIDIA/spark-rapids#10540 (comment)
We probably don't need to go this far, but I was thinking of an idea to record the arch sets for each lib going into a build (including the nvcomp ones), and then compute the set intersection for the final device support check, can be done at build or run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cudf supported architectures, we should check
./thirdparty/cudf/cpp/build/libcudf.a
.For spark-rapids-jni, what is the difference between
./target/cmake-build/libcudf.so
and./target/classes/amd64/Linux/libcudf.so
We should probably use set intersection of each lib. Can we ever have a case when
spark-rapids-jni
has different architectures supported fromcudf
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same, you can pick either path. The former is where the library is orginally built, the latter is where it's copied to pull it into the resulting jar.
Theoretically yes but not easily the way it builds today. IMO we could easily overengineer this thing to solve a problem that doesn't need to be solved in practice. We could simply probe for the architectures on libcudf.so and call it a day. Is it 100% bulletproof? No, theoretically someone could do a very crazy thing to get that to not be correct because they went out of their way to somehow build libcudf.a, libcudfjni.a, and libcudf.so differently. But I've never seen this in practice and arguably YAGNI applies. That avoids the "where are all the places to look" problem and the need to do set intersections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent build environment after upmerges, branch switches, other mistakes during custom dev build, bugs in a prod build plus things like nvcomp kernels built externally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed script delimiter to semi-colon and explicitly passed 'libcudf.a' and 'libcudf.so' as library files for cudf and spark-rapids-jni for processing by the build info script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My perception from internal chats is that devs routinely runs into build situations that would be YAGNI, but we do not need to optimize for this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of the space-separated arch list. Java properties file format allows for the space being the separator between the key and the value portion of the line
I prefer a semicolon-separated value format without any white space in the value
70;75;80;86;90
similar to the first example https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html#prop_tgt:CUDA_ARCHITECTURES and this is what we typically use for-DGPU_ARCHS
passed to MavenThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. This would keep it consistent with current usage.