Contributions to RAPIDS Accelerator for Apache Spark fall into the following three categories.
- To report a bug, request a new feature, or report a problem with documentation, please file an issue describing in detail the problem or new feature. The project team evaluates and triages issues, and schedules them for a release. If you believe the issue needs priority attention, please comment on the issue to notify the team.
- To propose and implement a new Feature, please file a new feature request issue. Describe the intended feature and discuss the design and implementation with the team and community. Once the team agrees that the plan looks good, go ahead and implement it using the code contributions guide below.
- To implement a feature or bug-fix for an existing outstanding issue, please follow the code contributions guide below. If you need more context on a particular issue, please ask in a comment.
There are two types of branches in this repository:
-
branch-[version]
: are development branches which can change often. Note that we merge into the branch with the greatest version number, as that is our default branch. -
main
: is the branch with the latest released code, and the version tag (i.e.v0.1.0
) is held here.main
will change with new releases, but otherwise it should not change with every pull request merged, making it a more stable branch.
This repository uses git submodules. The submodules may need to be updated after the repository
is cloned or after moving to a new commit via git submodule update --init
. See the
Git documentation on submodules for more
information.
We use Maven for most aspects of the build. We test the build with latest patch versions for Maven 3.6.x, 3.8.x and 3.9.x. Maven version 3.6.0 or more recent is enforced.
Some important parts
of the build execute in the verify
phase of the Maven build lifecycle. We recommend when
building at least running to the verify
phase, e.g.:
mvn verify
After a successful build, the RAPIDS Accelerator jar will be in the dist/target/
directory.
This will build the plugin for a single version of Spark. By default, this is Apache Spark
3.1.1. To build against other versions of Spark you use the -Dbuildver=XXX
command line option
to Maven. For instance to build Spark 3.1.1 you would use:
mvn -Dbuildver=311 verify
You can find all available build versions in the top level pom.xml file. If you are building
for Databricks then you should use the jenkins/databricks/build.sh
script and modify it for
the version you want.
Note that we build against both Scala 2.12 and 2.13. Any contribution you make to the codebase should compile with both Scala 2.12 and 2.13 for Apache Spark versions 3.3.0 and higher.
Also, if you make changes in the parent pom.xml
or any other of the module pom.xml
files, you must run the following command to sync the changes between the Scala 2.12 and
2.13 pom files:
./build/make-scala-version-build-files.sh 2.13
That way any new dependencies or other changes will also be picked up in the Scala 2.13 build.
See the scala2.13 directory for more information on how to build against Scala 2.13.
To get an uber jar with more than 1 version you have to mvn package
each version
and then use one of the defined profiles in the dist module, or a comma-separated list of
build versions. See the next section for more details.
You might see a warning during scala-maven-plugin compile goal invocation.
[INFO] Compiling 94 Scala sources and 1 Java source to /home/user/gits/NVIDIA/spark-rapids/tests/target/spark3XY/test-classes ...
OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=245760Kb used=236139Kb max_used=243799Kb free=9620Kb
bounds [0x00007f9681000000, 0x00007f9690000000, 0x00007f9690000000]
total_blobs=60202 nmethods=59597 adapters=504
compilation: disabled (not enough contiguous free space left)
It can be mitigated by increasing ReservedCodeCacheSize
passed in the MAVEN_OPTS
environment variable.
By default, the distribution jar only includes code for a single version of Spark, albeit the jar file layout will be such that it can be accessed only using the Shim loading logic for multiple Spark versions. See below for dist jar creation without the need for a special shim class loader.
If you want to create a jar with multiple versions we have the following options.
- Build for all Apache Spark versions and CDH with no SNAPSHOT versions of Spark, only released. Use
-PnoSnapshots
. - Build for all Apache Spark versions and CDH including SNAPSHOT versions of Spark we have supported for. Use
-Psnapshots
. - Build for all Apache Spark versions, CDH and Databricks with no SNAPSHOT versions of Spark, only released. Use
-PnoSnaphsotsWithDatabricks
. - Build for all Apache Spark versions, CDH and Databricks including SNAPSHOT versions of Spark we have supported for. Use
-PsnapshotsWithDatabricks
- Build for an arbitrary combination of comma-separated build versions using
-Dincluded_buildvers=<CSV list of build versions>
. E.g.,-Dincluded_buildvers=312,330
You must first build each of the versions of Spark and then build one final time using the profile for the option you want.
You can also install some manually and build a combined jar. For instance to build non-snapshot versions:
mvn clean
mvn -Dbuildver=311 install -Drat.skip=true -DskipTests
mvn -Dbuildver=312 install -Drat.skip=true -DskipTests
mvn -Dbuildver=313 install -Drat.skip=true -DskipTests
mvn -Dbuildver=320 install -Drat.skip=true -DskipTests
mvn -Dbuildver=321 install -Drat.skip=true -DskipTests
mvn -Dbuildver=321cdh install -Drat.skip=true -DskipTests
mvn -pl dist -PnoSnapshots package -DskipTests
Verify that shim-specific classes are hidden from a conventional classloader.
$ javap -cp dist/target/rapids-4-spark_2.12-24.08.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl
Error: class not found: com.nvidia.spark.rapids.shims.SparkShimImpl
However, its bytecode can be loaded if prefixed with spark3XY
not contained in the package name
$ javap -cp dist/target/rapids-4-spark_2.12-24.08.0-SNAPSHOT-cuda11.jar spark320.com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
Warning: File dist/target/rapids-4-spark_2.12-24.08.0-SNAPSHOT-cuda11.jar(/spark320/com/nvidia/spark/rapids/shims/SparkShimImpl.class) does not contain class spark320.com.nvidia.spark.rapids.shims.SparkShimImpl
Compiled from "SparkShims.scala"
public final class com.nvidia.spark.rapids.shims.SparkShimImpl {
There is a build script build/buildall
that automates the local build process. Use
./build/buildall --help
for up-to-date use information.
By default, it builds everything that is needed to create a distribution jar for all released (noSnapshots) Spark versions except for Databricks. Other profiles that you can pass using --profile=<distribution profile>
include
snapshots
that includes all released (noSnapshots) and snapshots Spark versions except for DatabricksminimumFeatureVersionMix
that currently includes 321cdh, 312, 320, 330 is recommended for catching incompatibilities already in the local development cycle
For initial quick iterations we can use --profile=<buildver>
to build a single-shim version. e.g., --profile=311
for Spark 3.1.1.
The option --module=<module>
allows to limit the number of build steps. When iterating, we often don't have the need for the entire build. We may be interested in building everything necessary just to run integration tests (--module=integration_tests
), or we may want to just rebuild the distribution jar (--module=dist
)
By default, buildall
builds up to 4 shims in parallel using xargs -P <n>
. This can be adjusted by
specifying the environment variable BUILD_PARALLEL=<n>
.
You can build against different versions of the CUDA Toolkit by modifying the variable cuda.version
:
-Dcuda.version=cuda11
(CUDA 11.x, default)-Dcuda.version=cuda12
(CUDA 12.x)
In many situations the user knows that the Plugin jar will be deployed for a single specific Spark
release. It is most commonly the case when a container image for a cloud or local deployment includes
Spark binaries as well. In such a case it is advantageous to create a jar with
a conventional class directory structure avoiding complications such as
#3704. To this end add
-DallowConventionalDistJar=true
when invoking Maven.
mvn package -pl dist -am -Dbuildver=340 -DallowConventionalDistJar=true
Verify com.nvidia.spark.rapids.shims.SparkShimImpl
is conventionally loadable:
$ javap -cp dist/target/rapids-4-spark_2.12-24.08.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
Compiled from "SparkShims.scala"
public final class com.nvidia.spark.rapids.shims.SparkShimImpl {
We support JDK8 as our main JDK version, and test JDK8, JDK11 and JDK17. It is possible to build and run
with more modern JDK versions, however these are untested. The first step is to set JAVA_HOME
in
the environment to your JDK root directory. NOTE: for JDK17, we only support build against spark 3.3.0+
If you need to build with a JDK version that we do not test internally add
-Denforcer.skipRules=requireJavaVersion
to the Maven invocation.
To build our project on ARM platform, please add -Parm64
to your Maven commands.
NOTE: Build process does not require an ARM machine, so if you want to build the artifacts only
on X86 machine, please also add -DskipTests
in commands.
mvn clean verify -Dbuildver=311 -Parm64
When iterating on changes impacting the dist
module artifact directly or via
dependencies you might find the jar creation step unacceptably slow. Due to the
current size of the artifact rapids-4-spark_2.12
Maven Jar Plugin spends the
bulk of the time compressing the artifact content.
Since the JAR file specification focuses on the file entry layout in a ZIP
archive without requiring file entries to be compressed it is possible to skip
compression, and increase the speed of creating rapids-4-spark_2.12
jar ~3x
for a single Spark version Shim alone.
To this end in a pre-production build you can set the Boolean property
dist.jar.compress
to false
, its default value is true
.
Furthermore, after the first build execution on the clean repository the spark-rapids-jni
SNAPSHOT dependency typically does not change until the next nightly CI build, or the next install
to the local Maven repo if you are working on a change to the native code. So you can save
significant time spent on repeated unpacking these dependencies by adding -Drapids.jni.unpack.skip
to the dist
build command.
The time saved is more significant if you are merely changing
the aggregator
module, or the dist
module, or just incorporating changes from
spark-rapids-jni
For example, to quickly repackage rapids-4-spark
after the
initial ./build/buildall
you can iterate by invoking
mvn package -pl dist -PnoSnapshots -Ddist.jar.compress=false -Drapids.jni.unpack.skip
or similarly
./build/buildall --rebuild-dist-only --option="-Ddist.jar.compress=false -Drapids.jni.unpack.skip"
Conventional code locations in Maven modules are found under src/main/<language>
. In addition to
that and in order to support multiple versions of Apache Spark with the minimum amount of source
code we maintain Spark-version-specific locations within non-shim modules if necessary. This allows
us to switch between incompatible parent classes inside without copying the shared code to
dedicated shim modules.
Thus, the conventional source code root directories src/main/<language>
contain the files that
are source-compatible with all supported Spark releases, both upstream and vendor-specific.
The following acronyms may appear in directory names:
Acronym | Definition | Example | Example Explanation |
---|---|---|---|
db | Databricks | 332db | Databricks Spark based on Spark 3.3.2 |
cdh | Cloudera CDH | 321cdh | Cloudera CDH Spark based on Apache Spark 3.2.1 |
The version-specific directory names have one of the following forms / use cases:
src/main/spark${buildver}
, example:src/main/spark332db
src/test/spark${buildver}
, example:src/test/spark340
with a special shim descriptor as a Scala/Java comment. See shimplify.md
Our project currently uses build-helper-maven-plugin
for shimming against conflicting definitions of superclasses
in upstream versions that cannot be resolved without significant code duplication otherwise. To this end different
source directories with differently implemented same-named classes are
added
for compilation depending on the targeted Spark version.
This may require some modifications to IDEs' standard Maven import functionality.
Last tested with IntelliJ IDEA 2023.1.2 (Community Edition)
Before proceeding with importing spark-rapids into IDEA or switching to a different Spark release
profile, execute the install phase with the corresponding buildver
, e.g. for Spark 3.4.0:
mvn clean install -Dbuildver=340 -Dmaven.scaladoc.skip -DskipTests
Our build relies on symlink generation for Spark
release-specific sources. It is recommended to install
the IDEA Resolve Symlinks plugin via Marketplace
tab.
To start working with the project in IDEA is as easy as opening the top level (parent) pom.xml.
In IDEA 2022.3.1 unselect "Import using the new IntelliJ Workspace Model API (experimental)". After 2023.1.2 the default "Enable fast import" can be used.
In order to make sure that IDEA handles profile-specific source code roots within a single Maven module correctly, unselect "Keep source and test folders on reimport".
If you develop a feature that has to interact with the Shim layer or simply need to test the Plugin with a different
Spark version, open Maven tool window and
select one of the release3xx
profiles (e.g, release320
) for Apache Spark 3.2.0.
Make sure Manual Maven Install for that profile
has been executed.
Go to File | Settings | Build, Execution, Deployment | Build Tools | Maven | Importing
and make sure
that Generated sources folders
is set to Detect automatically
and Phase to be used for folders update
is changed to process-test-resources
.
In the Maven tool window hit
Reload all projects
Generate Sources and Update Folders For all Projects
.
Known Issues:
-
With the old "slow" Maven importer it might be necessary to bump up maximum Java Heap size
-Xmx
viaFile | Settings | Build, Execution, Deployment | Build Tools | Maven | Importing | VM options for importer
-
When IDEA is upgraded, it might be necessary to remove
.idea
from the local git repository root. -
There is a known issue that the test sources added via the
build-helper-maven-plugin
are not handled properly. The workaround is tomark
the affected folders such astests/src/test/spark3*
manually asTest Sources Root
-
There is a known issue where, even after selecting a different Maven profile in the Maven submenu, the source folders from a previously selected profile may remain active. As a workaround, when switching to a different profile, go to
File | Project Structure ... | Modules
, select therapids-4-spark-sql_2.12
module, clickSources
, and delete all the shim source roots from theSource Folders
list. Make sure the right test source folders are inrapids-4-spark-sql_2.12
andrapids-4-spark-tests_2.12
. Re-execute the steps above: reload, andGenerate Sources ...
.
If you see Scala symbols unresolved (highlighted red) in IDEA please try the following steps to resolve it:
-
Make sure there are no relevant poms in
File | Settings | Build, Execution, Deployment | Build Tools | Maven | Ignored Files
-
Restart IDEA and click
Reload All Maven Projects
again
Bloop is a build server and a set of tools around Build Server Protocol (BSP) for Scala providing an integration path with IDEs that support it. In fact, you can generate a Bloop project from Maven just for the Maven modules and profiles you are interested in. For example, to generate the Bloop projects for the Spark 3.2.0 dependency just for the production code run:
mvn -B clean install \
-DbloopInstall \
-DdownloadSources=true \
-Dbuildver=320
With --generate-bloop
we integrated Bloop project generation into buildall
. It makes it easier
to generate projects for multiple Spark dependencies using the same profiles as our regular build.
It makes sure that the project files belonging to different Spark dependencies are
not clobbered by repeated bloopInstall
Maven plugin invocations, and it uses
jq to post-process JSON-formatted project files such that they
compile project classes into non-overlapping set of output directories.
To activate the Spark dependency version 3XY you currently are working with update
the symlink .bloop
to point to the corresponding directory .bloop-spark3XY
Example usage:
./build/buildall --generate-bloop --profile=311,330
rm -vf .bloop
ln -s .bloop-spark330 .bloop
You can now open the spark-rapids as a BSP project in IDEA
Read on for VS Code Scala Metals instructions.
It is known that Bloop's SemanticDB generation with JDK 8 is broken for spark-rapids. Please use JDK 11 or later for Bloop builds.
Last tested with 1.63.0-insider (Universal) Commit: bedf867b5b02c1c800fbaf4d6ce09cefba
Another, and arguably more popular, use of Bloop arises in connection with Scala Metals and VS @Code. Scala Metals implements the Language Server Protocol (LSP) for Scala, and enables features such as context-aware autocomplete, and code browsing between Scala symbol definitions, references and vice versa. LSP is supported by many editors including Vim and Emacs.
Here we document the integration with VS code. It makes development on a remote node almost as easy as local development, which comes very handy when working in Cloud environments.
Run ./build/buildall --generate-bloop --profile=<profile>
to generate Bloop projects
for required Spark dependencies, e.g. --profile=320
for Spark 3.2.0. When developing
remotely this is done on the remote node.
Install Scala Metals extension in VS Code,
either locally or into a Remote-SSH extension destination depending on your target environment.
When your project folder is open in VS Code, it may prompt you to import Maven project.
IMPORTANT: always decline with "Don't ask again", otherwise it will overwrite the Bloop projects
generated with the default 311
profile. If you need to use a different profile, always rerun the
command above manually. When regenerating projects it's recommended to proceed to Metals
"Build commands" View, and click:
- "Restart build server"
- "Clean compile workspace" to avoid stale class files.
Now you should be able to see Scala class members in the Explorer's Outline view and in the Breadcrumbs view at the top of the Editor with a Scala file open.
Check Metals logs, "Run Doctor" etc. if something is not working as expected. You can also verify
that the Bloop build server and the Metals language server are running by executing jps
in the
Terminal window:
jps -l
72960 sun.tools.jps.Jps
72356 bloop.Server
72349 scala.meta.metals.Main
Metals background compilation process status appears to be resetting to 0% after reaching 99%
and you see a peculiar error message java.lang.RuntimeException: boom
. This is a known issue
when running Metals/Bloop on Java 8. To work around it, ensure Metals and Bloop are both running on
Java 11+.
-
The
-DbloopInstall
profile will enforce Java 11+ compliance. -
Add
metals.javaHome
to VSCode preferences to point to Java 11+.
We welcome pull requests with tips on how to setup your favorite IDE!
- Read the Developer Overview to understand how the RAPIDS Accelerator plugin works.
- Find an issue to work on. The best way is to look for the good first issue or help wanted labels.
- Comment on the issue stating that you are going to work on it.
- Code! Make sure to update unit tests and integration tests if needed! refer to test section
- When done, create your pull request.
- Verify that CI passes all status checks. Fix if needed.
- Wait for other developers to review your code and update code as needed.
- Once reviewed and approved, a project committer will merge your pull request.
Remember, if you are unsure about anything, don't hesitate to comment on issues and ask for clarifications!
RAPIDS Accelerator for Apache Spark follows the same coding style guidelines as the Apache Spark
project. For IntelliJ IDEA users, an
example code style settings file is available in the
docs/dev/
directory.
This project follows the official Scala style guide and the Databricks Scala guide, preferring the latter.
This project follows the Oracle Java code conventions and the Scala conventions detailed above, preferring the latter.
We require that all contributors sign-off on their commits. This certifies that the contribution is your original work, or you have the rights to submit it under the same license, or a compatible license.
Any contribution which contains commits that are not signed off will not be accepted.
To sign off on a commit use the --signoff
(or -s
) option when committing your changes:
git commit -s -m "Add cool feature."
This will append the following to your commit message:
Signed-off-by: Your Name <[email protected]>
The sign-off is a simple line at the end of the explanation for the patch. Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. Use your real name, no pseudonyms or anonymous contributions. If you set your user.name
and user.email
git configs, you can sign your commit automatically with git commit -s
.
The sign-off means you certify the below (from developercertificate.org):
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
Please visit the testing doc for details about how to run tests
We provide a basic config .pre-commit-config.yaml
for pre-commit to
automate some aspects of the development process. As a convenience you can enable automatic
copyright year updates by following the installation instructions on the
pre-commit homepage.
To this end, first install pre-commit
itself using the method most suitable for your development
environment. Then you will need to run pre-commit install
to enable it in your local git
repository. Using --allow-missing-config
will make it easy to work with older branches
that do not have .pre-commit-config.yaml
.
pre-commit install --allow-missing-config
and setting the environment variable:
export SPARK_RAPIDS_AUTO_COPYRIGHTER=ON
The default value of SPARK_RAPIDS_AUTO_COPYRIGHTER
is OFF
.
When automatic copyright updater is enabled and you modify a file with a prior
year in the copyright header it will be updated on git commit
to the current year automatically.
However, this will abort the commit process
with the following error message:
Update copyright year....................................................Failed
- hook id: auto-copyrighter
- duration: 0.01s
- files were modified by this hook
You can confirm that the update has actually happened by either inspecting its effect with
git diff
first or simply re-executing git commit
right away. The second time no file
modification should be triggered by the copyright year update hook and the commit should succeed.
There is a known issue for macOS users if they use the default version of sed
. The copyright update
script may fail and generate an unexpected file named source-file-E
. As a workaround, please
install GNU sed
brew install gnu-sed
# and add to PATH to make it as default sed for your shell
export PATH="/usr/local/opt/gnu-sed/libexec/gnubin:$PATH"
A pull request should pass all status checks before being merged.
Please follow the steps in the Sign your work section, and make sure at least one commit in your pull request is signed-off.
The check runs on NVIDIA self-hosted runner, a project committer can
manually trigger it by commenting build
. It includes the following steps,
- Mergeable check
- Blackduck vulnerability scan
- Fetch merged code (merge the pull request HEAD into BASE branch, e.g. fea-001 into branch-x)
- Run
mvn verify
and unit tests for multiple Spark versions in parallel. Ref: spark-premerge-build.sh
If it fails, you can click the Details
link of this check, and go to Upload log -> Jenkins log for pull request xxx (click here)
to
find the uploaded log.
Options:
- Skip tests run by adding
[skip ci]
to title, this should only be used for doc-only change - Run build and tests in databricks runtimes by adding
[databricks]
to title, this would add around 30-40 minutes
Portions adopted from https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md, https://github.com/NVIDIA/nvidia-docker/blob/master/CONTRIBUTING.md, and https://github.com/NVIDIA/DALI/blob/main/CONTRIBUTING.md