Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Shim Detection for build Process [databricks] #11308

Merged
merged 50 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
f0cc81b
Removed irrelevant profiles for Scala 2.13
razajafri Aug 2, 2024
c23d29c
Added an ant target to replace profiles
razajafri Aug 2, 2024
0ffdc8b
Modified buildall script to use the new way
razajafri Aug 4, 2024
d558425
Added function to consume the antrun plugin for version-def
razajafri Aug 4, 2024
a3ab365
Added all.buildvers
razajafri Aug 4, 2024
80c8470
Added changes to version-def and shimplify
razajafri Aug 6, 2024
6f4f567
Clean up
razajafri Aug 8, 2024
347c50c
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 8, 2024
c9a15d2
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 8, 2024
368cafc
Refactored python script to an external file
razajafri Aug 8, 2024
6933a7a
Changed the location of release.properties, added it to a property an…
razajafri Aug 8, 2024
95dd876
Use exec tag to execute bash
razajafri Aug 9, 2024
5245c86
Merge remote-tracking branch 'private/dynamic-shim-detection' into dy…
razajafri Aug 9, 2024
6fc5a40
clean up
razajafri Aug 16, 2024
6d8e776
few more changes
razajafri Aug 16, 2024
3a542e8
Added a way to remove shims from releases
razajafri Aug 16, 2024
bfcadcd
included_buildvers changes
razajafri Aug 16, 2024
9f4eeb3
undo version-def echos
razajafri Aug 16, 2024
2e4175a
removed unnecessary comment
razajafri Aug 16, 2024
aa1d862
removed method for mapping json
razajafri Aug 16, 2024
b9542b9
Added an xpath alternative to speed up the build
razajafri Aug 16, 2024
caaf058
Added message to inform user about the missing package for speeding u…
razajafri Aug 16, 2024
507c70a
Removed bash script and used jython to get releases
razajafri Aug 17, 2024
d9b4d76
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 17, 2024
d0796ea
undo .gitignore change
razajafri Aug 17, 2024
3b716da
removed the unused param overwrite_properties
razajafri Aug 17, 2024
b529a49
cleanup
razajafri Aug 17, 2024
e7634cb
removed the call to create properties file
razajafri Aug 17, 2024
8cf07b4
addressed review comments
razajafri Aug 20, 2024
a41cf63
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 20, 2024
e9d2290
removed unnecessary property
razajafri Aug 20, 2024
edc7d76
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 20, 2024
ae657ba
regenerated scala2.13
razajafri Aug 20, 2024
a6b8b89
Removed antrun execution from pom and refactored the python script so…
razajafri Aug 22, 2024
a4112ab
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Aug 23, 2024
a04150c
Addressed adding comments to Python script
razajafri Aug 28, 2024
375aad3
added quotes to project so it's not mistaken for the word project but…
razajafri Aug 28, 2024
c058193
moved the comment down a line
razajafri Aug 28, 2024
8347892
Call python script to get buildvers
razajafri Aug 28, 2024
420e88e
Addressed review comments
razajafri Aug 29, 2024
33e3315
Removed some of the references of snapshots and noSnapshots
razajafri Aug 29, 2024
3af97ca
addressed review comments and other minor changes
razajafri Aug 29, 2024
3e487ed
replaced expression with buildvers
razajafri Aug 30, 2024
b90302c
Added dist profiles
razajafri Aug 30, 2024
deb96d5
Changed the phase so it runs after initialize
razajafri Aug 30, 2024
61ff81f
Added databricks profile
razajafri Sep 4, 2024
06ee292
Regenerated 2.13 pom
razajafri Sep 4, 2024
99f759a
Merge remote-tracking branch 'origin/branch-24.10' into dynamic-profiles
razajafri Sep 4, 2024
74bbf02
Addressed review comments
razajafri Sep 4, 2024
8131aa6
missed comma in databricks
razajafri Sep 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions build/buildall
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ function versionsFromDistProfile() {
echo -n $versionStr
}

function versionsFromReleaseProfiles() {
versionStr=$(python build/get_buildvers.py $1 $2)
echo -n $versionStr
}

FINAL_OP="package"

while [[ "$1" != "" ]] ; do
Expand Down Expand Up @@ -173,22 +178,23 @@ fi

[[ "$MODULE" != "" ]] && MODULE_OPT="--projects $MODULE --also-make" || MODULE_OPT=""

echo "Collecting Spark versions..."
case $DIST_PROFILE in

snapshotsScala213)
SPARK_SHIM_VERSIONS=($(versionsFromDistProfile "snapshotsScala213"))
SPARK_SHIM_VERSIONS=($(versionsFromReleaseProfiles "snapshots.buildvers" "scala2.13/pom.xml"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously snapshotXXX means snapshots+non-snapshots, I think currently it should be snap_and_no_snap here?

;;

noSnapshotsScala213)
SPARK_SHIM_VERSIONS=($(versionsFromDistProfile "noSnapshotsScala213"))
SPARK_SHIM_VERSIONS=($(versionsFromReleaseProfiles "no_snapshots.buildvers" "scala2.13/pom.xml"))
;;

snapshots?(WithDatabricks))
SPARK_SHIM_VERSIONS=($(versionsFromDistProfile "snapshots"))
SPARK_SHIM_VERSIONS=($(versionsFromReleaseProfiles "snapshots.buildvers" "pom.xml"))
;;

noSnapshots?(WithDatabricks))
SPARK_SHIM_VERSIONS=($(versionsFromDistProfile "noSnapshots"))
SPARK_SHIM_VERSIONS=($(versionsFromReleaseProfiles "no_snapshots.buildvers" "pom.xml"))
;;

minimumFeatureVersionMix)
Expand All @@ -215,8 +221,6 @@ if [[ "$GEN_BLOOP" == "true" ]]; then
exit 0
fi

[[ "$DIST_PROFILE" != "" ]] && MVN_PROFILE_OPT="-P$DIST_PROFILE" || MVN_PROFILE_OPT=""

# First element in SPARK_SHIM_VERSIONS to do most of the checks
export BASE_VER=${SPARK_SHIM_VERSIONS[0]}
export NUM_SHIMS=${#SPARK_SHIM_VERSIONS[@]}
Expand Down Expand Up @@ -300,7 +304,7 @@ time (
# a negligible increase of the build time by ~2 seconds.
joinShimBuildFrom="aggregator"
echo "Resuming from $joinShimBuildFrom build only using $BASE_VER"
$MVN $FINAL_OP -rf $joinShimBuildFrom $MODULE_OPT $MVN_PROFILE_OPT $INCLUDED_BUILDVERS_OPT \
$MVN $FINAL_OP -rf $joinShimBuildFrom $MODULE_OPT $INCLUDED_BUILDVERS_OPT \
-Dbuildver="$BASE_VER" \
-DskipTests -Dmaven.scaladoc.skip
)
37 changes: 37 additions & 0 deletions build/dyn_shim_detection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

import xml.etree.ElementTree as ET
import sys

_log = logging.getLogger("dyn-shim-detection")
# This script is called by maven's antrun plugin. The `project` variable is set by antrun which contains all the
# properties that are set in the pom.xml. For more details checkout the documentation of the `script` task
# https://ant.apache.org/manual/Tasks/script.html
show_version_info = project.getProperty("dyn.shim.trace")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a comment explaining where project is coming from, and a block comment in general explaining what this script is doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK, but I'd like to see this addressed. If you look at this Python script on its own, it's confusing how it runs since there are magic variables that just appear vs. how normal Python programs work. A comment explaining the context in which this is invoked and describing the magic variables it is relying on would be very helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I have added a comment explaining, I hope it helps.

_log.setLevel(logging.DEBUG if show_version_info else logging.INFO)
# Same as shimplify.py
ch = logging.StreamHandler()
ch.setFormatter(logging.Formatter('%(name)s - %(levelname)s - %(message)s'))
_log.addHandler(ch)
spark_rapids_source_basedir = project.getProperty("spark.rapids.source.basedir")
multi_module_project_dir = project.getProperty("maven.multiModuleProjectDirectory")
gerashegalov marked this conversation as resolved.
Show resolved Hide resolved
expression = project.getProperty("expression")

sys.path.append("{}/build/".format(spark_rapids_source_basedir))
from get_buildvers import _get_expression
value = _get_expression("{}/pom.xml".format(multi_module_project_dir), expression, _log)
project.setProperty(expression, value)
Binary file added build/get_buildvers$py.class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to upload the bytecode file? should be in .gitignore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was a mistake and yes, I will add it to .gitignore.

Binary file not shown.
69 changes: 69 additions & 0 deletions build/get_buildvers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
import xml.etree.ElementTree as ET

def _get_expression(pomFile, expression, logger = None):
pom = ET.parse(pomFile)
ns = {"pom": "http://maven.apache.org/POM/4.0.0"}
releases = []
release_prefix = "release"
for profile in pom.findall(".//pom:profile/pom:id", ns):
if profile.text.startswith(release_prefix):
releases.append(profile.text[len(release_prefix):])
snapshots = []
no_snapshots = []

for release in releases:
spark_version = pom.find(".//pom:spark{}.version".format(release), ns)
if (spark_version.text.endswith("SNAPSHOT")):
snapshots.append(release)
else:
no_snapshots.append(release)
excluded_shims = pom.find(".//pom:dyn.shim.excluded.releases", ns).text
if (excluded_shims):
for removed_shim in [x.strip() for x in excluded_shims.split(",")]:
if (removed_shim not in snapshots and removed_shim not in no_snapshots):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks insufficient as snapshots and no_snapshots should not have common items, can we try

if removed_shim in snapshots:
    snapshots.remove(removed_shim)
elif removed_shim in no_snapshots:
    no_snapshots.remove(removed_shim)
else:
    raise Exception('...')

Copy link
Collaborator Author

@razajafri razajafri Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to guard against a fictitious shim that doesn't exist in either snap or no_snaps but is listed in the excluded_shims. Yours look cleaner. I will update

raise Exception("Shim {} listed in dyn.shim.excluded.releases in pom.xml not present in releases".format(removed_shim))
try:
snapshots.remove(removed_shim)
except ValueError:
pass
try:
no_snapshots.remove(removed_shim)
except ValueError:
pass


if "scala2.13" in pomFile:
no_snapshots = list(filter(lambda x: not x.endswith("cdh"), no_snapshots))
db_release = list(filter(lambda x: x.endswith("db"), no_snapshots))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line. try reformat the code for better readability 👍

no_snapshots = list(filter(lambda x: not x.endswith("db"), no_snapshots))
snap_and_no_snap = no_snapshots + snapshots
all = snap_and_no_snap + db_release
release_dict={}
release_dict["databricks.buildvers"]=" ".join(db_release)
release_dict["snapshots.buildvers"]=" ".join(snapshots)
release_dict["no_snapshots.buildvers"]=" ".join(no_snapshots)
release_dict["snap_and_no_snap.buildvers"]=" ".join(snap_and_no_snap)
release_dict["all.buildvers"]=" ".join(all)
if (logger):
logger.debug("release_dict: {}".format(release_dict))
if (expression):
return release_dict[expression]


if __name__ == "__main__":
print(_get_expression(sys.argv[2], sys.argv[1]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be nice to include a clear usage message with a if len(sys.argv) != 3 check, as this will be used directly in some scripts

13 changes: 3 additions & 10 deletions build/shimplify.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,14 +187,13 @@ def __csv_as_arr(str_val):
__shims_arr = sorted(__csv_ant_prop_as_arr('shimplify.shims'))
__dirs_to_derive_shims = sorted(__csv_ant_prop_as_arr('shimplify.dirs'))

__all_shims_arr = sorted(__csv_ant_prop_as_arr('all.buildvers'))
__allScala213_shims_arr = sorted(__csv_ant_prop_as_arr('allScala213.buildvers'))

__log = logging.getLogger('shimplify')
__log.setLevel(logging.DEBUG if __should_trace else logging.INFO)
__ch = logging.StreamHandler()
__ch.setFormatter(logging.Formatter('%(name)s - %(levelname)s - %(message)s'))
__log.addHandler(__ch)
__all_shims_arr = sorted(__ant_proj_prop("all.buildvers").split(" "))
__shims_arr = __all_shims_arr if not __shims_arr else __shims_arr

__shim_dir_pattern = re.compile(r'spark\d{3}')
__shim_comment_pattern = re.compile(re.escape(__opening_shim_tag) +
Expand Down Expand Up @@ -371,12 +370,6 @@ def __generate_symlinks():
path,
build_ver_arr))

def __map_version_array(shim_json_string):
shim_ver = str(json.loads(shim_json_string).get('spark'))
assert shim_ver in __all_shims_arr or shim_ver in __allScala213_shims_arr, "all.buildvers or " \
"allScala213.buildvers in pom.xml does not contain %s" % shim_ver
return shim_ver

def __traverse_source_tree_of_all_shims(src_type, func):
"""Walks src/<src_type>/sparkXYZ"""
base_dir = __src_basedir
Expand All @@ -396,7 +389,7 @@ def __traverse_source_tree_of_all_shims(src_type, func):
shim_arr = shim_match.group(1).split(os.linesep)
assert len(shim_arr) > 0, "invalid empty shim comment,"\
"orphan shim files should be deleted"
build_ver_arr = map(__map_version_array, shim_arr)
build_ver_arr = map(lambda x: str(json.loads(x).get('spark')), shim_arr)
__log.debug("extracted shims %s", build_ver_arr)
assert build_ver_arr == sorted(build_ver_arr),\
"%s shim list is not properly sorted" % shim_file_path
Expand Down
82 changes: 0 additions & 82 deletions dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -49,28 +49,6 @@
<rapids.source.jar.phase>none</rapids.source.jar.phase>
</properties>
<profiles>
<profile>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NvTimLiu can you please double-check if there are any other references to these profiles? There might be some outside of this repository, thanks

<id>noSnapshotsWithDatabricks</id>
<properties>
<included_buildvers>
${noSnapshot.buildvers},
${databricks.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>noSnapshots</id>
<properties>
<included_buildvers>
<!-- #if scala-2.12 -->
${noSnapshot.buildvers}
<!-- #endif scala-2.12 -->
<!-- #if scala-2.13 --><!--
${noSnapshotScala213.buildvers}
--><!-- #endif scala-2.13 -->
</included_buildvers>
</properties>
</profile>
<profile>
<id>premergeUT1</id>
<properties>
Expand Down Expand Up @@ -127,14 +105,6 @@
</included_buildvers>
</properties>
</profile>
<profile>
<id>databricks</id>
<properties>
<included_buildvers>
${databricks.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<!--
https://spark.apache.org/versioning-policy.html
Expand All @@ -161,58 +131,6 @@
</included_buildvers>
</properties>
</profile>
<profile>
<id>snapshots</id>
<properties>
<included_buildvers>
${noSnapshot.buildvers},
${snapshot.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>snapshotOnly</id>
<properties>
<included_buildvers>
${snapshot.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>snapshotsWithDatabricks</id>
<properties>
<included_buildvers>
${noSnapshot.buildvers},
${snapshot.buildvers},
${databricks.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>noSnapshotsScala213</id>
<properties>
<included_buildvers>
${noSnapshotScala213.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>snapshotsScala213</id>
<properties>
<included_buildvers>
${noSnapshotScala213.buildvers},
${snapshotScala213.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>snapshotScala213Only</id>
<properties>
<included_buildvers>
${snapshotScala213.buildvers}
</included_buildvers>
</properties>
</profile>
<profile>
<id>individual</id>
<activation>
Expand Down
41 changes: 20 additions & 21 deletions jenkins/version-def.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,28 +58,27 @@ function set_env_var_SPARK_SHIM_VERSIONS_ARR() {
IFS=", " <<< $SPARK_SHIM_VERSIONS_STR read -r -a SPARK_SHIM_VERSIONS_ARR
}

function set_env_var_SPARK_SHIM_VERSIONS_ARR_FROM_PROFILES() {
versionStr=$(python build/get_buildvers.py $1 $2)
SPARK_SHIM_VERSIONS_STR=$(echo -n $versionStr)
<<< $SPARK_SHIM_VERSIONS_STR read -r -a SPARK_SHIM_VERSIONS_ARR
}

pom="pom.xml"
if [[ $SCALA_BINARY_VER == "2.13" ]]; then
# Psnapshots: snapshots + noSnapshots
set_env_var_SPARK_SHIM_VERSIONS_ARR -PsnapshotsScala213
SPARK_SHIM_VERSIONS_SNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PnoSnapshots: noSnapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR -PnoSnapshotsScala213
SPARK_SHIM_VERSIONS_NOSNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PsnapshotOnly : snapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR -PsnapshotScala213Only
SPARK_SHIM_VERSIONS_SNAPSHOTS_ONLY=("${SPARK_SHIM_VERSIONS_ARR[@]}")
else
# Psnapshots: snapshots + noSnapshots
set_env_var_SPARK_SHIM_VERSIONS_ARR -Psnapshots
SPARK_SHIM_VERSIONS_SNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PnoSnapshots: noSnapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR -PnoSnapshots
SPARK_SHIM_VERSIONS_NOSNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PsnapshotOnly : snapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR -PsnapshotOnly
SPARK_SHIM_VERSIONS_SNAPSHOTS_ONLY=("${SPARK_SHIM_VERSIONS_ARR[@]}")
pom="scala2.13/pom.xml"
fi

# Psnapshots: snapshots + noSnapshots
set_env_var_SPARK_SHIM_VERSIONS_ARR_FROM_PROFILES "snap_and_no_snap.buildvers" "$pom"
SPARK_SHIM_VERSIONS_SNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PnoSnapshots: noSnapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR_FROM_PROFILES "no_snapshots.buildvers" "$pom"
SPARK_SHIM_VERSIONS_NOSNAPSHOTS=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# PsnapshotOnly : snapshots only
set_env_var_SPARK_SHIM_VERSIONS_ARR_FROM_PROFILES "snapshots.buildvers" "$pom"
SPARK_SHIM_VERSIONS_SNAPSHOTS_ONLY=("${SPARK_SHIM_VERSIONS_ARR[@]}")

# PHASE_TYPE: CICD phase at which the script is called, to specify Spark shim versions.
# regular: noSnapshots + snapshots
# pre-release: noSnapshots only
Expand Down Expand Up @@ -127,7 +126,7 @@ SPARK_SHIM_VERSIONS_JDK17=("${SPARK_SHIM_VERSIONS_ARR[@]}")
set_env_var_SPARK_SHIM_VERSIONS_ARR -Pjdk17-scala213-test
SPARK_SHIM_VERSIONS_JDK17_SCALA213=("${SPARK_SHIM_VERSIONS_ARR[@]}")
# databricks shims
set_env_var_SPARK_SHIM_VERSIONS_ARR -Pdatabricks
set_env_var_SPARK_SHIM_VERSIONS_ARR_FROM_PROFILES "databricks.buildvers" "$pom"
SPARK_SHIM_VERSIONS_DATABRICKS=("${SPARK_SHIM_VERSIONS_ARR[@]}")

echo "SPARK_BASE_SHIM_VERSION: $SPARK_BASE_SHIM_VERSION"
echo "SPARK_BASE_SHIM_VERSION: $SPARK_BASE_SHIM_VERSION"
Loading