Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in the ability to fingerprint JSON columns [databricks] #11060

Merged
merged 8 commits into from
Jun 13, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Jun 13, 2024

This is the same as #11002 but it was reverted because it broke databricks

@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

@gerashegalov and @tgravescs can you help me out here with the build and let me know if I am seeing thing incorrectly?

I am getting an error trying to use classes in the jackson-core dependency. I have verified that the classes I need are in jackson-core-.jar, but even though it should be on the classpath it appears not to be. But the pom.xml files are complicated and I am not 100% sure that it is on the classpath, as that is not in the logs for the build.

When I look at the logs of databricks installing things I see jackson-core is being pulled in as jackson-databind.

[2024-06-13T10:37:08.109Z] Generating an execution for com.fasterxml.jackson.core jackson-core ----ws_3_3--mvn--hadoop3--com.fasterxml.jackson.core--jackson-databind--com.fasterxml.jackson.core__jackson-databind__*.jar
...
[2024-06-13T10:37:11.329Z] [INFO] --- maven-install-plugin:2.4:install-file (install-db-jar-27) @ rapids-4-spark-databricks-deps-installer ---
[2024-06-13T10:37:11.329Z] [INFO] Installing /databricks/jars/----ws_3_3--mvn--hadoop3--com.fasterxml.jackson.core--jackson-databind--com.fasterxml.jackson.core__jackson-databind__2.13.4.jar to /home/ubuntu/.m2/repository/com/fasterxml/jackson/core/jackson-core/3.3.0-databricks/jackson-core-3.3.0-databricks.jar
[2024-06-13T10:37:11.329Z] [INFO] Installing /tmp/mvninstall8760060792013558718.pom to /home/ubuntu/.m2/repository/com/fasterxml/jackson/core/jackson-core/3.3.0-databricks/jackson-core-3.3.0-databricks.pom

but jackson-databind is not jackson-core. If you run mvn dependency:tree on the datagen pom.xml you get

...
[INFO] |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.10.0:provided
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.10.0:provided
[INFO] |  |  \- com.fasterxml.jackson.core:jackson-core:jar:2.10.0:provided
...

and the contents of the jars are different. Is this a bug in our script or has databricks actually combined the jars together?

@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

Artifact('com.fasterxml.jackson.core', 'jackson-core',
f'{prefix_ws_sp_mvn_hadoop}--com.fasterxml.jackson.core--jackson-databind--com.fasterxml.jackson.core__jackson-databind__*.jar'),
is the line that I think is a problem.

… as jackson-core for databricks

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

build

gerashegalov
gerashegalov previously approved these changes Jun 13, 2024
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

build

jlowe
jlowe previously approved these changes Jun 13, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

It got past the build on databricks so that was the problem :)

@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

build

@sameerz
Copy link
Collaborator

sameerz commented Jun 13, 2024

Can this be marked as closing issue #11053 ?

@revans2
Copy link
Collaborator Author

revans2 commented Jun 13, 2024

Can this be marked as closing issue #11053 ?

That is up to you on how you want to track it. I reverted the original patch so that might have fixed it. This puts the functionality back in, but "fixed" so that is fine too.

@revans2 revans2 linked an issue Jun 13, 2024 that may be closed by this pull request
@revans2 revans2 merged commit 531a9f5 into NVIDIA:branch-24.08 Jun 13, 2024
45 checks passed
@revans2 revans2 deleted the json_datagen branch June 13, 2024 19:22
SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024
…1060)

Also fixed issue with databricks dependency not being what we said it was.

Signed-off-by: Robert (Bobby) Evans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Build on Databricks 330 fails
4 participants