Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync plugin support as of 2024-12-31 #1478

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

cindyyuanjiang
Copy link
Collaborator

WIP.
Fixes #1452

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang for working on that.
It is important to stay up-to-date with the plugin since many of the new operators are affecting our customers.
I updated the issue description with the latest sync-pugin job. So, will be nice to include the remaining execs/expression if they are easy to knock off.

@@ -1,2 +1,2 @@
App Name,App ID,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,SQL Stage Durations Sum,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,App Duration Estimated,Unsupported Execs,Unsupported Expressions,Estimated Job Frequency (monthly),Total Core Seconds
"Spark shell","local-1624371544219",4575,20421,175293,1523,72.15,"","JSON[string:double:date:int:bigint];Text[*]","JSON","","","",1859,5372,176916,13622,6799,false,"CollectLimit;Scan text;Execute InsertIntoHadoopFsRelationCommand json;Scan json","",30,2096
"Spark shell","local-1624371544219",4575,20421,175293,4365,72.15,"","Text[*]","JSON","","","",1859,5372,176916,938,19483,false,"CollectLimit;Scan text;Execute InsertIntoHadoopFsRelationCommand json","",30,2096
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: the legacy behavior was identifying JSON[string:double:date:int:bigint] as unsupported.
If we are checking that the datatypes is supported, what about BigInt ? I don't see it in the datasource columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate into this. This is the results from Qual tool after we updated the plugin changes. I am not certain why bigint is removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps bigInt has never been one of the columns to begin with.
There are two main questions here to answer:

  1. I expected this readFormat to be unsupported based on the current columns of the CSV files. Why the new output indicates that this is a supported read although it has a field in the schema that is not supported in the CSV files (BigInt)? @nartal1 can help in confirming what is the correct behavior.
  2. How should we handle BigInt? We can reachout to Bobby to confirm whether:
    • It needs to be added to the columns. In that case, the fix needs to be done on the plugin side; or
    • It should be mapped to another datatype like Int.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a bug because if BigInt is not a valid column, then the readSchema should have been considered unsupported.
For the second part: The correct way to handle that is to map the datatypes to their aliases. I filed a new issue #1492 for that.

@@ -1169,6 +1169,103 @@ class SQLPlanParserSuite extends BasePlanParserSuite {
}
}

test("MonthsBetween is supported in ProjectExec") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following 3 tests are very identical.
Is it possible to create a parameterized test for all of them?

TrampolineUtil.withTempDir { parquetoutputLoc =>
TrampolineUtil.withTempDir { eventLogDir =>
val (eventLog, _) = ToolTestUtils.generateEventLog(eventLogDir,
"ProjectExprsSupported") { spark =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProjectExprsSupported -> if we end up having 3 unit tests, it will be nice to have different names for the app. Different names for unit-tests help when we debug and extract the eventlogs for further debugging and testing. the app name then will be used to link back to the unit-test that created that eventlog.

@amahussein amahussein changed the title Add support for new expressions Sync plugin support as of 2024-12-31 Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Sync supported ops with RAPIDS plugin Dec 2024
2 participants