-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial commit: modified scala etls to accept Fannie Mae data #191
initial commit: modified scala etls to accept Fannie Mae data #191
Conversation
Signed-off-by: Suraj Aralihalli <[email protected]>
please ignore the markdown links checker warnings, I file an issue for this. |
@SurajAralihalli @nvliyuan Could you also check why markdown link failed:
|
the link is not dead, I believe it is a bug in the link checker. |
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
@viadea the link checker validates only http/https links and marks any references(links) to the files within the repo as failed |
examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb
Outdated
Show resolved
Hide resolved
Signed-off-by: Suraj Aralihalli <[email protected]>
@@ -667,7 +780,7 @@ | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 60, | |||
"execution_count": 57, | |||
"metadata": {}, | |||
"outputs": [], | |||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since spark.rapids.sql.incompatibleOps.enabled
is true by default now, maybe we can remove the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll update it in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the key is to keep only non-default settings for spark rapids based on current GA version -- say 2206.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the latest commit to resolve the config issues in both scala and python etl notebooks
since the output of mortgage-ETL.ipynb can be the input of mortgage-gpu.ipynb, it would be nice if we update the mortgage-gpu.ipynb about the data loading part. It will be something like:
|
@viadea shall we keep a sample mortgage dataset and add a declaration for convenience? So that customers do not need to get the dataset from the Fannie Mae website. |
I think we should not. And that is the same reason why we have this PR to use raw data. |
Thanks, I have updated both scala and python etl notebooks to save train and test datasets based on a boolean saveTrainTestDataset parameter. |
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Some issues:
We need to make sure the reading the raw CSV to use header=false.
Actually this optionsMap is never used. If so, we should remove this.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure ETL reading is reading CSV without header, and ETL writing is writing to parquet files in all notebooks(scala/python.)
examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb
Outdated
Show resolved
Hide resolved
examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL+XGBoost.ipynb
Outdated
Show resolved
Hide resolved
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
" val sets = rawDF.randomSplit(Array[Double](0.8, 0.2))\n", | ||
" val train = sets(0)\n", | ||
" val eval = sets(1)\n", | ||
" train.write.mode(\"overwrite\").parquet(new Path(outPath, \"train\").toString)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we save the rawDF to parquet files, and then calculate rawDF again to do the 20-80 split.
Is this the best logic for performance?
Should we : Save the rawDF to parquet files, and then read it back to do the 20-80 split to save the calculation again?
Could you help test it this new approach's performance to see if this is the best logic?
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli [email protected]