-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Spark - testing on DC2 Run #249
Comments
@wmwv I opened this issue to branch out the discussion from #234. Speaking of |
Yes, providing |
Great! Thanks. |
@wmwv I would like to start benchmarking Apache Spark performance on the catalog data. Thanks! |
@JulienPeloton Welcome back. My apologies I didn't have this finished before you came back. I've taken a stab at it, as you see, but I need to re-architect a few issues to make the Parquet files. In particular, I will finally implement @yymao 's long-standing request to provide dummy columns for missing filters so that the schema is the same for all tract+patches. |
OK thanks @wmwv ! Let me know when it's done, and if in the meanwhile I can be of any help. Not related to this specific data set, but here is a benchmark looking at Apache Spark performance to load, decode and distribute the same data set stored in different file formats: CSV, FITS, and PARQUET (100 times). --> Per-iteration running time to load, decode and distribute the same data set (370 million galaxies) stored in different file formats: FITS (blue), PARQUET (green) and CSV (orange). We show only the timing once the data reside in-memory (that is for 1 < iteration < 100). In the legend, the numbers in parenthesis are the means of each distribution. There is a jupyter notebook detailing the benchmark and what has been done under the hood, and I performed the same exercise in Scala as well. |
For reference - preliminary Spark Notebook on DC2: here |
@JulienPeloton Hi Julien, I wonder with all your great work on this, is this issue now concluded and ready to be closed? Are you planning to write a DESC Note on this? Or maybe the notebook is sufficient? Thanks! |
@katrinheitmann Hi Katrin - thanks for following up on this! For future reference, let me write the conclusion (or rather where we stand) here before closing. I can also definitely write a DESC note on this (is there any template somewhere?). |
DESC Note Google Doc template: https://docs.google.com/document/d/1ERz_S02Uvc0QkapVx145PrYZT0CRJbkPMmY5T95uMkk/edit Or if you want to use tex: https://github.com/LSSTDESC/start_paper |
Perfect, thanks @yymao! |
Here is a summary of where we stand regarding the use of Apache Spark in the context of the DATF and DC2:
Future work will include:
@katrinheitmann I think we can close this issue. |
The idea would be to try out Apache Spark on DC2 run.
Random thoughts:
hdf5
files. As far as I know there is no serious pyspark connector on the market to read hdf5 into DataFrames. While writing a custom one belongs to the domain of the possible, I would rather advocate to focus first on existing tools which have been validated/tested.FITS
would be more appropriate as it has a Spark connector usable from all current API (see e.g. spark-fits).Apache Parquet
would work as well since it has a built-in connector packaged and shipped with Apache Spark, and usable from all API.Targeted deadline for this work: mid-September (I'll be back to work early September only).
The text was updated successfully, but these errors were encountered: