Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Spark - testing on DC2 Run #249

Closed
JulienPeloton opened this issue Aug 10, 2018 · 12 comments
Closed

Apache Spark - testing on DC2 Run #249

JulienPeloton opened this issue Aug 10, 2018 · 12 comments

Comments

@JulienPeloton
Copy link
Contributor

JulienPeloton commented Aug 10, 2018

The idea would be to try out Apache Spark on DC2 run.

Random thoughts:

  • Language
    • Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala is the native one).
    • As far as DESC is concerned, I would advocate to use the Python API (called pyspark) for obvious reasons. But feel free to put your hands on Scala, it's worth it.
  • Data Format
    • Current catalogs used in Get Dask working at NERSC to analyze DC2 #237 are stored in hdf5 files. As far as I know there is no serious pyspark connector on the market to read hdf5 into DataFrames. While writing a custom one belongs to the domain of the possible, I would rather advocate to focus first on existing tools which have been validated/tested.
    • FITS would be more appropriate as it has a Spark connector usable from all current API (see e.g. spark-fits).
    • Apache Parquet would work as well since it has a built-in connector packaged and shipped with Apache Spark, and usable from all API.
  • Infrastructure
    • CC IN2P3 does not support Spark yet.
    • NERSC does however (via shifter), and I already ran jobs on it (works surprisingly well).
    • We have a (small) dedicated Apache Spark cluster at LAL, France for R&D.

Targeted deadline for this work: mid-September (I'll be back to work early September only).

@JulienPeloton
Copy link
Contributor Author

@wmwv I opened this issue to branch out the discussion from #234.

Speaking of FITS or parquet, would you think it would be possible to get catalogs in those data formats ready for September? If the instructions to create the catalogs are available, I can have a look and do it myself if you are short on manpower.

@wmwv
Copy link
Contributor

wmwv commented Aug 16, 2018

Yes, providing parquet and FITS can be done for September.

@JulienPeloton
Copy link
Contributor Author

Great! Thanks.

@JulienPeloton
Copy link
Contributor Author

@wmwv I would like to start benchmarking Apache Spark performance on the catalog data.
I can see parquet files in /global/projecta/projectdirs/lsst/global/in2p3/Run1.1/object_catalog/, but this is only 100 MB total. Do you know when the full data set will be available?

Thanks!

@wmwv
Copy link
Contributor

wmwv commented Sep 4, 2018

@JulienPeloton Welcome back. My apologies I didn't have this finished before you came back. I've taken a stab at it, as you see, but I need to re-architect a few issues to make the Parquet files.

In particular, I will finally implement @yymao 's long-standing request to provide dummy columns for missing filters so that the schema is the same for all tract+patches.

@JulienPeloton
Copy link
Contributor Author

OK thanks @wmwv ! Let me know when it's done, and if in the meanwhile I can be of any help.

Not related to this specific data set, but here is a benchmark looking at Apache Spark performance to load, decode and distribute the same data set stored in different file formats: CSV, FITS, and PARQUET (100 times).

distribution

--> Per-iteration running time to load, decode and distribute the same data set (370 million galaxies) stored in different file formats: FITS (blue), PARQUET (green) and CSV (orange). We show only the timing once the data reside in-memory (that is for 1 < iteration < 100). In the legend, the numbers in parenthesis are the means of each distribution.

There is a jupyter notebook detailing the benchmark and what has been done under the hood, and I performed the same exercise in Scala as well.

@JulienPeloton
Copy link
Contributor Author

For reference - preliminary Spark Notebook on DC2: here

@katrinheitmann
Copy link
Contributor

@JulienPeloton Hi Julien, I wonder with all your great work on this, is this issue now concluded and ready to be closed? Are you planning to write a DESC Note on this? Or maybe the notebook is sufficient? Thanks!

@JulienPeloton
Copy link
Contributor Author

@katrinheitmann Hi Katrin - thanks for following up on this! For future reference, let me write the conclusion (or rather where we stand) here before closing. I can also definitely write a DESC note on this (is there any template somewhere?).

@yymao
Copy link
Member

yymao commented Apr 6, 2019

@JulienPeloton
Copy link
Contributor Author

Perfect, thanks @yymao!

@JulienPeloton
Copy link
Contributor Author

JulienPeloton commented Apr 8, 2019

Here is a summary of where we stand regarding the use of Apache Spark in the context of the DATF and DC2:

Future work will include:

  • summarize the work into a DESC note, and keep on developing tools and exposing results.
  • focus on performance with larger catalogs (so far catalogs were limited in terms of size),
  • add cosmology-oriented features (e.g. xmatch/FoF is something that can be tested in the context of Spark),
  • include DESC-oriented interface (e.g. automatically manage paths to catalog data at NERSC as is done in GCR).

@katrinheitmann I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants