Apache Spark - testing on DC2 Run #249

JulienPeloton · 2018-08-10T20:28:12Z

The idea would be to try out Apache Spark on DC2 run.

Initial discussion in Small-Medium User Data Analysis. #234
Focus on Get Dask working at NERSC to analyze DC2 #237, and more specifically on this notebook.

Random thoughts:

Language
- Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala is the native one).
- As far as DESC is concerned, I would advocate to use the Python API (called pyspark) for obvious reasons. But feel free to put your hands on Scala, it's worth it.
Data Format
- Current catalogs used in Get Dask working at NERSC to analyze DC2 #237 are stored in hdf5 files. As far as I know there is no serious pyspark connector on the market to read hdf5 into DataFrames. While writing a custom one belongs to the domain of the possible, I would rather advocate to focus first on existing tools which have been validated/tested.
- FITS would be more appropriate as it has a Spark connector usable from all current API (see e.g. spark-fits).
- Apache Parquet would work as well since it has a built-in connector packaged and shipped with Apache Spark, and usable from all API.
Infrastructure
- CC IN2P3 does not support Spark yet.
- NERSC does however (via shifter), and I already ran jobs on it (works surprisingly well).
- We have a (small) dedicated Apache Spark cluster at LAL, France for R&D.

Targeted deadline for this work: mid-September (I'll be back to work early September only).

The text was updated successfully, but these errors were encountered:

JulienPeloton · 2018-08-10T20:31:38Z

@wmwv I opened this issue to branch out the discussion from #234.

Speaking of FITS or parquet, would you think it would be possible to get catalogs in those data formats ready for September? If the instructions to create the catalogs are available, I can have a look and do it myself if you are short on manpower.

wmwv · 2018-08-16T17:09:54Z

Yes, providing parquet and FITS can be done for September.

JulienPeloton · 2018-08-18T19:58:39Z

Great! Thanks.

JulienPeloton · 2018-09-04T09:41:43Z

@wmwv I would like to start benchmarking Apache Spark performance on the catalog data.
I can see parquet files in /global/projecta/projectdirs/lsst/global/in2p3/Run1.1/object_catalog/, but this is only 100 MB total. Do you know when the full data set will be available?

Thanks!

wmwv · 2018-09-04T13:36:53Z

@JulienPeloton Welcome back. My apologies I didn't have this finished before you came back. I've taken a stab at it, as you see, but I need to re-architect a few issues to make the Parquet files.

In particular, I will finally implement @yymao 's long-standing request to provide dummy columns for missing filters so that the schema is the same for all tract+patches.

JulienPeloton · 2018-09-04T14:25:11Z

OK thanks @wmwv ! Let me know when it's done, and if in the meanwhile I can be of any help.

Not related to this specific data set, but here is a benchmark looking at Apache Spark performance to load, decode and distribute the same data set stored in different file formats: CSV, FITS, and PARQUET (100 times).

--> Per-iteration running time to load, decode and distribute the same data set (370 million galaxies) stored in different file formats: FITS (blue), PARQUET (green) and CSV (orange). We show only the timing once the data reside in-memory (that is for 1 < iteration < 100). In the legend, the numbers in parenthesis are the means of each distribution.

There is a jupyter notebook detailing the benchmark and what has been done under the hood, and I performed the same exercise in Scala as well.

JulienPeloton · 2018-10-17T12:32:35Z

For reference - preliminary Spark Notebook on DC2: here

katrinheitmann · 2019-04-05T22:07:34Z

@JulienPeloton Hi Julien, I wonder with all your great work on this, is this issue now concluded and ready to be closed? Are you planning to write a DESC Note on this? Or maybe the notebook is sufficient? Thanks!

JulienPeloton · 2019-04-06T20:08:49Z

@katrinheitmann Hi Katrin - thanks for following up on this! For future reference, let me write the conclusion (or rather where we stand) here before closing. I can also definitely write a DESC note on this (is there any template somewhere?).

yymao · 2019-04-06T21:01:49Z

@JulienPeloton

DESC Note Google Doc template: https://docs.google.com/document/d/1ERz_S02Uvc0QkapVx145PrYZT0CRJbkPMmY5T95uMkk/edit

Or if you want to use tex: https://github.com/LSSTDESC/start_paper

JulienPeloton · 2019-04-06T21:06:20Z

Perfect, thanks @yymao!

JulienPeloton · 2019-04-08T08:49:09Z

Here is a summary of where we stand regarding the use of Apache Spark in the context of the DATF and DC2:

Apache Spark has been successfully used to efficiently load, and manipulate DC2 catalog data at NERSC (see e.g. Issues/249: Testing Apache Spark on DC2 #288 and Apache Spark tutorial: Part I (data access) DC2-analysis#27).
Release of an (Apache Spark + DESC python) jupyter kernel for use at NERSC (see here).
Real-life examples based on the DC2 extragalactic catalog (see Galaxy Clusters & Velocity: A HackUrDC2 Story DC2-analysis#54 & U/lal/hackurdc2 the movie DC2-analysis#56).
Various debug and validation work (see e.g. U/plaszczy/nb run1 2 #303, Run1.2p v4 #323)

Future work will include:

summarize the work into a DESC note, and keep on developing tools and exposing results.
focus on performance with larger catalogs (so far catalogs were limited in terms of size),
add cosmology-oriented features (e.g. xmatch/FoF is something that can be tested in the context of Spark),
include DESC-oriented interface (e.g. automatically manage paths to catalog data at NERSC as is done in GCR).

@katrinheitmann I think we can close this issue.

wmwv mentioned this issue Oct 17, 2018

Time performance of Dask with Parquet files. #285

Closed

2 tasks

JulienPeloton mentioned this issue Oct 19, 2018

Issues/249: Testing Apache Spark on DC2 #288

Merged

katrinheitmann added the Apache Spark label Apr 6, 2019

katrinheitmann closed this as completed Apr 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Spark - testing on DC2 Run #249

Apache Spark - testing on DC2 Run #249

JulienPeloton commented Aug 10, 2018 •

edited

Loading

JulienPeloton commented Aug 10, 2018

wmwv commented Aug 16, 2018

JulienPeloton commented Aug 18, 2018

JulienPeloton commented Sep 4, 2018

wmwv commented Sep 4, 2018

JulienPeloton commented Sep 4, 2018

JulienPeloton commented Oct 17, 2018

katrinheitmann commented Apr 5, 2019

JulienPeloton commented Apr 6, 2019

yymao commented Apr 6, 2019

JulienPeloton commented Apr 6, 2019

JulienPeloton commented Apr 8, 2019 •

edited

Loading

Apache Spark - testing on DC2 Run #249

Apache Spark - testing on DC2 Run #249

Comments

JulienPeloton commented Aug 10, 2018 • edited Loading

JulienPeloton commented Aug 10, 2018

wmwv commented Aug 16, 2018

JulienPeloton commented Aug 18, 2018

JulienPeloton commented Sep 4, 2018

wmwv commented Sep 4, 2018

JulienPeloton commented Sep 4, 2018

JulienPeloton commented Oct 17, 2018

katrinheitmann commented Apr 5, 2019

JulienPeloton commented Apr 6, 2019

yymao commented Apr 6, 2019

JulienPeloton commented Apr 6, 2019

JulienPeloton commented Apr 8, 2019 • edited Loading

JulienPeloton commented Aug 10, 2018 •

edited

Loading

JulienPeloton commented Apr 8, 2019 •

edited

Loading