CSV to Parquet recipe #94

rabernat · 2021-04-06T01:41:47Z

So far we basically only have NetCDF (or other things that Xarray can read; e.g. Grib) to Zarr recipes.

Some recipes will want to work with tabular data, e.g. transforming a collections of CSVs to Parquet. (Example: pangeo-forge/staged-recipes#3)

This will require an entirely new recipe class. Creating this class will force us to refactor the recipe module significantly. This will be laborious but hopefully relatively straightforward.

cisaacstern · 2022-06-15T01:20:22Z

I'm sitting here with @einatlev-ldeo at the EarthCube Annual Meeting in La Jolla. We are discussing if/how we may be able to provide cloud-optimized access to (at least some subset of) the data provided on

via Pangeo Forge.

Based on our discussions, it seems that this may be a great use case for a Parquet recipe. It strikes me that once we complete the work scoped in #376, the possibility of writing a Parquet recipe is perhaps quite approachable (as really just few additional PTransforms).

While we're waiting for the first phase Beam work to complete, perhaps we can start brainstorming what data objects would make sense to assemble from these raw data. For example, are there a set(s) of variables with the same time resolution, which would be able to fit all in a single large table together.? If so, what are those variables and their access paths on the file server? Can we assemble a demonstration CSV from them using a simple standalone Python script? If so, that would be a very useful basis for building a larger table with Pangeo Forge.

Side note: there's some awesome webcam data available through the same project. I wonder what ARCO format might be suitable for webcam time series data?

TomAugspurger · 2022-06-15T13:25:21Z

Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14

cisaacstern · 2022-06-15T15:08:04Z

Thanks Tom this is great

einatlev-ldeo · 2022-06-15T16:39:12Z

Thank you!

…

Sent from my iPhone

On Jun 15, 2022, at 6:25 AM, Tom Augspurger ***@***.***> wrote: Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

rabernat mentioned this issue Apr 6, 2021

Example pipeline for FIA pangeo-forge/staged-recipes#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV to Parquet recipe #94

CSV to Parquet recipe #94

rabernat commented Apr 6, 2021

cisaacstern commented Jun 15, 2022

TomAugspurger commented Jun 15, 2022

cisaacstern commented Jun 15, 2022

einatlev-ldeo commented Jun 15, 2022 via email

CSV to Parquet recipe #94

CSV to Parquet recipe #94

Comments

rabernat commented Apr 6, 2021

cisaacstern commented Jun 15, 2022

TomAugspurger commented Jun 15, 2022

cisaacstern commented Jun 15, 2022

einatlev-ldeo commented Jun 15, 2022 via email