-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV to Parquet recipe #94
Comments
I'm sitting here with @einatlev-ldeo at the EarthCube Annual Meeting in La Jolla. We are discussing if/how we may be able to provide cloud-optimized access to (at least some subset of) the data provided on
via Pangeo Forge. Based on our discussions, it seems that this may be a great use case for a Parquet recipe. It strikes me that once we complete the work scoped in #376, the possibility of writing a Parquet recipe is perhaps quite approachable (as really just few additional PTransforms). While we're waiting for the first phase Beam work to complete, perhaps we can start brainstorming what data objects would make sense to assemble from these raw data. For example, are there a set(s) of variables with the same time resolution, which would be able to fit all in a single large table together.? If so, what are those variables and their access paths on the file server? Can we assemble a demonstration CSV from them using a simple standalone Python script? If so, that would be a very useful basis for building a larger table with Pangeo Forge. Side note: there's some awesome webcam data available through the same project. I wonder what ARCO format might be suitable for webcam time series data? |
Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14 |
Thanks Tom this is great |
Thank you!
…Sent from my iPhone
On Jun 15, 2022, at 6:25 AM, Tom Augspurger ***@***.***> wrote:
Just FYI, I have some notes on how we think about tabular data for the Planetary Computer: https://gist.github.com/TomAugspurger/457a2288f6ef7490ab87546faf665e14
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
So far we basically only have NetCDF (or other things that Xarray can read; e.g. Grib) to Zarr recipes.
Some recipes will want to work with tabular data, e.g. transforming a collections of CSVs to Parquet. (Example: pangeo-forge/staged-recipes#3)
This will require an entirely new recipe class. Creating this class will force us to refactor the recipe module significantly. This will be laborious but hopefully relatively straightforward.
The text was updated successfully, but these errors were encountered: