-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow(s) discussion #61
Comments
My workflow looks very similar. The first part of my workflow I call |
When you go to form your 1x1 degree data cubes do you find it takes awhile to read in all of the individual files? Read speeds is may main motivation for building single file |
I think I'm going to move away from Arrow files in favor of JLD files so that I can take advantage of Fill array compression. I'm also tempted to adopt your single file approach if it's not too costly at read. |
Note that you're using Arrow, and I'm using (Geo)Parquet, which are (related but) different formats. JLD2 is yet another format. I expect Arrow to be fastest (vs Parquet), but I didn't run any benchmarks. Without benchmarks, it's all gut feeling, which tells me that reading a directory full of Parquet files is superfast (and probably I/O limited). Especially when compared to the rest of the pipeline afterwards, let alone the preprocessing done to get those files. |
I wish there was a way to build the 1x1 degree tiles without so much duplication of the data, i.e. raw -> parameter subset -> spatial subset |
What's keeping you from going from raw h5 to spatial subset? |
I guess it's file management and offset calculations |
I do something like this (with some extra processing and checking whether files already exist before reading the points). function tiles_from_box(box)
min_x, min_y, max_x, max_y = box
lons = round(Int, min_x, RoundDown):1:round(Int, max_x, RoundUp)
lats = round(Int, min_y, RoundDown):1:round(Int, max_y, RoundUp)
Iterators.product(lons, lats)
end
function to_tile(granule)
box = SpaceLiDAR.bounds(granule)
tiles = vec(tiles_from_box(box))
filter!(Base.Fix2(in, tile_selection), tiles) # if you don't want all tiles
table = SpaceLiDAR.points(granule)
for tile in tiles
subset(table, # something spatial #, view=true)
# process subset of table
# save to tile.pq
end
end |
@evetion all of my global workflows tend to converge to a geographic (lat,lon) tiling approach. I find that I have a some tooling to handle table data that have been separated into tiles. Some tools that I have are:
I'm going to split this off as a Package for my own work but before I do I was wondering if you might be interested in a package like this as well? I figure it could serve as a API for tiled approaches to global data processing and respective tooling. I suspect I will add DTables under the hood at some point. Here's the start, Let me know if you have any opinions before I go to far: https://github.com/alex-s-gardner/GeoTiles.jl The only requirement for a GeoTile is that the data is saved in a DataFrame compatible format, the data contains latitude and longitude columns and that the suffix of the file name follows GeoTile convention. |
Let's gather the workflows that we use here, to have a better shared understanding of the tools and the impact of code changes on them.
I download all ATL08/L2A .h5 files, using the
search
function, thenwrite_urls
and download them to disk (using the external aria2c tooling). I use the before/after search functionality to include/exclude dateranges.granules_from_folder
is used to read all existing filenames locally and exclude them from the list to download.After having all these files locally I process them using
points
/classify
and save them to individual GeoParquet files. I also write empty files, so I can again keep track of progress.Then there's the special case of tiling these GeoParquet files into 1x1 degree tiles, in which I save (again also empty) GeoParquet file to a 1x1 degree directory if the bounding box of the data intersects with the 1x1 tile.
I have tried to keep an external index up to date with all h5 files, but it never really worked out, so now everything is just filename based. Bit ugly, but it works for now. That might change if I want to process all files in one go, streaming the world let's say, which would require a spatial index on all points.
The text was updated successfully, but these errors were encountered: