Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow(s) discussion #61

Open
evetion opened this issue Sep 5, 2023 · 9 comments
Open

Workflow(s) discussion #61

evetion opened this issue Sep 5, 2023 · 9 comments

Comments

@evetion
Copy link
Owner

evetion commented Sep 5, 2023

Let's gather the workflows that we use here, to have a better shared understanding of the tools and the impact of code changes on them.

I download all ATL08/L2A .h5 files, using the search function, then write_urls and download them to disk (using the external aria2c tooling). I use the before/after search functionality to include/exclude dateranges. granules_from_folder is used to read all existing filenames locally and exclude them from the list to download.

After having all these files locally I process them using points/classify and save them to individual GeoParquet files. I also write empty files, so I can again keep track of progress.

Then there's the special case of tiling these GeoParquet files into 1x1 degree tiles, in which I save (again also empty) GeoParquet file to a 1x1 degree directory if the bounding box of the data intersects with the 1x1 tile.

I have tried to keep an external index up to date with all h5 files, but it never really worked out, so now everything is just filename based. Bit ugly, but it works for now. That might change if I want to process all files in one go, streaming the world let's say, which would require a spatial index on all points.

@alex-s-gardner
Copy link
Collaborator

My workflow looks very similar. The first part of my workflow I call build_archive which does the following steps for each 2x2 degree tile:
[1] geotile_search_granules: Find all granules [remote] that intersect geotile and save as an Arrow table
[2] granules_load: loads list of granules [remote]
[3] geotile_download_granules!: Download all granules in list using aria2c or Downloads.jl [skips files already on disk] amd save local granule list to Arrow file.
[4] granules_load: load local granule list
[5] geotile_build: read, crop, append local granules into a single df and save as a single Arrow file. When run a second time the function identifies only missing granules and appends to existing Arrow file.

@alex-s-gardner
Copy link
Collaborator

When you go to form your 1x1 degree data cubes do you find it takes awhile to read in all of the individual files? Read speeds is may main motivation for building single file geotile Arrow files... they are a pain to maintain since appending is slow each time they are updated there are several downstream steps that need to be re-run, like Bool mask creation and DEM extraction.

@alex-s-gardner
Copy link
Collaborator

I think I'm going to move away from Arrow files in favor of JLD files so that I can take advantage of Fill array compression. I'm also tempted to adopt your single file approach if it's not too costly at read.

@evetion
Copy link
Owner Author

evetion commented Sep 6, 2023

Note that you're using Arrow, and I'm using (Geo)Parquet, which are (related but) different formats. JLD2 is yet another format. I expect Arrow to be fastest (vs Parquet), but I didn't run any benchmarks.

Without benchmarks, it's all gut feeling, which tells me that reading a directory full of Parquet files is superfast (and probably I/O limited). Especially when compared to the rest of the pipeline afterwards, let alone the preprocessing done to get those files.

@alex-s-gardner
Copy link
Collaborator

I wish there was a way to build the 1x1 degree tiles without so much duplication of the data, i.e. raw -> parameter subset -> spatial subset

@evetion
Copy link
Owner Author

evetion commented Sep 7, 2023

What's keeping you from going from raw h5 to spatial subset?

@alex-s-gardner
Copy link
Collaborator

what's keeping you from going from raw h5 to spatial subset?

I guess it's file management and offset calculations

@evetion
Copy link
Owner Author

evetion commented Sep 7, 2023

I do something like this (with some extra processing and checking whether files already exist before reading the points).

function tiles_from_box(box)
    min_x, min_y, max_x, max_y = box
    lons = round(Int, min_x, RoundDown):1:round(Int, max_x, RoundUp)
    lats = round(Int, min_y, RoundDown):1:round(Int, max_y, RoundUp)
    Iterators.product(lons, lats)
end

function to_tile(granule) 
	box = SpaceLiDAR.bounds(granule) 
	tiles = vec(tiles_from_box(box))
   filter!(Base.Fix2(in, tile_selection), tiles)  # if you don't want all tiles
		
	table = SpaceLiDAR.points(granule)
	for tile in tiles
		subset(table, # something spatial #, view=true)
		# process subset of table
		# save to tile.pq
	end
end

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Oct 23, 2023

@evetion all of my global workflows tend to converge to a geographic (lat,lon) tiling approach. I find that I have a some tooling to handle table data that have been separated into tiles. Some tools that I have are:

  1. systematic naming of tiles and building of extents
  2. load data given an Extent, account for across tile aggregation
  3. switch between local UTM and Lat/Lon
  4. split data into tiles
  5. get extents from filename

I'm going to split this off as a Package for my own work but before I do I was wondering if you might be interested in a package like this as well? I figure it could serve as a API for tiled approaches to global data processing and respective tooling. I suspect I will add DTables under the hood at some point.

Here's the start, Let me know if you have any opinions before I go to far: https://github.com/alex-s-gardner/GeoTiles.jl

The only requirement for a GeoTile is that the data is saved in a DataFrame compatible format, the data contains latitude and longitude columns and that the suffix of the file name follows GeoTile convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants