-
Notifications
You must be signed in to change notification settings - Fork 0
Initial georeference ingests #1
Comments
I spent some time looking at loading the Microsoft building footprint dataset. There are a few reasons I think it is an interesting test case here:
Some working notes: Loading the dataThe data is distributed as zipped GeoJSON. This is not ideal: geoparquet or gpkg would likely be superior for a dataset of this size (indeed, the MS planetary computer version of the dataset are geoparquet). But GDAL/ogr2ogr can handle that no problem, with something like: ogr2ogr -makevalid -f Parquet California.gjsq /vsizip/California.geojson.zip This can then be loaded with either the Python BigQuery client libraries or the A fairly straightforward data-orchestration/dags/common/geo.py Lines 12 to 63 in 680be6f
But that doesn't provide a way to set up clustering or partitioning, so I need to drop to a bit of a lower level to get that working (code WIP, will open a PR later). BigQuery allows you to load geospatial data using line-delimited GeoJSON. So, again with ogr2ogr the following would work:
This might be a good solution longer-term, especially if we can get parquet working instead of line-delimited GeoJSON. However, I suspect that dealing with idempotency, error-handling, etc would make a bash-driven approach pretty annoying, and we might fall back to using the python clients. Additionally, the line-delimited GeoJSON approach took about twice as long as the Python+Parquet approach (possibly this is just network bandwidth, but I'm not certain). Using the dataI primarily wanted to test that geography clustering works as expected (allowing for faster and cheaper data scans). In so doing, I learned something about how BQ (and Snowflake) optimize their queries! So the following query is able to take advantage of clustering and scan less data: SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
geometry,
ST_BUFFER(ST_GEOGPOINT(-122.297327, 37.867691), 1000)
) But the following query scans the whole table SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
geometry,
(SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley')
) This is because the query optimizer only does a single pass, so can't take advantage of clustering due to the scalar subquery only returning a small area. The solution is to break the query into two parts, allowing the second part to be optimized accounting for clustering: DECLARE berkeley GEOGRAPHY;
SET berkeley = (SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley');
SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
geometry,
berkeley
) BigQuery has some general guidance for partitioned tables here which addresses the issue for partitioned tables (and seems to apply to clustering as well). Snowflake has a nice community article describing more-or-less the same thing. Orchestration bake-offFor some of the reasons outlined here cagov/data-infrastructure#4 (comment) I think this might be a good target for a data orchestration tool bake-off. Even though we are probably only interested in the California subset of the data, loading the whole dataset would allow us to test some things around idempotency, control flow, concurrency, etc. So I think my initial PR here will be focused on California, but will keep the door open for a larger ingest to test some tooling. |
Selecting a few from the state ArcGIS Online to work out a repeatable pattern.
The text was updated successfully, but these errors were encountered: