Initial georeference ingests #1

ian-r-rose · 2022-12-13T17:43:34Z

Selecting a few from the state ArcGIS Online to work out a repeatable pattern.

ian-r-rose · 2023-01-04T17:54:00Z

I spent some time looking at loading the Microsoft building footprint dataset. There are a few reasons I think it is an interesting test case here:

It is relatively large, to the point where ingest requires larger instance types, and probably several of them if we want to load the whole dataset.
It is usefully partitioned across states, but due to the vast difference in state sizes and urbanizations, the partitions are also imbalanced, so naive assumptions about evenly sized partitions would not hold here.
It is semantically relatively simple.
It's large enough that clustering is useful.

Some working notes:

Loading the data

The data is distributed as zipped GeoJSON. This is not ideal: geoparquet or gpkg would likely be superior for a dataset of this size (indeed, the MS planetary computer version of the dataset are geoparquet). But GDAL/ogr2ogr can handle that no problem, with something like:

ogr2ogr -makevalid -f Parquet California.gjsq /vsizip/California.geojson.zip

This can then be loaded with either the Python BigQuery client libraries or the bq command line tool. The main difficulty here is that getting ogr2ogr + GDAL installed on the orchestration workers is a bit of a lift. Not impossible, just not something I have set up today.

A fairly straightforward gdf.to_gbq does work, following the approach here:

data-orchestration/dags/common/geo.py

Lines 12 to 63 in 680be6f

    
           def gdf_to_bigquery( 
        
               gdf: geopandas.GeoDataFrame, 
        
               destination_table: str, 
        
               project_id: str | None = None, 
        
               chunksize: int | None = None, 
        
               if_exists: Literal["fail", "replace", "append"] = "fail", 
        
               table_schema: list[dict] | None = None, 
        
               location: str | None = None, 
        
           ) -> None: 
        
               """ 
        
               A wrapper around `gdf.to_gbq` which does some additional validation and ensures 
        
               that the geometry objects are loaded correctly. For documentation of parameters, 
        
               see gdf.to_gbq. 
        
               """ 
        
               assert isinstance(gdf, geopandas.GeoDataFrame) 
        
               if gdf.crs.srs != "epsg:4326": 
        
                   warnings.warn( 
        
                       f"GeoDataFrame has a projected coordinate system {gdf.crs.srs}," 
        
                       " but BigQuery only supports WGS84. Converting to to that." 
        
                   ) 
        
                   gdf = gdf.to_crs(epsg=4326)  # type: ignore 
        
               assert gdf.crs.srs == "epsg:4326" 
        
               # Identify the geometry columns, create a table schema that identifies 
        
               if table_schema is None: 
        
                   table_schema = [ 
        
                       {"name": name, "type": "GEOGRAPHY"} 
        
                       for name, dtype in gdf.dtypes.items() 
        
                       if isinstance(dtype, geopandas.array.GeometryDtype) 
        
                   ] 
        
               # Ensure that the geometry columns are properly oriented and valid geometries. 
        
               # GBQ is fairly unforgiving, so this step is often required. 
        
               gdf = gdf.assign( 
        
                   **{ 
        
                       name: gdf[name].make_valid().apply(shapely.ops.orient, args=(1,)) 
        
                       for name, dtype in gdf.dtypes.items() 
        
                       if isinstance(dtype, geopandas.array.GeometryDtype) 
        
                   } 
        
               )  # type: ignore 
        
               # Write to BiqQuery! 
        
               gdf.to_gbq( 
        
                   destination_table=destination_table, 
        
                   project_id=project_id, 
        
                   if_exists=if_exists, 
        
                   table_schema=table_schema, 
        
                   chunksize=chunksize, 
        
                   location=location, 
        
               )

But that doesn't provide a way to set up clustering or partitioning, so I need to drop to a bit of a lower level to get that working (code WIP, will open a PR later).

BigQuery allows you to load geospatial data using line-delimited GeoJSON. So, again with ogr2ogr the following would work:

ogr2ogr -makevalid -f GeoJSONSeq California.gjsq /vsizip/California.geojson.zip
bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --json_extension=GEOJSON \
  --autodetect \
  --project_id=<project-id> \
  geo_reference.california_gjsq \
  California.gjsq

This might be a good solution longer-term, especially if we can get parquet working instead of line-delimited GeoJSON. However, I suspect that dealing with idempotency, error-handling, etc would make a bash-driven approach pretty annoying, and we might fall back to using the python clients. Additionally, the line-delimited GeoJSON approach took about twice as long as the Python+Parquet approach (possibly this is just network bandwidth, but I'm not certain).

Using the data

I primarily wanted to test that geography clustering works as expected (allowing for faster and cheaper data scans). In so doing, I learned something about how BQ (and Snowflake) optimize their queries!

So the following query is able to take advantage of clustering and scan less data:

SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  ST_BUFFER(ST_GEOGPOINT(-122.297327, 37.867691), 1000)
)

But the following query scans the whole table

SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  (SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley')
)

This is because the query optimizer only does a single pass, so can't take advantage of clustering due to the scalar subquery only returning a small area.

The solution is to break the query into two parts, allowing the second part to be optimized accounting for clustering:

DECLARE berkeley GEOGRAPHY;
SET berkeley = (SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley');
SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  berkeley
)

BigQuery has some general guidance for partitioned tables here which addresses the issue for partitioned tables (and seems to apply to clustering as well). Snowflake has a nice community article describing more-or-less the same thing.

Orchestration bake-off

For some of the reasons outlined here cagov/data-infrastructure#4 (comment) I think this might be a good target for a data orchestration tool bake-off.

Even though we are probably only interested in the California subset of the data, loading the whole dataset would allow us to test some things around idempotency, control flow, concurrency, etc. So I think my initial PR here will be focused on California, but will keep the door open for a larger ingest to test some tooling.

ian-r-rose · 2023-01-25T20:23:17Z

Fixed by #16 and #17.

ian-r-rose self-assigned this Dec 13, 2022

ian-r-rose transferred this issue from cagov/data-infrastructure Dec 15, 2022

This was referenced Jan 24, 2023

Refactor to use kubernetes operators #16

Merged

Building footprints #17

Merged

ian-r-rose closed this as completed Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial georeference ingests #1

Initial georeference ingests #1

ian-r-rose commented Dec 13, 2022 •

edited

Loading

ian-r-rose commented Jan 4, 2023 •

edited

Loading

ian-r-rose commented Jan 25, 2023

Initial georeference ingests #1

Initial georeference ingests #1

Comments

ian-r-rose commented Dec 13, 2022 • edited Loading

ian-r-rose commented Jan 4, 2023 • edited Loading

Loading the data

Using the data

Orchestration bake-off

ian-r-rose commented Jan 25, 2023

ian-r-rose commented Dec 13, 2022 •

edited

Loading

ian-r-rose commented Jan 4, 2023 •

edited

Loading