Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Initial georeference ingests #1

Closed
ian-r-rose opened this issue Dec 13, 2022 · 2 comments
Closed

Initial georeference ingests #1

ian-r-rose opened this issue Dec 13, 2022 · 2 comments
Assignees

Comments

@ian-r-rose
Copy link
Member

ian-r-rose commented Dec 13, 2022

Selecting a few from the state ArcGIS Online to work out a repeatable pattern.

@ian-r-rose ian-r-rose self-assigned this Dec 13, 2022
@ian-r-rose ian-r-rose transferred this issue from cagov/data-infrastructure Dec 15, 2022
@ian-r-rose
Copy link
Member Author

ian-r-rose commented Jan 4, 2023

I spent some time looking at loading the Microsoft building footprint dataset. There are a few reasons I think it is an interesting test case here:

  1. It is relatively large, to the point where ingest requires larger instance types, and probably several of them if we want to load the whole dataset.
  2. It is usefully partitioned across states, but due to the vast difference in state sizes and urbanizations, the partitions are also imbalanced, so naive assumptions about evenly sized partitions would not hold here.
  3. It is semantically relatively simple.
  4. It's large enough that clustering is useful.

Some working notes:

Loading the data

The data is distributed as zipped GeoJSON. This is not ideal: geoparquet or gpkg would likely be superior for a dataset of this size (indeed, the MS planetary computer version of the dataset are geoparquet). But GDAL/ogr2ogr can handle that no problem, with something like:

ogr2ogr -makevalid -f Parquet California.gjsq /vsizip/California.geojson.zip

This can then be loaded with either the Python BigQuery client libraries or the bq command line tool. The main difficulty here is that getting ogr2ogr + GDAL installed on the orchestration workers is a bit of a lift. Not impossible, just not something I have set up today.

A fairly straightforward gdf.to_gbq does work, following the approach here:

def gdf_to_bigquery(
gdf: geopandas.GeoDataFrame,
destination_table: str,
project_id: str | None = None,
chunksize: int | None = None,
if_exists: Literal["fail", "replace", "append"] = "fail",
table_schema: list[dict] | None = None,
location: str | None = None,
) -> None:
"""
A wrapper around `gdf.to_gbq` which does some additional validation and ensures
that the geometry objects are loaded correctly. For documentation of parameters,
see gdf.to_gbq.
"""
assert isinstance(gdf, geopandas.GeoDataFrame)
if gdf.crs.srs != "epsg:4326":
warnings.warn(
f"GeoDataFrame has a projected coordinate system {gdf.crs.srs},"
" but BigQuery only supports WGS84. Converting to to that."
)
gdf = gdf.to_crs(epsg=4326) # type: ignore
assert gdf.crs.srs == "epsg:4326"
# Identify the geometry columns, create a table schema that identifies
if table_schema is None:
table_schema = [
{"name": name, "type": "GEOGRAPHY"}
for name, dtype in gdf.dtypes.items()
if isinstance(dtype, geopandas.array.GeometryDtype)
]
# Ensure that the geometry columns are properly oriented and valid geometries.
# GBQ is fairly unforgiving, so this step is often required.
gdf = gdf.assign(
**{
name: gdf[name].make_valid().apply(shapely.ops.orient, args=(1,))
for name, dtype in gdf.dtypes.items()
if isinstance(dtype, geopandas.array.GeometryDtype)
}
) # type: ignore
# Write to BiqQuery!
gdf.to_gbq(
destination_table=destination_table,
project_id=project_id,
if_exists=if_exists,
table_schema=table_schema,
chunksize=chunksize,
location=location,
)

But that doesn't provide a way to set up clustering or partitioning, so I need to drop to a bit of a lower level to get that working (code WIP, will open a PR later).

BigQuery allows you to load geospatial data using line-delimited GeoJSON. So, again with ogr2ogr the following would work:

ogr2ogr -makevalid -f GeoJSONSeq California.gjsq /vsizip/California.geojson.zip
bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --json_extension=GEOJSON \
  --autodetect \
  --project_id=<project-id> \
  geo_reference.california_gjsq \
  California.gjsq

This might be a good solution longer-term, especially if we can get parquet working instead of line-delimited GeoJSON. However, I suspect that dealing with idempotency, error-handling, etc would make a bash-driven approach pretty annoying, and we might fall back to using the python clients. Additionally, the line-delimited GeoJSON approach took about twice as long as the Python+Parquet approach (possibly this is just network bandwidth, but I'm not certain).

Using the data

I primarily wanted to test that geography clustering works as expected (allowing for faster and cheaper data scans). In so doing, I learned something about how BQ (and Snowflake) optimize their queries!

So the following query is able to take advantage of clustering and scan less data:

SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  ST_BUFFER(ST_GEOGPOINT(-122.297327, 37.867691), 1000)
) 

But the following query scans the whole table

SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  (SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley')
) 

This is because the query optimizer only does a single pass, so can't take advantage of clustering due to the scalar subquery only returning a small area.

The solution is to break the query into two parts, allowing the second part to be optimized accounting for clustering:

DECLARE berkeley GEOGRAPHY;
SET berkeley = (SELECT geometry FROM `caldata-sandbox.geo_reference.incorporated_cities` WHERE CITY = 'Berkeley');
SELECT geometry
FROM `caldata-sandbox.geo_reference.california`
WHERE ST_INTERSECTS(
  geometry,
  berkeley
) 

BigQuery has some general guidance for partitioned tables here which addresses the issue for partitioned tables (and seems to apply to clustering as well). Snowflake has a nice community article describing more-or-less the same thing.

Orchestration bake-off

For some of the reasons outlined here cagov/data-infrastructure#4 (comment) I think this might be a good target for a data orchestration tool bake-off.

Even though we are probably only interested in the California subset of the data, loading the whole dataset would allow us to test some things around idempotency, control flow, concurrency, etc. So I think my initial PR here will be focused on California, but will keep the door open for a larger ingest to test some tooling.

This was referenced Jan 24, 2023
@ian-r-rose
Copy link
Member Author

Fixed by #16 and #17.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant