Skip to content

Commit

Permalink
feat: add new custom_sql_filter parameter (#180)
Browse files Browse the repository at this point in the history
* feat: add new custom_sql_filter parameter

* fix: add None check for custom filters

* fix: change file hash generation

* chore: add tests for custom sql filtering

* fix: add missing custom sql filters to a prefiltering step

* chore: add new test scenario

* chore: update progress bar logic during multiprocessing startup

* feat: add custom sql filter example notebook

* chore: add changelog entry
  • Loading branch information
RaczeQ authored Nov 3, 2024
1 parent 8b25fbc commit e291608
Show file tree
Hide file tree
Showing 8 changed files with 363 additions and 24 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Option to pass custom SQL filters with `custom_sql_filter` (and `--custom-sql-filter`) parameter [#67](https://github.com/kraina-ai/quackosm/issues/67)

### Fixed

- Delayed progress bar appearing during nodes intersection step

## [0.11.4] - 2024-10-28

### Changed
Expand Down
202 changes: 202 additions & 0 deletions examples/advanced_examples/custom_sql_filter.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Custom SQL filter\n",
"\n",
"**QuackOSM** enables advanced users to filter data using SQL filters that will be used by DuckDB during processing.\n",
"\n",
"The filter will be loaded alongside with [OSM tags filters](../osm_tags_filter/) and features IDs filters. \n",
"\n",
"SQL filter clause will can be passed both in Python API (as `custom_sql_filter` parameter) and the CLI (as `--custom-sql-filter` option).\n",
"\n",
"Two columns available to users are: `id` (type `BIGINT`) and `tags` (type: `MAP(VARCHAR, VARCHAR)`).\n",
"\n",
"You can look for available functions into a [DuckDB documentation](https://duckdb.org/docs/sql/functions/overview).\n",
"\n",
"Below are few examples on how to use the custom SQL filters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features with exactly 10 tags\n",
"\n",
"Here we will use `cardinality` function dedicated to the `MAP` type.\n",
"\n",
"More `MAP` functions are available [here](https://duckdb.org/docs/sql/functions/map)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import quackosm as qosm\n",
"\n",
"data = qosm.convert_geometry_to_geodataframe(\n",
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
" osm_extract_source=\"Geofabrik\",\n",
" custom_sql_filter=\"cardinality(tags) = 10\",\n",
")\n",
"data[\"tags\"].head(10).values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"All have exactly 10 tags:\", (data[\"tags\"].str.len() == 10).all())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features with ID divisible by 13 and starting wit a number 6\n",
"\n",
"Here we will operate on the `ID` column.\n",
"\n",
"More `NUMERIC` functions are available [here](https://duckdb.org/docs/sql/functions/numeric).\n",
"\n",
"More `STRING` functions are available [here](https://duckdb.org/docs/sql/functions/char)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = qosm.convert_geometry_to_geodataframe(\n",
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
" osm_extract_source=\"Geofabrik\",\n",
" custom_sql_filter=\"id % 13 = 0 AND starts_with(id::STRING, '6')\",\n",
")\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"All starting with digit 6:\", data.index.map(lambda x: x.split(\"/\")[1].startswith(\"6\")).all())\n",
"print(\"All divisible by 13:\", data.index.map(lambda x: (int(x.split(\"/\")[1]) % 13) == 0).all())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Find features that have all selected tags present\n",
"\n",
"When using `osm_tags_filter` with value `{ \"building\": True, \"historic\": True, \"name\": True }`, the result will contain every feature that have at least one of those tags.\n",
"\n",
"Positive tags filters are combined using an `OR` operator. You can read more about it [here](../osm_tags_filter/).\n",
"\n",
"To get filters with `AND` operator, the `custom_sql_filter` parameter has to be used.\n",
"\n",
"To match a list of keys against given values we have to use list-related functions.\n",
"\n",
"More `LIST` functions are available [here](https://duckdb.org/docs/sql/functions/list)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = qosm.convert_geometry_to_geodataframe(\n",
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
" osm_extract_source=\"Geofabrik\",\n",
" custom_sql_filter=\"list_has_all(map_keys(tags), ['building', 'historic', 'name'])\",\n",
")\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tags_names = [\"name\", \"building\", \"historic\"]\n",
"for tag_name in tags_names:\n",
" data[tag_name] = data[\"tags\"].apply(lambda x, tag_name=tag_name: x.get(tag_name))\n",
"data[[*tags_names, \"geometry\"]].explore(tiles=\"CartoDB DarkMatter\", color=\"orange\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regex search to find streets starting with word New or Old\n",
"\n",
"*(If you really need to)* You can utilize regular expressions on a tag value (or key) to find some specific examples.\n",
"\n",
"More `REGEX` functions are available [here](https://duckdb.org/docs/sql/functions/regular_expressions)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = qosm.convert_geometry_to_geodataframe(\n",
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
" osm_extract_source=\"Geofabrik\",\n",
" custom_sql_filter=\"\"\"\n",
" list_has_all(map_keys(tags), ['highway', 'name'])\n",
" AND regexp_matches(tags['name'][1], '^(New|Old)\\s\\w+')\n",
" \"\"\",\n",
")\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ways_only = data[data.index.str.startswith(\"way/\")]\n",
"ways_only[\"name\"] = ways_only[\"tags\"].apply(lambda x: x[\"name\"])\n",
"ways_only[\"prefix\"] = ways_only[\"name\"].apply(lambda x: x.split()[0])\n",
"ways_only[[\"name\", \"prefix\", \"geometry\"]].explore(\n",
" tiles=\"CartoDB DarkMatter\", column=\"prefix\", cmap=[\"orange\", \"royalblue\"]\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
16 changes: 9 additions & 7 deletions quackosm/_parquet_multiprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,15 +100,20 @@ def map_parquet_dataset(
progress_bar (Optional[TaskProgressBar]): Progress bar to show task status.
Defaults to `None`.
"""
queue: Queue[tuple[str, int]] = ctx.Manager().Queue()

dataset = pq.ParquetDataset(dataset_path)

tuples_to_queue = []
for pq_file in dataset.files:
for row_group in range(pq.ParquetFile(pq_file).num_row_groups):
queue.put((pq_file, row_group))
tuples_to_queue.append((pq_file, row_group))

total = queue.qsize()
total = len(tuples_to_queue)
if progress_bar: # pragma: no cover
progress_bar.create_manual_bar(total=total)

queue: Queue[tuple[str, int]] = ctx.Manager().Queue()
for queue_tuple in tuples_to_queue:
queue.put(queue_tuple)

destination_path.mkdir(parents=True, exist_ok=True)

Expand Down Expand Up @@ -137,9 +142,6 @@ def _run_processes(
break
p.start()

if progress_bar: # pragma: no cover
progress_bar.create_manual_bar(total=total)

sleep_time = 0.1
while any(process.is_alive() for process in processes):
if any(p.exception for p in processes): # pragma: no cover
Expand Down
18 changes: 18 additions & 0 deletions quackosm/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,18 @@ def main(
show_default=False,
),
] = None,
custom_sql_filter: Annotated[
Optional[str],
typer.Option(
help=(
"Allows users to pass custom SQL conditions used to filter OSM features. "
"It will be embedded into predefined queries and requires DuckDB syntax to operate "
"on tags map object."
),
case_sensitive=False,
show_default=False,
),
] = None,
osm_extract_query: Annotated[
Optional[str],
typer.Option(
Expand Down Expand Up @@ -750,6 +762,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
save_as_wkt=wkt_result,
verbosity_mode=verbosity_mode,
)
Expand All @@ -771,6 +784,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
duckdb_table_name=duckdb_table_name or "quackosm",
verbosity_mode=verbosity_mode,
)
Expand All @@ -795,6 +809,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
save_as_wkt=wkt_result,
verbosity_mode=verbosity_mode,
)
Expand Down Expand Up @@ -825,6 +840,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
duckdb_table_name=duckdb_table_name or "quackosm",
save_as_wkt=wkt_result,
verbosity_mode=verbosity_mode,
Expand Down Expand Up @@ -853,6 +869,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
save_as_wkt=wkt_result,
verbosity_mode=verbosity_mode,
geometry_coverage_iou_threshold=geometry_coverage_iou_threshold,
Expand All @@ -876,6 +893,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
custom_sql_filter=custom_sql_filter,
duckdb_table_name=duckdb_table_name or "quackosm",
save_as_wkt=wkt_result,
verbosity_mode=verbosity_mode,
Expand Down
Loading

0 comments on commit e291608

Please sign in to comment.