v0.5.0 (#96)

* Version 0.5.0 * pypgstac loader improvements * rework pypgstac, linting, bug fixes * remove srid lookup * get tests working with temp database * update poetry lock * psycopg copy error on ci * more tests, update readme * bug fixes * change typing for file input to allow iterator * update poetry lock for black issue, run tests with pytest * add code to change partitions and migrate data * fix for assets in includes * add incremental migration * add migration, fix atexit
stac-utils · Apr 14, 2022 · 6a29c8f · 6a29c8f
1 parent 9fc5027
commit 6a29c8f
Show file tree

Hide file tree

Showing 41 changed files with 8,756 additions and 2,280 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@ pypgstac/dist
 *.pyc
 *.egg-info
 *.eggs
-venv
+venv
+.direnv
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,32 @@
 # Changelog
+## [v0.5.0]
+Version 0.5.0 is a major refactor of how data is stored. It is recommended to start a new database from scratch and to move data over rather than to use the inbuilt migration which will be very slow for larger amounts of data.
+
+### Fixed
+
+### Changed
+ - The partition layout has been changed from being hardcoded to a partition to week to using nested partitions. The first level is by collection, for each collection, there is an attribute partition_trunc which can be set to NULL (no temporal partitions), month, or year.
+
+ - CQL1 and Query Code have been refactored to translate to CQL2 to reduce duplicated code in query parsing.
+
+ - Unused functions have been stripped from the project.
+
+ - Pypgstac has been changed to use Fire rather than Typo.
+
+ - Pypgstac has been changed to use Psycopg3 rather than Asyncpg to enable easier use as both sync and async.
+
+ - Indexing has been reworked to eliminate indexes that from logs were not being used. The global json index on properties has been removed. Indexes on individual properties can be added either globally or per collection using the new queryables table.
+
+ - Triggers for maintaining partitions have been updated to reduce lock contention and to reflect the new data layout.
+
+ - The data pager which optimizes "order by datetime" searches has been updated to get time periods from the new partition layout and partition metadata.
+
+ - Tests have been updated to reflect the many changes.
+
+### Added
+
+ - On ingest, the content in an item is compared to the metadata available at the collection level and duplicate information is stripped out (this is primarily data in the item_assets property). Logic is added in to merge this data back in on data usage.
+
 ## [v0.4.5]
 
 ### Fixed

diff --git a/Dockerfile b/Dockerfile
@@ -33,8 +33,9 @@ RUN \
         python3-setuptools \
     && pip3 install -U pip setuptools packaging \
     && pip3 install -U psycopg2-binary \
+    && pip3 install -U psycopg[binary] \
     && pip3 install -U migra[pg] \
-    && pip3 install poetry==1.1.12 \
+    && pip3 install poetry==1.1.13 \
     && apt-get remove -y apt-transport-https \
     && apt-get -y autoremove \
     && rm -rf /var/lib/apt/lists/*

diff --git a/Dockerfile.dev b/Dockerfile.dev
@@ -12,7 +12,9 @@ ENV \
 
 RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
 
-RUN pip install poetry==1.1.7
+RUN pip install --upgrade pip && \
+    pip install --upgrade poetry==1.1.13 && \
+    pip install --upgrade psycopg[binary]
 
 RUN mkdir -p /opt/src/pypgstac
 

diff --git a/README.md b/README.md
@@ -23,13 +23,53 @@ STAC Client that uses PGStac available in [STAC-FastAPI](https://github.com/stac
 PGStac requires **Postgresql>=13** and **PostGIS>=3**. Best performance will be had using PostGIS>=3.1.
 
 ### PGStac Settings
-PGStac installs everything into the pgstac schema in the database. You will need to make sure that this schema is set up in the search_path for the database.
+PGStac installs everything into the pgstac schema in the database. This schema must be in the search_path in the postgresql session while using pgstac.
 
+
+#### PGStac Users
+The pgstac_admin role is the owner of all the objects within pgstac and should be used when running things such as migrations.
+
+The pgstac_ingest role has read/write priviliges on all tables and should be used for data ingest or if using the transactions extension with stac-fastapi-pgstac.
+
+The pgstac_read role has read only access to the items and collections, but will still be able to write to the logging tables.
+
+You can use the roles either directly and adding a password to them or by granting them to a role you are already using.
+
+To use directly:
+```sql
+ALTER ROLE pgstac_read LOGIN PASSWORD '<password>';
+```
+
+To grant pgstac permissions to a current postgresql user:
+```sql
+GRANT pgstac_read TO <user>;
+```
+
+#### PGStac Search Path
+The search_path can be set at the database level or role level or by setting within the current session. The search_path is already set if you are directly using one of the pgstac users. If you are not logging in directly as one of the pgstac users, you will need to set the search_path by adding it to the search_path of the user you are using:
+```sql
+ALTER ROLE <user> SET SEARCH_PATH TO pgstac, public;
+```
+setting the search_path on the database:
+```sql
+ALTER DATABASE <database> set search_path to pgstac, public;
+```
+
+In psycopg the search_path can be set by passing it as a configuration when creating your connection:
+```python
+kwargs={
+    "options": "-c search_path=pgstac,public"
+}
+```
+
+#### PGStac Settings Variables
 There are additional variables that control the settings used for calculating and displaying context (total row count) for a search, as well as a variable to set the filter language (cql-json or cql-json2).
 The context is "off" by default, and the default filter language is set to "cql2-json".
 
 Variables can be set either by passing them in via the connection options using your connection library, setting them in the pgstac_settings table or by setting them on the Role that is used to log in to the database.
 
+Turning "context" on can be **very** expensive on larger databases. Much of what PGStac does is to optimize the search of items sorted by time where only fewer than 10,000 records are returned at a time. It does this by searching for the data in chunks and is able to "short circuit" and return as soon as it has the number of records requested. Calculating the context (the total count for a query) requires a scan of all records that match the query parameters and can take a very long time. Settting "context" to auto will use database statistics to estimate the number of rows much more quickly, but for some queries, the estimate may be quite a bit off.
+
 Example for updating the pgstac_settings table with a new value:
 ```sql
 INSERT INTO pgstac_settings (name, value)
@@ -41,14 +81,36 @@ ON CONFLICT ON CONSTRAINT pgstac_settings_pkey DO UPDATE SET value = excluded.va
 ```
 
 Alternatively, update the role:
-```
+```sql
 ALTER ROLE <username> SET SEARCH_PATH to pgstac, public;
 ALTER ROLE <username> SET pgstac.context TO <'on','off','auto'>;
 ALTER ROLE <username> SET pgstac.context_estimated_count TO '<number of estimated rows when in auto mode that when an estimated count is less than will trigger a full count>';
 ALTER ROLE <username> SET pgstac.context_estimated_cost TO '<estimated query cost from explain when in auto mode that when an estimated cost is less than will trigger a full count>';
 ALTER ROLE <username> SET pgstac.context_stats_ttl TO '<an interval string ie "1 day" after which pgstac search will force recalculation of it's estimates>>';
 ```
 
+#### PGStac Partitioning
+By default PGStac partitions data by collection (note: this is a change starting with version 0.5.0). Each collection can further be partitioned by either year or month. **Partitioning must be set up prior to loading any data!** Partitioning can be configured by setting the partition_trunc flag on a collection in the database.
+```sql
+UPDATE collections set partition_trunc='month' WHERE id='<collection id>';
+```
+
+In general, you should aim to keep each partition less than a few hundred thousand rows. Further partitioning (ie setting everything to 'month' when not needed to keep the partitions below a few hundred thousand rows) can be detrimental.
+
+#### PGStac Indexes / Queryables
+By default, PGStac includes indexes on the id, datetime, collection, geometry, and the eo:cloud_cover property. Further indexing can be added for additional properties globally or only on particular collections by modifications to the queryables table.
+
+Currently indexing is the only place the queryables table is used, but in future versions, it will be extended to provide a queryables backend api.
+
+To add a new global index across all partitions:
+```sql
+INSERT INTO pgstac.queryables (name, property_wrapper, property_index_type)
+VALUES (<property name>, <property wrapper>, <index type>);
+```
+Property wrapper should be one of to_int, to_float, to_tstz, or to_text. The index type should almost always be 'BTREE', but can be any PostgreSQL index type valid for the data type.
+
+**More indexes is note necessarily better.** You should only index the primary fields that are actively being used to search. Adding too many indexes can be very detrimental to performance and ingest speed. If your primary use case is delivering items sorted by datetime and you do not use the context extension, you likely will not need any further indexes.
+
 ## PyPGStac
 PGStac includes a Python utility for bulk data loading and managing migrations.
 

diff --git a/pgstac.sql b/pgstac.sql
@@ -1,12 +1,14 @@
 BEGIN;
 \i sql/001_core.sql
 \i sql/001a_jsonutils.sql
-\i sql/001b_cursorutils.sql
 \i sql/001s_stacutils.sql
 \i sql/002_collections.sql
+\i sql/002a_queryables.sql
+\i sql/002b_cql.sql
 \i sql/003_items.sql
 \i sql/004_search.sql
 \i sql/005_tileutils.sql
 \i sql/006_tilesearch.sql
+\i sql/998_permissions.sql
 \i sql/999_version.sql
 COMMIT;