Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Adding daily partitions #307

Open
j-musial opened this issue Sep 20, 2024 · 6 comments
Open

[Feature request] Adding daily partitions #307

j-musial opened this issue Sep 20, 2024 · 6 comments
Assignees

Comments

@j-musial
Copy link

j-musial commented Sep 20, 2024

Next generation of optical satellites e.g. Sentinel-2 generates several thousands product per day at global scale and consequently the monthly PGSTAC partitions are too big. It would be advantageous to have daily partitions.

@bitner
Copy link
Collaborator

bitner commented Sep 25, 2024

My general recommendations from my testing is that the ideal balance for partition size is targeting for each partition to be roughly in the single digit millions. Even at a 100,000 products per day (3 million / month) that should still be reasonable with monthly partitions.

There is definitely a balance here -- as you increase the number of partitions, the query planning process in Postgres gets slower and any searches that cover multiple partitions will also get much slower.

It is certainly possible for us to add a daily option for partition, but I have certainly never seen anything that I would expect to perform better with that level of partitioning.

@j-musial
Copy link
Author

The use case for daily partitions are catalogues that are frequently queried for the most recent products. In our case the 70% queries are for the last week and 40% for the last day. Furthermore, if you have 3 million/month complex geometries/footprints then your queries will be significantly slower then for 100,000 products. Regarding the multi-temporal queries indeed number of partitions may decrease performance. However, once the 1000/10000 products are found (and this usually requires only few days and only few partitions scans) the query should stop.

Overall daily partitions would be useful option to have in PGSTAC so the user may choose what configuration works the best for him.

@jonas-eberle
Copy link

Please consider, as far as I know, if a request queries one specific STAC item (e.g., when calling https://pgstac.demo.cloudferro.com/collections/sentinel-2-l2a/items/S2B_MSIL2A_20240110T100309_N0510_R122_T33UVR_20240110T113053) then all partitions are queried until the scene has been found. I guess this is because postgres does not know in which partition this item is. At least I have seen such SQL queries when requesting a single item. With daily partitions this could lead to a lot of queries in the background.

@bitner Please correct, if this is wrong.

@j-musial
Copy link
Author

@jonas-eberle in case of a single product indeed all partitions have to be queried until it is found but this is a simple scan. More troublesome are spatio-temporal queries over large AOI with complex footprints. Then you really would take advantage of daily partitions because only few of them will be queried until the max elements (1000 in our case) are found.

@bitner
Copy link
Collaborator

bitner commented Sep 26, 2024

yep @jonas-eberle, you are exactly correct. Any query that is not bound by collection and/or time will need to scan through all the partitions. @j-musial this also means that an endpoint like /items/ is going to be very slow.

@j-musial Yes, it is a simple scan, but even a simple scan done 30X more often with the included additional planning penalty can make a huge difference. The way that pgstac works, if you are ordering your return data by datetime, we have a short cut that actually breaks the passed in query down into smaller chunks of time ranges that we scan serially. This allows the btree index to work very well even on larger tables with complex footprints. To be clear, I'm not opposed to adding a "day" option, I just want to explain that smaller partitions are not a magic bullet and may well likely make some things considerably worse for performance. I'd be happy to be show wrong though and would certainly accept a PR for the change, it's just not going to be a priority for me.

@j-musial
Copy link
Author

@bitner thank you for the comprehensive explanation. I guess the optimal partitioning depends on the usage scenario but if PGSTAC does the serial scans indeed the performance gain may not be huge at the cost of additional time required for query planning. The best way to verify this would be to run different types of queries on monthly and daily partitions, but I do understand that the daily partitions may not be the most urgent feature to be added to PGSTAC. Anyway thanks for your great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants