-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculating signatures is slow #13
Comments
Hi,
Thanks for your inquiry. I forwarded your message to the original developer who will be in a better position to answer your question.
Best,
Kurt
Dr. Kurt E. Fendt
Senior Lecturer
Director, Active Archives Initiative
Comparative Media Studies/Writing
Massachusetts Institute of Technology
Room 14N-421
77 Massachusetts Avenue
Cambridge, MA 02139, USA
Phone: (617) 253-4312
https://aai.mit.edu
On Aug 24, 2021, at 04:02, thijslemmens ***@***.******@***.***>> wrote:
I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the memcpy is copying data for every row, and that might take most of the time.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#13>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA7FLSA2IL3GXZKDASSS5T3T6NG2LANCNFSM5CWIRHDQ>.
|
Dear thijslemmens, Thanks for your message. Based on the sig_set source code, your suspicion that memcpy per row is slowing things down may be correct. Does an aggregate that uses fixed space but still touches every row run much faster? e.g. if memcpy is the limiting factor rather than the complete table scan then you should see One option would be to write a custom version of If you wish to open a PR that does this I would be happy to review and potentially merge it. Best, Christopher
|
Hello We've been working with a partner not further explore faceting for PostgreSQL. They have published a first version of an extension on github: |
I'm trying out the PG extension to figure out if I can use the signature strategy as an alternative to GROUP BY to give insights into results sets. My aim is to have facets on very big result sets within a second. I'm talking about 5M rows to begin with, but some of the cases we want to tackle might be a lot larger.
From my current experience, the "&" operator and facet.count() function works reasonably fast, but calculating a signature takes too much time (facet.signature aggregate). I understand that that aggregate does have to handle all the rows, but it is also slow compared to other aggregates over the same result set.
Do you have an idea what the reason could be? I'm looking at the sig_set function, but I'm not yet familiar with C code, so it takes some time. I suspect the
memcpy
is copying data for every row, and that might take most of the time.The text was updated successfully, but these errors were encountered: