-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage Collector is very agressive and kills some queries without waiting. #3977
Comments
Some metrics from puppetdb:
EU (local):
|
What Postgres version are you using? |
15.6 |
Ok, that's a new enough version that you are getting the best behavior we have (PG 14 or greater). Some background: GC of reports is very fast now that we have partitioned tables because we can drop whole tables at a time, but the database locks it acquires to do that are more blocking than deleting a row. The result of this is that once PuppetDB's GC has started to wait for a lock, we can block both commands and queries that come in afterwards until both the in-flight queries and the GC are complete. Since we don't know how long all in-flight user queries will take, we could be blocking future commands and queries indefinitely. To solve that we have chosen to rely on this "bulldozer" that allows us to get the heavily locking GC work done as quickly as possible. Both queries and command processing automatically retry failed PG operations, so just because the error shows up in the PG log does not mean that an attempt to query actually failed. If the user query fails, we would consider that a bug, and you should also see the error logged in PuppetDB's logs then. |
Yes, on Singapure puppetdb I see the next things
We also have an issue about if a lot of resources are uploaded to puppetdb (discussion on Slack), but currently I solved it by reducing frequeintly of GC (once a day instead once a hour) and switched puppet agent to puppetserver in Luxemburg (apply catalog has been changed from 90 seconds to 110 but it solves the issue with uploading catalog into puppetdb). I can provide more logs from postgres or puppetdb in this case. In DB we saw some Locks
|
Ah sorry, forgot about that log spew from HikariCP. Those are expected warnings from our connection pool because to terminate the query consistently we have to close the entire connection. The error that would indicate a command failed would be
I would need to double check our retry semantics to be 100% sure, but I think it would also be problematic if you saw
because that would mean our "fast retry" failed as well. |
Ah, so if I have a like host with big amount of resouces in catalog can't update during time and GC kills every hour this query. |
Describe the Bug
I observe in logs of puppetdb that GC kills some connections to postgres.
We have main node located in EU with settings (currently it runs GC every day, was 1 hour and for reproduce more frequient I can set it to less value):
and more nodes puppetdb which the most far located in Singapore with RTT 183 ms with settings to the same database in EU
So the issue - I observe drops of connection to postgres by GC even without some respectful waiting during phase
Starting sweep of stale reports and resource events
:So it kills 3 connection in time less than half of second (300 ms according log).
this connections are actively sent something from remote location (grep by pid is below):
Expected Behavior
I suppose that GC should wait for something and drops queries only of something really terrible happens.
Maybe be more patient or expand in logs why this connections to postgres should be terminated during GC. As for my side GC shouldn't be destructive.
Steps to Reproduce
I can reproduce in with every run GC with 2 puppetdb which are located quite far away (for example add by
tc
some delay between nodes)Environment
The text was updated successfully, but these errors were encountered: