Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass client error policy down to snapshot and output plugin #115

Merged
merged 7 commits into from
Aug 31, 2016

Conversation

samstokes
Copy link
Contributor

This adds an additional error_policy parameter to the snapshot function bottledwater_export and the logical decoding output plugin (specified via START_REPLICATION), with similar semantics to the --on-error flag added to the client in PR #85. The default policy is "exit", which preserves the existing behaviour that any error encoding a row for output to the client will terminate the snapshot or replication stream. Under the "log" policy, instead the error will just be logged and the offending row skipped.

This also ensures that the Bottled Water client passes the policy specified by --on-error down to the snapshot and output plugin. Previously this could lead to a confusing situation where you specified --on-error=log but some errors would still terminate (since the error policy did not apply to the extension): e.g. #36 and #113.

It also adds tests for handling of large values that exceed the extension's internal maximum buffer size, to illustrate and test for the difference between error policies.

This does not actually guarantee that all errors will be skipped in "log" mode - currently it skips only Avro encoding errors and errors writing the buffer. (e.g. it will probably not currently catch the "missing chunk for toast value" error described in #113.) This is because the extension was originally coded with the assumption that any error would exit early, and so it may not be safe to continue on after all possible errors - an unanticipated error might leave the extension in a bad state. We should be able to add handling for additional errors on a case by case basis.

@samstokes
Copy link
Contributor Author

@badboyd if you have a spare moment, do you think you could review this change? It touches similar areas to the --tables/--schemas pull request #101, so I'd really appreciate any thoughts you have on it.

(I know it will merge-conflict with PR #101, and I'll resolve the conflict when merging this.)

PQExpBuffer query = createPQExpBuffer();
appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X",
appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X (error_policy '%s')",
Copy link
Contributor

@badboyd badboyd Aug 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the query should look like this:
appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X (\"error_policy\" \'%s\')",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the double quotes around the param name are optional (this version worked when I tried it), but they're probably good practice, and it'll be more consistent with #101, so I've added them: 5d87428.

The single quotes don't need escaping though - a string literal can contain unescaped single quotes. Unless I misunderstood?

@badboyd
Copy link
Contributor

badboyd commented Aug 27, 2016

I think this patch is very cool. But there's a edge case like #113 , in this case I think we should have recreate a replication stream if error_policy = log or exit in case of error_policy = exit. What do you think a bout it ?

@samstokes
Copy link
Contributor Author

Thanks for the review @badboyd! Re:

But there's a edge case like #113 , in this case I think we should have recreate a replication stream if error_policy = log or exit in case of error_policy = exit.

Assuming the error is persistent, recreating the replication stream wouldn't help, since the client would have to reconnect, but it would just reprocess the same row that made it fail in the first place (since the replication slot still exists), and get nowhere.

The approach taken by this PR is to just skip over the data that caused the error, without terminating the replication stream, if error_policy=log. But it's hard to know if I'm catching all possible errors that way, since apparently there are errors like #113 that I'd never heard of and don't know how to reproduce :)

The behaviour of ereport(ERROR, ...) is similar to throwing an exception in a higher-level language - it will terminate the current SQL query or replication stream regardless of how deep in the C call stack it was. The error you're seeing in #113 is not reported explicitly by any code in Bottled Water, so it must be coming from somewhere in PG internals.

There is a PG_CATCH macro which I think we could use to catch any errors in the region of code that calls into Postgres to fetch the data for encoding. However, I can't find any documentation for it outside the Postgres source code, and I'm not familiar with its behaviour. Given that this part of the code is essential to Bottled Water's functioning, I'm apprehensive about putting a "catch-all" in there, at least without the ability to reproduce the error in question.

If you can reproduce the "missing chunk for toast" error, then I'd be interested to know if PG_CATCH works for intercepting it - in which case you could call error_policy_handle in the catch block and skip over it (assuming that skipping over it is safe!). This PR at least puts the plumbing in place.

@badboyd
Copy link
Contributor

badboyd commented Aug 30, 2016

The behaviour of ereport(ERROR, ...) is similar to throwing an exception in a higher-level language - it will terminate the current SQL query or replication stream regardless of how deep in the C call stack it was. The error you're seeing in #113 is not reported explicitly by any code in Bottled Water, so it must be coming from somewhere in PG internals.

You are right about this. I've just commented in #113, please check it.
About the restarting BW, my idea is: when postgres server is restarted( this happens a lot in real life), BW should wait for a number of seconds then try to reconnect again, this make the replication stream continue smoothly and reduce down time.

@samstokes
Copy link
Contributor Author

@badboyd I agree it would be useful to have Bottled Water reconnect after Postgres is restarted, although I think that's unrelated to this pull request. Although if you're running Bottled Water in some kind of process supervisor like runit or supervisord - which is a good idea anyway - you can get the same effect by configuring the supervisor to restart BW if it exits with a nonzero error code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants