Pass client error policy down to snapshot and output plugin #115

samstokes · 2016-08-26T19:15:25Z

This adds an additional error_policy parameter to the snapshot function bottledwater_export and the logical decoding output plugin (specified via START_REPLICATION), with similar semantics to the --on-error flag added to the client in PR #85. The default policy is "exit", which preserves the existing behaviour that any error encoding a row for output to the client will terminate the snapshot or replication stream. Under the "log" policy, instead the error will just be logged and the offending row skipped.

This also ensures that the Bottled Water client passes the policy specified by --on-error down to the snapshot and output plugin. Previously this could lead to a confusing situation where you specified --on-error=log but some errors would still terminate (since the error policy did not apply to the extension): e.g. #36 and #113.

It also adds tests for handling of large values that exceed the extension's internal maximum buffer size, to illustrate and test for the difference between error policies.

This does not actually guarantee that all errors will be skipped in "log" mode - currently it skips only Avro encoding errors and errors writing the buffer. (e.g. it will probably not currently catch the "missing chunk for toast value" error described in #113.) This is because the extension was originally coded with the assumption that any error would exit early, and so it may not be safe to continue on after all possible errors - an unanticipated error might leave the extension in a bad state. We should be able to add handling for additional errors on a case by case basis.

samstokes · 2016-08-26T19:18:03Z

@badboyd if you have a spare moment, do you think you could review this change? It touches similar areas to the --tables/--schemas pull request #101, so I'd really appreciate any thoughts you have on it.

(I know it will merge-conflict with PR #101, and I'll resolve the conflict when merging this.)

badboyd · 2016-08-27T16:29:28Z

client/replication.c

    PQExpBuffer query = createPQExpBuffer();
-    appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X",
+    appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X (error_policy '%s')",


I think the query should look like this:
appendPQExpBuffer(query, "START_REPLICATION SLOT \"%s\" LOGICAL %X/%X (\"error_policy\" \'%s\')",

I think the double quotes around the param name are optional (this version worked when I tried it), but they're probably good practice, and it'll be more consistent with #101, so I've added them: 5d87428.

The single quotes don't need escaping though - a string literal can contain unescaped single quotes. Unless I misunderstood?

badboyd · 2016-08-27T16:34:05Z

I think this patch is very cool. But there's a edge case like #113 , in this case I think we should have recreate a replication stream if error_policy = log or exit in case of error_policy = exit. What do you think a bout it ?

samstokes · 2016-08-29T23:15:26Z

Thanks for the review @badboyd! Re:

But there's a edge case like #113 , in this case I think we should have recreate a replication stream if error_policy = log or exit in case of error_policy = exit.

Assuming the error is persistent, recreating the replication stream wouldn't help, since the client would have to reconnect, but it would just reprocess the same row that made it fail in the first place (since the replication slot still exists), and get nowhere.

The approach taken by this PR is to just skip over the data that caused the error, without terminating the replication stream, if error_policy=log. But it's hard to know if I'm catching all possible errors that way, since apparently there are errors like #113 that I'd never heard of and don't know how to reproduce :)

The behaviour of ereport(ERROR, ...) is similar to throwing an exception in a higher-level language - it will terminate the current SQL query or replication stream regardless of how deep in the C call stack it was. The error you're seeing in #113 is not reported explicitly by any code in Bottled Water, so it must be coming from somewhere in PG internals.

There is a PG_CATCH macro which I think we could use to catch any errors in the region of code that calls into Postgres to fetch the data for encoding. However, I can't find any documentation for it outside the Postgres source code, and I'm not familiar with its behaviour. Given that this part of the code is essential to Bottled Water's functioning, I'm apprehensive about putting a "catch-all" in there, at least without the ability to reproduce the error in question.

If you can reproduce the "missing chunk for toast" error, then I'd be interested to know if PG_CATCH works for intercepting it - in which case you could call error_policy_handle in the catch block and skip over it (assuming that skipping over it is safe!). This PR at least puts the plumbing in place.

badboyd · 2016-08-30T04:46:55Z

The behaviour of ereport(ERROR, ...) is similar to throwing an exception in a higher-level language - it will terminate the current SQL query or replication stream regardless of how deep in the C call stack it was. The error you're seeing in #113 is not reported explicitly by any code in Bottled Water, so it must be coming from somewhere in PG internals.

You are right about this. I've just commented in #113, please check it.
About the restarting BW, my idea is: when postgres server is restarted( this happens a lot in real life), BW should wait for a number of seconds then try to reconnect again, this make the replication stream continue smoothly and reduce down time.

samstokes · 2016-08-31T00:01:20Z

@badboyd I agree it would be useful to have Bottled Water reconnect after Postgres is restarted, although I think that's unrelated to this pull request. Although if you're running Bottled Water in some kind of process supervisor like runit or supervisord - which is a good idea anyway - you can get the same effect by configuring the supervisor to restart BW if it exits with a nonzero error code.

samstokes added 6 commits August 26, 2016 11:43

Prototype error_policy option in output plugin

d7254e2

Tests for large values

fff01dd

Pass client error_policy down to output plugin

edc25de

Pass client error_policy to snapshot too

b71470e

Exclude our own domain type from tests

6498a1f

Clean up error handling

ae9acb9

samstokes mentioned this pull request Aug 26, 2016

Cannot stream rows larger than ~1MB #36

Open

badboyd reviewed Aug 27, 2016
View reviewed changes

Quote output plugin param name for consistency

5d87428

samstokes merged commit b11d925 into confluentinc:master Aug 31, 2016

samstokes deleted the policy-passdown branch August 31, 2016 00:01

This was referenced Aug 31, 2016

Add 2 new option arguments for BW, 'topics' and 'schemas', If set, BW will only snapshot/streaming these 'topics' in these 'schemas', 'topics' and 'schemas' are vertical-line separated list. #101

Closed

Support --on-error=log #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass client error policy down to snapshot and output plugin #115

Pass client error policy down to snapshot and output plugin #115

samstokes commented Aug 26, 2016

samstokes commented Aug 26, 2016

badboyd Aug 27, 2016 •

edited

Loading

samstokes Aug 29, 2016

badboyd commented Aug 27, 2016 •

edited

Loading

samstokes commented Aug 29, 2016

badboyd commented Aug 30, 2016 •

edited

Loading

samstokes commented Aug 31, 2016

Pass client error policy down to snapshot and output plugin #115

Pass client error policy down to snapshot and output plugin #115

Conversation

samstokes commented Aug 26, 2016

samstokes commented Aug 26, 2016

badboyd Aug 27, 2016 • edited Loading

Choose a reason for hiding this comment

samstokes Aug 29, 2016

Choose a reason for hiding this comment

badboyd commented Aug 27, 2016 • edited Loading

samstokes commented Aug 29, 2016

badboyd commented Aug 30, 2016 • edited Loading

samstokes commented Aug 31, 2016

badboyd Aug 27, 2016 •

edited

Loading

badboyd commented Aug 27, 2016 •

edited

Loading

badboyd commented Aug 30, 2016 •

edited

Loading