Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rootfs): refactor database init process #112

Closed
wants to merge 1 commit into from

Conversation

bacongobbler
Copy link
Member

@bacongobbler bacongobbler commented Jun 6, 2016

This refactor includes:

  • removing PGCTLTIMEOUT, now you'll only need to modify the readinessProbe timeout
  • cleaning up database boot script

Note that this is not required for 2.0.

Manual testing procedure:

  • boot up database
  • create a few apps
  • delete the database via kubectl delete po
  • ensure the database recovers correctly

closes #55
closes #108

@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @kmala to be a potential reviewer

@monaka
Copy link
Contributor

monaka commented Jul 17, 2016

I rebased this patch on my repository and tests on Travis-CI.
(Strictly, my Wal-e is patched. But it will be less side effects to this PR.)

Tests were failed randomly. Sometimes they finish with success.
Below is the example of log on failure.

LOG:  restored log file "00000001000000000000000D" from archive
LOG:  redo starts at 0/D000090
LOG:  consistent recovery state reached at 0/D0000B8
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
-----> checking if postgres is running
make: *** [test-functional-swift] Error 2

@bacongobbler
Copy link
Member Author

yeah this is still a WIP, it's not expected to work yet.

@@ -0,0 +1,14 @@
#!/usr/bin/env bash

until psql -l -t >/dev/null 2>&1; do sleep 1; done
Copy link
Contributor

@monaka monaka Jul 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be in "if then - fi" blocks? (e.g. Line 7)

(refs #123 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come? we want to wait here until the database is ready to accept connections. When I get back to this PR, I'd like to rewrite this to

until is_running; do sleep 1; done

So it's more clear what the intent of this line is.

@monaka
Copy link
Contributor

monaka commented Jul 25, 2016

Travis shows errors. I guess it may caused by timeout. Should be patched to the test driver?

@bacongobbler
Copy link
Member Author

I think it's because I'm out-of-date with master. Trying again

@bacongobbler
Copy link
Member Author

k, got travis to pass, now we just need e2e to pass.

@bacongobbler bacongobbler added this to the v2.3 milestone Jul 25, 2016
@@ -35,6 +30,9 @@ archive_mode = on
archive_command = 'envdir "${WALE_ENVDIR}" wal-e wal-push %p'
archive_timeout = 60
listen_addresses = '*'
archive_mode = on
archive_command = 'envdir "${WALE_ENVDIR}" wal-e wal-push %p'
archive_timeout = 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate with L31? (L33, L34 are similar)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, must've missed it in the rebase

@bacongobbler bacongobbler force-pushed the refactor-database branch 4 times, most recently from 3396707 to 72390e8 Compare July 26, 2016 04:43
@bacongobbler bacongobbler removed this from the v2.3 milestone Jul 26, 2016
@bacongobbler
Copy link
Member Author

re-labeling as in progress until I identify the travis failures

@monaka
Copy link
Contributor

monaka commented Jul 28, 2016

I guess the failure on Travis is caused by some timing issues on tests only. I also got random failures on my Travis builds.

@bacongobbler
Copy link
Member Author

bacongobbler commented Jul 28, 2016

I don't think these are random this time, but I'll eventually look into it.

@bacongobbler
Copy link
Member Author

FYI I managed to get #131 in so half of the work here has already been done.

When you complete a recovery of the database for the first time, a new log timeline
is started with an ID of 2. When you restore again, the timeline used when the last
backup occurred will be replayed. Because of this, if you restored the database and
did not perform a backup, all data from that successful recovery will be lost because
only WAL logs from the first timeline (the timeline that the database was last backed
up) will be restored.

In order to fix this, after completing a database recovery we create a fresh backup
in order to establish a new recovery baseline. That way we can now replay from
timeline 2.

Other fixes that this refactor includes:

 - database attempting to start halfway through recovery
 - /bin/is_running identifying the database as running during recovery
 - cleaning up database boot script
@bacongobbler bacongobbler force-pushed the refactor-database branch 2 times, most recently from 917e795 to 37a75b5 Compare November 2, 2016 17:03
@mboersma
Copy link
Member

@bacongobbler is this PR still worth pursuing? Should we prioritize it for the next release, or maybe mark it "help wanted?"

@bacongobbler
Copy link
Member Author

I think it's still worth it to pursue eventually to get rid of PGCTLTIMEOUT on long recovery times, otherwise there's no real end benefit. It's mostly cleanup so it's not a high priority.

@mboersma
Copy link
Member

mboersma commented Jul 5, 2017

I'm going to close this PR since it's old and rusty, but @bacongobbler please re-open if this code gets resurrected.

@mboersma mboersma closed this Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants