rebooting loop after recovered by WAL-E. #123

monaka · 2016-06-27T21:15:34Z

The deis-database on my cluster was in rebooting loops.
I'm not sure the reason why for now because It seems be started in my sleeping time.

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-06-27T21:09:38.279511-00 pid=4212
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2016-06-27T21:09:40.429363-00 pid=4212 action=wal-fetch key=wabs://database/wal_005/0000000500000004000000EB.lzo prefix= seg=0000000500000004000000EB
LOG:  restored log file "0000000500000004000000EB" from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-push".
        STRUCTURED: time=2016-06-27T21:09:45.837498-00 pid=4223
psql: FATAL:  the database system is starting up
FATAL:  the database system is starting up
wal_e.main   ERROR    MSG: could not run one or more external programs WAL-E depends upon
        DETAIL: Could not get a connection to the database: note that superuser access
        is required
        Could not run the following programs, are they installed? psql
        STRUCTURED: time=2016-06-27T21:09:47.631257-00 pid=4223

The text was updated successfully, but these errors were encountered:

bacongobbler · 2016-06-27T21:28:11Z

Could not run the following programs, are they installed? psql

Are you running v2.0.0? That error indicates psql is not installed in your container.

monaka · 2016-06-27T21:34:14Z

@bacongobbler Yes, v2.0.0.

It was fixed after deleted by using kubectl delete po deis-database ....
So the container image has psql.
It's not impossible to delete psql in the instance by someone... but hard to image.

I thought it was occurred by memory exhaust. But the result of free on the node seems be healthy...

             total       used       free     shared    buffers     cached
Mem:       7139592    7067840      71752       8384     148344    4530472
-/+ buffers/cache:    2389024    4750568
Swap:      8388604     367704    8020900

bacongobbler · 2016-06-27T21:39:18Z

okay so if you're on v2.0.0 then the other reason this error would pop up is if wal-e could not get a connection to the database, as it says in the logs. Since the previous logs say

psql: FATAL: the database system is starting up

I would assume that is your issue, and that the database took an abnormally long time to boot. Once it was restarted it restored faster (likely connection issues to Azure?).

I've got a work-in-progress that removes the wait timeout, which is the likely cause for this issue. #112

monaka · 2016-06-27T21:45:52Z

I see. I'll try the canary build after #112 is merged. And also trying more information when this issue was reproduced.

monaka · 2016-06-27T22:55:00Z

BTW, some persons using Kube on Azure may have DNS related issues.

It seems be reasonable if my issues I posted recently were DNS on Azure specific.
Tests (light weight usages) may be passed but production usages may be failed.

monaka · 2016-07-18T08:30:58Z

I tried to #112 based builds and it seems to resolve this issue.
Even though CI tests are failed randomly, it works well in my Workflow.

monaka · 2016-07-22T23:49:11Z

(I know PR #112 is WIP and it is going to fix this issue near the future.)~~~

I had some confusions about my canary images. Let me revert this comment.
But still I'm in trouble around there and be inspecting...

monaka · 2016-07-23T12:05:34Z

Even though I don't have no certain evidence, I guess it is caused when there was executed Wal-E and psql at the same time. (Maybe, psql inside Wal-E and psql outside WAL-E)

Deis/database container runs psql periodically.
I think there is no need to run psql under the recover mode. My guess is reasonable?

monaka · 2016-07-23T21:41:40Z

Additional information:

Recovery failures with SIGQUIT may be decreased by upgrading the spec of the node running SkyDNS (not the node running deis/database).
In my case, Azure D2_V2 (2core / 7GB RAM) to F4 (4core /8GB RAM).

And, upgrading specs seems not 'silver blullets'. Because still I have randomly termination by signal 3 from WAL-E.
But a step forward.

(BTW, I'm curious. Referencing to official document, it is enough to run Deis Workflow by preparing 2 core * 2 nodes. But my cluster requires more specs. My nodes is off production state, just a few sample apps running. Why does the cluster require more power...)

monaka · 2016-08-12T03:33:55Z

At least on my Deis Workflow, this issue was resolved by #112.
I guess this can be closed after #112 is merged.

bacongobbler · 2016-10-31T18:57:21Z

this should be resolved via #137. If it isn't please re-open the issue at wal-e/wal-e. Thanks!

mboersma added the bug label Jun 28, 2016

monaka mentioned this issue Jul 22, 2016

Infinite loop by "Could not run the following programs, are they installed? psql" deis/wal-e#16

Closed

monaka mentioned this issue Jul 23, 2016

fix(rootfs): refactor database init process #112

Closed

This was referenced Jul 24, 2016

Falied to spawn WAL-E in the recovery mode. #127

Closed

Refactor database (some additions to @bacongobbler's PR) #128

Closed

bacongobbler closed this as completed Oct 31, 2016

miracle2k mentioned this issue Aug 1, 2017

Backup restore loop: "Could not get a connection to the database:" #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rebooting loop after recovered by WAL-E. #123

rebooting loop after recovered by WAL-E. #123

monaka commented Jun 27, 2016

bacongobbler commented Jun 27, 2016

monaka commented Jun 27, 2016 •

edited

Loading

bacongobbler commented Jun 27, 2016 •

edited

Loading

monaka commented Jun 27, 2016

monaka commented Jun 27, 2016

monaka commented Jul 18, 2016

monaka commented Jul 22, 2016 •

edited

Loading

monaka commented Jul 23, 2016

monaka commented Jul 23, 2016

monaka commented Aug 12, 2016

bacongobbler commented Oct 31, 2016

rebooting loop after recovered by WAL-E. #123

rebooting loop after recovered by WAL-E. #123

Comments

monaka commented Jun 27, 2016

bacongobbler commented Jun 27, 2016

monaka commented Jun 27, 2016 • edited Loading

bacongobbler commented Jun 27, 2016 • edited Loading

monaka commented Jun 27, 2016

monaka commented Jun 27, 2016

monaka commented Jul 18, 2016

monaka commented Jul 22, 2016 • edited Loading

monaka commented Jul 23, 2016

monaka commented Jul 23, 2016

monaka commented Aug 12, 2016

bacongobbler commented Oct 31, 2016

monaka commented Jun 27, 2016 •

edited

Loading

bacongobbler commented Jun 27, 2016 •

edited

Loading

monaka commented Jul 22, 2016 •

edited

Loading