Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falied to spawn WAL-E in the recovery mode. #127

Closed
monaka opened this issue Jul 24, 2016 · 9 comments
Closed

Falied to spawn WAL-E in the recovery mode. #127

monaka opened this issue Jul 24, 2016 · 9 comments
Labels

Comments

@monaka
Copy link
Contributor

monaka commented Jul 24, 2016

It is similar to #123 . But I suspect they are different issue for each other.

Sometimes the recovery process was failed with logs like this.

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-07-24T07:23:32.631374-00 pid=296
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2016-07-24T07:23:33.115881-00 pid=296 action=wal-fetch key=wabs://database/wal_005/0000000500000002000000FB.lzo prefix= seg=0000000500000002000000FB
LOG:  restored log file "0000000500000002000000FB" from archive
LOG:  server process (PID 171) exited with exit code 2
LOG:  terminating any other active server processes
FATAL:  could not restore file "0000000500000002000000FC" from archive: child process was terminated by signal 3: Quit
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted while in recovery at log time 2016-07-20 23:15:18 UTC
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG:  starting archive recovery

In this case, 0000000500000002000000FB was restored and 0000000500000002000000FC was failed to restore.

There have no wal_e.main INFO MSG: starting WAL-E message for ...FC.
Referencing to cmd.py in deis/wal-e, there is no route to exit without the message above (except version command).
So it might that Postgres fails to spawn WAL-E.

In case #123, the issue is occurred in WAL-E. This point is the difference.

@monaka
Copy link
Contributor Author

monaka commented Jul 24, 2016

As I referred at comment in #56, this issue is similar to wal-e/wal-e#247.
He ( @xcompass ) reported that to Wal-E and I also suspected about Wal-E before.

@monaka
Copy link
Contributor Author

monaka commented Jul 26, 2016

And may the another pattern.

It look Wal-E works well.
But the startup process killed suddenly by signal 9 after the message LOG: restored log file "000000090000000600000047" from archive

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-07-26T00:21:01.242533-00 pid=10695
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-07-26T00:21:01.289450-00 pid=10695 action=wal-fetch key=wabs://database/wal_005/000000090000000600000047.lzo prefix= seg=000000090000000600000047 state=begin
requests.packages.urllib3.connectionpool INFO     Starting new HTTPS connection (1): monadeis2.blob.core.windows.net
wal_e.blobstore.wabs.wabs_util INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "wabs://database/wal_005/000000090000000600000047.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2016-07-26T00:21:02.100878-00 pid=10695
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-07-26T00:21:02.101141-00 pid=10695 action=wal-fetch key=wabs://database/wal_005/000000090000000600000047.lzo prefix= seg=000000090000000600000047 state=complete
LOG:  restored log file "000000090000000600000047" from archive
LOG:  startup process (PID 10639) was terminated by signal 9: Killed
LOG:  aborting startup due to startup process failure

The result of kubectl describe po is like below.

$ kubectl describe po deis-database-pqp3l --namespace deis
Name:           deis-database-pqp3l
Namespace:      deis
Node:           10.0.1.4/10.0.1.4
Start Time:     Mon, 25 Jul 2016 23:30:52 +0000
Labels:         app=deis-database
Status:         Running
IP:             172.16.12.21
Controllers:    ReplicationController/deis-database
Containers:
  deis-database:
    Container ID:       docker://562f44a31355fd3a361471fbe5647c41b424c93e983d4a5f98aa669e6c1d1c78
    Image:              monami0ya/postgres:canary
    Image ID:           docker://sha256:9ea726e4c23908edca409ac5608daca1755a8c81e7a826448fa99ff6b22f5ad6
    Port:               5432/TCP
    QoS Tier:
      cpu:              BestEffort
      memory:           BestEffort
    State:              Running
      Started:          Tue, 26 Jul 2016 00:21:19 +0000
    Last State:         Terminated
      Reason:           Error
      Exit Code:        1
      Started:          Tue, 26 Jul 2016 00:11:16 +0000
      Finished:         Tue, 26 Jul 2016 00:21:02 +0000
    Ready:              False
    Restart Count:      7
    Readiness:          exec [is_running] delay=30s timeout=5s period=10s #success=1 #failure=3
    Environment Variables:
      DATABASE_STORAGE: azure
Conditions:
  Type          Status
  Ready         False
Volumes:
  database-creds:
    Type:       Secret (a volume populated by a Secret)
    SecretName: database-creds
  objectstore-creds:
    Type:       Secret (a volume populated by a Secret)
    SecretName: objectstorage-keyfile
  deis-database-token-xwak6:
    Type:       Secret (a volume populated by a Secret)
    SecretName: deis-database-token-xwak6
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                    -------------                   --------        ------          -------
  55m           55m             1       {default-scheduler }                                    Normal          Scheduled       Successfully assigned deis-database-pqp3l to 10.0.1.4
  54m           54m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Created         Created container with docker id ae26f6176ace
  54m           54m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Started         Started container with docker id ae26f6176ace
  44m           44m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Started         Started container with docker id 3732beaec148
  44m           44m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Created         Created container with docker id 3732beaec148
  37m           37m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Started         Started container with docker id 2127c5d56f84
  37m           37m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Created         Created container with docker id 2127c5d56f84
  26m           26m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Created         Created container with docker id c74766caf80e
  26m           26m             1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal          Started         Started container with docker id c74766caf80e
  37m           25m             2       {kubelet 10.0.1.4}                                      Warning         FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deis-database" with CrashLoopBackOff: "Back-off 10s restarting failed container=deis-database pod=deis-database-pqp3l_deis(d32c1afb-52bf-11e6-a43b-000d3a50375f)"

  25m   25m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Created         Created container with docker id 4bf9bd630ae2
  25m   25m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Started         Started container with docker id 4bf9bd630ae2
  18m   17m     2       {kubelet 10.0.1.4}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deis-database" with CrashLoopBackOff: "Back-off 20s restarting failed container=deis-database pod=deis-database-pqp3l_deis(d32c1afb-52bf-11e6-a43b-000d3a50375f)"

  17m   17m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Created         Created container with docker id f4a449522b29
  17m   17m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Started         Started container with docker id f4a449522b29
  16m   15m     4       {kubelet 10.0.1.4}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deis-database" with CrashLoopBackOff: "Back-off 40s restarting failed container=deis-database pod=deis-database-pqp3l_deis(d32c1afb-52bf-11e6-a43b-000d3a50375f)"

  15m   15m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Started         Started container with docker id 2c3a2ba80ee2
  15m   15m     1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Created         Created container with docker id 2c3a2ba80ee2
  37m   5m      9       {kubelet 10.0.1.4}      spec.containers{deis-database}  Warning BackOff         Back-off restarting failed docker container
  5m    5m      1       {kubelet 10.0.1.4}                                      Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deis-database" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=deis-database pod=deis-database-pqp3l_deis(d32c1afb-52bf-11e6-a43b-000d3a50375f)"

  55m   5m      8       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Pulling         pulling image "monami0ya/postgres:canary"
  54m   5m      8       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Pulled          Successfully pulled image "monami0ya/postgres:canary"
  5m    5m      1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Started         Started container with docker id 562f44a31355
  5m    5m      1       {kubelet 10.0.1.4}      spec.containers{deis-database}  Normal  Created         Created container with docker id 562f44a31355
  54m   3s      291     {kubelet 10.0.1.4}      spec.containers{deis-database}  Warning Unhealthy       Readiness probe failed:

@monaka
Copy link
Contributor Author

monaka commented Aug 11, 2016

Recent revisions of K8s uses kubedns+dnsmasq instead of SkyDNS. And it is known that the internal DNS in Azure cloud have poor performance. So I tried Dnsmasq with SkyDNS.
Exactly Dnsmasq improve responses of DNS query (more than 900% faster!).
But it seems not to fix this issue.

@monaka
Copy link
Contributor Author

monaka commented Aug 12, 2016

It seems that "exit code 2" is caused by exception(s?) in WAL-E (cmd.py). Not in Postgres.
And this may be reproduced in the official WAL-E also. I'll move this issue there.

@fdr
Copy link

fdr commented Aug 13, 2016

Is it possible something is wrong with 0000000500000002000000FB, or its restore? In this case WAL-E is getting killed with SIGQUIT by Postgres, which has apparently decided it wants to shut down.

@monaka
Copy link
Contributor Author

monaka commented Aug 14, 2016

Indeed WAL-E may be killed by SIGQUIT. But I saw the another message on such cases.

In this case, log messages were changed to "exit core 5" after I tried to change sys.exit(2) to sys.exit(5) . But with no message expected to print by logging.critical(). I wonder why...

@fdr
Copy link

fdr commented Aug 14, 2016

Try dumping the error being caught stderr and/or stdout?

@monaka
Copy link
Contributor Author

monaka commented Aug 15, 2016

I wrote additional info to wal-e/wal-e#268.
I think the WAL-E exits before the log message is flushed.
I'll try dumping to stderr also (as stdout will be buffered.)

@bacongobbler
Copy link
Member

this should be resolved via #137. If it isn't please re-open the issue at wal-e/wal-e. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants