Liveness probe kills pod while upgrading #333

remram44 · 2023-01-19T20:18:23Z

It is really unclear how you mean for NextCloud to ever upgrade. There are no instructions #221 and automatic upgrade probably can't work since all replicas would race against each other, and pods get killed while doing it because of failing liveness checks.

How can this work?

jessebot · 2023-01-24T15:04:05Z

I haven't tried a manual upgrade yet, but I'm interested in your findings, because we can then create a PR to guide others as well. If not, I'll update this issue when I do get around to testing an upgrade and share my findings as well.

In case it's helpful: You can disable probes, but I'm not sure that's what you want. It took me a few tries to find settings that worked for my infra for at least the initial deployment, but I currently update these values in my values.yaml:

## Liveness and readiness probe values
## Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
##
livenessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

readinessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

startupProbe:
  enabled: false
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30
  successThreshold: 1

And if you're deploying directly to your cluster, you can edit the probe values, however, the time it takes will depend on your infrastructure, and so I'm not sure what values would be best for you, but you could do a helm upgrade to apply the new values, or you could modify the deployment directly.

provokateurin · 2023-01-24T15:24:20Z

Replicas can't race against each other because there is a locking mechanism implemented in the nextcloud docker image (See https://github.com/nextcloud/docker/blob/master/25/apache/entrypoint.sh#L127-L240).
I never had problems with the probes on my infra, but the default values should definitely work on most hardware, so we should change them accordingly. @remram44 If you could experiment with the values until they work we could compare them with the ones from @jessebot and take the most forgiving ones of each of you.

remram44 · 2023-01-24T15:28:11Z

I can change replicas to 1 and remove the probes before I upgrade then reset it, but it would probably be better if it worked out-of-the-box. I don't know the consequences of an upgrade getting killed half-way, it probably shouldn't be something that happens in normal operations.

It would probably be better to either:

Have the probes pass while the upgrade is happening. This is a little tricky because the upgrade happens in the Docker image's entrypoint, and no server is running at that point to reply to GET requests
Use a Helm hook to do the upgrade outside of the deployment proper. This way you don't need probes, and you can fail the helm upgrade if the upgrade fail
Use an init container to do the upgrade, which is not subject to probes

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

remram44 · 2023-01-24T15:29:59Z

@provokateurin NEXTCLOUD_INIT_LOCK is NOT set by default as per the README:

NEXTCLOUD_INIT_LOCK (not set by default) Set it to true to enable initialization locking. Other containers will wait for the current process to finish updating the html volume to continue.

jessebot · 2023-01-26T23:31:48Z

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

@remram44 would you like to create a PR for this change?

Env vars are created here:

helm/charts/nextcloud/templates/_helpers.tpl

Lines 65 to 72 in a491977

    
           {{/* 
        
           Create environment variables used to configure the nextcloud container as well as the cron sidecar container. 
        
           */}} 
        
           {{- define "nextcloud.env" -}} 
        
           {{- if .Values.phpClientHttpsFix.enabled }} 
        
           - name: OVERWRITEPROTOCOL 
        
             value: {{ .Values.phpClientHttpsFix.protocol | quote }} 
        
           {{- end }}

And then you'd need to update the config parameter table in our README.md and also the values.yaml, likely around here with the env var and a comment about it:

helm/charts/nextcloud/values.yaml

Lines 72 to 85 in a491977

    
           nextcloud: 
        
             host: nextcloud.kube.home 
        
             username: admin 
        
             password: changeme 
        
             ## Use an existing secret 
        
             existingSecret: 
        
               enabled: false 
        
               # secretName: nameofsecret 
        
               # usernameKey: username 
        
               # passwordKey: password 
        
               # tokenKey: serverinfo_token 
        
               # smtpUsernameKey: smtp_username 
        
               # smtpPasswordKey: smtp_password 
        
             update: 0

remram44 · 2023-01-27T03:11:31Z

Sure, opened #343. Note that this doesn't fix this issue, we should still do something about the probe.

jessebot · 2023-01-27T08:21:30Z

That makes sense, but that should probably be second PR. I haven't worked with helm hooks before, so I'd need to do some research, but if you or anyone else knows the fix off the top of your head, you can submit a PR and we will review it.

Note: I work on this in my spare time, and I've been on vacation, so I've been pretty busy in this repo, but I will be going back to work next week, so my response time may vary 🙏 I really appreciate your patience though. It feels great to see this repo improving :)

remram44 · 2023-01-27T16:11:20Z

I think the easiest way to do this is to use an init container. The Docker image looks like it supports doing the upgrade and then exiting, you just need to pass "true" as the command. I can work on a pull request.

jessebot · 2023-01-27T16:29:39Z

Awesome, thank you for working on this :) Please just link back to the relevant docker lines you're referencing in the PR, so others can quickly follow along and get up to speed.

davidfrickert · 2024-06-06T23:13:49Z

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

jessebot · 2024-06-09T07:48:14Z

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

davidfrickert · 2024-06-09T13:37:25Z

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

I think a startup probe works well here, at least it seems to have worked on the latest upgrade.

startupProbe: 
    enabled: true
    periodSeconds: 5 
    timeoutSeconds: 5 
    failureThreshold: 100

Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works

remram44 changed the title ~~Liveness checks kills pod while upgrading~~ Liveness probe kills pod while upgrading Jan 19, 2023

remram44 mentioned this issue Jan 27, 2023

Add nextcloud.initLock to set NEXTCLOUD_INIT_LOCK #343

Closed

3 tasks

jessebot added the upgrades label Jan 27, 2023

jessebot added the Probes Anything related to startupProbe, readinessProbe, or livenessProbes label Jan 27, 2023

remram44 linked a pull request Jan 27, 2023 that will close this issue

Avoid upgrade being killed by failed liveness probes #344

Open

2 tasks

provokateurin mentioned this issue Dec 6, 2023

Can't upgrade Nextcloud from chart version 4.5.3 to 4.5.5 #493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness probe kills pod while upgrading #333

Liveness probe kills pod while upgrading #333

remram44 commented Jan 19, 2023

jessebot commented Jan 24, 2023

provokateurin commented Jan 24, 2023

remram44 commented Jan 24, 2023 •

edited

Loading

remram44 commented Jan 24, 2023

jessebot commented Jan 26, 2023

remram44 commented Jan 27, 2023

jessebot commented Jan 27, 2023

remram44 commented Jan 27, 2023

jessebot commented Jan 27, 2023

davidfrickert commented Jun 6, 2024

jessebot commented Jun 9, 2024

davidfrickert commented Jun 9, 2024 •

edited by jessebot

Loading

Liveness probe kills pod while upgrading #333

Liveness probe kills pod while upgrading #333

Comments

remram44 commented Jan 19, 2023

jessebot commented Jan 24, 2023

provokateurin commented Jan 24, 2023

remram44 commented Jan 24, 2023 • edited Loading

remram44 commented Jan 24, 2023

jessebot commented Jan 26, 2023

remram44 commented Jan 27, 2023

jessebot commented Jan 27, 2023

remram44 commented Jan 27, 2023

jessebot commented Jan 27, 2023

davidfrickert commented Jun 6, 2024

jessebot commented Jun 9, 2024

davidfrickert commented Jun 9, 2024 • edited by jessebot Loading

remram44 commented Jan 24, 2023 •

edited

Loading

davidfrickert commented Jun 9, 2024 •

edited by jessebot

Loading