Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probe kills pod while upgrading #333

Open
remram44 opened this issue Jan 19, 2023 · 12 comments · May be fixed by #344
Open

Liveness probe kills pod while upgrading #333

remram44 opened this issue Jan 19, 2023 · 12 comments · May be fixed by #344
Labels
Probes Anything related to startupProbe, readinessProbe, or livenessProbes upgrades

Comments

@remram44
Copy link

It is really unclear how you mean for NextCloud to ever upgrade. There are no instructions #221 and automatic upgrade probably can't work since all replicas would race against each other, and pods get killed while doing it because of failing liveness checks.

How can this work?

@remram44 remram44 changed the title Liveness checks kills pod while upgrading Liveness probe kills pod while upgrading Jan 19, 2023
@jessebot
Copy link
Collaborator

I haven't tried a manual upgrade yet, but I'm interested in your findings, because we can then create a PR to guide others as well. If not, I'll update this issue when I do get around to testing an upgrade and share my findings as well.

In case it's helpful: You can disable probes, but I'm not sure that's what you want. It took me a few tries to find settings that worked for my infra for at least the initial deployment, but I currently update these values in my values.yaml:

## Liveness and readiness probe values
## Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
##
livenessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

readinessProbe:
  enabled: true
  initialDelaySeconds: 45
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

startupProbe:
  enabled: false
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30
  successThreshold: 1

And if you're deploying directly to your cluster, you can edit the probe values, however, the time it takes will depend on your infrastructure, and so I'm not sure what values would be best for you, but you could do a helm upgrade to apply the new values, or you could modify the deployment directly.

@provokateurin
Copy link
Member

Replicas can't race against each other because there is a locking mechanism implemented in the nextcloud docker image (See https://github.com/nextcloud/docker/blob/master/25/apache/entrypoint.sh#L127-L240).
I never had problems with the probes on my infra, but the default values should definitely work on most hardware, so we should change them accordingly. @remram44 If you could experiment with the values until they work we could compare them with the ones from @jessebot and take the most forgiving ones of each of you.

@remram44
Copy link
Author

remram44 commented Jan 24, 2023

I can change replicas to 1 and remove the probes before I upgrade then reset it, but it would probably be better if it worked out-of-the-box. I don't know the consequences of an upgrade getting killed half-way, it probably shouldn't be something that happens in normal operations.

It would probably be better to either:

  • Have the probes pass while the upgrade is happening. This is a little tricky because the upgrade happens in the Docker image's entrypoint, and no server is running at that point to reply to GET requests
  • Use a Helm hook to do the upgrade outside of the deployment proper. This way you don't need probes, and you can fail the helm upgrade if the upgrade fail
  • Use an init container to do the upgrade, which is not subject to probes

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

@remram44
Copy link
Author

@provokateurin NEXTCLOUD_INIT_LOCK is NOT set by default as per the README:

NEXTCLOUD_INIT_LOCK (not set by default) Set it to true to enable initialization locking. Other containers will wait for the current process to finish updating the html volume to continue.

@jessebot
Copy link
Collaborator

About the number of replicas, there is a locking mechanism in the image (primitive -- not sure why it doesn't use flock?), but it needs to be enabled by setting NEXTCLOUD_INIT_LOCK. This should probably be set in the Helm chart.

@remram44 would you like to create a PR for this change?

Env vars are created here:

{{/*
Create environment variables used to configure the nextcloud container as well as the cron sidecar container.
*/}}
{{- define "nextcloud.env" -}}
{{- if .Values.phpClientHttpsFix.enabled }}
- name: OVERWRITEPROTOCOL
value: {{ .Values.phpClientHttpsFix.protocol | quote }}
{{- end }}

And then you'd need to update the config parameter table in our README.md and also the values.yaml, likely around here with the env var and a comment about it:

nextcloud:
host: nextcloud.kube.home
username: admin
password: changeme
## Use an existing secret
existingSecret:
enabled: false
# secretName: nameofsecret
# usernameKey: username
# passwordKey: password
# tokenKey: serverinfo_token
# smtpUsernameKey: smtp_username
# smtpPasswordKey: smtp_password
update: 0

@remram44
Copy link
Author

Sure, opened #343. Note that this doesn't fix this issue, we should still do something about the probe.

@jessebot
Copy link
Collaborator

That makes sense, but that should probably be second PR. I haven't worked with helm hooks before, so I'd need to do some research, but if you or anyone else knows the fix off the top of your head, you can submit a PR and we will review it.

Note: I work on this in my spare time, and I've been on vacation, so I've been pretty busy in this repo, but I will be going back to work next week, so my response time may vary 🙏 I really appreciate your patience though. It feels great to see this repo improving :)

@remram44
Copy link
Author

I think the easiest way to do this is to use an init container. The Docker image looks like it supports doing the upgrade and then exiting, you just need to pass "true" as the command. I can work on a pull request.

@jessebot
Copy link
Collaborator

Awesome, thank you for working on this :) Please just link back to the relevant docker lines you're referencing in the PR, so others can quickly follow along and get up to speed.

@jessebot jessebot added the Probes Anything related to startupProbe, readinessProbe, or livenessProbes label Jan 27, 2023
@remram44 remram44 linked a pull request Jan 27, 2023 that will close this issue
2 tasks
@davidfrickert
Copy link

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

@jessebot
Copy link
Collaborator

jessebot commented Jun 9, 2024

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

@davidfrickert
Copy link

davidfrickert commented Jun 9, 2024

Any update on this work? I also run into this issue in every docker image upgrade, I guess I need to increase probe timeouts

For now, disable the probes or increase the timeout on them. In the meantime, I've added follow up comments to #344 to potentially solve this in the future.

I think a startup probe works well here, at least it seems to have worked on the latest upgrade.

startupProbe: 
    enabled: true
    periodSeconds: 5 
    timeoutSeconds: 5 
    failureThreshold: 100

Edit: never mind, was assuming latest helm chart changes had a nextcloud version change but not really. Will wait for next upgrade to see if it actually works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Probes Anything related to startupProbe, readinessProbe, or livenessProbes upgrades
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants