Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ability to define deployment strategy #38

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

ericgraf
Copy link
Contributor

@ericgraf ericgraf commented Jan 31, 2024

What type of PR is this?

Feature

Which issue does this PR fix:

What does this PR do / Why do we need it:

This PR introduces the ability to specify the upgrade strategy of the Zot deployment.

Strategy recreate needs to be set when trying to upgrade a single replica deployment with a PV defined.

In ReadWriteOnce accessMode pvcs the new pod will be stuck in creating until the pod with the pvc is manually deleted.
In ReadWriteMany accessMode pvcs the new pod will go into a crash loop with the error operation timeout: boltdb file is already in use.

If an issue # is not available please add repro steps and logs showing the issue:

Run the below commands to repoduce the failure.

#!/usr/bin/env bash

kind create cluster
helm repo add project-zot http://zotregistry.dev/helm-charts
helm repo update

# Install with an older zot image so it can be upgraded, also create a pvc.
helm install zot project-zot/zot \
 --set persistence=true \
 --set pvc.create=true \
 --set image.tag=v2.0.1-rc2

#Wait for pod/deployment to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot project-zot/zot \
    --reuse-values \
    --set image.tag=v2.0.1

#New pod will get stuck in `CrashLoopBackOff ` state.
kubectl get pod

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#cleanup
#kind delete cluster

New Pod is in a bad state

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS             RESTARTS      AGE
zot-64f666cfcb-d4w7k   1/1     Running            0             5m30s
zot-7977df978b-hjqdw   0/1     CrashLoopBackOff   3 (27s ago)   4m58s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   1         1         1       5m33s
zot-7977df978b   1         1         0       5m1s

Crash looping pod error logs:

egraf@gs-vm-1:~$ k logs zot-7977df978b-thgfr
kind-kind
{"level":"info","url":"ghcr.io/aquasecurity/trivy-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:585","time":"2024-01-30T20:10:42.450916242Z","message":"using default trivy-db download URL."}
{"level":"info","url":"ghcr.io/aquasecurity/trivy-java-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:592","time":"2024-01-30T20:10:42.450951586Z","message":"using default trivy-java-db download URL."}
{"level":"warn","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:318","time":"2024-01-30T20:10:42.450977361Z","message":"mgmt extensions configuration option has been made redundant and will be ignored."}
{"level":"info","params":{"distSpecVersion":"1.1.0-dev","GoVersion":"go1.20.13","Commit":"v2.0.1-9def35f3b8b38f5a438146a09087edcec00537c7","ReleaseTag":"v2.0.1","BinaryType":"-imagetrust-lint-metrics-mgmt-profile-scrub-search-sync-ui-userprefs","Storage":{"RootDirectory":"/var/lib/registry","Dedupe":true,"RemoteCache":false,"GC":true,"Commit":false,"GCDelay":3600000000000,"GCInterval":3600000000000,"Retention":{"DryRun":false,"Delay":0,"Policies":null},"StorageDriver":null,"CacheDriver":null,"SubPaths":null},"HTTP":{"Address":"0.0.0.0","ExternalURL":"","Port":"5000","AllowOrigin":"","TLS":null,"Auth":{"FailDelay":0,"HTPasswd":{"Path":""},"LDAP":null,"Bearer":null,"OpenID":null,"APIKey":false},"AccessControl":null,"Realm":"","Ratelimit":null},"Log":{"Level":"debug","Output":"","Audit":""},"Extensions":{"Search":{"Enable":true,"CVE":{"UpdateInterval":7200000000000,"Trivy":{"DBRepository":"ghcr.io/aquasecurity/trivy-db","JavaDBRepository":"ghcr.io/aquasecurity/trivy-java-db"}}},"Sync":null,"Metrics":null,"Scrub":null,"Lint":null,"UI":{"Enable":true},"Mgmt":{"Enable":true},"APIKey":null,"Trust":null},"scheduler":null},"goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:221","time":"2024-01-30T20:10:42.451353936Z","message":"configuration settings"}
{"level":"info","cpus":16,"max. open files":1048576,"listen backlog":"4096","max. inotify watches":"524288","goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:90","time":"2024-01-30T20:10:42.451463363Z","message":"runtime params"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","dbPath":"/var/lib/registry/cache.db","goroutine":1,"caller":"zotregistry.io/zot/pkg/storage/cache/boltdb.go:58","time":"2024-01-30T20:10:52.404668952Z","message":"failed to create cache db"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:74","time":"2024-01-30T20:10:52.404773952Z","message":"failed to init controller"}
Error: operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'
Usage:
  zot serve <config> [flags]

Aliases:
  serve, serve

Flags:
  -h, --help   help for serve

Testing done on this change:

The below commands were run with code from this PR.
The deployment strategy is set to recreate in the commands which allowed the deployment to successfully scale down the old replicaset, then spins up the new replicaset.

#!/usr/bin/env bash

kind create cluster

# Install with an older zot image , pvc create , and recreate strategy set.
# this is run from the charts/zot directory
 helm install zot ./ \
 --set persistence=true \
 --set pvc.create=true \
 --set strategy.type=Recreate \
 --set strategy.rollingUpdate=null \
 --set image.tag=v2.0.1-rc2

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot ./ \
    --reuse-values \
    --set image.tag=v2.0.1

kubectl get pod

#Wait for pods to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

# cleanup
#kind delete cluster

Final state after doing the upgrade test.

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS    RESTARTS   AGE
zot-7977df978b-nhltl   1/1     Running   0          16s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   0         0         0       4m36s
zot-7977df978b   1         1         1       19s

Automation added to e2e:

Two new tests were added:

  1. Upgrade tests of the helm chart this was added in the github actions workflow.
  2. Deployment strategy test using the rolling option.

Will this break upgrades or downgrades?
No.
This change sets the values in values.yaml to the deployment strategy values in the values,yaml file.

Does this PR introduce any user-facing change?:

Add ability to specify the deployment upgrade strategy. 

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ericgraf
Copy link
Contributor Author

ericgraf commented Jan 31, 2024

For this change there are still two things I'm still not sure about.

  1. Where is it best to documentation about this change and how to use it?
  2. I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

@ericgraf ericgraf marked this pull request as ready for review January 31, 2024 18:00
@ericgraf ericgraf force-pushed the add-recreate-type branch 2 times, most recently from 0320382 to 99d7c07 Compare January 31, 2024 18:33
@rchincha
Copy link
Contributor

rchincha commented Feb 1, 2024

@ericgraf Thanks for the PR.

Where is it best to documentation about this change and how to use it?

Our documentation is generated from here:
https://github.com/project-zot/project-zot.github.io
Perhaps best to add an article here.

I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

Maybe take two versions, start with the first one and upgrade to the next one.
However, note that upgrades are enforced to happen within the same major version. For example, 2.0.1 -> 3.0.5 is risky!

@ericgraf ericgraf marked this pull request as draft February 1, 2024 18:14
@ericgraf
Copy link
Contributor Author

ericgraf commented Feb 1, 2024

Thank you @rchincha . I moved this PR back to draft until I have those two things done.

@ericgraf ericgraf force-pushed the add-recreate-type branch 4 times, most recently from 0b406e6 to a372b59 Compare February 6, 2024 17:56
@ericgraf ericgraf marked this pull request as ready for review February 6, 2024 18:02
@ericgraf
Copy link
Contributor Author

ericgraf commented Feb 6, 2024

@rchincha I added two tests in this PR.
One that follows the current ct install workflow and another one that introduces a ct install --upgrade step in the github actions workflow.
The second one is a bit different, so let me know what you think about doing it this way.

@rchincha
Copy link
Contributor

rchincha commented Feb 6, 2024

@Andreea-Lupu pls review this as well.

@ericgraf thanks for updating the documentation as well.
project-zot/project-zot.github.io#158

Copy link
Collaborator

@Andreea-Lupu Andreea-Lupu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@rchincha rchincha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

A word of caution though that we are relying on k8s' rollback if something goes wrong and not a version path/plan check.

@rchincha rchincha merged commit ea975c5 into project-zot:main Feb 7, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants