feat: add ability to define deployment strategy #38

ericgraf · 2024-01-31T17:34:40Z

What type of PR is this?

Feature

Which issue does this PR fix:

What does this PR do / Why do we need it:

This PR introduces the ability to specify the upgrade strategy of the Zot deployment.

Strategy recreate needs to be set when trying to upgrade a single replica deployment with a PV defined.

In ReadWriteOnce accessMode pvcs the new pod will be stuck in creating until the pod with the pvc is manually deleted.
In ReadWriteMany accessMode pvcs the new pod will go into a crash loop with the error operation timeout: boltdb file is already in use.

If an issue # is not available please add repro steps and logs showing the issue:

Run the below commands to repoduce the failure.

#!/usr/bin/env bash

kind create cluster
helm repo add project-zot http://zotregistry.dev/helm-charts
helm repo update

# Install with an older zot image so it can be upgraded, also create a pvc.
helm install zot project-zot/zot \
 --set persistence=true \
 --set pvc.create=true \
 --set image.tag=v2.0.1-rc2

#Wait for pod/deployment to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot project-zot/zot \
    --reuse-values \
    --set image.tag=v2.0.1

#New pod will get stuck in `CrashLoopBackOff ` state.
kubectl get pod

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#cleanup
#kind delete cluster

New Pod is in a bad state

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS             RESTARTS      AGE
zot-64f666cfcb-d4w7k   1/1     Running            0             5m30s
zot-7977df978b-hjqdw   0/1     CrashLoopBackOff   3 (27s ago)   4m58s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   1         1         1       5m33s
zot-7977df978b   1         1         0       5m1s

Crash looping pod error logs:

egraf@gs-vm-1:~$ k logs zot-7977df978b-thgfr
kind-kind
{"level":"info","url":"ghcr.io/aquasecurity/trivy-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:585","time":"2024-01-30T20:10:42.450916242Z","message":"using default trivy-db download URL."}
{"level":"info","url":"ghcr.io/aquasecurity/trivy-java-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:592","time":"2024-01-30T20:10:42.450951586Z","message":"using default trivy-java-db download URL."}
{"level":"warn","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:318","time":"2024-01-30T20:10:42.450977361Z","message":"mgmt extensions configuration option has been made redundant and will be ignored."}
{"level":"info","params":{"distSpecVersion":"1.1.0-dev","GoVersion":"go1.20.13","Commit":"v2.0.1-9def35f3b8b38f5a438146a09087edcec00537c7","ReleaseTag":"v2.0.1","BinaryType":"-imagetrust-lint-metrics-mgmt-profile-scrub-search-sync-ui-userprefs","Storage":{"RootDirectory":"/var/lib/registry","Dedupe":true,"RemoteCache":false,"GC":true,"Commit":false,"GCDelay":3600000000000,"GCInterval":3600000000000,"Retention":{"DryRun":false,"Delay":0,"Policies":null},"StorageDriver":null,"CacheDriver":null,"SubPaths":null},"HTTP":{"Address":"0.0.0.0","ExternalURL":"","Port":"5000","AllowOrigin":"","TLS":null,"Auth":{"FailDelay":0,"HTPasswd":{"Path":""},"LDAP":null,"Bearer":null,"OpenID":null,"APIKey":false},"AccessControl":null,"Realm":"","Ratelimit":null},"Log":{"Level":"debug","Output":"","Audit":""},"Extensions":{"Search":{"Enable":true,"CVE":{"UpdateInterval":7200000000000,"Trivy":{"DBRepository":"ghcr.io/aquasecurity/trivy-db","JavaDBRepository":"ghcr.io/aquasecurity/trivy-java-db"}}},"Sync":null,"Metrics":null,"Scrub":null,"Lint":null,"UI":{"Enable":true},"Mgmt":{"Enable":true},"APIKey":null,"Trust":null},"scheduler":null},"goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:221","time":"2024-01-30T20:10:42.451353936Z","message":"configuration settings"}
{"level":"info","cpus":16,"max. open files":1048576,"listen backlog":"4096","max. inotify watches":"524288","goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:90","time":"2024-01-30T20:10:42.451463363Z","message":"runtime params"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","dbPath":"/var/lib/registry/cache.db","goroutine":1,"caller":"zotregistry.io/zot/pkg/storage/cache/boltdb.go:58","time":"2024-01-30T20:10:52.404668952Z","message":"failed to create cache db"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:74","time":"2024-01-30T20:10:52.404773952Z","message":"failed to init controller"}
Error: operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'
Usage:
  zot serve <config> [flags]

Aliases:
  serve, serve

Flags:
  -h, --help   help for serve

Testing done on this change:

The below commands were run with code from this PR.
The deployment strategy is set to recreate in the commands which allowed the deployment to successfully scale down the old replicaset, then spins up the new replicaset.

#!/usr/bin/env bash

kind create cluster

# Install with an older zot image , pvc create , and recreate strategy set.
# this is run from the charts/zot directory
 helm install zot ./ \
 --set persistence=true \
 --set pvc.create=true \
 --set strategy.type=Recreate \
 --set strategy.rollingUpdate=null \
 --set image.tag=v2.0.1-rc2

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot ./ \
    --reuse-values \
    --set image.tag=v2.0.1

kubectl get pod

#Wait for pods to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

# cleanup
#kind delete cluster

Final state after doing the upgrade test.

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS    RESTARTS   AGE
zot-7977df978b-nhltl   1/1     Running   0          16s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   0         0         0       4m36s
zot-7977df978b   1         1         1       19s

Automation added to e2e:

Two new tests were added:

Upgrade tests of the helm chart this was added in the github actions workflow.
Deployment strategy test using the rolling option.

Will this break upgrades or downgrades?
No.
This change sets the values in values.yaml to the deployment strategy values in the values,yaml file.

Does this PR introduce any user-facing change?:

Add ability to specify the deployment upgrade strategy.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ericgraf · 2024-01-31T17:56:06Z

For this change there are still two things I'm still not sure about.

Where is it best to documentation about this change and how to use it?
I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

rchincha · 2024-02-01T04:42:52Z

@ericgraf Thanks for the PR.

Where is it best to documentation about this change and how to use it?

Our documentation is generated from here:
https://github.com/project-zot/project-zot.github.io
Perhaps best to add an article here.

I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

Maybe take two versions, start with the first one and upgrade to the next one.
However, note that upgrades are enforced to happen within the same major version. For example, 2.0.1 -> 3.0.5 is risky!

ericgraf · 2024-02-01T18:15:22Z

Thank you @rchincha . I moved this PR back to draft until I have those two things done.

Signed-off-by: ericgraf <[email protected]>

ericgraf · 2024-02-06T18:07:04Z

@rchincha I added two tests in this PR.
One that follows the current ct install workflow and another one that introduces a ct install --upgrade step in the github actions workflow.
The second one is a bit different, so let me know what you think about doing it this way.

rchincha · 2024-02-06T22:26:40Z

@Andreea-Lupu pls review this as well.

@ericgraf thanks for updating the documentation as well.
project-zot/project-zot.github.io#158

Andreea-Lupu

lgtm

rchincha

lgtm

A word of caution though that we are relying on k8s' rollback if something goes wrong and not a version path/plan check.

ericgraf force-pushed the add-recreate-type branch from 4cc0f93 to d0169e3 Compare January 31, 2024 17:57

ericgraf marked this pull request as ready for review January 31, 2024 18:00

ericgraf requested review from rchincha, andaaron, eusebiu-constantin-petu-dbk and Andreea-Lupu as code owners January 31, 2024 18:00

ericgraf force-pushed the add-recreate-type branch 2 times, most recently from 0320382 to 99d7c07 Compare January 31, 2024 18:33

ericgraf marked this pull request as draft February 1, 2024 18:14

ericgraf force-pushed the add-recreate-type branch 4 times, most recently from 0b406e6 to a372b59 Compare February 6, 2024 17:56

feat: add ability to define deployment strategy; add ci upgrade tests

a41ba63

Signed-off-by: ericgraf <[email protected]>

ericgraf force-pushed the add-recreate-type branch from a372b59 to a41ba63 Compare February 6, 2024 17:59

ericgraf marked this pull request as ready for review February 6, 2024 18:02

This was referenced Feb 6, 2024

docs: Add Helm deployment strategy type info into documentation project-zot/project-zot.github.io#157

Closed

docs: Add info about helm value deployment strategy type. project-zot/project-zot.github.io#158

Merged

Andreea-Lupu approved these changes Feb 7, 2024

View reviewed changes

rchincha approved these changes Feb 7, 2024

View reviewed changes

rchincha merged commit ea975c5 into project-zot:main Feb 7, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ability to define deployment strategy #38

feat: add ability to define deployment strategy #38

ericgraf commented Jan 31, 2024 •

edited

Loading

ericgraf commented Jan 31, 2024 •

edited

Loading

rchincha commented Feb 1, 2024

ericgraf commented Feb 1, 2024

ericgraf commented Feb 6, 2024

rchincha commented Feb 6, 2024

Andreea-Lupu left a comment

rchincha left a comment

feat: add ability to define deployment strategy #38

feat: add ability to define deployment strategy #38

Conversation

ericgraf commented Jan 31, 2024 • edited Loading

ericgraf commented Jan 31, 2024 • edited Loading

rchincha commented Feb 1, 2024

ericgraf commented Feb 1, 2024

ericgraf commented Feb 6, 2024

rchincha commented Feb 6, 2024

Andreea-Lupu left a comment

Choose a reason for hiding this comment

rchincha left a comment

Choose a reason for hiding this comment

ericgraf commented Jan 31, 2024 •

edited

Loading

ericgraf commented Jan 31, 2024 •

edited

Loading