Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[main] Add new method to track originating cluster of Backup #613

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
6 changes: 6 additions & 0 deletions charts/rancher-backup-crd/templates/backup.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ spec:
- jsonPath: .status.conditions[?(@.type=="Ready")].message
name: Status
type: string
- jsonPath: .status.conditions[?(@.type=="InPlaceRestore")].status
name: In-Place
type: string
name: v1
schema:
openAPIV3Schema:
Expand Down Expand Up @@ -127,6 +130,9 @@ spec:
type: string
observedGeneration:
type: integer
originCluster:
nullable: true
type: string
storageLocation:
nullable: true
type: string
Expand Down
17 changes: 10 additions & 7 deletions pkg/apis/resources.cattle.io/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ import (
)

var (
BackupConditionReady = "Ready"
BackupConditionUploaded = "Uploaded"
BackupConditionReconciling = "Reconciling"
BackupConditionStalled = "Stalled"
RestoreConditionReconciling = "Reconciling"
RestoreConditionStalled = "Stalled"
RestoreConditionReady = "Ready"
BackupConditionReady = "Ready"
BackupConditionUploaded = "Uploaded"
BackupConditionReconciling = "Reconciling"
BackupConditionClusterOrigin = "HasClusterOrigin"
BackupConditionInPlaceRestore = "InPlaceRestore"
BackupConditionStalled = "Stalled"
RestoreConditionReconciling = "Reconciling"
RestoreConditionStalled = "Stalled"
RestoreConditionReady = "Ready"
)

// +genclient
Expand All @@ -37,6 +39,7 @@ type BackupSpec struct {

type BackupStatus struct {
Conditions []genericcondition.GenericCondition `json:"conditions"`
OriginCluster string `json:"originCluster,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the origin cluster metadata should be stored in the Status field.

I can see from the original CRD definition that the semantic difference between spec (the desired state of the object) and the status (what is the current state of the object) is a little bit muddied in this operator.

IMO, the origin cluster (and additional metadata) could be set in the Spec and be fully managed by the operator

In any case, the backup CRD's status should be the one surfacing whether or not a backup is viable for sure 👍

LastSnapshotTS string `json:"lastSnapshotTs"`
NextSnapshotAt string `json:"nextSnapshotAt"`
ObservedGeneration int64 `json:"observedGeneration"`
Expand Down
109 changes: 109 additions & 0 deletions pkg/controllers/backup/cluster_origin.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
package backup

import (
v1 "github.com/rancher/backup-restore-operator/pkg/apis/resources.cattle.io/v1"
"github.com/rancher/wrangler/v3/pkg/condition"
)

type backupClusterOriginConditionMeta struct {
backupName string
hasClusterOriginID bool
clusterOriginID string
hasCurrentOriginCondition bool
currentOriginCondition bool
canInPlaceRestore bool
hasInPlaceRestoreCondition bool
currentInPlaceRestoreCondition bool
Comment on lines +9 to +16
Copy link
Contributor

@alexandreLamarre alexandreLamarre Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having boolean fields injected during construction, could encapsulate

type backupClusterMdChecker struct{
   backupName string
   clusterOrigin string
   b *backupv1.Backup
   // Any controllers or client runtime needed for methods, if any
}

func (b backupClusterMd) hasClusterOriginID() bool {
  // ...
}

func (b backupClusterMd) hasCurrentOriginCondition() bool {
  // ...
}

This allows us to isolate some more complicated logic, as well as allowing for testing if we make these methods or struct public, which we probably should - because we should be focused on stability.

If making them public, could consider moving them to sub-package, pkg/controllers/backup/origin or something to prevent imports of heavy k8s dependencies in unit tests.

Also greatly simplifies the if statements below IMO.

}

func newBackupClusterOriginConditionMeta(controllerClusterID string, backup *v1.Backup) backupClusterOriginConditionMeta {
conditionMeta := backupClusterOriginConditionMeta{
backupName: backup.Name,
hasClusterOriginID: false,
hasCurrentOriginCondition: false,
currentOriginCondition: false,
canInPlaceRestore: false,
hasInPlaceRestoreCondition: false,
currentInPlaceRestoreCondition: false,
}

originalValue := backup.Status.OriginCluster
conditionMeta.hasClusterOriginID = originalValue != ""
if conditionMeta.hasClusterOriginID {
conditionMeta.clusterOriginID = originalValue
}

currentOriginConditionString := condition.Cond(v1.BackupConditionClusterOrigin).GetStatus(backup)
conditionMeta.hasCurrentOriginCondition = currentOriginConditionString != ""
if !conditionMeta.hasCurrentOriginCondition {
conditionMeta.currentOriginCondition = currentOriginConditionString == "True"
}

if conditionMeta.hasClusterOriginID {
conditionMeta.canInPlaceRestore = conditionMeta.clusterOriginID == controllerClusterID
}

currentInPlaceRestoreString := condition.Cond(v1.BackupConditionInPlaceRestore).GetStatus(backup)
conditionMeta.hasInPlaceRestoreCondition = currentInPlaceRestoreString != ""
if !conditionMeta.hasInPlaceRestoreCondition {
conditionMeta.currentInPlaceRestoreCondition = currentInPlaceRestoreString == "True"
}

return conditionMeta
}

// prepareClusterOriginConditions helps set the cluster origin conditions and reports if anything changed in this part of status.
func (h *handler) prepareClusterOriginConditions(backup *v1.Backup) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this logic in this function can also be simplified/more readable if we encapsulate most of it in struct methods

conditionChanged := false
if !h.canUseClusterOriginStatus {
currentOriginConditionString := condition.Cond(v1.BackupConditionClusterOrigin).GetStatus(backup)
if currentOriginConditionString != "False" {
condition.Cond(v1.BackupConditionClusterOrigin).SetStatusBool(backup, false)
condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "CRD not updated to include cluster UID yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this realistically happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least I think so. Granted it was during devworkflows, but I did run into it myself.

Maybe not as big a concern in prod, but I think could still happen - in a chart like BRO if they don't also update the CRDs. I.e. Installed via helm CLI directly.
Or even if (an edge case) where a user manually targets a newer image than the chart. However if we opt to not change the CRD to add it to status this wouldn't be needed anyway.

conditionChanged = true
}
currentInPlaceRestoreString := condition.Cond(v1.BackupConditionInPlaceRestore).GetStatus(backup)
if currentInPlaceRestoreString != "False" {
condition.Cond(v1.BackupConditionInPlaceRestore).SetStatusBool(backup, false)
condition.Cond(v1.BackupConditionInPlaceRestore).Message(backup, "Cannot determine if in-place Restore is viable.")
conditionChanged = true
}

return conditionChanged
}

// TODO: We could add a fallback mode that uses filenames (and/or the annotation) when the CRD is not updated
conditionMeta := newBackupClusterOriginConditionMeta(h.kubeSystemNS, backup)

// Fist pass we only care to set BackupConditionClusterOrigin based on if the context is there
if !conditionMeta.hasCurrentOriginCondition || conditionMeta.currentOriginCondition != conditionMeta.hasClusterOriginID {
conditionChanged = true
condition.Cond(v1.BackupConditionClusterOrigin).SetStatusBool(backup, conditionMeta.hasClusterOriginID)

if conditionMeta.hasClusterOriginID {
condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "Backup has cluster UID attached.")
} else {
condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "No cluster UID attached to backup.")
}
}

// Second pass, we care about the specifics of the ClusterOrigin to set the InPlaceRestore condition
if !conditionMeta.hasClusterOriginID {
// When annotation is missing, we'll mark as unable to determine
condition.Cond(v1.BackupConditionInPlaceRestore).SetStatusBool(backup, false)
condition.Cond(v1.BackupConditionInPlaceRestore).Message(backup, "Unable to determine if in-place Restore is viable.")
}

if !conditionMeta.hasInPlaceRestoreCondition || conditionMeta.canInPlaceRestore != conditionMeta.currentInPlaceRestoreCondition {
conditionChanged = true
condition.Cond(v1.BackupConditionInPlaceRestore).SetStatusBool(backup, conditionMeta.canInPlaceRestore)
if conditionMeta.canInPlaceRestore {
condition.Cond(v1.BackupConditionInPlaceRestore).Message(backup, "In-place Restore appears viable.")
} else {
condition.Cond(v1.BackupConditionInPlaceRestore).Message(backup, "In-place Restore does not appear viable.")
}
}

// When the annotation is present and not changed
return conditionChanged
}
50 changes: 34 additions & 16 deletions pkg/controllers/backup/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@ type handler struct {
dynamicClient dynamic.Interface
defaultBackupMountPath string
defaultS3BackupLocation *v1.S3ObjectStore
kubeSystemNS string
// TODO: rename to kubeSystemNamespaceUID; nit to improve clarity, it's not the string representation nor the NS resource
kubeSystemNS string
canUseClusterOriginStatus bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically, should avoid storing additional boolean fields, in favor of a method.

I think this field might be useless?

If we opt into using the backup CRD spec for storing cluster metadata, and older backup-restore versions don't set this new null-able field, then we cannot determine where it is from.

}

const DefaultRetentionCount = 10
Expand All @@ -56,15 +58,16 @@ func Register(
defaultS3 *v1.S3ObjectStore) {

controller := &handler{
ctx: ctx,
backups: backups,
resourceSets: resourceSets,
secrets: secrets,
namespaces: namespaces,
discoveryClient: clientSet.Discovery(),
dynamicClient: dynamicInterface,
defaultBackupMountPath: defaultLocalBackupLocation,
defaultS3BackupLocation: defaultS3,
ctx: ctx,
backups: backups,
resourceSets: resourceSets,
secrets: secrets,
namespaces: namespaces,
discoveryClient: clientSet.Discovery(),
dynamicClient: dynamicInterface,
defaultBackupMountPath: defaultLocalBackupLocation,
defaultS3BackupLocation: defaultS3,
canUseClusterOriginStatus: util.VerifyBackupCrdHasClusterStatus(clientSet.ApiextensionsV1()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would avoid having a function returning a value that uses API calls being passed to the handler. Instead would include the logic inside the function in the setup itself, before the call to Register

See comment below as well.

}
if controller.defaultBackupMountPath != "" {
logrus.Infof("Default location for storing backups is %v", controller.defaultBackupMountPath)
Expand All @@ -74,12 +77,13 @@ func Register(
}

// Use the kube-system NS.UID as the unique ID for a cluster
kubeSystemNS, err := controller.namespaces.Get("kube-system", k8sv1.GetOptions{})
kubeSystemNamespaceUID, err := util.FetchClusterUID(namespaces)
if err != nil {
// fatal log here, because we need the kube-system ns UID while creating any backup file
logrus.Fatalf("Error getting namespace kube-system %v", err)
}
controller.kubeSystemNS = string(kubeSystemNS.UID)
// TODO: rename to kubeSystemNamespaceUID
controller.kubeSystemNS = kubeSystemNamespaceUID
// Register handlers
backups.OnChange(ctx, "backups", controller.OnBackupChange)
}
Expand All @@ -94,27 +98,36 @@ func (h *handler) OnBackupChange(_ string, backup *v1.Backup) (*v1.Backup, error
return h.setReconcilingCondition(backup, err)
}

// Handle updates made on Backup CRs with existing backup files
if backup.Status.LastSnapshotTS != "" {
if backup.Spec.Schedule == "" {
// Backup CR was meant for one-time backup, and the backup has been completed. Probably here from UpdateStatus call
logrus.Infof("Backup CR %v has been processed for one-time backup, returning", backup.Name)
// This could also mean backup CR was updated from recurring to one-time, in which case observedGeneration needs to be updated
updBackupStatus := false
shouldUpdateStatus := false
if backup.Generation != backup.Status.ObservedGeneration {
backup.Status.ObservedGeneration = backup.Generation
updBackupStatus = true
shouldUpdateStatus = true
}
// check if the backup-type needs to be changed too
if backup.Status.BackupType != "One-time" {
backup.Status.BackupType = "One-time"
updBackupStatus = true
shouldUpdateStatus = true
}
if updBackupStatus {
// check if the origin cluster status needs updating
clusterOriginChanged := h.prepareClusterOriginConditions(backup)
if clusterOriginChanged {
shouldUpdateStatus = true
}
Comment on lines +118 to +121
Copy link
Contributor

@alexandreLamarre alexandreLamarre Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : can simplify using

if originChanged := h.prepareClusterOriginConditions(backup); originChanged{
   shouldUpdateStatus = true
}

if shouldUpdateStatus {
return h.backups.UpdateStatus(backup)
}
return backup, nil
}
if backup.Status.NextSnapshotAt != "" {
// TODO: Verify how recurring backups work after a migration today
// Then decide how/where to call prepareClusterOriginConditions for that.

currTime := time.Now().Format(time.RFC3339)
logrus.Infof("Next snapshot is scheduled for: %v, current time: %v", backup.Status.NextSnapshotAt, currTime)

Expand Down Expand Up @@ -173,20 +186,25 @@ func (h *handler) OnBackupChange(_ string, backup *v1.Backup) (*v1.Backup, error
return h.setReconcilingCondition(backup, err)
}
}

storageLocationType := backup.Status.StorageLocation
updateErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
var err error
backup, err = h.backups.Get(backup.Name, k8sv1.GetOptions{})
if err != nil {
return err
}
// Set the Cluster origin reference on backup
backup.Status.OriginCluster = h.kubeSystemNS
// reset conditions to remove the reconciling condition, because as per kstatus lib its presence is considered an error
backup.Status.Conditions = []genericcondition.GenericCondition{}

condition.Cond(v1.BackupConditionReady).SetStatusBool(backup, true)
condition.Cond(v1.BackupConditionReady).Message(backup, "Completed")
condition.Cond(v1.BackupConditionUploaded).SetStatusBool(backup, true)

h.prepareClusterOriginConditions(backup)

backup.Status.LastSnapshotTS = time.Now().Format(time.RFC3339)
if cronSchedule != nil {
nextBackupAt := cronSchedule.Next(time.Now())
Expand Down
3 changes: 2 additions & 1 deletion pkg/crds/crd.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ func List() []crd.CRD {
WithColumn("Latest-Backup", ".status.filename").
WithColumn("ResourceSet", ".spec.resourceSetName").
WithCustomColumn(apiext.CustomResourceColumnDefinition{Name: "Age", Type: "date", JSONPath: ".metadata.creationTimestamp"}).
WithColumn("Status", ".status.conditions[?(@.type==\"Ready\")].message")
WithColumn("Status", ".status.conditions[?(@.type==\"Ready\")].message").
WithColumn("In-Place", ".status.conditions[?(@.type==\"InPlaceRestore\")].status")
}),
newCRD(&resources.Restore{}, func(c crd.CRD) crd.CRD {
return c.
Expand Down
48 changes: 48 additions & 0 deletions pkg/util/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@ import (
"os"
"reflect"

apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
apiextensionsClientSetv1 "k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset/typed/apiextensions/v1"

v1core "github.com/rancher/wrangler/v3/pkg/generated/controllers/core/v1"
"github.com/sirupsen/logrus"
k8sv1 "k8s.io/apimachinery/pkg/apis/meta/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/apiserver/pkg/server/options/encryptionconfig"
"k8s.io/apiserver/pkg/storage/value"
Expand Down Expand Up @@ -74,3 +78,47 @@ func ErrList(e []error) error {
}
return nil
}

func FetchClusterUID(namespaces v1core.NamespaceController) (string, error) {
kubesystemNamespace, err := namespaces.Get("kube-system", k8sv1.GetOptions{})
if err != nil {
return "", err
}

return string(kubesystemNamespace.UID), nil
}

// Define the GroupVersionResource for CRDs
var crdGVR = schema.GroupVersionResource{
Group: "apiextensions.k8s.io",
Version: "v1",
Resource: "customresourcedefinitions",
}

func getCRDDefinition(dynamicClient apiextensionsClientSetv1.ApiextensionsV1Interface, crdName string) (*apiextensionsv1.CustomResourceDefinition, error) {
crd, err := dynamicClient.CustomResourceDefinitions().Get(context.TODO(), crdName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return crd, nil
}

func VerifyBackupCrdHasClusterStatus(client apiextensionsClientSetv1.ApiextensionsV1Interface) bool {
crdName := "backups.resources.cattle.io"

crd, err := getCRDDefinition(client, crdName)
if err != nil {
logrus.Infof("Error fetching CRD: %v", err)
return false
}
Comment on lines +110 to +113
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a more ready version of this, I'd expect any errors like this to be surfaced in the setup step. And potentially exit since an error like that signifies a more serious problem or a transient error (in this case could consider a backoff mechanism)


// Inspect the status schema, for example
_, found := crd.Spec.Versions[0].Schema.OpenAPIV3Schema.Properties["status"].Properties["originCluster"]
if found {
logrus.Debugf("Status schema contains `originCluster` on CRD `%s`.\n", crdName)
return true
}
Comment on lines +109 to +120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the above discussion, storing the cluster metadata in the backup CRD spec as a nullable obejct shouldn't be a breaking change, and we can instead check if it is nil or not, rather than doing dynamic CRD reflection

IMO it is easier to make it testable if we avoid checking crd definitions and focus on the presence/absence of a new field to determine that.


logrus.Debugf("`originCluster` not found on status schema for CRD `%s`.\n", crdName)
return false
}