-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document ReconcileResult type (fixes #1419) #1463
base: main
Are you sure you want to change the base?
Changes from 3 commits
3ac27d0
cfd7174
5838166
5bbd6af
36a47f9
ef4a864
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,8 @@ type Reconcileable[T any] interface { | |
*T | ||
} | ||
|
||
// Try with U, a type of any whose POINTER still fulfils Reoncilable... | ||
// ReconcileObject ensures that desiredObject exists in the given state, either by creating it, or updating it if it | ||
// already exists. | ||
func ReconcileObject[U any, T Reconcileable[U]](ctx context.Context, kClient client.Client, requeueDelay time.Duration, desiredObject U) result.ReconcileResult { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [thought] This is off-topic, but I don't understand why this function schedules a requeue after a successful creation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because that's the pattern followed throughout the codebase. Object creation is asynchronous, acceptance by the API server does not necessarily mean that something won't go wrong downstream of that. As a result, we check again for the success of the creation operation before continuing, which indeed does mean a requeue. There is also stuff to do with optimistic locking and the way the caches work on the controller which make this necessary - try removing it if you're curious as to what breaks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is that documented anywhere? The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's just how asynchronous HTTP APIs work, they return a status code but sometimes object creation might fail out elsewhere. But I think the more critical thing here is that (from memory) due to some of the caching behaviour of these clients, the object will not appear if you try to go do a Get in some circumstances. I encourage you to try to remove the requeues and see if things break - I believe they do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know, this example in the controller-runtime codebase doesn't requeue. They reference There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By "extra" requeues I meant the one after a successful Create, and the one after a successful Update. I pushed the change so you can check the diff. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This depends on what is read after the object was written. If it's something that the Kubernetes API returns, then it's consistent right away. For example, Olivier's example of using creationTimestamp, probably resourceVersion and uid etc are directly accessible in the object you just posted. However, for many use cases, what happens is that there's other controllers doing some extra processing and then modifying the object on the Kubernetes side (as a lot of Kubernetes processing itself is async with controllers). Can't access or Update .Status for example for some objects, like Pods. Or in our runtimes, if k8ssandra-operator creates CassandraDatacenter, it probably can't Update the Status field of CassandraDatacenter even instantly after, because cass-operator has modified it and the resourceVersions no longer match. The other thing that should be understood is that Read after Write is cached, but of course Write itself updates the object that was sent. So doing: obj = Create(obj) Is valid. However, doing: obj = Create(obj) Is not valid. The Get() call is coming from cache and would probably not return an object at all. Using Reader interface from manager will allow doing this as it isn't cached, but we don't really use that interface a lot (I think only ClientConfigs does that). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What we're saying here is that a requeue may or may not be needed, depending on the case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This thread has turned into a generic conversation about requeues, but we are in fact looking at a very specific case. I've only made two changes related to requeues in this PR:
(it's all in Don't requeue successful writes in ReconcileObject if anyone wants to look at the diff) If we accept the assertion that writes are synchronous then there's absolutely no need to requeue. ReconcileObject wanted to apply changes, those changes are applied, we're done. Other controllers and cached reads are valid concerns that we encounter in other places, but not here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Writes being synchronous, there doesn't seem to be a valid reason to force a requeue for each object creation. |
||
objectKey := types.NamespacedName{ | ||
Name: T(&desiredObject).GetName(), | ||
|
@@ -53,5 +54,5 @@ func ReconcileObject[U any, T Reconcileable[U]](ctx context.Context, kClient cli | |
} | ||
return result.RequeueSoon(requeueDelay) | ||
} | ||
return result.Done() | ||
return result.Continue() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this might be controversial, but given my current understanding of the API that's the right thing to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue: I don't think it has to be controversial, but I'd actually like to see all of this in a design doc so that I can understand what you're trying to achieve with these changes. To me, they have a very wide blast radius, and there are some questions and themes in here which suggest that we probably need to align on certain topics (e.g. how caches work, how optimistic locking works, how the API server is asynchronous) before moving forward. One of the main changes I think we need here is more descriptive naming for these implementations; and therefore alignment on what each of them is intended to signify. Once we have that it should be easy to figure out what the answer should be to this return type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As said in the previous comment, if we all agree that writes are synchronous, there's no need to force a requeue at this level. Creating a configmap for example won't require a requeue. |
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,54 @@ | ||
package result | ||
|
||
import ( | ||
"sigs.k8s.io/controller-runtime/pkg/reconcile" | ||
"time" | ||
|
||
ctrl "sigs.k8s.io/controller-runtime" | ||
) | ||
|
||
// Copyright DataStax, Inc. | ||
// Please see the included license file for details. | ||
|
||
// This is just so that we can reference TerminalError in the Godoc of [Error] | ||
var _ error = reconcile.TerminalError(nil) | ||
|
||
// ReconcileResult represents the result of a step in the reconciliation process. | ||
// | ||
// We typically split the top-level Reconcile() method of a controller into multiple sub-functions. Each of these | ||
// functions uses ReconcileResult to communicate to its caller how the current iteration should proceed. There are 4 | ||
// possible implementations: [Continue], [Done], [RequeueSoon], and [Error]. | ||
// | ||
// The idiomatic way to handle a ReconcileResult in an intermediary sub-function is: | ||
// | ||
// if recResult := callStep1(); recResult.Completed() { | ||
// // Global success, requeue or error: stop what we're doing and propagate up | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue: based on the logic you've described here it sounded like a requeue (which is not a success) will also be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, a requeue is func (c callBackSoon) Completed() bool {
return true
} But I don't see any issue here: if a step wants a requeue, we stop what we're doing and propagate up. The top-level Reconcile() will eventually return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me put my concern in another way. Right now, we have a struct implementation of this interface instantiated with To me, having a getter function If you want to continue having something which means "this reconciliation run has progressed as far as possible, and needs to jump back to the top level of the reconciliation to be handled appropriately", we need a different word to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is kind of why I think this needs a design doc too - because at the moment it isn't clear to me what behaviour or states we are trying to model, and we need to discuss and agree a target state rather than just say that we're "documenting" existing behaviour but then making quite impactful changes to the core of the reconciliation logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, the goal of this PR is to have that discussion. I proposed an initial version, let's iterate on it with reviews until we reach a common understanding.
Are the go-docs not self-sufficient? Not that I mind having a separate document, but I wouldn't know what else to put in there.
I've deliberately avoided going down the naming rabbithole so far... May I suggest that we keep that for a second step? I feel like it will get too messy if we rename as we go. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't about documentation, it is about the development process. We do deep discussions in design docs, not on PRs. If we choose to do it differently here, I'm concerned that we will end up in the same situation that we ended up in when we started making changes to the name overrides in labels and downstream resources, where the blast radius of the change was not adequately considered and customers suffered issues as a result. I strongly advocate for the creation of a design doc for this piece of work to avoid customer impacts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What is the English language definition difference of while and while do then? Yet, semantics for those two are different and in every programming language. One has while executed after first execution, while one has while executed before execution. This isn't English. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Miles, and let's not forget many of the people participating in these projects are not necessarily english native speakers. This API brings a level of confusion that is high enough to require having that conversation. There's overlap between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ugly Java POJO, but this isn't a Java project. We don't need to indicate it's a boolean when a function has return value of boolean. Same as with "Get" when fetching an object.
Why can't the users view what's defined in the godoc? It would be publicly available and generated already, like currently: https://pkg.go.dev/github.com/k8ssandra/[email protected]/pkg/reconciliation What Olivier wrote here would appear there. We can add examples, descriptions etc. Adding markdown doc somewhere assumes users know to search for it (and hope it's not outdated) instead of the "standard" place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In an effort to move forward again, let's just use these in code comments as docs on how to use this interface. |
||
// return recResult | ||
// } | ||
// // Otherwise, proceed with the next step(s) | ||
// if recResult := callStep2(); recResult.Completed() { | ||
// // etc... | ||
// | ||
// The idiomatic way to handle a ReconcileResult in the top-level Reconcile() method is: | ||
// | ||
// recResult := callSomeSubmethod() | ||
// // Possibly inspect the result (e.g. to set an error field in the status) | ||
// return recResult.Output() | ||
type ReconcileResult interface { | ||
// Completed indicates that this result is *NOT* a [Continue]. | ||
olim7t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// | ||
// In other words, it means that the current iteration of the reconciliation loop is complete, and the top-level | ||
// Reconcile() method should return to the controller runtime (either with a success or terminal error that will | ||
// stop the entire reconciliation loop, or a requeue or regular error that will trigger a retry). | ||
Completed() bool | ||
// Output converts this result into a format that the main Reconcile() method can return to the controller runtime. | ||
// | ||
// Calling this method on a [Continue] will panic. | ||
Output() (ctrl.Result, error) | ||
// IsError indicates whether this result is an [Error]. | ||
IsError() bool | ||
// IsRequeue indicates whether this result is a [RequeueSoon]. | ||
IsRequeue() bool | ||
// IsDone indicates whether this result is a [Done]. | ||
IsDone() bool | ||
// GetError returns the wrapped error if the result is an [Error], otherwise it returns nil. | ||
GetError() error | ||
} | ||
|
||
|
@@ -121,18 +155,31 @@ func (r errorOut) GetError() error { | |
return r.err | ||
} | ||
|
||
// Continue indicates that the current step in the reconciliation is done. The caller should proceed with the next step. | ||
func Continue() ReconcileResult { | ||
return continueReconcile{} | ||
} | ||
|
||
// Done indicates that the entire reconciliation loop was successful. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue: from a design perspective, this seems tricky, since only the last step should ever call And what is the effect of calling I think if we're getting into this topic, we should make We need to be explicit about WHAT is done - that's the issue we're currently struggling with. Naming these implementations correctly might be a good starting point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is exactly what it is, the reconciling process is done. It's not even really a necessary process didn't end in a pkg, but in the Reconcile() function. Just return reconcile.Result{}, nil.
Are we having issues? This API didn't even begin here and the original usage had no issues. I'm not convinced we need this ReconcileResult API at all, but then that's not what we're discussing here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe this should be something we're discussing. We are definitely having issues because there is no consistency in the usage of this API in the codebase and there are regular disagreements on what Done, Continue, and Completed mean, as there have been for several years. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, there hasn't been (and even in this discussion, it doesn't seem like there's that much issue for rest of the team). This API was copied over from cass-operator and it had a clear definition there how things worked. This isn't new API that needs design documents (which are utterly pointless at this point of using this API), just couple of fixes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
At the very least there's been some ambiguity, as evidenced by the changes we are doing in this PR to align different parts of the code.
This might be your longer experience of cass-operator speaking, but personally I wouldn't call this API clear, it took way more effort to understand than it should. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's focus on whether or not we agree with the comments that @olim7t added on each function. Let's not change names for now and just build a common understanding. |
||
// The caller should skip the next steps (if any), and propagate the result up the stack. This will eventually reach the | ||
// top-level Reconcile() method, which should stop the reconciliation. | ||
func Done() ReconcileResult { | ||
return done{} | ||
} | ||
|
||
// RequeueSoon indicates that the current step in the reconciliation requires a requeue after a certain amount of time. | ||
// The caller should skip the next steps (if any), and propagate the result up the stack. This will eventually reach the | ||
// top-level Reconcile() method, which should schedule a requeue with the given delay. | ||
func RequeueSoon(after time.Duration) ReconcileResult { | ||
return callBackSoon{after: after} | ||
} | ||
|
||
// Error indicates that the current step in the reconciliation has failed. | ||
// The caller should skip the next steps (if any), and propagate the result up the stack. This will eventually reach the | ||
// top-level Reconcile() method, which should return the error to the controller runtime. | ||
// | ||
// If the argument is wrapped with [reconcile.TerminalError], the reconciliation loop will stop; otherwise, it will be | ||
// retried with exponential backoff. | ||
func Error(e error) ReconcileResult { | ||
return errorOut{err: e} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -842,7 +842,7 @@ func createSingleDatacenterCluster(t *testing.T, ctx context.Context, namespace | |
require.NoError(err, "failed to patch K8ssandraCluster in namespace %s", namespace) | ||
checkStargateReady(t, f, ctx, stargateKey) | ||
checkStargateK8cStatusReady(t, f, ctx, kcKey, dcKey) | ||
checkContainerPresence(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) | ||
checkContainerPresenceEventually(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same issue as in unit tests: because the requeue is not ignored anymore, this is taking longer than before and subject to timing issues. |
||
checkVectorAgentConfigMapPresence(t, ctx, f, dcKey, stargate.VectorAgentConfigMapName) | ||
|
||
t.Log("check that if Stargate is deleted directly it gets re-created") | ||
|
@@ -853,7 +853,7 @@ func createSingleDatacenterCluster(t *testing.T, ctx context.Context, namespace | |
require.NoError(err, "failed to delete Stargate in namespace %s", namespace) | ||
checkStargateReady(t, f, ctx, stargateKey) | ||
|
||
checkContainerPresence(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) | ||
checkContainerPresenceEventually(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) | ||
checkVectorAgentConfigMapPresence(t, ctx, f, dcKey, stargate.VectorAgentConfigMapName) | ||
|
||
t.Log("delete Stargate in k8ssandracluster resource") | ||
|
@@ -2324,6 +2324,15 @@ func checkContainerPresence(t *testing.T, ctx context.Context, f *framework.E2eF | |
require.True(t, containerFound, "cannot find Container in pod template spec") | ||
} | ||
|
||
func checkContainerPresenceEventually(t *testing.T, ctx context.Context, f *framework.E2eFramework, podKey framework.ClusterKey, kc *api.K8ssandraCluster, specFunction func(t *testing.T, ctx context.Context, f *framework.E2eFramework, dcKey framework.ClusterKey, kc *api.K8ssandraCluster) *corev1.PodTemplateSpec, containerName string) { | ||
t.Logf("check that %s contains Container named %s", podKey.Name, containerName) | ||
require.Eventually(t, func() bool { | ||
podTempSpec := specFunction(t, ctx, f, podKey, kc) | ||
_, containerFound := cassandra.FindContainer(podTempSpec, containerName) | ||
return containerFound | ||
}, polling.stargateReady.timeout, polling.stargateReady.interval, "cannot find Container in pod template spec") | ||
} | ||
|
||
func checkContainerDeleted(t *testing.T, ctx context.Context, f *framework.E2eFramework, podKey framework.ClusterKey, kc *api.K8ssandraCluster, specFunction func(t *testing.T, ctx context.Context, f *framework.E2eFramework, dcKey framework.ClusterKey, kc *api.K8ssandraCluster) *corev1.PodTemplateSpec, containerName string) { | ||
t.Logf("check that %s does not have a Container named %s", podKey.Name, containerName) | ||
podTempSpec := specFunction(t, ctx, f, podKey, kc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to keep the exact same behavior, but the overall function has issues: the block
if requeues > 0 { ... }
at line 58 is never reached (either with this version or the original ones), so retries are not behaving as expected.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, we should drop the continue statements and move the check of requeues outside the loop for this to make sense.