-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic async controller and client reconciler interface #389
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I walked through the code very closely, seems like a good architecture, but it may change as more complex operators are ported. I feel like this will remain WIP until at least 3 operators are ported, that said - I tested created resource groups and it worked splendidly, so I will approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N/A
agreed, we will need to ensure our tests are a bit more stable if we are going to confidently tweak this controller after other resources start to rely on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the consolidation and simplification of the main reconcile loop! But do have a few questions that I want to discuss.
// ResourceGroup is the Schema for the resourcegroups API | ||
// +kubebuilder:resource:shortName=rg,path=resourcegroups | ||
// +kubebuilder:printcolumn:name="Provisioned",type="string",JSONPath=".status.provisioned" | ||
// +kubebuilder:printcolumn:name="Provisioning",type="string",JSONPath=".status.provisioning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handy!
if convertErr != nil { | ||
log.Info("accessor fail") | ||
return ctrl.Result{}, convertErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Might be good to add some comments on what this is doing for easier understanding. Also can the log.info have something more non-technical to express what went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, need some comments, I will add those
The controller doesn't know much about the objects being passed in as local
. One thing it does know is that every object passed represents a kubernetes custom resource which means it has a Metadata section.
So Accessor returns a struct that implements the metav1.Object interface...ie an object that knows about the contents of the Metadata section of kube resources. This is why you only see the returned value res
being used for things that involve Metadata...checking the delete timestamp, adding/removing finalizers.
r := resources.Group{ | ||
Response: helpers.GetRestResponse(201), | ||
Location: to.StringPtr(location), | ||
Name: to.StringPtr(groupName), | ||
} | ||
manager.resourceGroups = append(manager.resourceGroups, r) | ||
|
||
if index == -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qn: What does this check check for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fairly standard for a find method to return -1 as the index of something it couldn't find....
In this case, if the group being added doesn't already exist, we add it
} | ||
|
||
instance.Status.Provisioning = true | ||
instance.Status.Provisioned = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set both at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically the test only cares that one is true
if instance.Status.Provisioning { | ||
instance.Status.Provisioned = true | ||
instance.Status.Provisioning = false | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this else block be true? I see we set provisioning to true at the start of the function, so wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the call returns an error I guess
return true, nil | ||
} | ||
|
||
func (g *AzureResourceGroupManager) ForSubscription(context.Context, runtime.Object) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this function for?
var client = getGroupsClient() | ||
|
||
future, err := client.Delete(ctx, groupName) | ||
if err != nil { | ||
log.Fatalf("got error: %s", err) | ||
return autorest.Response{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt this be future.response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally when a function returns an error you can't expect there to be anything in the other returned value so I didn't try to call future.Response....might be worth looking into
add cli on top of same logic. add vnet, tm, nsg, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a good look at this PR:
It's definitely a major improvement on hand crafted reconcile loops.
My concern with this overall architecture, or maybe it's not such a big problem on the scheme of things, is that individual controller implementations, i.e. the implementations of AsyncClient
are passed in a runtime.Object
, and have the responsibility for updating the state of this object.
Some points on this:
- There is no way for the the generic reconcile loop to have any assuredness that this is being done correctly.
It has to trust that the updates are going to leave the reconcile loop in the right state.
The interactions with Azure are completely bespoke (i.e. per resource type/kind), but the interations with Kubernetes are actually quite standard across all operators.
For this reason, I've been more in favour of subsuming the interactions with Kubernetes into the
generic reconcile loop. - I've been seeing frequent update failures (writing this updated
runtime.Object
back to Kubernetes).
The typical error I've seen is "the object has been modified; please apply your changes to the latest version and try again".
I've found if you do exactly what it says, i.e. you refetch and reapply, it generally succeeds.
The problem is here, you don't know what updates have been made to the object, so there is not way of reapplying it. Maybe there is something about my particular setup (running with a Kind cluster) where I get this error that others have not experienced.
I haven't dug in much further since last week, but I am reasonably familiar with the code 🙂
This is generally true in all cases, and it's the developers job to test. This is true of every Kubernetes core controller -- there's no guarantee the Deployment controller does the right thing, except that it does and people have tested the behavior they expect. Internally, all of these reconcile requests will be fed to the controller by a workqueue of comingled objects anyway, so this aligns reasonably well with how the Kubernetes/controller-runtime internals function. Controllers should never store any state in memory they cannot reconstruct on a fresh reconcile run. Trying to create a state machine inside the controller is going to cause problems. This has bitten the Kubernetes community repeatedly when dealing with status fields in particular, but has been a thorn generally. There is no concept of an "operation" in the Kubernetes API, everything is simply a level triggered response. I'd argue in the scenarios you hit, the correct decision is always to requeue. Reconciliation loop itself should check for whatever preconditions are necessary. See discussion at kubernetes/kubernetes#34363 (comment) among others as context for this stance.
The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot. If it does or needs to store this as state, it's an anti-pattern. |
I think one point @szoio's PR captures thoroughly that this does not touch on at all is dependencies. There's somewhat of a blurry line between service/resource, but I would imagine we could have service implementations for all RPs and CRD/controller for each resource type, and e.g. requiring something as input/output would be a matter of, e.g.: if err := ensureA(ctx, obj); err != nil {
return ctrl.Result{}, err
}
if err := ensureB(ctx, obj); err != nil {
return ctrl.Result{}, err
}
if err := ensureC(ctx, obj); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil I think this starts to mirror the dependency information captured in #397 so maybe one path forward is starting with this, and building up the interface (e.g. pre/post reconcile resources) as we have more concrete use cases for some of those needs and find out what works before trying to abstract too much. |
Thanks @alexeldeib for the comments:
Totally agree that the controller itself should not store any state. However the controller does update state of kubernetes objects, and this new state gets fed to the reconcile loop next time round. So the controller itself is stateless, though I don't think this is exactly what you mean. I also like the idea that "The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot" wherever possible. There are however quite a few edge cases that one may encounter where some state can be somewhat helpful. For example, say you have a service with an asynchronous deletion, and the GET operation for that service only returns a status code, say 200 if it exists, and 404 if it doesn't. I have found that we can't always rely on the status fields returned by the Azure management SDK. Then let's say the manifest is updated and reapplied, such that the resource can't be patched, it needs to be deleted and recreated. The first time it reconciles the manifest it deletes the object. Because it is an asynchronous delete, it requeues. The next time it reconciles, it attempts to verify the object. In this case suppose it gets a 200. Then it somehow needs to know that the object is busy being deleted. If we don't, we'll just take the 200 to mean that we have got to the end and the resource has finished reconciling successfully. Having a richer set of possible states can be very helpful in resolving these edge cases. If we push the onus of responsibility onto the implementer to update these states, the implementations will become increasingly more confusing and tricky, especially if we want to refine the operators and provide better support for full lifecycle workflows such as update vs. delete+recreate. For that reason I have been suggesting moving the this into the generic loop code, and only expect the specific operator implementation i.e. the |
I think i've misunderstood your motivation (between your comment and perusing the updates to your PR). I think I actually agree entirely with capturing the interaction more cleanly, but without more operators, it's tough for me to tell what the right level of abstraction/separation is (both technically and organizationally). I'd be curious to see how you envision things like multiple service clients or composition of clients for higher level reconcilers (i'll drop a comment on the PR). |
closes #247
Bring in AsyncController from Ace's repo. Implement for ResourceGroup.
This converts the RG client from blocking on RG delete to being more async.
To test: