generic async controller and client reconciler interface #389

frodopwns · 2019-10-18T16:05:18Z

closes #247

Bring in AsyncController from Ace's repo. Implement for ResourceGroup.

This converts the RG client from blocking on RG delete to being more async.

To test:

run the normal tests
manually create a resource group
read the changes and consider repercussions

WilliamMortlMicrosoft

I walked through the code very closely, seems like a good architecture, but it may change as more complex operators are ported. I feel like this will remain WIP until at least 3 operators are ported, that said - I tested created resource groups and it worked splendidly, so I will approve

WilliamMortlMicrosoft

N/A

frodopwns · 2019-10-28T22:27:44Z

I walked through the code very closely, seems like a good architecture, but it may change as more complex operators are ported. I feel like this will remain WIP until at least 3 operators are ported, that said - I tested created resource groups and it worked splendidly, so I will approve

agreed, we will need to ensure our tests are a bit more stable if we are going to confidently tweak this controller after other resources start to rely on it

jananivMS

I love the consolidation and simplification of the main reconcile loop! But do have a few questions that I want to discuss.

jananivMS · 2019-10-28T17:07:54Z

api/v1alpha1/resourcegroup_types.go

 // ResourceGroup is the Schema for the resourcegroups API
 // +kubebuilder:resource:shortName=rg,path=resourcegroups
+// +kubebuilder:printcolumn:name="Provisioned",type="string",JSONPath=".status.provisioned"
+// +kubebuilder:printcolumn:name="Provisioning",type="string",JSONPath=".status.provisioning"


This is handy!

controllers/async_controller.go

jananivMS · 2019-10-28T20:50:32Z

controllers/async_controller.go

+	if convertErr != nil {
+		log.Info("accessor fail")
+		return ctrl.Result{}, convertErr
+	}


What does this do? Might be good to add some comments on what this is doing for easier understanding. Also can the log.info have something more non-technical to express what went wrong?

yes, need some comments, I will add those

The controller doesn't know much about the objects being passed in as local. One thing it does know is that every object passed represents a kubernetes custom resource which means it has a Metadata section.

So Accessor returns a struct that implements the metav1.Object interface...ie an object that knows about the contents of the Metadata section of kube resources. This is why you only see the returned value res being used for things that involve Metadata...checking the delete timestamp, adding/removing finalizers.

controllers/async_controller.go

jananivMS · 2019-10-28T22:21:33Z

pkg/resourcemanager/mock/resourcegroups/resourcegroup.go

 	r := resources.Group{
 		Response: helpers.GetRestResponse(201),
 		Location: to.StringPtr(location),
 		Name:     to.StringPtr(groupName),
 	}
-	manager.resourceGroups = append(manager.resourceGroups, r)
+
+	if index == -1 {


Qn: What does this check check for?

it is fairly standard for a find method to return -1 as the index of something it couldn't find....

In this case, if the group being added doesn't already exist, we add it

jananivMS · 2019-10-28T22:22:11Z

pkg/resourcemanager/mock/resourcegroups/resourcegroup.go

+	}
+
+	instance.Status.Provisioning = true
+	instance.Status.Provisioned = true


We set both at the same time?

technically the test only cares that one is true

jananivMS · 2019-10-28T22:24:00Z

pkg/resourcemanager/resourcegroups/reconcile.go

+	if instance.Status.Provisioning {
+		instance.Status.Provisioned = true
+		instance.Status.Provisioning = false
+	} else {


When will this else block be true? I see we set provisioning to true at the start of the function, so wondering.

if the call returns an error I guess

jananivMS · 2019-10-28T22:25:08Z

pkg/resourcemanager/resourcegroups/reconcile.go

+	return true, nil
+}
+
+func (g *AzureResourceGroupManager) ForSubscription(context.Context, runtime.Object) error {


What is this function for?

jananivMS · 2019-10-28T22:26:03Z

pkg/resourcemanager/resourcegroups/resourcegroup.go

 	var client = getGroupsClient()

 	future, err := client.Delete(ctx, groupName)
 	if err != nil {
-		log.Fatalf("got error: %s", err)
+		return autorest.Response{}, err


Shouldnt this be future.response?

generally when a function returns an error you can't expect there to be anything in the other returned value so I didn't try to call future.Response....might be worth looking into

add cli on top of same logic. add vnet, tm, nsg, etc.

szoio

I've had a good look at this PR:

It's definitely a major improvement on hand crafted reconcile loops.

My concern with this overall architecture, or maybe it's not such a big problem on the scheme of things, is that individual controller implementations, i.e. the implementations of AsyncClient
are passed in a runtime.Object, and have the responsibility for updating the state of this object.

Some points on this:

There is no way for the the generic reconcile loop to have any assuredness that this is being done correctly.
It has to trust that the updates are going to leave the reconcile loop in the right state.
The interactions with Azure are completely bespoke (i.e. per resource type/kind), but the interations with Kubernetes are actually quite standard across all operators.
For this reason, I've been more in favour of subsuming the interactions with Kubernetes into the
generic reconcile loop.
I've been seeing frequent update failures (writing this updated runtime.Object back to Kubernetes).
The typical error I've seen is "the object has been modified; please apply your changes to the latest version and try again".
I've found if you do exactly what it says, i.e. you refetch and reapply, it generally succeeds.
The problem is here, you don't know what updates have been made to the object, so there is not way of reapplying it. Maybe there is something about my particular setup (running with a Kind cluster) where I get this error that others have not experienced.

alexeldeib · 2019-10-29T02:08:29Z

I haven't dug in much further since last week, but I am reasonably familiar with the code 🙂

There is no way for the the generic reconcile loop to have any assuredness that this is being done correctly

This is generally true in all cases, and it's the developers job to test. This is true of every Kubernetes core controller -- there's no guarantee the Deployment controller does the right thing, except that it does and people have tested the behavior they expect. Internally, all of these reconcile requests will be fed to the controller by a workqueue of comingled objects anyway, so this aligns reasonably well with how the Kubernetes/controller-runtime internals function.

Controllers should never store any state in memory they cannot reconstruct on a fresh reconcile run. Trying to create a state machine inside the controller is going to cause problems. This has bitten the Kubernetes community repeatedly when dealing with status fields in particular, but has been a thorn generally. There is no concept of an "operation" in the Kubernetes API, everything is simply a level triggered response. I'd argue in the scenarios you hit, the correct decision is always to requeue. Reconciliation loop itself should check for whatever preconditions are necessary.

See discussion at kubernetes/kubernetes#34363 (comment) among others as context for this stance.

The problem is here, you don't know what updates have been made to the object, so there is not way of reapplying it.

The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot. If it does or needs to store this as state, it's an anti-pattern.

alexeldeib · 2019-10-29T02:15:53Z

I think one point @szoio's PR captures thoroughly that this does not touch on at all is dependencies. There's somewhat of a blurry line between service/resource, but I would imagine we could have service implementations for all RPs and CRD/controller for each resource type, and e.g. requiring something as input/output would be a matter of, e.g.:

if err := ensureA(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

if err := ensureB(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

if err := ensureC(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

return ctrl.Result{}, nil

I think this starts to mirror the dependency information captured in #397 so maybe one path forward is starting with this, and building up the interface (e.g. pre/post reconcile resources) as we have more concrete use cases for some of those needs and find out what works before trying to abstract too much.

szoio · 2019-10-29T03:34:43Z

Thanks @alexeldeib for the comments:

Controllers should never store any state in memory they cannot reconstruct on a fresh reconcile run. Trying to create a state machine inside the controller is going to cause problems. This has bitten the Kubernetes community repeatedly when dealing with status fields in particular, but has been a thorn generally. There is no concept of an "operation" in the Kubernetes API, everything is simply a level triggered response. I'd argue in the scenarios you hit, the correct decision is always to requeue. Reconciliation loop itself should check for whatever preconditions are necessary.

Totally agree that the controller itself should not store any state. However the controller does update state of kubernetes objects, and this new state gets fed to the reconcile loop next time round. So the controller itself is stateless, though I don't think this is exactly what you mean.

I also like the idea that "The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot" wherever possible.

There are however quite a few edge cases that one may encounter where some state can be somewhat helpful.

For example, say you have a service with an asynchronous deletion, and the GET operation for that service only returns a status code, say 200 if it exists, and 404 if it doesn't. I have found that we can't always rely on the status fields returned by the Azure management SDK.

Then let's say the manifest is updated and reapplied, such that the resource can't be patched, it needs to be deleted and recreated.

The first time it reconciles the manifest it deletes the object. Because it is an asynchronous delete, it requeues. The next time it reconciles, it attempts to verify the object. In this case suppose it gets a 200. Then it somehow needs to know that the object is busy being deleted. If we don't, we'll just take the 200 to mean that we have got to the end and the resource has finished reconciling successfully.

Having a richer set of possible states can be very helpful in resolving these edge cases.
There are a number of these kind of edge cases, and when these are combined, they make the implementation quite tricky, as each AsyncClient implementation will have to think about all these edge cases, and know how to deal with them, and set appropriate status for every possible scenario.

If we push the onus of responsibility onto the implementer to update these states, the implementations will become increasingly more confusing and tricky, especially if we want to refine the operators and provide better support for full lifecycle workflows such as update vs. delete+recreate.

For that reason I have been suggesting moving the this into the generic loop code, and only expect the specific operator implementation i.e. the AsyncClient (or ResourceManagerClient in #397) to give very concrete known facts about the interaction with Azure back to the caller, and not dictate how the reconcile loop should interpret it (by setting properties on the runtime.Object - which will determine how the loop behaves on the next reconcile).

alexeldeib · 2019-10-29T04:46:22Z

I think i've misunderstood your motivation (between your comment and perusing the updates to your PR). I think I actually agree entirely with capturing the interaction more cleanly, but without more operators, it's tough for me to tell what the right level of abstraction/separation is (both technically and organizationally).

I'd be curious to see how you envision things like multiple service clients or composition of clients for higher level reconcilers (i'll drop a comment on the PR).

alexeldeib mentioned this pull request Oct 22, 2019

refactor: centralize finalizer name, helpers #324

Closed

2 tasks

frodopwns changed the title ~~WIP - generic async controller and client reconciler interface~~ generic async controller and client reconciler interface Oct 25, 2019

frodopwns requested review from jananivMS, alexeldeib and WilliamMortlMicrosoft October 25, 2019 18:00

WilliamMortlMicrosoft approved these changes Oct 28, 2019

View reviewed changes

WilliamMortlMicrosoft suggested changes Oct 28, 2019

View reviewed changes

WilliamMortlMicrosoft approved these changes Oct 28, 2019

View reviewed changes

jananivMS reviewed Oct 28, 2019

View reviewed changes

alexeldeib and others added 18 commits October 28, 2019 17:48

chore: secrets work

79f96da

feat: implement keyvault secret, bundle, additional controllers

ff27783

add cli on top of same logic. add vnet, tm, nsg, etc.

feat: semi-generic async azure controller + CLI improvements

e8dc899

working on implementing a more generic interface/reconcile stragegy

229acaa

working on robustness

05c7acf

add logs to resourcegroup client, remove wiat from rg delete

f289aa3

leftover debug code

a0636fb

whoops

ad0ad78

update test clients to match interface

a9935b7

working on getting tests to pass

12aeb5a

working on improcing status

087672f

flesh out mock to work with async operator

3cebf19

fix issues in rg pkg tests resulting from async delete change

41a7e5b

rename azure client 'Az' to 'AzureClient'

2e434b8

remove commented code

3fad756

use constants for event type

19d5064

comment and log message cleanup

72a3498

cleaning up comments

b56231e

szoio reviewed Oct 29, 2019

View reviewed changes

frodopwns added 5 commits October 29, 2019 11:04

Merge branch 'master' into client-interface

490e0ee

removing unused fields

98b19db

merge master

bfab9cc

Merge branch 'master' into client-interface

efe31b4

Merge branch 'master' into client-interface

ce090bd

frodopwns merged commit 5f8b17e into Azure:master Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic async controller and client reconciler interface #389

generic async controller and client reconciler interface #389

frodopwns commented Oct 18, 2019 •

edited

Loading

WilliamMortlMicrosoft left a comment

WilliamMortlMicrosoft left a comment •

edited

Loading

frodopwns commented Oct 28, 2019

jananivMS left a comment

jananivMS Oct 28, 2019

jananivMS Oct 28, 2019

frodopwns Oct 29, 2019

jananivMS Oct 28, 2019

frodopwns Oct 31, 2019

jananivMS Oct 28, 2019

frodopwns Oct 29, 2019

jananivMS Oct 28, 2019

frodopwns Oct 31, 2019

jananivMS Oct 28, 2019

jananivMS Oct 28, 2019

frodopwns Oct 31, 2019

szoio left a comment •

edited

Loading

alexeldeib commented Oct 29, 2019

alexeldeib commented Oct 29, 2019 •

edited

Loading

szoio commented Oct 29, 2019 •

edited

Loading

alexeldeib commented Oct 29, 2019

generic async controller and client reconciler interface #389

generic async controller and client reconciler interface #389

Conversation

frodopwns commented Oct 18, 2019 • edited Loading

WilliamMortlMicrosoft left a comment

Choose a reason for hiding this comment

WilliamMortlMicrosoft left a comment • edited Loading

Choose a reason for hiding this comment

frodopwns commented Oct 28, 2019

jananivMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szoio left a comment • edited Loading

Choose a reason for hiding this comment

alexeldeib commented Oct 29, 2019

alexeldeib commented Oct 29, 2019 • edited Loading

szoio commented Oct 29, 2019 • edited Loading

alexeldeib commented Oct 29, 2019

frodopwns commented Oct 18, 2019 •

edited

Loading

WilliamMortlMicrosoft left a comment •

edited

Loading

szoio left a comment •

edited

Loading

alexeldeib commented Oct 29, 2019 •

edited

Loading

szoio commented Oct 29, 2019 •

edited

Loading