Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic async controller and client reconciler interface #389

Merged
merged 23 commits into from
Oct 31, 2019
Merged

generic async controller and client reconciler interface #389

merged 23 commits into from
Oct 31, 2019

Conversation

frodopwns
Copy link
Contributor

@frodopwns frodopwns commented Oct 18, 2019

closes #247

Bring in AsyncController from Ace's repo. Implement for ResourceGroup.

This converts the RG client from blocking on RG delete to being more async.

To test:

  • run the normal tests
  • manually create a resource group
  • read the changes and consider repercussions

@frodopwns frodopwns changed the title WIP - generic async controller and client reconciler interface generic async controller and client reconciler interface Oct 25, 2019
Copy link
Contributor

@WilliamMortlMicrosoft WilliamMortlMicrosoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I walked through the code very closely, seems like a good architecture, but it may change as more complex operators are ported. I feel like this will remain WIP until at least 3 operators are ported, that said - I tested created resource groups and it worked splendidly, so I will approve

Copy link
Contributor

@WilliamMortlMicrosoft WilliamMortlMicrosoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N/A

@frodopwns
Copy link
Contributor Author

I walked through the code very closely, seems like a good architecture, but it may change as more complex operators are ported. I feel like this will remain WIP until at least 3 operators are ported, that said - I tested created resource groups and it worked splendidly, so I will approve

agreed, we will need to ensure our tests are a bit more stable if we are going to confidently tweak this controller after other resources start to rely on it

Copy link
Contributor

@jananivMS jananivMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the consolidation and simplification of the main reconcile loop! But do have a few questions that I want to discuss.

// ResourceGroup is the Schema for the resourcegroups API
// +kubebuilder:resource:shortName=rg,path=resourcegroups
// +kubebuilder:printcolumn:name="Provisioned",type="string",JSONPath=".status.provisioned"
// +kubebuilder:printcolumn:name="Provisioning",type="string",JSONPath=".status.provisioning"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handy!

controllers/async_controller.go Outdated Show resolved Hide resolved
controllers/async_controller.go Outdated Show resolved Hide resolved
if convertErr != nil {
log.Info("accessor fail")
return ctrl.Result{}, convertErr
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Might be good to add some comments on what this is doing for easier understanding. Also can the log.info have something more non-technical to express what went wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need some comments, I will add those

The controller doesn't know much about the objects being passed in as local. One thing it does know is that every object passed represents a kubernetes custom resource which means it has a Metadata section.

So Accessor returns a struct that implements the metav1.Object interface...ie an object that knows about the contents of the Metadata section of kube resources. This is why you only see the returned value res being used for things that involve Metadata...checking the delete timestamp, adding/removing finalizers.

controllers/async_controller.go Outdated Show resolved Hide resolved
r := resources.Group{
Response: helpers.GetRestResponse(201),
Location: to.StringPtr(location),
Name: to.StringPtr(groupName),
}
manager.resourceGroups = append(manager.resourceGroups, r)

if index == -1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qn: What does this check check for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is fairly standard for a find method to return -1 as the index of something it couldn't find....

In this case, if the group being added doesn't already exist, we add it

}

instance.Status.Provisioning = true
instance.Status.Provisioned = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set both at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the test only cares that one is true

if instance.Status.Provisioning {
instance.Status.Provisioned = true
instance.Status.Provisioning = false
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this else block be true? I see we set provisioning to true at the start of the function, so wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the call returns an error I guess

return true, nil
}

func (g *AzureResourceGroupManager) ForSubscription(context.Context, runtime.Object) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function for?

var client = getGroupsClient()

future, err := client.Delete(ctx, groupName)
if err != nil {
log.Fatalf("got error: %s", err)
return autorest.Response{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this be future.response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally when a function returns an error you can't expect there to be anything in the other returned value so I didn't try to call future.Response....might be worth looking into

Copy link
Contributor

@szoio szoio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a good look at this PR:

It's definitely a major improvement on hand crafted reconcile loops.

My concern with this overall architecture, or maybe it's not such a big problem on the scheme of things, is that individual controller implementations, i.e. the implementations of AsyncClient
are passed in a runtime.Object, and have the responsibility for updating the state of this object.

Some points on this:

  • There is no way for the the generic reconcile loop to have any assuredness that this is being done correctly.
    It has to trust that the updates are going to leave the reconcile loop in the right state.
    The interactions with Azure are completely bespoke (i.e. per resource type/kind), but the interations with Kubernetes are actually quite standard across all operators.
    For this reason, I've been more in favour of subsuming the interactions with Kubernetes into the
    generic reconcile loop.
  • I've been seeing frequent update failures (writing this updated runtime.Object back to Kubernetes).
    The typical error I've seen is "the object has been modified; please apply your changes to the latest version and try again".
    I've found if you do exactly what it says, i.e. you refetch and reapply, it generally succeeds.
    The problem is here, you don't know what updates have been made to the object, so there is not way of reapplying it. Maybe there is something about my particular setup (running with a Kind cluster) where I get this error that others have not experienced.

@alexeldeib
Copy link
Contributor

I haven't dug in much further since last week, but I am reasonably familiar with the code 🙂

There is no way for the the generic reconcile loop to have any assuredness that this is being done correctly

This is generally true in all cases, and it's the developers job to test. This is true of every Kubernetes core controller -- there's no guarantee the Deployment controller does the right thing, except that it does and people have tested the behavior they expect. Internally, all of these reconcile requests will be fed to the controller by a workqueue of comingled objects anyway, so this aligns reasonably well with how the Kubernetes/controller-runtime internals function.

Controllers should never store any state in memory they cannot reconstruct on a fresh reconcile run. Trying to create a state machine inside the controller is going to cause problems. This has bitten the Kubernetes community repeatedly when dealing with status fields in particular, but has been a thorn generally. There is no concept of an "operation" in the Kubernetes API, everything is simply a level triggered response. I'd argue in the scenarios you hit, the correct decision is always to requeue. Reconciliation loop itself should check for whatever preconditions are necessary.

See discussion at kubernetes/kubernetes#34363 (comment) among others as context for this stance.

The problem is here, you don't know what updates have been made to the object, so there is not way of reapplying it.

The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot. If it does or needs to store this as state, it's an anti-pattern.

@alexeldeib
Copy link
Contributor

alexeldeib commented Oct 29, 2019

I think one point @szoio's PR captures thoroughly that this does not touch on at all is dependencies. There's somewhat of a blurry line between service/resource, but I would imagine we could have service implementations for all RPs and CRD/controller for each resource type, and e.g. requiring something as input/output would be a matter of, e.g.:

if err := ensureA(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

if err := ensureB(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

if err := ensureC(ctx, obj); err != nil {
    return ctrl.Result{}, err
}

return ctrl.Result{}, nil

I think this starts to mirror the dependency information captured in #397 so maybe one path forward is starting with this, and building up the interface (e.g. pre/post reconcile resources) as we have more concrete use cases for some of those needs and find out what works before trying to abstract too much.

@szoio
Copy link
Contributor

szoio commented Oct 29, 2019

Thanks @alexeldeib for the comments:

Controllers should never store any state in memory they cannot reconstruct on a fresh reconcile run. Trying to create a state machine inside the controller is going to cause problems. This has bitten the Kubernetes community repeatedly when dealing with status fields in particular, but has been a thorn generally. There is no concept of an "operation" in the Kubernetes API, everything is simply a level triggered response. I'd argue in the scenarios you hit, the correct decision is always to requeue. Reconciliation loop itself should check for whatever preconditions are necessary.

Totally agree that the controller itself should not store any state. However the controller does update state of kubernetes objects, and this new state gets fed to the reconcile loop next time round. So the controller itself is stateless, though I don't think this is exactly what you mean.

I also like the idea that "The controller shouldn't need to care whether it's a re-reconcile of a previous attempt, or the first run of a fresh object on new boot" wherever possible.

There are however quite a few edge cases that one may encounter where some state can be somewhat helpful.

For example, say you have a service with an asynchronous deletion, and the GET operation for that service only returns a status code, say 200 if it exists, and 404 if it doesn't. I have found that we can't always rely on the status fields returned by the Azure management SDK.

Then let's say the manifest is updated and reapplied, such that the resource can't be patched, it needs to be deleted and recreated.

The first time it reconciles the manifest it deletes the object. Because it is an asynchronous delete, it requeues. The next time it reconciles, it attempts to verify the object. In this case suppose it gets a 200. Then it somehow needs to know that the object is busy being deleted. If we don't, we'll just take the 200 to mean that we have got to the end and the resource has finished reconciling successfully.

Having a richer set of possible states can be very helpful in resolving these edge cases.
There are a number of these kind of edge cases, and when these are combined, they make the implementation quite tricky, as each AsyncClient implementation will have to think about all these edge cases, and know how to deal with them, and set appropriate status for every possible scenario.

If we push the onus of responsibility onto the implementer to update these states, the implementations will become increasingly more confusing and tricky, especially if we want to refine the operators and provide better support for full lifecycle workflows such as update vs. delete+recreate.

For that reason I have been suggesting moving the this into the generic loop code, and only expect the specific operator implementation i.e. the AsyncClient (or ResourceManagerClient in #397) to give very concrete known facts about the interaction with Azure back to the caller, and not dictate how the reconcile loop should interpret it (by setting properties on the runtime.Object - which will determine how the loop behaves on the next reconcile).

@alexeldeib
Copy link
Contributor

I think i've misunderstood your motivation (between your comment and perusing the updates to your PR). I think I actually agree entirely with capturing the interaction more cleanly, but without more operators, it's tough for me to tell what the right level of abstraction/separation is (both technically and organizationally).

I'd be curious to see how you envision things like multiple service clients or composition of clients for higher level reconcilers (i'll drop a comment on the PR).

@frodopwns frodopwns merged commit 5f8b17e into Azure:master Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Deleting a Resource Group from Kube that does not exist in Azure
5 participants