Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service events for azure cloud provider #68212

Merged
merged 1 commit into from
Sep 16, 2018

Conversation

feiskyer
Copy link
Member

@feiskyer feiskyer commented Sep 4, 2018

What this PR does / why we need it:

azure-cloud-provider events are now added for easily identify the underground errors of Azure API.

Before this PR, only timed out waiting for the condition on the service events. Users couldn't know what's wrong unless login to master and check kube-controller-manager logs:

$ kubectl describe svc nginx
...
Events:
  Type     Reason                         Age               From                  Message
  ----     ------                         ----              ----                  -------
  Normal   EnsuringLoadBalancer           5s (x3 over 28s)  service-controller    Ensuring load balancer
  Warning  CreatingLoadBalancerFailed     0s (x3 over 24s)  service-controller    Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: timed out waiting for the condition

After this PR, the details errors would also show in service's events:

$ kubectl describe svc nginx
...
Events:
  Type     Reason                         Age               From                  Message
  ----     ------                         ----              ----                  -------
  Normal   EnsuringLoadBalancer           5s (x3 over 28s)  service-controller    Ensuring load balancer
  Warning  CreateOrUpdatePublicIPAddress  0s (x6 over 26s)  azure-cloud-provider  network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="DnsRecordInUse" Message="DNS record bar.southeastasia.cloudapp.azure.com is already used by another public IP." Details=[]
  Warning  CreatingLoadBalancerFailed     0s (x3 over 24s)  service-controller    Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: timed out waiting for the condition

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63155

Special notes for your reviewer:

Release note:

Service events are now added in azure-cloud-provider for easily identify the underground errors of Azure API.

Action required: The following clusterrole and clusterrolebinding should be applied:

    kind: List
    apiVersion: v1
    items:
    - apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels:
          kubernetes.io/cluster-service: "true"
        name: system:azure-cloud-provider
      rules:
      - apiGroups: [""]
        resources: ["events"]
        verbs:
        - create
        - patch
        - update
    - apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        labels:
          kubernetes.io/cluster-service: "true"
        name: system:azure-cloud-provider
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: system:azure-cloud-provider
      subjects:
      - kind: ServiceAccount
        name: azure-cloud-provider
        namespace: kube-system

If the clusterrole with same has already been provisioned (e.g. for accessing azurefile secrets), then the above yaml should be merged togather, e.g.

    ---
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRole
    metadata:
      labels:
        kubernetes.io/cluster-service: "true"
      name: system:azure-cloud-provider
    rules:
    - apiGroups: [""]
      resources: ["events"]
      verbs:
      - create
      - patch
      - update
    ---
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
      labels:
        kubernetes.io/cluster-service: "true"
      name: system:azure-cloud-provider
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: system:azure-cloud-provider
    subjects:
    - kind: ServiceAccount
      name: azure-cloud-provider
      namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRole
    metadata:
      name: system:azure-persistent-volume-binder 
      labels:
        kubernetes.io/cluster-service: "true"
    rules:
    - apiGroups: ['']
      resources: ['secrets']
      verbs:     ['get','create']
    ---
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
      name: system:azure-persistent-volume-binder
      labels:
        kubernetes.io/cluster-service: "true"
    roleRef:
      kind: ClusterRole
      apiGroup: rbac.authorization.k8s.io
      name: system:azure-persistent-volume-binder 
    subjects:
    - kind: ServiceAccount
      name: persistent-volume-binder
      namespace: kube-system

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Sep 4, 2018
@feiskyer feiskyer changed the title Az events Add service events for azure cloud provider Sep 4, 2018
@feiskyer
Copy link
Member Author

feiskyer commented Sep 4, 2018

/sig azure
/milestone v1.12

@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Sep 4, 2018
@feiskyer
Copy link
Member Author

feiskyer commented Sep 4, 2018

/cc @brendandburns @khenidak

/assign @andyzhangx

@andyzhangx
Copy link
Member

@feiskyer It's conflict now

@feiskyer
Copy link
Member Author

feiskyer commented Sep 4, 2018

@andyzhangx Solved now. PTAL

Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2018
@feiskyer
Copy link
Member Author

feiskyer commented Sep 5, 2018

/assign @liggitt

@liggitt Could you help to take a look at the PR?

@@ -383,7 +391,12 @@ func parseConfig(configReader io.Reader) (*Config, error) {
}

// Initialize passes a Kubernetes clientBuilder interface to the cloud provider
func (az *Cloud) Initialize(clientBuilder controller.ControllerClientBuilder) {}
func (az *Cloud) Initialize(clientBuilder controller.ControllerClientBuilder) {
az.kubeClient = clientBuilder.ClientOrDie("azure-cloud-provider")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tallclair @deads2k
for more cloud provider clients

I know we don't want roles for these getting added to bootstrap policy. what client should they use to speak to the API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Didn't get the response yet from @tallclair @deads2k. Any suggestion of candidate role should be used for such cases?

@liggitt
Copy link
Member

liggitt commented Sep 5, 2018

cloud-specific roles should not be part of the bootstrap policy installed on all clusters (xref #66628 for clearing out the one that slipped in accidentally)

I'm actually out of the office, redirecting to @tallclair / @mikedanese / @deads2k for guidance on what client to use for reporting events instead (or alternately how to set up the role to give the cloudprovider-specific client event permissions)

@liggitt liggitt removed their assignment Sep 5, 2018
@feiskyer
Copy link
Member Author

feiskyer commented Sep 6, 2018

@liggitt Thanks for the comments.

I'm actually out of the office, redirecting to @tallclair / @mikedanese / @deads2k for guidance on what client to use for reporting events instead (or alternately how to set up the role to give the cloudprovider-specific client event permissions)

@tallclair @deads2k @mikedanese Could you please take a look at this?

@feiskyer
Copy link
Member Author

/retest

@feiskyer
Copy link
Member Author

kindly ping @tallclair @deads2k @mikedanese

@@ -489,6 +489,13 @@ func ClusterRoles() []rbacv1.ClusterRole {
rbacv1helpers.NewRule("create").Groups(certificatesGroup).Resources("certificatesigningrequests/selfnodeclient").RuleOrDie(),
},
},
{
// a role making azure cloud provider to send events
ObjectMeta: metav1.ObjectMeta{Name: "system:azure-cloud-provider"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the role that @liggitt noted shouldn't go into core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k Thanks. Then let me remove this and add a release note on it. This is a user-fencing change.

@deads2k
Copy link
Contributor

deads2k commented Sep 13, 2018

While there is still discussion over exactly which org cloud providers should be in, they all seem to agree that k/k isn't the right spot. Given that view, I don't think it makes sense to use a service account from the kube-system namespace. I'd suggest choosing a namespace where you'd eventually want the azure cloud provider to live and create a client from there. For the clusterrole and clusterrolebinding, I'd suggest providing a manifest that can be applied.

When errors occur on azure cloud provider, events now are added so that users
could easily find the underground errors on Azure API.
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Sep 14, 2018
@andyzhangx
Copy link
Member

andyzhangx commented Sep 14, 2018

@brendandburns @khenidak do you have any opinion about this PR? We are going to make this PR into v1.12, it's very useful for debugging, while clusterrole and rolebinding needs to be added into acs-engine and AKS otherwise there will be error event in service description.

Update:
I will wait for one day for your comfirmation, if no response, I will tag lgtm since v1.12 is very close.

@feiskyer
Copy link
Member Author

/retest

Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 15, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure DNS label mismatch leads to endless failed attempts to ensure load balancer
5 participants