Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service topologies: add install flag for EndpointSlices #4740

Merged
merged 1 commit into from
Jul 15, 2020
Merged

service topologies: add install flag for EndpointSlices #4740

merged 1 commit into from
Jul 15, 2020

Conversation

mateiidavid
Copy link
Member

Following the conclusion of EndpointSlice PR

How


  • What: Make EndpointSlices opt-in through helm values/cli install values.

  • Helm: update the values file to include a global enableEndpointSliceMode value, along with comments explaining both what the value is for, and that slices are still "experimental". I have also added a conditional in the destination rbac template, so if slices are not enabled rbac is not rendered -- my rationale here is that we shouldn't have any slice specific resources created if they are not going to be in use.

  • CLI: first instance was to add the flag and the values so it can be parsed. Another thing that I did for the cli was to add the boolean in the config map (more reasoning for this in a later section). The enableEndpointSliceMode flag takes after cniEnabled and smiMetrics.Enabled.

  • Destination Service: the approach I went with is storing the flag value in the configmap, each component mounts the config in a volume -- when the destination service is started up, we pass the flag value to NewEndpointsWatcher. This way, only the EndpointsWatcher part of the destination service will know of the flag and condition existence. Unfortunately, this also meant that I had to put the value in the struct, and add it as an argument to a few functions so it can be passed down the call stack.

  • Tests: I have not added any unit tests yet for the flag. One more important point of consideration is the changes in tests for the destination service since EndpointsWatcher now needs the EndpointSlice boolean when created, I had to change the tests to reflect this. Tests that had nothing to do with slices had the watcher initialised with a false, while tests that make use of slices have it set to true. Note that setting just the endpointsSliceEnabled to true will not mean endpoint slices can be used, slices, in order to be used, have to:

    • be enabled;
    • be registered on the cluster as a resource;
    • exist in the cluster.

Approaches considered


  • There are a few ways to do this. I came up with three approaches and then asked @alpeb to give me his opinion, given his extensive experience with flags and the cli.

  • Approach 1: my initial idea was to use the linkerd-configmap in EndpointSliceAccess, it looks a lot like the approach I ended up using. Initially, I considered to pass down the controller namespace, in the EndpointSliceAccess function I would have fetched the configmap, check the value and return early in case EndpointSlices are disabled. This is an infeasible solution though; I found getting the controller namespace passed to the function is quite hard -- ultimately the solution would have looked like what it does now.

  • Approach 2: the second approach I thought of was to not use a configmap and instead rely on rbac. In EndpointSliceAccess I would pass a (hardcoded) component name, since rbac is not namespaced and we know the name ahead of time we can hard code the values and get the rbac for our component through client-go. If the rbac does not include permissions for our component to list/get slices, then we can't use them. The downside to this is that it looks a bit hacky in nature, also, it would mean every time we want to use slices somewhere we have to change that components rbac meaning more PRs and a bit of tech debt (?).

  • Approach 3: use a global value, read the boolean from the configmap, set the global value to the boolean in the EndpointsWatcher. Passed on this because global state is not exciting.

Feedback and notes:


  • If you can think of any other (and better) ways to do this or of any architectural decisions that seem more sound, please let me know and I'll change it around. Given the possibilities, I think this solution is the most feasible, but I could have easily missed something.
  • To keep the diff smaller, I'll enable everything once we sort out the flag, that's why functions are still commented out or hardcoded to return false.

Signed-off-by: Matei David matei.david.35@gmail.com

@mateiidavid mateiidavid requested a review from a team as a code owner July 9, 2020 16:39
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!

# enables the use of EndpointSlice informers for the destination service;
# EndpointSliceMode should be set to true only if EndpointSlice feature gate is on;
# the feature is still experimental.
enableEndpointSliceMode: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Mode" feels a bit extraneous. what do you think about just "enableEndpointSlices"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it reads much better! :)

@@ -49,6 +49,7 @@ type (
disableH2Upgrade bool
disableHeartbeat bool
cniEnabled bool
endpointSliceModeEnabled bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a validation that causes linkerd install to error out if endpointSlices are enabled but the resource type doesn't exist on the cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I have added it in. I ended up creating a different function to test it, since we have to initialise a k8s client to check if the resource type exists. The only point of concern is if I make the validation in the right place -- let me know if it should be done in a different place/phase.

@@ -479,7 +481,7 @@ func (sp *servicePublisher) subscribe(srcPort Port, hostname string, listener En
}
port, ok := sp.ports[key]
if !ok {
port = sp.newPortPublisher(srcPort, hostname)
port = sp.newPortPublisher(srcPort, hostname, enableEndpointSlices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I'd structure this is that I'd make enableEndpointsSlices a field on servicePublisher and on portPublisher. That way we don't need to pass this boolean through the subscribe methods. Instead, when calling the new methods, we can pass in the enableEndpointsSlices from the parent struct.

@@ -75,11 +67,30 @@ func Main(args []string) {
}
}

enableEndpointSlices := global.GetEndpointSliceEnabled()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call EndpointSliceAccess and error out eagerly if enableEndpointsSlices is set but we don't have access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved this, after we initialise the K8s.API if enableEndpointSlices=true but the resource is not registered we have a call to log.Fatal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to move the EndpointSliceAccess check up to before initializing the k8s API. If EndpointSlices are not present on the cluster then initializing the k8s API will hang at "Waiting for caches to sync" instead of getting to the fatal log line.

pkg/k8s/authz.go Outdated
@@ -109,7 +109,11 @@ func ServiceProfilesAccess(k8sClient kubernetes.Interface) error {
// access to EndpointSlice resources.
//TODO: Uncomment function and change return type once EndpointSlices
// are supported and made opt-in through install flag
func EndpointSliceAccess(k8sClient kubernetes.Interface) bool {
func EndpointSliceAccess(k8sClient kubernetes.Interface, enableEndpointSlice bool) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave the second parameter off of this method. This should just check if we have access.

@@ -149,7 +151,7 @@ func NewEndpointsWatcher(k8sAPI *k8s.API, log *logging.Entry) *EndpointsWatcher
UpdateFunc: func(_, obj interface{}) { ew.addService(obj) },
})

if !pkgK8s.EndpointSliceAccess(k8sAPI.Client) {
if !pkgK8s.EndpointSliceAccess(k8sAPI.Client, ew.enableEndpointSlices) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we check for access in the main method as I suggest below, we can remove all of these checks in this file and instead treat the ew.enableEndpointSlices as authoritative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing suggestion, it reads much better and makes matters less complicated. It also simplifies writing tests; I have uncommented and split up the unit test cases for EndpointSlices since they are no longer coupled to EndpointSliceAccess.

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good! but I was wondering why not have a cli flag for destination service to enable this feature? and we can have a helm condition to toggle the same in the charts.

Was thinking that this would make it easier to make the feature enabled by default at destination, once the feature is enabled by default in k8s. WDYT?

@adleong

@mateiidavid
Copy link
Member Author

I like @Pothulapati's approach, and I think he is right; it will be easier to ensure backwards compatibility when EndpointSlices will be the default if we don't have any values in the cm.

The only downside that I can think of at this point in time is having to wire up each component to use the --enable-endpoint-slice program argument if they want to make use of EndpointSlices, but that doesn't seem nearly as much of a hassle as deprecating the value in the protocol buffer (as Tharun pointed out to me!).

@adleong
Copy link
Member

adleong commented Jul 10, 2020

A flag to the destination controller was my first thought too! But I think the current approach where the destination controller reads the value out of the configmap has some advantages:

  • Even if we were to make this a destination controller flag, we would still need to put it in the Linkerd configmap so that the destination manifest would get generated correctly by linkerd upgrade.
  • Since it has to be in the configmap and the destination controller already reads the configmap, to get the trust domain and cluster domain, it just makes sense to read the endpoint slices value at the same way.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍
Please add the enableEndpointSlices entry into charts/linkerd2/templates/_config.tpl so that it gets picked up when installing through Helm 😉

Comment on lines 439 to 440
if err = validateEndpointSlicesFeature(options.enableEndpointSlices); err != nil {
return nil, nil, errors.New("--enableEndpointSlice enabled but resource non-existent in cluster")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hiding the error returned by validateEndpointSlicesFeature(), which might not necessarily be that the resource is non-existent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alpeb thanks for the comments! I haven't used helm much so the charts/linkerd2/templates/_config.tpl file escaped me! Will run some manual tests with helm installations to make sure everything works as intended 👍

For validateEndpointSlicesFeature() I have changed the returned error by chaining --enableEndpointSlice true not supported: with whatever error validateEndpointSlicesFeature returns. I could also take another crack at validateEndpointSlicesFeature() and return some more customised errors instead of chaining them if you think that'd be a better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Matei207 chaining errors like that works for me 👍

@mateiidavid
Copy link
Member Author

mateiidavid commented Jul 13, 2020

Update: after a few failed integration tests, @alpeb kindly pointed out to me that there was a missing comma in the configmap which made installing/updating through helm error out.

Also, as a result of of the EndpointSlice k8s controller being a bit flaky (both in the integration test and another case with multicluster) it was suggested I add a known warning to the CI so we can ignore the error in the event that installation fails due to the k8s slice controller.

Update 2: added another event message to the list of exclusions; seems to be a known issue with EndpointSlices kubernetes#92928

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I haven't tested this manually yet but the code looks great (one small style suggestion). Since this flag is defaulted to false I think this is safe to merge. I'm looking forward to doing some end-to-end testing of this on main.

@@ -690,6 +699,19 @@ func (options *installOptions) validate() error {
return nil
}

func validateEndpointSlicesFeature(enableEndpointSlices bool) error {
if !enableEndpointSlices {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally I prefer to let the calling code control things like directly rather than passing instructions into the function. In this case, I'd remove the enableEndpointsSlices argument and instead call this function like:

if enableEndpointSlices {
  if err := validateEndpointsSlicesFeature(); err != nil {
    return ....
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why that's a better choice :) Thanks for signalling this, I've changed the code, squashed the commits and pushed.

adleong added a commit that referenced this pull request Jul 14, 2020
…ndpointslices. Fix some helm template issues

Signed-off-by: Alex Leong <alex@buoyant.io>
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Matei207 , this looks good to me 👍
You told me in slack that you still have to correct something. If that's the case, could you also add a space after the colon when concatenating the error, to avoid Error: --enableEndpointSlice=true not supported:EndpointSlice not supported?

EndpointSlices have been made opt-in due to their experimental nature. This PR
introduces a new install flag 'enableEndpointSlices' that will allow adopters to
specify in their cli install or helm install step whether they would like to
use endpointslices as a resource in the destination service, instead of the
endpoints k8s resource.

Signed-off-by: Matei David <matei.david.35@gmail.com>
@adleong adleong merged commit 8b85716 into linkerd:main Jul 15, 2020
han-so1omon pushed a commit to han-so1omon/linkerd2 that referenced this pull request Jul 15, 2020
EndpointSlices have been made opt-in due to their experimental nature. This PR
introduces a new install flag 'enableEndpointSlices' that will allow adopters to
specify in their cli install or helm install step whether they would like to
use endpointslices as a resource in the destination service, instead of the
endpoints k8s resource.

Signed-off-by: Matei David <matei.david.35@gmail.com>
Signed-off-by: Eric Solomon <errcsool@engineer.com>
adleong pushed a commit that referenced this pull request Jul 20, 2020
* Small PR that uncomments the `EndpointSliceAcess` method and cleans up left over todos in the destination service.
* Based on the past three PRs related to `EndpointSlices` (#4663 #4696 #4740); they should now be functional (albeit prone to bugs) and ready to use.

Signed-off-by: Matei David <matei.david.35@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants