Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable caching successful token authentication #50258

Merged
merged 4 commits into from
Aug 11, 2017

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Aug 7, 2017

Resolves #50472

To support revocation of service account tokens, an etcd lookup of the token and service account is done by the token authenticator. Controllers that make dozens or hundreds of API calls per second (like the endpoints controller) cause this lookup to be done very frequently on the same objects.

This PR:

  • Implements a cached token authenticator that conforms to the authenticator.Token interface
  • Implements a union token authenticator (same approach as the union request authenticator, conforming to the authenticator.Token interface)
  • Cleans up the auth chain construction to group all token authenticators (means we only do bearer and websocket header parsing once)
  • Adds a 10-second TTL cache to successful token authentication
API server authentication now caches successful bearer token authentication results for a few seconds.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 7, 2017
@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-label-needed labels Aug 7, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 7, 2017

cc @kubernetes/sig-auth-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Aug 7, 2017
}
if len(config.WebhookTokenAuthnConfigFile) > 0 {
webhookTokenAuth, err := newWebhookTokenAuthenticator(config.WebhookTokenAuthnConfigFile, config.WebhookTokenAuthnCacheTTL)
if err != nil {
return nil, nil, err
}
authenticators = append(authenticators, bearertoken.New(webhookTokenAuth), websocket.NewProtocolAuthenticator(webhookTokenAuth))
hasTokenAuth = true
tokenAuthenticators = append(tokenAuthenticators, webhookTokenAuth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This authenticator already does caching. Can we not double up?

r.Status = entry.(authentication.TokenReviewStatus)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to rework that cache, I'd rather do it in a follow up. I wouldn't special-case the webhook token authenticator in the chain here. I also think the config knob for the TTL on the external authenticator makes more sense to expose.

@liggitt liggitt added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 7, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 7, 2017

fixed bazel build file, unit flaked on #50262

Copy link
Contributor

@mattmoyer mattmoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, but this looks good to me.

The only other interaction I'm worried about is if a user has set --authentication-token-webhook-cache-ttl 0 (or any TTL < 10s), this would potentially break their expectations about how often their authentication webhook is called. This probably isn't a great configuration to start with, but maybe there's something we can do to detect this and make sure we don't cache longer if the user has set a short TTL?

import "sync"

// simple key/value cache with read/write locking
type simpleCache struct {
Copy link
Contributor

@mattmoyer mattmoyer Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that important but we could add a TODO to switch this to the new builtin sync.Map in Go 1.9.

Edit: once we are on 1.9, I mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with existing lru cache

value, ok := c.data[key]
return value, ok
}
func (c *simpleCache) set(key string, value *cacheRecord) *cacheRecord {
Copy link
Contributor

@mattmoyer mattmoyer Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failureTTL is non-zero, this will be caching failed attempts. This would allow DoS by filling all available memory with the underlying map here. Maybe it should have a maximum size? Maybe even separate bounds for successful and unsuccessful records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid points, but I'd probably defer those features until we need them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with existing lru cache with bounded maxsize


// New returns a token authenticator that validates credentials using a chain of authenticator.Token objects.
// The entire chain is tried until one succeeds. If all fail, an aggregate error is returned.
func New(authTokenHandlers ...authenticator.Token) authenticator.Token {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be an error if len(authTokenHandlers) == 0 (just in the interest of failing fast)? Feel free to ignore because you did add a test that makes sure it fails safe with zero handlers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the signature error free (it does the right thing if there are no authenticators)


// NewFailOnError returns a request authenticator that validates credentials using a chain of authenticator.Request objects.
// The first error short-circuits the chain.
func NewFailOnError(authTokenHandlers ...authenticator.Token) authenticator.Token {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (error if len(authTokenHandlers) == 0).

return &unionAuthTokenHandler{Handlers: authTokenHandlers, FailOnError: true}
}

// AuthenticateRequest authenticates the request using a chain of authenticator.Request objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: godoc is wrong here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

// NewTokenGroupAdder wraps a token authenticator, and adds the specified groups to the returned user when authentication succeeds
func NewTokenGroupAdder(auth authenticator.Token, groups []string) authenticator.Token {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't expecting this. What is it for?

Copy link
Member Author

@liggitt liggitt Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mirrors the request group adder for assembling an auth chain and adding groups to one type of token auth (we'll use it downstream, but keeping the union/cache/adder implementations together made sense to me)

import "sync"

// simple key/value cache with read/write locking
type simpleCache struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't expecting this. Aren't there already ttl and LRU caches? I thought someone used one in the admission chain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually want an LRU cache... just the TTL. An LRU cache turns every read into a write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually want an LRU cache... just the TTL. An LRU cache turns every read into a write.

That means I'm going to find later on in this pull a spot where you're pruning old entries? You really think that's better? Seems like it would be pretty trivial to explode memory size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will look at swapping to using an existing lru/ttl cache... with the striping lock, that's probably fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

func (c *simpleCache) set(key string, value *cacheRecord) *cacheRecord {
c.lock.Lock()
defer c.lock.Unlock()
if existing, exists := c.data[key]; exists && existing.expires.After(value.expires) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems weird to sometimes manage the expiry and sometimes not. If you check the expiry here, why not also check it during a get and avoid returning stale data entirely? If you don't do it there and leave it up to the caller, why check here?

I'm ok doing it either way (if we can't re-use the existing thread-safe LRU stuff), but the asymmetry bothers me.


// If our object was the one stored, queue the removal
if cachedValue == value {
// TODO: batch removals to avoid a goroutine per cache item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh.... this seems worse than the disease. Have you measured against the next best solution: the existing LRU cache?

Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the comments around reusing the existing cache implementations. Everything else lgtm.

func fnvKeyFunc(key string) int64 {
f := fnv.New32()
f.Write([]byte(key))
return int64(f.Sum32())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not use the 64 bit function?

@liggitt liggitt force-pushed the token-cache branch 4 times, most recently from 215054a to b92efc3 Compare August 8, 2017 04:52
@liggitt liggitt added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 8, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 8, 2017

comments addressed, PTAL

Copy link
Contributor

@ericchiang ericchiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment on the bearer token logic. Probably not the best person to review the striped cache code. Going to defer to @deads2k and @tallclair on that.

return record.user, record.ok, record.err
}

user, ok, err := a.authenticator.AuthenticateToken(token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this to check the err and fail fast before evaluating ok. It seems odd that we'd cache errors. Maybe that's just a bad distinction between ok == false and err != nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cache the complete results. failure cases are only cached if the failure ttl is set (which this PR does not do... it only caches success). If we wanted to in the future, we could set different ttls on error and non-error cases.

type keyFunc func(string) uint32
type newCacheFunc func() cache

func newStripedCache(stripeCount int, keyFunc keyFunc, newCacheFunc newCacheFunc) cache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the underlying cache is using k8s.io/apimachinery/pkg/util/cache, it feels like this striping code belongs there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to be able to hash the key... which you can't assume for interface{}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. Thanks.

@@ -83,7 +86,10 @@ type WebHookAuthenticationOptions struct {
}

func NewBuiltInAuthenticationOptions() *BuiltInAuthenticationOptions {
return &BuiltInAuthenticationOptions{}
return &BuiltInAuthenticationOptions{
TokenSuccessCacheTTL: 10 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the way this default interacts with the webhook TTL configuration could be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I can add a warning if webhook TTL is less than the overall success TTL, but we already have distributed use of token authentication that involves secondary caches (extension api servers and kubelets do tokenreviews with short ttl caches)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. A warning works for me, or just something in the docs for --authentication-token-webhook-cache-ttl.

Another option might be to exclude webhookTokenAuth from this new cache (leaving them with their own cache), or even to drop unionAuthTokenHandler and use a separate cache for each token authorizer (where --authentication-token-webhook-cache-ttl would become a flag that set the TTL on that particular cache).

Copy link
Member Author

@liggitt liggitt Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added the warning for now, TTL under 10s for webhook isn't recommended anyway

Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some nits.


type cacheRecord struct {
user user.Info
ok bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: document what these fields are

"k8s.io/apiserver/pkg/authentication/token/tokenfile"
tokenunion "k8s.io/apiserver/pkg/authentication/token/union"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you commits are out of order (not sure if you plan on squashing)

Copy link
Member Author

@liggitt liggitt Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the order is correct... github actually displays them out of order... I was planning on leaving the components distinct

@liggitt
Copy link
Member Author

liggitt commented Aug 10, 2017

/retest

1 similar comment
@liggitt
Copy link
Member Author

liggitt commented Aug 10, 2017

/retest

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liggitt, tallclair
We suggest the following additional approver: lavalamp

Assign the PR to them by writing /assign @lavalamp in a comment when ready.

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@liggitt liggitt added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 49488, 50407, 46105, 50456, 50258)

@k8s-github-robot k8s-github-robot merged commit 42adb9e into kubernetes:master Aug 11, 2017
@liggitt liggitt deleted the token-cache branch August 12, 2017 01:50
@@ -166,7 +166,9 @@ func TestAddFlagsFlag(t *testing.T) {
ServiceAccounts: &kubeoptions.ServiceAccountAuthenticationOptions{
Lookup: true,
},
TokenFile: &kubeoptions.TokenFileAuthenticationOptions{},
TokenFile: &kubeoptions.TokenFileAuthenticationOptions{},
TokenSuccessCacheTTL: 10 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there are any command-line flags on the API server to tune these two TTL parameters. Was that an anticipated feature to come? Was their omission deliberate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already control via flags over the TTLs for "expensive" webhook token validation operations. There's no current plan to expose these as tunable settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/security cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants