Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make etcd cache size configurable #23914

Merged
merged 1 commit into from
Apr 17, 2016
Merged

Make etcd cache size configurable #23914

merged 1 commit into from
Apr 17, 2016

Conversation

jsravn
Copy link
Contributor

@jsravn jsravn commented Apr 6, 2016

Instead of the prior 50K limit, allow users to specify a more sensible size for their cluster.

I'm not sure what a sensible default is here. I'm still experimenting on my own clusters. 50 gives me a 270MB max footprint. 50K caused my apiserver to run out of memory as it exceeded >2GB. I believe that number is far too large for most people's use cases.

There are some other fundamental issues that I'm not addressing here:

  • Old etcd items are cached and potentially never removed (it stores using modifiedIndex, and doesn't remove the old object when it gets updated)
  • Cache isn't LRU, so there's no guarantee the cache remains hot. This makes its performance difficult to predict. More of an issue with a smaller cache size.
  • 1.2 etcd entries seem to have a larger memory footprint (I never had an issue in 1.1, even though this cache existed there). I suspect that's due to image lists on the node status.

This is provided as a fix for #23323

@k8s-bot
Copy link

k8s-bot commented Apr 6, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented Apr 6, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Apr 6, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Apr 6, 2016
@xiang90
Copy link
Contributor

xiang90 commented Apr 6, 2016

/cc @hongchaodeng

@jsravn
Copy link
Contributor Author

jsravn commented Apr 6, 2016

I think that this change requires making the cache LRU, actually. Having a small randomly evicting cache is almost useless imho...

@@ -21,6 +21,8 @@ import (
"path"
)

const CacheSize int = 50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment.

@lavalamp
Copy link
Member

lavalamp commented Apr 6, 2016

I think converting this to an LRU cache, and basing the size off of the current knob for the watch window (instead of introducing an additional knob) maybe be more useful than this. Does this change actually affect apiserver's behavior? I would expect a bigger cache size to use more memory without really giving any benefit?

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2016

Ah, this is the cache we use to save time deserializing. So yes, this needs to eventually be made into an LRU cache. But actually small + randomly evicting is not that bad for this cache.

@@ -230,6 +231,7 @@ func (s *APIServer) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&s.EtcdConfig.CertFile, "etcd-certfile", s.EtcdConfig.CertFile, "SSL certification file used to secure etcd communication")
fs.StringVar(&s.EtcdConfig.CAFile, "etcd-cafile", s.EtcdConfig.CAFile, "SSL Certificate Authority file used to secure etcd communication")
fs.BoolVar(&s.EtcdConfig.Quorum, "etcd-quorum-read", s.EtcdConfig.Quorum, "If true, enable quorum read")
fs.IntVar(&s.EtcdConfig.CacheSize, "etcd-cache-size", s.EtcdConfig.CacheSize, "Number of etcd entries to cache in memory")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache is actually for deserialized objects. It's not got much to do with etcd. "deserialization-cache-size"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see your point. -Maybe 'resource-cache-size' or 'storage-cache-size' as other options.-

edit: I misunderstood the purpose of the cache. I see now - it just avoids the deserialization specifically (the object itself is still loaded from etcd or a separate cache).

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2016

OK, I'm actually pretty OK with this change if you can make the flag name/description more accurate.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 7, 2016

Ah, this is the cache we use to save time deserializing. So yes, this needs to eventually be made into an LRU cache. But actually small + randomly evicting is not that bad for this cache.

Yeah the LRU cache can be done in another PR. I'm curious why you think it won't be so bad as-is, though.

Also, what do you think about the default value I've used? 150 vs the original of 50k.

@lavalamp
Copy link
Member

lavalamp commented Apr 7, 2016

Yeah the LRU cache can be done in another PR. I'm curious why you think it won't be so bad as-is, though.

I said that just by reasoning about the behavior--often used things will get quickly replaced if evicted but rarely used things won't--but the first hit on a search seems to confirm: http://danluu.com/2choices-eviction/ Random looks even better than I expected.

Also, what do you think about the default value I've used? 150 vs the original of 50k.

Without doing any actual reasoning, my intuition is to go for something a bit bigger than that, maybe something in the range 1k - 5k?

We should really count the number of hits/misses this cache gets and export it as a Prometheus metric, then we'd have actual numbers to go off of. My guess is each entry gets like 2-7 hits and that's it?

DefaultEtcdPathPrefix = "/registry"
globalTimeout = time.Minute
DefaultEtcdPathPrefix = "/registry"
DefaultDeserializationCacheSize = 150
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's start at 1k and see if that breaks our performance tests?

@jsravn
Copy link
Contributor Author

jsravn commented Apr 8, 2016

http://danluu.com/2choices-eviction/ Random looks even better than I expected.

It also shows how random eviction performs worse the smaller the cache, although it doesn't show less than 64k sizes. And in our case we have etcd objects getting frequently updated, so that the majority are old objects which will never be used again (not sure what his workload is exactly, doesn't say). Given large enough cache size though it should behave okay, just inefficiently. But yeah - all conjecture without actual stats :).

@lavalamp
Copy link
Member

LGTM- please squash!

@jsravn
Copy link
Contributor Author

jsravn commented Apr 12, 2016

@lavalamp squashed

DefaultEtcdPathPrefix = "/registry"
globalTimeout = time.Minute
DefaultEtcdPathPrefix = "/registry"
DefaultDeserializationCacheSize = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please leave default as it was before (i.e. 50.000)?

@jsravn
Copy link
Contributor Author

jsravn commented Apr 12, 2016

@wojtek-t Ok, I changed it back to 50k.

But am not happy about it :). In my cluster, that will result in a ~10GB heap when fully populated (based on my seeing 1GB heap at cache-size=5000). With 99% old useless stuff (since my active set of etcd objects is <1% of 50k on a 4 node cluster). At least I can configure it now.

I hope you will consider opening a new issue to investigate replacing it with a more efficient, smaller LRU cache. Or point me to the load tests so I can have a look.

Instead of the default 50K entries, allow users to specify more sensible
sizes for their cluster.
@wojtek-t
Copy link
Member

@jsravn - what we should do is to set based on the initial cluster size in our salt scripts (similarly to how we e.g. set limits for heapster) That way, default value can be small for small clusters and big for big clusters.

@wojtek-t
Copy link
Member

BTW - I hope that as soon as we migrate to using protobufs, we would be able to get rid of this cache completely. [So in 1.3 timeframe hopefully]

@lavalamp
Copy link
Member

@WojtekT Was the 50k a result of measuring something?

On Tue, Apr 12, 2016 at 6:33 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

BTW - I hope that as soon as we migrate to using protobufs, we would be
able to get rid of this cache completely. [So in 1.3 timeframe hopefully]


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

@wojtek-t
Copy link
Member

@lavalamp - it was (although it was more than half an year ago, and I don't remember what exactly experiments I was doing :))

@lavalamp lavalamp added lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-merge release-note-none Denotes a PR that doesn't merit a release note. labels Apr 14, 2016
@lavalamp
Copy link
Member

LGTM

@k8s-bot
Copy link

k8s-bot commented Apr 14, 2016

GCE e2e build/test passed for commit 5bb0595.

@ncdc
Copy link
Member

ncdc commented Apr 15, 2016

@lavalamp I'd like to suggest this for 1.2.x, WDYT?

@lavalamp
Copy link
Member

@ncdc why?

@lavalamp lavalamp added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2016
@ncdc
Copy link
Member

ncdc commented Apr 15, 2016

To allow customers to avoid running out of memory in their apiservers

On Friday, April 15, 2016, Daniel Smith notifications@github.com wrote:

@ncdc https://github.com/ncdc why?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

@lavalamp
Copy link
Member

Is this a problem lots of people are experiencing? I'm in favor of only
cherrypicking absolutely essential changes. Anyway, you can add the
cherrypick-candidate label if you want, and see what happens :)

On Fri, Apr 15, 2016 at 12:07 PM, Andy Goldstein notifications@github.com
wrote:

To allow customers to avoid running out of memory in their apiservers

On Friday, April 15, 2016, Daniel Smith notifications@github.com wrote:

@ncdc https://github.com/ncdc why?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
#23914 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

@k8s-bot
Copy link

k8s-bot commented Apr 16, 2016

GCE e2e build/test passed for commit 5bb0595.

@k8s-bot
Copy link

k8s-bot commented Apr 17, 2016

GCE e2e build/test passed for commit 5bb0595.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Apr 17, 2016

GCE e2e build/test passed for commit 5bb0595.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit a275a04 into kubernetes:master Apr 17, 2016
@ncdc
Copy link
Member

ncdc commented Apr 18, 2016

@lavalamp never mind on the cherry-pick unless others really want it (OpenShift will pull it into our vendored copy of Kube)

@hongchaodeng
Copy link
Contributor

hongchaodeng commented Apr 20, 2016

@lavalamp @wojtek-t
I found that federatd/ also has an API server and it doesn't have any change from this. Specifically, it doesn't has default cache size.
Should we add it?

@wojtek-t
Copy link
Member

@hongchaodeng - at least not now in my opinion. To be honest, I would like to get rid of this cache completely. I hope that as soon as we have protobufs this should be possible.

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Oct 3, 2019
[3.11] BUG 1758267: UPSTREAM: 61330: fix workgroup panic bug

Origin-commit: 7618f0a8e440ff860081bd0904b6ef8a7d385362
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants