Make etcd cache size configurable #23914

jsravn · 2016-04-06T14:43:45Z

Instead of the prior 50K limit, allow users to specify a more sensible size for their cluster.

I'm not sure what a sensible default is here. I'm still experimenting on my own clusters. 50 gives me a 270MB max footprint. 50K caused my apiserver to run out of memory as it exceeded >2GB. I believe that number is far too large for most people's use cases.

There are some other fundamental issues that I'm not addressing here:

Old etcd items are cached and potentially never removed (it stores using modifiedIndex, and doesn't remove the old object when it gets updated)
Cache isn't LRU, so there's no guarantee the cache remains hot. This makes its performance difficult to predict. More of an issue with a smaller cache size.
1.2 etcd entries seem to have a larger memory footprint (I never had an issue in 1.1, even though this cache existed there). I suspect that's due to image lists on the node status.

This is provided as a fix for #23323

k8s-bot · 2016-04-06T14:44:13Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-06T14:44:30Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-06T14:45:15Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

xiang90 · 2016-04-06T15:02:42Z

/cc @hongchaodeng

jsravn · 2016-04-06T15:05:41Z

I think that this change requires making the cache LRU, actually. Having a small randomly evicting cache is almost useless imho...

lavalamp · 2016-04-06T18:25:00Z

pkg/storage/etcd/etcdtest/etcdtest.go

@@ -21,6 +21,8 @@ import (
 	"path"
 )

+const CacheSize int = 50


lavalamp · 2016-04-06T18:31:32Z

I think converting this to an LRU cache, and basing the size off of the current knob for the watch window (instead of introducing an additional knob) maybe be more useful than this. Does this change actually affect apiserver's behavior? I would expect a bigger cache size to use more memory without really giving any benefit?

lavalamp · 2016-04-07T00:19:53Z

Ah, this is the cache we use to save time deserializing. So yes, this needs to eventually be made into an LRU cache. But actually small + randomly evicting is not that bad for this cache.

lavalamp · 2016-04-07T00:21:54Z

cmd/kube-apiserver/app/options/options.go

@@ -230,6 +231,7 @@ func (s *APIServer) AddFlags(fs *pflag.FlagSet) {
 	fs.StringVar(&s.EtcdConfig.CertFile, "etcd-certfile", s.EtcdConfig.CertFile, "SSL certification file used to secure etcd communication")
 	fs.StringVar(&s.EtcdConfig.CAFile, "etcd-cafile", s.EtcdConfig.CAFile, "SSL Certificate Authority file used to secure etcd communication")
 	fs.BoolVar(&s.EtcdConfig.Quorum, "etcd-quorum-read", s.EtcdConfig.Quorum, "If true, enable quorum read")
+	fs.IntVar(&s.EtcdConfig.CacheSize, "etcd-cache-size", s.EtcdConfig.CacheSize, "Number of etcd entries to cache in memory")


This cache is actually for deserialized objects. It's not got much to do with etcd. "deserialization-cache-size"?

Ok, I see your point. -Maybe 'resource-cache-size' or 'storage-cache-size' as other options.-

edit: I misunderstood the purpose of the cache. I see now - it just avoids the deserialization specifically (the object itself is still loaded from etcd or a separate cache).

lavalamp · 2016-04-07T00:23:15Z

OK, I'm actually pretty OK with this change if you can make the flag name/description more accurate.

jsravn · 2016-04-07T10:10:45Z

Ah, this is the cache we use to save time deserializing. So yes, this needs to eventually be made into an LRU cache. But actually small + randomly evicting is not that bad for this cache.

Yeah the LRU cache can be done in another PR. I'm curious why you think it won't be so bad as-is, though.

Also, what do you think about the default value I've used? 150 vs the original of 50k.

lavalamp · 2016-04-07T17:12:57Z

Yeah the LRU cache can be done in another PR. I'm curious why you think it won't be so bad as-is, though.

I said that just by reasoning about the behavior--often used things will get quickly replaced if evicted but rarely used things won't--but the first hit on a search seems to confirm: http://danluu.com/2choices-eviction/ Random looks even better than I expected.

Also, what do you think about the default value I've used? 150 vs the original of 50k.

Without doing any actual reasoning, my intuition is to go for something a bit bigger than that, maybe something in the range 1k - 5k?

We should really count the number of hits/misses this cache gets and export it as a Prometheus metric, then we'd have actual numbers to go off of. My guess is each entry gets like 2-7 hits and that's it?

lavalamp · 2016-04-07T17:14:09Z

pkg/genericapiserver/genericapiserver.go

-	DefaultEtcdPathPrefix = "/registry"
-	globalTimeout         = time.Minute
+	DefaultEtcdPathPrefix           = "/registry"
+	DefaultDeserializationCacheSize = 150


Maybe let's start at 1k and see if that breaks our performance tests?

jsravn · 2016-04-08T11:00:27Z

http://danluu.com/2choices-eviction/ Random looks even better than I expected.

It also shows how random eviction performs worse the smaller the cache, although it doesn't show less than 64k sizes. And in our case we have etcd objects getting frequently updated, so that the majority are old objects which will never be used again (not sure what his workload is exactly, doesn't say). Given large enough cache size though it should behave okay, just inefficiently. But yeah - all conjecture without actual stats :).

lavalamp · 2016-04-11T23:46:02Z

LGTM- please squash!

jsravn · 2016-04-12T11:20:17Z

@lavalamp squashed

wojtek-t · 2016-04-12T11:46:27Z

pkg/genericapiserver/genericapiserver.go

-	DefaultEtcdPathPrefix = "/registry"
-	globalTimeout         = time.Minute
+	DefaultEtcdPathPrefix           = "/registry"
+	DefaultDeserializationCacheSize = 1000


Can you please leave default as it was before (i.e. 50.000)?

jsravn · 2016-04-12T12:33:12Z

@wojtek-t Ok, I changed it back to 50k.

But am not happy about it :). In my cluster, that will result in a ~10GB heap when fully populated (based on my seeing 1GB heap at cache-size=5000). With 99% old useless stuff (since my active set of etcd objects is <1% of 50k on a 4 node cluster). At least I can configure it now.

I hope you will consider opening a new issue to investigate replacing it with a more efficient, smaller LRU cache. Or point me to the load tests so I can have a look.

Instead of the default 50K entries, allow users to specify more sensible sizes for their cluster.

wojtek-t · 2016-04-12T13:32:58Z

@jsravn - what we should do is to set based on the initial cluster size in our salt scripts (similarly to how we e.g. set limits for heapster) That way, default value can be small for small clusters and big for big clusters.

wojtek-t · 2016-04-12T13:33:32Z

BTW - I hope that as soon as we migrate to using protobufs, we would be able to get rid of this cache completely. [So in 1.3 timeframe hopefully]

lavalamp · 2016-04-12T21:35:22Z

@WojtekT Was the 50k a result of measuring something?

On Tue, Apr 12, 2016 at 6:33 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

BTW - I hope that as soon as we migrate to using protobufs, we would be
able to get rid of this cache completely. [So in 1.3 timeframe hopefully]

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

wojtek-t · 2016-04-13T06:11:53Z

@lavalamp - it was (although it was more than half an year ago, and I don't remember what exactly experiments I was doing :))

lavalamp · 2016-04-14T21:54:27Z

LGTM

k8s-bot · 2016-04-14T22:40:13Z

GCE e2e build/test passed for commit 5bb0595.

ncdc · 2016-04-15T01:21:57Z

@lavalamp I'd like to suggest this for 1.2.x, WDYT?

lavalamp · 2016-04-15T19:05:05Z

@ncdc why?

ncdc · 2016-04-15T19:07:44Z

To allow customers to avoid running out of memory in their apiservers

On Friday, April 15, 2016, Daniel Smith notifications@github.com wrote:

@ncdc https://github.com/ncdc why?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

lavalamp · 2016-04-15T19:52:19Z

Is this a problem lots of people are experiencing? I'm in favor of only
cherrypicking absolutely essential changes. Anyway, you can add the
cherrypick-candidate label if you want, and see what happens :)

On Fri, Apr 15, 2016 at 12:07 PM, Andy Goldstein notifications@github.com
wrote:

To allow customers to avoid running out of memory in their apiservers

On Friday, April 15, 2016, Daniel Smith notifications@github.com wrote:

@ncdc https://github.com/ncdc why?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
#23914 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23914 (comment)

k8s-bot · 2016-04-16T23:19:38Z

GCE e2e build/test passed for commit 5bb0595.

k8s-bot · 2016-04-17T06:07:15Z

GCE e2e build/test passed for commit 5bb0595.

k8s-github-robot · 2016-04-17T06:31:49Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-04-17T07:06:04Z

GCE e2e build/test passed for commit 5bb0595.

k8s-github-robot · 2016-04-17T07:06:30Z

Automatic merge from submit-queue

ncdc · 2016-04-18T13:45:47Z

@lavalamp never mind on the cherry-pick unless others really want it (OpenShift will pull it into our vendored copy of Kube)

hongchaodeng · 2016-04-20T04:18:48Z

@lavalamp @wojtek-t
I found that federatd/ also has an API server and it doesn't have any change from this. Specifically, it doesn't has default cache size.
Should we add it?

wojtek-t · 2016-04-20T07:54:37Z

@hongchaodeng - at least not now in my opinion. To be honest, I would like to get rid of this cache completely. I hope that as soon as we have protobufs this should be possible.

[3.11] BUG 1758267: UPSTREAM: 61330: fix workgroup panic bug Origin-commit: 7618f0a8e440ff860081bd0904b6ef8a7d385362

googlebot added the cla: yes label Apr 6, 2016

k8s-github-robot assigned lavalamp Apr 6, 2016

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Apr 6, 2016

lavalamp reviewed Apr 6, 2016
View reviewed changes

pkg/storage/etcd/etcdtest/etcdtest.go

@@ -21,6 +21,8 @@ import (

"path"

)

const CacheSize int = 50

Copy link

Member

lavalamp Apr 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment.

lavalamp reviewed Apr 7, 2016
View reviewed changes

k8s-github-robot added the kind/old-docs label Apr 8, 2016

jsravn mentioned this pull request Apr 12, 2016

apiserver keeps increasing in memory until it gets oomkilled #23323

Closed

wojtek-t reviewed Apr 12, 2016
View reviewed changes

Make deserialization cache size configurable

5bb0595

Instead of the default 50K entries, allow users to specify more sensible sizes for their cluster.

lavalamp mentioned this pull request Apr 14, 2016

add --etcd-cache-entries for user control of api server etcd decode cache size #24261

Closed

lavalamp added lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-merge release-note-none Denotes a PR that doesn't merit a release note. labels Apr 14, 2016

k8s-github-robot removed the release-note-label-needed label Apr 14, 2016

lavalamp added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2016

k8s-github-robot merged commit a275a04 into kubernetes:master Apr 17, 2016

Make etcd cache size configurable #23914

Make etcd cache size configurable #23914

Conversation

jsravn commented Apr 6, 2016

k8s-bot commented Apr 6, 2016

k8s-bot commented Apr 6, 2016

k8s-bot commented Apr 6, 2016

xiang90 commented Apr 6, 2016

jsravn commented Apr 6, 2016

lavalamp Apr 6, 2016

Choose a reason for hiding this comment

lavalamp commented Apr 6, 2016

lavalamp commented Apr 7, 2016

lavalamp Apr 7, 2016

Choose a reason for hiding this comment

jsravn Apr 7, 2016

Choose a reason for hiding this comment

lavalamp commented Apr 7, 2016

jsravn commented Apr 7, 2016

lavalamp commented Apr 7, 2016

lavalamp Apr 7, 2016

Choose a reason for hiding this comment

jsravn commented Apr 8, 2016

lavalamp commented Apr 11, 2016

jsravn commented Apr 12, 2016

wojtek-t Apr 12, 2016

Choose a reason for hiding this comment

jsravn commented Apr 12, 2016

wojtek-t commented Apr 12, 2016

wojtek-t commented Apr 12, 2016

lavalamp commented Apr 12, 2016

wojtek-t commented Apr 13, 2016

lavalamp commented Apr 14, 2016

k8s-bot commented Apr 14, 2016

ncdc commented Apr 15, 2016

lavalamp commented Apr 15, 2016

ncdc commented Apr 15, 2016

lavalamp commented Apr 15, 2016

k8s-bot commented Apr 16, 2016

k8s-bot commented Apr 17, 2016

k8s-github-robot commented Apr 17, 2016

k8s-bot commented Apr 17, 2016

k8s-github-robot commented Apr 17, 2016

ncdc commented Apr 18, 2016

hongchaodeng commented Apr 20, 2016 • edited Loading

wojtek-t commented Apr 20, 2016

hongchaodeng commented Apr 20, 2016 •

edited

Loading