-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make etcd cache size configurable #23914
Make etcd cache size configurable #23914
Conversation
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
2 similar comments
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
/cc @hongchaodeng |
I think that this change requires making the cache LRU, actually. Having a small randomly evicting cache is almost useless imho... |
@@ -21,6 +21,8 @@ import ( | |||
"path" | |||
) | |||
|
|||
const CacheSize int = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment.
I think converting this to an LRU cache, and basing the size off of the current knob for the watch window (instead of introducing an additional knob) maybe be more useful than this. Does this change actually affect apiserver's behavior? I would expect a bigger cache size to use more memory without really giving any benefit? |
Ah, this is the cache we use to save time deserializing. So yes, this needs to eventually be made into an LRU cache. But actually small + randomly evicting is not that bad for this cache. |
@@ -230,6 +231,7 @@ func (s *APIServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.StringVar(&s.EtcdConfig.CertFile, "etcd-certfile", s.EtcdConfig.CertFile, "SSL certification file used to secure etcd communication") | |||
fs.StringVar(&s.EtcdConfig.CAFile, "etcd-cafile", s.EtcdConfig.CAFile, "SSL Certificate Authority file used to secure etcd communication") | |||
fs.BoolVar(&s.EtcdConfig.Quorum, "etcd-quorum-read", s.EtcdConfig.Quorum, "If true, enable quorum read") | |||
fs.IntVar(&s.EtcdConfig.CacheSize, "etcd-cache-size", s.EtcdConfig.CacheSize, "Number of etcd entries to cache in memory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache is actually for deserialized objects. It's not got much to do with etcd. "deserialization-cache-size"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see your point. -Maybe 'resource-cache-size' or 'storage-cache-size' as other options.-
edit: I misunderstood the purpose of the cache. I see now - it just avoids the deserialization specifically (the object itself is still loaded from etcd or a separate cache).
OK, I'm actually pretty OK with this change if you can make the flag name/description more accurate. |
Yeah the LRU cache can be done in another PR. I'm curious why you think it won't be so bad as-is, though. Also, what do you think about the default value I've used? 150 vs the original of 50k. |
I said that just by reasoning about the behavior--often used things will get quickly replaced if evicted but rarely used things won't--but the first hit on a search seems to confirm: http://danluu.com/2choices-eviction/ Random looks even better than I expected.
Without doing any actual reasoning, my intuition is to go for something a bit bigger than that, maybe something in the range 1k - 5k? We should really count the number of hits/misses this cache gets and export it as a Prometheus metric, then we'd have actual numbers to go off of. My guess is each entry gets like 2-7 hits and that's it? |
DefaultEtcdPathPrefix = "/registry" | ||
globalTimeout = time.Minute | ||
DefaultEtcdPathPrefix = "/registry" | ||
DefaultDeserializationCacheSize = 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's start at 1k and see if that breaks our performance tests?
It also shows how random eviction performs worse the smaller the cache, although it doesn't show less than 64k sizes. And in our case we have etcd objects getting frequently updated, so that the majority are old objects which will never be used again (not sure what his workload is exactly, doesn't say). Given large enough cache size though it should behave okay, just inefficiently. But yeah - all conjecture without actual stats :). |
LGTM- please squash! |
@lavalamp squashed |
DefaultEtcdPathPrefix = "/registry" | ||
globalTimeout = time.Minute | ||
DefaultEtcdPathPrefix = "/registry" | ||
DefaultDeserializationCacheSize = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please leave default as it was before (i.e. 50.000)?
@wojtek-t Ok, I changed it back to 50k. But am not happy about it :). In my cluster, that will result in a ~10GB heap when fully populated (based on my seeing 1GB heap at cache-size=5000). With 99% old useless stuff (since my active set of etcd objects is <1% of 50k on a 4 node cluster). At least I can configure it now. I hope you will consider opening a new issue to investigate replacing it with a more efficient, smaller LRU cache. Or point me to the load tests so I can have a look. |
Instead of the default 50K entries, allow users to specify more sensible sizes for their cluster.
@jsravn - what we should do is to set based on the initial cluster size in our salt scripts (similarly to how we e.g. set limits for heapster) That way, default value can be small for small clusters and big for big clusters. |
BTW - I hope that as soon as we migrate to using protobufs, we would be able to get rid of this cache completely. [So in 1.3 timeframe hopefully] |
@WojtekT Was the 50k a result of measuring something? On Tue, Apr 12, 2016 at 6:33 AM, Wojciech Tyczynski <
|
@lavalamp - it was (although it was more than half an year ago, and I don't remember what exactly experiments I was doing :)) |
LGTM |
GCE e2e build/test passed for commit 5bb0595. |
@lavalamp I'd like to suggest this for 1.2.x, WDYT? |
@ncdc why? |
To allow customers to avoid running out of memory in their apiservers On Friday, April 15, 2016, Daniel Smith notifications@github.com wrote:
|
Is this a problem lots of people are experiencing? I'm in favor of only On Fri, Apr 15, 2016 at 12:07 PM, Andy Goldstein notifications@github.com
|
GCE e2e build/test passed for commit 5bb0595. |
GCE e2e build/test passed for commit 5bb0595. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 5bb0595. |
Automatic merge from submit-queue |
@lavalamp never mind on the cherry-pick unless others really want it (OpenShift will pull it into our vendored copy of Kube) |
@hongchaodeng - at least not now in my opinion. To be honest, I would like to get rid of this cache completely. I hope that as soon as we have protobufs this should be possible. |
[3.11] BUG 1758267: UPSTREAM: 61330: fix workgroup panic bug Origin-commit: 7618f0a8e440ff860081bd0904b6ef8a7d385362
Instead of the prior 50K limit, allow users to specify a more sensible size for their cluster.
I'm not sure what a sensible default is here. I'm still experimenting on my own clusters. 50 gives me a 270MB max footprint. 50K caused my apiserver to run out of memory as it exceeded >2GB. I believe that number is far too large for most people's use cases.
There are some other fundamental issues that I'm not addressing here:
This is provided as a fix for #23323