Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canarying mechanism for ConfigMap #20200

Closed
therc opened this issue Jan 27, 2016 · 7 comments
Closed

Canarying mechanism for ConfigMap #20200

therc opened this issue Jan 27, 2016 · 7 comments
Labels
area/app-lifecycle kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@therc
Copy link
Member

therc commented Jan 27, 2016

A bad configuration file can take down a whole service. Is the expectation that problematic ConfigMaps will get caught by the dev/staging/production promotion process? They're namespaced, i think. I can come up with the example of a very spectacular postmortem from 2007 where even that was not enough.

Would there be merit in a canarying mechanism? I'm assuming that currently configurations are pulled, not pushed, driven by watches. That's less than ideal, but I think one could have an election held where all potential "victims" pick the lucky one to try the new settings and, later, report on success. Even if this doesn't become a built-in feature, the documentation should at least point to the problem, best practices or possible options.

@thockin
Copy link
Member

thockin commented Jan 27, 2016

I asked for something here too, but it was deferred at the time. I proposed that the ConfigMap hold 2 maps, an indicator of which map is current and which one is "new", and an int field with values 0-100000, representing mili-percents (or maybe use a real power of 2 for easier clients). Consumers who want to canary can generate a random number in this range. If their number is < the milli-percentage, they use the "new" map, else they use the "current" map. To canary, you push the new map and slowly ramp the percentage. When you hit 100% you flip the pointer and reset the percentage.

This could be baked into the ConfigMap volume. Obviously this is pushing the bounds of ConfigMap, and maybe should be done in a different way.

@therc
Copy link
Member Author

therc commented Jan 27, 2016

While it kinda solves the other need I was going to mention (rate limiting how fast new values are propagated to pods), that approach has the drawback that it is not very deterministic. It might take a while before any instance canaries the new setup. It's hard to tell who is using it and who isn't. Service health needs to be correlated with the milli-percent. What's worse, if a new configuration needs to be pushed in the middle of a large outage to try to get the service back on its feet, it'd be hard to evaluate if it's working.

Maybe a config controller could add metadata to the configMap. It could be set up to look at a service or set of labels to identify the candidate pool. Then it writes a "canary" field with the name(s) of the X lucky pod(s) — as many as the user requested. On the other side, the kubelets for the chosen ones reload the configuration (all others just wait... the question then becomes of what happens if new pods go live during this canarying phase. Maybe we can't escape having two different configurations.) If the canaries are still healthy after Y minutes, the controller updates the "canary" field with new pods, with up to Z new ones in flight at any time (e.g. 5% or 10% of total pods), keeping track of health checks during the whole process. Kubelets might also report which version of the ConfigMap is active for a pod. This would be opt-in, of course.

@thockin
Copy link
Member

thockin commented Jan 28, 2016

Alternatively, make a new ConfigMap and do a controlled rollingupdate of
your Pods to a new pod template

On Wed, Jan 27, 2016 at 8:39 AM, Rudi C notifications@github.com wrote:

While it kinda solves the other need I was going to mention (rate limiting
how fast new values are propagated to pods), that approach has the drawback
that it is not very deterministic. It might take a while before any
instance canaries the new setup. It's hard to tell who is using it and who
isn't. Service health needs to be correlated with the milli-percent. What's
worse, if a new configuration needs to be pushed in the middle of a large
outage to try to get the service back on its feet, it'd be hard to evaluate
if it's working.

Maybe a config controller could add metadata to the configMap. It could be
set up to look at a service or set of labels to identify the candidate
pool. Then it writes a "canary" field with the name(s) of the X lucky
pod(s) — as many as the user requested. On the other side, the kubelets for
the chosen ones reload the configuration (all others just wait... the
question then becomes of what happens if new pods go live during this
canarying phase. Maybe we can't escape having two different
configurations.) If the canaries are still healthy after Y minutes, the
controller updates the "canary" field with new pods, with up to Z new ones
in flight at any time (e.g. 5% or 10% of total pods), keeping track of
health checks during the whole process. Kubelets might also report which
version of the ConfigMap is active for a pod. This would be opt-in, of
course.


Reply to this email directly or view it on GitHub
#20200 (comment)
.

@bgrant0607 bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle team/ux labels Jan 30, 2016
@bgrant0607
Copy link
Member

I agree with @thockin's last proposal. The right thing to do here is create a new ConfigMap and do a rolling update to switch to it, using the new Deployment API.

@bgrant0607 bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 30, 2016
@bgrant0607
Copy link
Member

To be more clear: That's the recommended solution for the foreseeable future. If it won't work for you, please explain why and reopen the issue.

@bgrant0607 bgrant0607 added the kind/support Categorizes issue or PR as a support question. label Jan 30, 2016
@therc
Copy link
Member Author

therc commented Jan 30, 2016

But the rolling update would restart all pods, correct? Unless we get smarter updates that, with cooperation of pods, can just reload data.

@bgrant0607
Copy link
Member

Ref #9043 re. in-place rolling updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

3 participants