Canarying mechanism for ConfigMap #20200

therc · 2016-01-27T05:52:56Z

A bad configuration file can take down a whole service. Is the expectation that problematic ConfigMaps will get caught by the dev/staging/production promotion process? They're namespaced, i think. I can come up with the example of a very spectacular postmortem from 2007 where even that was not enough.

Would there be merit in a canarying mechanism? I'm assuming that currently configurations are pulled, not pushed, driven by watches. That's less than ideal, but I think one could have an election held where all potential "victims" pick the lucky one to try the new settings and, later, report on success. Even if this doesn't become a built-in feature, the documentation should at least point to the problem, best practices or possible options.

thockin · 2016-01-27T06:01:05Z

I asked for something here too, but it was deferred at the time. I proposed that the ConfigMap hold 2 maps, an indicator of which map is current and which one is "new", and an int field with values 0-100000, representing mili-percents (or maybe use a real power of 2 for easier clients). Consumers who want to canary can generate a random number in this range. If their number is < the milli-percentage, they use the "new" map, else they use the "current" map. To canary, you push the new map and slowly ramp the percentage. When you hit 100% you flip the pointer and reset the percentage.

This could be baked into the ConfigMap volume. Obviously this is pushing the bounds of ConfigMap, and maybe should be done in a different way.

therc · 2016-01-27T16:38:26Z

While it kinda solves the other need I was going to mention (rate limiting how fast new values are propagated to pods), that approach has the drawback that it is not very deterministic. It might take a while before any instance canaries the new setup. It's hard to tell who is using it and who isn't. Service health needs to be correlated with the milli-percent. What's worse, if a new configuration needs to be pushed in the middle of a large outage to try to get the service back on its feet, it'd be hard to evaluate if it's working.

Maybe a config controller could add metadata to the configMap. It could be set up to look at a service or set of labels to identify the candidate pool. Then it writes a "canary" field with the name(s) of the X lucky pod(s) — as many as the user requested. On the other side, the kubelets for the chosen ones reload the configuration (all others just wait... the question then becomes of what happens if new pods go live during this canarying phase. Maybe we can't escape having two different configurations.) If the canaries are still healthy after Y minutes, the controller updates the "canary" field with new pods, with up to Z new ones in flight at any time (e.g. 5% or 10% of total pods), keeping track of health checks during the whole process. Kubelets might also report which version of the ConfigMap is active for a pod. This would be opt-in, of course.

thockin · 2016-01-28T06:49:08Z

Alternatively, make a new ConfigMap and do a controlled rollingupdate of
your Pods to a new pod template

On Wed, Jan 27, 2016 at 8:39 AM, Rudi C notifications@github.com wrote:

While it kinda solves the other need I was going to mention (rate limiting
how fast new values are propagated to pods), that approach has the drawback
that it is not very deterministic. It might take a while before any
instance canaries the new setup. It's hard to tell who is using it and who
isn't. Service health needs to be correlated with the milli-percent. What's
worse, if a new configuration needs to be pushed in the middle of a large
outage to try to get the service back on its feet, it'd be hard to evaluate
if it's working.

Maybe a config controller could add metadata to the configMap. It could be
set up to look at a service or set of labels to identify the candidate
pool. Then it writes a "canary" field with the name(s) of the X lucky
pod(s) — as many as the user requested. On the other side, the kubelets for
the chosen ones reload the configuration (all others just wait... the
question then becomes of what happens if new pods go live during this
canarying phase. Maybe we can't escape having two different
configurations.) If the canaries are still healthy after Y minutes, the
controller updates the "canary" field with new pods, with up to Z new ones
in flight at any time (e.g. 5% or 10% of total pods), keeping track of
health checks during the whole process. Kubelets might also report which
version of the ConfigMap is active for a pod. This would be opt-in, of
course.

—
Reply to this email directly or view it on GitHub
#20200 (comment)
.

bgrant0607 · 2016-01-30T01:03:54Z

I agree with @thockin's last proposal. The right thing to do here is create a new ConfigMap and do a rolling update to switch to it, using the new Deployment API.

bgrant0607 · 2016-01-30T01:05:52Z

To be more clear: That's the recommended solution for the foreseeable future. If it won't work for you, please explain why and reopen the issue.

therc · 2016-01-30T01:08:35Z

But the rolling update would restart all pods, correct? Unless we get smarter updates that, with cooperation of pods, can just reload data.

bgrant0607 · 2016-03-25T16:23:43Z

Ref #9043 re. in-place rolling updates.

therc mentioned this issue Jan 27, 2016

Need documentation for ConfigMap #19620

Closed

bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle team/ux labels Jan 30, 2016

bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 30, 2016

bgrant0607 closed this as completed Jan 30, 2016

bgrant0607 added the kind/support Categorizes issue or PR as a support question. label Jan 30, 2016

therc mentioned this issue Mar 23, 2016

Facilitate ConfigMap rollouts / management #22368

Open

therc mentioned this issue Jun 24, 2016

Dynamic Kubelet Settings #27980

Closed

8 tasks

mtaufen mentioned this issue Dec 5, 2016

Dynamic Kubelet Configuration Proposal #29459

Closed

llamerada-jp mentioned this issue Jul 26, 2024

Hot Reloading lvmd.conf topolvm/topolvm#785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canarying mechanism for ConfigMap #20200

Canarying mechanism for ConfigMap #20200

therc commented Jan 27, 2016

thockin commented Jan 27, 2016

therc commented Jan 27, 2016

thockin commented Jan 28, 2016

bgrant0607 commented Jan 30, 2016

bgrant0607 commented Jan 30, 2016

therc commented Jan 30, 2016

bgrant0607 commented Mar 25, 2016

Canarying mechanism for ConfigMap #20200

Canarying mechanism for ConfigMap #20200

Comments

therc commented Jan 27, 2016

thockin commented Jan 27, 2016

therc commented Jan 27, 2016

thockin commented Jan 28, 2016

bgrant0607 commented Jan 30, 2016

bgrant0607 commented Jan 30, 2016

therc commented Jan 30, 2016

bgrant0607 commented Mar 25, 2016