Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for kube-up.sh to deploy Calico network policy to GCI masters #29037

Merged

Conversation

matthewdupre
Copy link
Contributor

@matthewdupre matthewdupre commented Jul 15, 2016

Also remove requirement for calicoctl from Debian / salt installed nodes and clean it up a little by deploying calico-node with a manifest rather than calicoctl. This also makes it more reliable by retrying properly.

How to use:

make quick-release
NETWORK_POLICY_PROVIDER=calico cluster/kube-up.sh

One place where I was uncertain:

  • CPU allocations (on the master particularly, where there's very little spare capacity). I took some from etcd, but if there's a better way to decide this, I'm happy to change it.

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 15, 2016
@matthewdupre
Copy link
Contributor Author

@thockin: Could you take a look at this please? It builds on the original Calico salt PR you reviewed (#26017).

@@ -520,10 +520,21 @@ function start-etcd-servers {
rm -f /etc/init.d/etcd
fi
prepare-log-file /var/log/etcd.log
prepare-etcd-manifest "" "4001" "2380" "200m" "etcd.manifest"
prepare-etcd-manifest "" "4001" "2380" "150m" "etcd.manifest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change

@matthewdupre
Copy link
Contributor Author

Thanks for taking a look.

Those two changes reduce the CPU resources reserved for the existing etcd clusters to provide room for Calico. There's virtually nothing spare (~40m IIRC) to be able to run more pods on the master, so I had to reduce some existing allocations to make room.

There are a few other options:

  1. Take different amounts of CPU from different places.
  2. Don't reserve resources for the Calico containers.
  3. Increase the master instance size so there's more to go around.

I'm open to any of these (or something else)?

@eparis eparis removed their assignment Jul 21, 2016
@matthewdupre
Copy link
Contributor Author

Anything I can do to help you review this @zmerlynn?

@zmerlynn
Copy link
Member

Sorry, I was out when originally assigned. I don't think we want to unilterally drop the CPU assignments on the master here - even if there's spare CPU in those containers right now, we need to make sure we leave room for growth in the future. Could possibly do this as a conditional based on the network provider.

cc @mikedanese @fabioy

@matthewdupre
Copy link
Contributor Author

Was there some systematic process for determining the CPU allocations in the first place?

I'd be perfectly happy to go with one of the alternative approaches as well. Please let me know what you think.

@k8s-github-robot k8s-github-robot added do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. retest-not-required-docs-only and removed retest-not-required-docs-only labels Jul 28, 2016
@matthewdupre
Copy link
Contributor Author

Sorry to ping you all again @thockin, @mikedanese, @fabioy - I'd really like to get this closed down and in :).

It seems like the main point of contention is the CPU resource allocations. To get things in for now, how about we go with 2 above (remove CPU requests for the Calico components altogether)?

  • This will have no effect on existing (no NETWORK_POLICY_PROVIDER) function.
  • They're only requests; so in practice Calico will still be able to use CPU.
  • We can easily add requests in later if a concrete problem arises.

Would this be acceptable?

function start-calico-policy-controller {
local -r temp_file="/tmp/calico-policy-controller.manifest"
cp "${KUBE_HOME}/kube-manifests/kubernetes/gci-trusty/calico-policy-controller.manifest" "${temp_file}"
PUBLIC_IP=$(ifconfig eth0 | grep 'inet ' | awk '{print $2}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I run this locally I get "addr:172.17.128.66" but when I run it on GCI I get a bare IP. This feels fragile. how about hostname -i ? That is fragile in a different way...

@thockin
Copy link
Member

thockin commented Jul 28, 2016

wrt CPU allocations on the master, I don't think it was particularly scientific, but this does raise some fun problems. I can't actually guarantee that all combinations of addons will "fit" on the master. Can this run as a "regular" pod in the cluster?

Running with very minimal CPU needs means it might starve. Is that OK? what is the failure mode?

@mikedanese too since this touches on turnup

@matthewdupre
Copy link
Contributor Author

Thanks for the feedback @thockin :). I'll make the minor adjustments you suggest (other than the Jinja syntax question, which I think I've addressed in reply: shout if you disagree). Now onto the CPU allocations / node choice:

Running calico-policy-controller on a regular node:

  • Running our etcd is problematic: it needs persistent storage, and we'd also have to plumb the IP around (maybe a service)? I don't know of a good solution for the storage.
  • Bootstrap is also tricky: if calico-nodes start before calico-policy-controller, then all pods will have their networking locked down until the latter is running. It ought to retry, but it's still a bit unpleasant.
  • We'd also have to make some adjustments to have calico-policy-controller authenticate with the apiserver: this is pretty straightforward though and should happen some time anyway.

Overall it's quite difficult; it might be possible, but it's by no means a trivial move. It's a central cluster service, so I don't think it's too inappropriate to run on the master in that respect.

As for failure modes, if calico-policy-controller receives no CPU time, the calico-nodes will keep everything as it is until they can get new information from the Calico etcd. That is: existing pods will be allowed to talk to the things they were already allowed to, and new pods won't be able to reach anything.

In practice, since the master components all have requests rather than limits, I'm not even sure if omitting requests for the calico-policy-controller would make any real difference?

@thockin
Copy link
Member

thockin commented Jul 29, 2016

Running calico-policy-controller on a regular node:

Running our etcd is problematic: it needs persistent storage, and we'd also have to plumb the IP around (maybe a service)? I don't know of a good solution for the storage.

Service IP should work. For storage we have persistent volumes and
even dynamic provisioning, so that should work

Bootstrap is also tricky: if calico-nodes start before calico-policy-controller, then all pods will have their networking locked down until the latter is running. It ought to retry, but it's still a bit unpleasant.

We're going to have this problem in general with network
infrastructure. We don't have a way to guarantee ordering, so we
really need to be sure that pods which need calico IPs will fail and
retry until calico is actually up. Meanwhile, pods which use host
networking will run, including calico

We'd also have to make some adjustments to have calico-policy-controller authenticate with the apiserver: this is pretty straightforward though and should happen some time anyway.

When you run as a pod we provide the secrets you need. All you have
to do is present the certificate.

Overall it's quite difficult; it might be possible, but it's by no means a trivial move. It's a central cluster service, so I don't think it's too inappropriate to run on the master in that respect.

I don't think it's INappropriate, but as you're experiencing, the
master is sort of jammed, and the more we put there the less likely it
is to actually, you know, WORK. I also don't think this idea is
impossibly hard. It might expose some bugs, but I can't see why it
shouldn't work eventually. Can we make an agreement to try it and see
what doesn't work and at least TRY to make it work? @mikedanese mr.
self-hosting.

As for failure modes, if calico-policy-controller receives no CPU time, the calico-nodes will keep everything as it is until they can get new information from the Calico etcd. That is: existing pods will be allowed to talk to the things they were already allowed to, and new pods won't be able to reach anything.

@davidopp for ideas. What do we do when the master is full? Do we
have an answer to this?

In practice, since the master components all have requests rather than limits, I'm not even sure if omitting requests for the calico-policy-controller would make any real difference?

If you specify a request with no limit, limit == request. That is
"guaranteed" QoS.

value: "false"
resources:
requests:
cpu: 40m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right number?

@mikedanese
Copy link
Member

mikedanese commented Aug 3, 2016

Overall it's quite difficult; it might be possible, but it's by no means a trivial move. It's a central cluster service, so I don't think it's too inappropriate to run on the master in that respect.

I don't think it's INappropriate, but as you're experiencing, the
master is sort of jammed, and the more we put there the less likely it
is to actually, you know, WORK. I also don't think this idea is
impossibly hard. It might expose some bugs, but I can't see why it
shouldn't work eventually. Can we make an agreement to try it and see
what doesn't work and at least TRY to make it work? @mikedanese mr.
self-hosting.

The optimal deployment would be a daemonset for the node agent that uses the kubernetes API to store state. It's really the only way to reduce deployment of calico across all kubernetes clusters to a single command, and I think that's what we should strive for. Last time I talked with you guys, you mentioned that this was planned? How far off is that? kube-flannel does this which is merging int flannel proper as an alternativly subnet managers. It removes the dependency on etcd or flannel server.

Running our etcd is problematic: it needs persistent storage, and we'd also have to plumb the IP around (maybe a service)? I don't know of a good solution for the storage.

use the apiserver :)

Bootstrap is also tricky: if calico-nodes start before calico-policy-controller, then all pods will have their networking locked down until the latter is running. It ought to retry, but it's still a bit unpleasant.

this primarily happens during initial deployment, which doesn't sound like too big of an issue.

We'd also have to make some adjustments to have calico-policy-controller authenticate with the apiserver: this is pretty straightforward though and should happen some time anyway.

You can use a service account. Credentials to the apiserver are injected to all pods by default. If you are already using the go client (?) then it's a couple lines of code change to switch to InClusterConfig.

@matthewdupre
Copy link
Contributor Author

@thockin @mikedanese OK, so I had a go at moving the policy controller to regular nodes. There's one thing I haven't yet got working, where I wanted to get some input on what the right approach is.

  • We need to do a kubectl create -f calico-policy-controller.yaml or equivalent, but where? It looks like the established method is to use the addon manager (which would involve some work adding replication controllers and pvs and pvcs to replace the pet set). Alternatively, if there's another place I can call the create from that would be easier.

With a delayed kubectl create, this commit does work, but it's fairly apparent that there are still unpleasant things here.

  • All our pods need to be host networked, otherwise they'll never be able to start up. Unfortunately host networked ports don't play nicely with services, hence the 10.0.0.17 hardcoded IP, and similar CONFIGURE_ETC_HOSTS hack to access the kubernetes API (the pod doesn't get the right DNS).
  • Etcd isn't clustered, and clustering it looks deeply unpleasant. It does restart on failure nicely though.
  • Moving to calico-node using the kubernetes API directly isn't something that can happen quickly (months not weeks).

I don't really mind what approach is taken here. I do care about getting this fixed sooner rather than later - the GCI master change broke NETWORK_POLICY_PROVIDER=calico cluster/kube-up.sh, and I'm trying to get that working again. Moving to a single create -f .yaml install is a thing we'd like to do (and we're pretty close), but for the time being it will still suffer the limitations above.

Which of these approaches (first commit / second commit) should I go with? Using the kubernetes API directly isn't possible any time soon.

@caseydavenport
Copy link
Member

@mikedanese

The optimal deployment would be a daemonset for the node agent that uses the kubernetes API to store state. It's really the only way to reduce deployment of calico across all kubernetes clusters to a single command, and I think that's what we should strive for. Last time I talked with you guys, you mentioned that this was planned? How far off is that? kube-flannel does this which is merging int flannel proper as an alternativly subnet managers. It removes the dependency on etcd or flannel server.

Yes! This is something I've been looking into recently. Actually installing the Calico pieces using a DaemonSet is relatively straightforward. I've got a prototype working but it's still a little bit fragile. The problem of course is etcd.

use the apiserver :)

We've looked into modifying Calico to use the k8s API as a replacement for etcd, and it may be something we do but it's going to take a bit of effort to make that happen so probably won't be available in the near future. As a pragmatic short-term solution, I think running etcd on top of k8s can get us that "one-click install" behavior.

this primarily happens during initial deployment, which doesn't sound like too big of an issue.

Yep, I agree. I'm not too worried about transient startup conditions.

You can use a service account. Credentials to the apiserver are injected to all pods by default. If you are already using the go client (?) then it's a couple lines of code change to switch to InClusterConfig.

Yep, the policy controller can already do this.

@caseydavenport
Copy link
Member

To add to my previous comment:

We want to keep moving down this "one-click install" path, but I don't think this PR is the place to do it all. The GCI master PR broke the NETWORK_POLICY_PROVIDER=calico option and firstly we want to make that work for people again.

We'll definitely want to do subsequent changes to make this "self-hosted" as we make that option more stable.

@lxpollitt
Copy link

+1 to Casey's comments. We should continue with the self-hosting one-click install work as a high priority (number one priority for us after this PR), but I don't think we should delay getting this fix merged in. If the main immediate concern is the extra load on the master then we can just make sure that the docs are super clear that there is a scale impact of enabling policy - and if they are an at scale deployment (e.g 1000 hosts) then they either should use larger instance for masters or they should follow the manual instructions in the Calico repo. I'm imagining we have the self-hosting one-click install ready to go in time for 1.4 so this is a short term impact anyway.

@zmerlynn zmerlynn removed their assignment Aug 8, 2016
@matthewdupre
Copy link
Contributor Author

(I'm going to make the addon-manager change; I'll post back when that's done.)

@@ -0,0 +1,6 @@
# Maintainers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use an OWNERS file like elsewhere in the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based this on the dashboard and node-problem-detector folders adjacent to this. I'm happy to change it: just pointing out that the closest equivalents currently look like this.

Which would you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainers is fine

@lxpollitt
Copy link

I'm concerned that we are going through feature creep on this PR. It was originally intended as a bug fix. If you pass NETWORK_POLICY_PROVIDER in then kube-up doesn't work in 1.3. We want to get this bug fix into the next 1.3 patch release. But it seems we've started adding increments towards the future vision of self-hosting into the PR which is delaying getting the fix into 1.3. We are already working on self-hosting and it is our highest priority task once this PR gets merged, but I don't think we should delay getting this PR merged on any of that work.

@@ -878,6 +878,9 @@ function start-kube-addons {
if echo "${ADMISSION_CONTROL:-}" | grep -q "LimitRanger"; then
setup-addon-manifests "admission-controls" "limit-range"
fi
if [[ "${NETWORK_POLICY_PROVIDER}" == "calico" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a :- to the variable expanision like the other checks.

@k8s-bot
Copy link

k8s-bot commented Aug 11, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

name: calico-etcd
namespace: kube-system
spec:
clusterIP: 10.0.0.17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not catching this. We parametrize the same field in kube-dns service because it needs to fall into the service subnet. This may not work if people are not using the default service subnet. I think we can defer the fix for a follow up though.

@mikedanese
Copy link
Member

Thanks, LGTM here. Last thing I'm unsure of is @caseydavenport 's review comment about resource request. Can you address?

@matthewdupre
Copy link
Contributor Author

Thanks @mikedanese - please hold off for a little longer: I'm tracking down an issue with API connectivity.

I spoke with @caseydavenport - we're going to remove the request for the calico-nodes.

@matthewdupre
Copy link
Contributor Author

OK @mikedanese this is ready to go from my end.

We'll continue to enhance the integration (un-static podding the calico-nodes is fairly close), but this should be enough to cover the issue. That's an enhancement, so we'll do a separate PR.

Let me know if you have any further comments; thanks :).

@caseydavenport
Copy link
Member

Looking good to me!

@apelisse apelisse removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Aug 11, 2016
@k8s-bot
Copy link

k8s-bot commented Aug 12, 2016

GCE e2e build/test passed for commit 568fb74.

@mikedanese mikedanese added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Aug 12, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Aug 12, 2016

GCE e2e build/test passed for commit 568fb74.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 9fe15e7 into kubernetes:master Aug 12, 2016
@matthewdupre
Copy link
Contributor Author

Thanks @mikedanese :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.