Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulSets with the same selector interfere with each other #36859

Closed
foxish opened this issue Nov 16, 2016 · 11 comments · Fixed by #42080
Closed

StatefulSets with the same selector interfere with each other #36859

foxish opened this issue Nov 16, 2016 · 11 comments · Fixed by #42080
Assignees
Labels
area/stateful-apps kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Milestone

Comments

@foxish
Copy link
Contributor

foxish commented Nov 16, 2016

What happened:
When two statefulsets are created with the same labels, only one manages to come up.
This is caused by the same issue as #36855 (comment).

What you expected to happen:
Two StatefulSets should be created, instead, one was created, and another was unable to come up.

~ kubectl get pods --watch
NAME      READY     STATUS    RESTARTS   AGE
ss2-0     1/1       Running   0          7s
ss2-1     1/1       Running   0          6s
ss2-2     1/1       Running   0          4s
NAME      READY     STATUS    RESTARTS   AGE
ss1-0     0/1       Pending   0          1s
ss1-0     0/1       Pending   0         1s
ss1-0     0/1       ContainerCreating   0         1s
ss1-0     1/1       Running   0         2s
ss1-0     1/1       Terminating   0         2s
ss1-0     1/1       Terminating   0         2s
ss1-0     0/1       Pending   0         0s
ss1-0     0/1       Pending   0         0s
ss1-0     0/1       ContainerCreating   0         1s
ss1-0     1/1       Running   0         2s
ss1-0     1/1       Terminating   0         2s
ss1-0     1/1       Terminating   0         2s

How to reproduce it (as minimally and precisely as possible):

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: ss1
spec:
  serviceName: "ss1"
  replicas: 3
  template:
    metadata:
      labels:
        app: ss
    spec:
      containers:
      - name: alpine
        image: alpine
        command: 
        - /bin/sleep
        - "1000000"
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: ss2
spec:
  serviceName: "ss2"
  replicas: 3
  template:
    metadata:
      labels:
        app: ss
    spec:
      containers:
      - name: alpine
        image: alpine
        command: 
        - /bin/sleep
        - "1000000"

What we should be doing:

  • Do not filter on the label selector to find the pods associated with a PetSet.

cc @kubernetes/sig-apps

@foxish foxish added area/stateful-apps kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Nov 16, 2016
@foxish foxish added this to the v1.5 milestone Nov 16, 2016
@foxish foxish mentioned this issue Nov 16, 2016
19 tasks
@foxish foxish added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 16, 2016
@janetkuo
Copy link
Member

Filter on label selector AND create-by annotation should be safe

@smarterclayton
Copy link
Contributor

This comes up a lot. Seems like we want some machinery to make this easier. @Kargakis

@janetkuo
Copy link
Member

Per discussion with @erictune, we should fix this in 1.6, presumably with Controller Reference instead

@bprashanth
Copy link
Contributor

This is the good old overlapping selectors bug right? (fixable with the good old overlapping selectors solution as Janet mentioned). We basically punted on this for a bunch of releases for RC/RS etc, then put in controlleref, and now everything that doesn't use controllereff still suffers from it.

Do not filter on the label selector to find the pods associated with a PetSet.

This is not the solution we adopted everywhere else, though I don't deny that it might work for this specific case.

An easy short term fix would be to filter the pets returned by the selector list: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/petset/pet_set.go#L223 by an additional petste-name filter, because we can. IIRC the selector list is used to figure out who to delete, not who to create, we know who to create based on the expected identities of the set.

@soltysh
Copy link
Contributor

soltysh commented Nov 19, 2016

Then there's the approach implemented in Jobs, where we have auto-generated selectors, which can still be combined with user created ones. That with controller ref, should make the pods distinguishable enough.

@0xmichalis
Copy link
Contributor

Then there's the approach implemented in Jobs, where we have auto-generated selectors, which can still be combined with user created ones. That with controller ref, should make the pods distinguishable enough.

@bgrant0607 argued somewhere that the autogenerated selector is more appropriate for Jobs than for long-running controllers (users will care more about labels on the latter vs the former) and I agree with that statement.

@soltysh
Copy link
Contributor

soltysh commented Nov 21, 2016

That make sense, thx for bringing that up.

@bgrant0607
Copy link
Member

The original discussion started in #2210.

We should document this (hint hint).

This is a tradeoff between ease of creation (1st operation) and ease of management (subsequent operations).

Labels are like names in most respects.

It is desirable for a collection of labels to be unique for a given set so that there is a way to unambiguously refer to that set, in management operations (e.g., kubectl delete all --selector=app=ss, monitoring/logging queries). Indeed, the most common reason to want to change labels and selectors (#26202) is lack of planning ahead about what labels one would need: controller, release/version, canary, service, environment, etc. etc.

We encourage use of a unique collection of labels rather than a single unique label value since the additional attributes are generally needed, anyway (e.g., due to rolling update), and otherwise the user would need to combine multiple attributes in a single label value or generate the unique label by some means that would likely result in an unpredictable and/or not-human-friendly value.

It is desirable that the collection of labels be predictable and human-friendly in order to facilitate references from other resources (e.g., services, pod affinity/anti-affinity policies, disruption budget, pod modification policies), and from controllers in adoption cases, even for StatefulSet, where a user might want to delete their previous PetSet or StatefulSet and create a new one to adopt the pre-existing pods.

Unlike many other REST APIs, we don't expect clients to need to query most resources first in order to figure out how to refer to them in subsequent operations. Jobs are different than most other resources because we do typically expect them to be managed by higher-level APIs and/or other orchestrators, and to exist concurrently with subsequently created Jobs. In those respects, they are similar to Pods and the ReplicaSets generated by Deployment, so generated names and labels make sense.

Controller references are intended to prevent bad behavior due to overlapping selectors, especially across different types of controllers, but are not intended as a replacement for label uniqueness. See also:
#14961 (comment)

StatefulSets are different than every other workload controller in that the instances have predictable names derived from the set resource name. Could we eliminate the use of labels for identifying the set's pods? In addition to making management of the pods harder, as discussed above, my main concern about that is that it would be inconsistent with all other workload controllers. Therefore, I'd like to think about a strategy that would work for other workload controllers, as well.

The main problem with Job's approach is that it creates a uid label, which is unpredictable an not human-friendly. The current approach of defaulting the label selector from all labels in the template has problems, as well (#34292, #17902). Defaulting the template labels and selector from just the resource name (and not the uid), would address the problem with Job's approach and would be neither better nor worse with respect to the other problems. We'd probably use name as the label key, which would enable users to not think as much about what --selector=name=ss1 means, which, again, mostly defers confusion to later.

It appears that we can't expect new users to understand what labels they need ahead of time, though our tooling (e.g., kubectl run) and documentation don't help them. We probably need to expect operations like expose and rolling-update to add new labels for their own use (#36897).

cc @erictune @pwittrock @Kargakis

@bgrant0607
Copy link
Member

Also note that the /scale subresource requires a selector.

I'd be on board with preventing changes that would break the StatefulSet. For instance, for now we could prevent the user from setting the selector and from removing the name label.

@0xmichalis
Copy link
Contributor

I'd be on board with preventing changes that would break the StatefulSet. For instance, for now we could prevent the user from setting the selector and from removing the name label.

Same thing should apply for the podtemplate-hash label in ReplicaSets owned by Deployments.

@enisoc
Copy link
Member

enisoc commented Jan 7, 2017

I'm pursuing a fix for this via controllerRef (#24946).

kow3ns added a commit to kow3ns/kubernetes that referenced this issue Feb 8, 2017
kow3ns added a commit to kow3ns/kubernetes that referenced this issue Feb 8, 2017
1. pcb and pcb controller are removed and their functionality is
encapsulated in StatefulPodControlInterface.
2. IdentityMappers has been removed to clarify what properties of a Pod are
mutated by the controller. All mutations are performed in the
UpdateStatefulPod method of the StatefulPodControlInterface.
3. In the current control loop, which implements the same logic,
is in stateful_set_control.go. The Pods are now sorted and considered by
their ordinal indices, as is outlined in the documentation.
4. StatefulSetController now checks to see if the Pods matching a
StatefulSet's Selector also match the Name of the StatefulSet.
k8s-github-robot pushed a commit that referenced this issue Feb 10, 2017
Automatic merge from submit-queue (batch tested with PRs 40796, 40878, 36033, 40838, 41210)

StatefulSet hardening

**What this PR does / why we need it**:

This PR contains the following changes to StatefulSet. Only one change effects the semantics of how the controller operates (This is described in #38418), and this change only brings the controller into conformance with its documented behavior.

1. pcb and pcb controller are removed and their functionality is encapsulated in StatefulPodControlInterface. This class modules the design contoller.PodControlInterface and provides an abstraction to clientset.Interface which is useful for testing purposes.
2. IdentityMappers has been removed to clarify what properties of a Pod are mutated by the controller. All mutations are performed in the UpdateStatefulPod method of the StatefulPodControlInterface.
3. The statefulSetIterator and petQueue classes are removed. These classes sorted Pods by CreationTimestamp. This is brittle and not resilient to clock skew. The current control loop, which implements the same logic, is in stateful_set_control.go. The Pods are now sorted and considered by their ordinal indices, as is outlined in the documentation.
4. StatefulSetController now checks to see if the Pods matching a StatefulSet's Selector also match the Name of the StatefulSet. This will make the controller resilient to overlapping, and will be enhanced by the addition of ControllerRefs.
5. The total lines of production code have been reduced, and the total number of unit tests has been increased. All new code has 100% unit coverage giving the module 83% coverage. Tests for StatefulSetController have been added, but it is not practical to achieve greater coverage in unit testing for this code (the e2e tests for StatefulSet cover these areas).
6. Issue #38418 is fixed in that StaefulSet will ensure that all Pods that are predecessors of another Pod are Running and Ready prior to launching a new Pod. This removes the potential for deadlock when a Pod needs to be rescheduled while its predecessor is hung in Pending or Initializing.
7. All reference to pet have been removed from the code and comments.

**Which issue this PR fixes**
 fixes #38418,#36859
**Special notes for your reviewer**:

**Release note**:

```release-note
Fixes issue #38418 which, under circumstance, could cause StatefulSet to deadlock. 
Mediates issue #36859. StatefulSet only acts on Pods whose identity matches the StatefulSet, providing a partial mediation for overlapping controllers.
```
k8s-github-robot pushed a commit that referenced this issue Mar 7, 2017
Automatic merge from submit-queue (batch tested with PRs 42080, 41653, 42598, 42555)

StatefulSet: Respect ControllerRef

**What this PR does / why we need it**:

This is part of the completion of the [ControllerRef](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md) proposal. It brings StatefulSet into full compliance with ControllerRef. See the individual commit messages for details.

**Which issue this PR fixes**:

Fixes #36859

**Special notes for your reviewer**:

**Release note**:

```release-note
StatefulSet now respects ControllerRef to avoid fighting over Pods. At the time of upgrade, **you must not have StatefulSets with selectors that overlap** with any other controllers (such as ReplicaSets), or else [ownership of Pods may change](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md#upgrading).
```
cc @erictune @kubernetes/sig-apps-pr-reviews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
8 participants