Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness and readness probe for jupyter notebook #5292

Closed
wants to merge 1 commit into from
Closed

Add liveness and readness probe for jupyter notebook #5292

wants to merge 1 commit into from

Conversation

MartinForReal
Copy link
Member

Add liveness and readness probe for jupyter notebook
Fixed: #5282
Signed-off-by: MartinForReal fanshangxiang@gmail.com

@google-cla
Copy link

google-cla bot commented Sep 9, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Sep 9, 2020
@kubeflow-bot
Copy link
Contributor

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign stefanofioravanzo
You can assign the PR to them by writing /assign @stefanofioravanzo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-cla
Copy link

google-cla bot commented Sep 9, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@MartinForReal
Copy link
Member Author

@googlebot I fixed it.

@google-cla
Copy link

google-cla bot commented Sep 9, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@thesuperzapper
Copy link
Member

@MartinForReal Is this disable-able/configurable for non-Jupyter docker images?

Or even just JupyterHub ones?

Signed-off-by: MartinForReal <fanshangxiang@gmail.com>
@google-cla google-cla bot added cla: yes and removed cla: no labels Sep 9, 2020
@MartinForReal
Copy link
Member Author

/assign @StefanoFioravanzo

@MartinForReal
Copy link
Member Author

@thesuperzapper "/notebook/" + instance.Namespace + "/" + instance.Name + "/api/status" is the default URI generated by notebook controller. No matter what the underlying image is, the container should launch a Jupyter notebook instance. /api/status is a standard API provided by Jupiter notebook. So I think it should work well with all kinds of Jupyter instance. The notebook culling feature in the notebook controller is a good example.

I can add one extra env flag for this, just like what ENABLE_CULLING did.

How do you think?

@thesuperzapper
Copy link
Member

@MartinForReal, I am talking about when the image launched by the notebook controller is NOT a Jupyter image, for example: RStudio or Theia.

It would be good if we can at least prepare for that, as this change would cause those images to perpetually restart.

@MartinForReal
Copy link
Member Author

Why don't we create another controller for RStudio or Theia? @thesuperzapper

@MartinForReal
Copy link
Member Author

/test kubeflow-presubmit

@thesuperzapper
Copy link
Member

@MartinForReal, I am fairly sure we can just generalise the existing notebook-controller to support any docker image which exposes a web socket.

Currently you have to hack it a bit for non-jupyter, but it does work as long as you:

  1. use port 8888
  2. embed some HTTP path rewriting for ./notebook/{namespace}/{notebook-name}/
  3. ensure the HTTP header has: Access-Control-Allow-Origin: *

We can make this way easier if we:

  1. make the target port configurable
  2. manage the HTTP path rewriting OUTSIDE the docker image:
    • like with the istio gateway OR a basic sidecar which just forwards the traffic to the docker image
  3. inject the Access-Control-Allow-Origin: * header
  4. remove the hardcoding to use /home/jovyan if no mount point is specified (probably just fail instead)
  5. don't implement things like this PR, without making them general enough to handle all types of docker image:
    • not all docker images expose something at /api/status

@MartinForReal
Copy link
Member Author

Well, it is interesting but I'm afraid I have different ideas.

I think there are a lot of solutions available which help us create three objects ( one statefulset, one service, and one virtualservice) such as helm, oam and etc. And they are generic enough so that we can apply them without any modification to our apps. To be more specific: what are the pros and cons when this solution is compared with helm operator + customized helm charts?

By implementing the controller, we can customize the resource lifecycle to support Jupiter operations, especially day2 operations.

@thesuperzapper
Copy link
Member

@MartinForReal when I say "support any docker image which exposes a web socket" I am actually meaning any "web-based IDE"

I think it would be crazy for us not to consolidate this into a single controller, as the code duplication would be crazy.

For example, I believe these are the IDE's we need to support:

@MartinForReal
Copy link
Member Author

But I don't think we can find a general solution that provides monitoring, error handling, authentication without specific customization. And we can extract common libraries from these implementations.

I don't think code duplication issue does exist.

@thesuperzapper
Copy link
Member

@MartinForReal why not put a notebookType field in the Notebook CRD, with values like:

  • jupyter
  • rstudio
  • theia
  • generic

The setting of notebookType would control all the IDE specific features, like idle detection and liveness checks.

Why do you need a new controller, and therefore CRD for each notebook type?

@gilbeckers
Copy link
Contributor

I'm also trying to fix this problem, see #5282

Instead of adding the readiness probe to the notebook container via the notebook controller, we were thinking to do this using a new custom admission controller. The idea is that at the creation of a notebook pod, the admission control would jump in and mutate the pod as desired.

The admission controller is configured via a MutatingWebhookConfiguration to act on the creation of Pods. Then, at create, the admission controller filters the right pod and only allows notebook pods to pass. The next step is to find the right container, as there are two (isto-proxy and the container running the nb image): this is done by looking at the environment variables. The nb container has a var named "NB_PREFIX", The final step is to add the readiness probe to the container.

@MartinForReal
Copy link
Member Author

@thesuperzapper I don't see any improvement when it is compared with the new controller. because it makes validation logic and control logic much more complicated and we also need to add extra logic to query different kinds of instance. Cluster-API is a good example of this kind of design. But you can find that It is difficult to validate cr and implement providers. And there is no point in making it general because we can not share any code except object creation.

The setting of notebookType would control all the IDE specific features, like idle detection and liveness checks.

@thesuperzapper
Copy link
Member

@MartinForReal but if you think about it from a user perspective, having a new resource type for each new type of IDE is cumbersome, as
all the resource configs are the same, like volumes, name, etc. (Only the UI itself is different)

Having multiple Resource types also causes more pain in the frontend UI design, and just pushes all the complexity into the JavaScript/Python app.

Under your design (multiple controllers) how would you propopose supporting a "generic" IDE type which could be used by Kubeflow admins for whatever internal tool they might have?

Finally, I believe there is some code which is shared, especially complicated logic like:

  • "is there currently someone using a pod"
  • "when was the last time someone accessed the UI of the pod"

@MartinForReal
Copy link
Member Author

@thesuperzapper well, I can hardly agree with you because daemonset, statefulset, and deployment all have a field which is the type of corev1.podSpec But it doesn't mean they should be managed by the same controller. and I think it is a common practise to apply Backend for FrontEnd pattern here. And we have graphql and other tools and technology which can simplify works in front end.

And I highly doubled that requirement is realistic because the user does care what kind of ide they are using. And kubeflow admin always manage objects using labeling strategies.

@thesuperzapper
Copy link
Member

@MartinForReal regardless of how we manage the Notebook, please make this PR not break non-jupyter docker images.

@thesuperzapper
Copy link
Member

@MartinForReal, after thinking about it more, you are partially right but I think I have an idea which will work for both of us.

We can create an annotation for the Notebook resource called kubeflow-resource-type:

  • initially kubeflow-resource-type should support two values: jupyter and generic
  • when it is jupyter
    • we should enable features like injecting liveness probes (for example, like this PR)
  • when it is generic:
    • we disable injection of jupyter specific features
  • NOTE: this field has an implied default value of jupyter when missing for backwards compatibility

Note there is already precedent for annotations like this, see kubeflow-resource-disabled.


If you do this, I can implement a user friendly way for Kubeflow administrators to provide non-jupyter images in the jupyter-web-app spawner UI.

@thesuperzapper
Copy link
Member

@kimwnasptd you might want to chip in

@kimwnasptd
Copy link
Member

First off, supporting more types of Notebooks is something we want to put effort in for Kubeflow 1.2 and forward.

FYI we are moving towards abstracting the way we deploy our apps in Kubeflow. In the near future we'll be discussing on what technical details to abstract in order to deploy a stateful application on Kubeflow, such as a Notebook, an IDE, Tensorboard etc, via a more generic CutomResource.

But until then, for this PR I also aggre with @thesuperzapper to not introduce extra functionality that only works for Jupyter. What I propose is:

  • lets use a notebook-type: {jupyter,theia,rstudio} annotation in a Notebook CR to express the type of the Notebook. In case it is missing it will default to jupyter to maintain backward compatibility
  • if it's jupyter then we'll use the path @MartinForReal has already implemented in this PR
  • if it's not jupyter then the controller will log that it doesn't support setting readiness/liveness probes for the provided type of Notebook.

This way we move forward with @MartinForReal's implementation, but we also allow for other people to add support for other types of Notebooks in the future.

@MartinForReal could you update the code to align with the proposed plan?

@yanniszark
Copy link
Contributor

yanniszark commented Sep 14, 2020

Hi all, this is a very interesting issue!
I have some questions regarding the liveness probe. Namely:

  • Will the liveness probe interfere with culling? AFAIK, probing is performed by the kubelet and traffic passes through the Istio sidecar container, so it probably counts as traffic. Maybe an exec liveness probe would be better here, as it would go through localhost?
  • Right now, if the notebook server process crashes, tini will terminate and the Pod will be restarted. Do we catch any more cases if we add a liveness probe?

@@ -323,6 +322,36 @@ func generateStatefulSet(instance *v1beta1.Notebook) *appsv1.StatefulSet {
},
}
}
if len(container.Ports) == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you assume there is only one container running in the pod, while in fact there are two; istio-proxy and notebook container. As result, the liveness- and readiness probe will only be added to the istio-proxy container and the notebook container will stay untouched.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

istio-proxy is added by istio sidecar injector. And only when pod is created, the istio sidecar injector can add new containers.
The controller watches the change of notebook resource and creates the correlated statefulset. The controller can only find one container when it generate statefulset definitions. FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also target the port based off the name notebook-port

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry I did't know that 😊. Thanks for the information

@vinayan3
Copy link

@yanniszark Is there still an appetite for landing this? I just encoutered a case where the notebook's server crashed on startup and never came alive. It'd be great to have a liveness probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notebook controller is only checking the status of one (the first) container
9 participants