Add liveness and readness probe for jupyter notebook #5292

MartinForReal · 2020-09-09T04:09:57Z

Add liveness and readness probe for jupyter notebook
Fixed: #5282
Signed-off-by: MartinForReal fanshangxiang@gmail.com

google-cla · 2020-09-09T04:10:00Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

kubeflow-bot · 2020-09-09T04:10:03Z

This change is

k8s-ci-robot · 2020-09-09T04:10:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign stefanofioravanzo
You can assign the PR to them by writing /assign @stefanofioravanzo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

components/notebook-controller/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-cla · 2020-09-09T04:11:27Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

MartinForReal · 2020-09-09T04:12:01Z

@googlebot I fixed it.

google-cla · 2020-09-09T04:12:04Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

thesuperzapper · 2020-09-09T04:48:10Z

@MartinForReal Is this disable-able/configurable for non-Jupyter docker images?

Or even just JupyterHub ones?

Signed-off-by: MartinForReal <fanshangxiang@gmail.com>

MartinForReal · 2020-09-09T04:49:30Z

/assign @StefanoFioravanzo

MartinForReal · 2020-09-09T05:06:12Z

@thesuperzapper "/notebook/" + instance.Namespace + "/" + instance.Name + "/api/status" is the default URI generated by notebook controller. No matter what the underlying image is, the container should launch a Jupyter notebook instance. /api/status is a standard API provided by Jupiter notebook. So I think it should work well with all kinds of Jupyter instance. The notebook culling feature in the notebook controller is a good example.

I can add one extra env flag for this, just like what ENABLE_CULLING did.

How do you think?

thesuperzapper · 2020-09-09T05:20:40Z

@MartinForReal, I am talking about when the image launched by the notebook controller is NOT a Jupyter image, for example: RStudio or Theia.

It would be good if we can at least prepare for that, as this change would cause those images to perpetually restart.

MartinForReal · 2020-09-09T06:31:08Z

Why don't we create another controller for RStudio or Theia? @thesuperzapper

MartinForReal · 2020-09-09T07:05:47Z

/test kubeflow-presubmit

thesuperzapper · 2020-09-09T07:23:16Z

@MartinForReal, I am fairly sure we can just generalise the existing notebook-controller to support any docker image which exposes a web socket.

Currently you have to hack it a bit for non-jupyter, but it does work as long as you:

use port 8888
embed some HTTP path rewriting for ./notebook/{namespace}/{notebook-name}/
ensure the HTTP header has: Access-Control-Allow-Origin: *

We can make this way easier if we:

make the target port configurable
manage the HTTP path rewriting OUTSIDE the docker image:
- like with the istio gateway OR a basic sidecar which just forwards the traffic to the docker image
inject the Access-Control-Allow-Origin: * header
remove the hardcoding to use /home/jovyan if no mount point is specified (probably just fail instead)
don't implement things like this PR, without making them general enough to handle all types of docker image:
- not all docker images expose something at /api/status

MartinForReal · 2020-09-09T08:10:59Z

Well, it is interesting but I'm afraid I have different ideas.

I think there are a lot of solutions available which help us create three objects ( one statefulset, one service, and one virtualservice) such as helm, oam and etc. And they are generic enough so that we can apply them without any modification to our apps. To be more specific: what are the pros and cons when this solution is compared with helm operator + customized helm charts?

By implementing the controller, we can customize the resource lifecycle to support Jupiter operations, especially day2 operations.

thesuperzapper · 2020-09-09T08:18:47Z

@MartinForReal when I say "support any docker image which exposes a web socket" I am actually meaning any "web-based IDE"

I think it would be crazy for us not to consolidate this into a single controller, as the code duplication would be crazy.

For example, I believe these are the IDE's we need to support:

MartinForReal · 2020-09-09T08:29:58Z

But I don't think we can find a general solution that provides monitoring, error handling, authentication without specific customization. And we can extract common libraries from these implementations.

I don't think code duplication issue does exist.

thesuperzapper · 2020-09-09T08:49:10Z

@MartinForReal why not put a notebookType field in the Notebook CRD, with values like:

jupyter
rstudio
theia
generic

The setting of notebookType would control all the IDE specific features, like idle detection and liveness checks.

Why do you need a new controller, and therefore CRD for each notebook type?

gilbeckers · 2020-09-09T09:39:24Z

I'm also trying to fix this problem, see #5282

Instead of adding the readiness probe to the notebook container via the notebook controller, we were thinking to do this using a new custom admission controller. The idea is that at the creation of a notebook pod, the admission control would jump in and mutate the pod as desired.

The admission controller is configured via a MutatingWebhookConfiguration to act on the creation of Pods. Then, at create, the admission controller filters the right pod and only allows notebook pods to pass. The next step is to find the right container, as there are two (isto-proxy and the container running the nb image): this is done by looking at the environment variables. The nb container has a var named "NB_PREFIX", The final step is to add the readiness probe to the container.

MartinForReal · 2020-09-09T09:50:06Z

@thesuperzapper I don't see any improvement when it is compared with the new controller. because it makes validation logic and control logic much more complicated and we also need to add extra logic to query different kinds of instance. Cluster-API is a good example of this kind of design. But you can find that It is difficult to validate cr and implement providers. And there is no point in making it general because we can not share any code except object creation.

The setting of notebookType would control all the IDE specific features, like idle detection and liveness checks.

thesuperzapper · 2020-09-09T10:17:05Z

@MartinForReal but if you think about it from a user perspective, having a new resource type for each new type of IDE is cumbersome, as
all the resource configs are the same, like volumes, name, etc. (Only the UI itself is different)

Having multiple Resource types also causes more pain in the frontend UI design, and just pushes all the complexity into the JavaScript/Python app.

Under your design (multiple controllers) how would you propopose supporting a "generic" IDE type which could be used by Kubeflow admins for whatever internal tool they might have?

Finally, I believe there is some code which is shared, especially complicated logic like:

"is there currently someone using a pod"
"when was the last time someone accessed the UI of the pod"

MartinForReal · 2020-09-09T11:36:21Z

@thesuperzapper well, I can hardly agree with you because daemonset, statefulset, and deployment all have a field which is the type of corev1.podSpec But it doesn't mean they should be managed by the same controller. and I think it is a common practise to apply Backend for FrontEnd pattern here. And we have graphql and other tools and technology which can simplify works in front end.

And I highly doubled that requirement is realistic because the user does care what kind of ide they are using. And kubeflow admin always manage objects using labeling strategies.

thesuperzapper · 2020-09-09T12:44:36Z

@MartinForReal regardless of how we manage the Notebook, please make this PR not break non-jupyter docker images.

thesuperzapper · 2020-09-14T07:54:58Z

@MartinForReal, after thinking about it more, you are partially right but I think I have an idea which will work for both of us.

We can create an annotation for the Notebook resource called kubeflow-resource-type:

initially kubeflow-resource-type should support two values: jupyter and generic
when it is jupyter
- we should enable features like injecting liveness probes (for example, like this PR)
when it is generic:
- we disable injection of jupyter specific features
NOTE: this field has an implied default value of jupyter when missing for backwards compatibility

Note there is already precedent for annotations like this, see kubeflow-resource-disabled.

If you do this, I can implement a user friendly way for Kubeflow administrators to provide non-jupyter images in the jupyter-web-app spawner UI.

thesuperzapper · 2020-09-14T07:56:16Z

@kimwnasptd you might want to chip in

kimwnasptd · 2020-09-14T15:11:11Z

First off, supporting more types of Notebooks is something we want to put effort in for Kubeflow 1.2 and forward.

FYI we are moving towards abstracting the way we deploy our apps in Kubeflow. In the near future we'll be discussing on what technical details to abstract in order to deploy a stateful application on Kubeflow, such as a Notebook, an IDE, Tensorboard etc, via a more generic CutomResource.

But until then, for this PR I also aggre with @thesuperzapper to not introduce extra functionality that only works for Jupyter. What I propose is:

lets use a notebook-type: {jupyter,theia,rstudio} annotation in a Notebook CR to express the type of the Notebook. In case it is missing it will default to jupyter to maintain backward compatibility
if it's jupyter then we'll use the path @MartinForReal has already implemented in this PR
if it's not jupyter then the controller will log that it doesn't support setting readiness/liveness probes for the provided type of Notebook.

This way we move forward with @MartinForReal's implementation, but we also allow for other people to add support for other types of Notebooks in the future.

@MartinForReal could you update the code to align with the proposed plan?

yanniszark · 2020-09-14T18:19:07Z

Hi all, this is a very interesting issue!
I have some questions regarding the liveness probe. Namely:

Will the liveness probe interfere with culling? AFAIK, probing is performed by the kubelet and traffic passes through the Istio sidecar container, so it probably counts as traffic. Maybe an exec liveness probe would be better here, as it would go through localhost?
Right now, if the notebook server process crashes, tini will terminate and the Pod will be restarted. Do we catch any more cases if we add a liveness probe?

gilbeckers · 2020-09-16T08:30:09Z

components/notebook-controller/controllers/notebook_controller.go

@@ -323,6 +322,36 @@ func generateStatefulSet(instance *v1beta1.Notebook) *appsv1.StatefulSet {
 			},
 		}
 	}
+	if len(container.Ports) == 1 {


Here you assume there is only one container running in the pod, while in fact there are two; istio-proxy and notebook container. As result, the liveness- and readiness probe will only be added to the istio-proxy container and the notebook container will stay untouched.

istio-proxy is added by istio sidecar injector. And only when pod is created, the istio sidecar injector can add new containers.
The controller watches the change of notebook resource and creates the correlated statefulset. The controller can only find one container when it generate statefulset definitions. FYI

We could also target the port based off the name notebook-port

Oops, sorry I did't know that 😊. Thanks for the information

vinayan3 · 2022-01-31T22:35:16Z

@yanniszark Is there still an appetite for landing this? I just encoutered a case where the notebook's server crashed on startup and never came alive. It'd be great to have a liveness probe.

google-cla bot added the cla: no label Sep 9, 2020

k8s-ci-robot added the size/L label Sep 9, 2020

k8s-ci-robot requested review from StefanoFioravanzo and yanniszark September 9, 2020 04:10

add liveness and readness probe for notebook instance

1b6c573

Signed-off-by: MartinForReal <fanshangxiang@gmail.com>

google-cla bot added cla: yes and removed cla: no labels Sep 9, 2020

k8s-ci-robot assigned StefanoFioravanzo Sep 9, 2020

gilbeckers reviewed Sep 16, 2020

View reviewed changes

This was referenced Sep 17, 2020

Notebook controller is only checking the status of one (the first) container #5282

Closed

Correct ContainerStatus of Notebook CR #5314

Merged

Status icon of newly created notebook is not shown correct in Kubeflow dashboard #5317

Closed

MartinForReal closed this Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add liveness and readness probe for jupyter notebook #5292

Add liveness and readness probe for jupyter notebook #5292

MartinForReal commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

kubeflow-bot commented Sep 9, 2020

k8s-ci-robot commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

gilbeckers commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

thesuperzapper commented Sep 14, 2020

thesuperzapper commented Sep 14, 2020

kimwnasptd commented Sep 14, 2020

yanniszark commented Sep 14, 2020 •

edited

Loading

gilbeckers Sep 16, 2020

MartinForReal Sep 17, 2020

thesuperzapper Sep 17, 2020

gilbeckers Sep 17, 2020

vinayan3 commented Jan 31, 2022

Add liveness and readness probe for jupyter notebook #5292

Add liveness and readness probe for jupyter notebook #5292

Conversation

MartinForReal commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

kubeflow-bot commented Sep 9, 2020

k8s-ci-robot commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

google-cla bot commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

gilbeckers commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

MartinForReal commented Sep 9, 2020

thesuperzapper commented Sep 9, 2020

thesuperzapper commented Sep 14, 2020

thesuperzapper commented Sep 14, 2020

kimwnasptd commented Sep 14, 2020

yanniszark commented Sep 14, 2020 • edited Loading

gilbeckers Sep 16, 2020

Choose a reason for hiding this comment

MartinForReal Sep 17, 2020

Choose a reason for hiding this comment

thesuperzapper Sep 17, 2020

Choose a reason for hiding this comment

gilbeckers Sep 17, 2020

Choose a reason for hiding this comment

vinayan3 commented Jan 31, 2022

yanniszark commented Sep 14, 2020 •

edited

Loading