Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix operator: correctly set head pod service account #276

Merged
merged 1 commit into from
Jun 8, 2022

Conversation

davidxia
Copy link
Contributor

Why are these changes needed?

Currently the patch in #259 isn't working because a later line overwrites with
the original bad value. This change sets the head pod's service account
to the one configured in the RayCluster.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Comment on lines 262 to 263
// If no service account is specified in the RayCluster,
// the head pod's service account should default to the RayCluster name.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan lmk if this is the correct behavior

@davidxia
Copy link
Contributor Author

Do you know why the compat tests are failing?

@davidxia
Copy link
Contributor Author

I deployed this change to my K8s cluster and see this in the operator logs. I think operator's inability to update the RayCluster instance's status field is unrelated to my change. I get the same error with kuberay/operator:e0de2bd

2022-05-25T20:22:26.916Z	INFO	raycluster-controller	Read request instance not found error!
2022-05-25T20:22:33.275Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "hyperkube"}
2022-05-25T20:22:33.289Z	INFO	raycluster-controller	Role created successfully	{"role name": "hyperkube"}
2022-05-25T20:22:33.289Z	DEBUG	events	Normal	{"object": {"kind":"RayCluster","namespace":"hyperkube","name":"hyperkube","uid":"5a1f1b9f-a867-461d-a66b-c0ab2ed17c3d","apiVersion":"ray.io/v1alpha1","resourceVersion":"38203870"}, "reason": "Created", "message": "Created role hyperkube"}
2022-05-25T20:22:33.302Z	INFO	raycluster-controller	RoleBinding created successfully	{"role binding name": "hyperkube"}
2022-05-25T20:22:33.302Z	DEBUG	events	Normal	{"object": {"kind":"RayCluster","namespace":"hyperkube","name":"hyperkube","uid":"5a1f1b9f-a867-461d-a66b-c0ab2ed17c3d","apiVersion":"ray.io/v1alpha1","resourceVersion":"38203870"}, "reason": "Created", "message": "Created role binding hyperkube"}
2022-05-25T20:22:33.313Z	INFO	raycluster-controller	Pod Service created successfully	{"service name": "hyperkube-head-svc"}
2022-05-25T20:22:33.314Z	INFO	raycluster-controller	reconcilePods 	{"creating head pod for cluster": "hyperkube"}
2022-05-25T20:22:33.316Z	INFO	RayCluster-Controller	Setting pod namespaces	{"namespace": "hyperkube"}
2022-05-25T20:22:33.316Z	INFO	raycluster-controller	createHeadPod	{"head pod with name": "hyperkube-head-"}
2022-05-25T20:22:33.319Z	DEBUG	events	Normal	{"object": {"kind":"RayCluster","namespace":"hyperkube","name":"hyperkube","uid":"5a1f1b9f-a867-461d-a66b-c0ab2ed17c3d","apiVersion":"ray.io/v1alpha1","resourceVersion":"38203870"}, "reason": "Created", "message": "Created service hyperkube-head-svc"}
2022-05-25T20:22:33.350Z	INFO	raycluster-controller	reconcilePods	{"add workers for group": "cpu-worker"}
2022-05-25T20:22:33.350Z	INFO	raycluster-controller	reconcilePods	{"creating worker for group": "cpu-worker", "index 0": "in total 1"}
2022-05-25T20:22:33.350Z	INFO	RayCluster-Controller	Setting pod namespaces	{"namespace": "hyperkube"}
2022-05-25T20:22:33.351Z	DEBUG	events	Normal	{"object": {"kind":"RayCluster","namespace":"hyperkube","name":"hyperkube","uid":"5a1f1b9f-a867-461d-a66b-c0ab2ed17c3d","apiVersion":"ray.io/v1alpha1","resourceVersion":"38203870"}, "reason": "Created", "message": "Created head pod hyperkube-head-ltwjz"}
2022-05-25T20:22:33.377Z	INFO	raycluster-controller	Created pod	{"Pod ": "hyperkube-worker-cpu-worker-"}
2022-05-25T20:22:33.377Z	DEBUG	events	Normal	{"object": {"kind":"RayCluster","namespace":"hyperkube","name":"hyperkube","uid":"5a1f1b9f-a867-461d-a66b-c0ab2ed17c3d","apiVersion":"ray.io/v1alpha1","resourceVersion":"38203870"}, "reason": "Created", "message": "Created worker pod "}
2022-05-25T20:22:33.403Z	INFO	raycluster-controller	reconciling RayCluster	{"cluster name": "hyperkube"}
2022-05-25T20:22:33.403Z	INFO	controllers.RayCluster	reconcileServices 	{"head service found": "hyperkube-head-svc"}
2022-05-25T20:22:33.403Z	INFO	raycluster-controller	reconcilePods 	{"head pod found": "hyperkube-head-ltwjz"}
2022-05-25T20:22:33.403Z	INFO	raycluster-controller	reconcilePods	{"head pod is up and running... checking workers": "hyperkube-head-ltwjz"}
2022-05-25T20:22:33.403Z	INFO	raycluster-controller	reconcilePods	{"all workers already exist for group": "cpu-worker"}
2022-05-25T20:22:33.475Z	ERROR	raycluster-controller	Update status error	{"cluster name": "hyperkube", "error": "Operation cannot be fulfilled on rayclusters.ray.io \"hyperkube\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/ray-project/kuberay/ray-operator/controllers/raycluster.(*RayClusterReconciler).Reconcile
	/workspace/controllers/raycluster/raycluster_controller.go:124
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

@Jeffwan
Copy link
Member

Jeffwan commented May 26, 2022

From the CI logs, looks like operator failed to create the resource so assert command failed (failed to find resources).

You shared logs looks good to me. Seems the worker have been created but not sure whether they are in expected status. If you can not figure it out, I will dump to my local env and give it a try

@davidxia
Copy link
Contributor Author

If you can not figure it out, I will dump to my local env and give it a try

yes this would be great! I retriggered the tests to see if it was a fluke or if they still fail with same error.

@davidxia
Copy link
Contributor Author

The failure is caused by my change. Here's a no-op change that passes. I'm not sure why this line breaks tests though.

@Jeffwan
Copy link
Member

Jeffwan commented Jun 1, 2022

sorry for late. I didn't get a chance to help check it in past few days. @davidxia Could you help rebase the changes and I can check it again

@Jeffwan
Copy link
Member

Jeffwan commented Jun 1, 2022

I manually apply your changes to the master and notice the following error. I think that's the reason it can not launch the cluster and pass the test. We should control the scope to autoscaler enabled cases only.

2022-06-01T14:49:25.702+0800	ERROR	controller.raycluster-controller	Reconciler error	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "raycluster-complete", "namespace": "ray-system", "error": "pods \"raycluster-complete-head-\" is forbidden: error looking up service account ray-system/raycluster-complete: serviceaccount \"raycluster-complete\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

@Jeffwan
Copy link
Member

Jeffwan commented Jun 1, 2022

In order to test your changes end to end locally. You can start docker for desktop or kind, then following steps

  1. make build
  2. ./bin/manager -kubeconfig ~/.kube/config --enable-leader-election=false
  3. create a ray cluster

I will close #277 then

@Jeffwan Jeffwan mentioned this pull request Jun 1, 2022
Currently the patch in ray-project#259 isn't working because a later line overwrites with
the original bad value. This change sets the head pod's service account
to the one configured in the RayCluster.
@davidxia
Copy link
Contributor Author

davidxia commented Jun 6, 2022

@Jeffwan thanks! All tests pass now.

@Jeffwan Jeffwan merged commit 7a6f903 into ray-project:master Jun 8, 2022
@davidxia davidxia deleted the patch2 branch June 19, 2022 13:39
davidxia added a commit to davidxia/kuberay that referenced this pull request Jun 19, 2022
instead of always using RayCluster name. Default to RayCluster
name if the ServiceAccount doesn't exist.

This is a follow up fix after
ray-project#259 and
ray-project#276.

The use case is when using the RayCluster autoscaler. The autoscaler
will use the head group's K8s ServiceAccount to make K8s API requests.
Always using the RayCluster name as the RoleBinding's subject name won't
be correct and will result in the autoscaler sidecar container not being
authorized by the K8s API.
DmitriGekhtman pushed a commit that referenced this pull request Jun 21, 2022
instead of always using RayCluster name. Default to RayCluster
name if the ServiceAccount doesn't exist.

This is a follow up fix after
#259 and
#276.

The use case is when using the RayCluster autoscaler. The autoscaler
will use the head group's K8s ServiceAccount to make K8s API requests.
Always using the RayCluster name as the RoleBinding's subject name won't
be correct and will result in the autoscaler sidecar container not being
authorized by the K8s API.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Currently the patch in ray-project#259 isn't working because a later line overwrites with
the original bad value. This change sets the head pod's service account
to the one configured in the RayCluster.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…roject#315)

instead of always using RayCluster name. Default to RayCluster
name if the ServiceAccount doesn't exist.

This is a follow up fix after
ray-project#259 and
ray-project#276.

The use case is when using the RayCluster autoscaler. The autoscaler
will use the head group's K8s ServiceAccount to make K8s API requests.
Always using the RayCluster name as the RoleBinding's subject name won't
be correct and will result in the autoscaler sidecar container not being
authorized by the K8s API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants