Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

mohamed-rafraf · 2024-06-12T09:45:41Z

What this PR does / why we need it:
This PR updates our Machine Controller's integration with Tinkerbell, transitioning from gRPC to Kubernetes CRDs following Tinkerbell's recent architectural changes. It ensures the Machine Controller can now directly manage Hardware resources within Tinkerbell's Kubernetes environment by syncing Hardware references specified in MachineDeployments into the Tinkerbell cluster, and automating corresponding workflow processes for machine provisioning.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

Tinkerbell plugin support CRDs

Documentation:

NONE

kubermatic-bot · 2024-06-12T09:45:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kron4eg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

moadqassem

@mohamed-rafraf Thanks a lot for your PR it addresses quite a lot of details that we have discussed. However, I found a few points that we can improve and those are some how major points that needs comprehensive implementation. I looked into the implementation closely this week and I have authored another PR that addresses my points since we need this logic in ASAP. However, once you are back we can discuss these points again and we can walkthrough these points and review them one by one and address them again with you. Thanks a lot for the fantastic job. The other PR you can find it here #1830

moadqassem · 2024-07-21T16:49:16Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/client/hardware.go

-		return errors.New("hardware should not be nil")
+// SelectAvailableHardware selects an available hardware from the given list of hardware references
+// that has an empty ID.
+func (h *HardwareClient) SelectAvailableHardware(ctx context.Context, hardwareRefs []types.NamespacedName) (*tinkv1alpha1.Hardware, error) {


Since machine deployment are 1:1 to hardware data we don't need this method since all the machine deployments must have a reference to the hardware object. While the idea is also valid but currently it is not needed.

moadqassem · 2024-07-21T16:58:12Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/client/hardware.go

-	if _, err := t.client.Push(ctx, &hardware.PushRequest{Data: h}); err != nil {
-		return fmt.Errorf("updating template in Tinkerbell: %w", err)
+// CreateHardwareOnTinkCluster creates a hardware object on the Tinkerbell cluster.
+func (h *HardwareClient) CreateHardwareOnTinkCluster(ctx context.Context, hardware *tinkv1alpha1.Hardware) error {


Machine controller should not create any hardware object in the tinkerbell infra cluster(Tinkerbell Cluster) we assume the hardware is already created.

Yes, this function was used by the old approach.

moadqassem · 2024-07-21T18:02:17Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/client/hardware.go

-	if _, err := t.client.Push(ctx, &hardware.PushRequest{Data: h}); err != nil {
-		return fmt.Errorf("creating hardware in Tinkerbell: %w", err)
+	// Check if the ID is empty and return the hardware if it is
+	if hardware.Spec.Metadata.Instance.ID == "" {


Not quite sure I get this. why do we need to fetch the hardware by the instance id? As discussed, machine deployments are 1:1 with the hardware based on the hardware reference and namespace thus there is no need for this.

yes the machine deployment are 1:1 with the hardware object. I am using the instance id to ensure that the hardware object is claimed by the machine deployment and there is no other machine deployment can use it.

So you cannot use one hardware object by two machine deployment

Sure, so in Tinkerbell there is a field called state which is similar to netbox. This field would tell you whether this machine is reserved and running. If you want we can schedule a call to discuss these points.

moadqassem · 2024-07-21T18:03:01Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/client/hardware.go

-		method = t.client.ByIP
-	default:
-		return nil, errors.New("need to specify either id, ip, or mac")
+func (h *HardwareClient) GetHardwareWithID(ctx context.Context, uid string) (*tinkv1alpha1.Hardware, error) {


Again this method is not needed as we fetch the hardware based on the name and namespace!

moadqassem · 2024-07-21T18:48:55Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/driver.go

+	}
+
+	template := &tinkv1alpha1.Template{}
+	// Check if template exists. Each machine deployment will have its own tinkerbell template.


The template actually should be only one template and if anything should be customized it must be done via the workflow not the template! here is an example:

apiVersion: tinkerbell.org/v1alpha1 kind: Template spec: data: | name: ubuntu tasks: - name: "os installation" worker: "{{.device_1}}"

And the workflow has the value for this placeholder:

apiVersion: "tinkerbell.org/v1alpha1" kind: Workflow spec: hardwareMap: device_1: 00:00:00:00:00:01

moadqassem · 2024-07-24T21:18:02Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/driver.go

-				return nil, fmt.Errorf("failed to create workflow template: %w", err)
-			}
+	// Set the HardwareID with machine UID. The hardware object is claimed by the machine.
+	if err = d.HardwareClient.SetHardwareID(ctx, hardware, string(meta.UID)); err != nil {


Before starting the provisioning process we need to check if the hardware object is already allowing running workflows and it enables iPXE booting. The reason behind that is, we don't wanna provision machine by mistake. After the workflow runs, we would need to set these values on the hardware, either by machine controller or by adding a mutating webhook in the tink stack which observe the machine deployment object and based on its status(does the machine has a node attached to it) and update the hardware object accordingly.

moadqassem · 2024-07-24T21:19:49Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/driver.go

+	}
+
+	// Set the Hardware UserData to execute the userdata generated by OSM.
+	if err = d.HardwareClient.SetHardwareUserData(ctx, hardware, userdata); err != nil {


Since the user data will be part of the workflow and gets baked into the the template programmatically this is not needed anymore.

moadqassem · 2024-07-24T21:20:29Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/driver.go

-	if hw.Hardware.Metadata == "" {
-		return fmt.Errorf("tinkerbell hardware metadata can not be empty")
+	// Reset the hardware cloud-init userdata
+	if err := d.HardwareClient.SetHardwareUserData(ctx, targetHardware, ""); err != nil {


Since the user data will be part of the workflow and gets baked into the the template programmatically this is not needed anymore.

moadqassem · 2024-07-24T21:33:54Z

pkg/cloudprovider/provider/baremetal/plugins/tinkerbell/client/workflow.go

+		},
+		Spec: tinkv1alpha1.WorkflowSpec{
+			TemplateRef: templateRef,
+			HardwareRef: hardware.GetName(),


So we need to add more keys over here such as the destination path on the machine and also the cloud init script.

moadqassem · 2024-07-25T15:02:30Z

examples/baremetal-tinkerbell-machinedeployment.yaml

-                        allowworkflow: false
+              clusterName: "<< CLUSTER_NAME >>"
+              osImageUrl: "<< OS_IMAGE_URL >>"
+              hegelUrl: "<< HEGEL_URL >>"


This won't be needed anymore since we are gonna create and pass the cloud-init as part of the workflow.

update baremetal provider

f407f29

kubermatic-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 12, 2024

mohamed-rafraf changed the title ~~Update baremetal provider and Tinkerbell plugin~~ Update Tinkerbell Plugin to Support Kubernetes CRDs Jun 12, 2024

fix lint issues

1a4dbf9

moadqassem mentioned this pull request Jul 25, 2024

Refactor Tinkerbell provider #1830

Merged

moadqassem reviewed Jul 25, 2024

View reviewed changes

mohamed-rafraf closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

mohamed-rafraf commented Jun 12, 2024

kubermatic-bot commented Jun 12, 2024

moadqassem left a comment

moadqassem Jul 21, 2024

moadqassem Jul 21, 2024

mohamed-rafraf Jul 25, 2024

moadqassem Jul 21, 2024

mohamed-rafraf Jul 25, 2024

moadqassem Jul 25, 2024

moadqassem Jul 21, 2024

moadqassem Jul 21, 2024

moadqassem Jul 24, 2024

moadqassem Jul 24, 2024

moadqassem Jul 24, 2024

moadqassem Jul 24, 2024

moadqassem Jul 25, 2024

Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

Conversation

mohamed-rafraf commented Jun 12, 2024

kubermatic-bot commented Jun 12, 2024

moadqassem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment