Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Tinkerbell Plugin to Support Kubernetes CRDs #1815

Closed
wants to merge 2 commits into from

Conversation

mohamed-rafraf
Copy link
Member

What this PR does / why we need it:
This PR updates our Machine Controller's integration with Tinkerbell, transitioning from gRPC to Kubernetes CRDs following Tinkerbell's recent architectural changes. It ensures the Machine Controller can now directly manage Hardware resources within Tinkerbell's Kubernetes environment by syncing Hardware references specified in MachineDeployments into the Tinkerbell cluster, and automating corresponding workflow processes for machine provisioning.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

Tinkerbell plugin support CRDs

Documentation:

NONE

@kubermatic-bot kubermatic-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. docs/none Denotes a PR that doesn't need documentation (changes). dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Jun 12, 2024
@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kron4eg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 12, 2024
@mohamed-rafraf mohamed-rafraf changed the title Update baremetal provider and Tinkerbell plugin Update Tinkerbell Plugin to Support Kubernetes CRDs Jun 12, 2024
Copy link
Member

@moadqassem moadqassem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohamed-rafraf Thanks a lot for your PR it addresses quite a lot of details that we have discussed. However, I found a few points that we can improve and those are some how major points that needs comprehensive implementation. I looked into the implementation closely this week and I have authored another PR that addresses my points since we need this logic in ASAP. However, once you are back we can discuss these points again and we can walkthrough these points and review them one by one and address them again with you. Thanks a lot for the fantastic job. The other PR you can find it here #1830

return errors.New("hardware should not be nil")
// SelectAvailableHardware selects an available hardware from the given list of hardware references
// that has an empty ID.
func (h *HardwareClient) SelectAvailableHardware(ctx context.Context, hardwareRefs []types.NamespacedName) (*tinkv1alpha1.Hardware, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since machine deployment are 1:1 to hardware data we don't need this method since all the machine deployments must have a reference to the hardware object. While the idea is also valid but currently it is not needed.

if _, err := t.client.Push(ctx, &hardware.PushRequest{Data: h}); err != nil {
return fmt.Errorf("updating template in Tinkerbell: %w", err)
// CreateHardwareOnTinkCluster creates a hardware object on the Tinkerbell cluster.
func (h *HardwareClient) CreateHardwareOnTinkCluster(ctx context.Context, hardware *tinkv1alpha1.Hardware) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machine controller should not create any hardware object in the tinkerbell infra cluster(Tinkerbell Cluster) we assume the hardware is already created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function was used by the old approach.

if _, err := t.client.Push(ctx, &hardware.PushRequest{Data: h}); err != nil {
return fmt.Errorf("creating hardware in Tinkerbell: %w", err)
// Check if the ID is empty and return the hardware if it is
if hardware.Spec.Metadata.Instance.ID == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure I get this. why do we need to fetch the hardware by the instance id? As discussed, machine deployments are 1:1 with the hardware based on the hardware reference and namespace thus there is no need for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the machine deployment are 1:1 with the hardware object. I am using the instance id to ensure that the hardware object is claimed by the machine deployment and there is no other machine deployment can use it.

So you cannot use one hardware object by two machine deployment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so in Tinkerbell there is a field called state which is similar to netbox. This field would tell you whether this machine is reserved and running. If you want we can schedule a call to discuss these points.

method = t.client.ByIP
default:
return nil, errors.New("need to specify either id, ip, or mac")
func (h *HardwareClient) GetHardwareWithID(ctx context.Context, uid string) (*tinkv1alpha1.Hardware, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this method is not needed as we fetch the hardware based on the name and namespace!

}

template := &tinkv1alpha1.Template{}
// Check if template exists. Each machine deployment will have its own tinkerbell template.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template actually should be only one template and if anything should be customized it must be done via the workflow not the template! here is an example:

apiVersion: tinkerbell.org/v1alpha1
kind: Template
spec:
  data: |
    name: ubuntu
    tasks:
      - name: "os installation"
        worker: "{{.device_1}}"

And the workflow has the value for this placeholder:

apiVersion: "tinkerbell.org/v1alpha1"
kind: Workflow
spec:
  hardwareMap:
    device_1: 00:00:00:00:00:01

return nil, fmt.Errorf("failed to create workflow template: %w", err)
}
// Set the HardwareID with machine UID. The hardware object is claimed by the machine.
if err = d.HardwareClient.SetHardwareID(ctx, hardware, string(meta.UID)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before starting the provisioning process we need to check if the hardware object is already allowing running workflows and it enables iPXE booting. The reason behind that is, we don't wanna provision machine by mistake. After the workflow runs, we would need to set these values on the hardware, either by machine controller or by adding a mutating webhook in the tink stack which observe the machine deployment object and based on its status(does the machine has a node attached to it) and update the hardware object accordingly.

}

// Set the Hardware UserData to execute the userdata generated by OSM.
if err = d.HardwareClient.SetHardwareUserData(ctx, hardware, userdata); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the user data will be part of the workflow and gets baked into the the template programmatically this is not needed anymore.

if hw.Hardware.Metadata == "" {
return fmt.Errorf("tinkerbell hardware metadata can not be empty")
// Reset the hardware cloud-init userdata
if err := d.HardwareClient.SetHardwareUserData(ctx, targetHardware, ""); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the user data will be part of the workflow and gets baked into the the template programmatically this is not needed anymore.

},
Spec: tinkv1alpha1.WorkflowSpec{
TemplateRef: templateRef,
HardwareRef: hardware.GetName(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to add more keys over here such as the destination path on the machine and also the cloud init script.

allowworkflow: false
clusterName: "<< CLUSTER_NAME >>"
osImageUrl: "<< OS_IMAGE_URL >>"
hegelUrl: "<< HEGEL_URL >>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be needed anymore since we are gonna create and pass the cloud-init as part of the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. docs/none Denotes a PR that doesn't need documentation (changes). release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants