Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Equinix Metal metadata service #680

Closed
wants to merge 8 commits into from

Conversation

displague
Copy link

@displague displague commented Nov 19, 2020

Proposed Commit Message

The Equinix Metal (formerly Packet.com) metadata offers a data rich metadata with some compatibility with the AWS metadata service, not as much compatibility as some other providers. I based this branch on the AliYun provider and some things are still definitely wrong for Equinix Metal use.

Equinix Metal does have other metadata values that could be pieced together here. By opening this draft PR, I'm hoping to sync the cloud-init expectations with the features of the available metadata service.

Additional Context

The Equinix Metal metadata service is the basis for the Tinkerbell metadata service, so support for Tinkerbell will benefit from this integration.

Equinix Metal sample metadata: (c3.small.x86 / "Layer3" network configuration):
https://gist.github.com/displague/dff585c6ef57510c37f475b6c68e6427

Tinkerbell's metadata service (2009-04-04 AWS compatibility):
https://github.com/tinkerbell/hegel/blob/a3138d417536903dcdedd674d634e13ebc19fc79/http_handlers.go#L23-L45

Equinix Metal devices create this cloud.cfg on OS disks following image-based provisioning: https://github.com/tinkerbell/osie/blob/71da59e10bf66cc352366615060171939412d397/docker/scripts/osie.sh#L363-L369

Test Steps

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@displague displague changed the title add equinix metal metadata service add Equinix Metal metadata service Nov 23, 2020
@TheRealFalcon
Copy link
Member

Hi @displague , thanks for the contribution. Are you looking for review of this yet? I noticed this is still in draft state and tests don't all pass. @mitechie notified me you were looking for reviewers, but I just want to make sure it's not premature.

@displague
Copy link
Author

@TheRealFalcon I'm looking for some direction, actually. I've provided examples of the available metadata. I'm not sure how best to take advantage of what is offered.

What I include in the PR is the lowest common denominator implementation based on what fits from the Alibaba provider. I've not attempted to take advantage of any of the EquinixMetal specific features to match them to available CloudInit features.

Copy link
Collaborator

@smoser smoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make this actually work on distros that use the cloud-init generator (both centos 8 and current ubuntu do), you'll need to change tools/ds-identify also.

You should hopefully be able to follow how to make those changes. And then also you'll want to add tests to tests/unittests/test_ds_identify.py

And then you should also add something to doc/rtd/topics/datasources/

@displague
Copy link
Author

@smoser Thanks for the hints. datasources.rst provides some guidance and I was able to add a few more missing files based on that.

I've left some TODO's where I'm not sure how this provider should behave.

For example, some but not all models will return a Product or Vendor Id in DMI that reports, 'Packet' (or some variation). This is not common, and even if it were, this value would eventually report 'Equinix Metal' (or some variation) in newer models and reprovisioned devices.

I can not rely on DMI to detect that an instance is running in Equinix Metal. The only factor that I am confident of is that the metadata service, which is available at metadata.platformequinix.com, will be accessible (on a public IP IPv4 address, different between facilities). I can also be certain of the iqn pattern within that metadata. A user may choose to provision Equinix Metal without a public address, in these cases I do not have a way to identify that the instance is running on Equinix Metal.

I'd really like to take better advantage of the additional metadata provided outside of the /2009-04-04/ compatibility layer, to configure networking devices or other features.

@displague
Copy link
Author

When an EquinixMetal DS is well-known to cloud-init, and for distributions that include a sufficient version of cloud-init, the Equinix Metal pre-installed cloud.cfg will offer an EquinixMetal DS, rather than EC2. (This can be seen in the tinkerbell/osie link included in the PR description).

},
'EquinixMetal': {
'ds': 'EquinixMetal',
'mocks': [], # TODO how do I mock metadata responses?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't. at least not for ds-identify. It only identifies via locally available data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add DS support for the models that do report "Packet" back somewhere in the DMI, but I would have to survey the available models to identify any variations in the fields and values.

Should I (can I?) remove DS support and only rely on metadata service detection?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "DS support".

Maybe some background would help you to understand what ds-identify does here.

  1. We want vendors to be able to make OS images (like https://cloud-images.ubuntu.com/) that "just work" wherever you try to run them, and an OS with cloud-init installed but not on a cloud would not do anything differently. datasource_list can configure which datasources will be searched, but ideally there would be no need for such a thing. cloud-init would just "do the right thing".

  2. Previously, cloud-init (in python) would walk through each datasource in datasource_list and try to get data. That meant that boot was always impacted (cloud-init always ran). On EC2, that meant doing a dhcp and checking to see if the metadata service was there. That is obviously less than ideal. It was slow, and meant if you booted such an image elsewhere, and there happened to be http://169.254.169.254/latest/user-data, then it would execute that code.

Now, with ds-identify if it determines that the system is not on a cloud platform, cloud-init does not run at all. From systemd's perspective, cloud-init.target is not even enabled. But in order to do that... we only look at local data. We want those checks to be very fast, and thus far, they are. When ds-identify finds that it is on Equinox, as told to it by DMI data, it knows that it will find an equinox metadata service (or... if someone is lying to it, then failure is somewhat expected. As an example... if cpu identifies itself as x86_64, but didn't implement some of the interfaces, you'd expect that a program might fail).

Copy link
Author

@displague displague Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there examples of providers that have no local representation (no guarantee of identifiable DMI ids)?

The presence of packet.net (or equinix) in metadata.platformequinix.com/2009-04-04/iqn is the only approach that will work for a majority of our infrastructure (that I am aware of). Very few devices report Packet somewhere in their DMI (that I am aware of).

/cc @mmlb @dustinmiller1337 @pereztr5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoser Is there a possible alternative to dmi data? We can likely go that route for our own machines, but should not be an expectation for making use of tinkerbell. We can control kernel boot args very easily, can we have cloud-init also check there?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait this would apply at runtime not install-time 🤦‍♂️ so we don't have as much control over kernel args :/

Copy link
Author

@displague displague Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoser Tinkerbell has an EM descendent metadata service and deployment architecture, but in that environment users bring their own hardware and DMI stamping may not be possible.

Related issue: tinkerbell/cluster-api-provider-tinkerbell#6

Copy link
Author

@displague displague Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving Tinkerbell concerns out of this PR, but I was hopeful that we could leverage this PR in some way in support of https://tinkerbell.org/ environments later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there examples of providers that have no local representation (no guarantee of identifiable DMI ids)?

MAAS. It sounds like MAAS is something very similar to what you're working on.
The way MAAS works is:

  • network booted environment sends cmdline with 'ci.ds=MAAS'

    ds-identify generically reads the ci.ds kernel parameter to be declaring
    the datasource to use.

  • installed system declares the datasource_list to have only MAAS in it.

    during the install, maas writes a cloud-init config file to the target system. that declares 'datasource_list' to just have MAAS and ds-identify reespects that.

@smoser
Copy link
Collaborator

smoser commented Dec 4, 2020

For example, some but not all models will return a Product or Vendor Id in DMI that reports, 'Packet' (or some variation). This is not common, and even if it were, this value would eventually report 'Equinix Metal' (or some variation) in newer models and reprovisioned devices.

I can not rely on DMI to detect that an instance is running in Equinix Metal. The only factor that I am confident of is that the metadata service, which is available at metadata.platformequinix.com, will be accessible (on a public IP IPv4 address, different between facilities). I can also be certain of the iqn pattern within that metadata. A user may choose to provision Equinix Metal without a public address, in these cases I do not have a way to identify that the instance is running on Equinix Metal.

Well... perhaps we can just ignore it and wait until all nodes report
something identifiable in the DMI. (its OK by me to consider either 'Packet' or the Equinix value).

Remember that, generally speaking, this change mostly affects new instances, which then perhaps are provisioned on new nodes that have the new values ? Ie, upgrade isn't terribly important here. An old node that didn't realize it was on Equinix will just continue to not know that after a cloud-init upgrade.

I'd really like to take better advantage of the additional metadata provided outside of the /2009-04-04/ compatibility layer, to configure networking devices or other features.

Yes please. Such a change might end up causing you to do your own DS rather than riding on Amazon, but that is fine.

@displague
Copy link
Author

Well... perhaps we can just ignore it and wait until all nodes report
something identifiable in the DMI. (its OK by me to consider either 'Packet' or the Equinix value).

@smoser Is this to say that local datasource detection is the only option?
Can this PR proceed with network URL based detection?

@smoser
Copy link
Collaborator

smoser commented Dec 15, 2020

Well... perhaps we can just ignore it and wait until all nodes report
something identifiable in the DMI. (its OK by me to consider either 'Packet' or the Equinix value).

@smoser Is this to say that local datasource detection is the only option?
Can this PR proceed with network URL based detection?

Polling "arbitrary" network sources for information without definitive data suggesting its presence is something cloud-init has not done since sometime in 2018. So the options are:

  1. datasource is not enabled by default in cloud-init. That means it doesn't "just work" 😢
  2. datasource is enabled by default but requires either the dmi information present, or some other local indication.

Again I'd like to point out that MAAS datasource relies on 'b', and for a platform that is doing an install, that just requires putting down that local indication during the install.

@displague
Copy link
Author

Support may be in more device models than I originally thought. Which fields are populated may not be entirely consistent, but there may be some common ground.

I'll survey the platform plans to confirm.

For example, on a c3.small.x86:

/sys/class/dmi/id/product_name:	c3.small.x86
/sys/class/dmi/id/sys_vendor:	Packet

The t1.small.x86 has no packet or equinix strings in dmidecode or /sys/class/dmi files.

@github-actions
Copy link

github-actions bot commented Jan 5, 2021

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed, please do feel free to reopen it if you wish to continue working on it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jan 5, 2021
@github-actions github-actions bot closed this Jan 12, 2021
@displague
Copy link
Author

displague commented Jan 12, 2021

Welp. I'll reopen this when the internal discoveries are available and some form of DS identify based on hardware is available.

@displague
Copy link
Author

@smoser would it be acceptable to do DS detection based on /proc/cmdline rather than DMI?
(@mmlb proposed this in the Apr 13, 2021 Tinkerbell Community meeting)

@smoser
Copy link
Collaborator

smoser commented Apr 13, 2021

It is acceptable... its just less than ideal. there are other examples of it. see #827.
Also, on the cmdline you can specify the datasource directly with 'ci.ds=Equinix'.

@mitechie mitechie removed the stale-pr Pull request is stale; will be auto-closed soon label Jun 2, 2021
@mitechie mitechie reopened this Jun 2, 2021
@TheRealFalcon
Copy link
Member

Hey there. This branch was reopened because we saw over on https://bugs.launchpad.net/cloud-init/+bug/1745920 that you're looking for work on this to be continued. Since it has been closed for a while, I'm guessing there's still work to do on your end before this PR would be considered ready to ready or merge. Is that correct?

@displague
Copy link
Author

Something that Equinix Metal support should add is the customdata arbitrary json field from the EM metadata.

This would be consumed as:

POST /devices {
 "operating_system": "ubuntu_20_04",
 "customdata": {"foo": "bar"},
 "userdata": "#cloud-config
     run_cmd: 
     - echo {{ds.meta_data.customdata.foo}}"
}

cloud-init query --all does not currently produce this field. I'm not sure why this doesn't work today or how or where this is redacted today (EM uses the 2009-04-04 EC2 metadata format). When included, I'm not sure if this should be treated as sensitive data or not.

@displague
Copy link
Author

@TheRealFalcon I think a first step would be to add support for an EquinixMetal datastore, even though it will not be a detectable environment (ds= will need to be supplied).

@TheRealFalcon
Copy link
Member

cloud-init query --all does not currently produce this field. I'm not sure why this doesn't work today or how or where this is redacted today (EM uses the 2009-04-04 EC2 metadata format).

@displague , I'm not entirely sure the context here, but in order for customdata to be queried, it would need to be included as part of the instance metadata returned from the datasource. Since your datasource is inheriting from the EC2 datasource, it would need to be returned during the crawl_metdata call (or attached separately in _get_data).

Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
A relative of this code is used to deploy cloud-config data on Equinix
Metal nodes:
https://github.com/tinkerbell/osie/blob/71da59e10bf66cc352366615060171939412d397/docker/scripts/osie.sh#L363-L369

This change makes the EM datasource match that timeout setting.

Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
@displague
Copy link
Author

displague commented Jun 28, 2021

@TheRealFalcon "customdata" is one field that is returned in the /metadata response, it is not a separate API path.
This field has an arbitrary structure. Whatever the user puts into the Equinix Metal API's device instance "customdata" field, it will appear in the metadata.

POST /devices HTTP/1.1
Host: api.equinix.com/metal/v1
Content-Type: application/json

{
 "customdata": {"foo": "bar"}
  ...
}

curl https://metadata.platformequinix.com/metadata | jq -r .customdata could be a string, a structure, an array, an integer, boolean, or null. With the API Post above, you would expect this jq command to output {"foo": "bar"}

I expected that this field would be available using the current EC2 2009-04-04 support (pointed at the metadata URL above). EM devices today include an /etc/cloud/cloud.cfg with:

datasource_list: [Ec2]
datasource:
  Ec2:
    timeout: 60
    max_wait: 120
    metadata_urls: [ 'https://metadata.platformequinix.com/metadata' ]
    dsmode: net

However, when I tried to use {{ ds.meta_data.customdata.foo }} in a jinja cloud-init, I found that customdata was not present.

cat /run/cloud-init/instance-data.json | jq .customdata does not match the curl statement above, customdata does not appear in the document so I'm wondering how this file is being filtered and what I should change in this PR to accommodate this field.

@TheRealFalcon
Copy link
Member

Is everything else at that address getting attached to your metadata as you'd expect? Does the cloud-init.log tell you anything about attempts or failures in fetching the correct URL? EC2 checks a <address>/meta-data endpoint in order to retrieve the metadata. Does this exist for you?

@displague
Copy link
Author

displague commented Jun 28, 2021

root@vcf-edge:~# cloud-init query -a

https://gist.github.com/displague/b9f360125e0b8c04d3c53a0f8b3b68b2


root@example:~# cloud-init query -f '{{ ds.meta_data.iqn }}'
iqn.2021-01.net.packet:device.83af6225
root@example:~# cloud-init query -f '{{ ds.meta_data.customdata }}'
WARNING: Could not render jinja template variables in file 'query commandline': 'customdata'
CI_MISSING_JINJA_VAR/customdata
root@example:~# curl -s http://metadata.platformequinix.com/metadata | jq -r .customdata
{}

(In this example I don't have customdata values set, but I get the same error when there are keys in the customdata)

@displague
Copy link
Author

I see what's happening. The ds.meta_data is bound by what is captured here:

$ curl http://metadata.platformequinix.com/2009-04-04/meta-data
instance-id
hostname
iqn
plan
facility
tags
operating-system
public-keys
public-ipv4
public-ipv6

@displague
Copy link
Author

I presume if customdata were returned here, {{ ds.meta_data.customdata }} would "just work"?

@displague
Copy link
Author

I may want to pursue getting that field added to the metadata response while this PR remains unpackaged.

@github-actions
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-pr Pull request is stale; will be auto-closed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants