Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSourceOracle: refactor to use only OPC v1 endpoint #493

Merged
merged 3 commits into from
Aug 10, 2020

Conversation

OddBloke
Copy link
Collaborator

@OddBloke OddBloke commented Jul 15, 2020

The /opc/v1/ metadata endpoints[0] are universally available in Oracle
Cloud Infrastructure and the OpenStack endpoints are considered
deprecated, so we can refactor the data source to use the OPC endpoints
exclusively. This simplifies the datasource code substantially, and
enables use of OPC-specific attributes in future.

[0] https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/Tasks/gettingmetadata.htm

@OddBloke
Copy link
Collaborator Author

(I have built an Ubuntu package from this and tested it on both iSCSI and non-iSCSI roots; it successfully boots and provisions SSH keys from a captured image.)

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! The metadata parsing code is much simpler.

My main question about the implementation is that it looks like certain fields were being parsed before and no longer are (i.e., vendordata_pure and launch-index). Is that still to do, or has something changed that we no longer need them?

On the test front, my only question is around how you're implementing marks and fixtures, but since that conversation is happening in another channel, I'll wait until there's some closure there before mentioning anything in particular here.

@OddBloke
Copy link
Collaborator Author

Looks good so far! The metadata parsing code is much simpler.

Thanks!

My main question about the implementation is that it looks like certain fields were being parsed before and no longer are (i.e., vendordata_pure

The new API has no equivalent to vendor-data. (vendor-data is an OpenStack concept originally, so it's not all that surprising that it was included in their OpenStack-a-like IMDS but is dropped now.)

and launch-index). Is that still to do, or has something changed that we no longer need them?

From Oracle:

- launch_index is hard-coded to 0

I don't know that anything will be relying on that, but we may as well include it in the metadata we produce just in case. Good catch!

@OddBloke OddBloke force-pushed the oracle branch 5 times, most recently from 7200f02 to 2956c75 Compare July 17, 2020 20:34
@TheRealFalcon
Copy link
Member

Regarding the TODO, if it's up to us, I don't see much benefit in adding it. It's essentially the same as curl'ing the metadata endpoint plus adding is_platform_viable and a system uuid to the output.

@OddBloke OddBloke force-pushed the oracle branch 3 times, most recently from 41da6d9 to 3920878 Compare July 24, 2020 16:29
@OddBloke OddBloke changed the title WIP: DataSourceOracle: refactor to use only OPC v1 endpoint DataSourceOracle: refactor to use only OPC v1 endpoint Jul 24, 2020
@OddBloke
Copy link
Collaborator Author

(We have confirmed with Oracle that the instance ID provided by the OpenStack endpoint will always be the same as the one provided by the OPC v1 endpoint, so upgrades should be smooth.)

@OddBloke
Copy link
Collaborator Author

I realised that the docstring in DataSourceOracle.py needs updating; I'll also double-check all other documentation/docstrings to check there aren't any other stale references.

:return:
The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
"""
return json.loads(readurl(METADATA_ENDPOINT, retries=2)._response.text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2? just because 2 is better than one?

I think we have a pretty convincing argument that we're running on OracleCloud at this point ('OracleCloud.com') any reason to not bump that up? Or turn it to no retries. 2 just seems an odd number here. Either we're dependent on it (no retries) or we want to be resilient (10 or more?).

I don't have strong feelings on this though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i knew that two is an impossible number, but this interpretation is new to me ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries=2 was requested by Oracle in a code review to handle a potential race condition where the IMDS might not be ready. Re-reading their point I think they really wanted 2 tries not 2 retries (and the race condition window is reportedly milliseconds, so a single retry will comfortably address it). I'll modify this to 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... well thanks for pointing out that it is retries, rather than tries. The same argument applies though, just with N-1.

handle a potential race condition where the IMDS might not be ready.

In other places (EC2) we have similar behavior. We wait_for_metadata_service only on the first request, and then subsequent calls the IMDS must be there, and readurl gets the default retries=0.

and the race condition window is reportedly milliseconds, so a single retry will comfortably address it

"The IMDS is not yet ready" seems unlikely to reliably fix itself milliseconds later. The cloud platform prepared gigabytes of disk, booted a virtual (or physical machine), loaded a kernel, initramfs, switchroot, init, dhcp, .... all of this stuff happened, but the system beat the IMDS up by milliseconds ? Just seems unlikely.

I wont fight that, please at least document it in code comment though.

* Both bare-metal and vms use iscsi root
* Both bare-metal and vms provide chassis-asset-tag of OracleCloud.com
* Bare metal instances use iSCSI root, virtual machines do not
* Both bare-metal and VMs provide chassis-asset-tag of OracleCloud.com
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edits here are just kind of odd. You seem to prefer 'Bare metal' to 'bare-metal' on line 11, but then back to 'bare-metal' on line 12. and then you replaced 'vms' with 'virtual machines' on line 14 -> 11, but then capitalized VMs on line 12.

only really mentioning because i had other comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no idea why I did that, will fix.

@OddBloke OddBloke force-pushed the oracle branch 3 times, most recently from fa9942f to 1de77af Compare August 4, 2020 19:11
@OddBloke
Copy link
Collaborator Author

OddBloke commented Aug 4, 2020

@smoser I believe I've addressed your comments now, could you take another look?

@blackboxsw blackboxsw self-assigned this Aug 6, 2020
This would return:
{version: {'user_data': b'blob', 'meta_data': json.loads(blob.decode())
'system_uuid': '3b54f2e0-3ab2-458d-b770-af9926eee3b2'}}
def read_opc_metadata():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some effort at some point previously to try to standardize and expose a commonly named read_metadata method from each datasource module so that at some point we could potentially use from a cloud-init query --no-cache call or hotplug logic to crawl fresh metadata for the datasource without updating the existing cached datasource metadata. specializing this function name seems to move away from that interest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def read_metadata( only appears in 4 places in the codebase:

cloudinit/sources/DataSourceOracle.py
313:def read_metadata(endpoint_base=METADATA_ENDPOINT, sys_uuid=None,

cloudinit/sources/DataSourceExoscale.py
186:def read_metadata(metadata_url=METADATA_URL,

cloudinit/sources/helpers/hetzner.py
16:def read_metadata(url, timeout=2, sec_between=2, retries=30):

cloudinit/sources/helpers/digitalocean.py
185:def read_metadata(url, timeout=2, sec_between=2, retries=30):

These don't have a consistent interface (some require a parameter, others don't), and 2 of them aren't in datasource modules. So I'm not sure we lose much by changing the name of this one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, since we have a lot of discussion about making datasources consistent and decided how that should be done, this certainly doesn't make things any harder so let's move forward with your changeset.

print(
util.json_dumps(
{
"read_opc_metadata": read_opc_metadata(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the same lines of generalization of main output, is it worth just calling this key "metadata"?
Also in the same vein, is_platform_viable function was an interest in generalizing for each datasource Is it worth just surfacing "is_platform_viable" key for this?

Admitedly, if required by our upcoming hotplug work, if we actually need to standardize on a common function name for all datasources, and common behavior of json output of a main in each datasource, this would be a flag day type PR that could address all datasources in one go, so maybe we don't have to quibble about it for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, similarly to my above response, there are only 3 datasources which implement this at all, and they are not consistent in how they do it:

cloudinit/sources/DataSourceOracle.py
184:    def _is_platform_viable(self):
286:def _is_platform_viable():

cloudinit/sources/DataSourceExoscale.py
137:    def _is_platform_viable(self):

cloudinit/sources/DataSourceAzure.py
502:    def _is_platform_viable(self):
1497:def _is_platform_viable(seed_dir):

AIUI, this output is only really intended for debugging and is already inconsistent across datasources, so I went with key == function name so it's clear where the data is coming from internal to the file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me, I think this could a topic for us to discuss and refine if, or when, we decide to go this route.

return marker.args
return default
class _FixtureUtils:
"""A namespace for fixture helper functions, used by fixture_utils.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this class namespacing, it does make things more clear.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, especially the new fixtures and test rework. I have a couple of comments/concerns for discussion on this pass and I want to think a bit about upgrade path support from cloud-init-20.2.45 on oracle systems to this functionality to thnk about what could break. I realize we don't have to support upgrades from OpenStack datasource -> Oracle, but current Oracle -> NewOracle makes me want to think a bit.

def test_metadata_keys_set_correctly(
self, keyname, expected_value, parameterized_oracle_ds,
):
parameterized_oracle_ds._get_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to add the assert True is here just to confirm we are actually returning the expected bool value even though we are validating that metadata is setup properly?



@mock.patch(DS_PATH + "._is_iscsi_root", lambda: False)
class TestNonIscsiRoot_GetDataBehaviour:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow impressive setup to test this scoping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the fact that the body of the context manager doesn't use a value returned by the context manager makes it much harder to assert on!

The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
"""
# retries=1 as requested by Oracle to address a potential race condition
return json.loads(readurl(METADATA_ENDPOINT, retries=1)._response.text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are no longer handling UrlErrors and raising BrokenMetadata with more specific failure failure information instead of UrlErrors. I realize cloud-init doesn't actually handle this specific BrokenMetadata uniquely, but what are your thoughts about us dropping BrokenMetadata in the exception specialization in the future from other/all datasources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the way BrokenMetadata is used elsewhere it breaks down in two ways: (a) the content of the metadata is incorrect (e.g. random_seed is not base64 as we expect), or (b) there are multiple different "transports" being used, and BrokenMetadata is used to abstract away the differences between the exceptions that those "transports" would raise on failure. (The only example of (b) is OpenStack: ConfigDrive will raise filesystem-related errors whereas the IMDS will raise HTTP errors.)

I don't think BrokenMetadata gains us much here because it doesn't match either of these cases: the UrlError indicates a "transport" issue (rather than a content issue), and we only have one "transport" to worry about.



@pytest.yield_fixture
def httpretty():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again an excellent fixture addition. Thanks!

{
"read_opc_metadata": read_opc_metadata(),
"_is_platform_viable": _is_platform_viable(),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth exposing the vnic network metadata here somehow from _add_network_config_from_opc_imds too to aid in introspection of content exposed to the vm when calling main?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but @TheRealFalcon will be changing this codepath in his follow-up PR anyway (because read_opc_metadata will evolve); instead of writing more code that will be replaced, let's remember to get him to do this.

return True
self.metadata = {
"availability-zone": data["ociAdName"],
"instance-id": data["id"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misreading this. it appears you are trying to limit the metadata preserved as a cache on the datasource by only instrumenting a few keys from the provided data.

I'm not certain why we would not want to cache all metadata here on the datasource? Why not just update the existing data with additional key aliases for the known availability-zone, instance-id, launch-index, local-hostname and name keys? At least from a triage point of view, we would be able to more easily inspect the system and "raw" values provided by the system more easily via cloud-init query if we get bugs, or have to develop more features etc.

An unwritten goal of cloud-init query is, at some point, to accurately represent any instance-data in as close to the original format as possible to potentially allow folks to leverage cloud-init's inherent knowledge of the 'best practices' to talk to a given datasource API (including retries and failure handling etc). This would allow customers on cloud-init driven platforms to simplify their scripts which rely on instance-data/meta-data to just cloud-init query blah instead of knowing the proper routes to curl etc and handle failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misreading this. it appears you are trying to limit the metadata preserved as a cache on the datasource by only instrumenting a few keys from the provided data.

This may be the effect, but it is not the intent: I'm following the prior implementation of the datasource which did not include the entire metadata in self.metadata. Looking at a few other datasources, it's inconsistent as to whether they include all metadata in self.metadata or use something like self.metadata_full.

Why not just update the existing data with additional key aliases for the known availability-zone, instance-id, launch-index, local-hostname and name keys?

These seem like two different namespaces to me: if we modify the dict that we get from the cloud, then it's hard for users to know which keys are set by cloud-init (and therefore they can rely on across platforms) and which are cloud-specific; all the keys appear similar. (This is further complicated by the fact that cloud-init uses OpenStack/EC2 names for metadata keys internally; on OpenStack/EC2 there wouldn't even be an obvious pattern to allow you to guess which keys are cross-substrate and which are cloud-specific.)

All that said, I think we already have this distinction between namespaces in the codebase: we have accessors like get_public_ssh_keys or availability_zone for the things which are required for cloud-init to function. So the right move here is likely to store the metadata unmodified, but implement all the accessors whose default implementations assume the presence of OpenStack-named keys in the metadata. This should obviate the need for aliasing.

An unwritten goal of cloud-init query is, at some point, to accurately represent any instance-data in as close to the original format as possible

Performing the aliasing required for cloud-init's core functionality in the same data structure as returned by the metadata service would run counter to this goal, I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further inspection, self._crawled_metadata is used in cloud-init query output if it's set (which it is in this PR). This also explains why the prior implementation of the DS was this way.

So I don't think we actually need to change anything here after all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further inspection, self._crawled_metadata is used in cloud-init query output if it's set (which it is in this PR). This also explains why the prior implementation of the DS was this way.

So I don't think we actually need to change anything here after all?

Correct. Thanks for the cloud-init query dump in IRC, I also launched instance before and after and determined (surprise) that I did misread the metadata setup, original metadata content is properly preserved and accessible under cloud-init query ds.meta_data so this PR didn't unintentionally redact or remove that information.
Oops.

The /opc/v1/ metadata endpoints[0] are universally available in Oracle
Cloud Infrastructure and the OpenStack endpoints are considered
deprecated, so we can refactor the data source to use the OPC endpoints
exclusively.  This simplifies the datasource code substantially, and
enables use of OPC-specific attributes in future.

[0] https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/Tasks/gettingmetadata.htm
Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM validate behavior before and after upgrade.

Copy link
Collaborator

@smoser smoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any objections to this. My comments/requests have been addressed. I'll let @blackboxsw ultimately approve it as he's done more work reviewing recently.

The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
"""
# retries=1 as requested by Oracle to address a potential race condition
return json.loads(readurl(METADATA_ENDPOINT, retries=1)._response.text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries=1 is still just weird, and my argument against it stands. I don't think "the cloud platform requests us to do random non-sense things" is general good argument. Either they have good reasons for doing this, or they do not.

But... as I said, I won't fight this.

@OddBloke
Copy link
Collaborator Author

I have performed an upgrade test between the current cloud-init in the Ubuntu archive and a deb built from this branch on the cross-product of {xenial, bionic, focal} x {standard image, minimal image}, and everything behaved as expected. I also performed a release upgrade from bionic->focal, and confirmed that the OpenStack datasource continues to be used, as we would expect.

I believe this is ready to land.

@OddBloke OddBloke merged commit a6bb375 into canonical:master Aug 10, 2020
@OddBloke OddBloke deleted the oracle branch August 14, 2020 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants