DataSourceOracle: refactor to use only OPC v1 endpoint #493

OddBloke · 2020-07-15T19:26:30Z

The /opc/v1/ metadata endpoints[0] are universally available in Oracle
Cloud Infrastructure and the OpenStack endpoints are considered
deprecated, so we can refactor the data source to use the OPC endpoints
exclusively. This simplifies the datasource code substantially, and
enables use of OPC-specific attributes in future.

[0] https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/Tasks/gettingmetadata.htm

OddBloke · 2020-07-15T19:30:11Z

(I have built an Ubuntu package from this and tested it on both iSCSI and non-iSCSI roots; it successfully boots and provisions SSH keys from a captured image.)

TheRealFalcon

Looks good so far! The metadata parsing code is much simpler.

My main question about the implementation is that it looks like certain fields were being parsed before and no longer are (i.e., vendordata_pure and launch-index). Is that still to do, or has something changed that we no longer need them?

On the test front, my only question is around how you're implementing marks and fixtures, but since that conversation is happening in another channel, I'll wait until there's some closure there before mentioning anything in particular here.

OddBloke · 2020-07-16T16:30:32Z

Looks good so far! The metadata parsing code is much simpler.

Thanks!

My main question about the implementation is that it looks like certain fields were being parsed before and no longer are (i.e., vendordata_pure

The new API has no equivalent to vendor-data. (vendor-data is an OpenStack concept originally, so it's not all that surprising that it was included in their OpenStack-a-like IMDS but is dropped now.)

and launch-index). Is that still to do, or has something changed that we no longer need them?

From Oracle:

- launch_index is hard-coded to 0

I don't know that anything will be relying on that, but we may as well include it in the metadata we produce just in case. Good catch!

TheRealFalcon · 2020-07-22T20:44:19Z

Regarding the TODO, if it's up to us, I don't see much benefit in adding it. It's essentially the same as curl'ing the metadata endpoint plus adding is_platform_viable and a system uuid to the output.

OddBloke · 2020-07-24T16:33:09Z

(We have confirmed with Oracle that the instance ID provided by the OpenStack endpoint will always be the same as the one provided by the OPC v1 endpoint, so upgrades should be smooth.)

OddBloke · 2020-07-27T15:17:51Z

I realised that the docstring in DataSourceOracle.py needs updating; I'll also double-check all other documentation/docstrings to check there aren't any other stale references.

cloudinit/sources/DataSourceOracle.py

smoser · 2020-07-29T13:42:40Z

cloudinit/sources/DataSourceOracle.py

+    :return:
+        The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
+    """
+    return json.loads(readurl(METADATA_ENDPOINT, retries=2)._response.text)


2? just because 2 is better than one?

I think we have a pretty convincing argument that we're running on OracleCloud at this point ('OracleCloud.com') any reason to not bump that up? Or turn it to no retries. 2 just seems an odd number here. Either we're dependent on it (no retries) or we want to be resilient (10 or more?).

I don't have strong feelings on this though.

i knew that two is an impossible number, but this interpretation is new to me ;)

retries=2 was requested by Oracle in a code review to handle a potential race condition where the IMDS might not be ready. Re-reading their point I think they really wanted 2 tries not 2 retries (and the race condition window is reportedly milliseconds, so a single retry will comfortably address it). I'll modify this to 1.

hmm... well thanks for pointing out that it is retries, rather than tries. The same argument applies though, just with N-1.

handle a potential race condition where the IMDS might not be ready.

In other places (EC2) we have similar behavior. We wait_for_metadata_service only on the first request, and then subsequent calls the IMDS must be there, and readurl gets the default retries=0.

and the race condition window is reportedly milliseconds, so a single retry will comfortably address it

"The IMDS is not yet ready" seems unlikely to reliably fix itself milliseconds later. The cloud platform prepared gigabytes of disk, booted a virtual (or physical machine), loaded a kernel, initramfs, switchroot, init, dhcp, .... all of this stuff happened, but the system beat the IMDS up by milliseconds ? Just seems unlikely.

I wont fight that, please at least document it in code comment though.

smoser · 2020-07-29T13:45:07Z

cloudinit/sources/DataSourceOracle.py

- * Both bare-metal and vms use iscsi root
- * Both bare-metal and vms provide chassis-asset-tag of OracleCloud.com
+ * Bare metal instances use iSCSI root, virtual machines do not
+ * Both bare-metal and VMs provide chassis-asset-tag of OracleCloud.com


The edits here are just kind of odd. You seem to prefer 'Bare metal' to 'bare-metal' on line 11, but then back to 'bare-metal' on line 12. and then you replaced 'vms' with 'virtual machines' on line 14 -> 11, but then capitalized VMs on line 12.

only really mentioning because i had other comments.

Yep, no idea why I did that, will fix.

OddBloke · 2020-08-04T19:11:43Z

@smoser I believe I've addressed your comments now, could you take another look?

blackboxsw · 2020-08-06T18:27:39Z

cloudinit/sources/DataSourceOracle.py

-    This would return:
-      {version: {'user_data': b'blob', 'meta_data': json.loads(blob.decode())
-                 'system_uuid': '3b54f2e0-3ab2-458d-b770-af9926eee3b2'}}
+def read_opc_metadata():


There was some effort at some point previously to try to standardize and expose a commonly named read_metadata method from each datasource module so that at some point we could potentially use from a cloud-init query --no-cache call or hotplug logic to crawl fresh metadata for the datasource without updating the existing cached datasource metadata. specializing this function name seems to move away from that interest.

def read_metadata( only appears in 4 places in the codebase:

cloudinit/sources/DataSourceOracle.py 313:def read_metadata(endpoint_base=METADATA_ENDPOINT, sys_uuid=None, cloudinit/sources/DataSourceExoscale.py 186:def read_metadata(metadata_url=METADATA_URL, cloudinit/sources/helpers/hetzner.py 16:def read_metadata(url, timeout=2, sec_between=2, retries=30): cloudinit/sources/helpers/digitalocean.py 185:def read_metadata(url, timeout=2, sec_between=2, retries=30):

These don't have a consistent interface (some require a parameter, others don't), and 2 of them aren't in datasource modules. So I'm not sure we lose much by changing the name of this one.

+1, since we have a lot of discussion about making datasources consistent and decided how that should be done, this certainly doesn't make things any harder so let's move forward with your changeset.

blackboxsw · 2020-08-06T18:39:22Z

cloudinit/sources/DataSourceOracle.py

+    print(
+        util.json_dumps(
+            {
+                "read_opc_metadata": read_opc_metadata(),


Along the same lines of generalization of main output, is it worth just calling this key "metadata"?
Also in the same vein, is_platform_viable function was an interest in generalizing for each datasource Is it worth just surfacing "is_platform_viable" key for this?

Admitedly, if required by our upcoming hotplug work, if we actually need to standardize on a common function name for all datasources, and common behavior of json output of a main in each datasource, this would be a flag day type PR that could address all datasources in one go, so maybe we don't have to quibble about it for this PR.

Yep, similarly to my above response, there are only 3 datasources which implement this at all, and they are not consistent in how they do it:

cloudinit/sources/DataSourceOracle.py 184: def _is_platform_viable(self): 286:def _is_platform_viable(): cloudinit/sources/DataSourceExoscale.py 137: def _is_platform_viable(self): cloudinit/sources/DataSourceAzure.py 502: def _is_platform_viable(self): 1497:def _is_platform_viable(seed_dir):

AIUI, this output is only really intended for debugging and is already inconsistent across datasources, so I went with key == function name so it's clear where the data is coming from internal to the file.

works for me, I think this could a topic for us to discuss and refine if, or when, we decide to go this route.

blackboxsw · 2020-08-06T18:40:09Z

conftest.py

-        return marker.args
-    return default
+class _FixtureUtils:
+    """A namespace for fixture helper functions, used by fixture_utils.


Thanks for this class namespacing, it does make things more clear.

blackboxsw

Looks really good, especially the new fixtures and test rework. I have a couple of comments/concerns for discussion on this pass and I want to think a bit about upgrade path support from cloud-init-20.2.45 on oracle systems to this functionality to thnk about what could break. I realize we don't have to support upgrades from OpenStack datasource -> Oracle, but current Oracle -> NewOracle makes me want to think a bit.

blackboxsw · 2020-08-06T18:44:17Z

cloudinit/sources/tests/test_oracle.py

+    def test_metadata_keys_set_correctly(
+        self, keyname, expected_value, parameterized_oracle_ds,
+    ):
+        parameterized_oracle_ds._get_data()


I think it'd be nice to add the assert True is here just to confirm we are actually returning the expected bool value even though we are validating that metadata is setup properly?

blackboxsw · 2020-08-06T18:47:13Z

cloudinit/sources/tests/test_oracle.py

+
+
+@mock.patch(DS_PATH + "._is_iscsi_root", lambda: False)
+class TestNonIscsiRoot_GetDataBehaviour:


wow impressive setup to test this scoping.

Yeah, the fact that the body of the context manager doesn't use a value returned by the context manager makes it much harder to assert on!

blackboxsw · 2020-08-06T18:58:22Z

cloudinit/sources/DataSourceOracle.py

+        The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
+    """
+    # retries=1 as requested by Oracle to address a potential race condition
+    return json.loads(readurl(METADATA_ENDPOINT, retries=1)._response.text)


So we are no longer handling UrlErrors and raising BrokenMetadata with more specific failure failure information instead of UrlErrors. I realize cloud-init doesn't actually handle this specific BrokenMetadata uniquely, but what are your thoughts about us dropping BrokenMetadata in the exception specialization in the future from other/all datasources?

Looking at the way BrokenMetadata is used elsewhere it breaks down in two ways: (a) the content of the metadata is incorrect (e.g. random_seed is not base64 as we expect), or (b) there are multiple different "transports" being used, and BrokenMetadata is used to abstract away the differences between the exceptions that those "transports" would raise on failure. (The only example of (b) is OpenStack: ConfigDrive will raise filesystem-related errors whereas the IMDS will raise HTTP errors.)

I don't think BrokenMetadata gains us much here because it doesn't match either of these cases: the UrlError indicates a "transport" issue (rather than a content issue), and we only have one "transport" to worry about.

blackboxsw · 2020-08-06T19:01:03Z

conftest.py

+
+
+@pytest.yield_fixture
+def httpretty():


Again an excellent fixture addition. Thanks!

blackboxsw · 2020-08-06T19:06:37Z

cloudinit/sources/DataSourceOracle.py

+            {
+                "read_opc_metadata": read_opc_metadata(),
+                "_is_platform_viable": _is_platform_viable(),
+            }


Is it worth exposing the vnic network metadata here somehow from _add_network_config_from_opc_imds too to aid in introspection of content exposed to the vm when calling main?

Potentially, but @TheRealFalcon will be changing this codepath in his follow-up PR anyway (because read_opc_metadata will evolve); instead of writing more code that will be replaced, let's remember to get him to do this.

blackboxsw · 2020-08-06T19:17:55Z

cloudinit/sources/DataSourceOracle.py

-        return True
+        self.metadata = {
+            "availability-zone": data["ociAdName"],
+            "instance-id": data["id"],


Maybe I'm misreading this. it appears you are trying to limit the metadata preserved as a cache on the datasource by only instrumenting a few keys from the provided data.

I'm not certain why we would not want to cache all metadata here on the datasource? Why not just update the existing data with additional key aliases for the known availability-zone, instance-id, launch-index, local-hostname and name keys? At least from a triage point of view, we would be able to more easily inspect the system and "raw" values provided by the system more easily via cloud-init query if we get bugs, or have to develop more features etc.

An unwritten goal of cloud-init query is, at some point, to accurately represent any instance-data in as close to the original format as possible to potentially allow folks to leverage cloud-init's inherent knowledge of the 'best practices' to talk to a given datasource API (including retries and failure handling etc). This would allow customers on cloud-init driven platforms to simplify their scripts which rely on instance-data/meta-data to just cloud-init query blah instead of knowing the proper routes to curl etc and handle failures.

Maybe I'm misreading this. it appears you are trying to limit the metadata preserved as a cache on the datasource by only instrumenting a few keys from the provided data.

This may be the effect, but it is not the intent: I'm following the prior implementation of the datasource which did not include the entire metadata in self.metadata. Looking at a few other datasources, it's inconsistent as to whether they include all metadata in self.metadata or use something like self.metadata_full.

Why not just update the existing data with additional key aliases for the known availability-zone, instance-id, launch-index, local-hostname and name keys?

These seem like two different namespaces to me: if we modify the dict that we get from the cloud, then it's hard for users to know which keys are set by cloud-init (and therefore they can rely on across platforms) and which are cloud-specific; all the keys appear similar. (This is further complicated by the fact that cloud-init uses OpenStack/EC2 names for metadata keys internally; on OpenStack/EC2 there wouldn't even be an obvious pattern to allow you to guess which keys are cross-substrate and which are cloud-specific.)

All that said, I think we already have this distinction between namespaces in the codebase: we have accessors like get_public_ssh_keys or availability_zone for the things which are required for cloud-init to function. So the right move here is likely to store the metadata unmodified, but implement all the accessors whose default implementations assume the presence of OpenStack-named keys in the metadata. This should obviate the need for aliasing.

An unwritten goal of cloud-init query is, at some point, to accurately represent any instance-data in as close to the original format as possible

Performing the aliasing required for cloud-init's core functionality in the same data structure as returned by the metadata service would run counter to this goal, I think.

Upon further inspection, self._crawled_metadata is used in cloud-init query output if it's set (which it is in this PR). This also explains why the prior implementation of the DS was this way.

So I don't think we actually need to change anything here after all?

Upon further inspection, self._crawled_metadata is used in cloud-init query output if it's set (which it is in this PR). This also explains why the prior implementation of the DS was this way.

So I don't think we actually need to change anything here after all?

Correct. Thanks for the cloud-init query dump in IRC, I also launched instance before and after and determined (surprise) that I did misread the metadata setup, original metadata content is properly preserved and accessible under cloud-init query ds.meta_data so this PR didn't unintentionally redact or remove that information.
Oops.

The /opc/v1/ metadata endpoints[0] are universally available in Oracle Cloud Infrastructure and the OpenStack endpoints are considered deprecated, so we can refactor the data source to use the OPC endpoints exclusively. This simplifies the datasource code substantially, and enables use of OPC-specific attributes in future. [0] https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/Tasks/gettingmetadata.htm

blackboxsw

LGTM validate behavior before and after upgrade.

smoser

I don't have any objections to this. My comments/requests have been addressed. I'll let @blackboxsw ultimately approve it as he's done more work reviewing recently.

smoser · 2020-08-10T14:45:43Z

cloudinit/sources/DataSourceOracle.py

+        The JSON-decoded value of the /opc/v1/instance/ endpoint on the IMDS.
+    """
+    # retries=1 as requested by Oracle to address a potential race condition
+    return json.loads(readurl(METADATA_ENDPOINT, retries=1)._response.text)


retries=1 is still just weird, and my argument against it stands. I don't think "the cloud platform requests us to do random non-sense things" is general good argument. Either they have good reasons for doing this, or they do not.

But... as I said, I won't fight this.

OddBloke · 2020-08-10T18:46:08Z

I have performed an upgrade test between the current cloud-init in the Ubuntu archive and a deb built from this branch on the cross-product of {xenial, bionic, focal} x {standard image, minimal image}, and everything behaved as expected. I also performed a release upgrade from bionic->focal, and confirmed that the OpenStack datasource continues to be used, as we would expect.

I believe this is ready to land.

TheRealFalcon reviewed Jul 16, 2020

View reviewed changes

OddBloke force-pushed the oracle branch 5 times, most recently from 7200f02 to 2956c75 Compare July 17, 2020 20:34

OddBloke force-pushed the oracle branch 3 times, most recently from 41da6d9 to 3920878 Compare July 24, 2020 16:29

OddBloke changed the title ~~WIP: DataSourceOracle: refactor to use only OPC v1 endpoint~~ DataSourceOracle: refactor to use only OPC v1 endpoint Jul 24, 2020

OddBloke force-pushed the oracle branch from 3920878 to d1254e8 Compare July 27, 2020 19:36

smoser reviewed Jul 29, 2020

View reviewed changes

OddBloke force-pushed the oracle branch 3 times, most recently from fa9942f to 1de77af Compare August 4, 2020 19:11

blackboxsw self-assigned this Aug 6, 2020

blackboxsw reviewed Aug 6, 2020

View reviewed changes

blackboxsw requested changes Aug 6, 2020

View reviewed changes

OddBloke force-pushed the oracle branch from 1de77af to ea01bc9 Compare August 7, 2020 14:26

blackboxsw approved these changes Aug 7, 2020

View reviewed changes

Merge branch 'master' into oracle

8a62e4b

smoser reviewed Aug 10, 2020

View reviewed changes

Merge branch 'master' into oracle

b284f37

OddBloke merged commit a6bb375 into canonical:master Aug 10, 2020

OddBloke deleted the oracle branch August 14, 2020 14:41

ubuntu-server-builder mentioned this pull request May 12, 2023

Release 20.3 #3774

Closed



		@mock.patch(DS_PATH + "._is_iscsi_root", lambda: False)
		class TestNonIscsiRoot_GetDataBehaviour:

DataSourceOracle: refactor to use only OPC v1 endpoint #493

DataSourceOracle: refactor to use only OPC v1 endpoint #493

Conversation

OddBloke commented Jul 15, 2020 • edited Loading

OddBloke commented Jul 15, 2020

TheRealFalcon left a comment

Choose a reason for hiding this comment

OddBloke commented Jul 16, 2020

TheRealFalcon commented Jul 22, 2020

OddBloke commented Jul 24, 2020

OddBloke commented Jul 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OddBloke commented Aug 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blackboxsw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blackboxsw left a comment

Choose a reason for hiding this comment

smoser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OddBloke commented Aug 10, 2020

OddBloke commented Jul 15, 2020 •

edited

Loading