Azure Storage is Serializing an Environment Variable During Registration #5701

kvnkho · 2022-04-21T19:17:16Z

Initial Problem

The initial Slack report is that AzureResult has a code snippet like this (link to code here):

class AzureResult(Result):
    def __init__(
        self,
        container: str,
        connection_string: str = None,
        connection_string_secret: str = None,
        **kwargs: Any
    ) -> None:
        self.container = container
        self.connection_string = connection_string or os.getenv(
            "AZURE_STORAGE_CONNECTION_STRING"
        )
        self.connection_string_secret = connection_string_secret
        super().__init__(**kwargs)

So the __init__ method is actually loading in an environment variable when a result is defined, when the Flow is serialized, this environment variable is actually stored leading into issues (and security concerns) when accessing the Result during Flow execution.

Initial Look

Does it actually get serialized? After digging with @peytonrunyan, it seems that it actually does, but we have to explain a bit where. First, doing

@task(result=AzureResult(...))
def abc():
    return 1

does not actually include the result when doing flow.serialize() so we are safe when it's defined on the task level. We are more concerned with it defined on the flow level.

On the flow level, doing

with Flow(..., result = AzureResult(...)) as flow:

also is not part of the serialized Flow. This can be confirmed by looking at the FlowSchema here. Results are indeed not serialized.

But Then!

However, the initial Slack report by @zyzil mentions that AzureStorage is using the AzureResult here. There is code that looks like:

class Azure(Storage):
    def __init__(
        self,
        ...
        **kwargs: Any
    ) -> None:
       .....
        self.connection_string_secret = connection_string_secret
        result = AzureResult(
            connection_string_secret=self.connection_string_secret,
            container=container,
        )
        super().__init__(result=result, stored_as_script=stored_as_script, **kwargs)

Notice that AzureResult is attached to the Storage object. When AzureResult is created, aconnection_string_secret is passed and connection_string is left None. Because it is None, the environment variable is pulled in the Result, and then the Result is added to storage.

Thoughts on Fixes

The least invasive non-breaking change should just be editing the AzureResult class in the __init__ method to avoid the environment variable pulling here and then defer it somewhere inside the initialize_service method. On first glance, this should be non-breaking and will be more consistent with S3 and GCS.

It seems a bit weird we even accept a connection_string that is sensitive on the __init__. We might want to remove that, but @peytonrunyan correctly points out it would be a breaking change so we should just leave it as is.

The text was updated successfully, but these errors were encountered:

zyzil · 2022-04-21T21:02:25Z

@kvnkho Based on what I saw, the flow's result is getting serialized even though you said it wasn't a part of the FlowSchema.

I originally discovered this when I started setting my flow's result to AzureResult at the flow level and using Docker storage. I've not used Azure storage but noticed it might be affected as well.

with Flow("test_flow", storage=Docker(), result=AzureResult("results")) as flow:
   # ...

Is enough to reproduce the issue. If you register a flow defined like this and AZURE_STORAGE_CONNECTION_STRING is available at registration, it will be picked up and serialized.

I've put together a small example to produce this due to the complexity of different things involved here. Hopefully it will help in demonstrating the behavior: https://github.com/zyzil/prefect-azure-result-flow-repro

kvnkho · 2022-04-21T21:10:47Z

Thanks for adding more detail! I am pretty confused why this would be the case, but I still preliminarily think that the AzureResult change should suffice. But this reproducible example!

zyzil · 2022-04-21T21:45:55Z

I've also just found that this appears to also occur if you set the result at the task level like you had mentioned:

@task(result=AzureResult("results"))
def get_word() -> str:
    return "World"

I've added an example demonstrating it as well.

cicdw · 2025-01-09T18:29:36Z

Closing as related to a now unsupported version of the SDK

zanieb added bug Something isn't working needs:contributor labels Apr 22, 2022

cicdw removed needs:contributor labels Jul 19, 2024

cicdw added the v1 Related to Prefect 1.x label Jan 9, 2025

cicdw closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Storage is Serializing an Environment Variable During Registration #5701

Azure Storage is Serializing an Environment Variable During Registration #5701

kvnkho commented Apr 21, 2022 •

edited

Loading

zyzil commented Apr 21, 2022

kvnkho commented Apr 21, 2022

zyzil commented Apr 21, 2022

cicdw commented Jan 9, 2025

Azure Storage is Serializing an Environment Variable During Registration #5701

Azure Storage is Serializing an Environment Variable During Registration #5701

Comments

kvnkho commented Apr 21, 2022 • edited Loading

Initial Problem

Initial Look

But Then!

Thoughts on Fixes

zyzil commented Apr 21, 2022

kvnkho commented Apr 21, 2022

zyzil commented Apr 21, 2022

cicdw commented Jan 9, 2025

kvnkho commented Apr 21, 2022 •

edited

Loading