Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Storage is Serializing an Environment Variable During Registration #5701

Closed
kvnkho opened this issue Apr 21, 2022 · 4 comments
Closed
Labels
bug Something isn't working v1 Related to Prefect 1.x

Comments

@kvnkho
Copy link
Contributor

kvnkho commented Apr 21, 2022

Initial Problem

The initial Slack report is that AzureResult has a code snippet like this (link to code here):

class AzureResult(Result):
    def __init__(
        self,
        container: str,
        connection_string: str = None,
        connection_string_secret: str = None,
        **kwargs: Any
    ) -> None:
        self.container = container
        self.connection_string = connection_string or os.getenv(
            "AZURE_STORAGE_CONNECTION_STRING"
        )
        self.connection_string_secret = connection_string_secret
        super().__init__(**kwargs)

So the __init__ method is actually loading in an environment variable when a result is defined, when the Flow is serialized, this environment variable is actually stored leading into issues (and security concerns) when accessing the Result during Flow execution.

Initial Look

Does it actually get serialized? After digging with @peytonrunyan, it seems that it actually does, but we have to explain a bit where. First, doing

@task(result=AzureResult(...))
def abc():
    return 1

does not actually include the result when doing flow.serialize() so we are safe when it's defined on the task level. We are more concerned with it defined on the flow level.

On the flow level, doing

with Flow(..., result = AzureResult(...)) as flow:

also is not part of the serialized Flow. This can be confirmed by looking at the FlowSchema here. Results are indeed not serialized.

But Then!

However, the initial Slack report by @zyzil mentions that AzureStorage is using the AzureResult here. There is code that looks like:

class Azure(Storage):
    def __init__(
        self,
        ...
        **kwargs: Any
    ) -> None:
       .....
        self.connection_string_secret = connection_string_secret
        result = AzureResult(
            connection_string_secret=self.connection_string_secret,
            container=container,
        )
        super().__init__(result=result, stored_as_script=stored_as_script, **kwargs)

Notice that AzureResult is attached to the Storage object. When AzureResult is created, aconnection_string_secret is passed and connection_string is left None. Because it is None, the environment variable is pulled in the Result, and then the Result is added to storage.

Thoughts on Fixes

The least invasive non-breaking change should just be editing the AzureResult class in the __init__ method to avoid the environment variable pulling here and then defer it somewhere inside the initialize_service method. On first glance, this should be non-breaking and will be more consistent with S3 and GCS.

It seems a bit weird we even accept a connection_string that is sensitive on the __init__. We might want to remove that, but @peytonrunyan correctly points out it would be a breaking change so we should just leave it as is.

@zyzil
Copy link

zyzil commented Apr 21, 2022

@kvnkho Based on what I saw, the flow's result is getting serialized even though you said it wasn't a part of the FlowSchema.

I originally discovered this when I started setting my flow's result to AzureResult at the flow level and using Docker storage. I've not used Azure storage but noticed it might be affected as well.

with Flow("test_flow", storage=Docker(), result=AzureResult("results")) as flow:
   # ...

Is enough to reproduce the issue. If you register a flow defined like this and AZURE_STORAGE_CONNECTION_STRING is available at registration, it will be picked up and serialized.

I've put together a small example to produce this due to the complexity of different things involved here. Hopefully it will help in demonstrating the behavior: https://github.com/zyzil/prefect-azure-result-flow-repro

@kvnkho
Copy link
Contributor Author

kvnkho commented Apr 21, 2022

Thanks for adding more detail! I am pretty confused why this would be the case, but I still preliminarily think that the AzureResult change should suffice. But this reproducible example!

@zyzil
Copy link

zyzil commented Apr 21, 2022

I've also just found that this appears to also occur if you set the result at the task level like you had mentioned:

@task(result=AzureResult("results"))
def get_word() -> str:
    return "World"

I've added an example demonstrating it as well.

@zanieb zanieb added bug Something isn't working needs:contributor labels Apr 22, 2022
@cicdw cicdw added the v1 Related to Prefect 1.x label Jan 9, 2025
@cicdw
Copy link
Member

cicdw commented Jan 9, 2025

Closing as related to a now unsupported version of the SDK

@cicdw cicdw closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1 Related to Prefect 1.x
Projects
None yet
Development

No branches or pull requests

4 participants