You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So the __init__ method is actually loading in an environment variable when a result is defined, when the Flow is serialized, this environment variable is actually stored leading into issues (and security concerns) when accessing the Result during Flow execution.
Initial Look
Does it actually get serialized? After digging with @peytonrunyan, it seems that it actually does, but we have to explain a bit where. First, doing
@task(result=AzureResult(...))defabc():
return1
does not actually include the result when doing flow.serialize() so we are safe when it's defined on the task level. We are more concerned with it defined on the flow level.
On the flow level, doing
withFlow(..., result=AzureResult(...)) asflow:
also is not part of the serialized Flow. This can be confirmed by looking at the FlowSchemahere. Results are indeed not serialized.
But Then!
However, the initial Slack report by @zyzil mentions that AzureStorage is using the AzureResulthere. There is code that looks like:
Notice that AzureResult is attached to the Storage object. When AzureResult is created, aconnection_string_secret is passed and connection_string is left None. Because it is None, the environment variable is pulled in the Result, and then the Result is added to storage.
Thoughts on Fixes
The least invasive non-breaking change should just be editing the AzureResult class in the __init__ method to avoid the environment variable pulling here and then defer it somewhere inside the initialize_service method. On first glance, this should be non-breaking and will be more consistent with S3 and GCS.
It seems a bit weird we even accept a connection_string that is sensitive on the __init__. We might want to remove that, but @peytonrunyan correctly points out it would be a breaking change so we should just leave it as is.
The text was updated successfully, but these errors were encountered:
@kvnkho Based on what I saw, the flow's result is getting serialized even though you said it wasn't a part of the FlowSchema.
I originally discovered this when I started setting my flow's result to AzureResult at the flow level and using Docker storage. I've not used Azure storage but noticed it might be affected as well.
Is enough to reproduce the issue. If you register a flow defined like this and AZURE_STORAGE_CONNECTION_STRING is available at registration, it will be picked up and serialized.
Thanks for adding more detail! I am pretty confused why this would be the case, but I still preliminarily think that the AzureResult change should suffice. But this reproducible example!
Initial Problem
The initial Slack report is that
AzureResult
has a code snippet like this (link to code here):So the
__init__
method is actually loading in an environment variable when a result is defined, when the Flow is serialized, this environment variable is actually stored leading into issues (and security concerns) when accessing the Result during Flow execution.Initial Look
Does it actually get serialized? After digging with @peytonrunyan, it seems that it actually does, but we have to explain a bit where. First, doing
does not actually include the result when doing
flow.serialize()
so we are safe when it's defined on the task level. We are more concerned with it defined on the flow level.On the flow level, doing
also is not part of the serialized Flow. This can be confirmed by looking at the
FlowSchema
here. Results are indeed not serialized.But Then!
However, the initial Slack report by @zyzil mentions that
AzureStorage
is using theAzureResult
here. There is code that looks like:Notice that
AzureResult
is attached to theStorage
object. WhenAzureResult
is created, aconnection_string_secret
is passed andconnection_string
is leftNone
. Because it isNone
, the environment variable is pulled in theResult
, and then theResult
is added to storage.Thoughts on Fixes
The least invasive non-breaking change should just be editing the
AzureResult
class in the__init__
method to avoid the environment variable pulling here and then defer it somewhere inside the initialize_service method. On first glance, this should be non-breaking and will be more consistent with S3 and GCS.It seems a bit weird we even accept a
connection_string
that is sensitive on the__init__
. We might want to remove that, but @peytonrunyan correctly points out it would be a breaking change so we should just leave it as is.The text was updated successfully, but these errors were encountered: