-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add columns to empty table in load_as_pandas and fix issues for complex types #37
Conversation
…the correct names and dtypes based on schema_string Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
…umns-to-empty-table
Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
…ort all types Signed-off-by: FX196 <yuhong.chen@databricks.com>
…umns-to-empty-table
@ueshin could you take another look? |
python/delta_sharing/converter.py
Outdated
elif schema_type == "date": | ||
return pd.Series([pd.Timestamp(0).date()]) | ||
elif schema_type == "timestamp": | ||
return pd.Series([pd.Timestamp(0).date()], dtype=np.dtype("datetime64[ns]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we need .date()
here.
elif schema_type == "binary": | ||
return None # partition on binary column not supported | ||
elif isinstance(schema_type, dict) and schema_type["type"] in ("array", "struct", "map"): | ||
return None # partition on complex column not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, Good catch!
python/delta_sharing/converter.py
Outdated
elif schema_type == "binary": | ||
return None # partition on binary column not supported | ||
elif isinstance(schema_type, dict) and schema_type["type"] in ("array", "struct", "map"): | ||
return None # partition on complex column not supported | ||
|
||
# TODO: binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can remove this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these will actually break a lot of other stuff since other parts of the code use the keys of converters
to iterate over the columns of the table. We change things to use schema_json
, but it'll be a lot of work and Ryan said it wasn't worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we can remove the comment # TODO: binary
.
reader = DeltaSharingReader( | ||
Table("table_name", "share_name", "schema_name"), RestClientMock() # type: ignore | ||
) | ||
pdf = reader.to_pandas() | ||
|
||
reader = DeltaSharingReader(Table(name="table7", share="share1", schema="default"), rest_client) | ||
expected = reader.to_pandas().iloc[0:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the expected
be created in local? Otherwise, this is trying to test DeltaSharingReader
using DeltaSharingReader
. I think it's weird.
The integration test should be at test_load
in test_delta_sharing.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objective was to make sure the empty table schema is consistent with the non-empty ones, so I added a reference table to compare with. If we make our own reference schema, we can't be sure our reference schema matches the real one when a table is actually read, hence we're using DeltaSharingReader
.
Although we're using the class to test itself, we're using the non-empty table behavior to test the empty table behavior, so it's a different logical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a local copy of the parquet file for the table so that this test isn't an integration test, but it's difficult to be sure we have the correct schema without actually reading a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm still feeling weird, though.
Anyway the behavior looks fine if the table7
has a proper schema.
I'd leave it to @zsxwing for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this in the short term. Eventually we should make the server support local file system and use all of these tables locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
reader = DeltaSharingReader( | ||
Table("table_name", "share_name", "schema_name"), RestClientMock() # type: ignore | ||
) | ||
pdf = reader.to_pandas() | ||
|
||
reader = DeltaSharingReader(Table(name="table7", share="share1", schema="default"), rest_client) | ||
expected = reader.to_pandas().iloc[0:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'm still feeling weird, though.
Anyway the behavior looks fine if the table7
has a proper schema.
I'd leave it to @zsxwing for this.
This PR fixes two issues:
load_as_pandas
should return an empty Pandas DataFrame with the table schema. Fixes load_as_pandas should return a DataFrame with correct schema for an empty table #35load_as_pandas
should not fail. In a Delta table, only partition columns are not allowed to use complex types.