Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add columns to empty table in load_as_pandas and fix issues for complex types #37

Merged
merged 14 commits into from
Jun 22, 2021

Conversation

FX196
Copy link
Collaborator

@FX196 FX196 commented Jun 7, 2021

This PR fixes two issues:

FX196 added 3 commits June 7, 2021 14:06
…the correct names and dtypes based on schema_string

Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
@zsxwing zsxwing requested a review from ueshin June 7, 2021 21:36
python/delta_sharing/converter.py Outdated Show resolved Hide resolved
python/delta_sharing/converter.py Outdated Show resolved Hide resolved
python/delta_sharing/converter.py Outdated Show resolved Hide resolved
python/delta_sharing/converter.py Show resolved Hide resolved
FX196 added 3 commits June 8, 2021 10:17
Signed-off-by: FX196 <yuhong.chen@databricks.com>
Signed-off-by: FX196 <yuhong.chen@databricks.com>
@FX196 FX196 requested a review from ueshin June 8, 2021 18:14
@zsxwing zsxwing changed the title Add columns to empty table in load_as_pandas Add columns to empty table in load_as_pandas and fix issues for complex types Jun 22, 2021
@zsxwing
Copy link
Member

zsxwing commented Jun 22, 2021

@ueshin could you take another look?

elif schema_type == "date":
return pd.Series([pd.Timestamp(0).date()])
elif schema_type == "timestamp":
return pd.Series([pd.Timestamp(0).date()], dtype=np.dtype("datetime64[ns]"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need .date() here.

Comment on lines +112 to +115
elif schema_type == "binary":
return None # partition on binary column not supported
elif isinstance(schema_type, dict) and schema_type["type"] in ("array", "struct", "map"):
return None # partition on complex column not supported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Good catch!

elif schema_type == "binary":
return None # partition on binary column not supported
elif isinstance(schema_type, dict) and schema_type["type"] in ("array", "struct", "map"):
return None # partition on complex column not supported

# TODO: binary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove this now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these will actually break a lot of other stuff since other parts of the code use the keys of converters to iterate over the columns of the table. We change things to use schema_json, but it'll be a lot of work and Ryan said it wasn't worth it.

Copy link
Collaborator

@ueshin ueshin Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can remove the comment # TODO: binary.

Comment on lines +231 to +237
reader = DeltaSharingReader(
Table("table_name", "share_name", "schema_name"), RestClientMock() # type: ignore
)
pdf = reader.to_pandas()

reader = DeltaSharingReader(Table(name="table7", share="share1", schema="default"), rest_client)
expected = reader.to_pandas().iloc[0:0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the expected be created in local? Otherwise, this is trying to test DeltaSharingReader using DeltaSharingReader. I think it's weird.

The integration test should be at test_load in test_delta_sharing.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective was to make sure the empty table schema is consistent with the non-empty ones, so I added a reference table to compare with. If we make our own reference schema, we can't be sure our reference schema matches the real one when a table is actually read, hence we're using DeltaSharingReader.

Although we're using the class to test itself, we're using the non-empty table behavior to test the empty table behavior, so it's a different logical path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a local copy of the parquet file for the table so that this test isn't an integration test, but it's difficult to be sure we have the correct schema without actually reading a table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm still feeling weird, though.
Anyway the behavior looks fine if the table7 has a proper schema.
I'd leave it to @zsxwing for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this in the short term. Eventually we should make the server support local file system and use all of these tables locally.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

Comment on lines +231 to +237
reader = DeltaSharingReader(
Table("table_name", "share_name", "schema_name"), RestClientMock() # type: ignore
)
pdf = reader.to_pandas()

reader = DeltaSharingReader(Table(name="table7", share="share1", schema="default"), rest_client)
expected = reader.to_pandas().iloc[0:0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'm still feeling weird, though.
Anyway the behavior looks fine if the table7 has a proper schema.
I'd leave it to @zsxwing for this.

@zsxwing zsxwing merged commit ede72a1 into delta-io:main Jun 22, 2021
@zsxwing zsxwing added this to the 0.2.0 milestone Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_as_pandas should return a DataFrame with correct schema for an empty table
3 participants