Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sk] Improved PostgreSQL Data Loader #651

Merged
merged 17 commits into from
Jul 14, 2022
Merged

Conversation

skunichetty
Copy link
Contributor

@skunichetty skunichetty commented Jul 14, 2022

Summary

This PR refactors the PostgreSQL IO handler to avoid using SQLAlchemy. There are a few benefits to this change:

  • Reduces the number of project dependencies, especially since SQLAlchemy was only used in a single location in code
  • Unifies interface between PostgreSQL IO Handler and handlers for Snowflake and Redshift (SQLAlchemy used a slightly different architecture than regular DB connectors)
  • Faster export times. Testing on a data frame with 100,000 rows and columns of varying types (integers, dates, times, datetimes, strings, floats, categories) was reduced from (averaged over 30 trials):
    • 9.414 seconds → 1.228 seconds when the index of the data frame is included
    • 10.109 seconds → 1.219 seconds when the index of the data frame is excluded

The change made was to use PostgreSQLs COPY TO query to improve export times in place of pandas.DataFrame.to_sql(). To do this, a new framework was added to manage type conversion (mapping Pandas datatypes to PostgreSQL data types). This framework involves:

  1. Inferring the pandas datatype of each column of the input data frame
  2. Mapping the pandas datatype to a database relative data type (in this case, Postgres data types)
  3. Cleaning the pandas data frame to ensure correct type mapping (e.g. np.timedelta64 has no appropriate type in PostgreSQL, so it is mapped to an integer before exporting)

Tests

Changes tested locally with PostgreSQL database alongside unit tests.

cc

@wangxiaoyou1993 @dy46

@skunichetty skunichetty marked this pull request as ready for review July 14, 2022 19:02
@skunichetty skunichetty changed the title [sk] [WIP] Improved PostgreSQL Data Loader [sk] Improved PostgreSQL Data Loader Jul 14, 2022
df: DataFrame,
table_name: str,
index: bool = False,
if_exists: str = 'replace',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's create constants/enum for the options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

from mage_ai.io.export_utils import (
BadConversionError,
clean_df_for_export,
PandasTypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

self, df: DataFrame, name: str, index: bool = False, if_exists: str = 'replace', **kwargs
self,
df: DataFrame,
table_name: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to specify schema_name to identify a table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about schemas, added schema_name parameter + auto creation of schema if the given schema doesn't exist.

)
self.conn.commit()

def __table_exists(self, table_name: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. you need both schema_name and table_name to identify a table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added schema name to table existence check.

@skunichetty skunichetty merged commit 2920b4d into master Jul 14, 2022
@skunichetty skunichetty deleted the sk--improved_psql_handler branch July 14, 2022 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants