Skip to content

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

License

Notifications You must be signed in to change notification settings

aws/aws-sdk-pandas

Repository files navigation

AWS Data Wrangler

Code style: black

The missing link between AWS services and the most popular Python data libraries

AWS Data Wrangler aims to fill a gap between AWS Big Data Services (Glue, Athena, EMR, Redshift) and the most popular Python libraries for lightweight workloads.


Contents: Installation | Usage | Known Limitations | Contributing | Dependencies | License


Installation

pip install awswrangler

AWS Data Wrangler runs on Python 2 and 3.

Usage

Writing Pandas Dataframe to Data Lake:

awswrangler.s3.write(
        df=df,
        database="database",
        path="s3://...",
        file_format="parquet",
        preserve_index=True,
        mode="overwrite",
        partition_cols=["col"],
    )

If a Glue Database name is passed, all the metadata will be created in the Glue Catalog. If not, only the s3 data write will be done.

Reading from Data Lake to Pandas Dataframe:

df = awswrangler.athena.read("database", "select * from table")

Dependencies

AWS Data Wrangler project relies on others great initiatives:

  • Boto3
  • Pandas
  • Apache Arrow
  • Dask s3fs

Known Limitations

  • By now only writes in Parquet and CSV file formats
  • By now only reads through AWS Athena
  • By now there are not compression support
  • By now there are not nested type support

Contributing

For almost all features we need rely on AWS Services that didn't have mock tools in the community yet (AWS Glue, AWS Athena). So we are focusing on integration tests instead unit tests.

So, you will need provide a S3 bucket and a Glue/Athena database through environment variables.

export AWSWRANGLER_TEST_BUCKET=...

export AWSWRANGLER_TEST_DATABASE=...

CAUTION: This may this may incur costs in your AWS Account

make init

Make your changes...

make format

make lint

make test

License

This library is licensed under the Apache 2.0 License.