Data pipeline that extracts a user's song listening history from the Spotify API using Python, PostgreSQL, dbt, Metabase, Airflow, and Docker
Deep dive into a user's song listening history to retrieve information about top artists, top tracks, top genres, and more. This is a personal side project for fun to recreate Spotify Wrapped but at a more frequent cadence to get quicker and more detailed insights. This pipeline calls the Spotify API every hour from hours 0-6 and 14-23 UTC (basically whenever I'm awake) to extract a user's song listening history, load the responses into a database, apply transformations and visualize the metrics in a dashboard. Since the dataset is small and this doesn't need to be running 24/7 this is all built using open source tools and hosted locally to avoid any cost.
- Containerization - Docker, Docker Compose
- Orchestration - Airflow
- Database - PostgreSQL
- Transformation - dbt
- Data Visualization - Metabase
- Language - Python
- main.py script is triggered every hour (from hours 0-6 and 14-23 UTC) via Airflow to refresh the access token, make a connection to the Postgres database to check for the latest listened time, and call the Spotify API to retrieve the most recently played songs and corresponding genres.
- Responses are saved as CSV files in 'YYYY-MM-DD.csv' format. These are saved on the local file system and act as our replayable source since the Spotify API only allows requesting the 50 most recently played songs and not any historical data. These files will keep getting appended with the most recently played songs for the respective date.
- Data is copied into the Postgres Database into the respective tables, spotify_songs and spotify_genres.
- dbt run task is triggered to run transformations on top of the staging data to produce analytical and reporting tables/views.
- dbt test will run after successful completion of dbt run to ensure all tests pass.
- Tables/views are fed into Metabase and the metrics are visualized through a dashboard.
- Slack subscription is set up in Metabase to send a weekly summary every Monday.
Throughout this entire process if any Airflow task fails an automatic Slack alert will be sent to a custom Slack channel that was created.
- Get Spotify API Access
- Build Docker Containers for Airflow
- Set Up Airflow Connection to Postgres
- Install dbt Core
- Enable Airflow Slack Notifications
- Install Metabase
- Create a BranchPythonOperator to first check if the API payload is empty. If empty then proceed directly to the end task else continue to the downstream tasks.
- Implement data quality checks to catch any potential errors in the dataset
- Create unit tests to ensure pipeline is running as intended
- Include CI/CD
- Create more visualizations to uncover further insights once Spotify sends back my entire songs listening history from 10+ years back to the current date (this needed to be requested separately since the current API only allows requesting the 50 most recently played tracks)
- If and whenever Spotify allows requesting historical data implement backfill capability