This project focuses on collecting, processing, and analyzing COVID-19 data using various data engineering tools and technologies. The project employs Terraform for infrastructure setup, dbt for analytical engineering, Mage.ai for workflow orchestration and data transformation, Google Cloud Platform (GCP) BigQuery for data warehousing, PySpark for batch processing, and Confluent Kafka for real-time data processing.
- Introduction
- Technologies Used
- Project Structure
- Setup Instructions
- Usage
- Dashboard
- Contributing
- License
The COVID-19 pandemic has generated massive amounts of data related to infection rates, testing, hospitalizations, and more. This project aims to centralize, process, and analyze this data to provide valuable insights for healthcare professionals, policymakers, and the general public.
- Terraform: Infrastructure as code tool used to provision and manage the project's infrastructure on cloud platforms.
- dbt (Data Build Tool): Analytics engineering tool used for transforming and modeling data in the data warehouse.
- Mage.ai: Workflow orchestration and data transformation platform used to streamline data processing tasks.
- Google Cloud Platform (GCP) BigQuery: Fully managed, serverless data warehouse used for storing and querying large datasets.
- PySpark: Python API for Apache Spark used for large-scale batch processing of data.
- Confluent Kafka: Distributed streaming platform used for real-time data processing and event streaming.
- Docker Compose: Tool for defining and running multi-container Docker applications. Used to run Mage.ai and Confluent Kafka services.
- Looker: Business intelligence and data visualization platform used to create dashboards and reports.
The project is structured as follows:
covid19/
│
├── analyitcs/
│ ├── dbt/
│ │ ├── analyses/
│ │ ├── macros/
│ │ └── ...
│ └── ...
│
├── containerization/
│ ├── docker/
│ │ ├── docker-compose.yml
│ │ │
│ │ └── ...
│ └── ...
│
├── workflows/
│ ├── mage/
│ │ ├── export_data/
│ │ │ ├── export_to_gcp.py
│ │ ├── load_data/
│ │ ├── load_data_to_gcp.py
│ └── ...
│
│
├── kafka/
│ │── consumer.py
│ │── producer.py
└── README.md
-
Infrastructure Setup: Use Terraform scripts in the
infrastructure/terraform/
directory to provision the required cloud resources. Make sure to configure your cloud provider credentials and settings. -
Analytical Engineering: Utilize dbt models in the
analytics/dbt/models/
directory to transform and model data in the data warehouse. -
Workflow Orchestration: Define and manage data processing workflows using Mage.ai workflows in the
workflows/mage/workflows/
directory. -
Data Warehousing: Load Data to implement data warehousing in
workflows/mage/
directory. -
Real-time Processing: Develop real-time data processing pipelines using Confluent Kafka consumer and producer scripts in the
kafka/
directory. -
Docker Compose Setup: Use the provided
docker-compose.yml
file to run Mage.ai and Confluent Kafka services. Make sure Docker is installed on your system. -
Looker Dashboards: Use Looker to import and customize dashboard.
- Currently in Progress......
- Modify and extend the provided scripts and configurations to suit your specific data processing requirements.
- Run Docker Compose to start Mage.ai and Confluent Kafka services.
- Use Looker to visualize and explore data through the imported dashboards.
- Refer to individual tool documentation for detailed usage instructions and best practices.
Contributions to improve and expand this project are welcome! Feel free to fork the repository, make your changes, and submit a pull request.
This project is licensed under the MIT License.