- docker
- docker-compose
- bash (optional)
- build the project
$ make start
- initialize the database
$ make init_db
- add calendar years
$ make calendar start=2018 end=2023
-
customer.csv
$ make customer FILE=/app/resources/customer.csv
-
transaction.csv
$ make transaction FILE=/app/resources/transaction.csv
-
exchange_rate.csv
$ make exchange_rate FILE=/app/resources/exchange_rate.csv
Design a Data Flow architecture
Which technologies do you propose? What are the advantages and disadvantages of those?
- API - to receive request from other platforms and enable a point of receiving the trigger. The API code has access to meta data tables, which can show the progress of the data journey - ingestion, validation, transformation, and extraction.
- Meta data storage - to store the data of the request received to ingest/export data. It can also capture some key attributes from each request received to derive insights at later stages. It will enable logging of the data received and overrides (versioning) if requested.
- SQS/RabbitMQ - Any queue mechanism, where API will register the request to be picked up based upon the availability of the processor.
- Processor - Key engine which will listen to traffic being received in the SQS/RabbitMQ and will spawn a necessary process to execute it.
- Python script
- Loader - this will be request specific and will expand to incorporate more data source types like Salesforce, Kafka, S3, direct upload, other databases, etc.
- Staging - this layer is responsible to hold the raw data in as is form.
- Pipeline - basically set of queries - written to validate, clean, and transform data to ingest it into facts, dimensions and lookup tables.
- Views - these are virtual tables which abstract the necessary joins, and aggregations for the access of data via tableau or other BI tools.
Design a relational data model to represent the given datasets. It should be optimized for analytical purposes.
Explain the modeling approach taken. Is there any alternative to the model chosen (trade-offs, limitations, advantages)?
I took approach of snowflake schema - where I introduced 2 fact tables to store exchange rate and transaction data. Also, dimensions are introduced:
- dim_country - to store distinct country records
- dim_currency - to store distinct currency
- dim_calendar - indexed table with date, month, and year columns
- dim_customer - to store distinct customers and reference it in fact tables.
A lookup table (map_country_currency) is also introduced to store the dependency between country and currency using the outgoing transaction data (per customer).
- Tradeoffs:
- query complexity - since data is normalized, so would need some joins to fetch data
- join overhead - due to which, proper indexing and query optimization is essential.
- Limitations:
- transformations required for ingestion of data.
- validations/updates to be in place of dimension table - to be able to ingest the data.
- Advantages:
- reduced data redundancy - created dimension data for the redundant data.
- improved data consistency - better control of updates on the dimension data.
Propose a primary key implementation for each table, elaborating on the solution adopted.
- dim_country - country_id
- dim_currency - currency_id
- map_country_currency - map_country_currency_id
- dim_calendar - date_id
- dim_customer - customer_id
- fact_transaction - transaction_id
- fact_exchange_rate - exchange_rate_id
Explain why your code is performant or how it can be improved.
I am using a staging table to ingest data as is, and then validating the data in the database. This allows me to scale the queries in the database, because the validation is done in the staging table, and then only the validated data is loaded into the production schema.
This helps in making the code performant, by allowing me to perform validation and transformation on the staging table (ELT approach).
-
Create a view which shows the total amount in EUR, number of transactions and number of unique users per month and country.
view name:transaction_summary_view
file name:/sql/db_setup.sql
-
Based on a date, show the evolution of the exchange rate for each currency (based on GBP) in percentage for the next 15 days
view name:exchange_rate_view
file name:/sql/db_setup.sql