This project uses a python script to simulate an ingestion an ELT (Extract, Load, Transform) pipeline that processes JSON event data through MinIO (S3-compatible storage), Dremio, and Apache Iceberg with Nessie version control. Credit to Dremio https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/
The pipeline consists of the following components:
- MinIO: S3-compatible object storage for raw data
- Dremio: SQL query engine and data lake engine
- Apache Iceberg: Table format for large analytic datasets
- Nessie: Git-like version control for your data lake
- Docker and Docker Compose
- Python 3.8+
-
Clone the Repository
git clone https://github.com/chris-guecia/mini_lake_iceberg.git
-
Follow this guide provided by Dremio
https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/At the part when making the dremio account with username and password
set them to the following. Access Dremio UI athttp://localhost:9047
the name and email can be anything for example admin@admin.com make an account with these creds: These are used in the simulate_ingestion_elt.py scriptUsername: admin Password: dremio_password123
Everything else in that guide is the same.
docker-compose up dremio
docker-compose up minio
docker-compose up nessie
-- To run the simulate_ingestion_elt.py script
docker-compose up --build simulate-ingestion-elt
docker exec simulate-ingestion-elt python simulate_ingestion_elt.py
-
Configure Dremio source for incoming raw events
Configure MinIO source in Dremio:
- Click "Add Source" and select "S3"
- Configure the following settings:
General Settings:
Name: incoming Credentials: AWS Access Key Access Key: admin Secret Key: password Encrypt Connection: false
Advanced Options:
Enable Compatibility Mode: true Root Path: /incoming
Connection Properties:
fs.s3a.path.style.access = true fs.s3a.endpoint = minio:9000
Initialize Data Warehouse Schema
- In Dremio UI, navigate to the SQL Editor
- Open and paste the sql:
sql/DDL-sample-dw.sql
into an editor and hit run. - This will create the necessary tables and schema for the data warehouse
The project includes a Python container for simulating data ingestion:
simulate-ingestion-elt:
build:
context: .
dockerfile: Dockerfile
container_name: simulate-ingestion-elt
networks:
- iceberg
volumes:
- ./data:/app/data
environment:
- DREMIO_USER=admin
- DREMIO_PASSWORD=dremio_password123
depends_on:
- minio
- dremio
- nessie
This container:
- Mounts the local
./data
directory to/app/data
in the container - Uses predefined Dremio credentials
- Runs after MinIO, Dremio, and Nessie services are started
- Connects to the
iceberg
network for communication with other services
The pipeline performs the following steps:
The script is idempotent meaning re-running won't make duplicates in the warehouse
The script follows W.A.P. Write -> Audit -> Publish using Nessie catalog with Apache Iceberg with
Dremio as the compute /sql engine
- Reads JSON event data
- Flattens and normalizes the data using Polars
- Writes partitioned Parquet files to MinIO
- Creates a new Nessie branch
- Loads data into an Iceberg table
- Performs validation
- Merges changes to main branch
To run the pipeline:
docker exec simulate-ingestion-elt python app/main.py
In Dremio You should see this in sources in http://localhost:9047/
The warehouse should look like this
Here shows different Nessie Branches in Dremio UI
And MinIO should look like this (object counts in warehouse will differ) http://localhost:9001/browser
Due to the sample JSON being only 1 days worth of events 2023-01-01, queries looking back a week from CURRENT_TIMESTAMP
won't return results I made some changes to the queries to show results based on the sample event timestamps
Additional tables and schema details can be found in sql/DDL-sample-dw.sql
.
- Data Versioning: Uses Nessie branches for safe data loading and validation
- Partitioned Storage: Data is partitioned by batch_id for efficient querying
- Data Quality Checks: Includes row count validation before merging changes
- Idempotent Operations: Supports multiple runs without duplicating data
- Error Handling: Includes branch cleanup on failure
- Add transformations to add dimension surrogate_keys to the fact_events table for better joins