Skip to content

chris-guecia/mini_lake_iceberg

Repository files navigation

Event Data Pipeline with Dremio, Nessie, and Apache Iceberg

This project uses a python script to simulate an ingestion an ELT (Extract, Load, Transform) pipeline that processes JSON event data through MinIO (S3-compatible storage), Dremio, and Apache Iceberg with Nessie version control. Credit to Dremio https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/

Architecture Overview

The pipeline consists of the following components:

  • MinIO: S3-compatible object storage for raw data
  • Dremio: SQL query engine and data lake engine
  • Apache Iceberg: Table format for large analytic datasets
  • Nessie: Git-like version control for your data lake

Prerequisites

  • Docker and Docker Compose
  • Python 3.8+

Setup Instructions

  1. Clone the Repository

    git clone https://github.com/chris-guecia/mini_lake_iceberg.git
  2. Follow this guide provided by Dremio
    https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/

    At the part when making the dremio account with username and password
    set them to the following. Access Dremio UI at http://localhost:9047 the name and email can be anything for example admin@admin.com make an account with these creds: These are used in the simulate_ingestion_elt.py script

    Username: admin
    Password: dremio_password123
    

    Everything else in that guide is the same.

docker-compose up dremio
docker-compose up minio
docker-compose up nessie

-- To run the simulate_ingestion_elt.py script 
docker-compose up --build simulate-ingestion-elt 
docker exec simulate-ingestion-elt python simulate_ingestion_elt.py 
  1. Configure Dremio source for incoming raw events

    Configure MinIO source in Dremio:

    • Click "Add Source" and select "S3"
    • Configure the following settings:

    General Settings:

    Name: incoming
    Credentials: AWS Access Key
    Access Key: admin
    Secret Key: password
    Encrypt Connection: false
    

    Advanced Options:

    Enable Compatibility Mode: true
    Root Path: /incoming
    

    Connection Properties:

    fs.s3a.path.style.access = true
    fs.s3a.endpoint = minio:9000
    

Initialize Data Warehouse Schema

  1. In Dremio UI, navigate to the SQL Editor
  2. Open and paste the sql: sql/DDL-sample-dw.sql into an editor and hit run.
  3. This will create the necessary tables and schema for the data warehouse

Docker Container Details

The project includes a Python container for simulating data ingestion:

simulate-ingestion-elt:
  build:
    context: .
    dockerfile: Dockerfile
  container_name: simulate-ingestion-elt
  networks:
    - iceberg
  volumes:
    - ./data:/app/data
  environment:
    - DREMIO_USER=admin
    - DREMIO_PASSWORD=dremio_password123
  depends_on:
    - minio
    - dremio
    - nessie

This container:

  • Mounts the local ./data directory to /app/data in the container
  • Uses predefined Dremio credentials
  • Runs after MinIO, Dremio, and Nessie services are started
  • Connects to the iceberg network for communication with other services

Running the Pipeline

The pipeline performs the following steps: The script is idempotent meaning re-running won't make duplicates in the warehouse
The script follows W.A.P. Write -> Audit -> Publish using Nessie catalog with Apache Iceberg with
Dremio as the compute /sql engine

  1. Reads JSON event data
  2. Flattens and normalizes the data using Polars
  3. Writes partitioned Parquet files to MinIO
  4. Creates a new Nessie branch
  5. Loads data into an Iceberg table
  6. Performs validation
  7. Merges changes to main branch

To run the pipeline:

docker exec simulate-ingestion-elt python app/main.py

In Dremio You should see this in sources in http://localhost:9047/
dremio_datasets

The warehouse should look like this dremio_6

Here shows different Nessie Branches in Dremio UI
dremio_branches

And MinIO should look like this (object counts in warehouse will differ) http://localhost:9001/browser
minIO_1

Analytical SQL query results

Due to the sample JSON being only 1 days worth of events 2023-01-01, queries looking back a week from CURRENT_TIMESTAMP
won't return results I made some changes to the queries to show results based on the sample event timestamps

dremio_sql_result_4

dremio_sql_result_3

ERD amplify_erd

Additional tables and schema details can be found in sql/DDL-sample-dw.sql.

Key Features

  • Data Versioning: Uses Nessie branches for safe data loading and validation
  • Partitioned Storage: Data is partitioned by batch_id for efficient querying
  • Data Quality Checks: Includes row count validation before merging changes
  • Idempotent Operations: Supports multiple runs without duplicating data
  • Error Handling: Includes branch cleanup on failure

Next Steps

  • Add transformations to add dimension surrogate_keys to the fact_events table for better joins

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published