Duckpond is a very small data lake. :-)
Duckpond is built by the Caspar Water System.
Duckpond writes timeseries data into multi-Parquet file databases and assembles them for export using DuckDB. A sibling project Noyo Blue Economy shows how to combine this output within Observable Framework markdown.
Warning! Work-in-progress. This works and needs testing! :-)
A receiver for www.hydrovu.com environmental monitoring data. To instantiate one of these, for example:
apiVersion: www.hydrovu.com/v1
kind: HydroVu
name: noyo-harbor
desc: Noyo Harbor Blue Economy
spec:
key: ...
secret: ...
This downloads the complete set of data for all locations and instruments. Configure key and secret fields using values supplied by the HydroVu system.
An exporter for backing up all files in the Pond to a storage bucket.
apiVersion: github.com/jmacd/duckpond/v1
kind: Backup
name: cloudflare
desc: Backup to Cloudflare R2
spec:
bucket: noyoharbor
region: ...
key: ...
secret: ...
endpoint: ...
Configure region, key, secret, and endpoint fields using values supplied by your service provider.
Loads from a Backup storage bucket.
apiVersion: github.com/jmacd/duckpond/v1
kind: Copy
name: cloudflare
desc: Backup from storage
spec:
bucket: noyoharbor
region: ...
key: ...
secret: ...
endpoint: ...
backup_uuid: 6fe1e0e1-b262-4d99-ae6f-3cae39e1196a
Configure region, key, secret, and endpoint the same as you would for the corresponding Backup.
Ingest files from a local directory on the host file system.
apiVersion: github.com/jmacd/duckpond/v1
kind: Inbox
name: csvin
desc: CSV inbox
spec:
pattern: /home/user/inbox/csv/**
The files will be placed in a directory named /Inbox/{UUID}
.
Register a SQL query to transform arbitrary data (e.g., from the Inbox) into a desired representation. The query will be executed once per target file matching the pattern in the Pond.
apiVersion: github.com/jmacd/duckpond/v1
kind: Derive
name: csvextract
desc: Extract from CSV
spec:
collections:
- pattern: {{ resource }}/*surface*.csv
name: SurfaceMeasurements
query: >
WITH INPUT as ...
SELECT ...
FROM read_csv('$1')
...
The placeholder $1
is replaced by the real path of each file
matching the configured pattern, and the resulting derived files are
populated with the results of the query
in a synthetic Pond
directory named /Derive/{UUID}/{spec.name}
.
Use the Combine resource to merge a set of files with different names and identical schemas. This is accomplished by an automatically generated UNION statement executed by DuckDB.
apiVersion: www.hydrovu.com/v1
kind: Combine
name: noyo-data
desc: Noyo harbor instrument collections
spec:
scopes:
- name: FieldStation-Surface
series:
- pattern: {{ derive }}/SurfaceMeasurements/*
attrs:
source: laketech
device: at500
where: surface
Use the Template resource to synthesize content generated through the
Rust Tera
template engine. Each named collection applies the
supplied pattern, which must capture a single variable. For each
match, the captured value becomes the basename of a file with the
contents of the expanded template.
apiVersion: github.com/jmacd/duckpond/v1
kind: Template
name: website
desc: observable
spec:
collections:
- name: details
in_pattern: "{{ combine }}/*/combine"
out_pattern: "$0.md"
template: |-
---
title: Detail with combined data
---
{% for field in schema.fields %}
{{ field.name }}
{% endfor %}
The template context includes the schema of the matching file in a variable named "schema". The schema is an array of fields:
pub struct Schema {
fields: Vec<Field>,
}
pub struct Field {
name: String,
instrument: String,
unit: String,
}
TODO: The schema is structured according to the HydroVu data model, which is not general purpose.
The group(by=..., in=...)
built-in groups any array of objects,
returning a map of arrays of objects. It can be used to group fields
by instrument name, for example.
{% for key, fields in group(by="name",in=schema.fields) %}
Name is {{ key }}
{% for field in fields %}
Instrument is {{ field.instrument }}
{% endfor %}
{% endfor %}
Synthetic data generator for testing.
apiVersion: github.com/jmacd/duckpond/v1
kind: Scribble
name: first-scribble
desc: Test data generator
spec:
count_min: 5
count_max: 10
probs:
tree: 0.05
table: 0.4
To initialize a new pond in the current working directory or $POND
,
if set. The pond directory is named ".pond". To create a new pond:
duckpond init
To apply a resource definition, such as to create a new temporal data set. For example:
duckpond apply -f noyo.yaml
Fetches new data from registered resources. For example:
duckpond run
Determines whether expected files are present in the pond and various other consistency checks. WIP: Simply prints the unexpected files.
duckpond check
Produces a directory listing. Accepts shell wildcards.
duckpond list PATTERN
Writes a file to the standard output.
duckpond cat PATH