Ingest

A data platform for everyone

Ingest
- About
- Getting Started
- Contributing
- Changelog

About

The ingest is a service of the Superb Data Kraken Platform (SDK). It is designed for managing data-ingestion to the SDK.

For a more detailed understanding of the broader context of the platform this project is used in, refer to the architecture documentation.

For instructions on how to deploy the ingest on an instance of the SDK, refer to the installation instructions.

The workers that are part of the ingest are explained in more detail below.

skip_validation

Certain organizations may not need to validate data (should be ingested as is). This worker provides a functionality, so one can configure, for which organizations no validation is required. Configuration via SKIP_VALIDATE_ORGANIZATIONS.

basic_metadata

In case no "qualified" metadata is provided, this worker generates a basic metadata-set and stores it in the cloud storage (loadingzone).

anonymize

This worker will provide functionality to anonymize metadata. However it is not implemented yet.

enrichment

This worker will provide functionality to enrich metadata. However it is not implemented yet.

validate

This worker will provide functionality to validate the dataset. However it is not implemented yet.

metadata_index

Indexes meta.json to the dedicated <orga>_<space>_measurements-index via metadata-service. For this, the worker accesses the cloud storage to read the meta.json and pass the content to the service. The document-id is stored in a dedicated file ingest.json - this prevents multiple indexing. CAUTION: Only users with the role <orga>_<space>_trustee may update documents - executing the ingest multiple times from a user without trustee-permission will lead to errors!

The following environment variables are required:

name	description
CLIENT_ID	client-id of confidential OAuth-Client
CLIENT_SECRET	client-secret of confidential OAuth-Client
ACCESS_TOKEN_URI	URI of the token-endpoint
INDEXER_URL	URL of the metadata-backend
STORAGE_TYPE	storage-type - one of `azure` and `s3` (default: `azure` - s3 currently not supported)
ACCESSMANAGER_URL	URL of the accessmanager (only required if `azure`-storage)
STORAGE_DOMAIN	domain of the storage-implementation (only required, if `s3`-storage - currently not supported)
BUCKET	storage-bucket (only required, if `s3`-storage - currently not supported)

move_data

Finally, the data is being moved from loadingzone to the main-storage.

The following pipeline variables are required:

name	description
CLIENT_ID	client-id of confidential OAuth-Client
CLIENT_SECRET	client-secret of confidential OAuth-Client
ACCESS_TOKEN_URI	URI of the token-endpoint
STORAGE_TYPE	storage-type - one of `azure` and `s3` (default: `azure` - s3 currently not supported)
READ_ENDPOINT	endpoint for generating SAS-Token in read-scope (only required, if `azure`-storage)
UPLOAD_ENDPOINT	endpoint for generating SAS-Token in upload-scope (only required, if `azure`-storage)
DELETE_ENDPOINT	endpoint for generating SAS-Token in delete-scope (only required, if `azure`-storage)
BLACKLIST	comma-separated list of wildcarded blob names that should not be moved to main-storage but deleted directly
<ORGA>.WHITELIST	comma-separated list of wildcarded blob names that should be moved to main-storage
STORAGE_DOMAIN	domain of the storage-implementation (only required, if `s3`-storage - currently not supported)
BUCKET	storage-bucket (only required, if `s3`-storage - currently not supported)

NOTE on black- and whitelist: The blacklist applies globally. It can be used to define files that can potentially cause damage to the system (*.exe, *.bat). If your organization only has certain file-extensions, you can use the organization-scoped whitelist to prevent uploading other extensions. The blacklist restricts each whitelist.

If you have the following configuration a bat- or a png-file would not be moved to main-storage:

blacklist = "*.exe,*.bat"
whitelist = "*.csv,*.json,*.bat"

Getting Started

Follow the instructions below to set up a local copy of the project for development and testing.

Prerequisites

python >= 3.9
A running OIDC/OAuth2 provider instance
A running kafka-instance
A running argo workflows instance
A running argo events instance with an EventSource listening on the kafka-event accessmanager-commit (which is sent by accessmanager/commit)
Cloud-Storage in expected storage-structure - currently only azure supported
accessmanager
organizationmanager
metadata-service

Setup

You may provide a secret auth-secret (as referenced by the ingest-sensor), with the following setup:

apiVersion: v1
data:
  ACCESS_TOKEN_URI: <ACCESS_TOKEN_URI_BASE64>
  CLIENT_ID: <CLIENT_ID_BASE64>
  CLIENT_SECRET: <CLIENT_SECRET_BASE64>
kind: Secret
metadata:
  name: auth-secret
  namespace: argo-mgmt
type: Opaque

This secret is being refered to from metadata_index and move_data.

Configuration

The configuration of the ingest takes place in argo/config-map.yml.

As already mentioned in skip_validation the property SKIP_VALIDATE_ORGANIZATIONS is a comma-separated list of organizations, that should not be validated.

Every other configuration within this file refers to your cluster-internal domain. Aside from a possible postfix, nothing more must be configured.

Usage

The ingest is an argo events sensor with an event-source for the accessmanager-commit-event. So the ingest is triggered, every time a dataset is committed via accessmanager.

Contributing

See the Contribution Guide.

Changelog

See the Changelog.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
argo		argo
batch		batch
contributing		contributing
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
azure-pipeline-dev.yml		azure-pipeline-dev.yml
azure-pipeline-prod.yml		azure-pipeline-prod.yml
azure-pipeline-template.yml		azure-pipeline-template.yml
catalog-info.yaml		catalog-info.yaml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingest

About

skip_validation

basic_metadata

anonymize

enrichment

validate

metadata_index

move_data

Getting Started

Prerequisites

Setup

Configuration

Usage

Contributing

Changelog

About

Releases

Packages

Languages

License

EFS-OpenSource/superb-data-kraken-ingest

Folders and files

Latest commit

History

Repository files navigation

Ingest

About

skip_validation

basic_metadata

anonymize

enrichment

validate

metadata_index

move_data

Getting Started

Prerequisites

Setup

Configuration

Usage

Contributing

Changelog

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages