-
Notifications
You must be signed in to change notification settings - Fork 56
Install VDK Control Service with custom SDK
In this tutorial, we will install the Versatile Data Kit Control Service using custom created SDK.
This SDK will be used automatically by all Data Jobs being deployed to it. And any change to the SDK will be automatically applied for all deployed data jobs instantaneously (starting from the next run).
Here are listed the minimum prerequisites needed to be able to install VDK Control Service using custom SDK.
- 1. Git and Docker repository.
- 2. Python (PyPi) repository
- 3. Kubernetes and Helm
- Optional integrations
Before follows more details and one example of how they can be set up.
This tutorial assumes Github will be used. Github provides both docker (container) and git repo. Any other docker and git repository would work.
Go to https://github.com/new and create a repository. For this example, we have created "github.com/tozka/demo-vdk.git"
You will need this Github Token later. Make sure to save it in a known place.
Make sure you gave permissions for both repo and packages (as we'd use it for both git and docker repository)
See example:
This is where we will release (upload) our custom SDK. For POC purposes we will use https://test.pypi.org
- Create an account using https://test.pypi.org/account/register/
- Go to https://test.pypi.org/manage/account/
- Click Add API Token and generate new API Token (you will need it later, save it for now)
We need Kubernetes to install the Control Service. And also helm to install it.
In production, you may want to use some cloud provider like GKE, TKG, EKS or other 3 letter abbreviation ...
In this example though, we will use kind and set up things locally.
- First, install kind
- Create a demo cluster using:
kind create cluster --name demo
VDK comes with some optional integrations with 3th party systems to provide more value that can be enabled with configuration only.
Those we will not be covered in this tutorial. Start a new discussion or contact us on slack on how to integrate since the options are not as clearly documented as we'd like.
All job logs can be forwarded to a centralized logging system.
Prerequisites: SysLog or Fluentd
SMPT Server for mail notifications. It's configured in in both SDK and Control Service
Prerequisites: SMTP Server
See list of metrics supported in here See more in monitoring configuration
Prerequisites: Prometheus or Wavefront or similar
You can define some more advanced monitoring rules. The Helm chart comes with prepared PrometheusRules (e.g Job Delay alerting) that can be used with AlertManager and Prometheus
Prerequisites: The out of the box rules require AlertManager
It supports Oauth2-based authorization of all operations enabling easy to integrate with company SSO. Authorization using claims is also supported.
See more in security section of Control Service Helm chart
Prerequisites: OAuth2
Access Control Webhooks enables to create more complex rules for who is allowed to do what operations in the Control Service (for cases where Oauth2 is not enough).
Prerequisites: Webhook endpoint
Here we will install the Versatile Data Kit.
First, we will create our custom SDK. This is a very simple process. If you are familiar with python packaging using setuptools, you will find these steps trivial.
NOTE: You can skip this if you do not want to create custom SDK. Quickstart VDK is a such custom SDK which can be used to start quickly.
mkdir my-org-vdk
cd my-org-vdk
Note that you should change the my-org-vdk
name to something appropriate to your organisation.
Open setup.py
in your favorite IDE.
We want to create an SDK that will support
- Database queries to both Postgres and Snowflake
- Ingesting Data into Postgres, Snowflake and using HTTP and using file.
- Control Service Operations - deploying data jobs.
In install_requires
we specify the plugins we need to achieve that:
import setuptools
setuptools.setup(
name="my-org-vdk",
version="1.0",
install_requires=[
"vdk-core",
"vdk-plugin-control-cli",
"vdk-postgres",
"vdk-snowflake",
"vdk-ingest-http",
"vdk-ingest-file",
]
)
Note that you should change the package name to something appropriate to your organisation, and amend subsequent commands to refer to that name instead of my-org-vdk
.
In order for our python SDK to be installable and usable, we need to release it.
- First, we build and package it:
python setup.py sdist --formats=gztar
- Then we upload it to pypi.org. Fill out PIP_REPO_UPLOAD_USER_PASSWORD and PIP_REPO_UPLOAD_USER_NAME from step 2 of the Prerequisites section.
twine upload --repository-url https://test.pypi.org/legacy/ -u "$PIP_REPO_UPLOAD_USER_NAME" -p "$PIP_REPO_UPLOAD_USER_PASSWORD" dist/my-org-vdk-1.0.tar.gz
We need to create a simple docker image with our SDK installed which will be used by all jobs managed by VDK Control Service.
Open empty Dockerfile-vdk-base
with a text editor or IDE.
The content of the Dockerfile is simply this:
FROM python:3.7-slim
WORKDIR /vdk
ENV VDK_VERSION $vdk_version
#Install VDK
RUN pip install --extra-index-url https://test.pypi.org/simple my-org-vdk
As you can see it's pretty basic. We just want to install VDK.
First, we need to log in to the Github Container Registry. Export the following environment variable:
export CR_PAT=*Github Personal Access Token*
and replace *Github Personal Access Token*
with the token you created earlier.
Then, run the following command:
echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
Make sure to tag it both with the version of the SDK and with the tag "release".
For example (replace with your own GitHub repo created in prerequisite):
docker build -t ghcr.io/tozka/my-org-vdk:1.0 -t ghcr.io/tozka/my-org-vdk:release -f Dockerfile-vdk-base .
docker push ghcr.io/tozka/my-org-vdk:release
docker push ghcr.io/tozka/my-org-vdk:1.0
Here it is time to put everything together.
Here we will use the GitHub token, account name, and repo created in step 2 of the Prerequisites.
We need to export the following variables:
export GITHUB_ACCOUNT_NAME=*your account name*
export GITHUB_URL=*URL of the repo you created earlier*
The content of the values.yaml is:
resources:
limits:
memory: 0
requests:
memory: 0
cockroachdb:
statefulset:
resources:
limits:
memory: 0
requests:
memory: 0
init:
resources:
limits:
cpu: 0
memory: 0
requests:
cpu: 0
memory: 0
deploymentGitUrl: "${GITHUB_URL}"
deploymentGitUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentGitPassword: "${GITHUB_TOKEN}"
uploadGitReadWriteUsername: "${GITHUB_ACCOUNT_NAME}"
uploadGitReadWritePassword: "${GITHUB_TOKEN}"
deploymentDockerRegistryType: generic
deploymentDockerRegistryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPasswordReadOnly: "${GITHUB_TOKEN}"
deploymentDockerRegistryUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPassword: "${GITHUB_TOKEN}"
deploymentDockerRepository: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
proxyRepositoryURL: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
deploymentVdkDistributionImage:
registryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
registryPasswordReadOnly: "${GITHUB_TOKEN}"
registry: ghcr.io/${GITHUB_ACCOUNT_NAME}
repository: "my-org-vdk"
tag: "release"
security:
enabled: False
helm repo add vdk-gitlab https://gitlab.com/api/v4/projects/28814611/packages/helm/stable
helm repo update
helm install my-vdk-runtime vdk-gitlab/pipelines-control-service -f values.yaml
In order to access the application from our browser we need to expose it using kubectl port-forward
command:
kubectl port-forward service/my-vdk-runtime-svc 8092:8092
Note that this command does not return, and you will need to open a new terminal window to proceed.
Then let's see how data or analytics engineers would use it in our organization to create, develop and deploy jobs:
pip install --extra-index-url https://test.pypi.org/simple/ my-org-vdk
export VDK_CONTROL_SERVICE_REST_API_URL=http://localhost:8092
This will create a data job and register it in the Control Service. Locally it will create a directory with sample files of a data job:
vdk create --name example --team my-team --path .
Browse the files in the example directory
It's a single "click" (or CLI command). Behind the scenes, VDK will package and install all dependencies, create docker images and container, release and version it, and finally schedule it (if configured) for execution.
vdk deploy --job-path example --reason "reason"
vdk show --name example --team my-team
Note how there is both a VDK version and a Job Version. Those are deployed independently. VDK version is taken from the Control Service configuration and managed centrally. While the Job version is separate and the data engineer developing the job is in control .
Both the VDK version and job version can be changed if needed with vdk deploy --update
command.
➡️ Next Section: Properties and Secrets
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts