Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for deploying airflow on Kubernetes with vineyard as backend #736

Merged
merged 1 commit into from
May 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 47 additions & 3 deletions python/vineyard/contrib/airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,23 @@ database backend without involving external storage systems like HDFS. The
Vineyard XCom backend handles object migration as well when the required inputs
is not located on where the task is scheduled to execute.

Requirements
Table of Contents
-----------------

- [Requirements](#requirements)
- [Configuration and Usage](#configuration-and-usage)
- [Run the tests](#run-tests)
- [Deploy on Kubernetes](#deploy-on-kubernetes)

Requirements <a name="requirements"/>
------------

The following packages are needed to run Airflow on Vineyard,

- airflow >= 2.1.0
- vineyard >= 0.2.12

Configuration and Usage
Configuration and Usage <a name="configuration-and-usage"/>
-----------------------

1. Install required packages:
Expand Down Expand Up @@ -92,7 +100,7 @@ table in backend databases of Airflow.
The example is adapted from the documentation of Airflow, see also
[Tutorial on the Taskflow API][2].

Run the tests
Run the tests <a name="run-tests"/>
-------------

1. Start your vineyardd with the following command,
Expand All @@ -110,5 +118,41 @@ Run the tests
The pandas test suite is not possible to run with the default XCom backend, vineyard
enables airflow to exchange **complex** and **big** data without modify the DAG and tasks!

Deploy on Kubernetes <a name="deploy-on-kubernetes"/>
--------------------

We provide a reference settings (see [values.yaml](./values.yaml)) for deploying
Airflow with vineyard as the XCom backend on Kubernetes, based on [the official helm charts][3].

Deploying vineyard requires etcd, to ease to deploy process, you first need to
setup a standalone etcd cluster. A _test_ etcd cluster with only one instance can
be deployed by

```bash
$ kubectl create -f etcd.yaml
```

The [values.yaml](./values.yaml) mainly tweaks the following settings:

- Installing vineyard dependency to the containers using pip before start workers
- Adding a vineyardd container to the airflow pods
- Mounting the vineyardd's UNIX-domain socket and shared memory to the airflow worker pods

Note that **the `values.yaml` may doesn't work in your environment**, as airflow requires
other settings like postgresql database, presistance volumes, etc. You can combine
the reference `values.yaml` with your own specific Airflow settings.

The [values.yaml](./values.yaml) for Airflow's helm chart can be used as

```bash
# add airflow helm stable repo
$ helm repo add apache-airflow https://airflow.apache.org
$ helm repo update

# deploy airflow
$ helm install -f values.yaml $RELEASE_NAME apache-airflow/airflow --namespace $NAMESPACE
```

[1]: https://v6d.io/notes/getting-started.html#starting-vineyard-server
[2]: https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html
[3]: https://github.com/apache/airflow/tree/main/chart
52 changes: 52 additions & 0 deletions python/vineyard/contrib/airflow/etcd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# referred from https://github.com/etcd-io/etcd/blob/master/hack/kubernetes-deploy/etcd.yml

apiVersion: v1
kind: Pod
metadata:
labels:
app: etcd
etcd_node: etcd0
name: etcd0
spec:
containers:
- command:
- /usr/local/bin/etcd
- --name
- etcd0
- --initial-advertise-peer-urls
- http://etcd0:2380
- --listen-peer-urls
- http://0.0.0.0:2380
- --listen-client-urls
- http://0.0.0.0:2379
- --advertise-client-urls
- http://etcd0:2379
- --initial-cluster
- etcd0=http://etcd0:2380
- --initial-cluster-state
- new
image: quay.io/coreos/etcd:v3.4.16
name: etcd0
ports:
- containerPort: 2379
name: client
protocol: TCP
- containerPort: 2380
name: server
protocol: TCP
restartPolicy: Always


---
apiVersion: v1
kind: Service
metadata:
name: etcd-for-vineyard
spec:
ports:
- name: etcd-for-vineyard-port
port: 2379
protocol: TCP
targetPort: 2379
selector:
app: etcd
193 changes: 193 additions & 0 deletions python/vineyard/contrib/airflow/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
---

# Airflow executor
# One of: LocalExecutor, LocalKubernetesExecutor, CeleryExecutor, KubernetesExecutor, CeleryKubernetesExecutor
executor: "CeleryExecutor"

# Environment variables for all airflow containers
env:
- name: "VINEYARD_IPC_SOCKET"
value: "/var/run/vineyard/vineyard.sock"
- name: "AIRFLOW__VINEYARD__IPC_SOCKET"
value: "/var/run/vineyard/vineyard.sock"

# Airflow scheduler settings
scheduler:
replicas: 1

# Command to use when running the Airflow scheduler (templated).
command: ~
# Args to use when running the Airflow scheduler (templated).
args:
- "bash"
- "-c"
- |
export AIRFLOW__CORE__XCOM_BACKEND=vineyard.contrib.airflow.xcom.VineyardXCom; \
python3 -m pip install vineyard vineyard-migrate airflow-provider-vineyard; \
exec airflow scheduler

# Launch additional containers into scheduler.
extraContainers:
- name: vineyard
image: libvineyard/vineyardd:v0.4.1
command:
- /bin/bash
- "-c"
- |
/usr/local/bin/vineyardd \
--size 4Gi \
--etcd_endpoint http://etcd-for-vineyard:2379 \
--etcd_prefix airflow-my-airflow-release \
--rpc_socket_port 9600 \
--socket /var/run/vineyard.sock
securityContext:
runAsUser: 0
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_HOST_NAME
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
tcpSocket:
port: 9600
periodSeconds: 60
readinessProbe:
exec:
command:
- ls
- /var/run/vineyard.sock
volumeMounts:
- name: vineyard-socket
mountPath: /var/run
- name: shm
mountPath: /dev/shm

# Mount additional volumes into scheduler.
extraVolumeMounts:
- name: vineyard-socket
mountPath: /var/run/vineyard
- name: shm
mountPath: /dev/shm

extraVolumes:
- name: vineyard-socket
emptyDir: {}
- name: shm
emptyDir:
medium: Memory

# Airflow Worker Config
workers:
# Number of airflow celery workers in StatefulSet
replicas: 1

# Command to use when running Airflow workers (templated).
command: ~
# Args to use when running Airflow workers (templated).
args:
- "bash"
- "-c"
# The format below is necessary to get `helm lint` happy
- |-
export AIRFLOW__CORE__XCOM_BACKEND=vineyard.contrib.airflow.xcom.VineyardXCom; \
python3 -m pip install vineyard vineyard-migrate airflow-provider-vineyard; \
exec \
airflow {{ semverCompare ">=2.0.0" .Values.airflowVersion | ternary "celery worker" "worker" }}

extraContainers:
- name: vineyard
image: libvineyard/vineyardd:v0.4.1
command:
- /bin/bash
- "-c"
- |
id; \
/usr/local/bin/vineyardd \
--size 4Gi \
--etcd_endpoint http://etcd-for-vineyard:2379 \
--etcd_prefix airflow-my-airflow-release \
--rpc_socket_port 9600 \
--socket /var/run/vineyard.sock
securityContext:
runAsUser: 0
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_HOST_NAME
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
tcpSocket:
port: 9600
periodSeconds: 60
readinessProbe:
exec:
command:
- ls
- /var/run/vineyard.sock
volumeMounts:
- name: vineyard-socket
mountPath: /var/run
- name: shm
mountPath: /dev/shm

# Mount additional volumes into worker.
extraVolumeMounts:
- name: vineyard-socket
mountPath: /var/run/vineyard
- name: shm
mountPath: /dev/shm

extraVolumes:
- name: vineyard-socket
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
3 changes: 3 additions & 0 deletions setup_airflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ def find_airflow_packages(root):
long_description_content_type='text/markdown',
url='https://v6d.io',
package_dir={'vineyard.contrib.airflow': 'python/vineyard/contrib/airflow'},
package_data={
'vineyard.contrib.airflow': ['*.yaml', '*.README'],
},
packages=find_airflow_packages('python'),
cmdclass={'bdist_wheel': bdist_wheel_plat, "install": install_plat},
zip_safe=False,
Expand Down