Proposal for a XGBOOST operator 

## Motivation
XGBOOST is the state of art approach for machine learning. Additionally to deploy XGboost over Yarn, or Spark, it is necessary to provide Kubernetes with the ability to handle distributed XGBoost training and prediction. The Kubernetes Operator for XGBOOST reduces the gap to build distributed XGBOOST over Kubernetes,  and allow XGBOOST applications to be specified, run, and monitored idiomatically on Kubernetes.

The operator allows ML applications based XGBOOST to be specified in a declarative manner (e.g., in a YAML file) and run without the need to deal with the XGBOOST submission process. It also enables status of XGBOOST job to be tracked and presented idiomatically like other types of workloads on Kubernetes. This document discusses the design and architecture of the operator.

## Goals
1. Provide a common custom resource definition (CRD)  for defining a single-node or multiple node  XGBOOST training and predication job.

2. Implement a custom controller to manage the CRD,  create dependency resources, and reconcile the desired states. 

More details
A XGBOOST operator
A way to deploy the operator
A single pod XGBOOST example
A distributed XGBOOST example

## Non-Goals
Issues or changes not being addressed by this proposal.

## UI or API
Custom Resource Definition
The custom resource submitted to the Kubernetes API would look something like this:

```yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "XGBoostJob"
metadata:
  name: "xgboost-example-job"
  command: ["xgboost"]
        args: [
          "-bind-to", "none",
          '-distributed', "yes",
          "-job-type", "train",
          "python", "scripts/xgboost_test/xgboost_test.py",
          "--model", "modelname"
        ]
spec:
  backend: "rabit"
  masterPort: "23456"
  replicaSpecs:
    - replicas: 1
      ReplicaType: MASTER
      template:
        spec:
          containers:
            - image: xgboost/xgboost:latest
              name: master
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 2
      ReplicaType: WORKER
      template:
        spec:
          containers:
            - image: xgboost/xgboost:latest
              name: worker
          restartPolicy: OnFailure
```

This XGBoostJob resembles the existing TFJob for the tf-operator.  backend defines the protocol the XGboost workers to communicate when initializing the worker group. masterPort defines the port the group will use to communicate with the master's Kubernetes service.

### Resulting Master

```yaml
kind: Service
apiVersion: v1
metadata:
  name: xgboost-master-${job_id}
spec:
  selector:
    app: xgboost-master-${job_id}
  ports:
  - port: 23456
    targetPort: 23456
```

Details
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: xgboost-master-${job_id}
  labels:
    app: xgboost-${job_id}
spec:
  containers:
  - image: xgboost/xgboost:latest
    imagePullPolicy: IfNotPresent
    name: master
    env:
      - name: MASTER_PORT
        value: "23456"
      - name: MASTER_ADDR
        value: "localhost"
      - name: WORLD_SIZE
        value: "3"
        # Rank 0 is the master
      - name: RANK
        value: "0"
    ports:
      - name: masterPort
        containerPort: 23456
  restartPolicy: OnFailure
```
### Resulting Worker

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: xgboost-worker-${job_id}
spec:
  containers:
  - image: xboost/xgboost:latest
    imagePullPolicy: IfNotPresent
    name: worker
    env:
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: xgboost-master-${job_id}
    - name: WORLD_SIZE
      value: "3"
    - name: RANK
      value: "1"
  restartPolicy: OnFailure
```
The worker spec generates a pod. They will communicate to the master through the master's service name.

## Design

The design of distributed XGBOOST follow the Rabit protocol of XGBOOST. The rabit design can be found here. Thus, XGBoost operator is coming to provide the framework for start the master node and slave nodes for Rabit as following way. 

The master node of Rabit is initialized, and each slave node comes to connect with master node via the provided port and IP. Each work in pods to read data from locally, and map the input data into Dmatrix format for XGBoost.
a. For training job: One of worker is selected as Host, and other workers use the IP and port of number of HOST to build the rabit network for training as the Figure 1. 
b. For predication job: the trained model is popugate to each work node, and use the local validation data for predication..

![image](https://user-images.githubusercontent.com/1431801/55560170-459e0800-5722-11e9-92e0-11dcd1cee889.png)


## Alternatives Considered
Description of possible alternative solutions and the reasons they were not chosen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a XGBOOST operator #247

Motivation

Goals

Non-Goals

UI or API

Resulting Master

Resulting Worker

Design

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development