[CELEBORN-1458][DOC] Introduce decommissioning document

### What changes were proposed in this pull request? Introduce decommissioning document to provide users with introduction of worker decommissioning. ### Why are the changes needed? Users should know operation of worker decommissioning maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes apache#2554 from AngersZhuuuu/CELEBORN-1458. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
SteNicholas · Jun 14, 2024 · d7e1510 · d7e1510
1 parent e177a20
commit d7e1510
Show file tree

Hide file tree

Showing 2 changed files with 86 additions and 0 deletions.
diff --git a/docs/decommissioning.md b/docs/decommissioning.md
@@ -0,0 +1,85 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements. See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License. You may obtain a copy of the License at
+    
+      https://www.apache.org/licenses/LICENSE-2.0
+    
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+Decommissioning
+===
+
+## Worker Decommission
+
+Celeborn provides support for decommissioning workers via a REST API, which enables administrators to
+efficiently manage cluster resizing and the removal of unhealthy worker nodes without disrupting ongoing jobs.
+
+## Decommission Process
+
+Here's a detailed breakdown of how the decommissioning process works:
+
+- Decommissioning Request: Administrators can send a decommission request through the REST API
+to initiate the process for one or more worker nodes.
+
+- Handling New Requests: Once the decommissioning process starts, the affected worker nodes will no longer
+accept new shuffle slot requests or new data. This ensures that no new tasks are assigned to
+the workers that are set to be decommissioned.
+
+- Existing Data Handling: The worker nodes will continue to handle their existing shuffle partitions
+until all the partitions have expired. This mechanism ensures that current jobs running on these nodes
+can complete their data shuffle operations without interruption.
+
+- Worker Exit: After all existing shuffle partitions on the worker nodes have expired,
+the workers will gracefully exit. This ensures that the node is safely removed from the cluster
+without causing data loss or job failures.
+
+This decommissioning process is essential for maintaining cluster health and efficiency,
+as it allows for the smooth removal of unhealthy nodes and enables dynamic resizing of the cluster
+to meet varying workload demands.
+
+## Decommission Configuration
+
+| Key                                               | Value |
+|---------------------------------------------------|-------| 
+| celeborn.worker.decommission.forceExitTimeout     | 6h    |
+| celeborn.worker.decommission.checkInterval        | 30s   |
+
+
+## Perform Decommissioning
+
+Administrators perform decommissioning operation in two approaches:
+
+1. Via Celeborn Worker REST API endpoint:
+  ```shell
+  curl --request POST --url 'ip:port/exit' --data '{"type":"Decommission"}'
+  ```
+2. Via Celeborn Master(Leader) REST API endpoint:
+  ```shell
+  curl --request POST --url 'ip:port/sendWorkerEvent' --data '{"type":"Decommission", "workers":"ip_1,ip_2"}'
+  curl --request POST --url 'ip:port/sendWorkerEvent' --data '{"type":"DecommissionThenIdle", "workers":"ip_1,ip_2"}'
+  ```
+
+Details of decommissioning interface can refer to [REST API](../monitoring/#rest-api)
+
+## Decommission Monitoring
+
+Administrators can monitor the status of the workers to ensure they are gracefully exiting
+after all tasks are complete.
+
+Administrators can monitor the status of the workers under decommission through worker REST API [ip:port/isDecommissioning](../monitoring/#worker_1)
+or worker metrics [IsDecommissioningWorker](../monitoring/#worker).
+Also, administrator can monitor count of workers decommissioned through master metrics [DecommissionWorkerCount](../monitoring/#master).
+
+By providing a REST API and metrics for decommissioning workers,
+Celeborn ensures that cluster administrators have a robust and flexible tool
+to manage cluster resources effectively, enhancing overall system stability and performance.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -71,6 +71,7 @@ nav:
       - Security: security.md
       - Quota Management: quota_management.md
       - Upgrading: upgrading.md
+      - Decommissioning: decommissioning.md
       - Ratis Shell: celeborn_ratis_shell.md
       - Cluster Planning: cluster_planning.md
   - Configuration: configuration/index.md