docs(lep): add "V2 Data Engine Live Upgrade" #9807

derekbit · 2024-11-14T07:28:02Z

Which issue(s) this PR fixes:

Issue #

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

derekbit · 2024-11-14T07:48:58Z

@yangchiu
Found an error of linking backport issue

enhancements/20241114-v2-data-engine-live-upgrade.md

innobead

In general, LGTM for this initial version, but I believe some additional details need to be discovered or clarified eventually.

There are also some minor feedback, mostly related to naming and clarifying the behavior.

enhancements/20241114-v2-data-engine-live-upgrade.md

innobead · 2024-12-14T11:37:53Z

enhancements/20241114-v2-data-engine-live-upgrade.md

+      dataEngine: v2
+    ```
+
+4.  User can observe the nodes in the cluster being upgraded one by one. During the upgrade of a node’s v2 data engine, an `NodeDataEngineUpgrade` resource for the upgrading node is created. The old instance manager and its pod are deleted, causing the replicas to enter an `error` state. The default instance manager pod then starts and transitions to a `running` state, after which the replicas in the `error` state are automatically rebuilt and becomes `running`. If the upgrade process is stalled, users can check the status of the `NodeDataEngineUpgrade` resource to troubleshoot issues and resume the upgrade process.


It might be helpful to briefly explain that volume I/O won’t be affected during this transition, even though it has already been covered elsewhere.

innobead · 2024-12-14T11:50:53Z

enhancements/20241114-v2-data-engine-live-upgrade.md

+
+    - `undefined`
+        1.  Update `status.instanceManagerImage` from `defaultInstanceManagerImage` setting
+        2.  Check whether the default instance manager image has be pulled on each Longhorn node. If not, update `status.State` to `error` and `status.ErrorMessage`


In general, in addition to the error message at runtime via status.ErrorMessage, any errors should also be logged as system events or conditions, if possible, as these provide users with a clearer understanding of the system's status. This is more related to implementation though.

We can change the status to

status: state: conditions: - <condition name>: status: reason: message:

However, I tried to use condition and could not find a better name for the condition. Any suggestion? In addition, for the upgrade status, condition seems redundant because the message is already set to status.Message. That's why I don't use condition here.

Conditions can be used to represent mandatory requirements for the upgrade, rather than representing upgrade status.

How about using DefaultImageOnAllNodesReady as the condition for indicating the readiness of the default image?

innobead · 2024-12-14T12:08:39Z

enhancements/20241114-v2-data-engine-live-upgrade.md

+    - `initializing`  
+        1. Update `status.upgradeNodes`  
+          - If `spec.nodes` is empty, list all nodes and add them to `status.upgradeNodes`  
+          - If `spec.nodes` is not empty, list the nodes in the `spec.nodes` and add them to `status.upgradeNodes`  


Does this mean the unknown nodes will be silently filtered? Throwing errors is recommended, as it informs users that the nodes should match those in the cluster.

What does unknown nodes mean?

What I’m asking is that users might add nodes to spec.nodes that don’t actually exist in the cluster. In this case, the upgrade should not be started and allowed.

However, you mentioned in another comment, spec.nodes is immutable, so we should decide the content of it. Then, why do need these two if conditions?

Then, why do need these two if conditions?

Ideally, we should upgrade all nodes cluster, but the design allows users to upgrade some nodes in the cluster according to users' upgrade plan.

Then, we might need to ensure the following items.

All nodes should be eventually upgraded to the same IM version before the next upgrade to prevent IM version drift.

If some nodes added by users in spec.nodes are not recognized, the upgrade should be avoided until spec.nodes is corrected by the users.

All nodes should be eventually upgraded to the same IM version before the next upgrade to prevent IM version drift.

We need to add the pre-upgrade check.

innobead · 2024-12-14T12:09:37Z

enhancements/20241114-v2-data-engine-live-upgrade.md

+        2. Then, update the `status.currentState` to `upgrading`
+
+    - `upgrading`  
+        1. Iterate `status.upgradeNodes`  


Should spec.nodes be immutable for the node once the node upgrade has started, or should we allow mutations in case we support upgrade cancellation? It’s suggested to keep the scenario simple initially.

spec.nodes is immutable. upgrade cancellation is not explicitly supported. For now, if an upgrade is stuck, user can directly delete the dataEngineUpgradeManager resource if it really needs to be caneclled.

Then, this flow should be explained, as this is a valid case for us.

enhancements/20241114-v2-data-engine-live-upgrade.md

innobead · 2024-12-14T12:23:58Z

enhancements/20241114-v2-data-engine-live-upgrade.md

+
+    - `completed`
+
+3.  According to `NodeDataEngineUpgrade.status.state`, `NodeDataEngineUpgrade controller` does


Since the upgrade is node-based, the process for recovering from an error during the status transition (before the node upgrade reaches 'completed') needs to be clarified.

If a volume on the upgrading node is somehow stuck, the nodeDataEngineUpgrade.status.state will be stuck in upgrading state and won't be in an error state, so user needs to resolve the issue, and the node upgrade can continue. Currently, the error message will be set in the status.Message, so the status is

status: state: upgrade message: an error is found.....

If nodeDataEngineUpgrade.status.state is in error state, it won't be recovered from the error. User needs to recreate a DataEngineUpgradeManager for a new upgrade.

Sounds good. Let's also briefly explain this flow.

Longhorn 9104 Signed-off-by: Derek Su <derek.su@suse.com>

derekbit requested review from innobead, c3y1huang and shuo-wu November 14, 2024 07:28

derekbit self-assigned this Nov 14, 2024

derekbit mentioned this pull request Nov 14, 2024

Add LEP for Live Upgrade For Data Engine of V2 Volumes #8814

Closed

PhanLe1010 reviewed Nov 21, 2024

View reviewed changes

enhancements/20241114-v2-data-engine-live-upgrade.md Outdated Show resolved Hide resolved

PhanLe1010 reviewed Nov 21, 2024

View reviewed changes

enhancements/20241114-v2-data-engine-live-upgrade.md Outdated Show resolved Hide resolved

PhanLe1010 reviewed Nov 21, 2024

View reviewed changes

enhancements/20241114-v2-data-engine-live-upgrade.md Outdated Show resolved Hide resolved

PhanLe1010 reviewed Nov 21, 2024

View reviewed changes

enhancements/20241114-v2-data-engine-live-upgrade.md Outdated Show resolved Hide resolved

derekbit force-pushed the v2-control-upgrade-lep branch 3 times, most recently from 6643586 to 89091f5 Compare November 22, 2024 08:01

derekbit force-pushed the v2-control-upgrade-lep branch 6 times, most recently from b47a19b to 420f124 Compare December 3, 2024 06:08

derekbit marked this pull request as ready for review December 3, 2024 06:11

derekbit requested a review from a team as a code owner December 3, 2024 06:11

derekbit requested a review from PhanLe1010 December 3, 2024 06:14

derekbit force-pushed the v2-control-upgrade-lep branch from 420f124 to b97a3b0 Compare December 9, 2024 12:11

innobead reviewed Dec 14, 2024

View reviewed changes

docs(lep): add "V2 Data Engine Live Upgrade"

722fe84

Longhorn 9104 Signed-off-by: Derek Su <derek.su@suse.com>

derekbit force-pushed the v2-control-upgrade-lep branch from b97a3b0 to 722fe84 Compare December 16, 2024 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(lep): add "V2 Data Engine Live Upgrade" #9807

docs(lep): add "V2 Data Engine Live Upgrade" #9807

derekbit commented Nov 14, 2024

derekbit commented Nov 14, 2024

innobead left a comment

innobead Dec 14, 2024

innobead Dec 14, 2024

derekbit Dec 14, 2024

innobead Dec 14, 2024

innobead Dec 14, 2024

derekbit Dec 14, 2024

innobead Dec 14, 2024

innobead Dec 14, 2024

derekbit Dec 15, 2024

innobead Dec 15, 2024

derekbit Dec 15, 2024

innobead Dec 14, 2024 •

edited

Loading

derekbit Dec 14, 2024

innobead Dec 14, 2024

innobead Dec 14, 2024

derekbit Dec 14, 2024

innobead Dec 14, 2024


		- `completed`

		3. According to `NodeDataEngineUpgrade.status.state`, `NodeDataEngineUpgrade controller` does

docs(lep): add "V2 Data Engine Live Upgrade" #9807

Are you sure you want to change the base?

docs(lep): add "V2 Data Engine Live Upgrade" #9807

Conversation

derekbit commented Nov 14, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

derekbit commented Nov 14, 2024

innobead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innobead Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

innobead Dec 14, 2024 •

edited

Loading