-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(lep): add "V2 Data Engine Live Upgrade" #9807
base: master
Are you sure you want to change the base?
Conversation
@yangchiu |
6643586
to
89091f5
Compare
b47a19b
to
420f124
Compare
420f124
to
b97a3b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, LGTM for this initial version, but I believe some additional details need to be discovered or clarified eventually.
There are also some minor feedback, mostly related to naming and clarifying the behavior.
dataEngine: v2 | ||
``` | ||
|
||
4. User can observe the nodes in the cluster being upgraded one by one. During the upgrade of a node’s v2 data engine, an `NodeDataEngineUpgrade` resource for the upgrading node is created. The old instance manager and its pod are deleted, causing the replicas to enter an `error` state. The default instance manager pod then starts and transitions to a `running` state, after which the replicas in the `error` state are automatically rebuilt and becomes `running`. If the upgrade process is stalled, users can check the status of the `NodeDataEngineUpgrade` resource to troubleshoot issues and resume the upgrade process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to briefly explain that volume I/O won’t be affected during this transition, even though it has already been covered elsewhere.
|
||
- `undefined` | ||
1. Update `status.instanceManagerImage` from `defaultInstanceManagerImage` setting | ||
2. Check whether the default instance manager image has be pulled on each Longhorn node. If not, update `status.State` to `error` and `status.ErrorMessage` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, in addition to the error message at runtime via status.ErrorMessage
, any errors should also be logged as system events or conditions, if possible, as these provide users with a clearer understanding of the system's status. This is more related to implementation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change the status to
status:
state:
conditions:
- <condition name>:
status:
reason:
message:
However, I tried to use condition and could not find a better name for the condition. Any suggestion? In addition, for the upgrade status, condition
seems redundant because the message is already set to status.Message
. That's why I don't use condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditions can be used to represent mandatory requirements for the upgrade, rather than representing upgrade status.
How about using DefaultImageOnAllNodesReady
as the condition for indicating the readiness of the default image?
- `initializing` | ||
1. Update `status.upgradeNodes` | ||
- If `spec.nodes` is empty, list all nodes and add them to `status.upgradeNodes` | ||
- If `spec.nodes` is not empty, list the nodes in the `spec.nodes` and add them to `status.upgradeNodes` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the unknown nodes will be silently filtered? Throwing errors is recommended, as it informs users that the nodes should match those in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does unknown nodes
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I’m asking is that users might add nodes to spec.nodes
that don’t actually exist in the cluster. In this case, the upgrade should not be started and allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, you mentioned in another comment, spec.nodes
is immutable, so we should decide the content of it. Then, why do need these two if
conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, why do need these two if conditions?
Ideally, we should upgrade all nodes cluster, but the design allows users to upgrade some nodes in the cluster according to users' upgrade plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, we might need to ensure the following items.
- All nodes should be eventually upgraded to the same IM version before the next upgrade to prevent IM version drift.
- If some nodes added by users in
spec.nodes
are not recognized, the upgrade should be avoided untilspec.nodes
is corrected by the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nodes should be eventually upgraded to the same IM version before the next upgrade to prevent IM version drift.
We need to add the pre-upgrade check.
2. Then, update the `status.currentState` to `upgrading` | ||
|
||
- `upgrading` | ||
1. Iterate `status.upgradeNodes` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should spec.nodes
be immutable for the node once the node upgrade has started, or should we allow mutations in case we support upgrade cancellation? It’s suggested to keep the scenario simple initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec.nodes
is immutable. upgrade cancellation
is not explicitly supported. For now, if an upgrade is stuck, user can directly delete the dataEngineUpgradeManager
resource if it really needs to be caneclled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, this flow should be explained, as this is a valid case for us.
|
||
- `completed` | ||
|
||
3. According to `NodeDataEngineUpgrade.status.state`, `NodeDataEngineUpgrade controller` does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the upgrade is node-based, the process for recovering from an error during the status transition (before the node upgrade reaches 'completed') needs to be clarified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a volume on the upgrading node is somehow stuck, the nodeDataEngineUpgrade.status.state
will be stuck in upgrading
state and won't be in an error
state, so user needs to resolve the issue, and the node upgrade can continue. Currently, the error message will be set in the status.Message
, so the status is
status:
state: upgrade
message: an error is found.....
If nodeDataEngineUpgrade.status.state
is in error
state, it won't be recovered from the error. User needs to recreate a DataEngineUpgradeManager
for a new upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's also briefly explain this flow.
Longhorn 9104 Signed-off-by: Derek Su <derek.su@suse.com>
b97a3b0
to
722fe84
Compare
Which issue(s) this PR fixes:
Issue #
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context