Skip to content

Commit

Permalink
Rearrange troubleshooting content
Browse files Browse the repository at this point in the history
Signed-off-by: Bryan Cox <brcox@redhat.com>
  • Loading branch information
bryan-cox committed Jan 22, 2024
1 parent 6987bd6 commit 200ab63
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 52 deletions.
2 changes: 1 addition & 1 deletion docs/content/how-to/aws/disaster-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Regarding the Workers nodes assigned to the cluster, during the migration they w
These next arguments depend on how the Hypershift Operator has been deployed and how a Hosted Cluster has been created. E.G If we want to go ahead with the procedure and our cluster is **private** we need to make sure that our **Hypershift Operator** has been deployed with the arguments set in the **Private** tab for **Hypershift Operator Deployment access endpoints arguments** and our **Hosted Cluster** has been created using the arguments following the **Private** tab in the **Arguments of the CLI when creating a HostedCluster** section down below.

!!! warning
Since this is a disaster recovery procedure, unexpected things could happen because of all the moving components involved. To assist, [troubleshooting section](./troubleshooting.md#disaster-recovery---hostedcluster-migration) with the most common issues identified is provided.
Since this is a disaster recovery procedure, unexpected things could happen because of all the moving components involved. To assist, see this [troubleshooting section](./troubleshooting/troubleshooting-disaster-recovery.md) for the most common issues identified.

- Hypershift Operator Deployment endpoint access arguments

Expand Down
File renamed without changes.
5 changes: 5 additions & 0 deletions docs/content/how-to/aws/troubleshooting/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Troubleshooting HyperShift on AWS
This section of the HyperShift documentation contains pages related to troubleshooting specific issues when using the AWS cloud provider.

- [Debug Missing Nodes](debug-nodes.md)
- [Debug Disaster Recovery Issues](troubleshooting-disaster-recovery.md)
Original file line number Diff line number Diff line change
@@ -1,15 +1,7 @@
---
title: Troubleshooting
---

# Troubleshooting

## Disaster Recovery - Hosted Cluster Migration

These are issues related with disaster recovery that we've identified and you could face during a Hosted Cluster migration.

### The new workloads do not got scheduled in the new migrated cluster
# Debug Disaster Recovery - Hosted Cluster Migration
These are issues related to disaster recovery that we've identified, and you could face during a Hosted Cluster migration.

## New workloads do not get scheduled in the new migrated cluster
Everything looks normal, in the destination Management or Hosted Cluster and in the old Management and Hosted Cluster, but your new workloads do not schedule in your migrated Hosted Cluster (your old ones should work properly).

Eventually your pods begin to fall down and the cluster status becomes degraded.
Expand All @@ -35,7 +27,7 @@ Eventually the Hosted Cluster will start self healing and the ClusterOperator wi

**Cause:** The cause of this issue is after the Hosted Cluster Migration the KAS (Kube API Server) uses the same DNS name, but it points to different load balancer in AWS platform. Sometimes OVN does not behave correctly facing this situation.

### The migration gets blocked in ETCD recovery
## The migration gets blocked in ETCD recovery

The context around it's basically "I've edited the Hosted Cluster adding the `ETCDSnapshotURL` but the modification disappears and does not continue".

Expand All @@ -54,11 +46,11 @@ oc delete pod -n hypershift -lapp=operator

**Cause:** This issue happens when the Hypershift operator is down and the Hosted Cluster controller cannot handle the modifications in the objects which belong to it.

### The nodes cannot join the new Hosted Cluster and stay in the older one
## The nodes cannot join the new Hosted Cluster and stay in the older one

We have 2 paths to follow and depends if [this code](https://github.com/openshift/hypershift/pull/2265) is in your Hypershift Operator.
We have 2 paths to follow, and it depends on if [this code](https://github.com/openshift/hypershift/pull/2265) is in your Hypershift Operator.

#### The PR is merged and my Hypershift Operator has that code running
### The PR is merged and my Hypershift Operator has that code running

If that's the case, you need to make sure your Hosted Cluster is paused:
```
Expand All @@ -69,15 +61,15 @@ If this command does not give you any output, make sure you've followed properly

Even if it's paused and is still in that situation, please **continue to the next section** because it's highly probable that you don't have the code which manages this situation properly.

#### The PR is not merged or my Hypershift Operator does not have that code running
### The PR is not merged or my Hypershift Operator does not have that code running

If that's not the case, the only way to solve it is executing the teardown of the old Hosted Cluster prior the full restoration in the new Management cluster. Make sure you already have all the Manifests and the ETCD backed up.

Once you followed the Teardown procedure of the old Hosted Cluster, you will see how the migrated Hosted Cluster begins to self-recover.

**Cause:** This issue occurs when the old Hosted Cluster has a conflict with the AWSPrivateLink object. The old one is still running and the new one cannot handle it because the `hypershift.local` AWS internal DNS entry still points to the old LoadBalancer.

### Dependant resources block the old Hosted Cluster teardown
## Dependant resources block the old Hosted Cluster teardown

To solve this issue you need to check all the objects in the HostedControlPlane Namespace and make sure all of them are being terminated. To do that we recommend to use an external tool called [ketall](https://github.com/corneliusweig/ketall) which gives you a complete overview of all resources in a kubernetes cluster.

Expand Down Expand Up @@ -110,7 +102,7 @@ Eventually, the namespace will be successfully terminated and also the Hosted Cl

**Cause:** This is pretty common issue in the Kubernetes/Openshift world. You are trying to delete a resource that has other dependedent objects. The finalizer is still trying to delete them but it cannot progress.

### The Storage ClusterOperator keeps reporting "Waiting for Deployment"
## The Storage ClusterOperator keeps reporting "Waiting for Deployment"

To solve this issue you need to check that all the pods from the **HostedCluster** and the **HostedControlPlane** are running, not blocked and there are no issues in the `cluster-storage-operator` pod. After that you need to delete the **AWS EBS CSI Drivers** from the HCP namespace in the destination management cluster:

Expand All @@ -124,7 +116,7 @@ The operator will take a while to raise up again and eventually the driver contr
**Cause:** This issue probably comes from objects that are deployed by the Operator. In this case, `cluster-storage-operator`, but the controller or the operator does not reconcile over them. If you delete the deployments, you ensure the operator is recreated from scratch.


### The image-registry ClusterOperator keeps reporting a degraded status
## The image-registry ClusterOperator keeps reporting a degraded status

When a migration is done and the image-registry clusteroperator is marked as degraded, you will need to figure out how it reaches that status. The message will look like `ImagePrunerDegraded: Job has reached the specified backoff limit`.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Troubleshooting

## How to dump the HostedCluster resources from a management cluster

In order to dump the relevant HostedCluster objects we will need some prerequisites:
## General
### Dump HostedCluster resources from a management cluster
To dump the relevant HostedCluster objects, we will need some prerequisites:

- `cluster-admin` access to the management cluster
- The HostedCluster `name` and the `namespace` where the CR is deployed
Expand Down Expand Up @@ -32,7 +32,25 @@ After some time, the output will show something like this:

This dump contains artifacts that aid in troubleshooting issues with hosted control plane clusters.

### Impersonation
#### Contents from Dump Command
The Management's Cluster's dump content:

- **Cluster scoped resources**: Basically nodes definitions of the management cluster.
- **The dump compressed file**: This is useful if you need to share the dump with other people
- **Namespaced resources**: This includes all the objects from all the relevant namespaces, like configmaps, services, events, logs, etc...
- **Network logs**: Includes the OVN northbound and southbound DBs and the statuses for each one.
- **HostedClusters**: Another level of dump, involves all the resources inside of the guest cluster.

The Guest Cluster dump content:

- **Cluster scoped resources**: It contains al the cluster-wide objects, things like nodes, CRDs, etc...
- **Namespaced resources**: This includes all the objects from all the relevant namespaces, like configmaps, services, events, logs, etc...

!!! note

**The dump will not contain any Secret object** from the cluster, only references to the secret's names.

#### Impersonation as user/service account

The dump command can be used with the flag `--as`, which works in the same way as the `oc` client. If you execute the command with the flag, the CLI will impersonate all the queries against the management cluster, using that username or service account.

Expand Down Expand Up @@ -75,27 +93,9 @@ hypershift dump cluster \
--artifact-dir clusterDump-${CLUSTERNS}-${CLUSTERNAME}
```

### Dump Content

The Management's Cluster's dump content:

- **Cluster scoped resources**: Basically nodes definitions of the management cluster.
- **The dump compressed file**: This is useful if you need to share the dump with other people
- **Namespaced resources**: This includes all the objects from all the relevant namespaces, like configmaps, services, events, logs, etc...
- **Network logs**: Includes the OVN northbound and southbound DBs and the statuses for each one.
- **HostedClusters**: Another level of dump, involves all the resources inside of the guest cluster.

The Guest Cluster dump content:

- **Cluster scoped resources**: It contains al the cluster-wide objects, things like nodes, CRDs, etc...
- **Namespaced resources**: This includes all the objects from all the relevant namespaces, like configmaps, services, events, logs, etc...

!!! note

**The dump will not contain any Secret object** from the cluster, only references to the secret's names.

## Troubleshooting sections by provider

If you have some provider scoped questions, please take a look the troubleshooting section in the provider list down below. We will keep adding more and more troubleshooting sections and updating the existent ones.
## Troubleshoot By Provider
If you have provider-scoped questions, please take a look at the troubleshooting section for the provider in the list below.
We will keep adding more and more troubleshooting sections and updating the existent ones.

- [Hypershift in AWS troubleshooting](./aws/troubleshooting.md)
- [AWS](./aws/troubleshooting/index.md)
- [Azure](./azure/troubleshooting)
12 changes: 7 additions & 5 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,25 +54,27 @@ nav:
- how-to/restart-control-plane-components.md
- how-to/pause-reconciliation.md
- how-to/per-hostedcluster-dashboard.md
- how-to/debug-nodes.md
- how-to/metrics-sets.md
- how-to/troubleshooting.md
- how-to/troubleshooting-general.md
- how-to/etcd-recovery.md
- 'Automated Machine Management':
- how-to/automated-machine-management/index.md
- how-to/automated-machine-management/scale-to-zero-dataplane.md
- how-to/automated-machine-management/nodepool-lifecycle.md
- how-to/automated-machine-management/node-tuning.md
- how-to/automated-machine-management/configure-machines.md
- "AWS":
- 'AWS':
- how-to/aws/create-aws-hosted-cluster-arm-workers.md
- how-to/aws/create-infra-iam-separately.md
- how-to/aws/create-aws-hosted-cluster-multiple-zones.md
- how-to/aws/deploy-aws-private-clusters.md
- how-to/aws/external-dns.md
- how-to/aws/etc-backup-restore.md
- how-to/aws/disaster-recovery.md
- how-to/aws/troubleshooting.md
- how-to/aws/create-aws-hosted-cluster-arm-workers.md
- 'Troubleshooting':
- how-to/aws/troubleshooting/index.md
- how-to/aws/troubleshooting/debug-nodes.md
- how-to/aws/troubleshooting/troubleshooting-disaster-recovery.md
- 'Azure':
- how-to/azure/create-azure-cluster.md
- how-to/azure/create-azure-cluster-with-options.md
Expand Down

0 comments on commit 200ab63

Please sign in to comment.