Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness Observed in OpenBao When Using Raft Storage Backend in Amazon EKS Environment #573

Open
sspirate24 opened this issue Oct 1, 2024 · 7 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@sspirate24
Copy link

We are experiencing significant slowness in OpenBao when using Raft as the storage backend in the Amazon EKS environment, particularly with the gp3 storage class. In contrast, a similar setup in Azure Kubernetes Service (AKS) operates normally, utilizing the STANDARD_SSD storage class, with consistent execution times for the same operations.

Steps to Reproduce the Behavior

  1. Run the testing script that includes:
    • Authentication: Retrieves a Vault token using ROLE_ID and SECRET_ID.
    • Decryption: Decrypts ciphertext using the Vault token.
    • Token Revocation: Revokes the Vault token after decryption.
  2. Observe the progressively increasing time taken for these operations in the EKS environment.

Expected Behavior

The authentication, decryption, and token revocation operations should execute quickly and consistently, similar to the performance observed in the AKS environment.

Environment

  • OpenBao Server Version:
    bao status
    Key                     Value
    ---                     -----
    Seal Type               shamir
    Initialized             true
    Sealed                  false
    Total Shares            3
    Threshold               3
    Version                 2.0.0
    Build Date              2024-07-17T22:05:43Z
    Storage Type            raft
    Cluster Name            bao-cluster-90a81699
    Cluster ID              f6492ef5-82d3-3994-efed-fc5e26c36a4a
    HA Enabled              true
    HA Cluster              https://openbao-0.openbao-internal:8201
    HA Mode                 active
    Active Since            2024-10-01T13:39:41.627572828Z
    Raft Committed Index    6916298
    Raft Applied Index      6916298
    
    
  • OpenBao CLI Version: OpenBao v2.0.0
  • Server Operating System/Architecture: Ubuntu x86_64

OpenBao server configuration file(s):

apiVersion: v1
data:
  extraconfig-from-values.hcl: |2-

    disable_mlock = true
    ui = true

    listener "tcp" {
      tls_disable = 1
      address = "[::]:8200"
      cluster_address = "[::]:8201"
    }

    storage "raft" {
      path = "/openbao/data"
      retry_join {
        leader_api_addr = "http://openbao-0.openbao-internal:8200"
      }
      retry_join {
        leader_api_addr = "http://openbao-1.openbao-internal:8200"
      }
      retry_join {
        leader_api_addr = "http://openbao-2.openbao-internal:8200"
      }
    }

    service_registration "kubernetes" {}
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: openbao
    meta.helm.sh/release-namespace: dummy
  creationTimestamp: "2024-09-30T11:17:07Z"
  labels:
    app.kubernetes.io/instance: openbao
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: openbao
    helm.sh/chart: openbao-0.4.0
  name: openbao-config
  namespace: dummy

Additional context

The only notable log entry observed continuously during testing:

 2024-10-01T13:07:29.609Z [INFO] expiration: revoked lease: lease_id=auth/approle/login/h66187244acd884b39ff9ba11f230e101b2f9baae45d793ebb4f1e4e9f001da6e
@sspirate24 sspirate24 added the bug Something isn't working label Oct 1, 2024
@cipherboy
Copy link
Member

@sspirate24 I'm not an AWS expert but it looks like their docs mention that you can choose your IOPS: https://aws.amazon.com/ebs/general-purpose/

What are your limits configured at and are they similar to Azure's?

@sspirate24
Copy link
Author

@cipherboy
GP3 is far better than Azure Standard SSD. The gp3 volumes provide a baseline of 3,000 IOPS and 125 MiBps, regardless of the volume size. As for Azure, their standard SSD offers 500 IOPS with the ability to burst up to 3,000 IOPS.

What are all the details and logs I need to collect from my end to troubleshoot this issue which happens in EKS?

@cipherboy
Copy link
Member

@sspirate24 A few things:

  1. Running with log_level=trace would be great.
  2. It'd be interesting to see something like iostat -x on both, if you can -- is the disk saturated? Is there increased latency?
  3. It'd also be interesting to get network and CPU performance stats for both systems as well if you can.

To my knowledge, there is nothing different in the behavior on any host type, other than the performance_multiplier option, which I don't believe should affect general (maximum?) operation speed other than when leadership changes occur.

It'd be interesting to see if there are leadership elections taking place, i.e., if OpenBao thinks the system/network/... is slow to the point of it timing out holding leader and wanting to re-elect a new one.

Some thoughts!

@sspirate24
Copy link
Author

@cipherboy
I have collected the logs and metrics that you have asked for from both AKS and EKS for comparison and to troubleshoot the issue.

Point to Note: We don't use OpenBao to store many secrets; our primary use case is leveraging the transit engine for encryption and decryption.

Log files:
Logs.zip

AKS vs EKS Visualizations

1. CPU Usage

AKS:
aks_cpu_usage

EKS:
cpu_usage


2. Network Traffic

AKS:
aks_network_traffic

EKS:
network_traffic


3. Memory Usage

AKS:
aks_memory_usage

EKS:
memory_usage


4. Disk Throughput Over Time

AKS:
aks_disk_throughput_over_time

EKS:
disk_throughput_over_time


5. Cumulative Disk Usage Over Time

AKS:
aks_cumulative_disk_usage_over_time

EKS:
cumulative_disk_usage_over_time


6. CPU Usage Over Time

AKS:
aks_cpu_usage_over_time

EKS:
cpu_usage_over_time


7. I/O Wait vs Disk Throughput

AKS:
aks_io_wait_vs_disk_throughput

EKS:
io_wait_vs_disk_throughput


8. Transactions Per Second Over Time

AKS:
aks_transactions_per_second_over_time

EKS:
transactions_per_second_over_time


Azure Metrics

Below are the visualizations for Azure Metrics:

vm2
vm3
vm4
vm5
volume_metrics


AWS Metrics

Below are the visualizations for AWS Metrics:

vm1
vm2
vm3
vm4
vm5
vm6
vm7
vm8
vm9
vm10

@cipherboy cipherboy added the help wanted Extra attention is needed label Nov 13, 2024
@cipherboy
Copy link
Member

cipherboy commented Nov 25, 2024

@sspirate24 Sorry about the delay here. Honestly, the stats you gave didn't really jump out to me as to why it would be slower.

Point to Note: We don't use OpenBao to store many secrets; our primary use case is leveraging the transit engine for encryption and decryption.

I'm surprised Transit is much slower, tbh. Most keys should be cached (unless you've turned that off or have a large number of distinct keys), so it shouldn't be hitting disk.

Are there other environmental factors perhaps? Different CPU features (e.g., no AES-NI acceleration) or network speed?


As an aside, I don't think Transit has metrics but OpenBao in general does, I'd be curious if you've hooked those up to a monitoring solution and what they say. We'd also probably take a PR if you wanted to add metrics to Transit!

@sspirate24
Copy link
Author

@cipherboy
Regarding metrics, could you please clarify which specific metrics would be most beneficial to track in this context? Any pointers or documentation links on what to monitor for the transit engine or OpenBao in general would be helpful.

@cipherboy
Copy link
Member

@sspirate24 Raft should emit the following metrics: https://openbao.org/docs/internals/telemetry/metrics/raft/

You'd have to configure a telemetry provider in the config file: https://openbao.org/docs/configuration/telemetry/

The core request metrics might also be interesting, along with the vault.route.* too: https://openbao.org/docs/internals/telemetry/metrics/all/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants